Analytical Approximations for Bayesian Inference Tohid Ardeshiri

Analytical Approximations for Bayesian Inference Tohid Ardeshiri
Linköping studies in science and technology. Dissertations.
No. 1710
Analytical
Approximations for
Bayesian Inference
Tohid Ardeshiri
Linköping studies in science and technology. Dissertations.
No. 1710
Analytical Approximations for Bayesian Inference
Tohid Ardeshiri
[email protected]
www.control.isy.liu.se
Division of Automatic Control
Department of Electrical Engineering
Linköping University
SE–581 83 Linköping
Sweden
ISBN : 978-91-7685-930-8
ISSN 0345-7524
Copyright © 2015 Tohid Ardeshiri
Printed by LiU-Tryck, Linköping, Sweden 2015
To Adrian
Abstract
Bayesian inference is a statistical inference technique in which Bayes’ theorem is
used to update the probability distribution of a random variable using observations. Except for few simple cases, expression of such probability distributions
using compact analytical expressions is infeasible. Approximation methods are
required to express the a priori knowledge about a random variable in form of
prior distributions. Further approximations are needed to compute posterior distributions of the random variables using the observations. When the computational complexity of representation of such posteriors increases over time as in
mixture models, approximations are required to reduce the complexity of such
representations.
This thesis further extends existing approximation methods for Bayesian inference, and generalizes the existing approximation methods in three aspects namely;
prior selection, posterior evaluation given the observations and maintenance of
computation complexity.
Particularly, the maximum entropy properties of the first-order stable spline kernel for identification of linear time-invariant stable and causal systems are shown.
Analytical approximations are used to express the prior knowledge about the
properties of the impulse response of a linear time-invariant stable and causal
system.
Variational Bayes (VB) method is used to compute an approximate posterior in
two inference problems. In the first problem, an approximate posterior for the
state smoothing problem for linear state-space models with unknown and timevarying noise covariances is proposed. In the second problem, the VB method is
used for approximate inference in state-space models with skewed measurement
noise.
Moreover, a novel approximation method for Bayesian inference is proposed. The
proposed Bayesian inference technique is based on Taylor series approximation
of the logarithm of the likelihood function. The proposed approximation is devised for the case where the prior distribution belongs to the exponential family
of distributions.
Finally, two contributions are dedicated to the mixture reduction (MR) problem. The first contribution, generalize the existing MR algorithms for Gaussian
mixtures to the exponential family of distributions and compares them in an extended target tracking scenario. The second contribution, proposes a new Gaussian mixture reduction algorithm which minimizes the reverse Kullback-Leibler
divergence and has specific peak preserving properties.
v
Populärvetenskaplig sammanfattning
Bayes sats är ett grundläggande verktyg inom statistik som kan användas för att
förfina förkunskapen om en variabel med hjälp av observationer. Förkunskapen
kallas prior och beskrivs matematiskt som en sannolikhetsfunktion för den okända variabeln, och observationen beskrivs av en s.k. likelihood-funktion. Bayes
sats säger att den normaliserade produkten av dessa beskriver den så kallade
posteriorn, dvs. fördelningen för variabeln som skattningen ska baseras på. Kärnproblemet i avhandlingen är att denna funktion i de flesta fall inte är analytisk,
dvs kan skrivas som en matematiskt uttryck, och måste approximeras på ett eller
annat sätt. Ett antal effektiva metoder presenteras i detta arbete.
Bilsäkerhet är en illustrativ tillämpning som studeras av författaren. Antag att
mjukvaran i en kamera har upptäckt en cykel framför bilen den är monterad i,
och att mjukvaran dessutom kan hitta de två hjulens positioner i bilden. Hjulen
avbildas som ellipser i bilden, och man kan då med hjälp av ellipsernas form och
Bayes sats förfina informationen om cykelns position och dessutom skatta hur
cykeln kommer att ändra riktning, t.ex. om den håller på att svänga ut framför
bilen. Bayes sats kan nämligen användas en gång till för att förutsäga var cykeln
kommer befinna sig när nästa kamerabild tas. Dessa två steg för att förutsäga
och skatta är grundkomponenter i olinjära filter, som är ett fokusområde i avhandlingen. Konceptet att modellera cykeln inte bara som en punkt, utan som en
struktur med två hjul, går under benämningen utökat mål, och detta område har
varit en motivation för många av resultaten i avhandlingen.
När man använder Bayes sats finns det ett antal kombinationer av prior och likelihood som faktiskt ger en analytisk posterior. En sådan prior som passar till
en viss likelihood kallas konjugerad prior. En metod som föreslås är att approximera likelihooden så att den prior man har blir konjugerad. Med hjälp av detta
trick blir posteriorn analytisk, och även på samma form som priorn. Det senare är
viktigt när operationen ska upprepas många gånger, som t.ex. i ett olinjärt filter.
Ett annat exempel då en posterior har ett analytiskt uttryck är när både prior
och likelihood är viktade summor av normalfördelningar. Sådana fördelningar
är mycket flexibla och kan approximera vilken fördelning som helst godtyckligt
väl. Posteriorn blir då också en viktad summa av normalfördelningar. Problemet
är att den får allt fler komponenter varje gång Bayes sats används. Ett av avhandlingens bidrag tar fram konkreta algoritmer för att begränsa antalet komponenter
genom smarta approximationer. Rent allmänt kan posteriorn alltid approximeras
inom en given funktionsklass, och här studeras Kullback-Leibler som mått att optimera över. Detta används för att skatta impulssvar för dynamiska system. En
metod som används i flera bidrag är Variational Bayes (VB). Här används VB till
att hitta en produktform av posteriorn över två delmängder av variabler som ska
skattas, vilket enligt VB kan göras stegvis med stora besparingar i beräkningskomplexitet.
vii
Acknowledgments
On the 6th of November 1632, the Swedish king Gustav II Adolf, the founder of
the Swedish Empire (1611-1721), was killed in the battle of Lützen in modernday Germany. Gustav II Adolf was an extremely able commander (rather than
an obedient soldier), was nearsighted and had a prominent nose. It is claimed by
historians that in the thick mix of gun smoke and fog covering the battlefield, he
was separated from his fellow riders and killed by several shots.
It is indeed just a coincidence that this thesis will be defended on the very
same day 383 years later. Even the fact that I have well-known aspirations for becoming the king of Sweden does not worry me. Furthermore, I am not concerned
about the fact that my opponent Dr Wolfgang Koch comes from a German defense institution which is situated only 500 km away from Lützen. Let us stay
objective when forming prior beliefs. Furthermore, separation from my fellow riders is not expected to happen since I have written this thesis to enable me and
my fellow riders to see through the fog of noise and smoke of disturbances using
Bayes’ rule.
I would never have been able to write such a dissertation without the support
of my fellow riders. Here, I want to acknowledge their contributions to this thesis.
131 years after that fateful day reverend Thomas Bayes wrote the article “An
Essay towards solving a Problem in the Doctrine of Chances ” for which I am
grateful. It took another 246 year till Lennart Ljung admitted me to the group
and gave the opportunity to find my supervisor and my research subject. Thank
you Lennart.
I want to thank my supervisor Fredrik Gustafsson for his engagement in the
beginning and the end of my PhD studies and his patience along the way. I want
to thank the head of division of Automatic Control Svante Gunnarsson for giving me the space to maneuver and making exceptions of traditions whenever I
wished for them.
Umut Orguner possesses every quality one would desire in a supervisor. I
offer my sincerest gratitude to Umut Orguner for sharing his vast knowledge, his
patience and his generosity. Thank you Emre Özkan and Fredrik Gustafsson for
proof reading this thesis.
During my graduate studies I had the opportunity to collaborate with many
talented researchers. It was a pleasure to co-author papers with Jonas Sjöberg,
Jonas Bärgman, Mathias Lidberg, Robert Thomson, Anders Hansson, Johan
Löfberg, Mikael Norrlöf, Fredrik Larsson, Michael Felsberg, Fredrik Gustafsson,
Thomas B. Schön, Christian Lundquist, Umut Orguner, Emre Özkan, Karl
Granström, Tianshi Chen, Henri Nurminen, Robert Piché, Francesca P. Carli,
Alessandro Chiuso, Lennart Ljung, and Gianluigi Pillonetto. Not all the research
work resulted in publications; My conversations with Fredrik Lindsten, Saikat
Saha, Michael Roth, Daniel Axehill, Martin Enqvist, Carsten Fritsche, Liam Ellis,
and Claudio Altafini have been enlightening. I am very excited about my ongoing collaborations with Rafael Rui, Michael Roth, Henri Nurminen, and Mikhail
Lifshits. Furthermore, I learned a lot from Anders Hansson and Torkel Glad
through their graduate courses for which I am grateful.
ix
x
Acknowledgments
I would like to thank division’s secretaries Ulla Salaneck, Åsa Karmelind and
Ninna Stensgård. Peter Rosander with whom I shared my office for three years is
the kindest and most considerate Swedish man I have met so far. I want to thank
Michael Roth and Kora Neupert for being such a wonderful friends. I want to
thank Henrik Ohlsson for helping me get on my board so many times on those
cold windy days.
I want to extend my gratitude to George Mathai, Niklas Wahlström, Alicia
Pamela Tonnoli, Bram Dil, Ylva Jung, Fredrik Lindsten, Patrik Axelsson, Manon
Kok, Martin Lindfors, Gustav Lindmark, Christian Andersson Naesseth , Hanna
Nyqvist, Jonas Linder, Karl Granström, André Carvalho Bittencourt and Christian Lundquist for all the good time be it while playing board games, skiing in
the Alps, and kite surfing at Swedish west coast.
I also want to thank Rikard Falkeborn, Daniel Petersson, Martin Skoglund
and Christian Lyzell for patiently helping me with software related questions
and especially Henrik Tidefelt for his contributions to the LATEX template used
for this thesis.
I offer my sincerest gratitude to the hardworking people of Sweden who gave
me the possibility to study my PhD degree not worrying about funding and gave
me the opportunity to have paid paternity leave. I wish to acknowledge the financial support from the frame project Extended Target Tracking and project
Scalable Kalman Filters both funded by Swedish Research Council.
I want to thank my teachers at Sharif University of Technology for being exceptionally dedicated teachers, especially Mohammad Durali and Ali Meghdari.
I want to thank Azadeh, Nazanin, Navid, Kavous, Hamed, Shahin, Shahla, Hossein and Hooshang for their support during my undergraduate studies.
Adrian! Your very existence, beautiful smiles, big hugs and kisses has been
my shelter during these years. I love you more than “a thousand million and
nine hundreds ninety one”. Last but not the least, I want to thank my wonderful
parents who taught me by example many things above all how to endure during
difficult times.
Linköping, October 2015
Tohid Ardeshiri
Contents
Notation
I
xv
Background
1 Introduction
1.1 Approximate Bayesian Inference . . . . . . . . .
1.2 Contributions . . . . . . . . . . . . . . . . . . .
1.3 Publications . . . . . . . . . . . . . . . . . . . .
1.4 Applications . . . . . . . . . . . . . . . . . . . .
1.4.1 Bicycle tracking using ellipse extraction
1.4.2 Positioning using ultra wide-band data .
1.4.3 Path tracking for robots . . . . . . . . .
1.5 Thesis outline . . . . . . . . . . . . . . . . . . .
1.5.1 Outline of Part I . . . . . . . . . . . . . .
1.5.2 Outline of Part II . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
7
8
10
10
13
13
14
14
15
2 Entropy, Exponential Family, and Variational Bayes
2.1 Entropy . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Maximum entropy prior distributions .
2.2 Exponential Family . . . . . . . . . . . . . . . .
2.3 Variational Bayes . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
20
22
23
3 System Identification
3.1 Impulse Response Identification .
3.2 Continuous-time impulse response
3.3 Discrete-time impulse response . .
3.4 Maximum Entropy Kernel . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
25
27
29
30
4 Mixture Reduction
4.1 Mixture Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Mixture Reduction for Target Tracking . . . . . . . . . . . . . . . .
4.3 Greedy mixture reduction . . . . . . . . . . . . . . . . . . . . . . .
33
33
34
36
xi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xii
Contents
4.4 Divergence measures . . . . . . . . . . . . . . . . . . . .
4.4.1 Integral square error . . . . . . . . . . . . . . . .
4.4.2 Kullback-Leibler Divergence . . . . . . . . . . . .
4.4.3 α-Divergences . . . . . . . . . . . . . . . . . . . .
4.5 Numerical comparison of mixture reduction algorithms
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
38
38
38
39
42
5 Concluding remarks
45
A Expressions for some members of exponential family
49
B Multiple hypothesis testing
67
C Implementation aspects of the ISE approach
69
Bibliography
75
II
Publications
A Maximum entropy properties of discrete-time first-order stable spline
kernel
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
MaxEnt property of Wiener and SS-1 kernels . . . . . . . . . . . .
2.1
DT Wiener process . . . . . . . . . . . . . . . . . . . . . . .
2.2
The first order SS kernel . . . . . . . . . . . . . . . . . . . .
3
Special structure of Wiener and SS-1 kernels and their MaxEnt interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
MaxEnt covariance completion . . . . . . . . . . . . . . . .
4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B Approximate Bayesian Smoothing with Unknown Process and Measurement Noise Covariances
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Variational Solution . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
Unknown time-varying noise covariances . . . . . . . . . .
4.2
Unknown time-invariant noise covariances . . . . . . . . .
5
Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . .
A
Derivations for the smoother . . . . . . . . . . . . . . . . . . . . . .
(i+1)
A.1
Derivations for the approximate posterior qx ( · ) . . . . .
(i+1)
A.2
Derivations for the approximate posterior qQ ( · ) . . . . .
(i+1)
A.3
Derivations for the approximate posterior qR
A.4
Calculation of the expected values . . . . . .
B
Comparison with Expectation Maximization . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . .
(·) .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
83
86
87
87
88
89
90
91
94
97
100
101
103
104
104
108
110
110
111
111
115
117
118
121
Contents
C
Robust Inference for State-Space Models with Skewed Measurement
Noise
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Skew t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Variational solution . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
One-dimensional positioning . . . . . . . . . . . . . . . . .
5.2
Pseudorange positioning . . . . . . . . . . . . . . . . . . . .
6
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A
Derivations for the smoother . . . . . . . . . . . . . . . . . . . . . .
A.1
Derivations for qx . . . . . . . . . . . . . . . . . . . . . . . .
A.2
Derivations for qu . . . . . . . . . . . . . . . . . . . . . . .
A.3
Derivations for qΛ . . . . . . . . . . . . . . . . . . . . . . .
B
Derivations for the filter . . . . . . . . . . . . . . . . . . . . . . . .
B.1
Derivations for qx . . . . . . . . . . . . . . . . . . . . . . . .
B.2
Derivations for qu . . . . . . . . . . . . . . . . . . . . . . .
B.3
Derivations for qΛ . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D Bayesian Inference via Approximation of Log-likelihood for Priors in
Exponential Family
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
The Exponential Family . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Conjugate Priors in Exponential Family . . . . . . . . . . .
2.2
Conjugate Likelihoods in Exponential Family . . . . . . . .
3
Measurement Update via Approximation of the Log-likelihood . .
3.1
Taylor series expansion . . . . . . . . . . . . . . . . . . . . .
3.2
The extended Kalman filter . . . . . . . . . . . . . . . . . .
3.3
A general linearization guideline . . . . . . . . . . . . . . .
4
Extended Target Tracking . . . . . . . . . . . . . . . . . . . . . . . .
4.1
The problem formulation . . . . . . . . . . . . . . . . . . .
4.2
Solution proposed by Feldmann et al. (Feldmann et al., 2011)
4.3
ETT via log-likelihood linearization . . . . . . . . . . . . .
5
Numerical simulation . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
Monte-Carlo simulations . . . . . . . . . . . . . . . . . . . .
5.2
Single extended target tracking scenario . . . . . . . . . . .
6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A
Proof of Lemma 13 . . . . . . . . . . . . . . . . . . . . . . . . . . .
B
First Order Taylor Series Approximations for Some Scalar Valued
Functions of Matrix Variables . . . . . . . . . . . . . . . . . . . . .
C
Proof of EKF derivation in Example 9 . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
E Greedy Reduction Algorithms for Mixtures of Exponential Family
xiii
125
128
129
130
131
132
132
135
137
139
139
140
141
142
143
144
145
146
149
152
153
154
157
161
162
162
164
167
167
168
170
171
172
175
178
179
179
182
184
186
189
xiv
Contents
1
2
Introduction . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . .
2.1
Mixtures and Their Reduction . . . .
2.2
Exponential Family of Distributions
3
Merging algorithm . . . . . . . . . . . . . .
4
General Mixture Reduction Algorithms . . .
4.1
Global Approaches . . . . . . . . . .
4.2
Local Approach . . . . . . . . . . . .
5
Numerical simulations . . . . . . . . . . . .
5.1
Example-I . . . . . . . . . . . . . . .
5.2
Example-II . . . . . . . . . . . . . . .
6
Conclusion . . . . . . . . . . . . . . . . . . .
A
Proof of Theorem 1 . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
192
192
192
193
194
194
194
196
197
197
198
200
201
202
F Gaussian Mixture Reduction Using Reverse Kullback-Leibler Divergence
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Runnalls’ Method . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Williams’ Method . . . . . . . . . . . . . . . . . . . . . . . .
3.3
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Approximations for RKLD . . . . . . . . . . . . . . . . . . . . . . .
5.1
Approximations for pruning hypotheses . . . . . . . . . . .
5.2
Approximations for merging hypotheses . . . . . . . . . . .
6
Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1
Example with real world data . . . . . . . . . . . . . . . . .
6.2
Robust clustering . . . . . . . . . . . . . . . . . . . . . . . .
7
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A
Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B
Derivation of V (qK , qI , qJ ) . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
205
208
209
211
211
211
211
214
218
219
221
224
224
227
231
231
232
233
Notation
Abbreviations
Abbreviation
ARMSE
ETT
EKF
EM
GLM
GM
GM-PHD
GP
GPS
iid
INLA
KF
KLD
LTI
MAP
MC
MR
MRA
MTT
PDF
PHD
RKLD
RMSE
RTS
SS-1
STVBF
STVBS
VB
Meaning
Average Root Mean Square Error
Extended Target Tracking
Extended Kalman Filter
Expectation Maximization
Generalized Linear Models
Gaussian Mixture
Gaussian Mixture Probability Hypothesis Density
Gaussian Process
Global Positioning System
independent and identically distributed
Integrated Nested Laplace Approximation
Kalman Filter
Kullback-Leibler Divergence
Linear Time-Invariant
Maximum A Posteriori
Monte-Carlo
Mixture Reduction
Mixture Reduction Algorithm
Multiple Target Tracking
Probability Density Function
Probability Hypothesis Density
Reverse Kullback-Leibler Divergence
Root Mean Square Error
Rauch-Tung-Striebel
First Order Stable Spline
Skew-t Variational Bayes Filter
Skew-t Variational Bayes Smoother
Variational Bayes
xv
xvi
Notation
Some sets
Notation
N
R
d
S++
Meaning
Set of natural numbers
Set of real numbers
Set of d × d symmetric positive definite matrices
Probability
Notation
Meaning
E
V
DK L ( · || · )
H
H
∼
Expectation
Variance
Kullback-Leibler divergence
Differential entropy
Differential entropy rate
Distributed according to or sampled from
Common distributions
Notation
N (µ, Σ)
U (a, b)
Exp(x; λ)
Weibull(λ, k)
Laplace(µ, b)
Rayleigh(σ )
log −N (µ, σ )
Gamma(α, β)
IGamma(α, β)
Wd (n, V )
Meaning
Multivariate Gaussian with mean µ and covariance Σ
Uniform over the interval [a, b]
Exponential with rate λ
Weibull with scale λ and shape k
Laplace with location µ and scale b
Rayleigh with scale σ
Log-normal with location µ and scale σ
Gamma with shape α and rate β
Inverse gamma with shape α and rate β
Wishart with degrees of freedom n and scale matrix
d
V ∈ S++
I W d (ν, Ψ )
Inverse Wishart with degrees of freedom ν and scale
d
matrix Ψ ∈ S++
2
Student’s t-distribution with location parameter µ,
t(µ, σ , ν)
spread parameter σ , and degrees of freedom ν
T( · ; 0, 1, ν)
Cumulative distribution function (CDF) of Student’s
t-distribution with degrees of freedom ν
ST(z; µ, σ 2 , δ, ν) Skew t-distribution with location parameter µ, spread
parameter σ , shape parameter δ and degrees of freedom ν
N+ (µ, Σ)
Truncated Multivariate Gaussian with closed positive
orthant as support, location parameter µ and squaredscale matrix Σ
xvii
Notation
Operators and Symbols
Notation
Meaning
tr (A)
det (A), |A|
AT
vec(A)
Diag( · )
Trace of matrix A
Determinant of matrix A
Transpose of matrix A
Vectorized matrix A
Diagonal matrix whose diagonal elements are the arguments of the operator
Sequence (xm , xm+1 , · · · , xn )
d-dimensional identity matrix
Hypothesis
Minimizing argument with respect to λ
Maximizing argument with respect to λ
xm:n
Id
H
argminλ
argmaxλ
Part I
Background
1
Introduction
This chapter introduces the research area that is considered in this thesis and
summarizes the contributions that constitute this thesis. In Section 1.1, an introduction to the approximate Bayesian inference is given. In Section 1.2, the main
contributions are summarized. In Section 1.3, the publications by the PhD candidate are listed. In Section 1.4 three applied research results produced by the PhD
candidate are presented. In Section 1.5, the outline of the thesis is given.
1.1
Approximate Bayesian Inference
Bayesian inference is a statistical inference technique in which Bayes’ theorem
is used to update the probability distribution of a random latent variable using
observations. This technique provides a mathematical tool for modeling systems
where uncertainties of the model, as well as the system, are reflected by the probability distributions. The probabilistic models which are constructed by probability distributions that describe our knowledge about the system are determined
using the rules of the probability calculus.
Probabilistic models describe the relation between the random latent variables, the deterministic parameters, and the measurements. Such relations are
specified by prior distributions of the latent variables p(x), and the likelihood
function p(y|x) which gives a probabilistic description of the measurements given
(some of) the latent variables. Using the probabilistic model and measurements
the exact posterior can be expressed in a functional form using the Bayes’ rule
p(x|y) = R
p(x)p(y|x)
p(x)p(y|x) dx
.
(1.1)
The prior knowledge about the latent variables and the parameters is expressed via prior distributions. Ideally, the prior distribution should express
3
4
1 Introduction
this prior knowledge about the latent variables without any extra assumptions.
The maximum entropy method (Jaynes, 1982) provides a tool to express the prior
knowledge in form of prior distributions without further assumptions. Describing the prior knowledge about a random variable using compact analytical expressions is not always feasible. In such cases approximation methods are required. One of the contributions in this thesis concerns such approximations.
Determination of the posterior distribution of a latent variable x given the
measurements (observed data) y is at the core of Bayesian inference using probabilistic models. The exact posterior distribution can be analytical. A subclass of
cases where the posterior is analytical is when the posterior belongs to the same
family of distributions as the prior distribution. In such cases the prior distribution is called a conjugate prior for the likelihood function. A well-known example where analytical posterior is obtained using conjugate priors is when the
latent variable is a priori normal-distributed and the likelihood function given
the latent variable as its mean is again normal.
Example 1.1
Let x have a normal prior distribution with mean µ and covariance Σ, i.e., x ∼
N (x; µ, Σ). A measurement y with the likelihood function p(y|x) = N (y; Hx, R)
is in hand where H is a matrix with proper dimensions and R is a covariance
matrix. The posterior distribution of x can be obtained using the Bayes’ rule
given in (1.1);
p(x|y) = R
=R
p(x)p(y|x)
(1.2)
p(x)p(y|x) dx
N (x; µ, Σ)N (y; Hx, R)
N (x; µ, Σ)N (y; Hx, R) dx
.
(1.3)
The posterior distribution p(x|y) has an analytical solution and turns out to be the
normal distribution N (x; µ0 , Σ0 ) whose parameters can be computed via closed
form expressions given here,
µ0 =µ + K(y − H µ),
(1.4a)
Σ0 =Σ − K H Σ,
(1.4b)
where
K = ΣH(H ΣH T + R)−1 .
(1.5)
The exact posterior distribution of a latent variable can not always be given a
compact analytical expression. In the following, three examples of such cases will
be given. In Example 1.2, a problem that is encountered in nonlinear filtering is
presented.
1.1
5
Approximate Bayesian Inference
Example 1.2
Let x have a normal prior distribution p(x) = N (x; µ, Σ). A measurement y with
the likelihood function p(y|x) = N (y; h(x), R) is in hand where h( · ) is a vector valued nonlinear function and R is a covariance matrix. The posterior distribution
of x can be expressed using the Bayes’ rule given in (1.1). However, the posterior
distribution does not necessarily have a compact analytical solution. A remedy
can be obtained by approximation of the likelihood function via linearization of
the function h( · ) around the prior mean µ as in
b − µ)
h(x) ≈ h(µ) + H(x
(1.6)
b , ∇x h(x)|
H
x=µ .
(1.7)
where
b R), the approximate posteUsing the approximate likelihood p(y|x) = N (y; Hx,
rior distribution can be computed using the analytical expressions given in Example 1.1.
One of the contributions in this thesis concerns the problem and the solution
proposed in Example 1.2. In Example 1.3, a problem that is encountered in simulating a Markov chain with multi-modal transition density will be presented.
Example 1.3
Consider a Markov chain with transition density:
p(xk+1 |xk ) = wN (xk+1 ; Axk , Q) + (1 − w)N (xk+1 ; Axk , Q),
(1.8)
where 0 < w < 1, and factors A and A are two square matrices. We are interested in the marginal distribution of x1000 , where x1 has the distribution p(x1 ) =
N (x1 ; µ1 , Σ1 ). The marginal distribution of xk can be obtained recursively by
integration as in
Z
p(xk ) = p(xk |xk−1 )p(xk−1 ) dxk−1 .
(1.9)
The first step of the recursion is computed here.
Z p(x2 ) =
wN (x2 ; Ax1 , Q) + (1 − w)N (x2 ; Ax1 , Q) N (x1 ; µ1 , Σ1 ) dx1
T
=wN (x2 ; Aµ1 , Q + AΣ1 AT ) + (1 − w)N (x2 ; Aµ1 , Q + AΣ1 A )
(1.10)
(1.11)
Although the marginal density of x2 can be computed analytically, the complexity of p(x2 ) has increased compared to p(x1 ) due to increase in the number of
Gaussian components needed to express p(x2 ). The number of Gaussian components needed to express the marginal density of xk grows exponentially with
6
1 Introduction
x1
x2
x3
y1
y2
y3
xT
···
yT
Figure 1.1: A probabilistic graphical model for stochastic dynamical system
with latent state xk and measurements yk .
respect to k and is 2k−1 . In order to maintain the computational complexity of
the marginal distributions of xk at a tractable level, the number of components
needs to be reduced via approximation of the true marginal density of xk with
another distribution with less components. A candidate solution for the problem
is minimizing a statistical distance between the true density of xk and its approximation.
Two of the contributions in this thesis concern the problem described in Example 1.3. Approximate Bayesian inference is particularly important when the
measurements appear sequentially in time as in the filtering task for a stochastic
dynamical system, whose probabilistic graphical model is presented in Figure 1.1.
In Example 1.4, the Bayesian filtering recursion is introduced and the need for approximations is highlighted. Three contributions in this thesis concern problems
of similar nature to Example 1.4.
Example 1.4
Consider a stochastic dynamical system represented by the following recursion
x1 ∼p(x1 ),
(1.12a)
yk ∼p(yk |xk ),
(1.12b)
xk+1 ∼p(xk+1 |xk ).
(1.12c)
The Bayesian filtering recursion corresponds to computing the posterior distributions p(xk |y1:k );
p(xk |y1:k ) = R
p(xk |y1:k−1 )p(yk |xk )
p(xk |y1:k−1 )p(yk |xk ) dxk
.
(1.13)
The density p(xk |y1:k−1 ) in the numerator of (1.13) which is called the predicted
density of xk and is obtained by integration as in
Z
p(xk |y1:k−1 ) = p(xk |xk−1 )p(xk−1 |y1:k−1 ) dxk−1 .
(1.14)
In such filtering problems, the posterior to the last processed measurement is
the prior distribution in the next time step. To be able to use the same inference algorithm in a recursive manner, the posterior distribution at each time step should
1.2
Contributions
7
obtain the same form as the prior. When such a condition does not exist, approximations can be used. A class of such approximations is called the variational approximations, where the posterior is assumed to have a specific functional form
(the same as the prior). Subsequently a statistical distance between the assumed
posterior and the true posterior is minimized to find the hyper-parameters of the
assumed (approximate) posterior.
Several methods for approximate inference over probabilistic models are proposed in the literature such as variational Bayes (Jordan et al., 1999), expectation
propagation (Minka, 2001), integrated nested Laplace approximation (INLA) (Rue
et al., 2009), generalized linear models (GLMs) (Nelder and Wedderburn, 1972)
and, Monte-Carlo (MC) sampling methods (Hastings, 1970; Geman and Geman,
1984).
Variational Bayes (VB) and expectation propagation (EP) are two optimizationbased solutions to the approximate Bayesian inference (Wainwright and Jordan,
2008). In these two approaches Kullback-Leibler divergence (Cover and Thomas,
2006) between the true posterior distribution and an approximate posterior is
minimized. INLA is a technique to perform approximate Bayesian inference in
latent Gaussian models (Hennevogl et al., 2001) using the Laplace approximation. GLMs are an extension of ordinary linear regression when errors belong to
the exponential family.
Sampling methods such as Markov Chain Monte Carlo (MCMC) methods provide a general class of solutions to the approximate Bayesian inference problem.
In this thesis, the focus is on fast analytical approximations which are applicable to large-scale inference problems. These approximations propose solutions
to the Bayesian inference problems where the vanilla versions are described in
Examples 1.2, 1.3 and 1.4. These analytical approximations either involve minimization of a statistical divergence between the true distribution and its approximation or are based on expansion of a function with respect to a basis function.
In Section 1.2, the main contributions in this thesis are summarized. In Section 1.5, the connection between the problems highlighted here in form of Examples 1.2, 1.3 and 1.4 and their corresponding contributions in this thesis will be
drawn.
1.2
Contributions
The contributions of this thesis address various aspects of Bayesian inference.
These contributions can be categorized in three groups:
1. Prior selection: The prior information about a stochastic process in a Gaussian process regression problem can be encoded in the covariance function.
The maximum entropy properties of a covariance function for Gaussian process regression referred to as the discrete-time first-order stable spline kernel is proven.
8
1 Introduction
2. Determination of the posterior distribution for dynamical systems: Approximate posterior of two Bayesian inference problems are derived using
the variational Bayes technique. Furthermore, an approximation method
for general Bayesian inference problems using linearization of log-likelihood
function is proposed. These contributions in this category concern problems such as those highlighted in Examples 1.2 and 1.4.
3. Maintenance of computational complexity: The contributions in this category concern with maintenance of computational complexity in problems
such as the one introduced in Example 1.3.
1.3 Publications
The following papers, listed in reverse chronological order, are published
T. Ardeshiri, U. Orguner, and F. Gustafsson. Bayesian inference via approximation of log-likelihood for priors in exponential family. ArXiv
e-prints, October 2015b. Submitted to Signal Processing, IEEE Transactions on.
T. Ardeshiri, E. Özkan, U. Orguner, and F. Gustafsson. Approximate
Bayesian smoothing with unknown process and measurement noise
covariances. To appear in Signal Processing Letters, IEEE, 2015.
T. Chen, T. Ardeshiri, F. P. Carli, A. Chiuso, L. Ljung, and G. Pillonetto.
Maximum entropy properties of discrete-time first-order stable spline
kernel. To appear in Automatica, 2015.
T. Ardeshiri, U. Orguner, and E. Özkan. Gaussian Mixture Reduction
Using Reverse Kullback-Leibler Divergence. ArXiv e-prints, August
2015. To be Submitted to Signal Processing, IEEE Transactions on.
H. Nurminen, T. Ardeshiri, R. Piché, and F. Gustafsson. A NLOSrobust TOA positioning filter based on a skew-t measurement noise
model. In 2015 International Conference on Indoor Positioning and
Indoor Navigation (IPIN), Banff, Alberta, Canada, October 2015b.
H. Nurminen, T. Ardeshiri, R. Piché, and F. Gustafsson. Robust inference for state-space models with skewed measurement noise. Signal
Processing Letters, IEEE, 22(11):1898–1902, Nov 2015a. ISSN 10709908. doi: 10.1109/LSP.2015.2437456.
T. Ardeshiri, K. Granström, E. Özkan, and U. Orguner. Greedy reduction algorithms for mixtures of exponential family. Signal Processing Letters, IEEE, 22(6):676–680, June 2015a. ISSN 1070-9908. doi:
10.1109/LSP.2014.2367154.
1.3
Publications
T. Ardeshiri and T. Chen. Maximum entropy property of discretetime stable spline kernel. In Acoustics, Speech and Signal Processing
(ICASSP), 2015 IEEE International Conference on, pages 3676–3680,
April 2015. doi: 10.1109/ICASSP.2015.7178657.
T. Ardeshiri and E. Özkan. An adaptive PHD filter for tracking with
unknown sensor characteristics. In Information Fusion (FUSION),
2013 16th International Conference on, pages 1736–1743, July 2013.
T. Ardeshiri, U. Orguner, C. Lundquist, and T. Schön. On mixture
reduction for multiple target tracking. In Information Fusion (FUSION), 2012 15th International Conference on, pages 692–699, July
2012.
T. Ardeshiri, F. Larsson, F. Gustafsson, T. Schön, and M. Felsberg. Bicycle tracking using ellipse extraction. In Information Fusion (FUSION),
2011 Proceedings of the 14th International Conference on, pages 1–8,
July 2011a.
T. Ardeshiri, M. Norrlöf, J. Löfberg, and A. Hansson. Convex optimization approach for time-optimal path tracking of robots with speed dependent constraints. In Proceedings of the 18th IFAC World Congress,
Milan, Italy, pages 14648–14653, August 2011b.
T. Ardeshiri, S. Kharrazi, R. Thomson, and J. Bärgman. Offset eliminative map matching algorithm for intersection active safety applications. In Intelligent Vehicles Symposium, 2006 IEEE, pages 82–88,
2006b. doi: 10.1109/IVS.2006.1689609.
T. Ardeshiri, S. Kharrazi, J. Sjöberg, J. Bärgman, and L. M. Sensor fusion for vehicle positioning in intersection active safety applications.
In International Symposium on Advanced Vehicle Control, 2006a.
9
10
1.4
1 Introduction
Applications
In this section, a summary of three applied research results produced by the PhD
candidate is presented.
1.4.1 Bicycle tracking using ellipse extraction
A new approach to track bicycles from imagery sensor data is proposed in
(Ardeshiri et al., 2011a). It is based on detecting ellipsoids in the image as in
Figures 1.2 and 1.3. These ellipses are treated these pair-wise using a dynamic
bicycle model illustrated in Figure 1.4. One important application area is in automotive collision avoidance systems, where no dedicated systems for bicyclists
yet exist and where very few theoretical studies have been published. Possible
conflicts can be predicted from the position and velocity state in the model, but
also from the steering wheel articulation and roll angle that indicate yaw changes
before the velocity vector changes. An algorithm is proposed in (Ardeshiri et al.,
2011a) which consists of an ellipsoid detection and estimation algorithm and a
particle filter. A simulation study of three critical single target scenarios is presented, and the algorithm is shown to produce excellent state estimates. An experiment using a stationary camera and the particle filter for state estimation is
performed and has shown encouraging results.
Figure 1.2: The green ellipses indicate measurements obtained from the two
bike wheels. The ellipse parameters are later fed through a particle filter
framework in order to estimate the bicycle state.
1.4
Applications
11
Figure 1.3: Ellipse extraction. Top left: Query image, Top right: Query image after background subtraction. Bottom: Ellipses plotted with 0.9 and 1.1
times the estimated size, the actual estimated ellipses are halfway between
the lines.
12
1 Introduction
N
mg
θ
z
x
l2
l1
L
Cα
Cα
y
y
(a) side view
(b) front view
x
Z
ψ
δ
α
z
X
(c) top view
(d) slope
Figure 1.4: (a) Illustration of the coordinate system and the bicycle parameters. The wheelbase L and the distance of center of gravity to the wheel
centers is denoted by l1 and l2 . The y-axis goes through the center of gravity and the x-axis goes through the wheel centers. (b) Illustration of the
inclination θ of the bicycle. The inclination angle can be calculated using
Newton’s second law of motion. The gravitational force is denoted by mg
and the reaction force of the ground is denoted by N . (c) An extended bicycle model is used as motion model where ψ and δ are shown in this figure.
The orientation of the camera at the origin of the global coordinate system is
shown. (d) The slope of the bicycle’s track is denoted by α.
1.4
13
Applications
1.4.2 Positioning using ultra wide-band data
The skew-t variational Bayes filter (STVBF) (Nurminen et al., 2015a) is applied to
indoor positioning with time-of-arrival (TOA) based distance measurements and
pedestrian dead reckoning (PDR) in (Nurminen et al., 2015b). The proposed filter accommodates large positive outliers caused by occasional non-line-of-sight
(NLOS) conditions by using a skew-t model of measurement errors. Real-data
tests using the fusion of inertial sensors based PDR and ultra-wideband based
TOA ranging show that the STVBF clearly outperforms the extended Kalman filter (EKF) in positioning accuracy with the computational complexity about three
times that of the EKF. A tracking performance of one of the test tracks is illustrated in Figure 1.5.
UWB beacon
reference track
STVBF
EKF
END
INIT
5m
Figure 1.5: Test track 1 consists of corridors and turns at corridor junctions.
1.4.3 Path tracking for robots
The task of generating time optimal trajectories for a six degrees of freedom industrial robot is discussed in (Ardeshiri et al., 2011b) and an existing convex
optimization formulation of the problem is extended to include new types of constraints. The new constraints are speed dependent and can be motivated from
physical modeling of the motors and the drive system. It is shown how the speed
dependent constraints should be added in order to keep the convexity of the overall problem. A method to, conservatively, approximate the linear speed dependent constraints by a convex constraint is also proposed (see Figure 1.6). A numerical example proves versatility of the extension proposed in (Ardeshiri et al.,
2011b).
14
1 Introduction
τi
τ i,affine
2
q̇ i,affine
q̇i2
τ i,affine
Figure 1.6: The torque at a joint of a robotic arm is plotted versus the square
of angular velocity of the same joint. The non-convex true feasible set is
approximated by a set of affine constraints. The true actuator’s constraints
is represented by the dashed line. The approximation of the feasible set by a
convex set is illustrated by the hatched area.
1.5
Thesis outline
The thesis is divided into two parts. In the rest of the first part, background
material for these contributions will be provided1 . In the second part of the
thesis, a compilation of six edited publications are presented.
1.5.1 Outline of Part I
In Chapter 2, the concepts of entropy, relative entropy, and maximum entropy
priors, and their relation to the exponential family are introduced. Also a short
introduction to variational Bayes method is given. The background material in
Chapter 2 are intended to lay the theoretical foundation for the Papers A, B, C, D
and E in the second part of this thesis.
In Chapter 3, a short introduction to the problem of identification of linear time-invariant, stable and causal systems using Gaussian process regression
methods is given. This Chapter is intended to give an introduction to the problem addressed in Paper A which is about approximation of the prior knowledge
for the purpose of devising a maximum entropy prior distribution.
1 Parts of the material presented in the first part of the thesis is already published by the author in
form of technical reports, conference papers and journal articles.
1.5
Thesis outline
15
In Chapter 4, an introduction to the mixture reduction problem introduced
in Example 1.3 is presented. The mixture reduction problem is addressed in the
second part of this thesis by Papers E and F. Concluding remarks are given in
Chapter 5.
1.5.2 Outline of Part II
Part II of the thesis is a compilation of six edited contributions which are summarized in the following.
Maximum entropy properties of discrete-time first-order stable spline kernel
Paper A
T. Chen, T. Ardeshiri, F. P. Carli, A. Chiuso, L. Ljung, and G. Pillonetto.
Maximum entropy properties of discrete-time first-order stable spline
kernel. To appear in Automatica, 2015.
presents the maximum entropy properties of the discrete-time first-order stable
spline kernel. The first order stable spline (SS-1) kernel (also known as the tunedcorrelated kernel) is used extensively in regularized system identification, where
the impulse response is modeled as a zero-mean Gaussian process whose covariance function is given by well designed and tuned kernels. In particular, the
exact maximum entropy problem solved by the SS-1 kernel without Gaussian
and uniform sampling assumptions is formulated. Under general sampling assumption, the special structure of the SS-1 kernel (e.g. its tridiagonal inverse and
factorization have closed form expression) is derived. Also a maximum entropy
covariance completion interpretation is given to it.
Approximate Bayesian smoothing with unknown process and measurement
noise covariances
Paper B
T. Ardeshiri, E. Özkan, U. Orguner, and F. Gustafsson. Approximate
Bayesian smoothing with unknown process and measurement noise
covariances. To appear in Signal Processing Letters, IEEE, 2015.
presents an adaptive smoother for linear state-space models with unknown process and measurement noise covariances. The proposed method utilizes the variational Bayes technique to perform approximate inference. The resulting smoother
is computationally efficient, easy to implement, and can be applied to high dimensional linear systems. The performance of the algorithm is illustrated on a target
tracking example.
16
1 Introduction
Robust inference for state-space models with skewed measurement noise
Paper C
H. Nurminen, T. Ardeshiri, R. Piché, and F. Gustafsson. Robust inference for state-space models with skewed measurement noise. Signal
Processing Letters, IEEE, 22(11):1898–1902, Nov 2015a. ISSN 10709908. doi: 10.1109/LSP.2015.2437456.
presents filtering and smoothing algorithms for linear discrete- time state-space
models with skewed and heavy-tailed measurement noise. The algorithms use a
variational Bayes approximation of the posterior distribution of models that have
normal prior and skew-t-distributed measurement noise. The proposed filter
and smoother are compared with conventional low-complexity alternatives in a
simulated pseudorange positioning scenario. In the simulations the proposed
methods achieve better accuracy than the alternative methods, the computational
complexity of the filter being roughly 5 to 10 times that of the Kalman filter.
Bayesian inference via approximation of log-likelihood for priors in
exponential family
Paper D
T. Ardeshiri, U. Orguner, and F. Gustafsson. Bayesian inference via approximation of log-likelihood for priors in exponential family. ArXiv
e-prints, October 2015b. Submitted to Signal Processing, IEEE Transactions on.
presents a Bayesian inference technique based on Taylor series approximation
of the logarithm of the likelihood function. The proposed approximation is devised for the case where the prior distribution belongs to the exponential family
of distribution and is continuous. The logarithm of the likelihood function is
linearized with respect to the sufficient statistic of the prior distribution in exponential family such that the posterior obtains the same exponential family form
as the prior. Similarities between the proposed method and the extended Kalman
filter for nonlinear filtering are illustrated. Further, an extended target measurement update for target models where the target extent is represented by a random matrix having an inverse Wishart distribution is derived. The approximate
update covers the important case where the spread of measurement is due to the
target extent as well as the measurement noise in the sensor.
1.5
Thesis outline
17
Greedy reduction algorithms for mixtures of exponential family
Paper E
T. Ardeshiri, K. Granström, E. Özkan, and U. Orguner. Greedy reduction algorithms for mixtures of exponential family. Signal Processing Letters, IEEE, 22(6):676–680, June 2015a. ISSN 1070-9908. doi:
10.1109/LSP.2014.2367154.
presents a general framework for greedy reduction of mixture densities of exponential family. The performances of the generalized algorithms are illustrated
both on an artificial example where randomly generated mixture densities are
reduced and on a target tracking scenario where the reduction is carried out in
the recursion of a Gaussian inverse Wishart probability hypothesis density (PHD)
filter.
Gaussian mixture reduction using reverse Kullback-Leibler divergence
Paper F
T. Ardeshiri, U. Orguner, and E. Özkan. Gaussian Mixture Reduction
Using Reverse Kullback-Leibler Divergence. ArXiv e-prints, August
2015. To be Submitted to Signal Processing, IEEE Transactions on.
presents a greedy mixture reduction algorithm which is capable of pruning mixture components as well as merging them based on the Kullback-Leibler divergence (KLD). The algorithm is distinct from the well-known Runnalls’ KLD based
method since it is not restricted to merging operations. The capability of pruning
(in addition to merging) gives the algorithm the ability of preserving the peaks
of the original mixture during the reduction. Analytical approximations are derived to circumvent the computational intractability of the KLD which results in
a computationally efficient method. The proposed algorithm is compared with
Runnalls’ and Williams’ methods in two numerical examples, using both simulated and real world data. The results indicate that the performance and computational complexity of the proposed approach make it an efficient alternative to
existing mixture reduction methods.
2
Entropy, Exponential Family, and
Variational Bayes
Analytical approximations proposed in the second part of this thesis build upon
the existing literature on maximum entropy priors, exponential family of distributions and variational Bayes. In this chapter some preliminary definitions and
results relating to these contributions will be given. In Section 2.1, entropy and
the relative entropy will be defined. Furthermore, maximum entropy distributions will be derived. The background material in section 2.1 will lay the foundations for Paper A in the second part of the thesis. Also the relationship between
the maximum entropy priors and the exponential family will be explained. In
Section 2.2, the exponential family of distributions and some of their properties
will be given. These background material will be used in Papers E and D which
are about approximate inference techniques relating to exponential family of distributions. In Section 2.3 the variational Bayes (VB) method is described. The VB
method is used to derive approximate posteriors in Papers B and C.
2.1
Entropy
Entropy is a measure of the uncertainty of a random variable. In this thesis, only
continuous random variables are considered. Consequently, only the aspects of
the information theory which are related to continuous random variables will be
covered. The definitions of the differential entropy and relative entropy will be
given in the following.
Definition 2.1. For a distribution with its support on S with density p( · ), the
differential entropy is defined by (Cover and Thomas, 2012)
Z
H(p) = − p(x) log p(x) dx.
(2.1)
S
19
20
2
Entropy, Exponential Family, and Variational Bayes
Example 2.2
P
For standard normal distribution in Rn where p(x) = (2π)−n/2 exp{− nj=1 x2j /2}
P
and log p(x) = − n2 log(2π) − 21 nj=1 x2j the following holds.
H(p) = − E[log p(x)] = n/2 log(2π) + n/2
= n/2 log(2πe),
where e is the Euler number.
Definition 2.3. The relative entropy or the Kullback-Leibler divergence between
two PDFs is defined by
p(x)
DK L (p||q) = E log
.
(2.2)
q(x)
p(x)
2.1.1 Maximum entropy prior distributions
By maximizing the differential entropy of a distribution subject to constraints imposed by prior knowledge, the probability distribution which encompasses the
least assumptions about the data can be obtained. In the following, the maximum entropy distribution subject to some constraints expressed by equality constraints on expectation of some functions will be derived.
Example 2.4
Maximize the entropy H(p) over all probability densities p( · ) satisfying
1. p(x) ≥ 0, with equality outside the support set S,
R
2. S p(x) dx = 1,
R
3. S p(x)Ti (x) dx = αi , for 1 ≤ i ≤ m .
The solution to the maximum entropy problem can be found using calculus
(Cover and Thomas, 2012); The Lagrangian for the problem is given by
Z
Z
Z
m
X
J(p) = − p log p + λ0 p +
λi Ti p.
(2.3)
i=1
Since the entropy is a concave function defined over a convex set we can compute
the functional derivative and equate it to zero to obtain
m
X
∂J
λi Ti (x) = 0.
= − log p(x) − 1 + λ0 +
∂p(x)
(2.4)


m
X



p(x) = exp −1 + λ0 +
λi Ti (x) .
(2.5)
i=1
Hence,
i=1
2.1
21
Entropy
The result of the example above will be proven using the information inequality in the following theorem.
P
Theorem 2.5. Let p∗ (x) = exp −1 + λ0 + m
i=1 λi Ti (x) , x ∈ S, where λ0 , λ1 , ..., λm
are chosen so that p∗ satisfies (Cover and Thomas, 2012, Theorem 12.1.1)
1. p(x) ≥ 0, with equality outside the support set S,
R
2. S p(x) dx = 1,
3.
R
S
p(x)Ti (x) dx = αi , for 1 ≤ i ≤ m .
Then, p∗ uniquely maximizes H(p) over all probability densities p satisfying the
constraints.
Proof: Proof is obtained using the information inequality. Let g satisfy the constraints. Then
Z
Z
Z
g ∗
∗
H(g) = − g ln g = − g ln ∗ p = −DK L (g||p ) − g ln p∗
p
S
S
S

Z
Z 
m
X



≤ − g ln p∗ = − g −1 + λ0 +
λi Ti 
S
i=1
S


Z
Z
m
X


∗

= − p −1 + λ0 +
λi Ti  = − p∗ ln p∗ = H(p∗ )
S
i=1
S
Note that the equality holds iff DK L (g||p∗ ) = 0 for all x. Therefore, g = p∗ except
for a set of measure 0.
Example 2.6
The maximum entropy distribution on the support S = (−∞, ∞) satisfying the
constraint E[x] = µ, E[x2 ] = σ 2 is p(x) = N (x; µ, σ 2 ).
Example 2.7
The maximum entropy distribution on the support S = [0, +∞) satisfying the
constraint E[x] = λ, is p(x) = Exp(x; λ−1 ).
Example 2.8
The maximum entropy distribution on the support S = [a, b] satisfying no other
constraint than integrability, is p(x) = U (x; a, b).
22
2
Entropy, Exponential Family, and Variational Bayes
The differential entropy for continuous random variables has some weaknesses
compared to the discrete random variables which are listed here.
Remark 2.9. Differential entropy differs from the entropy of finely quantized version of
the continuous random variable (the Shannon entropy) by the logarithm of the quantization resolution which is infinite in the limit. (Cover and Thomas, 2012, Theorem 8.3.1).
Remark 2.10. Differential entropy is not scale invariant on Rn . That is, for a vector-valued
random variable X ∈ Rn and a non-singular matrix A ∈ Rn×n (Cover and Thomas, 2012,
page 254)
H(AX) = H(X) + log | det(A)|.
(2.6)
Remark 2.11. Differential entropy can be negative. Hence, the well known relation between the information content of a distribution and the Shannon entropy does not hold
for the differential entropy.
As we showed in theorem 2.5, the maximum entropy distribution subject to
expectation
constraints given
in the theorem obtains the form
P
exp −1 + λ0 + m
λ
T
(x)
. In the following section the exponential family of
i=1 i i
distributions will be introduced. The members of this family arise naturally as
the solution to the problem of finding maximum entropy distribution subject to
the expectation constraint on their sufficient statistic T (x).
2.2
Exponential Family
The exponential family of distributions (Wainwright and Jordan, 2008) includes
many common distributions such as Gaussian, beta, Dirichlet, gamma and Wishart.
The exponential family in its natural form can be represented by its natural parameters η, sufficient statistic T (x), Log-partition function A(η) and base measure
h(x) as in
q(x; η) = h(x) exp(η · T (x) − A(η)),
(2.7)
where the natural parameter η belongs to the natural parameter space Ω = {η ∈
Rm |A(η) < +∞}. Here a · b denotes the inner product of a and b. In Table 2.1 the
sufficient statistic for some continuous members of the exponential family are
given.
Definition 2.12. The set corresponding to all mean values for the sufficient statistics
M = {µ ∈ Rm |∃p, E[T (x)] = µ}
(2.8)
p
is called the mean parameter space (Wainwright and Jordan, 2008).
Definition 2.13. In a regular family of exponential family the domain Ω is an
open set (Wainwright and Jordan, 2008).
Definition 2.14. In minimal representation of an exponential family a unique
parameter vector is associated with each distribution (Wainwright and Jordan,
2008).
2.3
23
Variational Bayes
Table 2.1: Some continuous exponential family distributions and their sufficient statistic are listed.
Continuous Exp. Family Distribution
Exponential distribution
Normal distribution with known variance σ 2
Normal distribution
Pareto distribution with known minimum xm
Weibull distribution with known shape k
Chi-squared distribution
Dirichlet distribution
Laplace distribution with known mean µ
Inverse Gaussian distribution
Scaled inverse Chi-squared distribution
Beta distribution
Lognormal distribution
Gamma distribution
Inverse gamma distribution
Gaussian Gamma distribution
Wishart distribution
Inverse Wishart distribution
T(·)
x
x/σ
(x, xxT )
log x
xk
log x
(log x1 , · · · , log xn )
|x − µ|
(x, 1/x)
(log x, 1/x)
(log x, log(1 − x))
(log x, (log x)2 )
(log x, x)
(log x, 1/x)
(log τ, τ, τx, τx2 )
(log |X|, X)
(log |X|, X −1 )
The formulas for representation of some probability distribution functions in
the exponential family form are given in Appendix A.
2.3
Variational Bayes
Variational Bayes (VB) method is used to find an approximate solution to inference problems when an exact solution is not analytically tractable. Consider a
Bayesian model in which prior distributions are assigned to all parameters and
latent variables. We will denote all these parameters and latent variables by x
where x , {x1 , x2 , · · · , xn }. Now, consider the measurement vector y along with
the joint posterior distribution p(x, y).
When there is no analytical solution for the posterior p(x|y) we can look for
an approximate analytical solution using the following factorized variational approximation.
p(x|y) ≈ q(x)
, q1 (x1 )q2 (x2 ) · · · qn (xn ),
(2.9)
(2.10)
where the densities q1 (x1 ), q2 (x2 ), · · · , qn (xn ) are the approximate posterior densities for x1 , x2 , · · · , xn , respectively. Technique of VB (Bishop, 2006, Ch. 10),(Tzikas
et al., 2008) chooses the estimates q̂1 (x1 ), q̂2 (x2 ), · · · , q̂n (xn ) for the factors in (2.10)
24
2
Entropy, Exponential Family, and Variational Bayes
using the following optimization problem
q̂(x) = argmin DK L (q(x)||p(x|y)).
(2.11)
q(x)
The optimal solution for the optimization problem satisfies the following set
of equations.
log q̂i (xi ) = E [log p(x, y)] + const., 1 ≤ i ≤ n
−i
(2.12)
where the term const. is constant with respect to the variables xi and the subscript −i under the expectation operator means that the expectation is taken with
respect to factors other than qi (xi ).
The solution to (2.12) can be obtained via fixed-point iterations where only
one factor in (2.10) is updated and all the other factors are fixed to their last estimated values (Bishop, 2006, Ch. 10). The iterations converge to a local optima
of (2.11) (Bishop, 2006, Ch. 10), (Wainwright and Jordan, 2008, Ch. 3).
The posterior in (2.9) can be the smoothing distribution of the states and
model parameters which my not be analytical. In Papers B and C, it is shown
that the VB technique can be used to find an approximate posterior.
3
System Identification
This chapter concerns a maximum entropy prior for a specific approximate Bayesian inference problem. Particularly, the prior information about the impulse response of a linear time-invariant (LTI) stable and causal system will be described.
The background material presented in this chapter lays the foundation for the
contribution in Paper A in the second part of this thesis where the prior information is approximated to construct a maximum entropy kernel for Gaussian
process regression. Parts of the back ground material is published in (Ardeshiri
and Chen, 2015).
3.1
Impulse Response Identification
System identification is about how to construct mathematical models based on
observed data, see e.g., (Ljung, 1999). For linear time-invariant (LTI) and causal
systems, the identification problem can be stated as follows. Consider
y(ti ) = f ∗ u(ti ) + v(ti ),
i = 0, 1, · · · , N
(3.1)
where ti , i = 0, 1, · · · , N are the time instants at which the measured input u(t)
and output y(t) are collected, v(t) is the disturbance, f (t) is the impulse response
with t ∈ R+ , [0, ∞) for continuous-time systems and t = ti , i = 0, 1, · · · for
discrete-time systems, and f ∗ u(ti ) is the convolution of f ( · ) and u( · ) evaluated
at t = ti . The goal is to estimate f (t) as good as possible.
Recently, there have been increasing interests in system identification community to study system identification problems with machine learning methods,
see e.g., (Ljung et al., 2011), (Pillonetto et al., 2014). An emerging trend among
others is to apply Gaussian process regression methods for LTI stable and causal
system identification problems, see (Pillonetto and Nicolao, 2010) and its follow
up papers (Pillonetto et al., 2011), (Chen et al., 2012a), (Chen et al., 2014). Its
25
26
3
System Identification
idea is to model the impulse response f (t) with a suitably defined Gaussian process which is characterized by
f (t) ∼ GP(m(t), k(t, s)),
(3.2)
where m(t) is the mean function and is often set to be zero, and k(t, s) is the covariance function, also called the kernel function in machine learning and statistics,
see e.g., (Rasmussen and Williams, 2006).
The kernel k(t, s) is parametrized by a hyper-parameter β and further written
as k(t, s; β). The key issue is to design a suitable parametrization of k(t, s; β), or
in other words, the structure of k(t, s; β), because it reflects our prior knowledge
about the system to be identified. Several kernel structures have been proposed
in the literature, e.g., the stable spline (SS) kernel in (Pillonetto and Nicolao,
2010) and the diagonal and correlated (DC) kernel in (Chen et al., 2012a).
Interestingly, (Pillonetto and Nicolao, 2011) shows based on a result in (Nicolao et al., 1998) that for continuous-time systems, the continuous-time first-order
SS kernel (also derived by deterministic arguments in (Chen et al., 2012a) and
called Tuned Correlated (TC) kernel):
k(t, s) = min{e−βt , e−βs },
t, s ∈ R+
(3.3)
has a certain maximum entropy property. In Example 3.1 an the impulse response identification problem will be further illustrated.
Example 3.1
Consider the LTI system and the simulated input-output data presented in Figure 3.1.
y(t)
u(t)
Input and Output data
5
0
−5
0
50
100
5
0
−5
0
50
100
150
Time (seconds)
150
200
250
200
250
Figure 3.1: Input (black) and output (green) data versus time.
The impulse response f can be computed using the input-output data and,
the prior knowledge about the form of the impulse response expressed by the
kernel function,
k(t, s) = min{e−βt , e−βs },
t, s = ti , i = 0, 1, · · ·
(3.4)
3.2
27
Continuous-time impulse response
where f (t) ∼ GP(0, k(t, s)). The estimated impulse response is presented in Figure 3.2.
Estimated Impulse Response
0.6
f (t)
0.4
0.2
0
−0.2
0
20
40
60
80
Time (seconds)
100
120
Figure 3.2: The estimated impulse response (dark blue) along with one standard deviation band (cyan).
In the following some characteristics of the impulse response of LTI stable and
causal systems will be given in two separate sections one for the continuous-time
case and another for the discrete-time case.
3.2 Continuous-time impulse response
The prior knowledge about the continuous-time impulse response of a stable and
causal LTI system are
1. bounded input bounded output (BIBO) stability of the system and,
2. smoothness of the impulse response.
For continuous-time impulse response the BIBO stability is assured when the
impulse response be absolutely integrable, i.e., its L1 -norm exists;
Z∞
|f (t)| dt = kf k1 < ∞.
(3.5)
−∞
The smoothness constraint on the continuous-time impulse responses can be
addressed as in (Nicolao et al., 1998, Theorem 1) where the authors suggest that
the smoothness of a signal can be imposed by assuming that the variances of its
derivatives are finite.
#
"
df
= λ, λ < ∞
(3.6)
V
dt
The impulse response of a continuous-time LTI system and its L1 -norm is
illustrated in Figure 3.3.
28
3
System Identification
Figure 3.3: The impulse response of a continuous-time LTI stable system.
The shaded area under the impulse response should be finite.
Some definitions which would be needed to solve the maximum entropy kernel estimation problem for continuous-time stochastic processes are given in the
following.
Definition 3.2. (L2 differentiation) (Åström, 1970, page 37) A second order stochastic process f is said to be differentiable in the mean square at t if the limit
lim
s→0
f (t + s) − f (t)
= f 0 (t)
s
(3.7)
exists in the sense of mean square convergence, that is, if
"
lim E
s→0
f (t + s) − f (t)
− f 0 (t)
s
#2
= 0.
(3.8)
Recall that the derivative variances can be expressed via spectral measure by
h
E f
(m)
(t)
2
i
1
=
2π
Z∞
ω2m S(ω) dω
(3.9)
−∞
where, mth square mean derivative of f exists iff the integral in the right hand
side of (3.9) is finite (Lifshits, 2014, page 107).
Definition 3.3. The differential entropy rate of a real-valued continuous-time
stochastic process f ( · ) is defined in (Nicolao et al., 1998) as
1
H(f ) =
4π
Z∞
−∞
log S(ω) dω.
(3.10)
3.3
3.3
29
Discrete-time impulse response
Discrete-time impulse response
The prior knowledge about the discrete-time impulse response of a stable and
causal LTI system are
1. bounded input bounded output (BIBO) stability of the system and,
2. smoothness of the impulse response.
BIBO stability is assured when the impulse response is absolutely summable,
i.e., its ` 1 norm exists;
∞
X
(3.11)
|f (n)| = kf k1 < ∞.
n=−∞
The smoothness constraint on the discrete-time impulse responses can be imposed by assuming that the variances of its finite differences are proportional to
the time increment over which the finite difference is computed;
V[f (ti+1 ) − f (ti )] = λ(ti+1 − ti ), ∞ > λ > 0.
(3.12)
Some definitions which would be needed to solve the maximum entropy kernel estimation problem for discrete-time stochastic processes are given in the
following.
Definition 3.4. (Differential entropy rate of a sequence) Let {X(n)} be a sequence.
Its differential entropy rate is defined as (Cover and Thomas, 2012)
H(X) , lim
n→∞
1
H(p(X(1), ..., X(n))),
n
(3.13)
when the limit exists.
Note that stationarity or even wide-sense stationarity are not required for definition 3.4 to hold.
Proposition 3.5. Among all sequences with given covariance, Gaussian one has
the maximal differential entropy rate (Cover and Thomas, 2012).
Example 3.6
For independent and identically distributed (iid) standard Gaussian sequence we
have
√
H(X) = log 2πe.
(3.14)
Let X( · ) be a centered (zero-mean) discrete-time stationary sequence with autocorrelation R( · ) where, R(n) = E[X(n)X(0)]. Also, let S( · ) denote the spectral
density of X( · ) on [−π, π]. The following hold for a spectral representation S( · )
(Papoulis and Pillai, 2002, page 421),
1
R(n) =
2π
Zπ
−π
e inω S(ω) dω.
(3.15)
30
3
System Identification
Example 3.7
X( · ) is the iid standard sequence with covariance function



1
R(n) = 

0
n=0
otherwise
(3.16)
iff S(ω) = 1 for ω ∈ [−π, π].
Theorem 3.8. If a stationary sequence is Gaussian with spectral density S(ω),
then (Papoulis and Pillai, 2002, page 663)
Zπ
√
1
log S(ω) dω.
H(X) = log 2πe +
4π
(3.17)
−π
Example 3.9
Here, we will verify this theorem for iid Gaussian sequence with variance σ 2 .
From (3.14) we obtain
√
H(X) = log 2πe + log σ .
(3.18)
From (3.17) we can obtain the same result as in
Zπ
√
1
log σ 2 du
H(X) = log 2πe +
4π
−π
√
= log 2πe + log σ .
3.4
(3.19)
Maximum Entropy Kernel
In (Pillonetto and Nicolao, 2011), the maximum differential entropy rate continuoustime stochastic process subject to constraints on smoothness and bounded-input
bounded-output (BIBO) stability is sought. In (Pillonetto and Nicolao, 2011) the
Definition 3.3 for the differential entropy rate of a stationary continuous-time
Gaussian process g(t) with power spectrum S(ω) is adopted from (Nicolao et al.,
1998). Furthermore, the following proposition is adopted from (Nicolao et al.,
1998).
Proposition 3.10. (Nicolao et al., 1998, Theroem 1) Let g(t) be a zero-mean bandlimited stationary Gaussian process with power spectrum S(ω) = 0 for |ω| > B.
3.4
31
Maximum Entropy Kernel
Given finite λ2k , k = 0, 1, · · · , m, assume that there exist real numbers αj , j =
0, 1, · · · , m such that
ZB
ω2k
dω = 2πλ2k ,
2j
j=0 αj ω
Pm
−B
k = 0, 1, · · · , m.
(3.20)
Under this assumption, if there exists S(ω) that maximizes H(g) in
1
H(g) =
4π
+∞
Z
log S(ω) dω.
(3.21)
−∞
dk g(t)
subject to constraints V[ dt k ] = λ2k , k = 0, 1, · · · , m, then the spectrum is given
by S(ω) = Pm 1α ω2j . In particular, if there is no constraints on the first m − 1
j=0
j
order derivatives, then the spectrum becomes S(ω) =
1
.
αm ω2m
Deriving the maximum entropy process in continuous-time in (Nicolao et al.,
1998) and (Pillonetto and Nicolao, 2011) is quite involved, due to the infinitedimensional nature of of the problem and absence of a well-defined differential
entropy rate for a generic continuous-time stochastic process.
In Paper A, we focus on discrete-time impulse responses (stochastic processes),
and provide a simple and self-contained proof to show the maximum entropy
properties of the discrete-time first-order SS kernel (3.4). The advantages of working in discrete-time domain include
1. The differential entropy rate is well-defined for discrete-time stochastic process.
2. Given a stochastic process, its finite difference process can be well-defined
in discrete-time domain.
3. It is possible to show what maximum entropy property a zero-mean discretetime Gaussian process with covariance function (3.4) has.
4
Mixture Reduction
The background material presented in this chapter introduces the mixture reduction problem and presents the background material which are related to papers E
and F. Some of the material presented in this chapter are published by the PhD
candidate in (Ardeshiri et al., 2012) and (Ardeshiri et al., 2014).
4.1
Mixture Reduction
A common problem encountered in Bayesian inference and particularly tracking
is mixture reduction (MR). Examples of such circumstances are multi-hypotheses
tracking (MHT)(Blackman and Popoli, 1999), Gaussian sum filter(Alspach and
Sorenson, 1972), multiple model filtering (Blackman and Popoli, 1999), Gaussian
mixture probability hypothesis density (GM-PHD) filter (Vo and Ma, 2006). In
these algorithms the information about the state of a random variable is modeled
as a mixture density.
A mixture density is a probability density which is a convex combination of
(more basic) component probability densities, see e.g. (Bishop, 2006). A normalized mixture with N components is defined as
p(x) =
N
X
w I q(x; η I ),
(4.1)
I=1
where the terms w I are positive weights summing to unity, and η I are the parameters of the component density q(x; η I ). When the component density is a
Gaussian density the mixture density is referred to as Gaussian mixture (GM).
The mixture reduction problem (MRP) is to find an approximation of the original mixture density by a mixture density with fewer components. To be able
to implement these algorithms for real time applications a mixture reduction
33
34
4 Mixture Reduction
step is necessary. The aim of the reduction algorithm is to reduce the computational complexity into a predefined budget while keeping the inevitable error
introduced by the approximation as small as possible.
4.2
Mixture Reduction for Target Tracking
This section concerns with the mixture reduction algorithms in multiple target
tracking. The current mixture reduction convention in multiple target tracking
(MTT) is to use exactly the same algorithm for reducing the computational load
to a feasible level as for extracting the state estimates. In general, the mixture
reduction for the state extraction should be much more aggressive than that for
computational feasibility. For this reason, the number of components in the mixtures have to be reduced much more than what the computational resources actually allow for. This can result in coarser approximations than what is actually
necessary. It is proposed in (Ardeshiri et al., 2012) to split the reduction step into
two separate procedures according to:
• Reduction in the loop is a reduction step which must be performed at each
point in time for computational feasibility of the overall target tracking
framework. The objective for this algorithm is to reduce the number of
components and to minimize the information loss.
• Reduction for state extraction aims at reducing the number of components
so that the remaining components can be considered as state estimates in
the target tracking framework.
This separation makes it possible to tailor these two algorithms to fulfill their individual objectives, which reduces the unnecessary approximations in the overall
algorithm. A block diagram of the conventional mixture reduction method on a
high level is shown in Figure 4.1.
Prediction
Update
Mixture
Reduction
State
Extraction
Figure 4.1: The standard flowchart of the MTT algorithms has only one mixture reduction block.
In the proposed implementation of MR for MTT in (Ardeshiri et al., 2012),
the reduction algorithm is split into two subroutines each of which is tailored for
its own purpose, see Figure 4.2. The first reduction algorithm, denoted reduction
4.2
35
Mixture Reduction for Target Tracking
in the loop, is designed to reduce the computational cost of the algorithm to the
computational budget between the updates. In this reduction step, the number
of components should be reduced to a number that is tractable by the available
computational budget and minimal loss of information is in focus. The second
reduction algorithm, denoted reduction for extraction, is designed to reduce the
mixture to as many components as the number of targets. In this part of the algorithm, application dependent specifications and heuristics can enter into the
picture. If the purpose of state extraction is only visualization, the second reduction does not have to be performed at the same frequency as the measurements
are received and can be made less frequent. The advantages of the proposed algorithm are that the unnecessary loss of information in the reduction in the loop
step will only be due to the finite computational budget rather than closeness of
the components. Furthermore, some computational cost can be discounted if the
state extraction does not have to be performed for every measurement update
step.
Prediction
Update
Mixture
Reduction
for Computational
Feasibility
Mixture
Reduction
for State
Extraction
State
Extraction
Figure 4.2: The proposed block diagram of the MTT algorithm with two
mixture reduction blocks; one tailored to keep the computational complexity
within the computational budget and one tailored for state extraction.
Another important advantage of the proposed algorithm in (Ardeshiri et al.,
2012) is that the number of final components in both of the reduction algorithms
is known since the computational budget is predefined in the reduction in the
loop algorithm. Furthermore, the number of target states can be predetermined
by summarizing the weights in e.g., a GM-PHD filter and utilized in the reduction for extraction algorithm. The clustering or optimization method selected
for reduction can be executed more efficiently compared to a scenario where the
36
4 Mixture Reduction
number of components is left to be decided by the algorithm itself.
4.3
Greedy mixture reduction
Ideally, the MRP is formulated as a nonlinear optimization problem where a divergence measure between a mixture and its approximation with a desired number
of components is selected. The optimization problem is then solved by numerical solvers when the problem is not analytically tractable. The numerical optimization based approaches can be computationally quite expensive, especially
for high dimensional data and they generally suffer from the problem of local
optima. Hence, a common alternative solution to the MRP has been the greedy iterative approach. When the computational budget permits a numerical solution,
the greedy approaches are used to initialize the global optimization approach
(Williams and Maybeck, 2006).
In the greedy approach, the number of components in the mixture is reduced
one at a time. By applying the same procedure over and over, a desired number
of components can be reached. In order to reduce the number of components
by one, two types of operations are considered, namely, pruning one component
and merging of two components. These two operations will be given an official
definition in the following.
Pruning which is the simplest operation for reducing the number of components in a mixture density is to remove one (or more) components of a mixture
and rescaling the remaining components such that it integrates to unity. For example pruning component J from (4.1) results in the mixture density
p0 (x) = (1 − w J )−1
N
X
w I q(x; η I ).
(4.2)
I=1,I,J
The merging operation in a MRA approximates a subset of components in a
mixture density with a single component of the same component density type. In
general, an optimization problem minimizing the KLD between the normalized
subset of the mixture and the single component is used for this purpose leading
to a moment matching operation. More formally, approximation of a fraction
of the mixture density (4.1) consisting of two components I and J; w I q(x; η I ) +
w J q(x; η J ) by a single weighted component (w I + w J )q(x; η I J ) is referred to as
merging components I and J, where
η I J = arg min DK L
!
w I q(x; η I ) + w J q(x; η J )
IJ
q(x;
η
)
.
||
wI + wJ
When the component densities are Gaussian densities with mean µ and covari-
4.3
37
Greedy mixture reduction
ance Σ the parameters of the approximate density are given by
1
(w I µI + w J µJ ),
+ wJ
X
wK K
K
IJ
K
IJ T
Σ
+
(µ
−
µ
)(µ
−
µ
)
.
=
wI + wJ
µI J =
ΣI J
wI
(4.3a)
(4.3b)
K∈{I,J}
There are two different types of greedy approaches in the literature, local and
global approaches. The local approaches consider only the merging operation.
The (two) components to be merged are selected among all possible pairs of components based on a divergence measure between the individual components and
the divergence between the original mixture and its approximation is not (explicitly) taken into account. Well-known examples of local approaches are given
in (Salmond, 1990; Granström and Orguner, 2012b).
In the global approach, each of the pruning or merging possibilities are considered to be a hypothesis. The decisions are then made by choosing the candidate
hypothesis that minimizes a divergence measure involving the original mixture
and the corresponding reduced mixtures (all of which has one less component).
In the global approach to mixture reduction, pruning or merging operations
applicable on the original mixture p(x) are considered to be hypotheses denoted
by H. The resulting mixtures that would be obtained if the I th component is
pruned, or if the I th and J th components are merged, are denoted by p(x|H0I )
and p(x|HI J ), respectively. The single component obtained by merging the I th
and J th components is denoted by q(x, η I J ). If p(x) has K components there are K
pruning and K(K − 1)/2 merging hypotheses. In order to decide on the candidate
pruning and merging operations, all corresponding mixtures p(x|H0I ), p(x|HI J )
and their associated divergence measure are calculated. The hypothesis which
results in the smallest divergence measure, is most similar mixture to the original
mixture, is selected.
More particularly, at the k th stage of reducing mixture density of equation (4.1),
nk = N − k + 1 components are left and there are 12 nk × (nk − 1) possible merging
decisions and nk possible pruning decisions to choose from. Let the reduced density at the k th stage be denoted by pk (x). We have a multiple hypotheses decision
problem at hand where the hypotheses are formulated according to


H01 : x ∼ pk (x|H01 ),






H02 : x ∼ pk (x|H02 ),
Pruning Hypotheses 
..



.



H
:
x
∼
p
0nk
k (x|H0nk ),


H12 : x ∼ pk (x|H12 ),





H

 13 : x ∼ pk (x|H13 ),
Merging Hypotheses 
..



.



H
: x ∼ p (x|H
(nk −1)nk
k
(nk −1)nk ),
38
4 Mixture Reduction
which is a decision problem with nk (nk + 1)/2 hypotheses. The first nk hypotheses
account for pruning and the rest account for merging decisions. The subscript on
hypotheses H. refers to the two components to be merged for merging hypotheses
while in the case of pruning hypotheses the subscript refers to the label of the
component to be pruned which is preceded by zero.
The divergence measures used for the aforementioned decision problem are
presented in the following Section.
4.4
Divergence measures
A divergence measure is a function which establishes the distance of one probability distribution to the other on a statistical manifold (Minka, 2005). A divergence
measure is a weaker form of a metric, in particular the divergence does not need
to be symmetric and does not need to satisfy the triangle inequality.
4.4.1 Integral square error
ISE is a divergence measure between two densities which is defined as
Z
ISE(p||q) = |p(x) − q(x)|2 dx
(4.4)
for two densities p(x) and q(x). ISE has all properties of a metric.
ISE is used by Williams and Maybeck in (Williams and Maybeck, 2006) as a
divergence measure for mixture reduction. The cost of the hypothesis HK obeys
Z
ISE(HK ) = |p(x) − pk (x|HK )|2 dx.
(4.5)
In this approach, the hypothesis which gives the smallest ISE will be chosen at
each step of the reduction i.e., the decision rule based on ISE becomes “decide
HK if ISE(HK ) < ISE(HL ) for all L , K”, where K and L are permissible indices
of the hypotheses.
An attractive property of the ISE as a divergence measure is that the ISE between two Gaussian mixtures has an analytical solution.
4.4.2 Kullback-Leibler Divergence
The global approach to mixture reduction problem can be posed as a multiple
hypothesis testing problem 1 . Suppose that we have a mixture p(x) with N components as in (4.1). Suppose we have a number of reduced mixtures {p(x|Hj )}K
j=1
and we would like to select one of them. Assuming that we have the data {xi }Si=1
sampled from p( · ), the selection of the best reduced mixture can be posed as
1 For a short introduction to multiple hypothesis testing and maximum a posteriori decision rule
see Appendix B.
4.4
39
Divergence measures
a multiple hypothesis testing problem where the test statistics become the loglikelihood of the data given as
log p({xi }Si=1 |Hj ) =
S
X
log p(xi |Hj )
(4.6)
i=1
and the decision is made to select Hj ∗ where
j ∗ , arg max log p({xi }Si=1 |Hj ).
j
(4.7)
When we let the number of the samples S go to ∞, we see that
lim
S→∞
h
i
1
log p({xi }Si=1 |Hj ) = E log p(x|Hj )
S
p( · )
(4.8)
by the law of large numbers. Kullback-Leibler divergence DK L (p( · )||p( · |Hj )) between p( · ) and p( · |Hj ) is given as
h
i
DK L (p( · )||p( · |Hj )) = −H(p(x)) − E log p(x|Hj )
p( · )
(4.9)
where H( · ) is the entropy of its argument density. Therefore, the optimization (4.7)
is equivalently given as
j ∗ , arg min DK L (p(x)||p(x|Hj )).
j
(4.10)
The cost function in (4.10) can not be analytically evaluated when one of the
arguments in the KLD is a Gaussian mixture. Runnalls in (Runnalls, 2007) used
a nice analytical approximation of the KLD between two mixtures which can
only be used for evaluating the merging hypotheses. The approximation is in
fact an upper bound on DK L (p(x)||p(x|HI J )), which is the cost of merging two
components I and J, and is denoted by B(I, J) and defined by
B(I, J) ,w I DK L (q(x; η I )||q(x; η I J )) + w J DK L (q(x; η J )||q(x; η I J )).
(4.11)
Runnalls has shown that B(I, J) ≥ DK L (p(x)||p(x|HI J )) in (Runnalls, 2007). The
greedy MR algorithm suggested by (Runnalls, 2007) will be referred to as approximate Kullback-Leibler (AKL) algorithm in the rest of this thesis.
4.4.3
α -Divergences
A generalization of the KLD called the α-divergence is a family of divergences
defined over a range of continuous hyper-parameter α ∈ (−∞, ∞) by
!
Z
4
(1+α)/2
(1−α)/2
Dα (p||q) ,
1
−
p(x)
q(x)
dx
.
1 − α2
40
4 Mixture Reduction
Some special cases of of the α-divergence are
lim Dα (p||q) = DK L (p||q)
(4.12a)
lim Dα (p||q) = DK L (q||p)
(4.12b)
α→1
α→−1
D0 (p||q) = DH (p||q)
(4.12c)
where, DH (p||q) is the Hellinger distance (Bishop, 2006).
We will analyze the divergence measures given above in Example 4.1.
Example 4.1
Using an example given in (Minka, 2005) the effect of changing the hyper-parameter α in the divergence measure is illustrated and compared with the ISE distance. Consider the Gaussian mixture p(x) = 0.6 N (x; −2, 1) + 0.4 N (x; 2, 0.16),
and its approximation q(x) which is a Gaussian distribution with unknown mean
and standard deviation. In Figure 4.3 the minimizing argument of Dα (p||q) over
q is given for various values of α alongside the minimizing argument of ISE(p||r).
The parameters of q (mean and standard deviation) are given in Figure 4.4. The
parameters of q vary smoothly with α except when −1 ≤ α ≤ 1. When α −1
the solution is mode seeking (the mode with the largest mass) and when α 1
the optimal solution distributes the probability mass over the support and where
there is considerable probability mass in original distribution. The observations
made in this simple example are general and can be conveniently explained by
the definition of the α-divergence. The minimizing argument of ISE(p||r) in this
example does not have the same general interpretation; in this example ISE is
rather similar to Dα for α = 1, if the two Gaussian densities where far away from
each other, ISE becomes more similar to Dα for α = −1. Another observation that
is made is that the minimizing argument of Dα (p||q) over q does not vary so much
for values of α outside the interval of [−1, 1]. Therefore, we will only study the αdivergence in the limit as α → 1 where it corresponds to the KLD and as α → −1
where it corresponds to to the reversed Kullback-Leibler divergence (RKLD). In
applications where the mode seeking property of the solution is desired RKLD
is suitable. On the other hand, when the solution should preserve the statistical
moments of a mixture density KLD is the most appropriate.
In Paper F a Gaussian mixture reduction algorithm using the RKLD is proposed which has the mode seeking property illustrated in Example 4.1.
4.4
41
Divergence measures
p
q
p
q
r
r
α = −∞
α=∞
p
q
p
p
q
q
r
r
α = −1
r
α=0
α=1
Figure 4.3: The Gaussian mixture p (black) is approximated by two Gaussian
densities q (red) and r (blue). q minimizes the α-divergence for different
values of α and r minimizes the ISE.
2
3
1
2.5
0
estimated std dev
estimated mean
ISE std dev
true mean
ISE mean
−1
−2
−20
0
α
20
true std dev
2
1.5
1
−20
0
α
20
Figure 4.4: The mean and standard deviation of the Gaussian density q
which minimizes the α-divergence to p for different values of α (black). For
α = 1 the mean and standard deviation of q matches those of p (red). Mean
and standard deviation of the Gaussian density r which minimizes the ISE
to p is given for comparison (blue).
42
4.5
4 Mixture Reduction
Numerical comparison of mixture reduction
algorithms
In Paper E three mixture reduction algorithms for mixtures of the exponential
family are given and are evaluated in simulations. In these algorithms the ISE
approach and AKL approach are compared with a local approach referred to as
Symmetrized Kullback-Leibler Divergence. The Symmetrized Kullback-Leibler
Divergence is used for the comparison of the merging hypotheses in local algorithms such as (Kitagawa, 1994), (Chen et al., 2012b), (Granström and Orguner,
2012b) and (Granström and Orguner, 2012a). The symmetrized KLD (SKL) for
two component densities is defined as
DSK L (I, J) = DK L (qη I ||qη J ) + DK L (qη J ||qη I ).
(4.13)
This approach is referred to as SKL and is used in the numerical simulation intended for comparison of different MR algorithms in the following.
In this section, eight mixture reduction examples are illustrated. In Figures
4.5, 4.6, 4.7, 4.8, 4.9, 4.10, 4.11 and 4.12 mixture densities of exponential, Weibull,
Rayleigh, Log-Normal, gamma, inverse gamma and Gaussian distribution are reduced, respectively. In each figure, a mixture density with 25 components along
with its reduced approximations with 3 components using three reduction algorithms AKL, SKL and ISE are plotted. In these figures the original mixture density (black solid line) and its components (black dashed line) are given. In the subfigures AKL, SKL and ISE are used to approximate the original mixture which has
25 component densities with mixtures with 3 component densities. The approximate densities (thick dashed lines) and their components (thin dashed line) are
drawn in different colors; red(AKL), green(SKL) and blue(ISE). AKL is used in
the left sub-figure, SKL is used in the center sub-figure and ISE is used in the
right sub-figure. The reduced mixture in the right sub-figure is not rescaled after possible pruning steps and is plotted as it is used in the ISE algorithm. For
implementation aspects of the ISE approach see Appendix C.
4.5
43
Numerical comparison of mixture reduction algorithms
0.06
0.06
0.06
0.04
0.04
0.04
0.02
0.02
0.02
0
0
50
100
0
0
50
100
0
0
50
100
Figure 4.5: Exponential Distribution
0.08
0.08
0.08
0.06
0.06
0.06
0.04
0.04
0.04
0.02
0.02
0.02
0
20
40
60
0
20
40
60
0
20
40
60
Figure 4.6: Weibull Distribution with known shape k
0.08
0.08
0.08
0.06
0.06
0.06
0.04
0.04
0.04
0.02
0.02
0.02
0
−100 −50
0
50
100
0
−100 −50
0
50
100
0
−100 −50
0
50
100
Figure 4.7: Laplace Distribution with known mean µ
0.1
0.1
0.1
0.05
0.05
0.05
0
10
20
30
40
0
10
20
30
40
0
Figure 4.8: Rayleigh Distribution
10
20
30
40
44
4 Mixture Reduction
0.06
0.06
0.06
0.04
0.04
0.04
0.02
0.02
0.02
0
−1
10
0
10
1
0
−1
10
2
10
10
0
10
1
0
−1
10
2
10
10
0
1
10
2
10
10
Figure 4.9: Log-normal Distribution
0.025
0.025
0.025
0.02
0.02
0.02
0.015
0.015
0.015
0.01
0.01
0.01
0.005
0.005
0.005
0
20
40
60
0
80 100
20
40
60
0
80 100
20
40
60
80 100
Figure 4.10: Gamma Distribution
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
1
2
3
0
0
1
2
3
0
0
1
2
3
Figure 4.11: Inverse Gamma Distribution
0.06
0.06
0.06
0.04
0.04
0.04
0.02
0.02
0.02
0
0
50
100
0
0
50
100
0
0
Figure 4.12: Univariate Gaussian Distribution
50
100
5
Concluding remarks
This chapter concludes the first part of this thesis. An overall summary of the contributions given in the second part of the thesis and some directions for further
research will be given here. For more detailed discussion on the each contribution see the discussions and concluding remarks at the end of each contribution.
In paper A, the maximum entropy properties of the first-order stable spline
kernel for identification of linear time-invariant stable and causal systems are
shown. Analytical approximations are used to express the prior knowledge about
the properties of the impulse response of a linear time-invariant stable and causal
system. Future work on the subject includes studying maximum entropy interpretation of other kernels used for regression using Gaussian processes. Furthermore, the maximum entropy approach can be used to construct new kernels for
system identification.
In papers B, variational Bayes (VB) method is used to compute an approximate posterior for the state smoothing problem for linear state-space models with
unknown and time-varying noise covariances. The VB method gives an approximate posterior for the unknown noise covariances. Nevertheless, the Variational
Bayes type algorithms approximate the posterior by minimizing Kullback-Leibler
divergence in zero forcing mode, meaning that if there are multiple modes in the
true posterior, the algorithm approximates only one of the modes. Hence the posterior covariance might underestimate the true covariance significantly in such
cases. Computing a better estimate of estimation uncertainly for the noise covariances can be a future work. Theoretical comparison of the proposed VB method
with expectation maximization and maximum likelihood estimate of the noise
covariances is another possible future work.
In paper C, the VB method is used for approximate inference in state-space
models with skewed measurement noise. A filter and a smoother that take into
account the skewness and heavy-tailedness of the measurement noise are pro45
46
5
Concluding remarks
posed where skew-t distribution is used to model the distribution of measurement noise. Future research on the subject includes learning the skewness and
spread parameters of the measurement noise from the data. Further research on
the subject can include studying a class of hierarchical models for modeling the
noise parameters and devising algorithms for learning the parameters of such a
model from the data.
In paper D, a novel approximation method for Bayesian inference is proposed.
The proposed Bayesian inference technique is based on Taylor series approximation of the logarithm of the likelihood function. The proposed approximation
is devised for the case where the prior distribution belongs to the exponential
family of distributions. The linearization of the log-likelihood is performed with
respect to the sufficient statistic of the prior distribution. Extension of the proposed method for prior distributions outside the exponential family of distribution can be a future research direction. The comparison of possible choices for
the linearization point and linearization methods with respect to the sufficient
statistic are among the future research problems.
In papers E and F, two contributions are dedicated to the mixture reduction
(MR) problem. The first contribution, generalizes the existing MR algorithms
for Gaussian mixtures to the exponential family of distributions and compares
them in an extended target tracking scenario. The second contribution, proposes
a new Gaussian mixture reduction algorithm using the reversed Kullback-Leibler
divergence which has specific peak preserving properties. Future research on
these topics includes evaluation of these methods in real life scenarios with real
measurements.
There is a general class of solutions to the Bayesian inference problem referred to by sampling methods which can obtain much better performance compared to proposed approximation methods with respect to accuracy when the
computation time is not critical. Sampling methods are not covered in this thesis. That is why the approximations used in this thesis are specified as analytical
approximations. The proposed analytical approximations however, can be used
for initialization of the sampling based methods as well as selecting proposals in
Monte-Carlo (MC) methods. Speeding up these MC methods using the proposed
approximation methods is a general direction for future research.
Appendix
A
Expressions for some members of
exponential family
Essential expressions and formula for reduction of mixture densities of common
exponential family distributions are given in this section. These expressions can
be found in (Ardeshiri et al., 2014) as well. Some functions which are used in the
expressions such as the gamma function are defined here for completeness. The
gamma function is defined by
Z∞
Γ (t) =
x t−1 exp (−x) dx.
(A.1)
0
The multivariate gamma function which is a generalization of the gamma function is
Z
Γd (t) =
exp (− Tr (S))|S|t−
d+1
2
dS
(A.2)
S0
= π d(d−1)/4
d
Y
1−j
Γ t+
.
2
j=1
The digamma function is given as
0
Γ (t)
d
ψ(t) =
log Γ (t) =
.
dt
Γ (t)
49
(A.3)
50
A Expressions for some members of exponential family
The multivariate polygamma function of order n is defined as
(n)
dn+1
log Γd (t)
dt n+1
d
X
1−j
dn+1
=
log
Γ
t
+
2
dt n+1
ψd (t) =
(A.4)
(A.5)
j=1
=
d
X
j=1
1−j
ψ (n) t +
.
2
(A.6)
The multinomial beta function in terms of the gamma function is given as
QK
BK (α) =
j=1 Γ (αj )
P
.
K
Γ
j=1 αj
(A.7)
Exponential Distribution
Exp(x; λ) = λ exp(−λx)
(A.8a)
Support: x ∈ [ 0, +∞)
(A.8b)
Parameter space: λ ∈ (0, +∞)
η = −λ
(A.8c)
(A.8d)
A(η) = − log(−η)
(A.8e)
∂A
1
=−
∂η
η
(A.8f)
∇η A =
h(x) = 1
(A.8g)
E[h(x)] = 1
T (x) = x
(A.8h)
(A.8i)
Solution to ∇η L A = Y is given by
ηL = −
1
.
Y
(A.9)
51
Weibull Distribution with known shape k
!
k x k−1
xk
Weibull(x; λ, k) =
exp − k
λ λ
λ
Support: x ∈ [ 0, +∞)
Parameter space: λ ∈ (0, +∞), k ∈ (0, +∞)
1
η=− k
λ
A(η) = − log(−η) − log(k)
(A.10a)
(A.10b)
(A.10c)
(A.10d)
(A.10e)
∂A
1
∇η A =
=−
∂η
η
(A.10f)
h(x) = x k−1
(A.10g)
!
E[h(x)] = Γ
1−k
2k − 1
(−η) k
k
(A.10h)
T (x) = x k
(A.10i)
Solution to ∇η L A = Y is given by
ηL = −
1
.
Y
(A.11)
The expression for Eq(x;η) [h(x)] is derived here
Z∞
E[h(x)] =
x k−1
0
Z∞
=
!
(
)
xk
k x k−1
xk
x k−1
exp − k dx = z , k , dz = k k dx
λ λ
λ
λ
λ
λk−1 z
k−1
k
exp(−z) dz
0
= λk−1 Γ
!
!
!
1−k
k−1
2k − 1
2k − 1
+ 1 = λk−1 Γ
=Γ
(−η) k .
k
k
k
(A.12)
52
A Expressions for some members of exponential family
Laplace Distribution with known mean µ
|x − µ|
1
Laplace(x; µ, b) =
exp −
2b
b
Support: x ∈ (−∞, ∞)
Parameter space: b ∈ (0, +∞), µ ∈ R
1
η=−
b
!
2
A(η) = log −
η
∂A
1
=−
∇η A =
∂η
η
h(x) = 1
E[h(x)] = 1
T (x) = |x − µ|
!
(A.13a)
(A.13b)
(A.13c)
(A.13d)
(A.13e)
(A.13f)
(A.13g)
(A.13h)
(A.13i)
Solution to ∇η L A = Y is given by
ηL = −
1
.
Y
(A.14)
53
Rayleigh Distribution
x
x2
Rayleigh(x; σ ) = 2 exp − 2
σ
2σ
!
(A.15a)
Support: x ∈ [ 0, +∞)
Parameter space: σ ∈ (0, +∞)
1
η=− 2
2σ
A(η) = − log(−2η)
∇η A =
(A.15b)
(A.15c)
(A.15d)
(A.15e)
∂A
1
=−
∂η
η
(A.15f)
h(x) = x
r
π
E[h(x)] =
−2η
(A.15g)
(A.15h)
T (x) = x2
(A.15i)
1
.
Y
(A.16)
Solution to ∇η L A = Y is given by
ηL = −
The expression for Eq(x;η) [h(x)] is derived here
Z∞
E[h(x)] =
0
x2
x
x 2 exp − 2
σ
2σ
r
!
dx = σ
π
=
2
r
π
.
−2η
(A.17)
54
A Expressions for some members of exponential family
Log-normal Distribution
1
√
1
exp − 2 (log x − µ)2
2σ
xσ 2π
Support: x ∈ (0, +∞)
log −N (x; µ, σ ) =
Parameter space: σ ∈ (0, +∞), µ ∈ R
η = (η1 , η2 )
µ
η1 = 2
σ
1
η2 = − 2
2σ
η2
1
A(η) = − 1 − log(−2η2 )
4η2 2
!
∂A ∂A
,
∇η A =
∂η1 ∂η2
η1
∂A
=−
∂η1
2η2
η2
∂A
1
= 12 −
∂η2
4η2 2η2
1
h(x) = √
x 2π
!
η1
1
1
−
exp
E[h(x)] = √
2η2 4η2
2π
T (x) = log(x), (log(x))2
(A.18a)
(A.18b)
(A.18c)
(A.18d)
(A.18e)
(A.18f)
(A.18g)
(A.18h)
(A.18i)
(A.18j)
(A.18k)
(A.18l)
(A.18m)
Solution to the system of equations ∇η L A = Y is given by
−2
,
Y2 − Y12
(A.19a)
η1L = −2Y1 η2L .
(A.19b)
η2L =
The expression for Eq(x;η) [h(x)] is derived here
!
!
η1
1
1
1
σ2
1
1
=√
exp −µ +
=√
exp
−
.
E[h(x)] = √
E
x
2
2η2 4η2
2π
2π
2π
(A.20)
55
Gamma Distribution
β α α−1
x
exp(−βx)
Γ (α)
Support: x ∈ (0, +∞)
Gamma(x; α, β) =
Parameter space: α ∈ (0, +∞), β ∈ (0, +∞)
η = (η1 , η2 )
(A.21a)
(A.21b)
(A.21c)
(A.21d)
η1 = α − 1
(A.21e)
η2 = −β
(A.21f)
A(η) = log Γ (η1 + 1) − (η1 + 1) log(−η2 )
!
∂A ∂A
∇η A =
,
∂η1 ∂η2
∂A
= ψ(η1 + 1) − log(−η2 )
∂η1
η1 + 1
∂A
=−
∂η2
η2
h(x) = 1
(A.21g)
(A.21h)
(A.21i)
(A.21j)
(A.21k)
E[h(x)] = 1
T (x) = (log(x), x)
(A.21l)
(A.21m)
To solve the system of equations ∇η L A = Y , first let Z = log(Y2 )− Y1 and u = η1 +1.
Then solve ψ(u) − log(u) + Z = 0 numerically and obtain
η1L = u − 1,
u
η2L = − .
Y2
(A.22a)
(A.22b)
56
A Expressions for some members of exponential family
Inverse Gamma Distribution
β α −α−1
β
x
exp −
Γ (α)
x
Support: x ∈ (0, +∞)
IGamma(x; α, β) =
Parameter space: α ∈ (0, +∞), β ∈ (0, +∞)
(A.23a)
(A.23b)
(A.23c)
η = (η1 , η2 )
(A.23d)
η1 = −α − 1
(A.23e)
η2 = −β
(A.23f)
A(η) = log Γ (−η1 − 1) − (−η1 − 1) log(−η2 )
!
∂A ∂A
∇η A =
,
∂η1 ∂η2
∂A
= −ψ(−η1 − 1) + log(−η2 )
∂η1
η1 + 1
∂A
=
∂η2
η2
h(x) = 1
(A.23g)
(A.23h)
(A.23i)
(A.23j)
(A.23k)
E[h(x)] = 1
1
T (x) = log(x),
x
(A.23l)
(A.23m)
To solve the system of equations ∇η L A = Y first let Z = log(Y2 ) + Y1 and u =
−η1 − 1. Then solve ψ(u) + log(u) − Z = 0 numerically and obtain
η1L = −u − 1,
u
η2L = − .
Y2
(A.24a)
(A.24b)
57
Univariate Gaussian Distribution
1
1
exp − 2 (x − µ)2
√
2σ
σ 2π
Support: x ∈ R
N (x; µ, σ 2 ) =
Parameter space: σ ∈ (0, +∞), µ ∈ R
η = (η1 , η2 )
µ
η1 = 2
σ
1
η2 = − 2
2σ
η2
1
A(η) = − 1 − log(−2η2 )
4η2 2
!
∂A ∂A
∇η A =
,
∂η1 ∂η2
η1
∂A
=−
∂η1
2η2
η2
∂A
1
= 12 −
∂η2
2η
4η2
2
1
h(x) = √
2π
1
E[h(x)] = √
2π
T (x) = x, x2
(A.25a)
(A.25b)
(A.25c)
(A.25d)
(A.25e)
(A.25f)
(A.25g)
(A.25h)
(A.25i)
(A.25j)
(A.25k)
(A.25l)
(A.25m)
Solution to the system of equations ∇η L A = Y is given by
−2
,
Y2 − Y12
(A.26a)
η1L = −2Y1 η2L .
(A.26b)
η2L =
58
A Expressions for some members of exponential family
Multivariate Gaussian Distribution
k
1
1
N (x; m, P ) = (2π)− 2 |P |− 2 exp − (x − m)T P −1 (x − m)
2
Support: x ∈ Rk
Parameter space: P ∈ R
k×k
T
and P = P 0, µ ∈ R
η = (η1 , η2 )
η1 = P
−1
m
k
(A.27a)
(A.27b)
(A.27c)
(A.27d)
(A.27e)
1
η2 = − P −1
2
1
1
A(η) = − η1T η2−1 η1 − log | − 2η2 |
4
! 2
∂A ∂A
∇η A =
,
∂η1 ∂η2
1
∂A
= − η1T η2−1
∂η1
2
1
1
∂A
= η2−T η1 η1T η2−T − η2−1
∂η2
4
2
(A.27h)
h(x) = (2π)−k/2
(A.27k)
−k/2
(A.27l)
E[h(x)] = (2π)
T (x) = x, xxT
(A.27f)
(A.27g)
(A.27i)
(A.27j)
(A.27m)
Solution to system of equations ∇η L A = Y is given by
1
η2L = − (Y2 − Y1T Y1 )−1 ,
2
T
η1L = −2Y1 η2L .
(A.28)
(A.29)
59
Gaussian Gamma Distribution
1
GaussianGamma (x, τ; µ, λ, α, β) = N x; µ,
Gamma(τ; α, β)
λτ
Support: x ∈ R, τ ∈ (0, +∞)
(A.30a)
(A.30b)
Parameter space: α ∈ (0, +∞), β ∈ (0, +∞), λ ∈ (0, +∞), µ ∈ R
(A.30c)
η = (η1 , η2 , η3 , η4 )
1
η1 = α −
2
λµ2
η2 = −β −
2
η3 = λµ
λ
η4 = −
2
1
1
− log(−2η4 )
A(η) = log Γ η1 +
2
2
!
η32
1
− η1 +
log −η2 +
2
4η4
!
∂A ∂A ∂A ∂A
,
,
,
∇η A =
∂η1 ∂η2 ∂η3 ∂η4
!
η32
1
∂A
= ψ η1 +
− log −η2 +
∂η1
2
4η4
η1 + 21
∂A
=−
η2
∂η2
−η2 + 4η3
4
1
2η
+
η
3
1+ 2
∂A
=−
η2
∂η3
4η4 −η2 + 4η3
4
1
2
η3 η1 + 2
1
∂A
−
=
η32
∂η4
2η4
2
4η4 −η2 + 4η
(A.30d)
(A.30e)
(A.30f)
(A.30g)
(A.30h)
(A.30i)
(A.30j)
(A.30k)
(A.30l)
(A.30m)
(A.30n)
4
1
h(x) = √
2π
1
E[h(x)] = √
2π
T (x) = log(τ), τ, τ x, τ x2
(A.30o)
(A.30p)
(A.30q)
60
A Expressions for some members of exponential family
To solve the system of equations ∇η L A = Y first let Z = log(−Y2 ) − Y1 and
u = η1 + 12 . Then solve ψ(u) − log(u) + Z = 0 numerically and obtain
η1L = u −
1
,
2
1 Y32
+ Y4
2 Y2
2η4 Y3
,
η3L =
Y2
(A.31a)
!−1
η4L = −
η2L =
η1 + 12
η32
+
.
4η4
Y2
,
(A.31b)
(A.31c)
(A.31d)
61
Dirichlet distribution
K
DirK (x; α) =
1 Y αi −1
xi
B(α)
(A.32a)
i=1
Support: xi ∈ [0, 1] for i = 1 · · · K and
K
X
xi = 1
(A.32b)
i=1
Parameter space: αi > 0 and K ≥ 2
(A.32c)
η = (η1 , · · · , ηK )
(A.32d)
ηi = αi − 1
(A.32e)
 K

K
X
X

A(η) =
log Γ (ηi + 1) − log Γ 
(ηi + 1)
i=1
i=1
!
∂A
∂A ∂A
,
,··· ,
∇η A =
∂η1 ∂η2
∂ηK
 K

X


∂A

= ψ(ηi + 1) − ψ 
(ηi + 1)
∂ηi
(A.32f)
(A.32g)
(A.32h)
i=1
h(x) = 1
(A.32i)
E[h(x)] = 1
T (x) = (log(x1 ), · · · , log(xK ))
(A.32j)
(A.32k)
The system of equations ∇η L A = Y can be solved using a numerical method such
as newton method where the Hessian is given by,
 K

X

∂2 A
(1)
(1) 

 ,
=
ψ
(η
+
1)
−
ψ
(η
+
1)
(A.33a)
i
k


∂ηi 2
k=1
 K

X

∂2 A
(ηk + 1) .
(A.33b)
= −ψ (1) 
∂ηij
k=1
62
A Expressions for some members of exponential family
Wishart Distribution
1
|X| 2 (n−d−1) exp Tr − 12 V −1 X
Wd (X; n, V ) =
1
1
2 2 nd Γd 12 n |V | 2 n
Support: X ∈ Rd×d and X = X T 0
(A.34a)
(A.34b)
T
and V = V 0, n ≥ d
(A.34c)
η = (η1 , η2 )
1
η1 = (n − d − 1)
2
1
η2 = − V −1
2
!
!
d+1
d+1
A(η) = − η1 +
log | − η2 | + log Γd η1 +
2
2
!
∂A ∂A
∇η A =
,
∂η1 ∂η2
!
d+1
∂A
= − log | − η2 | + ψd η1 +
∂η1
2
!
∂A
d + 1 −1
= − η1 +
η2
∂η2
2
(A.34d)
h(X) = 1
(A.34k)
Parameter space: V ∈ R
d×d
E[h(X)] = 1
T (X) = (log |X|, X)
(A.34e)
(A.34f)
(A.34g)
(A.34h)
(A.34i)
(A.34j)
(A.34l)
(A.34m)
To solve the system of equations ∇η L A = Y first let Z = log |Y2 | − Y1 and u =
η1 + d+1
2 . Then solve ψd (u) − d log(u) + Z = 0 numerically and obtain
d+1
,
2
η2L = −uY2−1 .
η1L = u −
(A.35a)
(A.35b)
63
Inverse Wishart Distribution
1
|Ψ | 2 (ν−d−1) exp Tr − 12 Ψ X −1
I W d (X; ν, Ψ ) = 1
1
2 2 (ν−d−1)d Γd 12 (ν − d − 1) |X| 2 ν
(A.36a)
Support: X ∈ Rd×d and X = X T 0
(A.36b)
Parameter space: ν > 2d Ψ ∈ R
d×d
, Ψ =Ψ
T
0
(A.36c)
η = (η1 , η2 )
1
η1 = − ν
2
1
η2 = − Ψ
2
!
!
d+1
d+1
A(η) = η1 +
log | − η2 | + log Γd −η1 −
2
2
!
∂A ∂A
∇η A =
,
∂η1 ∂η2
!
d+1
∂A
= log | − η2 | − ψd −η1 −
∂η1
2
!
∂A
d + 1 −1
= η1 +
η2
∂η2
2
(A.36d)
h(X) = 1
(A.36k)
E[h(X)] = 1
T (X) = log |X|, X −1
(A.36e)
(A.36f)
(A.36g)
(A.36h)
(A.36i)
(A.36j)
(A.36l)
(A.36m)
To solve the system of equations ∇η L A = Y first let Z = − log(Y2 ) − Y1 and u =
−η1 − d+1
2 . Then solve −ψd (u) + d log(u) + Z = 0 numerically and obtain
d+1
,
2
η2L = −uY2−1 .
η1L = −u −
(A.37a)
(A.37b)
64
A Expressions for some members of exponential family
Gaussian Inverse Wishart Distribution
GIW (x, X; m, P , ν, Ψ ) = N (x; m, P ) I W d (X; ν, Ψ )
Support: x ∈ Rk , X ∈ Rd×d and X = X T 0
Parameter space: ν > 2d Ψ ∈ R
P ∈R
k×k
d×d
, Ψ =Ψ
T
T
(A.38a)
(A.38b)
0,
and P = P 0, µ ∈ Rk
(A.38c)
η = (η1 , η2 , η3 , η4 )
(A.38d)
1
η1 = − ν
(A.38e)
2
1
η2 = − Ψ
(A.38f)
2
η3 = P −1 m
(A.38g)
1
(A.38h)
η4 = − P −1
2
!
!
d+1
d+1
A(η) = η1 +
log | − η2 | + log Γd −η1 −
2
2
1
1 T −1
(A.38i)
− η3 η4 η3 − log | − 2η4 |
4
2
!
∂A ∂A ∂A ∂A
,
,
,
(A.38j)
∇η A =
∂η1 ∂η2 ∂η3 ∂η4
!
∂A
d+1
= log | − η2 | − ψd −η1 −
(A.38k)
∂η1
2
!
∂A
d + 1 −1
= η1 +
η2
(A.38l)
∂η2
2
1
∂A
= − η3T η4−1
(A.38m)
∂η3
2
1
∂A
1
(A.38n)
= η4−T η1 η3T η4−T − η4−1
∂η4
4
2
h(x, X) = (2π)−k/2
(A.38o)
−k/2
(A.38p)
E[h(x, X)] = (2π)
T (x, X) = log |X|, X −1 , x, xxT
(A.38q)
65
To solve the system of equations ∇η L A = Y first let Z = − log(Y2 ) − Y1 and
u = −η1 − d+1
2 . Then solve −ψd (u) + d log(u) + Z = 0 numerically and obtain
d+1
,
2
η2L = −uY2−1 ,
1
η3L = − (Y4 − Y3T Y3 )−1 ,
2
T
L
η4 = −2Y3 η4L .
η1L = −u −
(A.39a)
(A.39b)
(A.39c)
(A.39d)
B
Multiple hypothesis testing
Here, the multiple hypothesis testing problem and the maximum a posteriori decision rule is given for the sake of completeness (Ardeshiri et al., 2014). For more
complete treatment see (Kay, 1998).
Consider that we want to decide among M hypotheses {H1 , H2 , ..., HM }. Let
the cost assigned to the decision to choose Hi when Hj is true is denoted by Cij
where
(
0 i=j
.
(B.1)
Cij =
1 i,j
The expected Bayes risk (Kay, 1998) becomes
R=
M X
M
X
Cij P (Hi |Hj )P (Hj ).
(B.2)
i=1 j=1
We are looking for a decision rule that minimizes R. Let us partition the space
to regions Ri for i = 1 : M so that
R=
M X
M
X
Z
i=1 j=1
=
=
M Z
X
M
X
i=1 R
j=1
i
M Z
X
i=1 R
p(x|Hj )P (Hj ) dx
Cij
Ri
Cij P (Hj |x)p(x) dx
Ci p(x) dx
i
67
(B.3)
68
B
Multiple hypothesis testing
P
where Ci (x) = M
j=1 Cij P (Hj |x). Since each data x should trigger only one decision,
i.e. assigned to only one of the Ri partitions we should decide Hk for which Ci is
minimum.
P
Since Ci (x) = M
j=1 P (Hj |x) − P (Hi |x), Ci (x) is minimized if P (Hi |x) is maximized. Thus the decision rule is decide Hk if P (Hk |x) > P (Hi |x) for i , k. For
equal prior probabilities P (Hk ) = P (Hi ) the decision rule will be to decide Hk if
p(x|Hk ) > p(x|Hi ) for i , k. This decision rule is also referred to as maximum a
posteriori decision rule.
If the prior probabilities are not equal due to e.g., heuristics P (Hk ) , P (Hi ),
Bayes rule P (Hi |x) ∝ p(x|Hi )P (Hi ) can be used. This possibility is not exploited
in this thesis.
C
Implementation aspects of the ISE
approach
An advantage of the ISE metric is that, it can be computed analytically for many
distributions (Ardeshiri et al., 2015a). In the ISE approach two parameters can be
varied to create slightly different reduction algorithms as detailed below (Ardeshiri
et al., 2014):
1. In the first variation,
the ISE is calculated for each hypothesis according
R
to ISE(HK ) = |p(x) − pk (x|HK )|2 dx and the density after pruning is renormalized. This variation is consistent with the presentation of the ISE
algorithm so far in this technical report.
2. In the second variation, as it is pointed out in (Williams and Maybeck,
2006), when the ISE is being calculated for a pruning hypothesis the rescaling can be skipped since re-normalizing the weights will increase the error
value in parts of the support that are not affected by the pruning hypothesis.
This choice also brings substantial computational savings.
3. In the third variation, instead of comparing p(x|HK ) with the original mixture p(x), it is compared with the resulting mixture of the previous reduction step pk (x), as given here
Z
ISE(HK ) = |pk (x) − pk (x|HK )|2 dx.
In this way, the ISE metric for merging decision can be simplified to
ISE(HI J ) = (w I )2 Q(I, I) + (w J )2 Q(J, J)
+ (w I J )2 Q(I J, I J) + 2w I w J Q(I, J)
− 2w I w I J Q(I, I J) − 2w J w I J Q(J, I J).
69
70
C Implementation aspects of the ISE approach
where,
Z
Q(I, J) =
q(x; η I )q(x; η J ) dx.
(C.1)
Q(I, J) can be calculated analytically for many basic densities of interest belonging to the exponential family such as Gaussian, gamma and Wishart
distributions. For explicit expressions for the exponential family of distribution see (Ardeshiri et al., 2015a) and (Ardeshiri et al., 2014).
Similarly the ISE metric for pruning decision can be simplified as in
wI
ISE(H0I ) =
1 − wI

!2 
N
N X
N
X
X



i
i
j
Q(I, I) − 2
w Q(I, i) +
w w Q(i, j) .


i=1
i=1 j=1
4. The fourth variant is similar to the third variant in terms of the choice of the
reference density, but the mixture is not renormalized after each pruning
which results in the expression
ISE(H0I ) = (w I )2 Q(I, I)
for pruning hypotheses.
Calculation of the ISE for each hypothesis at every step of the reduction is
costly. A scheme is suggested here to cache the calculated quantities to reduce the
computational cost of the reduction. The cost reduction scheme is given for the
second type of implementation of the ISE approach, where the mixture density
after pruning hypothesis is not re-normalized.
In the first step of the reduction of the mixture density (4.1) merging of all
possible pairs of components results in 12 N (N − 1) hypotheses. For the evaluation
of these hypotheses the resultant component of each merging should be calculated. To calculate the ISE of each hypothesis Q( · , · ) should be calculated for
all pairs of components in the mixture as well as the pair of components where
one component is among the merged components and the other one is among the
existing components. All these quantities should be stored and can be reused in
the future reduction steps.
At the k th step of the reduction of the mixture density given in (4.1), the reduced density is denoted by pk (x). In order to keep the notation less cluttered, let
the term q J denote w J q(x; η J ); p denote p(x) and pk denote pk (x). Let us assume
that the cost of the reduction hypotheses at the k th stage denoted by ISEk (HR ) are
stored in a vector Yk and let M = argmin ISEk (HR ) for all permissible values of
R.
71
When M corresponds to a pruning hypothesis, for example M = 0J, the vector
Yk+1 can be updated with less computations for next pruning hypotheses using
Z
ISEk+1 (H0S |M = 0J) = (p − pk + q J + q S )2 dx
Z
Z
Z
S 2
J 2
= (p − pk + q ) dx + (q ) dx + 2 q J (p − pk + q S ) dx
Z
Z
Z
Z
S 2
J 2
J
= (p − pk + q ) dx + (q ) dx + 2 q (p − pk ) dx + 2 q J q S dx (C.2)
Z
Z
Z
= ISEk (H0S ) + (q J )2 dx + 2 q J (p − pk ) dx +2 q J q S dx,
|
{z
}
A(J)
where, the quantity ISEk (H0S ) is already known from the previous step and A(J)
is a part of the ISE added to elements of Yk due to the pruning of the J th component.
Similarly, when M corresponds to a pruning hypothesis, for example M =
0J, the vector Yk+1 can be updated with less computations for the next merging
hypotheses using
Z
ISEk+1 (HST |M = 0J) = (p − pk + q J + q S + q T − q ST )2 dx
Z
Z
Z
S
T
ST 2
J 2
= (p − pk + q + q − q ) dx + (q ) dx + 2 q J (p − pk + q S + q T − q ST ) dx
Z
Z
Z
S
T
ST 2
J 2
= (p − pk + q + q − q ) dx + (q ) dx + 2 q J (p − pk ) dx
Z
+ 2 q J (q S + q T − q ST ) dx
Z
= ISEk (HST ) + A(J) + 2 q J (q S + q T − q ST ) dx.
(C.3)
After each pruning step all elements of vector Yk+1 corresponding to the pruned
component will be eliminated from Yk+1 .
Using a similar approach, when M corresponds to a merging hypothesis, say
M = I J, the vector Yk+1 can be updated with less computations for the next prun-
72
C Implementation aspects of the ISE approach
ing hypotheses using
Z
ISEk+1 (H0S |M = I J) = (p − pk + q J + q I − q I J + q S )2 dx
Z
Z
= (p − pk + q S )2 dx + (q J + q I − q I J )2 dx
Z
+ 2 (q J + q I − q I J )(p − pk + q S ) dx
Z
Z
S 2
= (p − pk + q ) dx + (q J + q I − q I J )2 dx
Z
Z
J
I
IJ
+ 2 (q + q − q )(p − pk ) dx + 2 (q J + q I − q I J )q S dx
Z
Z
= ISEk (H0S ) + (q J + q I − q I J )2 dx + 2 (q J + q I − q I J )(p − pk ) dx
|
{z
}
(C.4)
C(I,J)
Z
+2
(q J + q I − q I J )q S dx,
and for the next merging hypotheses using
Z
ISEk+1 (HST |M = I J) = (p − pk + q J + q I − q I J + q S + q T − q ST )2 dx
Z
Z
S
T
ST 2
= (p − pk + q + q − q ) dx + (q J + q I − q I J )2 dx
Z
+ 2 (q J + q I − q I J )(p − pk + q S + q T − q ST ) dx
Z
Z
= (p − pk + q S + q T − q ST )2 dx + (q J + q I − q I J )2 dx
Z
+ 2 (q J + q I − q I J )(p − pk ) dx
Z
+ 2 (q J + q I − q I J )(q S + q T − q ST ) dx
Z
= ISEk (HST ) + C(I, J) + 2 (q J + q I − q I J )(q S + q T − q ST ) dx.
(C.5)
When two components I and J are merged, the merged component labeled
I J will obtain the label of component I in the computation environment and all
elements of Yk+1 corresponding to element J will be eliminated. The vector Yk+1
73
should be updated for the new component as in
Z
ISEk+1 (H(I J)S |M = I J) = (p − pk + q J + q I − q I J + q S + q I J − q(I J)S )2 dx
Z
Z
2
= (p − pk ) dx + (q J + q I + q S − q(I J)S )2 dx
(C.6)
Z
+ 2 (p − pk )(q J + q I + q S − q(I J)S ) dx,
where, the first term is known from the last reduction step.
Bibliography
D. Alspach and H. Sorenson. Nonlinear Bayesian estimation using Gaussian
sum approximations. Automatic Control, IEEE Transactions on, 17(4):439–448,
1972. ISSN 0018-9286. doi: 10.1109/TAC.1972.1100034.
T. Ardeshiri and T. Chen. Maximum entropy property of discrete-time stable spline kernel. In Acoustics, Speech and Signal Processing (ICASSP),
2015 IEEE International Conference on, pages 3676–3680, April 2015. doi:
10.1109/ICASSP.2015.7178657.
T. Ardeshiri and E. Özkan. An adaptive PHD filter for tracking with unknown
sensor characteristics. In Information Fusion (FUSION), 2013 16th International Conference on, pages 1736–1743, July 2013.
T. Ardeshiri, S. Kharrazi, J. Sjöberg, J. Bärgman, and L. M. Sensor fusion for
vehicle positioning in intersection active safety applications. In International
Symposium on Advanced Vehicle Control, 2006a.
T. Ardeshiri, S. Kharrazi, R. Thomson, and J. Bärgman. Offset eliminative map
matching algorithm for intersection active safety applications. In Intelligent
Vehicles Symposium, 2006 IEEE, pages 82–88, 2006b. doi: 10.1109/IVS.2006.
1689609.
T. Ardeshiri, F. Larsson, F. Gustafsson, T. Schön, and M. Felsberg. Bicycle tracking
using ellipse extraction. In Information Fusion (FUSION), 2011 Proceedings of
the 14th International Conference on, pages 1–8, July 2011a.
T. Ardeshiri, M. Norrlöf, J. Löfberg, and A. Hansson. Convex optimization approach for time-optimal path tracking of robots with speed dependent constraints. In Proceedings of the 18th IFAC World Congress, Milan, Italy, pages
14648–14653, August 2011b.
T. Ardeshiri, U. Orguner, C. Lundquist, and T. Schön. On mixture reduction for
multiple target tracking. In Information Fusion (FUSION), 2012 15th International Conference on, pages 692–699, July 2012.
75
76
Bibliography
T. Ardeshiri, K. Granström, E. Özkan, and U. Orguner. Greedy reduction algorithms for mixtures of exponential family. Signal Processing Letters, IEEE, 22
(6):676–680, June 2015a. ISSN 1070-9908. doi: 10.1109/LSP.2014.2367154.
T. Ardeshiri, U. Orguner, and F. Gustafsson. Bayesian inference via approximation of log-likelihood for priors in exponential family. ArXiv e-prints, October
2015b. Submitted to Signal Processing, IEEE Transactions on.
T. Ardeshiri, U. Orguner, and E. Özkan. Gaussian Mixture Reduction Using Reverse Kullback-Leibler Divergence. ArXiv e-prints, August 2015. To be Submitted to Signal Processing, IEEE Transactions on.
T. Ardeshiri, E. Özkan, U. Orguner, and F. Gustafsson. Approximate Bayesian
smoothing with unknown process and measurement noise covariances. To appear in Signal Processing Letters, IEEE, 2015.
T. Ardeshiri, E. Özkan, and U. Orguner. On reduction of mixtures of the exponential family distributions. Technical Report LiTH-ISY-R-3076, Department
of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden, August 2014. URL http://urn.kb.se/resolve?urn=urn:nbn:se:
liu:diva-100234.
C. M. Bishop. Pattern Recognition and Machine Learning (Information Science
and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. ISBN
0387310738.
S. Blackman and R. Popoli. Design and analysis of modern tracking systems.
Artech House radar library. Artech House, 1999. ISBN 9781580530064.
T. Chen, H. Ohlsson, and L. Ljung. On the estimation of transfer functions, regularizations and Gaussian processes - Revisited. Automatica, 48:1525–1535,
2012a.
T. Chen, T. Ardeshiri, F. P. Carli, A. Chiuso, L. Ljung, and G. Pillonetto. Maximum
entropy properties of discrete-time first-order stable spline kernel. To appear
in Automatica, 2015.
T. Chen, M. Andersen, L. Ljung, A. Chiuso, and G. Pillonetto. System identification via sparse multiple kernel-based regularization using sequential convex
optimization techniques. Automatic Control, IEEE Transactions on, 59(11):
2933–2945, Nov 2014. ISSN 0018-9286. doi: 10.1109/TAC.2014.2351851.
X. Chen, R. Tharmarasa, M. Pelletier, and T. Kirubarajan. Integrated clutter estimation and target tracking using Poisson point processes. Aerospace and
Electronic Systems, IEEE Transactions on, 48(2):1210–1235, April 2012b. ISSN
0018-9251. doi: 10.1109/TAES.2012.6178058.
T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley &
Sons, 2012.
Bibliography
77
T. M. Cover and J. Thomas. Elements of Information Theory. John Wiley and
Sons, 2006.
M. Feldmann, D. Fränken, and W. Koch. Tracking of extended objects and group
targets using random matrices. Signal Processing, IEEE Transactions on, 59(4):
1409–1420, April 2011.
S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741, 1984. ISSN 0162-8828.
K. Granström and U. Orguner. Estimation and maintenance of measurement
rates for multiple extended target tracking. In Information Fusion (FUSION),
2012 15th International Conference on, pages 2170 –2176, july 2012a.
K. Granström and U. Orguner. On the reduction of Gaussian inverse Wishart
mixtures. In Information Fusion (FUSION), 2012 15th International Conference on, pages 2162 –2169, july 2012b.
W. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97–109, 1970. doi: 10.1093/biomet/57.1.97.
W. Hennevogl, L. Fahrmeir, and G. Tutz. Multivariate Statistical Modelling Based
on Generalized Linear Models. Springer Series in Statistics. Springer New York,
2001. ISBN 9780387951874.
E. T. Jaynes. On the rationale of maximum-entropy methods. Proceedings of the
IEEE, 70(9):939–952, 1982.
M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction
to variational methods for graphical models. Mach. Learn., 37(2):183–233,
November 1999. ISSN 0885-6125. doi: 10.1023/A:1007665907178.
S. Kay. Fundamentals of Statistical Signal Processing: Detection theory. Prentice
Hall signal processing series. Prentice-Hall PTR, 1998. ISBN 9780135041352.
G. Kitagawa. The two-filter formula for smoothing and an implementation of the
Gaussian-sum smoother. Annals of the Institute of Statistical Mathematics, 46
(4):605–623, 1994.
M. Lifshits. Random Processes by Example. World Scientific Publishing Co. Pte.
Ltd, 2014. ISBN 978-981-4522-28-1.
L. Ljung. System Identification - Theory for the User. Prentice-Hall, Upper Saddle River, N.J., 2nd edition, 1999.
L. Ljung, H. Hjalmarsson, and H. Ohlsson. Four encounters with system identification. European Journal of Control, 17:449–471, 2011.
78
Bibliography
T. Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference Annual Conference on Uncertainty in
Artificial Intelligence (UAI-01), pages 362–369, San Francisco, CA, 2001. Morgan Kaufmann.
T. Minka. Divergence measures and message passing. Technical report, Microsoft
Research Ltd., Cambridge, UK, 2005.
J. A. Nelder and R. W. M. Wedderburn. Generalized linear models. Journal of the
Royal Statistical Society. Series A (General), 135(3):pp. 370–384, 1972. ISSN
00359238.
G. D. Nicolao, G. Ferrari-Trecate, and A. Lecchini. MAXENT priors for stochastic
filtering problems. In Mathematical Theory of Networks and Systems, Padova,
Italy, July 1998.
H. Nurminen, T. Ardeshiri, R. Piché, and F. Gustafsson. Robust inference for
state-space models with skewed measurement noise. Signal Processing Letters,
IEEE, 22(11):1898–1902, Nov 2015a. ISSN 1070-9908. doi: 10.1109/LSP.2015.
2437456.
H. Nurminen, T. Ardeshiri, R. Piché, and F. Gustafsson. A NLOS-robust TOA
positioning filter based on a skew-t measurement noise model. In 2015 International Conference on Indoor Positioning and Indoor Navigation (IPIN),
Banff, Alberta, Canada, October 2015b.
A. Papoulis and S. Pillai. Probability, Random Variables, and Stochastic Processes.
McGraw-Hill series in electrical engineering: Communications and signal processing. Tata McGraw-Hill, 2002. ISBN 9780070486584.
G. Pillonetto and G. D. Nicolao. A new kernel-based approach for linear system
identification. Automatica, 46(1):81–93, 2010.
G. Pillonetto and G. D. Nicolao. Kernel selection in linear system identification.
Part I: A Gaussian process perspective. In Proc. 50th IEEE Conference on Decision and Control, pages 4318–4325, Orlando, Florida, 2011.
G. Pillonetto, A. Chiuso, and G. D. Nicolao. Prediction error identification of
linear systems: a nonparametric Gaussian regression approach. Automatica,
47(2):291–305, 2011.
G. Pillonetto, F. Dinuzzo, T. Chen, G. De Nicolao, and L. Ljung. Kernel methods
in system identification, machine learning and function estimation: A survey.
Automatica, 50(3):657–682, 2014.
C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning.
MIT Press, Cambridge, MA, 2006.
K. J. Åström. Introduction to stochastic control theory, volume 70 of Mathematics
in science and engineering. Academic press, New York, London, 1970. ISBN
0-12-065650-7.
Bibliography
79
H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent
Gaussian models by using integrated nested Laplace approximations. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 71(2):319–
392, 2009. ISSN 1467-9868.
A. Runnalls.
Kullback-Leibler approach to Gaussian mixture reduction.
Aerospace and Electronic Systems, IEEE Transactions on, 43(3):989–999, July
2007. ISSN 0018-9251. doi: 10.1109/TAES.2007.4383588.
D. J. Salmond. Mixture reduction algorithms for target tracking in clutter. In
Proceeding of SPIE, Signal and Data Processing of Small Targets, volume 1305,
pages 434–445, 1990.
D. G. Tzikas, A. C. Likas, and N. P. Galatsanos. The variational approximation for
Bayesian inference. IEEE Signal Processing Magazine, 25(6):131–146, November 2008.
B.-N. Vo and W.-K. Ma. The Gaussian mixture probability hypothesis density
filter. Signal Processing, IEEE Transactions on, 54(11):4091–4104, nov. 2006.
ISSN 1053-587X. doi: 10.1109/TSP.2006.881190.
M. Wainwright and M. Jordan. Graphical Models, Exponential Families, and Variational Inference. Foundations and trends in machine learning. Now Publishers, 2008. ISBN 9781601981844.
J. L. Williams and P. S. Maybeck. Cost-function-based hypothesis control techniques for multiple hypothesis tracking. Mathematical and Computer Modelling, 43(9-10):976–989, May 2006. ISSN 08957177. doi: 10.1016/j.mcm.
2005.05.022.
Part II
Publications
Papers
The articles associated with this thesis have been removed for copyright
reasons. For more details about these see:
http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-121619
PhD Dissertations
Division of Automatic Control
Linköping University
M. Millnert: Identification and control of systems subject to abrupt changes. Thesis
No. 82, 1982. ISBN 91-7372-542-0.
A. J. M. van Overbeek: On-line structure selection for the identification of multivariable
systems. Thesis No. 86, 1982. ISBN 91-7372-586-2.
B. Bengtsson: On some control problems for queues. Thesis No. 87, 1982. ISBN 91-7372593-5.
S. Ljung: Fast algorithms for integral equations and least squares identification problems.
Thesis No. 93, 1983. ISBN 91-7372-641-9.
H. Jonson: A Newton method for solving non-linear optimal control problems with general constraints. Thesis No. 104, 1983. ISBN 91-7372-718-0.
E. Trulsson: Adaptive control based on explicit criterion minimization. Thesis No. 106,
1983. ISBN 91-7372-728-8.
K. Nordström: Uncertainty, robustness and sensitivity reduction in the design of single
input control systems. Thesis No. 162, 1987. ISBN 91-7870-170-8.
B. Wahlberg: On the identification and approximation of linear systems. Thesis No. 163,
1987. ISBN 91-7870-175-9.
S. Gunnarsson: Frequency domain aspects of modeling and control in adaptive systems.
Thesis No. 194, 1988. ISBN 91-7870-380-8.
A. Isaksson: On system identification in one and two dimensions with signal processing
applications. Thesis No. 196, 1988. ISBN 91-7870-383-2.
M. Viberg: Subspace fitting concepts in sensor array processing. Thesis No. 217, 1989.
ISBN 91-7870-529-0.
K. Forsman: Constructive commutative algebra in nonlinear control theory. Thesis
No. 261, 1991. ISBN 91-7870-827-3.
F. Gustafsson: Estimation of discrete parameters in linear systems. Thesis No. 271, 1992.
ISBN 91-7870-876-1.
P. Nagy: Tools for knowledge-based signal processing with applications to system identification. Thesis No. 280, 1992. ISBN 91-7870-962-8.
T. Svensson: Mathematical tools and software for analysis and design of nonlinear control
systems. Thesis No. 285, 1992. ISBN 91-7870-989-X.
S. Andersson: On dimension reduction in sensor array signal processing. Thesis No. 290,
1992. ISBN 91-7871-015-4.
H. Hjalmarsson: Aspects on incomplete modeling in system identification. Thesis No. 298,
1993. ISBN 91-7871-070-7.
I. Klein: Automatic synthesis of sequential control schemes. Thesis No. 305, 1993.
ISBN 91-7871-090-1.
J.-E. Strömberg: A mode switching modelling philosophy. Thesis No. 353, 1994. ISBN 917871-430-3.
K. Wang Chen: Transformation and symbolic calculations in filtering and control. Thesis
No. 361, 1994. ISBN 91-7871-467-2.
T. McKelvey: Identification of state-space models from time and frequency data. Thesis
No. 380, 1995. ISBN 91-7871-531-8.
J. Sjöberg: Non-linear system identification with neural networks. Thesis No. 381, 1995.
ISBN 91-7871-534-2.
R. Germundsson: Symbolic systems – theory, computation and applications. Thesis
No. 389, 1995. ISBN 91-7871-578-4.
P. Pucar: Modeling and segmentation using multiple models. Thesis No. 405, 1995.
ISBN 91-7871-627-6.
H. Fortell: Algebraic approaches to normal forms and zero dynamics. Thesis No. 407,
1995. ISBN 91-7871-629-2.
A. Helmersson: Methods for robust gain scheduling. Thesis No. 406, 1995. ISBN 91-7871628-4.
P. Lindskog: Methods, algorithms and tools for system identification based on prior
knowledge. Thesis No. 436, 1996. ISBN 91-7871-424-8.
J. Gunnarsson: Symbolic methods and tools for discrete event dynamic systems. Thesis
No. 477, 1997. ISBN 91-7871-917-8.
M. Jirstrand: Constructive methods for inequality constraints in control. Thesis No. 527,
1998. ISBN 91-7219-187-2.
U. Forssell: Closed-loop identification: Methods, theory, and applications. Thesis No. 566,
1999. ISBN 91-7219-432-4.
A. Stenman: Model on demand: Algorithms, analysis and applications. Thesis No. 571,
1999. ISBN 91-7219-450-2.
N. Bergman: Recursive Bayesian estimation: Navigation and tracking applications. Thesis
No. 579, 1999. ISBN 91-7219-473-1.
K. Edström: Switched bond graphs: Simulation and analysis. Thesis No. 586, 1999.
ISBN 91-7219-493-6.
M. Larsson: Behavioral and structural model based approaches to discrete diagnosis. Thesis No. 608, 1999. ISBN 91-7219-615-5.
F. Gunnarsson: Power control in cellular radio systems: Analysis, design and estimation.
Thesis No. 623, 2000. ISBN 91-7219-689-0.
V. Einarsson: Model checking methods for mode switching systems. Thesis No. 652, 2000.
ISBN 91-7219-836-2.
M. Norrlöf: Iterative learning control: Analysis, design, and experiments. Thesis No. 653,
2000. ISBN 91-7219-837-0.
F. Tjärnström: Variance expressions and model reduction in system identification. Thesis
No. 730, 2002. ISBN 91-7373-253-2.
J. Löfberg: Minimax approaches to robust model predictive control. Thesis No. 812, 2003.
ISBN 91-7373-622-8.
J. Roll: Local and piecewise affine approaches to system identification. Thesis No. 802,
2003. ISBN 91-7373-608-2.
J. Elbornsson: Analysis, estimation and compensation of mismatch effects in A/D converters. Thesis No. 811, 2003. ISBN 91-7373-621-X.
O. Härkegård: Backstepping and control allocation with applications to flight control.
Thesis No. 820, 2003. ISBN 91-7373-647-3.
R. Wallin: Optimization algorithms for system analysis and identification. Thesis No. 919,
2004. ISBN 91-85297-19-4.
D. Lindgren: Projection methods for classification and identification. Thesis No. 915,
2005. ISBN 91-85297-06-2.
R. Karlsson: Particle Filtering for Positioning and Tracking Applications. Thesis No. 924,
2005. ISBN 91-85297-34-8.
J. Jansson: Collision Avoidance Theory with Applications to Automotive Collision Mitigation. Thesis No. 950, 2005. ISBN 91-85299-45-6.
E. Geijer Lundin: Uplink Load in CDMA Cellular Radio Systems. Thesis No. 977, 2005.
ISBN 91-85457-49-3.
M. Enqvist: Linear Models of Nonlinear Systems. Thesis No. 985, 2005. ISBN 91-8545764-7.
T. B. Schön: Estimation of Nonlinear Dynamic Systems — Theory and Applications. Thesis No. 998, 2006. ISBN 91-85497-03-7.
I. Lind: Regressor and Structure Selection — Uses of ANOVA in System Identification.
Thesis No. 1012, 2006. ISBN 91-85523-98-4.
J. Gillberg: Frequency Domain Identification of Continuous-Time Systems Reconstruction and Robustness. Thesis No. 1031, 2006. ISBN 91-85523-34-8.
M. Gerdin: Identification and Estimation for Models Described by Differential-Algebraic
Equations. Thesis No. 1046, 2006. ISBN 91-85643-87-4.
C. Grönwall: Ground Object Recognition using Laser Radar Data – Geometric Fitting,
Performance Analysis, and Applications. Thesis No. 1055, 2006. ISBN 91-85643-53-X.
A. Eidehall: Tracking and threat assessment for automotive collision avoidance. Thesis
No. 1066, 2007. ISBN 91-85643-10-6.
F. Eng: Non-Uniform Sampling in Statistical Signal Processing. Thesis No. 1082, 2007.
ISBN 978-91-85715-49-7.
E. Wernholt: Multivariable Frequency-Domain Identification of Industrial Robots. Thesis
No. 1138, 2007. ISBN 978-91-85895-72-4.
D. Axehill: Integer Quadratic Programming for Control and Communication. Thesis
No. 1158, 2008. ISBN 978-91-85523-03-0.
G. Hendeby: Performance and Implementation Aspects of Nonlinear Filtering. Thesis
No. 1161, 2008. ISBN 978-91-7393-979-9.
J. Sjöberg: Optimal Control and Model Reduction of Nonlinear DAE Models. Thesis
No. 1166, 2008. ISBN 978-91-7393-964-5.
D. Törnqvist: Estimation and Detection with Applications to Navigation. Thesis No. 1216,
2008. ISBN 978-91-7393-785-6.
P-J. Nordlund: Efficient Estimation and Detection Methods for Airborne Applications.
Thesis No. 1231, 2008. ISBN 978-91-7393-720-7.
H. Tidefelt: Differential-algebraic equations and matrix-valued singular perturbation.
Thesis No. 1292, 2009. ISBN 978-91-7393-479-4.
H. Ohlsson: Regularization for Sparseness and Smoothness — Applications in System
Identification and Signal Processing. Thesis No. 1351, 2010. ISBN 978-91-7393-287-5.
S. Moberg: Modeling and Control of Flexible Manipulators. Thesis No. 1349, 2010.
ISBN 978-91-7393-289-9.
J. Wallén: Estimation-based iterative learning control. Thesis No. 1358, 2011. ISBN 97891-7393-255-4.
J. Hol: Sensor Fusion and Calibration of Inertial Sensors, Vision, Ultra-Wideband and GPS.
Thesis No. 1368, 2011. ISBN 978-91-7393-197-7.
D. Ankelhed: On the Design of Low Order H-infinity Controllers. Thesis No. 1371, 2011.
ISBN 978-91-7393-157-1.
C. Lundquist: Sensor Fusion for Automotive Applications. Thesis No. 1409, 2011.
ISBN 978-91-7393-023-9.
P. Skoglar: Tracking and Planning for Surveillance Applications. Thesis No. 1432, 2012.
ISBN 978-91-7519-941-2.
K. Granström: Extended target tracking using PHD filters. Thesis No. 1476, 2012.
ISBN 978-91-7519-796-8.
C. Lyzell: Structural Reformulations in System Identification. Thesis No. 1475, 2012.
ISBN 978-91-7519-800-2.
J. Callmer: Autonomous Localization in Unknown Environments. Thesis No. 1520, 2013.
ISBN 978-91-7519-620-6.
D. Petersson: A Nonlinear Optimization Approach to H2-Optimal Modeling and Control.
Thesis No. 1528, 2013. ISBN 978-91-7519-567-4.
Z. Sjanic: Navigation and Mapping for Aerial Vehicles Based on Inertial and Imaging
Sensors. Thesis No. 1533, 2013. ISBN 978-91-7519-553-7.
F. Lindsten: Particle Filters and Markov Chains for Learning of Dynamical Systems. Thesis No. 1530, 2013. ISBN 978-91-7519-559-9.
P. Axelsson: Sensor Fusion and Control Applied to Industrial Manipulators. Thesis
No. 1585, 2014. ISBN 978-91-7519-368-7.
A. Carvalho Bittencourt: Modeling and Diagnosis of Friction and Wear in Industrial
Robots. Thesis No. 1617, 2014. ISBN 978-91-7519-251-2.
M. Skoglund: Inertial Navigation and Mapping for Autonomous Vehicles. Thesis
No. 1623, 2014. ISBN 978-91-7519-233-8.
S. Khoshfetrat Pakazad: Divide and Conquer: Distributed Optimization and Robustness
Analysis. Thesis No. 1676, 2015. ISBN 978-91-7519-050-1.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement