Sequential Monte Carlo for inference in nonlinear state space models Johan Dahlin

Sequential Monte Carlo for inference in nonlinear state space models Johan Dahlin
Linköping studies in science and technology. Thesis.
No. 1652
Licentiate’s Thesis
Sequential Monte Carlo for inference
in nonlinear state space models
Johan Dahlin
LERTEKNIK
REG
AU
T
O MA
RO
TI C C O N T
L
LINKÖPING
Division of Automatic Control
Department of Electrical Engineering
Linköping University, SE-581 83 Linköping, Sweden
http://www.control.isy.liu.se
[email protected]
Linköping 2014
This is a Swedish Licentiate’s Thesis.
Swedish postgraduate education leads to a Doctor’s degree and/or a Licentiate’s degree.
A Doctor’s degree comprises 240 ECTS credits (4 years of full-time studies).
A Licentiate’s degree comprises 120 ECTS credits,
of which at least 60 ECTS credits constitute the Licentiate’s thesis.
Linköping studies in science and technology. Thesis.
No. 1652
Sequential Monte Carlo for inference in nonlinear state space models
Johan Dahlin
[email protected]
www.control.isy.liu.se
Department of Electrical Engineering
Linköping University
SE-581 83 Linköping
Sweden
ISBN 978-91-7519-369-4
ISSN 0280-7971
LIU-TEK-LIC-2014:85
Copyright © 2014 Johan Dahlin
Printed by LiU-Tryck, Linköping, Sweden 2014
Denna avhandling tillägnas min familj!
Abstract
Nonlinear state space models (SSMs) are a useful class of models to describe many
different kinds of systems. Some examples of its applications are to model; the
volatility in financial markets, the number of infected persons during an influenza
epidemic and the annual number of major earthquakes around the world. In this
thesis, we are concerned with state inference, parameter inference and input design
for nonlinear SSMs based on sequential Monte Carlo (SMC) methods.
The state inference problem consists of estimating some latent variable that is
not directly observable in the output from the system. The parameter inference
problem is concerned with fitting a pre-specified model structure to the observed
output from the system. In input design, we are interested in constructing an
input to the system, which maximises the information that is available about the
parameters in the system output. All of these problems are analytically intractable
for nonlinear SSMs. Instead, we make use of SMC to approximate the solution to
the state inference problem and to solve the input design problem. Furthermore,
we make use of Markov chain Monte Carlo (MCMC) and Bayesian optimisation
(BO) to solve the parameter inference problem.
In this thesis, we propose new methods for parameter inference in SSMs using both
Bayesian and maximum likelihood inference. More specifically, we propose a new
proposal for the particle Metropolis-Hastings algorithm, which includes gradient
and Hessian information about the target distribution. We demonstrate that the
use of this proposal can reduce the length of the burn-in phase and improve the
mixing of the Markov chain.
Furthermore, we develop a novel parameter inference method based on the combination of BO and SMC. We demonstrate that this method requires a relatively
small amount of samples from the analytically intractable likelihood, which are
computationally costly to obtain. Therefore, it could be a good alternative to
other optimisation based parameter inference methods. The proposed BO and
SMC combination is also extended for parameter inference in nonlinear SSMs with
intractable likelihoods using approximate Bayesian computations. This method is
used for parameter inference in a stochastic volatility model with α-stable returns
using real-world financial data.
Finally, we develop a novel method for input design in nonlinear SSMs which
makes use of SMC methods to estimate the expected information matrix. This
information is used in combination with graph theory and convex optimisation to
estimate optimal inputs with amplitude constraints. We also consider parameter
estimation in ARX models with Student-t innovations and unknown model orders.
Two different algorithms are used for this inference: reversible Jump Markov chain
Monte Carlo and Gibbs sampling with sparseness priors. These methods are used
to model real-world EEG data with promising results.
v
Populärvetenskaplig sammanfattning
Den värld som vi lever i är fylld av olika typer av system som kan beskrivas med
matematiska modeller. Med hjälp av dessa modeller kan vi skapa oss en bättre
förståelse av hur dessa system påverkas av sin omgivning, samt förutsäga hur de
kommer att utvecklas över tid. Exempelvis kan man konstruera modeller av vädret
baserad på kunskap om fysik samt tidigare års väder. Dessa modeller kan sedan
användas för att exempelvis förutsäga om det kommer att regna imorgon. En
annan typ av modeller kan användas för att prissätta olika typer av finansiella
kontrakt, baserat på tidigare utfall och ekonomisk teori. Ett tredje exempel är
modeller för att förutsäga antalet framtida jordbävningar i världen, givet historisk
data och några modellantaganden.
Det som är gemensamt för dessa exempel är att de alla beskriver icke-linjära
dynamiska system, alltså system som utvecklas över tid. I denna avhandling är
vi intresserade av att bygga icke-linjära tillståndsmodeller av dynamiska system
med hjälp av datadrivna statistiska inferensmetoder. Med hjälp av dessa modeller
och metoder är det möjligt att kombinera teoretiska kunskaper som beskrivs av
en modellstruktur med observationer från systemet. Den senare informationen
kan användas för att bestämma värdet på några okända parametrar i modellen,
detta kallas även för parameterskattning. Ett annat vanligt förekommande problem
är tillståndsskattning, där vi vill bestämma värdet på någon dynamisk storhet i
systemet som inte kan observeras direkt. Det huvudsakliga problemet med detta
angreppssätt är att inget av dessa skattningsproblem kan lösas exakt med hjälp
av analytiska metoder.
Istället nyttjar vi approximativa metoder som baseras på statistiska simuleringer
för att lösa problemen. Så kallade sekventiella Monte Carlo-metoder används för
att approximera lösningen till tillståndsskattningsproblemet. Detta görs med hjälp
av en dator som simulerar en stor mängd hypoteser (även kallade partiklar) om hur
systemet fungerar. De hypoteser som stämmer väl överens med det verkliga observerade beteendet sparas och förfinas i nästkommande steg. De övriga hypoteserna
tas bort från simuleringen för att fokusera beräkningskraften på de hypoteser som
har relevans enligt den observerade informationen. Parameterskattningsproblemet
kan lösas approximativt med liknande metoder som även de bygger på simulering.
I denna avhandling arbetar vi främst med att försöka förbättra parameterskattningsmetoder som bygger på partikel Markovkedje-Monte Carlo (MCMC) och Bayesiansk optimering. De förbättringar som vi föreslår leder till att skattningarna kan
beräknas snabbare än tidigare genom att bättre ta tillvara på den information som
finns i det observerade datamaterialet. Dessa metoder används för att exempelvis
prissätta finansiella optioner. Vi föreslår även en ny algoritm för att skapa insignaler till system så att observationerna som erhålls från systemet, innehåller så
mycket information som möjligt om de okända parametrarna. Slutligen demonstrerar vi hur man kan använda MCMC-metoder för parameterskattning i modeller
som kan användas för att beskriva EEG-signaler.
vii
Acknowledgments
I could never have written my thesis without the help, love and support from all
the people around me, now and in the past. I will now spend a few lines to express
my gratitude to some of the special people that have helped me along the way.
First and foremost, I would like to acknowledge the help and support from my supervisors. My main supervisor Prof. Thomas Schön has always provided me with
encouragement, ideas, new people to meet, new opportunities to grab and challenges to help me grow. Also, my co-supervisor Dr. Fredrik Lindsten has always
been there for me with answers to my questions, explanations to my problems
and ideas/suggestions for my work. I am most impressed by your enthusiasm and
dedication in the roles as supervisors. I think that you both have surpassed what
can be expected from a supervisor and for this I am extremely grateful. I could
never have done this without you! Thank you for your support and all our time
running together in Stanley park, Maastricht, Warwick and Söderköping!
Furthermore, to be able to write a good thesis you require a good working environment. Prof. Svante Gunnarsson and Ninna Stensgård are two very important
persons in this effort. Thank you for all your kindness, support and helpfulness
in all matters; small and large. I would also like to acknowledge Dr. Henrik Tidefelt and Dr. Gustaf Hendeby for constructing and maintaining the LATEX-template
in which this thesis is written. My brother Fredrik Dahlin, Lic. Joel Kronander,
Jonas Linder, Dr. Fredrik Lindsten, Prof. Thomas Schön, Andreas Svensson, Patricio Valenzuela and Johan Wågberg have all helped with proof-reading and with
suggestions to improve the thesis. All remaining errors are entirely my own.
I am most grateful for the financial support from the project Probabilistic modelling
of dynamical systems (Contract number: 621-2013-5524) funded by the Swedish
Research Council.
Another aspect of the atmosphere at work is all my wonderful colleagues. Especially, my room mate Lic. Michael Roth, who helps with his experience and advice
about balancing life as a PhD student. Thank you for sharing the room with me
and watering our plants while I am away! I would also like to thank Jonas Linder
for all our adventures and our nice friendship together both at and outside of work.
Manon Kok and I started in the group at the same time and have helped eachother
over the years. Thank you for your positive attitude and for always being open
for discussions. Also, I would like to acknowledge the wonderful BBQs and other
funny things that Lic. Sina Khoshfetrat Pakazad arranges and lets me participate
in!
Furthermore, I would like to thank my remaining friends and colleagues at the
group. Especially, (without any specific ordering) Dr. Christian Lyzell, Lic. Ylva
Jung, Isak Nielsen, Hanna Nyquist, Dr. Daniel Petersson, Clas Veibäck and Dr.
Emre Özkan for all the funny things that we have done together. This includes
everything from beer tastings, wonderful food in France and fitting as many of RTs
PhD students into a jacuzzi as possible to hitting some shallows on open sea with
ix
x
Acknowledgments
a canoe, cross country skiing in Chamonix and eating food sitting on the floor in
a Japanese restaurant with screaming people everywhere. You have given me the
most wonderful memories and times!
I have also had the benefit of working with a lot of talented and enthusiastic researchers during my time in academia. I would first like to thank all my fellow
students at Teknisk Fysik, Umeå University for a wonderful time as an undergraduate student. Also, Prof. Anders Fällström, Dr. Konrad Abramowicz, Dr. Ulf
Holmberg and Dr. Sang Hoon Lee have inspired, supported and encouraged me to
pursue a PhD degree, something I have not (a.s.) regretted!
Also, my wonderful (former) colleagues at the Swedish Defence Research Agency
(FOI) have supported and encouraged me to continue climbing the educational
ladder. Thank you, Dr. Pontus Svensson, Dr. Fredrik Johansson, Dr. Tove Gustavi
and Christian Mårtensson. Finally, I would like to thank all my co-authors during
my time at Linköping University for some wonderful and fruitful collaborations:
Daniel Hultqvist, Lic. Daniel Jönsson, Lic. Joel Kronander, Dr. Fredrik Lindsten,
Cristian Rojas, Dr. Jakob Roll, Prof. Thomas Schön, Fredrik Svensson, Dr. Jonas
Unger, Patricio Valenzuela, Prof. Mattias Villani and Dr. Adrian Wills.
Furthermore, Lic. Joel Kronander and Dr. Jonas Unger have helped me with the
images from the computer graphics application in the introduction. Prof. Mattias
Villani together with Stefan Laséen and Vesna Crobo at Riksbanken helped with
the economics application and made the forecasts from RAMSES II.
Finally, I am most grateful to my loving family and close relatives for their support
all the way from childhood until now and beyond. I love you all every much! Also,
my friends are always a great source for support, inspiration and encouragement
both at work and in life! My life would be empty and meaningless without you
all! I hope that we all can spend some more time now when my thesis is done
and when new challenges awaits! Because, what would life be without challenges,
meeting new people and spending time with your loved ones. Empty.
Linköping, May 2014
Johan Dahlin
Contents
Notation
xv
I
Background
1
Introduction
1.1 Examples of applications . . . . . . . . .
1.1.1 Predicting GDP growth . . . . .
1.1.2 Rendering photorealistic images .
1.2 Thesis outline and contributions . . . .
1.3 Publications . . . . . . . . . . . . . . . .
2
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
4
5
7
9
14
Nonlinear state space models and statistical inference
2.1 State space models and inference problems . . . . . .
2.2 Some motivating examples . . . . . . . . . . . . . . .
2.2.1 Linear Gaussian model . . . . . . . . . . . . .
2.2.2 Volatility models in econometrics and finance
2.2.3 Earthquake count model in geology . . . . . .
2.2.4 Daily rainfall models in meteorology . . . . .
2.3 Maximum likelihood parameter inference . . . . . . .
2.4 Bayesian parameter inference . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
19
19
19
24
26
28
32
State inference using particle methods
3.1 Filtering and smoothing recursions . . . . . . . . . . . . . .
3.2 Monte Carlo and importance sampling . . . . . . . . . . . .
3.3 Particle filtering . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 The auxiliary particle filter . . . . . . . . . . . . . .
3.3.2 State inference using the auxiliary particle filter . . .
3.3.3 Statistical properties of the auxiliary particle filter .
3.3.4 Estimation of the likelihood and log-likelihood . . .
3.4 Particle smoothing . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 State inference using the particle fixed-lag smoother
3.4.2 Estimation of additive state functionals . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
37
39
41
43
46
49
51
53
55
59
xi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
xii
3.5
4
5
3.4.3 Statistical properties of the particle fixed-lag smoother . . .
SMC for Image Based Lighting . . . . . . . . . . . . . . . . . . . .
Parameter inference using sampling methods
4.1 Overview of computational methods for parameter inference
4.1.1 Maximum likelihood parameter inference . . . . . .
4.1.2 Bayesian parameter inference . . . . . . . . . . . . .
4.2 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Statistical properties of the MH algorithm . . . . . .
4.2.2 Proposals using Langevin and Hamiltonian dynamics
4.3 Particle Metropolis-Hastings . . . . . . . . . . . . . . . . . .
4.4 Bayesian optimisation . . . . . . . . . . . . . . . . . . . . .
4.4.1 Gaussian processes as surrogate functions . . . . . .
4.4.2 Acquisition rules . . . . . . . . . . . . . . . . . . . .
4.4.3 Gaussian process optimisation . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
69
69
70
70
71
76
78
83
85
88
93
96
Concluding remarks and future work
5.1 Summary of the contributions . . . . . . . . . . . . . . . . . .
5.2 Outlook and future work . . . . . . . . . . . . . . . . . . . . .
5.2.1 Particle Metropolis-Hastings . . . . . . . . . . . . . .
5.2.2 Gaussian process optimisation using the particle filter
5.2.3 Input design in SSMs . . . . . . . . . . . . . . . . . .
5.3 Source code and data . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
103
103
104
104
105
106
106
Bibliography
II
63
64
107
Publications
A PMH using gradient and Hessian information
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Particle Metropolis-Hastings . . . . . . . . . . . . . . . . .
2.1
MH sampling with unbiased likelihoods . . . . . .
2.2
Constructing the first and second order proposals .
2.3
Properties of the first and second order proposals .
3
Estimation of likelihoods, gradients, and Hessians . . . . .
3.1
Auxiliary particle filter . . . . . . . . . . . . . . . .
3.2
Estimation of the likelihood . . . . . . . . . . . . .
3.3
Estimation of the gradient . . . . . . . . . . . . . .
3.4
Estimation of the Hessian . . . . . . . . . . . . . .
3.5
Accuracy of the estimated gradients and Hessians .
3.6
Resulting SMC algorithm . . . . . . . . . . . . . .
4
Numerical illustrations . . . . . . . . . . . . . . . . . . . .
4.1
Estimation of the log-likelihood and the gradient .
4.2
Burn-in and scale-invariance . . . . . . . . . . . . .
4.3
The mixing of the Markov chains at stationarity .
5
Discussion and future work . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
121
124
126
126
127
128
129
129
130
131
132
134
135
135
135
138
140
143
Contents
xiii
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
B Particle filter-based GPO for parameter inference
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Maximum likelihood estimation with a surrogate cost function
3
Estimating the log-likelihood . . . . . . . . . . . . . . . . . . .
3.1
The particle filter . . . . . . . . . . . . . . . . . . . . . .
3.2
Estimation of the likelihood . . . . . . . . . . . . . . . .
3.3
Estimation of the log-likelihood . . . . . . . . . . . . . .
4
Modelling the surrogate function . . . . . . . . . . . . . . . . .
4.1
Gaussian process model . . . . . . . . . . . . . . . . . .
4.2
Updating the model and the hyperparameters . . . . . .
4.3
Example of log-likelihood modelling . . . . . . . . . . .
5
Acquisition rules . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
Expected improvement . . . . . . . . . . . . . . . . . . .
6
Numerical illustrations . . . . . . . . . . . . . . . . . . . . . . .
6.1
Implementation details . . . . . . . . . . . . . . . . . . .
6.2
Linear Gaussian state space model . . . . . . . . . . . .
6.3
Nonlinear stochastic volatility model . . . . . . . . . . .
7
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
149
152
153
154
154
155
156
156
157
158
158
158
160
160
161
161
164
164
165
C Approximate inference in SSMs with intractable likelihoods using GPO
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
An intuitive overview . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Estimating the posterior distribution . . . . . . . . . . . . . . . . .
3.1
State inference . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Estimation of the log-likelihood . . . . . . . . . . . . . . . .
4
Gaussian process optimisation . . . . . . . . . . . . . . . . . . . . .
4.1
Constructing the surrogate function . . . . . . . . . . . . .
4.2
The acquisition rule . . . . . . . . . . . . . . . . . . . . . .
5
Putting the algorithm together . . . . . . . . . . . . . . . . . . . .
6
Numerical illustrations . . . . . . . . . . . . . . . . . . . . . . . . .
6.1
Inference in α-stable data . . . . . . . . . . . . . . . . . . .
6.2
Linear Gaussian model . . . . . . . . . . . . . . . . . . . . .
6.3
Stochastic volatility model with α-stable returns . . . . . .
7
Conclusions and outlook . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A
α-stable distributions . . . . . . . . . . . . . . . . . . . . . . . . . .
A.1
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2
Simulating random variables . . . . . . . . . . . . . . . . .
A.3
Parameter estimation . . . . . . . . . . . . . . . . . . . . .
167
170
171
172
172
174
174
174
176
176
177
177
181
181
184
186
188
188
189
191
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
D A graph/particle-based method for experiment design
193
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
2
Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Contents
xiv
3
New input design method . . . . . . . . .
3.1
Graph theoretical input design . .
3.2
Estimation of the score function .
3.3
Monte Carlo-based optimisation .
3.4
Summary of the method . . . . . .
4
Numerical examples . . . . . . . . . . . .
4.1
Linear Gaussian state space model
4.2
Nonlinear growth model . . . . . .
5
Conclusion . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
E Hierarchical Bayesian approaches for robust inference in
1
Introduction . . . . . . . . . . . . . . . . . . . . . .
2
Hierarchical Bayesian ARX Models . . . . . . . . .
2.1
Student’s t distributed innovations . . . . .
2.2
Parametric model order . . . . . . . . . . .
2.3
Automatic relevance determination . . . . .
3
Markov chain Monte Carlo . . . . . . . . . . . . .
4
Posteriors and proposal distributions . . . . . . . .
4.1
Model order . . . . . . . . . . . . . . . . . .
4.2
ARX coefficients . . . . . . . . . . . . . . .
4.3
ARX coefficients variance . . . . . . . . . .
4.4
Latent variance variables . . . . . . . . . .
4.5
Innovation scale parameter . . . . . . . . .
4.6
Innovation DOF . . . . . . . . . . . . . . .
5
Numerical illustrations . . . . . . . . . . . . . . . .
5.1
Average model performance . . . . . . . . .
5.2
Robustness to outliers and missing data . .
5.3
Real EEG data . . . . . . . . . . . . . . . .
6
Conclusions and Future work . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
198
199
201
203
204
205
205
206
207
208
ARX models
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
211
214
215
215
216
216
217
218
218
219
220
220
221
221
222
222
224
226
228
229
Notation
Probability
Notation
a.s.
−→
d
−→
p
−→
δz (dx)
P,E,V
∼
Meaning
Almost sure convergence.
Convergence in distribution.
Convergence in probability.
Dirac point mass located at x = z.
Probability, expectation and covariance operators.
Sampled from or distributed according to.
Statistical distributions
Notation
A(α, β, γ, η)
B(p)
N (µ, σ 2 )
G(α, β)
IG(α, β)
P(λ)
U(a, b)
Meaning
α-stable distribution
with stability α, skewness β, scale γ and location η.
Bernoulli distribution with success probability p.
Gaussian (normal) dist. with mean µ and variance σ 2
Gamma distribution with rate α and shape β.
Inverse Gamma distribution with rate α and shape β.
Poisson distribution with mean λ.
Uniform distribution on the interval [a, b].
xv
Notation
xvi
Operators and other symbols
Notation
Id
,
diag(v)
∇f (x)
∇2 f (x)
I
det(A), |A|
A−1
tr(A)
A>
2
v = vv >
an:m
sign(x)
supp(f )
Meaning
d × d identity matrix.
Definition.
Diagonal matrix with the vector v on the diagonal.
Gradient of f (x).
Hessian of f (x).
Indicator function.
Matrix determinant of A.
Matrix inverse of A.
Matrix trace of A.
Matrix transpose of A.
Outer product of the vector v.
Sequence {an , an+1 , . . . , am−1 , am }, for m > n.
Sign of x.
Support of the function f , {x : f (x) > 0}.
Statistical quantities
Notation
Meaning
I(θ)
L(θ)
`(θ)
θbML
J (θ)
θb
p(θ|y1:T )
p(θ)
θ
S(θ)
Expected information matrix evaluated at θ.
Likelihood function evaluated at θ.
Log-likelihood function evaluated at θ.
Maximum likelihood parameter estimate.
Observed information matrix evaluated at θ.
Parameter estimate.
Parameter posterior distribution.
Parameter prior distribution.
Parameter vector, θ ∈ Θ ⊆ Rd .
Score function evaluated at θ.
Algorithmic quantities
Notation
(i)
at
Z
(i)
xt
Rθ (xt |x0:t−1 , yt )
Wθ (xt , xt−1 )
q(θ)
π(θ)
γ(θ)
(i)
(i)
wt , w
et
Meaning
Ancestor of particle i at time t.
Normalisation constant.
Particle i at time t.
Particle proposal kernel.
Particle weighting function.
Proposal distribution.
Target distribution.
Unnormalised target distribution.
Un- and normalised weight of particle i at time t.
Notation
xvii
Abbreviations
Abbreviation
Meaning
a.s.
ABC
ACF
AIS
APF
AR(p)
ARD
ARCH(p)
ARX(p)
BIS
BO
bPF
BRDF
CDF
CLT
CPI
DSGE
EB
EEG
EI
EM
ESS
faPF
FFBSm
FFBSi
FL
GARCH(p,q)
GPO
GPU
HMM
IACT
IBL
IID
IS
KDE
LGSS
LTE
MCMC
MH
MIS
ML
MLE
MLT
MSE
Almost surely (with probability 1).
Approximate Bayesian computations.
Autocorrelation function.
Adaptive importance sampling.
Auxiliary particle filter.
Autoregressive process of order p.
Automatic relevance determination.
AR conditional heteroskedasticity process of order p.
Autoregressive exogenous process of order p.
Bidirectional importance sampling.
Bayesian optimisation.
Bootstrap particle filter.
Bidirectional reflectance distribution function.
Cumulative distribution function.
Central limit theorem.
Consumer price index.
Dynamic stochastic general equilibrium.
Empirical Bayes.
Electroencephalography.
Expected improvement.
Environment map.
Effective sample size.
Fully-adapted particle filter.
Forward-filtering backward-smoothing.
Forward-filtering backward-simulation.
Fixed-lag (particle smoother).
Generalised ARCH process of order (p, q).
Gaussian process optimisation.
Graphical processing unit.
Hidden Markov model.
Integrated autocorrelation time.
Image-based lightning.
Independent and identically distributed.
Importance sampling.
Kernel density estimate/estimator.
Linear Gaussian state space.
Light transport equation.
Markov chain Monte Carlo.
Metropolis-Hastings.
Multiple importance sampling.
Maximum likelihood.
Maximum likelihood estimator.
Metropolis light transport.
Mean square error.
Notation
xviii
Abbreviations (cont.)
Abbreviation
PD
PDF
PMF
PF
PG
PI
PMCMC
PMH
PMH0
PMH1
PMH2
PS
RJ-MCMC
RTS
RW
SIS
SIR
SLLN
SMC
SPSA
SSM
UCB
Meaning
Positive definite.
Probability density function.
Probability mass function.
Particle filter.
Particle Gibbs.
Probability of improvement.
Particle Markov chain Monte Carlo.
Particle Metropolis-Hastings.
Marginal particle Metropolis-Hastings.
PMH using first order information
PMH using first and second order information
Particle smoother.
Reversible jump Markov chain Monte Carlo.
Rauch-Tung-Stribel.
Random walk.
Sequential importance sampling.
Sequential importance sampling and resampling.
Strong law of large numbers.
Sequential Monte Carlo.
Simultaneous perturbation stochastic approximation.
State space model.
Upper confidence bound.
Part I
Background
1
Introduction
Science is the art of collecting and organising knowledge about the universe by
tested explanations and validated predictions. Therefore, modelling the world
using observations and statistical inference is an integral part of the scientific
method. The resulting statistical models can be used to describe certain observed
phenomena or to predict new phenomena and future behaviours. An example of the
former is to discover new physical models by generalising from observed data using
induction. Examples of prediction applications are to validate scientific theories
or to forecast the future GDP of Sweden, the probability of rainfall tomorrow and
the number of earthquakes during the coming year. We discuss some of the details
of these problems in the following chapters.
This thesis is concerned with building dynamical models from recorded observations,
i.e. models of systems that evolves over time. The observations are combined with
past experiences and established scientific theory to build models using statistical
tools. Here, we limit ourselves to discussing nonlinear state space models (SSMs),
where most of the structure is known beforehand except a few parameters. A fairly
general class of SSMs can be expressed as
xt+1 |xt ∼ fθ (xt+1 |xt ),
yt |xt ∼ gθ (yt |xt ),
where xt and yt denotes an unobserved (latent) state and an observation from
the system at time t. Here, fθ (xt+1 |xt ) and gθ (yt |xt ) denotes two Markov kernels
parametrised by an unknown static real-valued parameter vector θ. Applications
of this class of SSMs can be found in almost all of the natural sciences and most of
the social sciences. Some specific examples are biology (Wilkinson, 2011), control
(Ljung, 1999), epidemiology (Keeling and Rohani, 2008) and finance (Tsay, 2005;
Hull, 2009).
3
1
4
Introduction
The procedure to determine the parameter vector θ from the observations y1:T is
referred to as parameter inference and this problem is analytically intractable for
nonlinear SSMs. Another related problem is the state inference problem, where we
would like to determine the value of xt given the information in the observations
y1:T or y1:t . This problem is also analytically intractable for most SSMs.
Instead, we make use of statistical simulation methods to estimate the parameters
and states. As the name suggests, these methods are based on simulating many
(often thousands or millions) of hypotheses referred to as particles. The particles
that match the recorded observations are retained and the others are discarded.
This procedure can be repeated in a sequential manner, where the solution of the
problem is obtained as the solution to many subproblems.
This idea is the basis for the sequential Monte Carlo (SMC) methods that are an
integral part of the methods that we consider for state inference in SSMs. For
example, the marginal filtering distribution pθ (xt |y1:t ) can be approximated by an
empirical distribution,
pbθ (dxt |y1:t ) =
N
X
(i)
w
et δ
i=1
(i)
(i)
xt
(dxt ),
(i)
where xt and w
et denotes the particle i and its corresponding (normalised) weight
obtained from the SMC algorithm. Here, δz (dx) denotes a Dirac point mass located at x = z. The empirical distribution summarises all the information that
is contained within the data about the value of the latent state at some time t.
This information can then be used by other methods for solving the parameter
inference problem in SSMs.
The number of particles N in the SMC method controls both the accuracy of the
empirical distributions and the computational cost. That is, a high accuracy requires many particles which incurs a high computational cost and this results in a
trade-off between accuracy and speed. Also, we make use of the SMC algorithm
within some iterative parameter inference methods to solve the state inference
problem at each iteration. Therefore, we would like to limit the number of iterations required by the parameter inference methods to obtain an accurate parameter
estimate with a reasonable computational cost.
These ideas are the two main themes of this thesis. The first theme is to propose
some developments to improve the efficiency of existing parameter inference methods based on SMC algorithms. The second theme is to extend some of the current
methods for linear SSMs to nonlinear SSMs by making use of SMC algorithms.
We return to the contributions of this thesis in Section 1.2.
1.1
Examples of applications
In this section, we give two examples of problems in which computational statistical methods based on the SMC algorithm are useful. In the first example, we use a
1.1
Examples of applications
5
model to forecast the future development of the Swedish economy given past experience and economical theory. In the second example, we use a model constructed
from the physics of light transport to render photorealistic images.
1.1.1
Predicting GDP growth
The economy of a country is a complex system with an emergent behaviour depending on the actions of many interacting heterogeneous agents. In an economy,
these agents correspond to consumers, companies, banks, politicians, governmental
agencies, other countries, etc. As such, these agents may or may not act rationally
to their situation and could therefore be difficult to model on an individual level.
As a result, economical models mostly deal with the aggregated behaviour of
many homogeneous agents, i.e. rational utility maximising agents with a common
valuation of goods and services. These models can be used to produce forecasts,
to gain understanding about the current situation in the economy and simulate
the result of different policy decisions. An example of this could be to study the
impact of changing the repo rate on the unemployment level and the GDP growth
of the economy.
For this purpose, many central banks are today using dynamic stochastic general
equilibrium (DSGE) models (An and Schorfheide, 2007; Del Negro and Schorfheide,
2004) for modelling the economy of a country. The outputs from these models
are various macroeconomic quantities, such as GDP growth, unemployment rate,
inflation, etc. The general structure is given by economic theory, but there are
some unknown parameters that needs to be inferred from data.
Riksbanken (the Swedish central bank) has developed a DSGE model called the
Riksbank Aggregate Macromodel for Studies of the Economy of Sweden II (RAMSES II) (Adolfson et al., 2013, 2007a) to model the Swedish economy. Essentially,
RAMSES II is a nonlinear SSM with 12 outputs, about 40 latent states and about
65 unknown parameters.
For computational convenience, only the log-linearised version of the full model is
considered in most of the analysis. Consequently, Kalman filtering methods can
be used to solve the state inference problem. The parameter inference problem is
solved using a Metropolis-Hastings (MH) algorithm, where the proposal is a multivariate Gaussian distribution with the covariance matrix given by the inverse of
the observed information matrix at the posterior mode. The information matrix
is estimated using Quasi-Newton optimisation algorithms such as the BFGS algorithm (Nocedal and Wright, 2006). For more details, see Adolfson et al. (2007b).
In Chapters 3 and 4, we discuss alternative methods that could solve the state and
parameter inference problem in the original nonlinear version of RAMSES II. For
related treatments using SMC and MCMC in combination with DSGE models, see
Flury and Shephard (2011), Fernández-Villaverde and Rubio-Ramírez (2007) and
Amisano and Tristani (2010).
In Figure 1.1, we give an example of how the RAMSES II model can be used for
forecasting the changes in GDP, the consumer price index with fixed interest rates
1
Introduction
2
0
−2
−4
Quarterly GDP growth (%)
4
6
1995
1997
1999
2001
2003
2005
2007
2009
2011
2013
2015
2017
2019
2021
2023
2011
2013
2015
2017
2019
2021
2023
2011
2013
2015
2017
2019
2021
2023
2011
2013
2015
2017
2019
2021
2023
6
4
0
2
Repo rate (%)
8
10
Date
1995
1997
1999
2001
2003
2005
2007
2009
6
4
2
0
−2
Annual change in CPIF (%)
8
Date
1995
1997
1999
2001
2003
2005
2007
2009
20
0
−20
−40
Unemployment gap (%)
40
Date
1995
1997
1999
2001
2003
2005
2007
2009
Date
Figure 1.1: The quarterly GDP growth (green), the repo rate (red), the annual
change in the CPIF (blue) and unemployment gap (orange). The historical
data is presented for each variable up until the dotted vertical lines. The
predicted means are presented with the 50%, 70%, 90% and 95% credibility
intervals (from lighter to darker gray). The published forecasts from Riksbanken are presented as crosses. The data and predictions are obtained by
the courtesy of Riksbanken.
1.1
Examples of applications
7
(CPIF) inflation, the repo rate and the unemployment gap1 in Sweden. We first
make use of some historical data (up until the dotted vertical line) to estimate the
parameters and the latent states. The resulting model is then used to forecast the
predictive posterior mean (solid lines) with some credibility intervals (gray areas).
These predictive means are used by Riksbanken together with experts and other
models to construct forecasts of the economy. This forecast is presented by crosses
and differs from the output of the RAMSES II model.
1.1.2
Rendering photorealistic images
To simulate light transport, we make use of a geometrical optics model, developed
during centuries of research in the field of physics (Hecht, 2013). By the use of
this models, we can simulate how light behaves in different environments and make
use of this to render images. A popular related application is to add objects into
an image or video sequence that were not present when the scene was captured.
In this section, we shortly discuss how to do this using the so-called image-based
lighting (IBL) method (Debevec, 1998; Pharr and Humphreys, 2010).
To make use of the IBL method, we require an panoramic image of the real scene
captured using a high dynamic range (HDR) camera. This type of camera can
record much larger variations in brightness than a standard camera, which are
needed to capture all the different light sources within the scene. The resulting
image is referred to as an environment map (EM). In IBL, this panoramic image serves as the source of illumination when rendering images, allowing the real
objects to cast shadows and interact with the virtual objects. Secondly, we need
geometrical models of the objects that we would like to add into the scene. Finally,
we require a mathematical description of the optical properties of the materials in
these objects to be able to simulate how the light scatters over their surfaces.
The IBL method combines all of this information using the light transport equation
(LTE), which is a physical model of how light rays propagates through space and
reflects off surfaces. The LTE model cannot be solved analytically, but it can be
approximated using methods related to SMC algorithms. To see how this can be
done, consider a cartoon of the setup presented in Figure 1.2. In the first step, a
set of light rays originating from a pixel in the image plane is generated. We then
track how these rays bounces around in the scene until they finally hit the EM.
The colours and brightnesses of the EM in these locations are recorded and used
to compute the resulting colour and brightness of the pixel in the image plane.
This approach is repeated for all the pixels in the image plane.
However in the real world, there are infinitely many rays that bounces around in the
scene before they hit the pixels in the image plan. As a result, it is computationally
infeasible to simulate all the light rays and all the bounces in the scene. Instead,
there are methods to select only the light rays which contributes the most to
1 The unemployment gap is the amount (in percent) that the GDP must increase to be able
to achieve full employment of the work force. The decrease of the unemployment gap is an
important aim of financial policy making and can be achieved (in Keynesian economic theory)
by increasing public spending or lowering taxes.
8
1
Introduction
Figure 1.2: The basic setup of ray tracing underlying photorealistic image
synthesis. The colour and brightness of a pixel in the image plane is determined by the EM, the geometry of the scene and the optical properties of the
objects.
Figure 1.3: The scene before (left) and after (right) the rendering using a
version of the IBL method. The image is taken from Unger et al. (2013) and
is used with courtesy of the authors.
1.2
Thesis outline and contributions
9
the brightness and colour each pixel in the image plane. This can be done in
a similar manner to the methods discussed later in Section 4.2 resulting in the
Metropolis light transport (MLT) algorithm (Veach and Guibas, 1997). The basis
for these methods is to solve the LTE problem by simulating different hypotheses
and improving them in analogue with SMC methods. That is, light rays that hit
bright areas of the EM are kept and modified, whereas rays that does hit the EM
in dim regions or bounces around too long are discarded.
Note that, it can take several days to render a single image using the IBL algorithm,
even when only allowing for a few bounces and light rays per pixel in the image
plane. This problem grows even further when we would like to render a sequence of
images. A possible solution could be to start from the solution from the previous
frame and adapt it to the new frame. If the EMs are similar, this could lead to a
decrease in the total computational cost. We return to this idea in Section 3.5.
In Figure 1.3, we present an example from Unger et al. (2013) of a scene before
(left) and after (right) it is rendered in a computer by the use of the IBL method
and the methods discussed in Section 4.2. Note that, in the final result we have
added several photorealistic objects into the scene such as the sofa, the table
and have also changed the floor in the room. These methods are used in many
entertainment applications to create special effects and to modify scenes in post
production. Furthermore, they are useful in rendering images of scenes that are
difficult or costly to build in the real-world. Some well-known companies (such as
IKEA and Volvo) make use of these methods for digital design and advertisements
as a cost effective alternative to traditional photography.
1.2
Thesis outline and contributions
This thesis is divided into two parts. In Part I, we give some examples of models and applications together with an introduction to the different computational
inference methods that are used. In Part II, we present edited versions of some
published peer-reviewed papers and unpublished technical reports.
Part I - Background
In this part, we begin by introducing the SSM and provide some additional examples of its real-world applications in Chapter 2. Furthermore, we introduce
two different statistical paradigms for parameter inference problems in SSMs: the
maximum likelihood (ML) based approach and the Bayesian approach. Finally,
we discuss why computational methods are required for estimating the solution to
these problems.
In Chapter 3, we review the state inference problem in SSMs and discuss the
use of SMC methods for approximating the solution to these problems. We also
discuss the use of SMC algorithms for other classes of models, which includes the
computer graphics example discussed in Section 1.1.2.
Chapter 4 is devoted to discussing the parameter inference problem for nonlinear
1
10
Introduction
SSMs. We begin by giving an overview of different parameter inference methods
and then discuss Markov chain Monte Carlo (MCMC) and Bayesian optimisation
(BO) in more detail. The former can be used for Bayesian parameter inference and
the latter can be used for ML or maximum a posteriori (MAP) based parameter
inference.
We conclude Part I by Chapter 5 which contains a summary of the contributions of
the thesis together with some general conclusions and possible avenues for future
work.
Part II - Publications
The main part of this thesis is the compilation of five papers published in peerreviewed conference proceedings or as technical reports. These papers contain the
main contributions of this thesis:
• In Paper A, we develop a novel particle MCMC algorithm that combines
the Particle Metropolis-Hastings (PMH) with the Langevin dynamics. The
resulting algorithm explores the posterior distribution more efficient then
the marginal PMH algorithm, is invariant to affine transformations of the
parameter vector and reduces the length of the the burn-in. As a consequence, the proposed algorithm requires less iterations, which makes it more
computationally efficient than the marginal PMH algorithm.
• In Paper B, we develop a novel algorithm for ML parameter inference by
combining ideas from BO with SMC for log-likelihood estimation. The resulting algorithm is computationally efficient as it requires less samples from
the log-likelihood compared with other popular methods.
• In Paper C, we extend the combination of BO and SMC to parameter inference in nonlinear SSMs with intractable likelihoods. Computationally costly
approximate Bayesian computations (ABC) are used to approximate the
likelihood. We illustrate the proposed algorithm for parameter inference in
stochastic volatility model with α-stable returns using real-world data.
• In Paper D, we develop a novel algorithm for input design in nonlinear SSMs,
which can handle amplitude constraints on the input. The proposed method
makes use of SMC for estimating the expected information matrix. The
algorithm performs well compared with some other methods in the literature
and decreases the variance of the parameter estimates with almost an order
of magnitude.
• In Paper E, we propose two algorithms for parameter inference in ARX
models with Student-t innovations which includes automatic model order selection. These methods makes use of reversible jump MCMC (RJMCMC)
and the Gibbs sampler together with sparseness priors to estimate the model
order and the parameter vector. We illustrate the use of the proposed algorithm to model real-world EEG data with promising results.
1.2
Thesis outline and contributions
11
Here, we present an abstract of each paper together with an account of the contribution of the author of this thesis.
Paper A
Paper A of this thesis is an edited version of,
J. Dahlin, F. Lindsten, and T. B. Schön. Particle Metropolis-Hastings
using gradient and Hessian information. Pre-print, 2014b. arXiv:1311.0686v2.
which is a combination and development of the two earlier publications
J. Dahlin, F. Lindsten, and T. B. Schön. Second-order particle MCMC
for Bayesian parameter inference. In Proceedings of the 19th IFAC
World Congress, Cape Town, South Africa, August 2014a. (accepted
for publication).
J. Dahlin, F. Lindsten, and T. B. Schön. Particle Metropolis Hastings
using Langevin dynamics. In Proceedings of the 38th International
Conference on Acoustics, Speech, and Signal Processing (ICASSP),
Vancouver, Canada, May 2013a.
Abstract: PMH allows for Bayesian parameter inference in nonlinear state space
models by combining MCMC and particle filtering. The latter is used to estimate the intractable likelihood. In its original formulation, PMH makes use of a
marginal MCMC proposal for the parameters, typically a Gaussian random walk.
However, this can lead to a poor exploration of the parameter space and an inefficient use of the generated particles.
We propose two alternative versions of PMH that incorporate gradient and Hessian information about the posterior into the proposal. This information is more
or less obtained as a byproduct of the likelihood estimation. Indeed, we show
how to estimate the required information using a fixed-lag particle smoother, with
a computational cost growing linearly in the number of particles. We conclude
that the proposed methods can: (i) decrease the length of the burn-in phase, (ii)
increase the mixing of the Markov chain at the stationary phase, and (iii) make
the proposal distribution scale invariant which simplifies tuning.
Contributions and background: The author of this thesis contributed with the
majority of the work including the design, the implementation, the numerical illustrations and the written presentation.
Paper B
Paper B of this thesis is an edited version of,
J. Dahlin and F. Lindsten. Particle filter-based Gaussian process optimisation for parameter inference. In Proceedings of the 19th IFAC
World Congress, Cape Town, South Africa, August 2014. (accepted for
publication).
1
12
Introduction
Abstract: We propose a novel method for maximum-likelihood-based parameter
inference in nonlinear and/or non-Gaussian state space models. The method is an
iterative procedure with three steps. At each iteration a particle filter is used to
estimate the value of the log-likelihood function at the current parameter iterate.
Using these log-likelihood estimates, a surrogate objective function is created by
utilizing a Gaussian process model. Finally, we use a heuristic procedure to obtain
a revised parameter iterate, providing an automatic trade-off between exploration
and exploitation of the surrogate model. The method is profiled on two state space
models with good performance both considering accuracy and computational cost.
Contributions and background: The author of this thesis contributed with the
majority of the work including the design, the implementation, the numerical illustrations and the written presentation.
Paper C
Paper C of this thesis is an edited version of,
J. Dahlin, T. B. Schön, and M. Villani. Approximate inference in state
space models with intractable likelihoods using Gaussian process optimisation. Technical Report LiTH-ISY-R-3075, Department of Electrical Engineering, Linköping University, Linköping, Sweden, April 2014c.
Abstract: We propose a novel method for MAP parameter inference in nonlinear state space models with intractable likelihoods. The method is based on a
combination of BO, SMC and SBC. SMC and ABC are used to approximate the
intractable likelihood by using the similarity between simulated realisations from
the model and the data obtained from the system. The BO algorithm is used for
the MAP parameter estimation given noisy estimates of the log-likelihood.
The proposed parameter inference method is evaluated in three problems using
both synthetic and real-world data. The results are promising, indicating that the
proposed algorithm converges fast and with reasonable accuracy compared with
existing methods.
Contributions and background: The author of this thesis contributed with the
majority of the work including the design, the implementation, the numerical
illustrations and the written presentation. This contribution resulted from the
participation in the course Bayesian learning given by Prof. Mattias Villani at
Linköping University during the autumn of 2013.
Paper D
Paper D of this thesis is an edited version of,
P. E. Valenzuela, J. Dahlin, C. R. Rojas, and T. B. Schön. A graph/particlebased method for experiment design in nonlinear systems. In Proceedings of the 19th IFAC World Congress, Cape Town, South Africa,
August 2014. (accepted for publication).
1.2
Thesis outline and contributions
13
Abstract: We propose an extended method for experiment design in nonlinear
state space models. The proposed input design technique optimizes a scalar cost
function of the information matrix, by computing the optimal stationary probability mass function (PMF) from which an input sequence is sampled. The feasible
set of the stationary PMF is a polytope, allowing it to be expressed as a convex
combination of its extreme points. The extreme points in the feasible set of PMFs
can be computed using graph theory.
Therefore, the final information matrix can be approximated as a convex combination of the information matrices associated with each extreme point. For nonlinear
SSMs, the information matrices for each extreme point can be computed by using
particle methods. Numerical examples show that the proposed technique can be
successfully employed for experiment design in nonlinear SSMs.
Contributions and background: This is an extension of the work presented in
Valenzuela et al. (2013) and a result of the cooperation with the Department of
Automatic Control at the Royal Institute of Technology (KTH). The author of
this thesis designed and implemented the algorithm for estimating the expected
information matrix and the Monte Carlo method for estimating the optimal input.
The corresponding sections in the paper were also written by the author of this
thesis.
Paper E
Paper E of this thesis is an edited version of,
J. Dahlin, F. Lindsten, T. B. Schön, and A. Wills. Hierarchical Bayesian
ARX models for robust inference. In Proceedings of the 16th IFAC
Symposium on System Identification (SYSID), Brussels, Belgium, July
2012b.
Abstract: Gaussian innovations are the typical choice in most ARX models but using other distributions such as the Student-t could be useful. We demonstrate that
this choice of distribution for the innovations provides an increased robustness to
data anomalies, such as outliers and missing observations. We consider these models in a Bayesian setting and perform inference using numerical procedures based
on MCMC methods. These models include automatic order determination by two
alternative methods, based on a parametric model order and a sparseness prior,
respectively. The methods and the advantage of our choice of innovations are illustrated in three numerical studies using both simulated data and real EEG data.
Contributions and background: The author of this thesis contributed to parts of
the the implementation, generated most of the numerical illustrations and wrote
the sections covering the numerical illustrations and the conclusions in the paper.
The EEG data was kindly provided by Eline Borch Petersen and Thomas Lunner
at Eriksholm Research Centre, Oticon A/S, Denmark.
1
14
1.3
Introduction
Publications
Published work of relevance to this thesis are listed below in reverse chronological
order. Items marked with ? are included in Part II of this thesis.
? J. Dahlin, T. B. Schön, and M. Villani. Approximate inference in state
space models with intractable likelihoods using Gaussian process optimisation. Technical Report LiTH-ISY-R-3075, Department of Electrical Engineering, Linköping University, Linköping, Sweden, April 2014c.
? J. Dahlin, F. Lindsten, and T. B. Schön. Particle Metropolis-Hastings
using gradient and Hessian information. Pre-print, 2014b. arXiv:1311.0686v2.
? J. Dahlin and F. Lindsten. Particle filter-based Gaussian process optimisation for parameter inference. In Proceedings of the 19th IFAC
World Congress, Cape Town, South Africa, August 2014. (accepted for
publication).
J. Dahlin, F. Lindsten, and T. B. Schön. Second-order particle MCMC
for Bayesian parameter inference. In Proceedings of the 19th IFAC
World Congress, Cape Town, South Africa, August 2014a. (accepted
for publication).
? P. E. Valenzuela, J. Dahlin, C. R. Rojas, and T. B. Schön. A graph/particlebased method for experiment design in nonlinear systems. In Proceedings of the 19th IFAC World Congress, Cape Town, South Africa,
August 2014. (accepted for publication)
J. Dahlin, F. Lindsten, and T. B. Schön. Particle Metropolis Hastings
using Langevin dynamics. In Proceedings of the 38th International
Conference on Acoustics, Speech, and Signal Processing (ICASSP),
Vancouver, Canada, May 2013a.
? J. Dahlin, F. Lindsten, T. B. Schön, and A. Wills. Hierarchical Bayesian
ARX models for robust inference. In Proceedings of the 16th IFAC
Symposium on System Identification (SYSID), Brussels, Belgium, July
2012b.
Other published works related to but not included in the thesis are:
J. Kronander, J. Dahlin, D. Jönsson, M. Kok, T. B. Schön, and J. Unger.
Real-time Video Based Lighting Using GPU Raytracing. In Proceedings of the 2014 European Signal Processing Conference (EUSIPCO),
Lisbon, Portugal, September 2014a. (submitted, pending review).
J. Kronander, T. B. Schön, and J. Dahlin. Backward sequential Monte
Carlo for marginal smoothing. In Proceedings of the 2014 IEEE Statistical Signal Processing Workshop (SSP), Gold Coast, Australia, July
2014b. (accepted for publication).
1.3
Publications
D. Hultqvist, J. Roll, F. Svensson, J. Dahlin, and T. B. Schön. Detection and positioning of overtaking vehicles using 1D optical flow. In
Proceedings of the IEEE Intelligent Vehicles (IV) Symposium, Dearborn, MI, USA, June 2014. (accepted for publication).
J. Dahlin and P. Svenson. Ensemble approaches for improving community detection methods. Pre-print, 2013. arXiv:1309.0242v1.
J. Dahlin, F. Lindsten, and T. B. Schön. Inference in Gaussian models
with missing data using Equalisation Maximisation. Pre-print, 2013b.
arXiv:1308.4601v1.
J. Dahlin, F. Johansson, L. Kaati, C. Mårtensson, and P. Svenson.
A Method for Community Detection in Uncertain Networks. In Proceedings of International Symposium on Foundation of Open Source
Intelligence and Security Informatics 2012, Istanbul, Turkey, August
2012a.
J. Dahlin and P. Svenson. A Method for Community Detection in
Uncertain Networks. In Proceedings of 2011 European Intelligence
and Security Informatics Conference, Athens, Greece, August 2011.
15
2
Nonlinear state space models
and statistical inference
In this chapter, we introduce the SSM and give some motivating examples of
different applications in which the model is used. We also review ML inference
and Bayesian inference in connection with SSMs. Interested readers are referred
to Douc et al. (2014), Cappé et al. (2005), Ljung (1999), Shumway and Stoffer
(2010) and Brockwell and Davis (2002), for more detailed accounts of the topics
covered here.
2.1
State space models and inference problems
An SSM or hidden Markov model (HMM) consists of a pair of discrete-time stochastic processes1 x0:T , {xt }Tt=0 and y1:T , {yt }Tt=1 . Here, xt ∈ X ∈ Rn denotes the
latent state and yt ∈ Y ∈ Rm denotes the observation obtained from the system
at time t.
The latent state is modelled as a Markov chain with initial state x0 ∼ µ(x0 ) and the
transition kernel fθ (xt+1 |xt , ut ). Furthermore, we assume that the observations are
mutually independent given the latent states and have the conditional observation
density gθ (yt |xt , ut ). In both kernels, θ ∈ Θ ⊆ Rd denotes the static parameter
vector of the Markov kernel and ut denotes a known input to the system. With
these definitions, we can write the SSM on the compact form
x0 ∼ µ(x0 ),
xt+1 |xt ∼ fθ (xt+1 |xt , ut ),
yt |xt ∼ gθ (yt |xt , ut ),
(2.1a)
(2.1b)
(2.1c)
1 In this thesis, we do not make any difference in notation between a random variable and its
realisation. This is done to ease the notation.
17
2
18
Nonlinear state space models and statistical inference
···
xt−1
xt
xt+1
···
···
yt−1
yt
yt+1
···
Figure 2.1: Graphical model of an SSM with latent process (red) and observed
process (blue).
which we make use of in this thesis. This is a fairly general class of models and
can be used to model both nonlinear and non-Gaussian systems.
Another popular description of stochastic models like the class of SSMs is graphical
model (Murphy, 2012; Bishop, 2006). The corresponding graphical representation
of an SSM is depicted in Figure 2.1, where the latent state process is presented
in red and the observed process in blue. From this graphical model, we see that
the state xt is only dependent on the last state xt−1 due to the Markov property
inherent in the model. That is, all the information about the past is summarised in
the state at time t−1. Also, we see that the observations are mutually independent
given the states, as there are no arrows directly between two observations.
In this thesis, we are interested in two different inference problems connected to
SSMs: (i) the state inference problem and (ii) the parameter inference problem.
The first problem is to infer the density of the latent state process given the
observations and the model. If we are interested in the current state given all the
observations up until now, we would like to estimate the marginal filtering density
pθ (xt |y1:t ). We return to this and other related problems in Chapter 3. There, we
define the problem in mathematical terms and present numerical methods designed
to approximate the filtering and smoothing densities.
The second problem is to infer the values of the parameter θ given the set of observations y1:T and the model structure encoded by the Markov transition kernels
fθ (xt+1 |xt ) and gθ (yt |xt ). It turns out that we have to solve the state inference
problem as a part of the parameter inference problem. We later return to the
mathematical formulation of this problem in the ML setting in Section 2.3 and in
the Bayesian setting in Section 2.4. These problems are analytically intractable
and cannot be computed in closed-form. Therefore, we present computational
methods based on sampling methods for parameter inference in Chapter 4.
2.2
Some motivating examples
2.2
19
Some motivating examples
SSMs have been successfully applied in various areas for modelling dynamical systems. In this section, we give three examples from different research fields and
connect them with the inference problems discussed in the previous section. The
first model is taken from finance, where we would like to model the real-valued
latent volatility given some stock or exchange rate data. In the second model, we
would like to make predictions of the number of annual major earthquakes. The
third model is taken from meteorology, where we would like predict the probability of rain fall during the coming days. However, we start the the well-known
linear Gaussian state space (LGSS) model, which we make use of as a benchmark
throughout the thesis.
2.2.1
Linear Gaussian model
Consider the scalar LGSS model2 ,
xt+1 |xt ∼ N x1+1 ; φxt + γut , σv2 ,
yt |xt ∼ N yt ; xt , σe2 ,
(2.2a)
(2.2b)
where the parameter vector is θ = {φ, γ, σv , σe }. Here, φ describes the persistence
of the state and {σv , σe } controls the noise levels. In this model, we have added
an optional input u1:T to the system, which is scaled by the parameter γ. Here,
we require that φ ∈ (−1, 1) ⊂ R to obtain a stable system and that {σv , σe } ∈ R2+
as they correspond to standard deviations.
The state inference problem can be solved exactly for this model using Kalman
filters and smoothers (Kailath et al., 2000). This is a result of that the model is
linear and only includes Gaussian kernels. Due to this property, we make use of
the model as a benchmark problem for some of the algorithms reviewed in this
thesis. Comparing the methods that we develop for the nonlinear SSMs with the
LGSS model can reveal important properties of the algorithms and help with insights about how to calibrate them. The Kalman methods can also be used to
estimate the log-likelihood, the score function and the information matrix, which
are important quantities in the ML parameter inference problem discussed in Section 2.3.
2.2.2
Volatility models in econometrics and finance
Nonlinear SSMs are often encountered in econometric and financial problems,
where we would like to e.g. model the variations in the log-returns of a stock
or an index. The log-returns are calculated by yt = log(st /st−1 ), where st denotes
the price of some financial asset at time t. The variations in the log-returns can
be seen as the instantaneous standard deviation and is referred to as the volatility.
2 This model is also known as ARX(1) in noise, where ARX(1) denotes an exogenous autoregressive process of order 1.
20
2
Nonlinear state space models and statistical inference
The volatility plays an important role in the famous Black-Scholes pricing model
(Black and Scholes, 1973). In this model, the log-returns are assumed to follow
a Brownian motion with independent and identically distributed IID increments
distributed according to some Gaussian distribution N (µ, σ 2 ), where σ denotes the
volatility. Therefore, the volatility is an important component when calculating
the price of options and other financial instrument based on the Black-Scholes
model. For a discussion of how the volatility is used for pricing options, see Hull
(2009), Björk (2004) and Glasserman (2004). For a more extensive treatment of
financial time series and alternative inference methods for volatility models, see
Tsay (2005).
In Figure 2.2, we present the closing prices and daily log-returns for the the NASDAQ OMX Stockholm 30 Index during a 14 year period. Note, the large drops in
the closing prices (middle) in the period around the years 2001, 2008 and 2011 in
connection with the most recent financial crises (shocks). At these drops, we see
that the log-returns are quite volatility, as they vary much between consecutive
days. During other periods, the volatility is quite low and the log-returns does
not vary much between consecutive days. These variations in the volatility are
referred to as volatility clustering in finance.
Also, from the QQ-plots we see that the log-returns are heavy-tailed and clearly
non-Gaussian with large deviations from theoretical quantiles in the tail behaviour.
All these features (and some other) are known as stylized facts (Cont, 2001). In this
section, we present three different volatility models that tries to capture different
aspects of the observed properties of financial data. A complication is that the
resulting inference problems become more challenging as when try to capture more
and more of the stylized facts. This results in a trade-off between accuracy of the
model and the computational complexity of the resulting inference problems.
A common theme for all the models considered here, is that they are all based
on a (slowly) varying random walk-type model of the volatility. This model can
be motivated by the volatility clustering behaviour, i.e. the underlying volatility
varies slowly and gives rise to volatility clustering. Furthermore, the log-returns are
modelled as a non-stationary white noise process where the variance is determined
by the latent volatility. This corresponds quite well with the log-returns presented
in the upper part of Figure 2.2. For a more through discussion about different
volatility models, see Mitra (2011) and Kim et al. (1998).
The first model is the generalised autoregressive conditional heteroskedasticity
(GARCH) model (Bollerslev, 1986), which is a generalisation of the ARCH model
(Engle, 1982). Here, we consider the GARCH(1,1) in noise model given by
ht+1 |xt , ht = α + βx2t + γht ,
xt+1 |xt , ht = N xt+1 ; 0, ht+1 ,
yt |xt = N yt ; xt , τ 2 ,
(2.3a)
(2.3b)
(2.3c)
where the parameter vector is θ = {α, β, γ, τ } with the constraints {α, β, γ, τ } ∈
Some motivating examples
21
0.00
−0.10
−0.05
Daily log−returns
0.05
0.10
2.2
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
1000
800
400
600
Daily closing prices
1200
1400
Year
2000
2001
2002
2003
2004
2005
2006
2007
25
20
Density
15
10
0.00
0
−0.10
5
−0.05
Sample Quantiles
0.05
30
35
0.10
Year
−2
0
Theoretical Quantiles
2
−0.10
−0.05
0.00
0.05
0.10
Daily log−returns
Figure 2.2: Daily log-returns (upper) for the NASDAQ OMX Stockholm 30
Index from 2000-01-04 to 2014-03-14. The daily closing prices (middle), QQplot of the log-returns (lower left) and histogram of the log-returns and the
kernel density estimate (KDE) (lower right) are also presented.
22
2
Nonlinear state space models and statistical inference
R4+ and β + γ ∈ (0, 1) ⊂ R for stability. In this model, the current log-return
is taken into account when computing the volatility at the next time step. This
construction tries to capture the volatility clustering.
The second model is the Hull-White stochastic volatility (HWSV) model3 (Hull
and White, 1987) given by
(2.4a)
xt+1 |xt ∼ N xt+1 ; µ + φ(xt − µ), σ 2 ,
yt |xt ∼ N yt ; 0, β 2 exp(xt ) ,
(2.4b)
where the parameter vector is θ = {µ, φ, σ, β} with the constraints {µ, σ, β} ∈ R3+
and φ ∈ (−1, 1) ⊂ R for stability. There are many variants of the HWSV model
that includes correlations between the noise sources (called leverage models) and
outliers in the form of jump processes. See Chib et al. (2002) and Jacquier et al.
(2004) for more information.
One problem with HWSV is that the Gaussian observation noise in some cases
cannot fully capture the heavy-tail behaviour found in real-world data. Instead, et
is often assumed to be simulated from a Student-t distribution, which has heavier
tails than the Gaussian distribution. Another modification is to assume that et
is generated from an α-stable distribution, which can model both large outliers
in the data and non-symmetric log-returns. This results in the third model, the
stochastic volatility model with symmetric α-stable returns (SVα) Casarin (2004)
given by
xt+1 |xt ∼ N xt+1 ; µ + φxt , σ 2 ,
(2.5a)
yt |xt ∼ A yt ; α, 0, exp(xt /2), 0 ,
(2.5b)
where the parameter vector is θ = {µ, φ, σ, α} and the constraints {µ, σ} ∈ R2+ ,
α ∈ (0, 2] \ {1} ⊂ R and φ ∈ (−1, 1) ⊂ R for stability. Here, A(α, 0, 1, 0) denotes a
symmetric α-stable distribution4 with stability parameter α. For this distribution,
we cannot evaluate gθ (yt |xt ) and this results in problems when inferring the states
and parameter vector of the model. In Paper C, we apply ABCs for solving this
problem.
As previously discussed, the inference problem in volatility models is mainly state
inference, which requires a model and hence results in the need for parameter
inference. For example, the state estimate can be used as the volatility estimate
for option pricing. Also, the parameter vector of the model can be used to analyse if
the log-return are symmetric and heavy-tailed or the persistence of the underlying
volatility process.
3 A similar model is used for modelling the glacial varve thickness (the thickness of the clay
collected within the glacial) in Shumway and Stoffer (2010). This model is obtain by replacing
the noise in (2.4b) with gamma distributed noise, i.e. et ∼ G(α−1 , α) for some parameter α ∈ R+ .
4 See Appendix A of Paper C for a brief summary about α-stable distributions and their
properties. For a more detailed presentation, see Nolan (2003).
Some motivating examples
350
23
150
100
50
0
0
50
100
150
Latitude
200
-160
250
-110
-60
300
-10
2.2
-50
0
50
Longitude
Figure 2.3: The major earthquakes in the world between 2000 and 2013. The
size of the circle is proportional to the relative magnitude of the earthquake.
2
24
2.2.3
Nonlinear state space models and statistical inference
Earthquake count model in geology
Another application of SSMs is to model the number of major (with magnitude
7 or higher on the Richter scale) earthquakes each year. As a motivation for the
model, we consider some real-world data from the Earthquake Data Base System of
the U.S. Geological Survey 5 . The data describes the number of major earthquakes
around the world between the years 1900 and 2013. In Figure 2.3, we present the
locations of the major earthquakes during the period 2000 to 2013. In Figure 2.4,
we present the annual number of major earthquakes with some exploratory plots.
From the data, we see that the number of earthquakes is clearly correlated, similar
to the clustering behaviour found in the volatility models from the previous application. That is, a year with many earthquakes is likely to be followed by another
year with high earthquake intensity. The reason for this follows quite intuitive
as pointed out in Langrock (2011) since earthquakes are due to stresses in the
tectonic plates of the Earth. Therefore, the underlying process which model these
stresses should be slowly varying over the time span of years.
Due to the autocorrelation in the number of earthquakes, it is reasonable to assume
that the intensity is determined by some underlying latent variable. Therefore, we
can model the number of major earthquakes as an SSM. To this end, we use the
model6 from Zeger (1988) and Chan and Ledolter (1995), which assumes that
the number of earthquakes yt is a Poisson distributed variable (corresponding a
positive integer or count data). Also, we assume that the mean of the Poisson
process λt follows an AR(1) process,
log(λt ) − µ = φ log λt−1 − µ + σv vt ,
where vt denotes a standard Gaussian random variable. By introducing xt =
log(λt ) − µ and β = exp(µ), we obtain the SSM
xt+1 |xt ∼ N xt+1 ; φxt , σv2 ,
(2.6a)
yt |xt ∼ P yt ; β exp(xt ) ,
(2.6b)
where the parameter vector is θ = {φ, σv , β} with the constraints φ ∈ (−1, 1) ⊂ R
and {σv , β} ∈ R2+ . Here, P(λ) denotes a Poisson distributed variable with mean
λ. That is, the probability of k ∈ N earthquakes during year t is given by the
probability mass function (PMF),
exp(−λ)λk
.
k!
The inference problem in this type of model could be to determine the underlying
intensity of the process (the state). This information could be useful in making
predictions of the future number of major earthquakes.
P[Nt = k] =
5 This data can be accessed from http://earthquake.usgs.gov/earthquakes/eqarchives/.
6 This type of Poisson count model can also be used to model the number of yearly polio
infections (Zeger, 1988) and the number of transactions per minute of a stock on the market
(Fokianos et al., 2009).
Some motivating examples
25
30
25
20
15
5
10
Number of major earthquakes
35
40
2.2
1900
1920
1940
1960
1980
2000
2020
Density
0.00
0.01
0.02
0.03
0.04
0.05
0.06
Year
0
10
20
30
40
50
0.6
0.4
0.2
−0.2
0.0
ACF of number of earthquakes
0.8
1.0
Number of earthquakes
0
5
10
15
20
Lag
Figure 2.4: Upper: number of annual major earthquakes (with magnitude 7
or higher on the Richter scale) during the period between the years 1900 and
2013. Middle: the corresponding histogram with the KDE (blue). Lower: the
estimated autocorrelation function (ACF).
26
2.2.4
2
Nonlinear state space models and statistical inference
Daily rainfall models in meteorology
A common problem in meteorology and weather forecasting is to estimate the
probability and amount of rainfall. In practice, this problem is often split into two
subproblems, each with its own model. The first model determines the probability
of rainfall and the second model determines the amount of rainfall. For more
information, see Srikanthan and McMahon (1999) and Woolhiser (1992). Here,
we consider the problem to construct a model to determine the probability of
rainfall given historic data of rainfall in the region of interest. It is also possible
to use user data from the Internet to predict the probability of rainfall in some
region. An example of this is to make use of some collected data from Twitter, see
Naesseth (2012) for more information about this approach.
We first consider some real-world data from the Swedish weather service (SMHI)
collected daily at Malmslätt near Linköping during the period between the years
1952 and 2002. The data is presented in Figure 2.5 as the daily probability of
rainfall (upper) and the average daily amount of rainfall (middle) calculated per
week. We also present the ACF of rainfall, which indicates that there is a correlation between rainy and non-rainy days. Furthermore, the probability of rainfall
seems to follow a cyclic behaviour with a high probability during the latter part
of the year. The amount of rainfall also varies with a peak around week 30.
To construct a model of the probability of rainfall, we follow the insights from
the previous analysis and review the model proposed by Langrock and Zucchini
(2011). To account for the autocorrelation in the rainfall, we make use of a latent
process to describe the persistence of the weather. This construction can be used
to create a rough model to account for the structure of low pressure weather
systems which passes over a period of days. Hence, this can be seen as a short
term model of the probability of rainfall. From practical knowledge, we also know
that the probability of rainfall is connection with the season of the year. This
cyclic behaviour was also seen in the weather data from Malmslätt. Therefore, we
assume that there also exists a cyclical part of the latent process and that this
together with the short term persistence determines the probability of rainfall.
Furthermore, we assume that the output from the model is a binary variable
yt ∼ B(pt ) from a Bernoulli distribution with success probability pt . That is, the
variable assumes the value 1 with probability pt (rain falls during day t) and the
value 0 with probability 1 − pt (no rain falls during day t). The final model on
SSM form is given by
X
2
2
X
2kπt
2kπt
ht =
αk cos
+
βk sin
,
(2.7a)
365
365
k=1
k=1
xt+1 |xt ∼ N xt+1 ; φxt , σv2 ,
(2.7b)
exp(µ + xt + ht )
yt ∼ B
,
(2.7c)
1 + exp(µ + xt + ht )
where the parameter vector is θ = {φ, σv , µ, α1 , α2 , β1 , β2 } with the constraints
Some motivating examples
27
0.50
0.45
0.40
0.30
0.35
Probability of rainfall
0.55
0.60
2.2
0
10
20
30
40
50
30
40
50
2.0
1.5
1.0
0.0
0.5
Average daily rainfall (mm)
2.5
3.0
Week
0
10
20
0.6
0.4
0.0
0.2
ACF of rainfall
0.8
1.0
Week
0
5
10
15
20
Lag
Figure 2.5: The daily probability of rainfall (upper), the daily amount of
rainfall (middle) and the ACF of rainfall (lower). The values are calculated
as weekly averages of daily data from Malmslätt during the period between
the years 1952 and 2002. The data is provided by the Swedish weather service
(SMHI) and is used under the creative commons license.
28
2
Nonlinear state space models and statistical inference
φ ∈ (−1, 1) ⊂ R and σv ∈ R+ . The inference problem in this model could be to
determine the probability of rainfall (estimate pt ) given the data. Also, it could be
interesting to determining the strength of the persistence in the system determined
by φ or the strength of the seasonal components determined by {α1 , α2 , β1 , β2 }.
2.3
Maximum likelihood parameter inference
In this section, we present the fundamentals of ML based parameter inference in
SSMs. This presentation is mainly included to set the notation and to highlight
some features of the method that are needed in the sequel. An accessible general
introduction to ML inference is given by Casella and Berger (2001). For more
extensive treatments, see Rao (1965) and Lehmann and Casella (1998).
Inference in the ML paradigm is focused on optimising the likelihood function
L(θ) = pθ (y1:T ), which also appears in Bayesian inference. The likelihood encodes
the information contained with the observations y1:T into a quantity that can be
used for inference.
2.1 Definition (Likelihood function for an SSM). The likelihood (function) of an
SSM can be expressed as the decomposition
L(θ) = pθ (y1:T ) = pθ (y1 )
T
Y
pθ (yt |y1:t−1 ),
(2.8)
t=2
where pθ (yt |y1:t−1 ) denotes the one-step-ahead predictor.
It is common to replace the likelihood function with the log-likelihood (function)
in many inference problems. This is done to simplify analytical calculations and
improve the numerical stability of many algorithms. The log-likelihood is given
by
`(θ) = log pθ (y1:T ) = log pθ (y1 ) +
T
X
log pθ (yt |y1:t−1 ).
(2.9)
t=2
In general, there are two different interpretations of the likelihood in statistics.
The first is that the likelihood is a function of the data for a fixed parameter θ.
A common name for this distribution is the sampling distribution and it plays
an important role when calculating the distribution of some sampled data. The
second interpretation (which we adopt in this thesis) is that the likelihood is
a function of the parameter. That is the data is fixed and that the likelihood
therefore summaries the information in the data.
The ML parameter inference problem is formulated as a maximisation problem
of the likelihood or equivalently the log-likelihood. This follows from that the
logarithm is a monotone function and hence any maximiser of the likelihood is
also a maximiser of the log-likelihood.
2.3
Maximum likelihood parameter inference
29
2.2 Definition (Maximum likelihood parameter inference problem). The parameter inference problem in the ML setting is given by
θbML = argmax L(θ) = argmax `(θ),
θ∈Θ
(2.10)
θ∈Θ
where θbML denotes the ML parameter estimate.
The interpretation of (2.10) is that we should select the parameter that together
with the model is the most likely to have been generated the observations. This
definition makes good intuitive sense, but as we shall see in the following section
it is not the only method for estimating the parameter given the data.
We continue by discussing some useful quantities connected with the log-likelihood
and will then return to discuss the properties of the estimator in (2.10). The gradient of the log-likelihood is referred to as the score function and the negative Hessian of the log-likelihood is referred to as the observed information matrix. These
quantities are useful in many optimisation algorithms as they are the first order
and second order information about the objective function (the log-likelihood) in
(2.10), respectively. Also, this information can be used to build efficient proposals
in some sampling algorithms, see Paper A.
2.3 Definition (Score function). The score function is defined as the gradient of
the log-likelihood,
S(θ0 ) = ∇`(θ)θ=θ0 ,
(2.11)
where the gradient is taken with respect to the parameter vector.
The score function has a natural interpretation as the slope of the log-likelihood.
Hence, the score function is zero when evaluated at the true parameter vector,
S(θ? ) = 0. However, note that this is not necessarily true when we work with
finite data samples as is discussed in Example 2.6.
2.4 Definition (Observed information matrix). The observed information matrix
is defined as the negative Hessian of the log-likelihood,
J (θ0 ) = −∇2 `(θ)θ=θ0
(2.12)
where the Hessian is taken with respect to the parameter vector.
The statistical interpretation of the observed information matrix is as a measure
of the amount of information in the data regarding the parameter θ. That is, if the
data is informative the resulting information matrix is large (according to some
measure). Also, the information matrix can geometrically be seen as the negative
curvature of the log-likelihood. As such, we expect it to be positive definite (PD)
at the ML parameter estimate (c.f. the second-derivative test in basic calculus).
Finally, we note that there exists a limiting behaviour for the observed information
matrix, which approaches the so called expected information matrix as the number
of data points tends to infinity.
30
2
Nonlinear state space models and statistical inference
2.5 Definition (Expected information matrix). The expected information matrix
(or the Fisher information matrix) is defined by the expected value of the observed
information matrix (2.12),
h
i
2 0
2
I(θ ) = −Ey1:T ∇ `(θ) θ=θ0 = Ey1:T ∇`(θ) θ=θ0
,
(2.13)
which is evaluated with respect to the data record.
Note, that the expected information matrix is independent of the data realisation,
whereas the observed information is dependent on the realisation. The expected
information matrix is PD for all values of θ as it can be seen as the variance of the
score function. We make use of this property in Section 4.2.2 and in Paper A to
construct a random walk on a Riemann manifold using the information matrices.
We conclude the discussion on the score function and the information matrix by
Example 2.6, where we investigate the defined quantities for an LGSS model as a
function of the parameter φ.
2.6 Example: Score and information matrix in the LGSS model
Consider the LGSS model (2.2) with the parameter vector θ? = {0.5, 0, 1, 0.1} from
which we generate a realisation of length T = 250 using the initial value x0 = 0.
We fix {σv , σe } at their true values and create a grid over φ. For each grid point,
we calculate the log-likelihood, the score function and the expected information
matrix. The results are presented in Figure 2.6.
We note that the log-likelihood has a distinct maximum near the true parameter
(presented as dotted vertical lines) and that the score function is zero close to
this point. The small difference in the score function is due to that we use a finite
amount of data. Finally, the zero of the score function results in a maximum of the
log-likelihood function as the expected information (negative Hessian) is positive.
With the definition of the expected information matrix in place, we now return
to discussing the properties of the ML parameter estimate. The ML estimator
obtained from (2.10) has a number of strong asymptotic properties, i.e. when
the number of observations tends to infinity. It is possible to show that this
estimator is consistent, asymptotically normal and efficient under some regularity
conditions. These conditions include that the parameter space is compact and that
the likelihood, score function and information matrix exist and are well-behaved.
The (ML) estimator is said to be consistent as it fulfils that
a.s.
θbML −→ θ? ,
T → ∞.
That is, the estimate almost surely converges to the true value of the parameter in
the limit of infinite data. Furthermore, as the estimator is asymptotically normal,
we have that the error in the estimate satisfies a CLT given by
√ d
T θbML − θ? −→ N 0, I −1 (θ? ) ,
Maximum likelihood parameter inference
31
−360
−370
−380
−400
−390
Log−likelihood function
−350
−340
2.3
0.0
0.2
0.4
0.6
0.8
1.0
700
500
−200
100
300
Expected information
0
−100
Score function
100
900
200
φ
0.0
0.2
0.4
0.6
φ
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
φ
Figure 2.6: The estimates of the log-likelihood function (upper) the score
function (lower left) and the expected information matrix (lower right) of the
LGSS model in Example 2.6. The dotted lines indicate the true parameter
and the zero-level.
32
2
Nonlinear state space models and statistical inference
which follows from a Taylor expansion around the point θ = θ? of the log-likelihood.
Note that, the expected information matrix enters into the expression and limits
the accuracy of the estimate.
Therefore, a natural question is if we can somehow change the size of this matrix
to decrease the lower-bound on the variance of the estimate. This is a key problem
in the field of input design, where an input is added to the SSM to maximise some
scalar function of the expected information matrix. This problem is discussed for
nonlinear SSMs in Example 4.10 and Paper D.
Lastly, we say that an estimator is efficient if it attains the Cramér-Rao lower
bound, which means that no other consistent estimator has a lower mean square
error (MSE). That is, the ML estimator is the best unbiased estimator in the MSE
sense and there are no better unbiased estimators. This last property is appealing
and one might be tempted to say that the this estimator is the best choice for
parameter inference. However, this result is only valid when we have an infinite
number of samples. Therefore, other estimators (e.g. Bayes estimators discussed
in the next section) could have better properties in the finite sample regime. Also,
there can exist biased estimators with lower MSE than the ML estimator.
We have now introduced the ML parameter inference problem and some of the
properties of the resulting estimate. In practice, we usually encounter a number of
problems when we try to solve the optimisation problem in (2.10). These includes
that for nonlinear SSMs: (i) the log-likelihood, the score and the information
matrix are all intractable, (ii) the optimisation problem cannot be solved in closedform and (iii) there could exist many local maxima of the log-likelihood making
numerical optimisation problematic.
To solve these problems, a number of different numerical approaches have been developed. We return to the use of these approaches for estimating the log-likelihood,
the score and the information matrix in Section 3. We also discuss other numerical
methods for solving the ML parameter inference problem in Chapter 4.
2.4
Bayesian parameter inference
Previously in the ML paradigm, we implicitly assumed that the true parameter
is a specific value. In this section, we instead assume that the true parameter is
distributed according to some probability distribution. From this assumption, we
can construct a different statistical paradigm called Bayesian inference, which can
be to use for state and parameter inference in nonlinear SSMs.
As in the previous section, we only briefly discuss Bayesian inference for SSMs to
set the notation and highlight some important features that we make use of in this
thesis. For general treatments of Bayesian analysis, see Robert (2007) and Berger
(1985). Furthermore, two accessible introductions are Gelman et al. (2013) and
Casella and Berger (2001).
In the Bayesian parameter inference problem, we combine the information in the
2.4
Bayesian parameter inference
33
data described by the likelihood pθ (y1:T ) with prior information encoded as a
probability distribution denoted p(θ). This combination of the prior information
and the information in the data is made using Bayes’ theorem,. This procedure
is referred to as the prior-posterior update in Bayesian inference. The result is an
updated probability distribution called the posterior distribution, which we denote
p(θ|y1:T ) in the parameter inference problem. Similar distributions can be defined
for the state inference problem and we return to this is Section 3.1.
2.7 Definition (Bayesian parameter inference problem). Given the parameter prior
p(θ) and the likelihood pθ (y1:T ), we obtain the parameter posterior by Bayes’ theorem as
p (y1:T )p(θ)
p(θ|y1:T ) = θ
∝ pθ (y1:T )p(θ),
(2.14)
p(y1:T )
where p(y1:T ) denotes the marginal likelihood (or the evidence). The name is due
to that it can be computed by marginalisation
Z
p(y1:T ) = pθ (y1:T )p(θ) dθ.
(2.15)
One of the main questions in Bayesian inference is the choice of the prior and
how we may encode our prior beliefs about the data into it. A common view
is that this choice is subjective as it is often done using intuition and previous
knowledge, which depends on subjective experiences. In this thesis, we make use
of simple priors to encode stability properties of the nonlinear system or to keep
the standard deviation positive at all times. For this, we make use of uniform
priors over different subsets of the parameter space. An example of this is that
we assume a uniform prior over φ ∈ (−1, 1) ⊂ R for the LGSS model (2.2), which
can be expressed as p(φ) = U(φ; −1, 1). Note, that the use of improper priors can
be seen as a link between ML and Bayesian inference, but we shall not discuss
this further. Instead, interested readers are referred to Robert (2007) for more
information.
Another important class of priors that we make use of in Paper E is conjugate
priors, which enables us to compute closed-form expressions for the posterior. A
conjugate prior has the same functional form as the posterior and depends on the
form of the likelihood. For example, the conjugate prior for the mean (given that
the variance is known) of the Gaussian distribution is again a Gaussian distribution. This results from the fact that the product of two Gaussian distributions
(the likelihood and the prior) is again a Gaussian distribution and hence it is a
conjugate prior. We give another example of a conjugate prior in Example 2.8.
Conjugate priors exist mainly for some combinations of priors and likelihoods in
the exponential family of distributions, see Robert (2007) for a discussion.
2
Nonlinear state space models and statistical inference
0.6
0.0
0.2
0.4
Posterior
0.8
1.0
34
1
2
3
4
5
6
−460
−440
0.06
−520
−500
−480
Log−likelihood
0.10
0.08
Prior
0.12
−420
σ2
1
2
3
4
2
σ
5
6
1
2
3
4
5
6
2
σ
Figure 2.7: The posterior (upper) of the parameter σ 2 in the Gaussian distribution resulting from the combination of an inverse-Gamma prior (lower left)
and the data log-likelihood (lower right).
2.4
Bayesian parameter inference
35
2.8 Example: Prior-posterior update in a conjugate model
Consider the problem of inferring the value of the variance σ 2 in a Gaussian
distribution given that we know the value of the mean µ. Also, we assume that we
have obtained a set of IID data y1:T generated by the model. The conjugate prior
to the variance in a Gaussian distribution is the inverse-Gamma distribution with
rate α0 and shape β0 denoted IG(α0 , β0 ). By direct calculation, we obtain that
P
the posterior has the parameters αT = α0 + T /2 and βT = β0 + 12 Tt=1 (yi − µ)2 .
Here, we simulate the data realisation using the parameter {µ, σ 2 } = {1.5, 4} of
length T = 200 and use the prior coefficients {α0 , β0 } = {1, 4}. The corresponding
posterior, prior and log-likelihood are presented in Figure 2.7. We note that the
prior assigns a small probability of the variance equal to 4. However, as the data
record is relatively large, the resulting posterior is centred close to the true value
of the parameter.
The Bayesian parameter inference problem is completely described by (2.14) and
everything that we need is know is encoded in the posterior. This follows from
the likelihood principle, which states that all information obtained from the data
is contained within the likelihood function, see Robert (2007). However, we are
sometimes interested in computing point estimates of the parameter vector. To
compute point estimates, we can make use of statistical decision theory to make
a decision about what information from the posterior to use. Consider a loss
function L : Θ × Θ → R+ , which takes the parameter and its estimate as input
and returns a real-valued
positive loss. The expected posterior loss (or posterior
risk) ρ p(θ), δ y1:T is given by
Z
ρ p(θ), δ y1:T = L θ, δ(y1:T ) p(θ|y1:T ) dθ,
(2.16)
where δ(y1:T ) denotes the decision of the parameter estimate given the data. The
Bayes estimator is defined as the minimising argument of the expected posterior
loss,
δ ? (y1:T ) = argmin ρ p(θ), δ y1:T .
δ(y1:T )∈Θ
Here, we restrict ourselves to discussing the resulting Bayes estimators for three
different loss functions in Table 2.1, but there are many other possibly suitable
choices for our application. From this, we see that the estimate that minimises
the quadratic loss function is the expected value of the posterior,
Z
E[θ] = θ p(θ|y1:T ) dθ.
(2.17)
That is, the Bayes estimator is given by the posterior mean for this choice of
loss function. This is an example of a common expectation operation in Bayesian
inference, as the expected value of the parameter is computed with respect to
the parameter posterior distribution. Similar expressions can be written for many
other Bayes estimators.
2
36
Linear
Quadratic
0-1
Nonlinear state space models and statistical inference
Loss function
Bayes estimator
L(θ, δ) = |θ − δ|
L(θ, δ) = (θ − δ)2
L(θ, δ) = I(θ = δ)
Posterior median
Posterior mean
Posterior mode
Table 2.1: Different loss functions and the resulting Bayes estimator.
The statistical properties of the Bayes estimator depends on the choice of loss function when we work with finite data. However, it follows (under some conditions)
from the Bernstein–von Mises theorem that the influence of the prior diminishes
when the amount of data grows. Consequently, the MAP estimator (the posterior
mode) converges to the ML estimator in some cases when the amount of data
increases. Therefore it is possible to see ML parameter inference as a special case
of Bayesian parameter inference in some sense. Finally, we note that Bayes estimators asymptotically have the same properties as the ML estimator, i.e. they are
consistent, asymptotically normal and efficient. Also, as the number of data tends
to infinity, the posterior tends to a Gaussian distribution which is known as the
Bayesian CLT. For more information about the statistical properties of the Bayes
estimators, see Robert (2007), Lehmann and Casella (1998) and Berger (1985).
We have now introduced the Bayesian paradigm for parameter and state inference
in nonlinear SSMs. We have seen that the key quantity in Bayesian inference is the
posterior distribution, which depends on the likelihood. Therefore, we in practice
encounter the same problems with analytically intractable likelihoods as for the
ML paradigm. Another complication is that the posterior might not be described
by any known distribution. As a consequence, we cannot compute many of the
integrals depending on the posterior distribution encountered in Bayesian analysis
in closed-form. Examples of such integrals are the normalisation in (2.15), the
marginalisation in (2.16) and the expectation in (2.17).
To counter these problem, we can make use of the same numerical methods from
Chapter 3 as for the ML paradigm to estimate the likelihood function. To solve
the Bayesian parameter inference problem, we can make use of sampling methods
to obtain approximations of the posterior distributions of interest. These methods
use computer simulations and have therefore only been available for practical use
during the last decades. However, during this period these methods have been
well-studied and are today established tools for Bayesian inference. Another approach is to make analytical approximations of the posterior distribution using
known distributions or iterative procedures. We discuss some of these methods in
Chapter 4
3
State inference
using particle methods
In this chapter, we discuss the state inference problem in detail and presents some
algorithms for approximate state inference. The foundation of state inference in
SSMs is given by the Bayesian filtering and smoothing recursions that are discussed
in Section 3.1. Unfortunately, these recursions are analytically intractable for
most SSMs. Therefore, the remaining part of this chapter is devoted to discussing
numerical methods based on SMC for approximate state inference. We also discuss
how to make use of SMC methods to estimate the likelihood, the score function
and the information matrix in SSMs. This chapter is concluded by discussing the
direct illumination problem in computer graphics and how it can be solved using
SMC methods.
3.1
Filtering and smoothing recursions
There are mainly two types of state inference problems in an SSM: filtering and
smoothing. Both the filtering and smoothing problem can be marginal (inference
on a single state), k-interval (inference of k states) or joint (inference on all states).
In marginal filtering, only observations y1:t collected until the current time step
t are used to infer the current value of the state xt . In marginal smoothing, all
the collected observations including (possibly) future observations y1:T are used
to infer the value of the current state xt with t ≤ T . In Table 3.1, we summarise
some common filtering and smoothing problems in SSMs.
The solutions to the filtering and smoothing problems are given by the Bayesian
filtering and smoothing recursions (Jazwinski, 1970; Anderson and Moore, 2005).
These (as the names suggest) are based on Bayesian inference similar to the parameter inference problems discussed in Section 2.4. These recursions can be used
37
3
38
State inference using particle methods
Name
Density
(Marginal) filtering
(Marginal) smoothing (t ≤ T )
Joint smoothing
Fixed-interval smoothing (s < t ≤ T )
Fixed-lag smoothing (for lag ∆)
pθ (xt |y1:t )
pθ (xt |y1:T )
pθ (x0:T |y1:T )
pθ (xs:t |y1:T )
pθ (xt−∆−1:t |y1:t )
Table 3.1: Common filtering and smoothing densities in SSMs.
to iteratively solve the filtering problem for each time t using two steps given by
pθ (xt |y1:t−1 )gθ (yt |xt )
,
pθ (yt |y1:t−1 )
Z
pθ (xt |y1:t−1 ) = fθ (xt |xt−1 )pθ (xt−1 |y1:t−1 ) dxt−1 .
pθ (xt |y1:t ) =
(3.1a)
(3.1b)
In the first step, the state estimate at time t is updated with the new observation
yt (also known as the measurement update) using Bayes’ theorem. In the second
step, a marginalisation is used to predict the next state xt using the information
in the current state and the collected observations (also known as the time update). In a similar manner, the smoothing problem can be solved by iterating the
marginalisation
Z
f (xt+1 |xt )p(xt+1 |y1:T )
p(xt |y1:T ) = p(xt |y1:t )
dxt+1 ,
(3.2a)
p(xt+1 |y1:t )
Z
p(xt+1 |y1:t ) = f (xt+1 |xt )p(xt |y1:t ) dxt ,
(3.2b)
backwards in time. Here the smoothing recursion, makes use of the filtering distribution computed in a forward pass. The resulting smoother from this procedure is
therefore known as the forward filtering backward smoothing (FFBSm) algorithm.
We return to a numerical method based on the FFBSm algorithm for the use in
SSMs later in this chapter.
The filtering (3.1) and smoothing (3.2) recursions can only be solved analytically
for two different classes of SSMs: linear Gaussian SSMs and SSMs with finite state
processes. For the former, the recursions can be solved by the Kalman filter (KF)
and e.g. the Rauch–Tung–Striebel (RTS) smoother (Rauch et al., 1965), respectively. Using the Kalman methods, we can exactly compute the likelihood and the
states in an LGSS model. We do not discuss the details of the Kalman methods
here and refer readers to Anderson and Moore (2005) and Kailath et al. (2000) for
extensive treatments of Kalman filtering and different Kalman smoothers.
The general intractability of the Bayesian filtering and smoothing recursions results
from that there are no closed-form expressions for the densities in the recursions.
This is similar to the problems that we encountered in the Bayesian parameter
inference problem. Instead, we consider numerical approximations that relies on
3.2
Monte Carlo and importance sampling
39
Monte Carlo (MC) methods to estimate the filtering and smoothing distributions.
This results in SMC and related methods, which we return to in Section 3.3.
3.2
Monte Carlo and importance sampling
MC methods are a collection of statistical simulation methods based on sampling
and the strong law of large numbers (SLLN). For a general introduction to MC
methods, see Robert and Casella (2004) for an extensive treatment or Ross (2012)
for an accessible introduction. In this thesis, we mainly use MC methods for estimating the expected value (an integral) of an arbitrary well-behaved test function
ϕ(x),
Z
ϕ
b = Eπ ϕ(x) = ϕ(x)π(x) dx,
(3.3)
where π(x) denotes a (normalised) target distribution from which we can simulate
IID particles (or samples). As a consequence of the SSLN1 , we can estimate the
expected value by the sample average
ϕ
bMC =
N
1 X
ϕ x(i) ,
N
x(i) ∼ π(x),
i=1
taken over independent realisations x(i) simulated from π(x). This estimator is
strongly consistent, i.e.
a.s.
ϕ
bMC −→ ϕ,
b
N → ∞.
Also, we can construct a CLT for the error of the MC estimator, given by
√ d
2
2
N ϕ
b−ϕ
bMC → N 0, σMC
,
σMC
= V[ϕ(x)] < ∞,
where we assume that the function ϕ(x) has a finite second moment. Here, we
see that the MC estimator is asymptotically unbiased with Gaussian errors. Also,
the variance of the error decreases as 1/N independent of the dimension of the
problem, which is one of the main advantages of the MC methods.
Consider, the problem of estimating (3.3) when the normalised target π(x) =
γ(x)Z −1 is unknown together with the normalisation factor Z. Instead, we can
only evaluate the unnormalised target γ(x) point-wise. In this case, importance
sampling (IS) (Marshall, 1956) can be used to sample from another distribution
called the proposal distribution (or importance distribution) q(x) and adapt the
particles using a weighting scheme. For this, it is required that q(x) , 0 for almost
all x ∈ supp(π) and that the support of ϕ(x)q(x) contains the support of π(x), i.e.
supp(π) ⊂ supp(ϕq). The IS algorithm follows by rewriting the expected value in
1 The SSLN states that the sample average x̄ computed using N IID samples from π(x),
a.s.
converges almost surely to µ = Eπ [x] when N tends to ∞, i.e. x̄ −→ µ.
3
40
State inference using particle methods
Algorithm 1 Importance sampling (IS)
Inputs: γ(x) (unnormalised target), q(x) (proposal) and N > 0 (no. particles).
Output: pb(dx) (empirical distribution).
1:
2:
3:
4:
5:
6:
for i = 1 to N do
Sample a particle by x(i) ∼ q(x).
Compute the weight by w(i) = γ(x(i) )/q(x(i ).
end for
Normalise the particle weights by (3.5).
Estimate pb(dx) using (3.7).
(3.3) to obtain
Eπ ϕ(x) =
Z
Z
ϕ(x)π(x) dx =
ϕ(x)
γ(x)
q(x) dx = Eq w(x)ϕ(x) ,
q(x)
| {z }
,w(x)
where w(x) denotes the (unnormalised) importance weights. The IS estimator
follows in analogue with the vanilla MC estimator,
ϕ
bIS =
N
X
w
e(i) ϕ x(i) ,
x(i) ∼ q(x),
(3.4)
i=1
where the normalised weights are given by
w(i)
w
e x(i) , w
e(i) = PN
,
(k)
k=1 w
i = 1, . . . , N.
(3.5)
We can also construct an unbiased estimator for the normalisation constant by
N
1 X (i)
ZbIS =
w ,
N
(3.6)
i=1
which we later shall make use of to estimate the (marginal) likelihood in SSMs.
The IS algorithm can be interpreted as generating an empirical distribution, which
can be seen by rewriting the estimator in (3.4) as a Dirac mixture given by
pb(dx) =
N
X
w
e(i) δx(i) (dx),
(3.7)
i=1
which we can insert into (3.3) to recover (3.4) as presented in Algorithm 1. Here,
δz (dx) denotes a Dirac point mass located at x = z. In the following, we make
use of this property to compute expectations in state inference problems using
sequential variants of the IS algorithm.
We conclude the discussion of the IS algorithm by Example 3.1 in which we estimate the parameters of the HWSV model using some real-world data. Note
that, the IS algorithm can be developed further to include multiple and adaptive
3.3
Particle filtering
41
proposals. In multiple importance sampling (MIS), the algorithm can make use of
several proposal algorithms that are then combined to form the estimate. In adaptive importance sampling (AIS), we update a mixture of proposals or the proposal
after each iteration to fit the target better. These methods could be interesting
in developing new approximate state inference algorithms similar to the particle
methods discussed in the next section. For more information, see Kronander and
Schön (2014), Cornuet et al. (2011) and Veach and Guibas (1995).
3.1 Example: IS for Bayesian parameter inference in the HWSV model
Consider the Bayesian parameter inference problem in the Hull-White SV model
(2.4) and the data presented in Section 2.2.2. In this model, we would like to
estimate the parameters given the data by the use of the IS algorithm. The
unnormalised target distribution is given by
γ(θ) = pθ (y1:T )p(θ),
where we assume that we can evaluate the likelihood point-wise. Note that, the
solution to this problem is analytically intractable but can approximated using
methods discussed later in Section 3.3.4. Furthermore, we assume the uniform
priors over φ ∈ (−1, 1) ⊂ R and {σv , β} ∈ R2+ . The parameter proposals are given
by
q(φ, ) = U(φ; −1, 1),
q(σv ) = G(σv ; 2, 0.1),
q(β) = G(β; 7, 0.1).
These proposals are based on prior information regarding the typical values for
parameters in the model. Hence, they can be seen as a kind of prior distributions
in the Bayesian inference. For the IS algorithm, we use N = 200 samples and the
procedure outlined in Algorithm 1.
In Figure 3.1, we present the resulting posterior distributions with the proposal
distributions. The parameter estimates obtain by the posterior mode are θbIS =
{0.996, 0.129, 0.837}, which indicates a slowly varying latent volatility process. In
Chapter 4, we discuss other methods for parameter inference that are more efficient
when the number of parameters increases. However, for small problems the IS
algorithm could be a useful alternative in cases when the prior information is good.
Otherwise, the number samples required from the parameter posterior increases
exponentially with the number of elements in the parameter vector.
3.3
Particle filtering
The IS algorithm discussed in the previous section can be combined with the
filtering recursions in Section 3.1 to sequentially estimate the joint smoothing
distribution π(x0:t ) = pθ (x0:t |y1:t ). To this end, we can apply Algorithm 1 to
sequentially approximate the target using an empirical distribution. The resulting
algorithm is known as the sequential importance sampling (SIS) algorithm. The
3
State inference using particle methods
0
5
Density
10
15
42
0.90
0.92
0.94
0.96
0.98
1.00
0.15
0.20
0.25
0
5
Density
10
15
φ
0.00
0.05
0.10
3
0
1
2
Density
4
5
6
σv
0.6
0.7
0.8
0.9
β
Figure 3.1: The estimates posterior distributions for φ (green), σv (red) and
β (orange) compute as KDEs with the proposal distributions presented in
blue. The grey dots indicates the samples obtained from the proposal and the
dotted lines indicate the posterior means.
3.3
Particle filtering
43
main problem with this algorithm is that the variance of the estimate increases
rapidly as t increases. For a long time, this was a major obstacle in approximate
state inference. This results from that the particle weights deteriorate over time
and in the limit only a single particle has a non-zero weight after a few iterations.
Hence, the effective number of particles is one, which is the reason for the high
variance in the estimates.
However, by including a resampling step into the SIS algorithm, we can focus the
attention of the algorithm on the areas of interest. The resampling step essentially
duplicates particles with large weights and discard particles with small weights,
while keeping the total number of particles fixed. Hence, we do not end up with
only one effective particle in the estimator. This development of the algorithm
leads to the SIS with resampling (SIR) algorithm. When this method is applied
on SSMs, we obtain the basic particle filtering algorithm introduced by Gordon
et al. (1993). In this thesis, we refer to this algorithm as the bootstrap particle
filter (bPF).
In subsequent developments, the particle filter is generalised to include more advanced proposals and resampling schemes. In this section, we present a refined
version of the bPF called the auxiliary particle filter (APF) (Pitt and Shephard,
1999), which can use more general particle proposals and weighting functions than
in the original formulation. This can result in a large decrease in the variance of
the estimate and also a decrease in the number of particles required to achieve a
certain accuracy. The APF also allows for the use of different resampling schemes
than the multinomial resampling that is used in the original formulation of the
bPF.
To keep the presentation brief, we do not derive the APF and refer interested
readers to Doucet and Johansen (2011) for a derivation of the APF starting from
the IS algorithm. Furthermore, we note that the APF is a member of the more
general family of SMC algorithms, which are discussed in Cappé et al. (2007) and
Del Moral et al. (2006).
3.3.1
The auxiliary particle filter
(i) (i) N
The APF algorithm operates by constructing a particle system wt , xt i=1 sequentially over time. By the use of this particle system, we can construct an
empirical marginal filtering distribution in analogue with (3.7) as
pbθ (dxt |y1:t ) ,
N
X
(i)
w
et δ
i=1
(i)
xt
(dxt ),
(3.8a)
(i)
w
(i)
w
et , P t
N
(k)
,
(3.8b)
k=1 wt
which can be seen as a discrete approximation of the filtering distribution using
(i)
(i)
a collection of weighted Dirac point masses. Here xt and wt denote particle
i at time t and its corresponding (unnormalised) importance weight. Also, the
3
44
State inference using particle methods
empirical joint smoothing distribution follows similarly as
pbθ (dx0:t |y1:t ) ,
N
X
(i)
w
et δ
i=1
(i)
x0:t
(dx0:t ).
(3.9)
Three steps are carried out during each iteration of the APF to update the particle
system from time t − 1 to t:
(i) The particles are resampled according to their auxiliary weights. The means
that particles with small weights are discarded and particles with large
weights are multiplied. This mitigates the weight depletion problem experienced by the SIS algorithm. This is often the computational bottleneck
of the APF algorithm as all the other steps can easily be implemented in
parallel.
(ii) The particles are propagated from time t − 1 to t. This can be seen as
a simulation step, where new particles are generated by sampling from a
Markov kernel.
(iii) The (unnormalised) particle weight is calculated for each particle. These
weights compare the particle with the recorded output from the system. A
high weight is given to a particle that is likely to generate yt .
After these three steps, we can construct empirical distributions using (3.8) and
(3.9). It is also possible to make use of the particle system for estimating other
quantities. We return to this in the following. Note that, by comparison with
the filtering recursions, Step (ii) corresponds to the time update in (3.1b) and
Step (iii) corresponds to the measurement update in (3.1a). We now proceed to
discuss each step of the APF in more detail.
(i)
Step (i) can be seen as a sampling of ancestor indices denoted at , i.e. the index of
the particle at time t − 1 from which the particle i at time t originates from. This
can be expressed as simulating from a multinomial distribution with probabilities
(i)
(j)
P(at = j) = νet−1 ,
(j)
where νet
j = 1, . . . , N,
i = 1, . . . , N,
(3.10)
denotes a normalised auxiliary weight given by
"N
#−1
X (k)
(j)
(j)
νet−1 = νt−1
νt−1
,
k=1
for some auxiliary weight function
(k)
νt−1
determined by the user. After the resam(i)
pling step, we obtain the unweighted particle system {e
xt−1 , 1/N }N
i=1 .
In this thesis, we mainly consider two different types of APFs. The bPF is obtained
by selecting the auxiliary weight as the particle weights, i.e. νt = wt . The fullyadapted particle filter (faPF) (Pitt and Shephard, 1999), which takes into account
the (future) observation in the resampling step. This information is included by
3.3
Particle filtering
45
using the auxiliary weight given by
Z
νt−1 = pθ (yt |xt−1 ) =
gθ (yt |xt )fθ (xt |xt−1 ) dxt ,
(3.11)
when it can be computed in closed-form. This is only possible for some systems,
e.g. the LGSS model (2.2) and the GARCH(1,1) model (2.3).
In (3.10), we make use of multinomial resampling but there are other resampling
algorithms that can be useful. Systematic resampling and stratified resampling
are good alternatives to the multinomial resampling. These resampling methods
often decrease the variance in the estimates of the filtering distribution. See Douc
and Cappé (2005) and Hol et al. (2006) for comparisons of different resampling
strategies.
In Step (ii), each particle is propagated using a propagation kernel as,
(i)
at
(i)
xt ∼ Rθ xt |x0:t−1
, yt ,
i = 1, . . . , N.
(3.12)
The bPF is recovered by selecting the state dynamics as the propagation kernel,
Rθ (xt |x0:t−1 , yt ) = fθ (xt |xt−1 ).
The faPF makes use of the current observation when proposing new particles. This
results in the propagation kernel
Rθ (xt |x0:t−1 , yt ) = pθ (xt |yt , xt−1 ) =
pθ (yt |xt )p(xt |xt−1 )
,
p(yt |xt−1 )
(3.13)
when it can be computed in closed-form. Each new particle is appended to the
(i)
(i)
a
(i)
t
particle trajectory by x0:t = {x0:t−1
, xt }, for i = 1, . . . , N .
Finally, in Step (iii) each particle is assigned an importance weight by a weighting
function Wθ (xt , xt−1 ), similar to the IS algorithm in (3.4). The resulting weight
is given by
(i) (i)
(i) at
wt = Wθ xt , x0:t−1
,
i = 1, . . . , N
(3.14a)
wt−1 gθ (yt |xt )fθ (xt |xt−1 )
Wθ xt , x0:t−1 ,
.
νt−1 Rθ (xt |x0:1:t−1 , yt )
(3.14b)
For the bPF, the choice of resampling step and propagation kernel leads to that
the weights are given by
νt = wt = gθ (yt |xt ),
i.e. the probability of observing yt given xt . For the faPF, we have that the particle
weights are given by wt ≡ 1.
The general form of the APF is presented in Algorithm 2 for approximate state inference in nonlinear SSMs. The complexity of the algorithm is linear in the number
of time steps and particles, i.e. O(N T ). The user choices are mainly the number of
particles N , the auxiliary weights ν, the propagation kernel Rθ (xt |x0:t−1 , yt ) and
3
46
State inference using particle methods
Algorithm 2 Auxiliary particle filter (APF)
Inputs: y1:T (observations), Rθ (xt |x0:t−1 , yt ) (propagation kernel), νt (auxiliary
weights) and N > 0 (no. particles).
Outputs: pbθ (xt |y1:t ) for t = 1 to T (the empirical marginal filtering distributions)
and pbθ (x0:t |y1:t ) for t = 1 to T (the empirical joint smoothing distributions).
1:
2:
3:
4:
5:
6:
7:
8:
(i)
Initialise each particle x0 .
for t = 1 to T do
(i)
Sample new the ancestor indices to obtain {at }N
i=1 using (3.10).
(i)
Propagate the particles to obtain {xt }N
i=1 using (3.12).
(i)
Calculate new importance weights to obtain {wt }N
i=1 using (3.14).
Estimate the marginal filtering distribution pbθ (xt |y1:t ) using (3.8).
Estimate the joint smoothing distribution pbθ (x0:t |y1:t ) using (3.9).
end for
the resampling method. We now proceed with discussing the use of the APF for
state inference and log-likelihood estimation.
3.3.2
State inference using the auxiliary particle filter
Algorithm 2 can be used to estimate the marginal filtering and joint smoothing distributions in an SSM given some observations. From these empirical distributions,
we can compute an estimate of the filtered marginal state xt|t ,
Z
x
bt|t =
xt pbθ (xt |y1:t ) dxt =
N
X
(i) (i)
w
et xt ,
(3.15)
i=1
where we have inserted (3.8). Note that, the state trajectory x0:t|t can be estimated
using an analogue expression given by
Z
N
X
(i) (i)
x
b0:t|t = x0:t pbθ (x0:t |y1:t ) dx0:t =
w
et x0:t ,
(3.16)
i=1
by the use of the empirical smoothing distribution (3.9). In Example 3.2, we
illustrate the use of the bPF for marginal state inference in the earthquake model
(2.6) and data discussed in Section 2.2.3,
3.2 Example: State inference in the earthquake count model
Consider the earthquake count model (2.6) using the data presented in Figure 2.4.
To infer the state in the model, we make use of the bPF with N = 1 000 particles
and the parameter vector θb = {0.88, 0.15, 17.65} estimated later in Example 4.9.
In the upper part of Figure 3.2, we present the filtered number of major earthquakes obtained from the bPF (red line) together with the data (black dots). We
also present the predicted number of major earthquakes (dark green) for 2014 to
2022 together with a 95% CI for the predictions. The uncertainty grows fast for the
3.3
Particle filtering
47
predictions (gray area), which indicates that a more advanced model is probably
required to make good predictions.
In the lower part of Figure 3.2, we present the estimated state obtained from
the bPF, which indicates that the world currently is in a calmer period with a
smaller earthquake intensity compared with e.g. the 1950s. This conclusion is also
supported in the actual number of earthquakes recorded during these two periods.
We also present the predicted latent state (dark green) for 2014 to 2022 together
with a 95% CI for the predictions (gray area). Again, we see that there is a large
uncertainty in the future latent state.
From Algorithm 2, we know that the APF algorithm can be used to approximate
the joint smoothing distribution. In the following, we require estimates of this distribution for the use in some of the proposed methods for approximate parameter
inference. However, the accuracy of these estimates are poor due to problems with
path degeneracy. To explain the nature of this problem, consider the resampling
step in the APF algorithm. During each resampling, we discard some particle with
a non-zero probability due to that they have small auxiliary weights. This means
that the number of unique particles is smaller than N with a non-zero probability
after each resampling step.
As t increases, we repetitively resample the particle system and this leads to that
the number of unique particles tends to one before some time s < t. That is, the
particle trajectory collapses into a single trajectory for all the particles before time
s. Consequently, the variance of the estimates of the joint smoothing distribution
is large due to the same problem as in the SIS algorithm. To mitigate this effect,
we could make use of the faPF or increase the number of particles. However, this
could be problematic as the faPF is not available for many interesting models.
Also the computational cost of the APF algorithm increases linearly with N . We
illustrate this effect and compare these two alternatives in Example 3.3.
There exists additional alternatives to mitigate the path degeneracy to obtain
accurate approximation of the joint smoothing distribution. A popular alternative
in practice is to resample only when the variance in the particle weights is larger
than some threshold (Doucet and Johansen, 2011). This leads to a decrease in
the path degeneracy problem but this is often not enough to completely solve the
problem. Instead, a particle smoother is often used for this problem, which gives
better accuracy in the estimates but often increases the computational cost. In
Section 3.4, we return to the use of particle smoothing for estimating the smoothing
distributions. We now continue with discussing the statistical properties of the
APF and how to use the APF to estimate the likelihood and log-likelihood for an
SSM.
3
State inference using particle methods
40
30
20
10
0
Number of major earthquakes
50
48
1900
1920
1940
1960
1980
2000
2020
1980
2000
2020
0.5
0.0
−0.5
−1.0
Estimated latent state
1.0
Year
1900
1920
1940
1960
Year
Figure 3.2: Upper: the filtered number of major earthquakes (red line) and
the data (black dots) obtained from the bPF using the data and the model
from Example 3.2. The predicted number of earthquakes (green line) is also
presented together with a 95% CI (gray area). Lower: the estimated latent
earthquake intensity obtained from the bPF. The predicted latent state (green
line) is also presented together with a 95% CI (gray area).
3.3
Particle filtering
49
3.3 Example: Path degeneracy in the GARCH(1,1) model
Consider the GARCH(1,1) model in (2.3) from which we generate T = 20 observations using the parameter vector θ? = {0.10, 0.80, 0.05, 0.30}. Here, we make use
of the bPF and faPF with systematic resampling at every iteration. The aim is to
estimate the state trajectory x0:t|t using the APF and (3.16).
For this model, the weight function (3.11) and the proposal (3.13) required for
the faPF can be computed in closed from by properties of the joint Gaussian
distributions, see Bishop (2006). The required quantities are given by
Z
p(yt |xt−1 , ht ) = N yt ; xt , τ 2 N (xt ; 0, ht ) dxt
= N yt ; 0, τ 2 + ht ,
p(xt |yt , xt−1 ) ∝ p(yt |xt )p(xt |xt−1 )
= N yt ; xt , τ 2 N xt ; 0, ht
−1 −2
= N xt ; τ −2 + h−1
τ yt , τ −2 + h−1
,
t
t
from the marginalisation property and the conditioning property.
In Figure 3.3, we present the state trajectories obtained by tracing the ancestor
linage backwards from time T to 0. We see that for the bPF with N = 10 particles,
the ancestral lines collapse to a single unique particle before time t = 10 due to the
path degeneracy problem. Increasing the number of particles to N = 20 results
in that the path degenerate before t = 8 instead. However, the faPF does not
have the same problem with degeneracy as many ancestors survive the repeated
resamplings and contributes with information for estimating the joint smoothing
distribution.
3.3.3
Statistical properties of the auxiliary particle filter
In this section, we review some results regarding two aspects of the statistical
properties of the APF. Note that, the analysis of the APF is rather complicated
compared with other estimators that makes use of independent samples from the
target distribution. Instead, the particles obtained from the APF are not independent due to the interaction during the resampling step. However, there are many
strong results regarding the statistical properties of the bPF in the literature. Extensive technical accounts are found in Del Moral (2013) and Del Moral (2004).
Some of the statistical properties are also discussed in a more application oriented
setting in Crisan and Doucet (2002), Douc et al. (2014) and Doucet and Johansen
(2011).
Assume that we would like to compute the expected value of some well-behaved
test function ϕ(x) using the particle system generated by the APF. This is in
analogue with the IS estimator in (3.3). From the empirical filtering distribution
3
State inference using particle methods
0.0
−1.5
−1.0
−0.5
State
0.5
1.0
1.5
50
bPF, N=10
0
5
10
15
20
0.0
−1.5
−1.0
−0.5
State
0.5
1.0
1.5
Time
bPF, N=20
0
5
10
15
20
0.0
−1.5
−1.0
−0.5
State
0.5
1.0
1.5
Time
faPF, N=10
0
5
10
15
20
Time
Figure 3.3: The ancestral paths obtained from the bPF with 10 particles
(upper), the bPF with 20 particles (middle) and the faPF with 10 particles
(lower) with T = 20 in the GARCH(1,1) model in Example 3.3. The discarded
particles are presented as gray dots.
3.3
Particle filtering
51
(3.8), we have that the expected value can be approximated by
ϕ
bAPF =
N
X
(i)
(i)
w
et ϕ xt .
(3.17)
i=1
Under some assumptions, it is possible to prove that this estimator is consistent
and therefore converges in probability to the true expectation as the number of
particles tend to infinity,
h
i
p
N −→ ∞.
ϕ
bAPF −→ E ϕ(x) ,
Hence, this estimator is asymptotically unbiased and equivalent with some unknown optimal filter. However, for finite N the estimator is often biased and we
return to this problem in Section 3.4.3.
We would also like to say something about the MSE of the estimator in (3.17). For
this we assume that the function ϕ(x) is bounded2 for all x and some additional
assumptions that are discussed by Crisan and Doucet (2002). It then follows that
the MSE of the estimator obtained by the bPF can be upper bounded as
2 kϕk2
bAPF − E[ϕ(x)]
E ϕ
≤ CT
,
(3.18)
N
where k · k denotes the supremum norm. Here, CT denotes a function that possibly depends on T but is independent of N .
There exists numerous other results regarding the MSE of the estimator in (3.17).
For example, it is possible to relax the assumption that ϕ(x) should be bounded
and that we only use the bPF with multinomial resampling. The resulting upper
bounds have a similar structure to (3.18) but with different functions C. For more
information, see Del Moral (2013) and Douc et al. (2014).
From this upper bound on the MSE, we would like to give some general recommendations regarding how to select N given T . However, this is difficult as the
accuracy of the estimates is connected with the mixing property of the SSM (see
Example 3.5). However, in practice it is recommended to use at least N ∝ T
particles in the APF but sometimes even more particles are required to obtain
reasonable estimates. Hence, we recommend that the user estimates the MSE
for each model using e.g. a pilot run with some Monte Carlo simulations or by
comparing with solution obtain from a particle smoother.
3.3.4
Estimation of the likelihood and log-likelihood
As previously discussed, the likelihood L(θ) and log-likelihood `(θ) play important roles in both ML and Bayesian parameter inference. We also discussed that
they are analytically intractable for nonlinear SSMs. However, we can obtain an
unbiased estimate of the likelihood for any number of particles using the weights
2 This is a rather restrictive assumption as it does not satisfied by the function ϕ(x) = x,
which is used to compute the estimate of the filtered state b
xt|t .
3
52
State inference using particle methods
generated by the APF. To understand how this can be done, we consider the decomposition in Definition 2.1 and the fact that each predictive likelihood can be
written as
Z
p(yt |y1:t−1 ) = pθ (yt , xt |y1:t−1 ) dxt
Z
= gθ (yt |xt )fθ (xt |xt−1 )pθ (xt−1 |y1:t−1 ) dxt−1:t .
If we consider the bPF algorithm, we can rewrite this as
Z
gθ (yt |xt )fθ (xt |xt−1 )
pbPF (yt |y1:t−1 ) =
Rθ (xt |xt−1 , yt )pθ (xt−1 |y1:t−1 ) dxt−1:t
Rθ (xt |xt−1 , yt )
Z
= Wθ (xt , xt−1 )Rθ (xt |xt−1 , yt )pθ (xt−1 |y1:t−1 ) dxt−1:t
by the expression for the weight function (3.14) as wt−1 = νt−1 for the bPF.
(i)
(i)
To approximate the predictive likelihood, we make use of that {e
xt , xt−1 }N
i=1 is
approximately distributed according to Rθ (xt |xt−1 , yt )pθ (xt−1 |y1:t−1 ). From this,
it follows that
N Z
1 X
Wθ (xt , xt−1 )δ (i) (i) (dxt−1:t )
pbPF (yt |y1:t−1 ) ≈
e
xt ,xt−1
N
=
=
1
N
1
N
i=1
N
X
i=1
N
X
(i)
(i)
Wθ x
et , xt−1
(i)
wt .
i=1
It is also possible to show that the faPF leads to a similar estimator by replacing
wt with νt , see Pitt (2002) for the derivation. The resulting estimator for the
likelihood using the APF (including both the bPF and faPF as special cases) is
given by
) (T −1 N
)
(N
X (i)
Y X (i)
1
b = pbθ (y1:T ) =
L(θ)
wT
νt
,
(3.19)
N T +1
i=1
t=0 i=1
where the first summation is unity for the faPF and where νt = wt for the bPF. To
implement this estimator, we run Algorithm 2 and then calculate the likelihood
estimate using (3.19) by inserting the particle weights generated by the APF.
The statistical properties of the likelihood estimator are studied by Del Moral
(2004). It turns out that the estimator is consistent and unbiased for any N ≥ 1.
Furthermore, the error of the estimate satisfies a CLT,
i
√ h
d
b
N L(θ) − L(θ)
−→ N 0, ψ 2 (θ) ,
(3.20)
for some asymptotic variance ψ 2 (θ), see Proposition 9.4.1 in Del Moral (2004).
3.4
Particle smoothing
53
It is also straightforward to obtain an estimator for the log-likelihood from (3.19),
(N
) T −1
(N
)
X (i)
X
X (i)
\
b
`(θ) = log pθ (y1:T ) = log
wT
+
log
νt
− T log(N + 1). (3.21)
i=1
t=0
i=1
However, this estimator is biased for a finite number of particles, but it is still
consistent and asymptotically normal. This result follows from applying the secondorder delta method (Casella and Berger, 2001) on (3.20). The resulting CLT for
the log-likelihood estimator is given by
√
γ 2 (θ)
d
b
−→ N 0, γ 2 (θ) ,
(3.22)
N `(θ) − `(θ) +
2N
where we introduce γ(θ) = ψ(θ)/L(θ). As a result, we have an expression for the
bias of the estimator given by −γ 2 (θ)/2N for a finite number of particles. Consequently, it is possible to compensate for the bias as the variance of the estimator
γ 2 (θ) can be estimated using Monte Carlo simulations by repeated application of
the APF on the same data. This could be an interesting improvement for the
proposed methods in Papers B and C, where we make use of the log-likelihood
estimator.
3.4 Example: Bias and variance of the log-likelihood estimate
Consider the setup in Example 2.6 and the problem of estimating the log-likelihood
at θ = θ? using the faPF with N = 10 particles. We repeat the estimation over
1 000 Monte Carlo simulations using the same data. The error of the log-likelihood
estimate is calculated using the true valued obtain from the Kalman filter.
In Figure 3.4, we present the histogram of the error together with a Gaussian
approximation (upper), the box plot of the errors (lower left) and the QQ-plot of
the errors (lower right). The Gaussian approximation fits the data quite well and
the resulting average error is −0.03 with variance
γ
b2 (θ? ) = 0.05. The predicted
√
2
?
bias is calculated using (3.22) by −b
γ (θ )/2 N = −0.01. The QQ-plot validates
the Gaussian assumption as we do not seen any deviating tail behaviour.
3.4
Particle smoothing
Particle smoothers approximate the solution to the smoothing problem in an SSM
similar to how the APF approximates the corresponding filtering problem. However, there exists a number of different approaches to carry out the smoothing
given the particle system from the APF. The simplest smoother is to make use of
the APF to approximate the joint smoothing distribution as discussed in the previous. The main problem with this approach is that the path degeneracy problem
limits the accuracy of the estimate. Another similar approach is to make use of
the fixed-lag (FL) particle smoother (Kitagawa and Sato, 2001), which is based on
using the APF to estimate the fixed-lag smoothing distribution (recall Table 3.1).
In the following, we make use of this smoother as it has a low computational cost
and a reasonable accuracy compared with other particle smoothers.
3
State inference using particle methods
1.0
0.0
0.5
Density
1.5
2.0
54
−1.0
−0.5
0.0
0.5
1.0
0.5
Sample Quantiles
0.0
−0.5
0.5
0.0
−0.5
Error in the log−likelihood estimate
Error in the log−likelihood estimate
−3
−2
−1
0
1
2
3
Theoretical Quantiles
Figure 3.4: The histogram with a Gaussian approximation (blue line) of the
error in the log-likelihood estimates (upper) in the LGSS model using a faPF.
The boxplot (lower left) and QQ-plot (lower right) supports the Gaussian
assumption of the error.
3.4
Particle smoothing
55
More advanced smoothers are often based on approximations of the Bayesian
smoothing recursion (3.2). There are two main families of smoothers that results from this approach: the FFBSm and the forward filtering backward simulator (FFBSi). The original marginal FFBSm and FFBSi algorithms are discussed
in Doucet et al. (2000) and Godsill et al. (2004), respectively. Another type of
smoother makes use of two APFs (one running forward in time and the other
backwards) and combines the output of these by using a two-filter formula (Briers
et al., 2010; Fearnhead et al., 2010). Furthermore, other types of smoothers have
been proposed in Bunch and Godsill (2013) and Dubarry and Douc (2011) using
MCMC methods (discussed in the next chapter). The interesting improvement in
these two smoothers are that they generate new particles in the backward sweep,
which is not done in the FFBSm and FFBSi. See Lindsten and Schön (2013) for
a recent survey of different particle smoothing methods.
In this section, we focus on the use of the FL smoother for estimating some quantities that are required for the proposed methods in Papers A and D. We introduce
the underlying assumptions of the smoother and discuss how to implement it. We
also show how it can be used to estimate the score function of an SSM and parts of
the information matrix. We conclude by discussing the properties of the estimates
obtained from the FL smoother.
3.4.1
State inference using the particle fixed-lag smoother
The FL smoother relies on the forgetting properties of an SSM, i.e. that the Markov
chain quickly forgets about its earlier states. This property is illustrated in Example 3.5 for an LGSS model.
3.5 Example: Mixing property in the LGSS model
Consider the LGSS model using the same setup as in Example 2.6, where 20
different state processes are simulated during 8 time steps. Here, each state process
has a randomly selected initial states distributed as x0 ∼ N (0, 202 ).
In Figure 3.5, we present the evolution of the state processes in three different
LGSS models. We note that the value of φ determines the rate at which the
processes converge to a stationary phase. A larger value of φ gives the process a
longer memory and this results in that it requires longer time to forget its initial
condition. That is, the state process mixes slowly and therefore future observations
contain useful information about the current state.
The observation that some SSMs mixes quickly is the basis for the assumption
that
pθ (x0:t |y1:T ) ≈ pθ (x0:t |y1:κt ),
(3.23)
where κt = min{T, t + ∆} and ∆ denotes some lag determined by the user. This
means that future observations contain a decreasing amount of information about
the current state. The rate of this decrease is determined by the mixing of the
model as discussed in Example 3.5. If the model mixes quickly, future observations
have a limited amount of information about the current state as the state process
3
State inference using particle methods
0
−40
−20
State
20
40
60
56
−60
φ=0.2
0
2
4
6
8
0
−40
−20
State
20
40
60
Time
−60
φ=0.5
0
2
4
6
8
0
−40
−20
State
20
40
60
Time
−60
φ=0.8
0
2
4
6
8
Time
Figure 3.5: The evolution of 20 different state processes in the LGSS model
from Example 2.6 using different initial values. Three different values of φ
are used in the LGSS model: φ = 0.2 (upper), φ = 0.5 (middle) and φ = 0.8
(lower).
3.4
Particle smoothing
57
Algorithm 3 Two-step fixed-lag (FL) particle smoother
Inputs: y1:T (observations), Rθ (xt |x0:t−1 , yt ) (propagation kernel), νt (auxiliary
weights), N > 0 (no. particles) and ∆ (lag).
Outputs: pbθ (xt−1:t |y1:T ) for t = 1 to T (empirical two-step smoothing dist.).
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
(i)
Initialise each particle x0 .
for t = 1 to T do
(i)
Sample new the ancestor indices to obtain {at }N
i=1 using (3.10).
(i)
Propagate the particles to obtain {xt }N
i=1 using (3.12).
(i)
Calculate new importance weights to obtain {wt }N
i=1 using (3.14).
if t > ∆ + 1 then
(i)
Compute κt = min{t + ∆, T } and recover {xκt ,t−1:t }N
i=1 .
Compute (3.24) to obtain pbθ (xt−1:t |y1:T ).
end if
end for
quickly forgets its past. Hence, the lag ∆ can be selected to be rather small and
still make (3.23) a valid approximation. This is the main intuition behind the FL
smoother and the procedure follows directly from the APF in Algorithm 2.
In the following, we require estimates of the two-step smoothing distribution
pbθ (dxt−1:t |y1:κt ) to estimate the score function and the information matrix for
an SSM. By using the FL smoother assumption, we can compute the required
estimate by a marginalisation over the joint smoothing distribution,
Z
pθ (xt−1:t |y1:κt ) = pθ (x0:κt |y1:κt ) dx0:t−2 dxt+1:κt .
By inserting the empirical joint smoothing distribution pθ (x1:κt |y1:κt ) from (3.9),
an estimate of the two-step smoothing distribution is obtained as
pbθ (dxt−1:t |y1:κt ) =
N
X
i=1
(i)
(i)
w
eκt δ
(i)
t ,t−1:t
xκ
(dxt−1:t ),
(3.24)
(i)
where xκt ,t denotes the ancestor particle from which the particle xκt originated
from at time t. This ancestor is obtained by tracing the linage backwards similar
to Figure 3.3, where the same is done for the particles starting at time T .
We summarise the procedure for estimating the two-step smoothing distribution
using the FL-smoother in Algorithm 3. The marginal smoothing distribution can
be computed using an analogue expression and this results in a similar procedure.
Finally, we present an application of these methods for volatility estimation in
Example 3.6.
3
State inference using particle methods
0.10
58
0.00
−0.10
−0.05
Daily log−returns
0.05
bPF
FL smoother
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
2008
2009
2010
2011
2012
2013
2014
0.02
0.01
0.00
−0.02
−0.01
bPF estimate of volatility
0.03
0.04
Year
2000
2001
2002
2003
2004
2005
2006
2007
0.02
0.01
0.00
−0.01
−0.02
FL smoother estimate of volatility
0.03
0.04
Year
2000
2001
2002
2003
2004
2005
2006
2007
Year
Figure 3.6: Upper: the log-returns (gray dots) from the Nasdaq OMX Stockholm 30 Index presented in Figure 2.2. The estimated 95% CIs are also
presented for the bPF (red) and the FL smoother (blue). Middle: the estimated latent volatility obtained from the bPF. Lower: the estimated latent
volatility obtained from the FL smoother.
3.4
Particle smoothing
59
3.6 Example: State inference in the Hull-White SV model
Consider the Hull-White SV model (2.4) with data from Section 2.2.2 and the
parameter vector estimate from Example 3.1. To infer the latent volatility (the
state), we apply the bPF using Algorithm 2 and the FL smoother according to
Algorithm 3 using N = 5 000 particles and ∆ = 12.
In the upper part of Figure 3.6, we present the filtered log-returns with a 95% CI
for the bPF (red) and the FL smoother (blue). Here, we present the upper CI for
the bPF and the lower CI for the FL smoother so that the two lines do not overlap.
Note that the mean of the log-returns is zero and therefore the CIs are symmetric
around zero. We conclude that the CIs seem reasonable as they cover most of the
log-returns except at the financial crises. Also, the CI computed from the bPF is
a bit rougher than the CI computed using the FL smoother.
In the middle and lower parts of Figure 3.6, we present the estimated volatilities
using both methods. We see that the estimates correspond reasonable well with
variation of the log-returns. Note that, the periods with larger volatility are connected with the financial crises.
3.4.2
Estimation of additive state functionals
In this section, we consider the use of particle smoothing to estimate the expected
value of an additive functional given the observations. This is in analogue with
the Monte Carlo estimate of the expectation of a function in (3.3). An additive
functional satisfies the expression
Sθ (x0:T ) =
T
X
sθ,t (xt−1:t ),
(3.25)
t=1
which means that we can decompose a function that depends on the entire particle trajectory into several functionals. Here, sθ,t (xt−1:t ) denotes some general
functional that depends on only two states of the trajectory. This type of additive
functionals occurs frequently in nonlinear SSMs when we would like to compute
functions that depend on the kernels fθ (xt+1 |xt ) and gθ (yt |xy ). In the following,
we give some concrete examples of these functionals connected with parameter inference in SSMs. In these problems, we would like to compute the expected value
of the additive functional given the observations,
h
i Z
E Sθ (x0:T )y1:T = Sθ (x0:T )pθ (x0:T |y1:T ) dx0:T
=
T Z
X
sθ,t (xt−1:t )pθ (xt−1:t |y1:T ) dxt−1:t .
(3.26)
t=1
This can be done by inserting the two-step smoothing distribution estimated by
any particle filter or smoothing algorithm. Examples of some different approaches
for this are found in Poyiadjis et al. (2011) using the APF and in Del Moral et al.
(2010) using a forward smoother based on the FFBSm. In this section and in
3
60
State inference using particle methods
Paper A, we discuss the use of the FL smoother for this application. The resulting
estimate is obtained by inserting (3.24) into (3.26),
Sbθ (x0:T ) =
T X
N
X
(i)
(i)
w
eκt sθ,t (xκt ,t−1:t ),
(3.27)
t=1 i=1
which can be computed by some minor modifications of Algorithm 3.
We encounter this type of additive functionals are encountered in two common
problems concerning SSMs: when estimating the score function and parts of the
information matrix. In this thesis, we make use of the material in this section for
estimating the latter two quantities in Papers A and D. To derive the expressions
for the additive functions related to these problems, we consider the logarithm of
the joint distribution of states and observations in an SSM,
log pθ (x0:T , y1:T ) = log µ(x0 ) +
T h
X
i
log fθ (xt |xt−1 ) + log gθ (yt |xt ) .
(3.28)
t=1
By using this quantity, the score function can be estimated using Fisher’s identity
(Fisher, 1925; Cappé et al., 2005),
Z h
i
S(θ0 ) = ∇`(θ)θ=θ0 =
∇ log pθ (x0:T , y1:T )θ=θ0 pθ (x0:T |y1:T ) dx0:T ,
which results in the functional
ξθ,t (xt−1:t ) = ∇ log fθ (xt |xt−1 )θ=θ0 + ∇ log gθ (yt |xt )θ=θ0 ,
(3.29)
corresponding to the gradient of (3.28) evaluated at θ = θ0 . An estimator of
the score function is obtained by inserting the additive function (3.29) into the
empirical distribution obtained by the FL smoother (3.27) as
b 0) =
S(θ
T X
N
X
(i)
(i)
w
eκt ξθ,t (xκt ,t−1:t ).
t=1 i=1
The observed information matrix can be estimated using Louis’ identity (Louis,
1982; Cappé et al., 2005). However, only some parts of the identity can be directly
estimated using the FL smoother. The remaining parts must be estimated using
the APF or a more advanced smoother like the FFBSm. To see this, we rewrite
the observed information matrix using Louis’ identity,
h
i2 h
ih
i−1
J (θ0 ) = −∇2 `(θ)θ=θ0 = ∇`(θ)θ=θ0 − ∇2 L(θ)θ=θ0 L(θ)
.
Here, the second term of the identity can be written as
h
ih
i−1 Z h
i2
2
=
∇ log pθ (x0:T , y1:T )θ=θ0 pθ (x0:T |y1:T ) dx0:T
∇ L(θ) θ=θ0 L(θ)
Z h
i
+
∇2 log pθ (x0:T , y1:T ) 0 pθ (x0:T |y1:T ) dx0:T .
θ=θ
(3.30)
3.4
Particle smoothing
61
The first term in (3.30) cannot be directly estimated by the FL smoother, as it does
not provide approximations of the densities needed. Instead, it can be estimated
using the APF directly as proposed by Poyiadjis et al. (2011) or by the use of a
combination of the FL smoother and the APF. The latter alternative is discussed
in Paper A. However, the second term in (3.30) can be estimated using the FL
smoother directly as it can be written as an additive functional given by
ζθ,k (xt−1:t ) = ∇2 log fθ (xt |xt−1 )θ=θ0 + ∇2 log gθ (yt |xt )θ=θ0 .
(3.31)
corresponding to the Hessian of (3.28) evaluated at θ = θ0 . An estimator for
this part of Louis’ identity is obtained by inserting the additive function (3.31)
into the empirical distribution obtained by the FL smoother (3.27). The expected
information matrix can be estimated as the sample covariance matrix of a large
number of score functions estimated using different data sets. We discuss this
method further in Paper D.
We have now seen two examples of additive functionals in connection with parameter inference in SSMs and how to estimate them using the FL smoother. The same
setup can also be used in the expectation maximisation algorithm (Dempster et al.,
1977; McLachlan and Krishnan, 2008) as discussed in Del Moral et al. (2010) using
a forward smoother. We return to this problem in Section 4.1, where we discuss
different methods for parameter inference. We end this section with Example 3.7,
where we estimate the log-likelihood, score and observed information matrix for a
nonlinear SSM using the FL smoother.
3.7 Example: Score and information matrix in the Hull-White SV model
Consider the problem of estimating the log-likelihood, the score function and the
natural gradient (the score function divided by the observed information matrix)
for θ = φ in the HWSV model (2.4). We again make use of the data from Section 2.2.2 and the parameter vector estimate from Example 3.1. For this model,
the additive functional connected with the score function (3.29) is
2 1
1
2
ξθ,t (xt−1:t ) = ∇ − log 2πσv − 2 xt+1 − φxt 2
2σv
φ=φ0
2
1
y exp(−xt ) + ∇ − log 2πβ 2 exp(xt ) − t
2
2β 2
φ=φ0
0
= xt xt+1 − φ xt ,
and similar for the observed information matrix (3.31),
h
i
ζθ,t (xt−1:t ) = ∇ xt xt+1 − φxt = −x2t .
0
φ=φ
Here, we make use of the FL smoother in Algorithm 3 for estimating (3.27) with
lag ∆ = 12 and N = 5 000 particles. We vary the parameter on a grid within the
interval φ ∈ [0.70, 0.99] and estimate the required quantities at each grid point.
Here, we fix {σv , β} to their estimated value. The observed information matrix is
estimated using the combination of the APF and the FL smoother introduced in
Paper A.
3
−7500
−7000
−6500
State inference using particle methods
−8000
Estimated log−likelihood function
62
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0.004
0.002
0.000
−0.004
−0.002
Estimated natural gradient
5000
0
−5000
−10000
Estimated score function
10000
φ
0.70
0.75
0.80
0.85
φ
0.90
0.95
1.00
0.90
0.92
0.94
0.96
0.98
1.00
φ
Figure 3.7: The estimates of the log-likelihood function (upper) the score
function (lower left) and the natural gradient (lower right) of the HWSV
model in Example 3.7. The dotted lines indicate the true parameters and the
zero level.
3.4
Particle smoothing
63
The resulting estimates are presented in Figure 3.7. We see that the log-likelihood
and score estimates seems reasonable and have a rather low variance. The natural
gradient is more difficult to estimate due to that the observed information matrix
estimates are noisy. We see that the sign of the estimate is correct but the gradient
is noisy and rather small.
3.4.3
Statistical properties of the particle fixed-lag smoother
We conclude the discussion of particle smoothing by reviewing some results regarding the statistical properties of the estimates obtained by the FL smoother for the
additive functionals. In Olsson et al. (2008), the authors analyse the bias and
variance of the estimates from the APF and the FL smoother, respectively. The
variances of the estimates are concluded (under some
to be
√ regularity conditions)
√
upper bounded by quantities proportional to T 2 / N and (T log T )/ N for the
APF and the FL smoother, respectively.
The bias is also analysed in Olsson et al. (2008) and the authors conclude (under
some regularity conditions) that the bias is upper bounded by quantities proportional to T 2 /N and λ + (T log T )/N , respectively. Here, λ denotes a quantity
which is independent of the number of particles. Hence, the FL smoother gives a
biased estimate for all N whereas the bias of the estimate obtained from the APF
decreases as 1/N .
In the analysis, it is assumed that the lag is selected according to ∆ ∝ c log T ,
where c > −1/ log ρ and ρ ∈ [0, 1] denotes a measure of the mixing of the model
(Olsson et al., 2008). If ∆ is too small, the underlying approximation of the FL
smoother (3.23) and the accuracy of the estimate are rather poor. This is the result
of that more information about the current state is available in future observations
that are not taken into account by the smoother. If instead ∆ is too large, we get
problems with path degeneracy and in the limit when ∆ = T , the APF estimate
is recovered. The choice of ∆ is therefore crucial for obtaining good estimates
from the FL smoother and its optimal value depends on both the number of the
observations and the mixing properties of the model. Hence, it must be tailored
for each application individually as is further discussed in Paper A.
Finally, we mention that even better accuracy can be obtained by using other
more advanced smoothing algorithms. The main drawback with these algorithm
is that their computational cost is larger than for the APF and the FL smoother,
which have a computational cost that is proportional to O(N T ). For example, the
FFBSm algorithm (Doucet et al., 2000) has a computational cost that is proportional to O(N 2 T ). The benefit of using this smoother is that the variance of the
estimate grows proportional to O(T ), instead of as O(T 2 ) and O(T log T ) for the
APF and the FL smoother, respectively. These properties are discussed in more
detail in Del Moral et al. (2010) and Poyiadjis et al. (2011). There are also some
new promising particle smoothers with a computational complexity proportional
to O(T N log N ). For more information, see Klaas et al. (2006), Taghavi et al.
(2013) and Gray (2003).
64
3.5
3
State inference using particle methods
SMC for Image Based Lighting
In Sections 3.3 and 3.4, we have discussed the application of SMC to the filtering
and smoothing problems in SSMs. However, the SMC algorithm can be applied
to other problems that are sequential in nature and where the target distribution
π(x0:k ) grows over some index k = 1, . . . , K. It is also possible to make use of
SMC methods for static problems by defining an artificial target that grows over
k even if the original target does not. This approach is discussed by Del Moral
et al. (2006) and opens up for using SMC for a wide range of problems.
Figure 3.8: The setup used in the IBL approach, where the LTE gives the
outgoing radiance at the angle ωr that hits the image plane. This radiance
is computed by taking the EM, the BRDF and the visibility into account.
Note, that one of the light sources are occluded in this scene and does not
contribute to the outgoing radiance.
We exemplify the usefulness of this general class of SMC algorithm by returning to
the problem of rendering a sequence of photorealistic images using IBL (Debevec,
1998; Pharr and Humphreys, 2010) as discussed in Section 1.1.2. In Figure 3.8, we
present a cartoon of the setup of the LTE. Here, we are interested in calculating
the outgoing radiance Lr,k (ωr ) from a point at the outgoing angle ωr at frame k,
which is given by the LTE as
Z
Lr,k (ωr ) = fr (ωi → ωr )Lk (ωi )V (ωi )(ωi · n) dωi ,
(3.32)
where n denotes the surface normal and ωi denotes incoming angles of the light rays
that hits the object at n. Here, fr , Lk and V denote the bidirectional reflectance
distribution function (BRDF), the EM in frame k and the binary visibility function,
respectively. Furthermore, we assume that these functions can be evaluated pointwise. In this problem, the BRDF describes the optical properties of the object and
can be seen as a distribution over incoming angles.
3.5
SMC for Image Based Lighting
65
To see how SMC can be used to solve (3.32), we define this as a sequence of
normalisation problems, where kth unnormalised target distribution is given by
γk (ωi ) = fr (ωi → ωr )Lk (ωi )(ωi · n),
where we neglect the visibility term for computational convenience. Calculating
the outgoing radiance (without regarding visibility) Lr,k (ωr ) can therefore be seen
as normalisation,
Z
Lr,k (ωr ) = γk (ωi ) dωi ,
for a sequence of EMs indexed by k. From this setup, we see that the desired
particles corresponds to incoming angles ωi of the light that hits the object and
contributes to the outgoing radiance. As previously discussed, the challenge is
to select only a few important angles to take into account when solving the LTE.
Otherwise, the problem becomes computational infeasible.
In the filtering problem, this normalisation factor is the marginal likelihood of the
problem, which can be computed using the particle weights. In a similar manner,
we can compute Lr (ωr ) and the only difference between the APF in Algorithm 2
is the choice of propagation kernel. Here, we do not have an SSM to describe
the dynamical behaviour of the EM between frames. Instead, we make use of an
MCMC kernel to adapt the particles between EMs. Details of this approach are
found in Kronander et al. (2014a) and Ghosh et al. (2006).
Another approach to solve (3.32) is to use the IS algorithm directly on Lr,k (ωr )
for each frame individually. This can be done with an algorithm similar to SIR,
where we first sample from the EM and compute the particle weights using the
BRDF. The resulting particles are obtained after a resampling step. This method
is referred to as the bidirectional importance sampling (BIS) (Burke et al., 2005)
algorithm. This approach is computationally cheap, but cannot make use of information from the previous frame when estimating Lr,k (ωr ) in the current frame
k. As previously discussed, this problem can be solved with the SMC setup that
we consider here. We conclude this section by Example 3.8 in which we compare
these two approaches.
3.8 Example: SMC for direct illumination
Consider a simple setup, where we would like to render a sphere given a sequence
of HDR images. This is similar to the problem considered in Section 1.1.2 but
simpler since we do not consider multiple bounces. Hence, this problem is referred
to as direct illumination. Here, we compare four different approaches to render
the sphere in the image. The first and second methods are to make use of the IS
algorithm to sample directly from the EM and the BRDF, respectively. Hence, we
disregard the information in the BRDF and the EM, respectively. We also make
use of the BIS algorithm and the SMC algorithm previously outlined.
In Figure 3.9, we compare the four different approaches at frame k = 10 in a
sequence of EMs. The first three methods solves the problem for each frame
individually and the SMC approach makes use of the previous information. We
66
3
State inference using particle methods
see that using the IS algorithm directly to sample from the EM does not perform
well. The reason is that we consider a mirror sphere, where the BRDF is rather
narrow, which means that only light from a few incoming angles contribute to
the outgoing radiance. The remaining three methods gives comparable results.
Therefore, we conclude that there could be advantages by using the SMC algorithm
for this application. See Kronander et al. (2014a) for a more detailed discussion
and comparison of these approaches.
3.5
SMC for Image Based Lighting
67
Figure 3.9: Mirror sphere, frame 10 with different sampling techniques: IS
the EM (upper left), IS the BRDF (upper right), BIS (lower left), and the
SMC algorithm (lower right).
4
Parameter inference
using sampling methods
In this chapter, we consider some sampling based methods for ML and Bayesian
parameter inference in SSMs. We start by giving a small overview of some of the
current algorithms in the field. Later, we introduce and discuss the use of MCMC
(Robert and Casella, 2004) and particle MCMC (PMCMC) (Andrieu et al., 2010)
approaches to Bayesian parameter inference. Here, we also discuss the use of
Langevin and Hamiltonian dynamics to improve the efficiency of the method. This
material is later used in Paper A to construct a particle version of these MCMC
algorithms.
The second part of this chapter is devoted to discussing the BO algorithm (Jones,
2001; Boyle, 2007; Lizotte, 2008; Osborne, 2010) and how it can be used together
with GPs (Rasmussen and Williams, 2006) for ML parameter inference and input
design. This approach is used in Papers B and C for ML based parameter inference
and MAP based parameter inference, respectively.
4.1
Overview of computational methods for
parameter inference
There are many different methods developed for parameter inference in SSMs using
both ML and Bayesian methods. Here, we review some of these methods to give
an overview of the area and discuss some of the more popular methods. For more
exhaustive accounts of different methods, see Douc et al. (2014), Kantas et al.
(2009), Cappé et al. (2007) and Cappé et al. (2005).
69
70
4.1.1
4
Parameter inference using sampling methods
Maximum likelihood parameter inference
Most parameter inference problems in the ML framework cannot be solved by
analytical calculations. Instead popular approaches are based on optimisation and
other iterative algorithms to solve the inference problem in Definition 2.2. From
Chapter 3, we know that the APF can be used to calculate the log-likelihood and
that the FL smoother can be used to estimate the score function (the gradient
of the log-likelihood). Hence, we can make use of a gradient-based optimisation
algorithm to maximise the log-likelihood and thereby estimate the parameters of
the model. This method operates by an iterative application of
b k ),
θbk+1 = θbk + k S(θ
(4.1)
where θbk and k denote the current estimate of the parameter vector and the
step length, respectively. This approach is used by Poyiadjis et al. (2011) and
Yildirim et al. (2013) for parameter inference in SV models with Gaussian returns
and α-distributed returns, respectively. Furthermore, we can estimate the Hessian
information by a particle smoother and make use of it in (4.1). This results in a
Newton optimisation algorithm which operates by an iterative application of
b k ).
θbk+1 = θbk + k Jb−1 (θk )S(θ
(4.2)
Another approach is to estimate the Hessian on-the-fly in a Quasi-Newton algorithm using finite difference approximation. A popular member of this class of
algorithms is the simultaneous perturbation stochastic approximation (SPSA) algorithm discussed in Spall (1987). The SPSA algorithm is used for inference in the
αSV model in Ehrlich et al. (2012).
The expectation maximisation algorithm (Dempster et al., 1977; McLachlan and
Krishnan, 2008) is another popular alternative for ML inference. In this method,
the latent states are seen as missing information and are estimated together with
the parameter vector of the model. The quantity required by the expectation maximisation algorithm is an additive functional that can be estimated using a particle
smoother. Many different versions of the expectation maximisation algorithm are
available in the literature using different particle smoothers. Some examples are
Del Moral et al. (2010) using a forward smoother, Olsson et al. (2008) using the
FL smoother, Lindsten (2013) using the conditional particle filter with ancestor
sampling and Schön et al. (2011) using the FFBSm.
Finally, we mention another approach based on Bayesian optimisation (Jones,
2001; Boyle, 2007; Lizotte, 2008), introduced in Papers B and C for ML parameter
inference in SSMs where the likelihood can be intractable. We return to this
approach in Section 4.4, where we introduce the method in detail.
4.1.2
Bayesian parameter inference
As for the ML parameter inference approach, most interesting Bayesian parameter
inference problems are analytically intractable. Some problems can be solved in
closed-form using conjugate priors as discussed in Section 2.4. However, in general we require approximate methods to solve expectation, marginalisation and
4.2
Metropolis-Hastings
71
normalisation problems in Bayesian inference. Previously, we discussed the sampling based approach in which we make use of statistical simulations and computers
to approximate the posterior. Examples of some sampling based approaches are
MCMC methods or SMC methods. Some recent methods that makes us of the
latter method are SMC-squared (SMC2 ) (Chopin et al., 2013) and particle learning
(Carvalho et al., 2010). An alternative class of methods are based on approximate
analytical computations, which often makes use of iterated approximations to approximate the posterior. Examples of analytical approximations are the Laplace
approximation, the integrated nested Laplace approximation (INLA) (Rue et al.,
2009), variational Bayes (VB) (Bishop, 2006) and expectation propagation (EP)
(Minka, 2001).
In this thesis, we are primarily interested in sampling based inference using MCMC
(Robert and Casella, 2004) for Bayesian parameter inference. MCMC is a family
of algorithms that all are based on simulating a Markov chain with the target
as its stationary distribution. After some burn-in, the chain reaches stationarity
and the samples obtained from the chain can be treated as independent samples
from the parameter posterior by the ergodic theorem discussed in Section 4.2.1.
Here, we consider two instances of MCMC algorithms, the Metropolis-Hastings
(MH) algorithm (Metropolis et al., 1953; Hastings, 1970) and the Gibbs sampler
(Geman and Geman, 1984). The Gibbs sampler can be seen as a special case of
the more general MH algorithm. As such it can only be used in models with a
certain conditional structure of the joint parameter posterior. We return to the
Gibbs sampler in Paper E for parameter inference in a non-Gaussian ARX model.
The MH algorithm is a general method for sampling from intractable distributions
and is applied in many different fields. Some applications of the MH algorithm are
found in biology (Wilkinson, 2011), system identification (Ninness and Henriksen,
2010), finance (Johannes and Polson, 2009) and statistics (Robert and Casella,
2004). A major problem with this algorithm is that it requires evaluations of
the likelihood of the SSMs. As discussed in Section 3.3.4, we can only obtain an
unbiased estimator of this quantity, but it turns out that this is enough. As a
result of the unbiasedness property, the resulting Markov chain keeps the target
as its stationary distribution. This type of methods are known as psuedo-marginal
algorithms (Andrieu and Roberts, 2009; Andrieu et al., 2010) and this family
includes the particle MH (PMH) and particle Gibbs (PG) algorithms. This opens
up for Bayesian parameter inference in SSMs. We return to discussing the PMH
algorithm in Section 4.3 and propose refinements to it in Paper A.
4.2
Metropolis-Hastings
In this section, we discuss the use of the MH algorithm for sampling from the
parameter posterior p(θ|y1:T ). Here, we give a short introduction to the MH algorithm and also briefly discuss why it works and discuss some interesting extensions
of the algorithm based on random walks on Riemann manifolds. Interested readers
are referred to Robert and Casella (2004), Gelman et al. (2013) and Liu (2008) for
4
72
Parameter inference using sampling methods
more detailed accounts of the algorithm and its general application.
The MH algorithm samples the target distribution π(θ) = p(θ|y1:T ) by simulating a
carefully constructed Markov chain on the parameter space Θ. The Markov chain
is constructed in such a way that it admits the target as its unique stationary
distribution. The algorithm consists of two steps: (i) a new parameter θ00 is
sampled from a proposal distribution q(θ00 |θ0 ) given the current state θ0 and (ii) the
current parameter is changed to θ00 with acceptance probability α(θ00 , θ0 ), otherwise
the chain remains at the current parameter. The acceptance probability is given
by
α(θ00 , θ0 ) = 1 ∧
p(θ00 ) pθ00 (y1:T ) q(θ0 |θ00 )
π(θ00 ) q(θ0 |θ00 )
=1∧
,
0
00
0
π(θ ) q(θ |θ )
p(θ0 ) pθ0 (y1:T ) q(θ00 |θ0 )
(4.3)
where a ∧ b , min{a, b} and pθ (y1:T ) denotes the likelihood. The resulting procedure is outlined in Algorithm 4 for parameter inference in models where we
can compute the likelihood exactly. Here, we have used the form of the parameter posterior from (2.14), where the marginal likelihoods cancel in the acceptance
probability. Therefore, the algorithm only requires that we can point-wise evaluate
the unnormalised target distribution.
Note that the IS algorithm could be used to estimate the parameter posterior using e.g. a multivariate Gaussian distribution as the proposal as in Example 3.1.
However, if θ is high dimensional this approach would require many samples to
accurately estimate the posterior. In the MH algorithm, we could instead construct a Markov chain that explores the posterior distribution by local moves thus
exploiting the previously accepted parameter. Hence, it focuses its attention to
areas of the parameter space in which the posterior assigns a relatively large probability mass. This makes sampling of high dimensional target more efficient, as
less iterations of the algorithm are required to obtain an accurate estimate.
To see this, assume that we have a symmetric proposal q(θ00 |θ0 ) = q(θ0 |θ00 ), so that
the ratio between the proposals cancel in (4.3). The remaining part is the ratio
between the target evaluated at θ00 and θ0 . If the proposed parameter θ00 results
in that the target assumes a larger value than in θ0 , it is always accepted. Also,
we accept a proposed parameter with some probability if this results in a small
decrease of the target compared with the previous iteration. This results in that
the MH sampler both allows for the algorithm to climb the posterior to its mode
and explore the area surrounding the mode. Hence, the MH algorithm can possibly
escape local extrema which is a problem for many local optimisation algorithms
used in numerical ML parameter inference.
The performance of the MH algorithm is dependent on the choice of the proposal
distribution. A poor proposal leads to a poor exploration of the posterior distribution, which results in that many iterations of the algorithm are required to obtain
a good approximation of the posterior. Here, we discuss some common choices of
proposals which are needed for the discussion in Section 4.2.2 and in Paper A. The
perhaps simplest example is the independent proposal in which q(θ00 |θ0 ) = q(θ00 ),
i.e. a parameter is proposed without taking the previously accepted parameter into
4.2
Metropolis-Hastings
73
Algorithm 4 Metropolis-Hastings (MH) for Bayesian inference in SSMs
Inputs: M > 0 (no. MCMC steps), q( · ) (proposal) and θ0 (initial parameter).
Output: θ = {θ1 , . . . , θM } (samples from the parameter posterior).
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
Initalise using θ0 .
for k = 1 to M do
Sample θ0 from the proposal θ0 ∼ q(θ0 |θk−1 ).
Calculate the likelihood pθ0 (y1:T ).
Sample ωk from U(0, 1).
if ωk < α(θ0 , θk−1 ) given by (4.3) then
θk ← θ0 . {Accept the parameter}
else
θk ← θk−1 . {Reject the parameter}
end if
end for
account. This proposal cannot exploit the previous accepted parameters and this
could be a difficulty in inference problems when the posterior is quite complicated.
However, if the proposal is similar to the posterior, this leads to a good mixing of
the Markov chain as the proposed parameters are uncorrelated. This insight have
been used in Giordani and Kohn (2010) to construct a proposal distribution based
on a mixture of Gaussian distributions, which results in an efficient MH sampler
in some applications.
Another popular choice is the (symmetric) Gaussian random walk (RW) with some
step length , which results from selecting q(θ00 |θ0 ) = N (θ00 ; θ0 , 2 ). The choice of
the step length in the RW proposal determines the mixing of the Markov chain
and thereby the efficiency of the exploration of the parameter posterior. If the
step length is too small, the exploration is poor since that it takes a long time to
explore the target. If it is too large, we seldom accept the proposed parameter as
the difference in the values that the posterior assume is too large, resulting in a
small acceptance probability. This is often referred to as the mixing of the Markov
chain (c.f. Example 3.5). That is we would like to balance the mixing of the
Markov chain to get a reasonable exploration of the posterior and acceptance rate.
This problem is illustrated in Example 4.1, where we make use of Algorithm 4 to
infer the parameters in an LGSS model for which we can compute the likelihood
exactly. We return to the choice of proposal and the mixing of the Markov chain
in following sections.
4.1 Example: Parameter inference in the LGSS model
Consider the parameter inference problem in the LGSS model (2.2) with the unknown parameter θ = φ. We use the parameter θ? = 0.5 together with {σv , σe } =
{1.0, 0.1} to generate a realisation from the model with T = 250 and known initial
state x0 = 0. Here, we use the MH algorithm defined in Algorithm 4 together with
the Kalman filter for estimating the likelihood. A Gaussian random walk proposal
is used with zero mean and variance . For simplicity, we use a uniform prior over
4
Parameter inference using sampling methods
ACF of φ
φ
0.2
0.2
0.3
0.4
0.6
0.4
0.8
0.5
1.0
0.6
74
0
100
200
300
400
0.0
0.1
ε =0.01
500
0
10
20
40
50
30
40
50
30
40
50
0.7
0.8
0.6
ACF of φ
0.4
0.6
0.5
0.4
0.0
0.2
0.2
0.3
φ
30
Lag
1.0
Iteration
0.1
ε =0.10
0
100
200
300
400
500
0
10
20
Lag
ACF of φ
0.4
0.4
0.1
ε =1.00
0
100
200
300
Iteration
400
500
0.0
0.2
0.2
0.3
φ
0.6
0.5
0.8
0.6
1.0
0.7
Iteration
0
10
20
Lag
Figure 4.1: The traces (left) of the first 500 iterations of φ in the LGSS model
(2.2) using the RW proposal with three different step lengths = 0.01 (upper), = 0.10 (middle) and = 1.00 (lower). The estimated autocorrelations
function (right) using 9 000 iterations corresponding to each step length in
the RW proposal. The dotted lines show the true parameters from which the
data was generated.
Metropolis-Hastings
75
4
0
0.1
0.2
2
0.3
0.4
φ
Density
0.5
6
0.6
0.7
8
0.8
4.2
0
100
200
300
400
500
0.4
0.5
0.6
0.7
φ
0.6
3
0
0.8
1
1.0
2
1.2
σv
Density
1.4
4
1.6
1.8
5
2.0
Iteration
0
100
200
300
Iteration
400
500
0.6
0.7
0.8
0.9
1.0
1.1
1.2
σv
Figure 4.2: The trace of the first 500 iterations of φ (upper left) and σv (lower
left) in the LGSS model in Example 4.1. The estimated parameter posteriors
(right) obtained from 9 000 iterations of the MH algorithm. The dotted lines
show the true parameters from which the data was generated.
4
76
Parameter inference using sampling methods
φ ∈ (−1, 1) ⊂ R to ensure that the system is stable at all times.
In Figure 4.1, we present the trace plot of the Markov chain together with the
estimated autocorrelation function from M = 10 000 iterations (discarding the first
1 000 iterations as burn-in) for the RW proposal with three different step lengths.
We see that the mixing is rather poor for the smallest step size = 0.01, as this
results in a poor exploration of the posterior and therefore a large correlation in
the Markov chain. This is also the case if the step length is too large as illustrated
by the proposal with step length = 1.00. Here, the large step length results is
a small acceptance probability and therefore a larger correlation in the Markov
chain.
We now add the parameter σv to our inference problem and would therefore like
to infer the parameters {φ, σv } using the same setup as before. Here, we also
add an uniform prior over σv ∈ R+ as it corresponds to a standard deviation.
We use the step length = 0.1 with the RW proposal and simulate the Markov
chain for M = 10 000 iterations (discarding the first 1 000 iterations as burn-in).
In Figure 4.2, we present the trace of each parameter and the resulting posterior
density estimate. We see that the Markov chain mixes well in both parameters
and that the posterior estimates rather close to the true parameters.
4.2.1
Statistical properties of the MH algorithm
In this section, we discuss some of the statistical results that the MH algorithm
relies upon and briefly mention their underlying assumptions. For more information about the properties of the MH algorithm and MCMC algorithms in general,
see e.g. Tierney (1994), Robert and Casella (2004) and Meyn and Tweedie (2009).
The core result that is used in the MH algorithm is the ergodic theorem (Tierney,
1994; Robert and Casella, 2004). Given a well-behaved test function ϕ, we can
construct a Monte Carlo estimator,
ϕ
bMH =
N
1 X
ϕ(θi ),
N
i=1
{θi }M
i=1
denotes the samples obtain from the MH algorithm. If the Markov
where
chain is ergodic, then by the ergodic theorem this estimator is strongly consistent,
i.e.
a.s. N → ∞.
ϕ
bMH −→ E ϕ(θ) ,
Note that this property does not follow directly from the SLLN as the samples
obtained for the posterior are not IID, as the proposed parameters are correlated.
Hence, the ergodic theorem can be seen as the SLLN for correlated samples. Also,
a CLT for the error in the estimator (under some conditions) is given by
√
d
N ϕ
bMH − E ϕ(θ) → N 0, σϕ2 ,
4.2
Metropolis-Hastings
77
where σϕ2 denotes the variance of the estimator, which is proportional to the mixing properties of the Markov chain. In fact, this variance is proportional to the
integrated autocorrelation time (IACT), which is given by
IACT(θ1:M ) = 1 + 2
∞
X
ρk (θ1:M ).
k=1
The IACT cannot be computed analytically in practice as we do not know the true
autocorrelation function for many models. Therefore, we often approximate it by
[ 1:M ) = 1 + 2
IACT(θ
K
X
ρbk (θ1:M ),
k=1
where ρbk ( · ) denotes the empirical autocorrelation function at lag k and K denotes
some maximum lag for which to compute the IACT. This value can be selected
as a fixed value or by the first index at which the ACF √
becomes statistically
insignificant, i.e. the first index K such that |b
ρK (θ1:M )| < 2/ M . Another related
measure is the effective sample size (ESS) defined as
ESS(θ1:M ) =
M
M
P∞
=
,
IACT(θ1:M )
1 + 2 k=1 ρk (θ1:M )
(4.4)
which can be approximated in the same manner as the IACT. The IACT and ESS
have the interpretations as the number of iterations between each independent
sample and the number of independent samples obtained from the posterior, respectively. Hence, we would like to minimise the IACT and maximise the ESS to
get an optimal mixing in the Markov chain and many uncorrelated samples from
the posterior. We illustrate the ESS values for parameter inference in an LGSS
model in Example 4.2.
4.2 Example: ESS values for inference in the LGSS model
We return to Example 4.1 and calculate the ESS values for the three different
Markov chains considered in Figure 4.1. The ESS values (4.4) are {22, 1292, 353}
when calculated using the adaptive method for the RW proposal with each of the
three step lengths, respectively. These number validates the previous discussion
that the proposal with = 0.10 is preferable for this problem, as this maximises
the number of uncorrelated samples from the posterior.
The consistency results and the CLT relies on the ergodic theorem, which assumes
that the Markov chain is irreducible and aperiodic. Irreducible means that we
should be able to get from any state to any state in the state space. Aperiodicity
means that the Markov chain does not return to a state with regular intervals, i.e.
no loops that the chain can get stuck in. These requirements need to be check for
each MH algorithm to validate its assumptions. However, often in practice these
assumptions hold for at least unimodal parameter posteriors.
Another property that is important for a Markov chain used in MCMC is reversible.
4
78
Parameter inference using sampling methods
This property holds if the chain satisfies the detailed balance equation,
π(θ00 )K(θ00 , θ0 ) = π(θ0 )K(θ0 , θ00 ),
where K(θ00 , θ0 ) denote the Markov transition kernel. This results in that we can
write
Z
Z
Z
π(θ00 )K(θ00 , θ0 ) dθ00 = π(θ0 )K(θ0 , θ00 ) dθ00 = π(θ0 ) K(θ0 , θ00 ) dθ00 = π(θ0 ),
which shows that π is an invariant distribution of K, i.e. admits π(θ) as its stationary distribution. To show that the MH algorithm, satisfies the detailed balance
equation we follow Liu (2008) and write the transition kernel of the Markov chain
as,
π(θ0 )q(θ0 , θ00 )
00 0
00 0
K(θ , θ ) = q(θ , θ ) min 1,
,
θ00 , θ0
π(θ00 )q(θ00 , θ0 )
which gives
π(θ00 )K(θ00 , θ0 ) = min{π(θ00 )q(θ00 , θ0 ), π(θ0 )q(θ0 , θ00 )},
which is a symmetric function in θ00 and θ0 . This results in that the detailed balances is fulfilled and the resulting Markov chain obtained from the MH algorithm
is reversible.
4.2.2
Proposals using Langevin and Hamiltonian dynamics
We have previously discussed two different proposals for the MH algorithm and
that these could perform better than the IS algorithm when the target is high
dimensional. It turns out that the RW proposal in the MH algorithm still scales
rather poorly as the dimension of the parameter space d increases, see Roberts et al.
(1997). That is, the mixing of the Markov chain decreases and more iterations are
required to maintain the number of independent samples from the target. This is
a result of that the random walk does not explore the parameter space well when
the dimension increases.
A modification to the random walk proposal that could increase the mixing in the
Markov chain is to add a drift term that is proportional to the gradient of the logtarget. This leads to a proposal which is the noisy equivalent of (4.1). In statistics,
the resulting algorithm is known as the Metropolis adjusted Langevin algorithm
(MALA) (Roberts and Rosenthal, 1998; Neal, 2010), where the proposal is said
to follow a Langevin dynamics. This means that the proposal can be seen as the
discrete version of a continuous time Langevin diffusion process (Øksendal, 2010;
Kloeden and Platen, 1992). The Langevin diffusion with stationary distribution
p(θ|y1:T ) is given by the stochastic differential equation,
i
1h
dθ(τ ) = ∇ log p(θ|y1:T )θ=θ(τ ) dτ + dB(τ ),
2
where B(τ ) denotes a Brownian motion. To obtain samples from the parameter
posterior, we can simulate the Langevin diffusion to stationarity, which is useful
in the proposal as it guides the process towards the mode of the posterior. The
4.2
Metropolis-Hastings
79
discrete time proposal used in MALA follows from a first order Euler discretisation
of (4.5) as
2
q(θ00 |θ0 ) = N θ00 ; θ0 + S(θ0 ), 2 Id ,
2
where denotes a step length of the discretisation and the corresponding proposal.
The proposal can also be derived using a Laplace approximation of the log-posterior
as discussed in Paper A. Another version of this algorithm is the manifold MALA
(mMALA) (Neal, 2010; Girolami and Calderhead, 2011) which also includes the
Hessian information of the log-posterior in analogue with Newton algorithms (c.f.
(4.2)). This proposal can be derived similar to the MALA proposal and has the
form
!
2 −1 0
0
2 −1 0
00 0
00 0
q(θ |θ ) = N θ ; θ + J (θ )S(θ ), J (θ ) ,
2
where we include the observed information matrix J (θ) into the proposal. We
make use of the MALA and mMALA proposals in Paper A to construct particle
versions of the algorithms for parameter inference in SSMs. Alternative algorithms
that solves the same problem are proposed by Girolami and Calderhead (2011),
where the MALA and mMALA are used for parameter inference in the SV model.
In this thesis, we adopt an optimisation mind set and refer to the MALA and
mMALA algorithm as first order and second order proposals, respectively. The
names of the proposals refer to the use of first order information (the gradient) and
the second order information (the Hessian) in the proposals. In the optimisation
literature, this corresponds to the first order gradient-based search method and
and second order Newton method, respectively. The corresponding MH algorithms
are therefore called MH1 and MH2, respectively.
Another perspective of the second order proposal is that the added gradient and
Hessian information about the log-target is used to construct a random walk on
a Riemann manifold (Livingstone and Girolami, 2014; Girolami and Calderhead,
2011). For this end, we require that the negative Hessian is a PD matrix which is
not always the case when we use the observed information matrix as the second
order information. In Girolami and Calderhead (2011), it is proposed that the expected information matrix should be used instead, but this is computational costly
to estimate for an SSM. Therefore, we make use of methods from optimisation (Nocedal and Wright, 2006) to make the observed information matrix PD if necessary.
This is done by regularisation, i.e. adding an appropriate diagonal matrix to make
the negative eigenvalues positive, see Paper A for more information.
Another related method is called manifold Hamiltonian MCMC (mHMC) and it
can improve the mixing of the Markov chain further. This method originates from
physics and is one instance of hybrid MC (Duane et al., 1987) algorithms, which
are used in statistical physics to simulate from high dimensional targets. Here, we
shortly discuss their use in a statistical setting for parameter inference. Interested
readers are referred to Liu (2008), Neal (2010) and Girolami and Calderhead (2011)
80
4
Parameter inference using sampling methods
for more information. The mHMC algorithm is based on simulating a physical
system with the Hamiltonian (the total energy),
H(θ, p) = U (θ) + K(p),
(4.5)
where p ∼ N (0, J −1 (θ)) denotes a random fictitious momentum for each parameter. Here, the potential energy function and the kinetic energy function are given
by U (θ) = − log π(θ) and K(p) = p> J −1 (θ)p/2, respectively. The proposal simulates this Hamiltonian system for a number of steps L using the so-called leap-frog
algorithm (Neal, 2010) with some step size . The result from this procedure
is the proposed parameter and the resulting momentum {θ00 , p00 }. This pair is
accepted/rejected using a similar procedure as in the MH algorithm. The acceptance probability compares the Hamiltonian of the system at the last accepted
parameter and the proposed parameter by
α(θ00 , θ0 ) = 1 ∧ exp H(θ00 , p00 ) − H(θ0 , p0 ) .
(4.6)
As the accept/reject decision is delayed for L steps, this proposal allows for the
chain to move a larger distance between the iterations of the algorithm. This
increases the mixing of the Markov chain and it also allows for the Markov chain
to visit isolated modes in the posterior. This leads to a better exploration of
the posterior as well as more effective samples. The mHMC algorithm is used in
many applications and is currently a popular algorithm in statistics and machine
learning, see e.g. Chen et al. (2014), Beam et al. (2014) and Betancourt and
Girolami (2013). In Girolami and Calderhead (2011), the authors make use of the
mHMC algorithm for parameter inference in a SV model with impressive results.
To adapt the mHMC algorithm to the PMCMC framework is therefore an exciting
opportunity for future research.
We conclude this section with an comparison between the three MCMC algorithms
previously discussed. In Example 4.3, we replicate and extend an illustration
from Neal (2010) to compare the different methods when sampling from a highdimensional target distribution.
4.3 Example: Parameter inference in a 100-dimensional Gaussian dist.
Consider the problem of sampling from a non-isotropic 100-dimensional Gaussian
distribution with zero-mean and covariance matrix Σ = diag(1.00, 0.99, . . . , 0.01).
In this problem, we use random step sizes where RW ∼ U(0.018, 0.026) for MH-RW
and HMC ∼ U(0.010, 0.016) for HMC with L = 150. These values follow the suggestions by Neal (2010). For MALA, we us the step length LMC ∼ U(0.008, 0, 013),
which results in an acceptance rate of about 60%. The HMC algorithm is executed
for 1 000 iterations and the RW and MALA algorithms are executed for 150 000
iterations but only every 150th iteration is considered to make a fair comparison
with the HMC. The resulting acceptance probabilities are 0.26, 0.60 and 0.88, for
each proposal respectively.
The trace plots and ACF for the first coordinate (with standard deviation 1)
are presented in Figure 4.3. The estimated mean and standard deviations of the
target distribution are presented in Figure 4.4. The HMC algorithm gives almost
Metropolis-Hastings
1.0
81
ACF of x1
0.4
0.2
0
0.0
−2
x1
0.6
2
0.8
4
4.2
−4
RW−MH
0
200
400
600
800
1000
0
10
20
30
40
30
40
30
40
Lag
ACF of x1
0.4
0.2
0
0.0
−2
x1
0.6
2
0.8
4
1.0
Iteration
−4
MALA
0
200
400
600
800
1000
0
10
20
Lag
0.4
ACF of x1
0.2
0
0.0
−2
x1
0.6
2
0.8
4
1.0
Iteration
−4
HMC
0
200
400
600
Iteration
800
1000
0
10
20
Lag
Figure 4.3: The trace of x1 (left) and the corresponding estimated ACF (right)
for the RW-MH, MALA and HMC algorithms.
4
Parameter inference using sampling methods
0.4
^
σ
0.0
−0.6
−0.4
0.2
−0.2
^
µ
0.0
0.6
0.2
0.8
0.4
0.6
1.0
82
RW−MH
0
20
40
60
80
100
0
20
40
60
80
100
80
100
80
100
Coordinate
0.4
^
σ
0.0
−0.6
−0.4
0.2
−0.2
^
µ
0.0
0.6
0.2
0.8
0.4
0.6
1.0
Coordinate
MALA
0
20
40
60
80
100
0
20
40
60
Coordinate
0.4
^
σ
HMC
0
20
40
60
Coordinate
80
100
0.0
−0.6
−0.4
0.2
−0.2
^
µ
0.0
0.6
0.2
0.8
0.4
0.6
1.0
Coordinate
0
20
40
60
Coordinate
Figure 4.4: The estimated parameters obtained for each coordinate from the
RW-MH, MALA and HMC algorithms. The estimates of the mean vector
(left) and the diagonal of the covariance matrix (right) are presented for each
coordinate. The dashed lines indicate the true parameter values for each
coordinate.
4.3
Particle Metropolis-Hastings
83
independent samples from the target. The ACF falls off quicker for the MALA
compared with the RW-MH, which indicates a more efficient exploration of the
posterior. However, the estimates of the covariance matrix are biased for the
MALA, which could be the result of that the Markov chain gets stuck somewhere
in the parameter space. We conclude that there is a large gain in using the HMC
algorithm for inference in models with a high-dimensional parameter vector.
4.3
Particle Metropolis-Hastings
As previously discussed in Chapter 2, the likelihood of an SSM is analytically
intractable and therefore we cannot make use of the MH algorithm directly for
parameter inference. In Section 3.3.4, we reviewed how to construct an unbiased
estimator for the likelihood based on the APF. A natural solution to this problem
could therefore be to replace the intractable quantities in the acceptance probability with unbiased estimates.
This idea is first used by Beaumont (2003), where the authors replace an analytical
intractable likelihood with an unbiased estimate for a genetics application of the
MH algorithm. The first use of this idea for parameter inference in nonlinear SSMs
is found in Fernández-Villaverde and Rubio-Ramírez (2007), where the intractable
likelihood is replaced with an estimate from the APF. Subsequent work by Andrieu and Roberts (2009) and Andrieu et al. (2010) analyse the resulting PMH
algorithm and prove that this is a valid approach to solve the problem. To see
why this method works, we review the derivation of the PMH algorithm following
the presentation in Flury and Shephard (2011).
Consider the problem of using the MH algorithm to sample the parameter posterior (2.14), i.e. π(θ) = p(θ|y1:T ). This implies that the acceptance probability (4.3)
depends explicitly on the intractable likelihood pθ (y1:T ), preventing direct application of the MH algorithm to this problem. Instead, assume that there exists an
unbiased, non-negative estimator of the likelihood pbθ (y1:T |u), i.e.
Z
Eu|θ pbθ (y1:T |u) = pbθ (y1:T |u)mθ (u) dθ = pθ (y1:T )
(4.7)
where u ∈ U denotes the multivariate random variable (vector) used to construct
this estimator. Here, mθ (u) denotes the probability density of u on U. When the
APF is used to construct the estimator of the likelihood, the random variable u is
(i)
(i)
the particles and their ancestors {x0:T , a1:T }N
i=1 .
The PMH algorithm can then be seen as a standard MH algorithm operating in a
non-standard extended space Θ × U, with the extended target given by
π(θ, u|y1:T ) =
pb (y1:T |u)mθ (u)p(θ|y1:T )
pbθ (y1:T |u)mθ (u)p(θ)
= θ
,
p(y1:T )
pθ (y1:T )
and the proposal distribution mθ00 (u00 )q(θ00 |θ0 ). As a result, we can recover the
4
84
Parameter inference using sampling methods
Algorithm 5 Particle Metropolis-Hastings (PMH) for Bayesian inference in SSMs
Inputs: Algorithm 2, M > 0 (no. MCMC steps), q(θ00 , θ0 ) (proposal) and θ0
(initial parameters).
Output: θ = {θ1 , . . . , θM } (samples from the parameter posterior).
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
Initalise using θ0 .
for k = 1 to M do
Sample θ0 from the proposal θ0 ∼ q(θ0 |θk−1 ).
Estimate the likelihood pbθ0 (y1:T ) using Algorithm 2 and (3.19).
Sample ωk from U(0, 1).
if ωk < α(θ0 , θk−1 ) given by (4.8) then
{θk , pbθk (y1:T )} ← {θ0 , pbθ0 (y1:T )}. {Accept the parameter}
else
{θk , pbθk (y1:T )} ← {θk−1 , pbθk−1 (y1:T )}. {Reject the parameter}
end if
end for
parameter posterior by marginalisation of the extended target,
Z
Z
p(θ|y1:T )
π(θ, u|y1:T ) du =
pbθ (y1:T |u)mθ (u) du = p(θ|y1:T ),
pθ (y1:T )
|
{z
}
=pθ (y1:T )
using the unbiasedness property (4.7) of the likelihood. Samples from the parameter posterior can therefore be obtained as a byproduct by simulating from
π(θ, u|y1:T ). By selecting the proposal distribution as q(θ00 |θ0 , u), the acceptance
probability is given by
α(θ00 , θ0 ) = 1 ∧
pbθ00 (y1:T |u00 ) p(θ00 ) q(θ00 |θ0 , u0 )
.
pbθ0 (y1:T |u0 ) p(θ0 ) q(θ0 |θ00 , u00 )
(4.8)
Note, that the acceptance probability is the same as for the MH algorithm, but
replacing the intractable likelihood with an unbiased estimator and including an
extended proposal. As previously discussed, the random variable u contains the
entire particle system generated by the APF algorithm. From Section 3.4.2, we
know that this information can be used in combination with the FL smoother
to estimate the score and information matrix, which can be used to construct
particle versions of the MH1 and MH2 algorithms. This is the main idea behind
the proposed PMH1 and PMH2 algorithms in Paper A, to which we refer interested
readers for more information.
It turns out that the acceptance rate of the PMH algorithm is closely connected
with the number of particles used to estimate the log-likelihood. If N is too small,
then the variance of the log-likelihood estimates are large and therefore we often get
stuck with the Markov chain with a resulting low acceptance rate. We also know
that N is connected with the computational cost of the APF algorithm. Therefore,
we have a trade-off between the number of MCMC iterations and the number of
4.4
Bayesian optimisation
85
particles in the APF, where we would like to minimise the total computational cost
of the PMH algorithm. This problem is analysed and discussed by Doucet et al.
(2012), Pitt et al. (2010) and Pitt et al. (2012). From this work, it is recommended
to use a value of N such that the variance of the log-likelihood estimates is between
0.25 and 2.25. Consequently, we can determine the optimal number of particles
by a pilot run.
Two other common versions of the PMH algorithm are the particle independent MH
(PIMH) algorithm and the particle marginal MH (PMMH or PMH0) algorithm.
The Gaussian versions of these proposals are obtained by using q(θ00 ) = N (θ00 ; 0, 2 )
and q(θ00 |θ0 ) = N (θ00 ; θ0 , 2 ), respectively. The resulting general version of the PMH
algorithm that incorporates the PMH0 and PIMH as special cases is presented in
Algorithm 5. The full procedure for PMH1 and PMH2 are found in Algorithm 1
in Paper A.
4.4 Example: PMH0 for parameter inference in the GARCH(1,1) model
Consider the parameter inference problem for the GARCH(1,1) model (2.3) using
the NASDAQ OMX Stockholm 30 Index data from Section 2.2.2. We make use of
the PMH0 algorithm and the RW proposal with the step length = 0.005 for all
elements in the parameter vector. We run the algorithm for M = 20 000 iterations
(discarding the first 5 000 as burn-in) with the faPF algorithm and N = 3 000
particles for estimating the likelihood.
The resulting acceptance rate is 0.06 (after burn-in) with ESS {39, 21, 21, 11} for
the four parameters, respectively. Note that the poor ESS are due to that we for
simplicity make use of the same step length for all parameters. In Figure 4.5, we
present the trace plots and posterior estimates obtain from the run. We see that
the mixing is rather poor and longer runs with smaller step lengths are needed.
Also, we could make use of PMH1 or PMH2 from Paper A to improve the mixing.
The posterior means are θbPMH = {0.0088, 0.14, 0.86, 0.63}, which can be used as
point estimates of the parameters in the model.
4.4
Bayesian optimisation
BO (Jones, 2001; Boyle, 2007; Lizotte, 2008; Osborne, 2010) is a popular (global)
derivative-free optimisation method, which is currently studied extensively in the
machine learning community. Here, we consider the use of BO to solve
xmax = argmax f (x),
(4.9)
x∈B
where B ⊂ Rd denotes some compact set. Here, we assume that we cannot directly
evaluate the real-valued function f (x), but can obtain noisy estimates (samples)
modelled as
yk = f (xk ) + zk ,
zk ∼ N (0, σz2 ),
(4.10)
4
Parameter inference using sampling methods
150
0
0.004
50
100
0.008
α
Density
200
0.012
250
300
86
5000
6000
7000
8000
9000
10000
0.000
0.005
0.010
0.015
0.020
φ
30
Density
20
0.14
0
0.10
10
0.12
β
40
0.16
50
60
Iteration
5000
6000
7000
8000
9000
10000
0.10
0.12
0.14
0.16
0.18
0.20
0.86
0.88
0.90
β
30
Density
γ
0
0.83
10
0.85
20
0.87
40
0.89
50
Iteration
5000
6000
7000
8000
9000
10000
0.80
0.82
0.84
γ
20
15
Density
0
0.58
5
0.60
10
0.62
τ
0.64
25
0.66
30
Iteration
5000
6000
7000
8000
Iteration
9000
10000
0.55
0.60
0.65
0.70
τ
Figure 4.5: The trace plots (left) and the corresponding posterior estimates
(right) of the GARCH(1,1) model using the Nasdaq OMX Stockholm 30 Index data from Section 2.2.2. The estimates are computed using the PMH0
algorithm with 20 000 iterations and discarding the first 5 000 as burn-in.
4.4
Bayesian optimisation
87
where zk denotes some zero-mean Gaussian noise with unknown variance σz2 .
Therefore, the BO algorithm is useful when we can only estimate the value of
the objective function with some simulation based algorithm. It also turns out
that the BO algorithm requires less estimates of the objective function than other
optimisation algorithms (Brochu et al., 2010). As a consequence, BO is useful
when the noisy estimates of the objective function are computationally costly to
obtain. In the following, we make use of BO for ML based parameter inference
and for input design in nonlinear SSMs. In these cases, the objective function
corresponds to the log-likelihood and the logarithm of the determinant of the expected information matrix, respectively. In both these cases, these quantities are
analytically intractable but we can obtain noisy estimates from the objective function by the use of computationally costly particle filtering and smoothing. Hence,
these are two applications which fits the BO algorithm well.
The BO algorithm operates by constructing a surrogate function also called a
response surface that emulates the objective function. In BO, this surrogate function is modelled as a probabilistic function with some prior form selected by the
user. The name comes from that the samples from the objective function are used
together with the prior to update the model using Bayes’ theorem. Using the
updated posterior, we can predict the value of the objective function anywhere in
the space of interest and also obtain an uncertainty of the predicted value.
Using the predictive distribution, the algorithm can analyse where the optimum of
the objective function could be located and focus the sampling to that area. Also,
the algorithm can choose to explore areas where there are large uncertainties in
the predicted value of the objective function. We refer to these two situations as
exploitation and exploration, respectively. In the following, we discuss this part
of the algorithm in more detail and the acquisition rules that is used to make
the decisions regarding exploitation and exploration. To conclude this overview of
the BO algorithm, we present the three steps that are carried out during the kth
iteration:
(i) Sample the objective function in xk to obtain yk .
(ii) Use the data collected Dk = {xi , yi }ki=1 to construct a surrogate function.
(iii) Use an acquisition rule and the surrogate function to select xk+1 , i.e. where
to sample the objective function in the next iteration of the algorithm.
We now proceed by discussing Steps (ii) and (iii) in more detail. Step (i) depends
on the specific optimisation problem that we would like to solve and for the two
previously discussed applications correspond to the APF and the FL smoother in
Algorithms 2 and 3, respectively.
In this thesis, we make use of GPs (Rasmussen and Williams, 2006) as the surrogate function. Therefore, we devote the following section to introducing the GP
and discuss its structure and how to combine it with the obtained samples from
the objective function. Then, we discuss some different acquisition functions and
compare their properties. Finally, we combine the GP with the BO algorithm to
88
4
Parameter inference using sampling methods
obtain the Gaussian process optimisation (GPO) algorithm. We conclude this section by discussing some applications of GPO in connection with SSMs. For more
information regarding BO, see Lizotte (2008), Boyle (2007), Brochu et al. (2010)
and Snoek et al. (2012).
4.4.1
Gaussian processes as surrogate functions
GPs are an instance of Bayesian nonparametric models and has its origins in
kriging (Cressie, 1993; Matheron, 1963) methods from spatial statistics. An application of kriging is to interpolate between elevation measurements sampled in
some terrain to build a map of the elevation in an area. The underlying assumption is that the elevation should vary smoothly between sampled points. This is
the property of the GP that we would like to use for interpolating the objective
function between the values in which we sample it.
A GP (Rasmussen and Williams, 2006) can be seen as a generalisation of the multivariate Gaussian distribution to an infinite dimension. As such, it is a collection
of random variables, where each finite subset is jointly distributed according to a
Gaussian distribution. A realisation drawn from a GP can therefore be seen as an
infinitely long vector of values, which can be seen as a function over the real space
Rd . This is why the GP is considered by some to be a prior over functions on Rd .
As the GP is a Gaussian distribution of infinite dimension, we cannot characterise
it using a mean vector and covariance matrix. Instead, we introduce a mean
function m(x) and a kernel (or covariance function) κ(x, x0 ) defined as
m(x) = E[f (x)],
h
i
κ(x, x0 ) = E f (x) − m(x) f (x0 ) − m(x0 ) .
(4.11a)
(4.11b)
To construct the surrogate function, we assume a priori that the objective function
can be modelled as a GP,
f (x) ∼ GP m(x), κ(x, x0 ) ,
(4.12)
with the mean function and kernel defined by (4.11). Here, the mean function
specifies the average value of the process and the kernel specifies the correlation
between (nearby) samples. Both functions are considered to be prior choices to the
algorithm and are used to encode the beliefs about the data before it is observed.
Consequently, we have that both the prior (4.12) and the data likelihood (4.10)
are distributed according to Gaussian distributions. Hence, the posterior resulting
from Bayes’ theorem is a Gaussian distribution with some mean and covariance
that can be calculated using standard results. From this posterior, we can con-
4.4
Bayesian optimisation
89
struct the predictive distribution at some test point x? given the data Dk by
f (x? )Dk ∼ N x? ; µf (x? |Dk ), σf2 (x? |Dk ) ,
(4.13a)
h
i
−1
2
µf (x? |Dk ) = κ>
y1:N ,
(4.13b)
? κ x1:k , x1:k + σz Ik
h
i
−1
2
σf2 (x? |Dk ) = κ x? , x? − κ>
κ? + σz2 ,
(4.13c)
? κ x1:k , x1:k + σz Ik
where κ? = κ x? , x1:k denotes the
covariance between the test value and the
sampling points. Here, κ x1:k , x1:k denotes a matrix where the element at (i, j)
is given by κ(xi , xj ) for i = 1, . . . , k and j = 1, . . . , k.
To obtain the GP posterior, we need to select a kernel function. Note that, it
is possible to include the assumption of a non-zero mean function into the kernel
function by adding an appropriate constant kernel. Therefore, we only make use of
a zero mean function in this thesis and focus on the kernel design problem, where
several kernels can be combined by different operations to encode the prior beliefs
of the structure in the data. Here, we only consider the combination of a constant
covariance function and three different popular kernels: the squared exponential
(SE), the Matérn 3/2 and the Matérn 5/2. See Rasmussen and Williams (2006)
for other kernels and for a discussion of how they can be combined.
The SE kernel is also known as radial basis function (RBF) and has the form
(x − x0 )2
0
2
,
(4.14)
κSE (x, x ) = σκ exp
2l2
where the hyperparameters are α = {σκ2 , l}. Here, l is called the characteristic
length scale as it scales the Euclidean distance between the two points x and x0 .
Two other kernels are the Matérn 3/2 and the Matérn 5/2 with the form
√
√
3(x − x0 )
3(x − x0 )
κ3/2 = σκ2 1 +
exp −
,
(4.15a)
l
l
√
√
5(x − x0 ) 5(x − x0 )2
5(x − x0 )
2
κ5/2 = σκ 1 +
+
,
(4.15b)
exp −
l
3l2
l
where the hyperparameters are α = {σκ2 , l}. The main difference between the three
kernels are their smoothness properties. The SE kernel is the smoothest and it has
infinitely many continuous derivatives. The κ3/2 and κ5/2 kernels only have one
and two continuous derivatives, respectively. To illustrate the different kernels, we
present some simulated realisations from each in Example 4.5.
4.5 Example: GP kernels
In Figure 4.6, we present realisations from the GP prior (4.12) using three different
kernels with two different length scales. We see that the smoothness of the prior
decreases from top to bottom and this verifies the previous discussion. Also, we
see that the length scale determines the rate of change in the realisations.
4
90
Parameter inference using sampling methods
0
f(x)
1
2
3
SE, l=3
−3
−2
−1
0
−3
−2
−1
f(x)
1
2
3
SE, l=1
0
2
4
6
8
10
0
2
4
x
8
10
0
f(x)
1
2
3
Matérn 5/2, l=3
−3
−2
−1
0
−3
−2
−1
f(x)
1
2
3
Matérn 5/2, l=1
0
2
4
6
8
10
0
2
4
x
6
8
10
x
0
1
2
3
Matérn 3/2, l=3
−2
−3
−3
−2
−1
0
f(x)
1
2
3
Matérn 3/2, l=1
−1
f(x)
6
x
0
2
4
6
x
8
10
0
2
4
6
8
10
x
Figure 4.6: Realisations simulated from the GP prior using three different
kernels: SE (upper), Matérn 5/2 (middle) and Matérn 3/2 (lower) and length
scales 1 (left) and 3 (right).
4.4
Bayesian optimisation
91
Finally, we need to determine suitable values for the hyperparameters in the kernel.
This can be done using an ML procedure called emperical Bayes (EB), where the
marginal likelihood of the data is optimised with respect to the hyperparameters α.
Note that this is not a pure Bayesian approach as the data is used to determine the
properties of the prior. Nevertheless, it is a popular approach in the GP literature
and therefore we make use of it here. The marginal likelihood can be computed
by a marginalisation,
Z
p(y|x, α) = p(y|f, x, α)p(f |x, α) df,
where we drop the subscript on y1:k , x1:k and f1:k for brevity. The log-marginal
likelihood can be obtained in closed form using results for the Gaussian distribution
as
i−1
1
1 h
y − log κ x, x + σz2 IN ,
log p(y|x, α) ∝ − y > κ x, x + σz2 IN
2
2
where we have neglected the terms that are independent of α (independent of the
kernel). The gradient of the log-marginal likelihood for αj can be computed using
−1 ∂
−1
∂
1
>
log p(y|x, α) = tr ββ − κ x, x
κ x, x ,
β = κ x, x
y.
∂αj
2
∂θj
Therefore, the hyperparameters can be estimated by maximising the log-marginal
likelihood using a gradient-based search algorithm (4.1). In Example 4.6, we make
use of the EB method for GP regression using the different kernels discussed in
this section.
4.6 Example: GP regression
Consider the GP regression problem where we would like to recover the underlying
function f (x) given by
h
i2
f (x) = 3 cos(0.5x) + sin(2 + 0.25x) ,
from the noisy measurements y generated by (4.10) with σz2 = 2. Here, we make
use of the three kernels in (4.14) and (4.15) with an added constant kernel to
account for the non-zero mean of the data. The hyperparameters of the resulting
kernels are estimated using the EB procedure.
In the left part of Figure 4.7, we use N = 5 observations and present the resulting
predictive mean (solid line), the underlying function f (x) (dashed) and the 95%
CI of the predictive distribution (gray area). We see that most of the samples are
located in the left part of the region and as a result the predictive means of the
GPs follows the underlying function.
In the right part of Figure 4.7, we present the same setup but using N = 15 sampled
points instead. Here, the overall fit is much better and the three GP predictive
distributions recover the underlying function pretty well across the region. As the
underlying function is smooth (have infinity many continuous derivatives), it is
well captured by the SE kernel.
30
Parameter inference using sampling methods
30
4
92
f(x)
10
20
SE, N=15
0
10
0
f(x)
20
SE, N=5
0
2
4
6
8
10
0
2
4
8
10
30
f(x)
10
20
Matérn 5/2, N=15
0
10
0
f(x)
20
Matérn 5/2, N=5
0
2
4
6
8
10
0
2
4
6
8
10
30
x
30
x
0
10
f(x)
20
Matérn 3/2, N=15
10
20
Matérn 3/2, N=5
0
f(x)
6
x
30
x
0
2
4
6
x
8
10
0
2
4
6
8
x
Figure 4.7: The GP regression problem in Example 4.6 using N = 5 (left)
and N = 15 (right) data points with three different kernels. The mean of
the predictive distributions (solid lines) and the corresponding 95% CI (gray
area) are presented together with the true function (dashed line) and the noisy
observations (black dots).
10
4.4
Bayesian optimisation
4.4.2
93
Acquisition rules
We now proceed by discussing Step (iii) of the BO algorithm, i.e. the acquisition
rule and how it operates. The main idea with this rule is to make use of the
predictive mean and its uncertainty to decide on a good parameter to sample the
objective in during the next iteration. We would like the algorithm to explore
the parameter space to find all the peaks, but still focus the samples around the
maximum to decrease the number of samples required to solve the problem. This
is called the exploration and exploitation trade-off as we would like to explore the
space, but also exploit the information encoded in the surrogate function about
about where the maxima are located.
Using a general acquisition rule AQ(x|Dk ), we would like to select xk+1 in Step (iii)
as the maximising argument,
xk+1 = argmax AQ(x|Dk ),
(4.16)
x∈B
which in itself is an optimisation problem. We discuss some different methods for
solving this in Paper B and now instead proceed with discussing three different
acquisition rules that are popular and widely used in GPO (Brochu et al., 2010).
These are: the probability of improvement (PI), the expected improvement (EI) and
the upper confidence bound (UCB).
The PI and EI make use of the fact that the predictive distribution is Gaussian
and that we can use the predictive mean and covariance. From this, we can use
the probability density function (PDF) and the cumulative distribution function
(CDF) of the Gaussian distribution to calculate the PI and EI. This can be done
by introducing the highest predicted value of the surrogate function (or the incumbent),
µmax = max µf (xi |Dk ),
xi ∈x1:k
(4.17)
then the PI (Kushner, 1964) can be computed using
PI(x|Dk ) = P µf (x|Dk ) ≥ µmax + ξ = Φ(Zk ),
(4.18a)
µf (x|Dk ) − µmax − ξ
,
σf (x|Dk )
(4.18b)
Zk =
where Φ( · ) denotes the Gaussian CDF. Here, ξ ≥ 0 denotes a user-defined coefficient proposed in Lizotte (2008) to balance the exploitation and exploration
behaviour. From the form of the PI expression, we note that the variable Zk can
be seen as a standard Gaussian random variable. Hence, Zk assumes a (large)
positive value if the predictive mean is close to or larger than µmax . Therefore, we
obtain a value of the PI close to one and it is probable that the GPO algorithm
will sample the objective function in this region during its next iteration. Instead,
we obtain a small value of the PI if the predictive mean is much smaller than µmax
and/or the uncertainty is very large. A small value of the PI results in that it is
unlikely that the GPO algorithm will sample the objective function in this region
during its next iteration.
94
4
Parameter inference using sampling methods
However, the PI only takes into account the probability of an improvement and
not its size. To include this information, we consider the EI rule (Mockus et al.,
1978; Jones et al., 1998) of the form
EI(x|Dk ) = µf (x|Dk ) − µmax − ξ Φ(Zk ) + σf (x|Dk )φ(Zk ),
(4.19)
where φ( · ) denotes the Gaussian PDF. The interpretation of the EI follows in
analogue with the PI rule. If the predictive mean is close to or larger than µmax
then Φ(Zk ) assumes a value close to one, which scales the expected gain in the
objective function. Here, we also take the uncertainty into account by the second
term in (4.19). This term can be seen as a scaling of the uncertainty in the
predictive distribution. The scaling is large if Zk is close to zero and decreases
for larger value. This means that we get an extra contribution to the EI if the
predictive mean is close to µmax , which means that it could be interesting to explore
that area. For more information, see Paper B where we review the derivation of
the EI rule.
The third acquisition rule, the UCB rule follows from that we can construct a CI
using the predictive distribution. The intuition for this is that if the predictive
mean is high in an area of the parameter space, the resulting UCB is also large.
Moreover, uncertainty in a region increases the predictive covariance, which also
increases the value of the UCB. By this rule, we therefore explore areas where
peaks have been found and where the uncertainty is large. As the name suggests,
we are interested in the upper bound which for a Gaussian distribution is given by
UCB(x|Dk ) = µf (x|Dk ) + σf (x|Dk ).
(4.20)
Here, ≥ 0 denotes a coefficient determining the confidence level of the interval.
As the predictive distribution is Gaussian, we would choose = 1.96 to obtain a
95% CI and = 2.58 to obtain a 99% CI.
4.7 Example: GPO using different acquisition rules
Consider again the problem in Example 4.6 using the Matérn 5/2 kernel with
N = 3, N = 5 and N = 15 samples. In the left of Figure 4.8, we present the
predictive distributions together with the underlying function as before. In the
right part of the figure, we present the normalised value of the three different
acquisition functions previously discussed: the PI (green), the EI (red) and the
UCB (blue). Here, we use the recommended value of ξ = 0.01 in Lizotte (2008)
and = 1.96, resulting in that the UCB corresponds to a 95% CI.
We note that the three acquisition functions have quite different behaviours in
the three situations. In the first situation (upper), the PI and the UCB have two
peaks that are located in the left and right part of the region to exploit the current
information and to explore the region better, respectively. The EI would like to
sample the left end of the region to exploit the current information and to reduce
the uncertainty in that area.
In the second situation (middle), the EI again would like to exploit the current
information by placing a sample near the peak of the predictive mean. The other
two acquisition functions would like to explore the right part of the region better.
Bayesian optimisation
1.2
95
30
4.4
1.0
Matérn 5/2, N=3
20
UCB
0.6
AQ(x)
0.4
10
f(x)
0.8
PI
0.0
0.2
0
EI
0
2
4
6
8
10
0
2
4
6
8
10
6
8
10
6
8
10
1.2
x
30
x
0.6
AQ(x)
0.0
0.2
0
0.4
10
f(x)
0.8
20
1.0
Matérn 5/2, N=5
0
2
4
6
8
10
0
2
4
1.2
x
30
x
0.6
AQ(x)
0.0
0.2
0
0.4
10
f(x)
0.8
20
1.0
Matérn 5/2, N=10
0
2
4
6
x
8
10
0
2
4
x
Figure 4.8: The GP regression problem in Example 4.6 using N = 3 (upper),
N = 5 (middle) and N = 10 (lower) data points with the Matérn 5/2 kernel.
In the left part, we present the mean of the predictive distributions (solid lines)
and the corresponding 95% CI (gray area) together with the true function
(dashed line) and the noisy observations (black dots). In the right part,
we present normalised values of three different acquisition functions for each
situation.
4
96
Parameter inference using sampling methods
Finally, in the third situation (lower) the three functions agree and would like to
exploit the current peak in the predictive mean. Hence, we conclude that the three
acquisition functions have rather different behaviour and we return in a following
example to investigate how this affects the resulting GPO algorithm.
4.4.3
Gaussian process optimisation
The full procedure for using GPO to estimate the solution to (4.9) is presented
in Algorithm 6. As previously discussed, the user choices include the kernel for
the GP prior and the acquisition function. Again, we remind the reader that the
choice of kernel is crucial for the performance of the algorithm. Furthermore, an
optimisation method is needed to optimise the acquisition function in (4.16). Here,
we make use of a global derivative-free optimisation algorithm called DIRECT
(Jones et al., 1993), but we discuss other possible choices in Paper B.
The GPO algorithm is often initialised by some randomly selected parameters
sampled from the parameter prior. These samples are used to estimate the hyperparameters of the GP kernels so that the AQ function can operate. The number
of these samples varies with the dimension of the problem but between 5 and 50
are reasonable numbers for small problems. Also, as the EB procedure is computationally costly, it is beneficial to estimate the GP hyperparameters every 5th or
10th iteration to save computations.
We end this section by discussing three different examples of where GPO and/or
surrogate function modelling is useful. In Examples 4.8 and 4.9, we use the GPO
algorithm for solving the ML parameter inference problem in an SSM and compare
the different possible choices of AQs. Note that more examples of this application
are found in Papers B and C.
Finally in Example 4.10, we illustrate the use of GPO for creating an input that
maximises the logarithm of determinant of the expected information in an SSM.
Remember that this corresponds to maximising the accuracy of the ML parameter
estimate, which is the objective in input design.
Algorithm 6 Gaussian process optimisation (GPO)
Inputs: κ( · ) (GP kernel), AQ (acquisition func.), K (no. iterations) and x1 (init. value).
Output: x
bmax (estimate of the maximising argument of (4.9)).
1:
2:
3:
4:
5:
6:
7:
8:
Initialise the algorithm by random sampling.
for k = 1 to K do
Sample f (xk ) to obtain the noisy estimate yk .
Compute (4.13) to obtain µf (x|Dk ).
Compute (4.17) to obtain µmax .
Compute (4.16) to obtain xk+1 .
end for
Compute the maximiser of µf (x|DK ) to obtain the estimate x
bmax of (4.9).
4.4
Bayesian optimisation
97
4.8 Example: GPO for ML inference in the GARCH(1,1) model
Consider the ML parameter inference problem in the GARCH(1,1) model (2.3)
using synthetic data with θ = τ and Θ = (0, 0.25) ∈ R. Here, we make use of the
GPO algorithm for solving this problem, by first rewriting (4.10) as
b k ) = `(θk ) + zk ,
`(θ
zk ∼ N (0, σz2 ),
which is similar to the CLT established in Section 3.3.4. As a consequence, (4.9)
turns into the maximum likelihood maximisation problem discussed in Section 2.3.
The resulting procedure follows from Algorithm 6 by plugging in the APF from
Algorithm 2 for log-likelihood estimation.
Here, we use an one dimensional parameter vector to be able to compare the three
different AQs discussed in the previous section. We generate T = 250 observations
from the model using {α, β, γ, τ } = {0.1, 0.8, 0.05, 0.3}. We make use of the faPF
with N = 100 to estimate the log-likelihood during each iteration. Here, we use the
recommended value of ξ = 0.01 in Lizotte (2008) and = 1.96 for the acquisition
rule.
In Figure 4.9, we present the procedure at five consecutive iterations using three
different acquisition rules. The procedure is initialised with two randomly selected
samples of the log-likelihood, which are not shown in the figure. Here, we see that
the behaviour of the three acquisition rules are rather different but the resulting
mode of the log-likelihood is almost the same. The parameter estimate is obtained
around θbML = 0.12 for all three choices of the acquisition rule. Note, that this
small toy example is probably not complex enough to show the real differences
between the acquisition rules. Remember, that more extensive evaluations by
Lizotte (2008) show that the EI rule is often a good choice in many different
applications.
0.00
0.10
α
0.20
0.00
0.10
0.10
α
0.20
0.20
−240 −200
UCB
−160
0.20
−280
−220
−300
−300
0.0
−240
0.0
0.4
PI
0.6
EI
−250
1.0
−200
UCB
−260
1.5
−210
1.0
−150
−220
0.8
−220
Log−likelihood
0.5
−230
Log−likelihood
0.2
0.20
−160
0.00
−300
−300
0.0
−240
0.0
−300
0.4
PI
0.6
EI
0.8
−250
−200
UCB
−260
−150
−220
1.2
−210
1.0
−220
0.8
−220
Log−likelihood
0.4
−230
Log−likelihood
0.2
−260
Log−likelihood
0.20
−240 −200
UCB
α
0.10
−260
1.0
−210
0.00
Log−likelihood
0.4 0.6
EI
−220
1.0
α
0.10
−300
0.2
−230
0.8
0.8
−220
0.00
−220
0.6
PI
Log−likelihood
0.4
−260
Log−likelihood
α
0.10
−260
0.0
−240
0.2
−300
0.00
Log−likelihood
0.6
0.20
0.4
EI
−210
0.0
0.20
0.2
−220
0.10
−230
1.0
0.00
0.10
0.20
0.8
0.8
−220
α
−280
0.6
−260
Log−likelihood
−185
−300
0
−240
0.0
−300
30
EI
−165
UCB
−260
−220
50
−210
1.5
−220
−155
40
−220
Log−likelihood
20
−230
−175
10
1.0
PI
Log−likelihood
0.5
−260
Log−likelihood
4
−300
PI
−300
0.00
0.10
0.20
0.0
0.4
−220
0.00
0.10
Log−likelihood
0.2
−260
Log−likelihood
0.00
−240
0.0
−300
98
Parameter inference using sampling methods
α
0.00
α
0.00
α
0.00
α
0.00
0.00
0.10
α
0.10
0.10
0.10
0.10
0.20
α
0.20
α
0.20
α
0.20
α
0.20
Figure 4.9: Five steps of the GPO algorithm for ML parameter inference in
Example 4.8 using three different acquisition rules: PI (left), EI (center), UCB
(right). The predictive mean and the resulting value of the acquisition rule
are presented with coloured solid and dotted lines, respectively. The black
dots and gray areas indicate the samples obtained from the log-likelihood and
the 95% predictive CI, respectively.
4.4
Bayesian optimisation
99
4.9 Example: GPO for ML inference in the earthquake count model
We return to the setting considered in Example 4.9 for ML parameter inference
in the earthquake count model (2.6) using the real-world data discussed in Section 2.2.3. Here, the parameter vector is given by θ = {φ, σv , β} and the parameter
space is Θ = (0, 1)×(0, 1)×(10, 20) ∈ R3 . We make use of the bPF with N = 1 000
particles to estimate the log-likelihood. The procedure is initialised with 50 samples from the log-likelihood at randomly selected parameters and continues with
150 iterations of the GPO algorithm.
In Figure 4.10, we present the current ML parameter estimate at each iteration of
the GPO algorithm. We see that the parameter estimates and the predicted loglikelihood stabilise after about 150 iterations. This is rather fast compared with
e.g. SPSA which in Paper B requires an order of magnitude more samples of the loglikelihood. The final parameter estimate is obtained as θbML = {0.88, 0.15, 17.65}.
This shows that the underlying intensity is rather slowly varying and that the
mean number of major earthquakes each year is 17.65.
4.10 Example: Input design in the LGSS model using GPO
Consider the LGSS model (2.2) with the parameters θ? = {0.5, 1, 0.1, 0.1}, T = 250
and an input u1:T generated by


−1, with probability (1 − α1 )(1 − α2 ),
ut = 0,
with probability α1 ,


1,
with probability (1 − α1 )α2 ,
for t = 1, . . . , T . Here, the aim is to find the input parameters α? = {α1? , α2? } such
that
b ? , u1:T (α)) ,
α? = argmax log det I(θ
α∈[0,1]×[0,1]
b ? , u1:T (α)) denotes the estimated expected information matrix using the
where I(θ
input u1:T (α) generated using the input parameters α. We make use of the second
formulation in (2.13) to estimate the expected information matrix. This is done
by first estimating the score function with the FL smoother in Algorithm 3 for
M = 100 different data realisations from the model using the same input u1:T and
the faPF with N = 100. Finally, the expected information matrix is estimated
using the sample covariance matrix of the score function estimates.
We integrate this problem into the GPO procedure outlined in Algorithm 6, where
the estimation of the expected information matrix is Step (i). Furthermore, we
make use of the EI as the acquisition rule and initialise the procedure with 10
uniform random samples for α. In Figure 4.11, we present the resulting predictive
mean for the input parameters and the sampling points for the algorithm. The
input parameter estimates are obtained as α
b = {0.26, 0.50}. We see that the
algorithm converges quickly to the estimates, which this shows that the GPO
algorithm can be a useful alternative in input design. This as, it requires a limited
number of computationally costly estimates of the expected information matrix.
4
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
φ̂
σ̂v
0.6
0.8
1.0
Parameter inference using sampling methods
1.0
100
50
100
150
200
50
100
200
150
200
Iteration
10
−360
−380
−420
12
−400
14
β̂
16
log−likelihood estimate
18
−340
20
−320
Iteration
150
50
100
150
Iteration
200
50
100
Iteration
Figure 4.10: The ML parameter estimate and the resulting predicted loglikelihood at each iteration of the GPO algorithm in Example 4.9.
Bayesian optimisation
101
0.2
0.4
α2
0.6
0.8
4.4
0.2
0.4
0.6
0.8
0.6
0.4
α1
0.2
Sample point
0.8
1.0
α1
0.0
α2
0
10
20
30
40
50
0.0
−1.0
−0.5
Input
0.5
1.0
Iteration
0
20
40
60
80
100
Time
Figure 4.11: Upper: the estimated mean of the expected information matrix
as a function of the input parameters. Middle: the sampling points of the
GPO algorithm. Lower: a realisation of the estimated optimal input.
5
Concluding remarks and future work
In this chapter, we give a summary of the contributions in the papers included in
this thesis and discuss some avenues for future work.
5.1
Summary of the contributions
Broadly speaking, the main contributions of this thesis are mainly within two areas.
The first contribution is to develop new methods and improve existing methods for
efficient parameter inference in SSMs. Here, we are concerned with computational
efficiency, which means that we would like to reduce the number of particles or
iterations needed to reach a certain accuracy in the parameter estimates. This
contribution is contained within Papers A-C. The second contribution is to make
use of SMC and MCMC to extend existing methods for parameter inference and
input design to nonlinear problems. This contribution is discussed in Papers D
and E.
In Paper A, we develop a novel PMCMC algorithm that combines the PMH algorithm from Section 4.3 with the Langevin dynamics discussed in Section 4.2.2.
The key idea here is to include the particle system within the proposal of the
PMH algorithm. With this information, we can make use of the FL smoother
from Section 3.4.2 to estimate the score function and the observed information
matrix. These quantities are then used to construct the PMH1 and PMH2 algorithms in Paper A in analogue with the MH1 and MH2 algorithms discussed
in Section 4.2.2. The resulting algorithm is efficient as it explores the posterior
distribution better and this results in a higher ESS compared with the PMH0
algorithm. Furthermore, the added information makes the algorithm invariant to
affine transformations of the parameter vector and reduces the length of the the
103
104
5
Concluding remarks and future work
burn-in. As a consequence, the proposed algorithm require less iterations than the
PMH0 algorithm in some settings, which makes it more computationally efficient
as the computational complexity of the two algorithm are the same.
In Paper B, we develop a novel algorithm for ML parameter inference by combining
ideas from GPO in Section 4.4 with log-likelihood estimation using the APF from
Section 3.3.4. The resulting algorithm is computationally efficient as it requires less
samples from the log-likelihood compared with some other optimisation methods.
As these estimates are computationally costly to obtain this results in an overall
decreased computational cost. Compared with SPSA, the gain is about one order
of magnitude, see Paper B for a comparison on the HWSV model. In Paper C,
we extend the combination of GPO and SMC to parameter inference in nonlinear
SSMs with intractable likelihoods. Computationally costly ABC methods are used
to approximate the intractable likelihood. Therefore, there could be substantial
gains in using this algorithm for inference in this type of models.
In Paper D, we develop a novel algorithm for input design in nonlinear SSMs, which
can handle amplitude constraints on the input. The proposed method combines
results from Valenzuela et al. (2013) with SMC from Section 3.4.2 for estimating
the expected information matrix. In Paper E, we propose two algorithms for
parameter inference in ARX models with Student-t innovations which includes
automatic model order selection by two different methods. These methods makes
use of the MH algorithm discussed in Section 4.2 with reversible jump and the
Gibbs sampler together with sparseness priors.
5.2
Outlook and future work
In this section, we summarise some ideas for future work and extensions of the
contributions presented within this thesis. We discuss three different areas; the
PMH-algorithm, the GPO-SMC algorithm and input design in SSMs.
5.2.1
Particle Metropolis-Hastings
The proposed contributions to the PMH algorithm are mainly methodological
developments of existing methods. Therefore, it would be interesting to examine
the theoretical properties of the PMH1 and the PMH2 algorithms. This includes
questions regarding the convergence rate of the algorithm and how its properties
scale with the dimension of the parameter space. Similar analysis has previously
been done for MH0 (Roberts et al., 1997), MH1 (Roberts and Rosenthal, 1998)
and PMH0 (Sherlock et al., 2013). A possible first step for the PMH analysis
is to consider the situation where the number of observations is large. By the
discussion in Section 2.4, we know that the Bayesian CLT would give a roughly
Gaussian posterior, which simplifies the analysis.
Further methodological developments could also be interesting in the PMCMC
framework. This includes the development of adaptive PMH1 and PMH2, which
automatically determines suitable step sizes and the number of particles. It could
5.2
Outlook and future work
105
also be possible to relax the reversibility constraint of the Markov chain during
the burn-in phase of the algorithm. This could decrease the hitting time of the
posterior mode by the Markov chain. The adaptive mechanism could then decide
when the chain has reached the mode and then turn on the reversibility condition,
so that the chain admits the target as its stationary distribution. Relevant work
for this idea is found in Andrieu and Thoms (2008), Peters et al. (2010) and Pitt
et al. (2012). Another approach to reduce the length of the burn-in is to make
use of GPO to estimate the location of the posterior mode in a pilot run. This is
similar to the work by Owen et al. (2014), where the authors make use of some
pilot runs of the ABC algorithm to initialise the PMH0 algorithm in SSMs with
intractable likelihoods.
Also, it would be interesting to develop a particle HMC algorithm as suggested
in the discussions (Doucet et al., 2011) following Girolami and Calderhead (2011).
The challenge with this idea is how to handle that multiple APFs are run within
each PMH iteration, which might require some additional developments to the
PMCMC framework. Better particle smoothers could also be useful as they could
improve the estimates of the score function and the information matrix. This
results in that larger step lengths can be used in the PMH2 algorithm and could
lead to even larger increases in mixing of the Markov chain.
Online methods for Bayesian inference would also be of great interest, especially
for the many big data problems that are likely to be faced in the future. Also,
graphical processing units (GPUs) and other multicore architectures can be used to
decrease the computational cost as some parts of the SMC and MCMC algorithms
are possible to run in parallel. Interested readers are referred to Beam et al.
(2014), Neiswanger et al. (2013), Henriksen et al. (2012) and Murray (2012) for
more information.
5.2.2
Gaussian process optimisation using the particle filter
As we have demonstrated in this thesis, the performance of the GPO algorithm
depends on the kernel function in the GP prior and the choice of acquisition
function. Therefore it would be interesting to develop new acquisition functions
that could make use of ideas from sparse GPs, Newton methods and/or proximal
point algorithms (Rasmussen and Williams, 2006; Nocedal and Wright, 2006). The
main challenge is to construct a rule that keeps exploring the objective function,
but still keeps the fast convergence that we have illustrated in the examples in
Section 4.4.3 and in Papers B and C.
Another possible improvement to the algorithm is to remove the bias in the loglikelihood estimate. This could be done by the bias compensation discussed in
Example 3.4. Also, it would be interesting to develop online methods for this algorithm, perhaps by using some kind of stochastic approximation scheme. Finally,
we think that there are many interesting applications of GP models for estimating the score and information matrix for an SSM. As these are computationally
costly to evaluate with good accuracy, perhaps the ideas from probabilistic numerics could be helpful. This is an emerging field in machine learning, where GPs also
5
106
Concluding remarks and future work
are used to estimate derivatives and integrals as well as to solve ordinary differential equations. For more information, see Hennig (2013), Osborne et al. (2012),
Osborne (2010) and Boyle (2007).
5.2.3
Input design in SSMs
The main drawback of the input design method proposed in Paper D is that the
expected information matrix is computationally costly to evaluate. It is possible
to make a perfect parallel implementation of this method to reduce the computational cost. However, it would be even more interesting to develop the idea from
Example 4.10 and make use of GPO for input design. This could be useful when
considering that GPO does not require many evaluations of the objective function,
which corresponds to estimates of the expected information matrix.
Also, it would be interesting to consider methods that infer the parameters of the
model at the same time as the optimal input. This would relax the unrealistic
assumption that we need to know the true parameters to be able to construct an
optimal input. There are mainly three approaches for solving this problem. The
first is to pose the problem in a robust optimisation setting and only assume that
the true parameter is located within some set. By this construction, the resulting
optimal input would be an average (in some sense) over this set.
The second approach is to construct a sequential algorithm that first infer the
parameters and then the optimal input. This is repeated over many iterations by
exciting the system with the input constructed in the last iteration. However, it is
difficult to prove that this approach would converge to the true optimal input and
the true parameters. Finally, we could pose this problem in a Bayesian manner
and marginalise over the parameters of the model. The resulting optimal input
would therefore be a marginalisation over the parameter posterior. Therefore, we
would not need to know the true parameters of the system. See Müller et al. (2007),
Kuck et al. (2006) and Bubeck and Cesa-Bianchi (2012) for more information.
5.3
Source code and data
Source code written in Python and R for recreating most of the examples in Part I
are available from the author’s homepages at: http://users.isy.liu.se/en/
rt/johda87/ and http://code.johandahlin.com/. Furthermore, source code
for recreating some of the numerical illustrations from Papers A, B, C and E are
also available from the same homepages. The source code and data are provided
under the MIT license with no guaranteed support and no responsibility for its
use and function.
Bibliography
M. Adolfson, S. Laséen, J. Lindé, and M. Villani. RAMSES – a new general
equilibrium model for monetary policy analysis. Sveriges Riksbank Economic
Review, 2, 2007a.
M. Adolfson, S. Laséen, J. Lindé, and M. Villani. Bayesian estimation of an open
economy DSGE model with incomplete pass-through. Journal of International
Economics, 72(2):481–511, 2007b.
M. Adolfson, S. Laséen, L. Christiano, M. Trabandt, and K. Walentin. RAMSES
II – Model Description. Occasional Paper Series, 12, 2013.
G. Amisano and O. Tristani. Euro area inflation persistence in an estimated
nonlinear DSGE model. Journal of Economic Dynamics and Control, 34(10):
1837–1858, 2010.
S. An and F. Schorfheide. Bayesian analysis of DSGE models. Econometric reviews,
26(2-4):113–172, 2007.
B. D. O. Anderson and J. B. Moore. Optimal filtering. Courier Publications, 2005.
C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient Monte
Carlo computations. The Annals of Statistics, 37(2):697–725, 2009.
C. Andrieu and J. Thoms. A tutorial on adaptive MCMC. Statistics and Computing, 18(4):343–373, 2008.
C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo
methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
A. L. Beam, S. K. Ghosh, and J. Doyle. Fast Hamiltonian Monte Carlo Using
GPU Computing. Pre-print, 2014. arXiv:1402.4089v1.
M. A. Beaumont. Estimation of population growth or decline in genetically monitored populations. Genetics, 164(3):1139–1160, 2003.
J. O. Berger. Statistical decision theory and Bayesian analysis. Springer, 1985.
107
108
Bibliography
M. J. Betancourt and M. Girolami. Hamiltonian Monte Carlo for Hierarchical
Models. Pre-print, 2013. arXiv:1312.0906v1.
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York,
USA, 2006.
T. Björk. Arbitrage theory in continuous time. Oxford University Press, 2004.
F. Black and M. Scholes. The pricing of options and corporate liabilities. The
Journal of Political Economy, pages 637–654, 1973.
T. Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal
of Econometrics, 31(3):307–327, 1986.
P. Boyle. Gaussian processes for regression and optimisation. PhD thesis, Victoria
University of Wellington, 2007.
M. Briers, A. Doucet, and S. Maskell. Smoothing algorithms for state-space models.
Annals of the Institute of Statistical Mathematics, 62(1):61–89, 2010.
E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on Bayesian optimization
of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Pre-print, 2010. arXiv:1012.2599v1.
P. J. Brockwell and R. A. Davis. Introduction to time series and forecasting.
Springer, 2002.
S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic
multi-armed bandit problems. Foundations and Trends in Machine Learning,
5(1):1–122, 2012.
P. Bunch and S. Godsill. Improved particle approximations to the joint smoothing
distribution using Markov Chain Monte Carlo. IEEE Transactions on Signal
Processing, 61(4):956–963, 2013.
D. Burke, A. Ghosh, and W. Heidrich. Bidirectional importance sampling for
direct illumination. In Proceedings of the 16th Eurographics Symposium on
Rendering Techniques, pages 147–156, Konstanz, Germany, June 2005.
O. Cappé, E. Moulines, and T. Rydén. Inference in Hidden Markov Models.
Springer, 2005.
O. Cappé, S. J Godsill, and E. Moulines. An overview of existing methods and
recent advances in sequential Monte Carlo. Proceedings of the IEEE, 95(5):
899–924, 2007.
C. M. Carvalho, M. S. Johannes, H. F. Lopes, and N. G. Polson. Particle learning
and smoothing. Statistical Science, 25(1):88–106, 2010.
R. Casarin. Bayesian inference for generalised Markov switching stochastic volatility models, 2004. CEREMADE Journal Working Paper 0414.
G. Casella and R. L. Berger. Statistical Inference. Duxbury Press, 2 edition, 2001.
Bibliography
109
K. S. Chan and J. Ledolter. Monte Carlo EM estimation for time series models
involving counts. Journal of the American Statistical Association, 90(429):242–
252, 1995.
T. Chen, E. B. Fox, and C. Guestrin. Stochastic Gradient Hamiltonian Monte
Carlo. Pre-print, 2014. arXiv:1402.4102v1.
S. Chib, F. Nardari, and N. Shephard. Markov chain Monte Carlo methods for
stochastic volatility models. Journal of Econometrics, 108(2):281–316, 2002.
N. Chopin, P. E. Jacob, and O. Papaspiliopoulos. SMC2 : an efficient algorithm
for sequential analysis of state space models. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 75(3):397–426, 2013.
R. Cont. Empirical properties of asset returns: stylized facts and statistical issues.
Quantitative Finance, 1:223–236, 2001.
J-M. Cornuet, J-M. Marin, A. Mira, and C. P. Robert. Adaptive Multiple Importance Sampling. Pre-print, 2011. arXiv:0907.1254v5.
N. Cressie. Statistics for spatial data. Wiley, 1993.
D. Crisan and A. Doucet. A survey of convergence results on particle filtering
methods for practitioners. IEEE Transactions on Signal Processing, 50(3):736–
746, 2002.
J. Dahlin and F. Lindsten. Particle filter-based Gaussian process optimisation for
parameter inference. In Proceedings of the 19th IFAC World Congress, Cape
Town, South Africa, August 2014. (accepted for publication).
J. Dahlin and P. Svenson. A Method for Community Detection in Uncertain Networks. In Proceedings of 2011 European Intelligence and Security Informatics
Conference, Athens, Greece, August 2011.
J. Dahlin and P. Svenson. Ensemble approaches for improving community detection methods. Pre-print, 2013. arXiv:1309.0242v1.
J. Dahlin, F. Johansson, L. Kaati, C. Mårtensson, and P. Svenson. A Method for
Community Detection in Uncertain Networks. In Proceedings of International
Symposium on Foundation of Open Source Intelligence and Security Informatics
2012, Istanbul, Turkey, August 2012a.
J. Dahlin, F. Lindsten, T. B. Schön, and A. Wills. Hierarchical Bayesian ARX
models for robust inference. In Proceedings of the 16th IFAC Symposium on
System Identification (SYSID), Brussels, Belgium, July 2012b.
J. Dahlin, F. Lindsten, and T. B. Schön. Particle Metropolis Hastings using
Langevin dynamics. In Proceedings of the 38th International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, May
2013a.
J. Dahlin, F. Lindsten, and T. B. Schön. Inference in Gaussian models with missing
data using Equalisation Maximisation. Pre-print, 2013b. arXiv:1308.4601v1.
110
Bibliography
J. Dahlin, F. Lindsten, and T. B. Schön. Second-order particle MCMC for Bayesian
parameter inference. In Proceedings of the 19th IFAC World Congress, Cape
Town, South Africa, August 2014a. (accepted for publication).
J. Dahlin, F. Lindsten, and T. B. Schön. Particle Metropolis-Hastings using gradient and Hessian information. Pre-print, 2014b. arXiv:1311.0686v2.
J. Dahlin, T. B. Schön, and M. Villani. Approximate inference in state space models with intractable likelihoods using Gaussian process optimisation. Technical
Report LiTH-ISY-R-3075, Department of Electrical Engineering, Linköping University, Linköping, Sweden, April 2014c.
P. Debevec. Rendering synthetic objects into real scenes: bridging traditional and
image-based graphics with global illumination and high dynamic range photography. In Proceedings of the 25th annual conference on Computer graphics and
interactive techniques, pages 189–198, Orlando, FL, USA, jul 1998. ACM.
P. Del Moral. Feynman-Kac Formulae - Genealogical and Interacting Particle
Systems with Applications. Probability and its Applications. Springer, 2004.
P. Del Moral. Mean field simulation for Monte Carlo integration. CRC Press,
2013.
P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–
436, 2006.
P. Del Moral, A. Doucet, and S. Singh. Forward smoothing using sequential Monte
Carlo. Pre-print, 2010. arXiv:1012.5390v1.
M. Del Negro and F. Schorfheide. Priors from General Equilibrium Models for
VARS. International Economic Review, 45(2):643–673, 2004.
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):
1–38, 1977.
R. Douc and O. Cappé. Comparison of resampling schemes for particle filtering. In
Proceedings of the 4th International Symposium on Image and Signal Processing
and Analysis (ISPA), pages 64–69, Zagreb, Croatia, September 2005.
R. Douc, E. Moulines, and D. S Stoffer. Nonlinear Time Series: theory, methods
and applications with R examples. CRC Press, 2014.
A. Doucet and A. Johansen. A tutorial on particle filtering and smoothing: Fifteen
years later. In D. Crisan and B. Rozovsky, editors, The Oxford Handbook of
Nonlinear Filtering. Oxford University Press, 2011.
A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte Carlo sampling
methods for Bayesian filtering. Statistics and computing, 10(3):197–208, 2000.
A. Doucet, P. Jacob, and A. M. Johansen. Discussion on Riemann manifold
Bibliography
111
Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B Statistical Methodology, 73(2), p 162, 2011.
A. Doucet, M. K. Pitt, and R. Kohn. Efficient implementation of Markov
chain Monte Carlo when using an unbiased likelihood estimator. arXiv.org,
arXiv:1210.1871, October 2012.
S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo.
Physics letters B, 195(2):216–222, 1987.
C. Dubarry and R Douc. Particle approximation improvement of the joint
smoothing distribution with on-the-fly variance estimation. Pre-print, 2011.
arXiv:1107.55241.
E. Ehrlich, A. Jasra, and N. Kantas. Static Parameter Estimation for ABC Approximations of Hidden Markov Models. Pre-print, 2012. arXiv:1210.4683v1.
R. F. Engle. Autoregressive conditional heteroscedasticity with estimates of the
variance of United Kingdom inflation. Econometrica: Journal of the Econometric Society, pages 987–1007, 1982.
P. Fearnhead, D. Wyncoll, and J. Tawn. A sequential smoothing algorithm with
linear computational cost. Biometrika, 97(2):447–464, 2010.
J. Fernández-Villaverde and J. F. Rubio-Ramírez. Estimating macroeconomic
models: A likelihood approach. The Review of Economic Studies, 74(4):1059–
1087, 2007.
R. A. Fisher. Theory of statistical estimation. Mathematical Proceedings of the
Cambridge Philosophical Society, 22(05):700–725, 1925.
T. Flury and N. Shephard. Bayesian inference based only on simulated likelihood:
particle filter analysis of dynamic economic models. Econometric Theory, 27(5):
933–956, 2011.
K. Fokianos, A. Rahbek, and D. Tjøstheim. Poisson autoregression. Journal of
the American Statistical Association, 104(488):1430–1439, 2009.
A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin.
Bayesian data analysis. Chapman & Hall/CRC, 3 edition, 2013.
S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 6:721–741, 1984.
A. Ghosh, A. Doucet, and W. Heidrich. Sequential sampling for dynamic environment map illumination. In Proceedings of the 17th Eurographics conference on
Rendering Techniques, pages 115–126, Nicosia, Cyprus, June 2006.
P. Giordani and R. Kohn. Adaptive independent Metropolis-Hastings by fast
estimation of mixtures of normals. Journal of Computational and Graphical
Statistics, 19(2):243–259, 2010.
112
Bibliography
M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian
Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):1–37, 2011.
P. Glasserman. Monte Carlo methods in financial engineering, volume 53. Springer,
2004.
S. J. Godsill, A. Doucet, and M. West. Monte Carlo smoothing for nonlinear time
series. Journal of the American Statistical Association, 99(465):156–168, March
2004.
N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to
nonlinear/non-Gaussian Bayesian state estimation. IEEE Proceedings of Radar
and Signal Processing, 140(2):107–113, 1993.
A. G. Gray. Bringing tractability to generalized n-body problems in statistical
and scientific computation. PhD thesis, Carnegie Mellon University, 2003.
W. K. Hastings. Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 57(1):97–109, 1970.
E. Hecht. Optics. Pearson, 4 edition, 2013.
P. Hennig. Fast probabilistic optimization from noisy gradients. In Proceedings
of the 30th International Conference on Machine Learning, Atlanta, GA, USA,
jun 2013.
S. Henriksen, A. Wills, T. B. Schön, and B. Ninness. Parallel implementation of
particle MCMC methods on a GPU. In Proceedings of the 16th IFAC Symposium on System Identification (SYSID), Brussels, Belgium, July 2012.
J. D. Hol, T. B. Schön, and F. Gustafsson. On resampling algorithms for particle
filters. In Proceedings of the Nonlinear Statistical Signal Processing Workshop,
Cambridge, UK, September 2006.
J. Hull. Options, Futures, and other Derivatives. Pearson, 7 edition, 2009.
J. Hull and A. White. The pricing of options on assets with stochastic volatilities.
The Journal of Finance, 42(2):281–300, 1987.
D. Hultqvist, J. Roll, F. Svensson, J. Dahlin, and T. B. Schön. Detection and
positioning of overtaking vehicles using 1D optical flow. In Proceedings of the
IEEE Intelligent Vehicles (IV) Symposium, Dearborn, MI, USA, June 2014.
(accepted for publication).
E. Jacquier, N. G. Polson, and P. E. Rossi. Bayesian analysis of stochastic volatility
models with fat-tails and correlated errors. Journal of Econometrics, 122(1):185–
212, 2004.
A. H. Jazwinski. Stochastic processes and filtering theory, volume 63. Academic
press, 1970.
Bibliography
113
M. Johannes and N. Polson. MCMC Methods for Continuous-time Financial
Econometrics. In Y. Ait-Sahalia and L. Hansen, editors, Handbook of Financial Econometrics, Vol. 1: Tools and Techniques, volume 2, pages 1–72. NorthHolland, 2009.
D. R. Jones. A taxonomy of global optimization methods based on response
surfaces. Journal of Global Optimization, 21(4):345–383, 2001.
D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications,
79(1):157–181, 1993.
D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of
expensive black-box functions. Journal of Global optimization, 13(4):455–492,
1998.
T. Kailath, A. H. Sayed, and B. Hassibi. Linear Estimation. Prentice Hall, Upper
Saddle River, NJ, USA, 2000.
N. Kantas, A. Doucet, S.S. Singh, and J.M. Maciejowski. An overview of sequential
monte carlo methods for parameter estimation in general state-space models. In
IFAC Symposium on System Identification (SYSID), Saint-Malo, France, July
2009.
M. J. Keeling and P. Rohani. Modeling infectious diseases in humans and animals.
Princeton University Press, 2008.
S. Kim, N. Shephard, and S. Chib. Stochastic volatility: likelihood inference
and comparison with ARCH models. The Review of Economic Studies, 65(3):
361–393, 1998.
G. Kitagawa and S. Sato. Monte Carlo smoothing and self-organising state-space
model. In A. Doucet, N. de Fretias, and N. Gordon, editors, Sequential Monte
Carlo methods in practice, pages 177–195. Springer, 2001.
M. Klaas, M. Briers, N. de Freitas, A. Doucet, S. Maskell, and D. Lang. Fast
particle smoothing: if I had a million particles. In Proceedings of the 23rd
International Conference on Machine Learning, Pittsburgh, USA, June 2006.
P. E. Kloeden and E. Platen. Numerical solution of stochastic differential equations,
volume 23. Springer, 4 edition, 1992.
J. Kronander and T. B. Schön. Robust auxiliary particle filters using multiple
importance sampling. In Proceedings of the 2014 IEEE Statistical Signal Processing Workshop (SSP), Gold Coast, Australia, July 2014. (accepted for publication).
J. Kronander, J. Dahlin, D. Jönsson, M. Kok, T. B. Schön, and J. Unger. Real-time
Video Based Lighting Using GPU Raytracing. In Proceedings of the 2014 European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, September
2014a. (submitted, pending review).
114
Bibliography
J. Kronander, T. B. Schön, and J. Dahlin. Backward sequential Monte Carlo
for marginal smoothing. In Proceedings of the 2014 IEEE Statistical Signal
Processing Workshop (SSP), Gold Coast, Australia, July 2014b. (accepted for
publication).
H. Kuck, N. de Freitas, and A. Doucet. SMC samplers for Bayesian optimal nonlinear design. In Proceedings of the 2006 Nonlinear Statistical Signal Processing
Workshop, pages 99–102, Cambridge, UK, sep 2006.
H. J. Kushner. A new method of locating the maximum point of an arbitrary
multipeak curve in the presence of noise. Journal of Basic Engineering, 86(1):
97–106, 1964.
R. Langrock. Some applications of nonlinear and non-Gaussian state–space modelling by means of hidden Markov models. Journal of Applied Statistics, 38(12):
2955–2970, 2011.
R. Langrock and W. Zucchini. Hidden Markov models with arbitrary state dwelltime distributions. Computational Statistics & Data Analysis, 55(1):715–724,
2011.
E. L. Lehmann and G. Casella. Theory of point estimation. Springer, 1998.
F. Lindsten. An efficient stochastic approximation EM algorithm using conditional
particle filters. In Proceedings of the 38th International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), Vancouver, Canada, May 2013.
F. Lindsten and T. B. Schön. Backward simulation methods for Monte Carlo
statistical inference. In Foundations and Trends in Machine Learning, volume 6,
pages 1–143, August 2013.
J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer, 2008.
S. Livingstone and M. Girolami. Information-geometric Markov Chain Monte
Carlo methods using Diffusions. Pre-print, 2014. arXiv:1403.7957v1.
D. J. Lizotte. Practical Bayesian optimization. PhD thesis, University of Alberta,
2008.
L. Ljung. System identification: theory for the user. Prentice Hall, 1999.
T. A. Louis. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 44(02):226–233, 1982.
A. Marshall. The use of multi-stage sampling schemes in Monte Carlo simulations.
In M. Meyer, editor, Symposium on Monte Carlo Methods, pages 123–140. Wiley,
1956.
G. Matheron. Principles of geostatistics. Economic Geology, 58(8):1246–1266,
1963.
Bibliography
115
G. J. McLachlan and T. Krishnan. The EM algorithm and extensions. WileyInterscience, second edition, 2008.
N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller.
Equation of state calculations by fast computing machines. The Journal of
Chemical Physics, 21(6):1087–1092, 1953.
S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge
University Press, 2009.
T. P. Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the 17th conference on Uncertainty in Artificial Intelligence, pages
362–369, Seattle, WA, USA, aug 2001.
S. Mitra. A review of volatility and option pricing. International Journal of
Financial Markets and Derivatives, 2(3):149–179, 2011.
J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods for
seeking the extremum. In L. C. W. Dixon and G. P. Szego, editors, Toward
Global Optimization, pages 117–129. North-Holland, 1978.
P. Müller, D. A. Berry, A. P. Grieve, M. Smith, and M. Krams. Simulation-based
sequential Bayesian design. Journal of Statistical Planning and Inference, 137
(10):3140–3150, 2007.
K. P. Murphy. Machine learning: a probabilistic perspective. The MIT Press,
2012.
L. Murray. GPU acceleration of the particle filter: The Metropolis resampler.
Pre-print, 2012. arXiv:1202.6163v1.
C. A. Naesseth. Nowcasting using Microblog Data. Bachelor thesis, Linköping
University, sep 2012. LiTH-ISY-EX-ET-12/0398.
R. M. Neal. MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman,
G. Jones, and X-L. Meng, editors, Handbook of Markov Chain Monte Carlo.
Chapman & Hall/ CRC Press, June 2010.
W. Neiswanger, C. Wang, and E Xing. Asymptotically Exact, Embarrassingly
Parallel MCMC. Pre-print, 2013. arXiv:1311.4780v2.
B. Ninness and S. Henriksen. Bayesian system identification via Markov chain
Monte Carlo techniques. Automatica, 46(1):40–51, 2010.
J. Nocedal and S. Wright. Numerical Optimization. Springer, 2 edition, 2006.
J. Nolan. Stable distributions: models for heavy-tailed data. Birkhauser, 2003.
B. Øksendal. Stochastic differential equations. Springer, 6 edition, 2010.
J. Olsson, O. Cappé, R. Douc, and E. Moulines. Sequential Monte Carlo smoothing with application to parameter estimation in nonlinear state space models.
Bernoulli, 14(1):155–179, 2008.
116
Bibliography
M. Osborne. Bayesian Gaussian Processes for Sequential Prediction, Optimisation
and Quadrature. PhD thesis, University of Oxford, 2010.
M. A. Osborne, R. Garnett, S. J. Roberts, C. Hart, S. Aigrain, and N. Gibson.
Bayesian quadrature for ratios. In Proceedings of the Fifteenth International
Conference on Artificial Intelligence and Statistics (AISTATS), pages 832–840,
La Palma, Canary Islands, SP, apr 2012.
J. Owen, D. J. Wilkinson, and C. S. Gillespie. Scalable Inference for Markov
Processes with Intractable Likelihoods. Pre-print, 2014. arXiv:1403.6886v1.
G. W. Peters, G. R. Hosack, and K. R. Hayes. Ecological non-linear state space
model selection via adaptive particle Markov chain Monte Carlo. Pre-print, 2010.
arXiv:1005.2238v1.
M. Pharr and G. Humphreys. Physically based rendering: From theory to implementation. Morgan Kaufmann, 2010.
M. K. Pitt. Smooth Particle Filters for Likelihood Evaluation and Maximisation.
Technical Report 651, Department of economics, University of Warwick, Coventry, UK, July 2002. Warwick economic research papers.
M. K. Pitt and N. Shephard. Filtering via simulation: Auxiliary particle filters.
Journal of the American Statistical Association, 94(446):590–599, 1999.
M. K. Pitt, R. S. Silva, P. Giordani, and R. Kohn. Auxiliary particle filtering within
adaptive Metropolis-Hastings sampling. Pre-print, 2010. arXiv:1006.1914v1.
M. K. Pitt, R. S. Silva, P. Giordani, and R. Kohn. On some properties of Markov
chain Monte Carlo simulation methods based on the particle filter. Journal of
Econometrics, 171(2):134–151, 2012.
G. Poyiadjis, A. Doucet, and S. S. Singh. Particle approximations of the score and
observed information matrix in state space models with application to parameter
estimation. Biometrika, 98(1):65–80, 2011.
C. R. Rao. Linear Statistical Inference and Its Applications. Wiley, 1965.
C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning.
MIT Press, 2006.
H. E. Rauch, F. Tung, and C. T. Striebel. Maximum likelihood estimates of linear
dynamic systems. AIAA Journal, 3(8):1445–1450, August 1965.
C. P. Robert. The Bayesian choice. Springer, 2007.
C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2 edition,
2004.
G. O. Roberts and J. S. Rosenthal. Optimal Scaling of Discrete Approximations to
Langevin Diffusions. Journal of the Royal Statistical Society. Series B (Statistical
Methodology), 60(1):255–268, 1998.
Bibliography
117
G. O. Roberts, A. Gelman, and W. R. Gilks. Weak convergence and optimal scaling
of random walk Metropolis algorithms. The Annals of Applied Probability, 7
(1):110–120, 1997.
S. M. Ross. Simulation. Academic Press, 5 edition, 2012.
H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent
Gaussian models by using integrated nested Laplace approximations. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 71(2):319–392,
2009.
T. B. Schön, A. Wills, and B. Ninness. System identification of nonlinear statespace models. Automatica, 47(1):39–49, 2011.
C. Sherlock, A. H. Thiery, G. O. Roberts, and J. S. Rosenthal. On the efficency of pseudo-marginal random walk Metropolis algorithms. Pre-print, 2013.
arXiv:1309.7209v1.
R. H. Shumway and D. S. Stoffer. Time series analysis and its applications.
Springer, 3 edition, 2010.
J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian Optimization of
Machine Learning Algorithms. In Advances in Neural Information Processing
Systems 25 (NIPS 2012), pages 2951–2959. Curran Associates, Inc., 2012.
J. C. Spall. A stochastic approximation technique for generating maximum likelihood parameter estimates. In American Control Conference, pages 1161–1167,
Minneapolis, MN, USA, June 1987.
R. Srikanthan and T. A. McMahon. Stochastic generation of annual, monthly
and daily climate data: A review. Hydrology and Earth System Sciences, 5(4):
653–670, 1999.
E. Taghavi, F. Lindsten, L. Svensson, and T. B. Schön. Adaptive stopping for
fast particle smoothing. In Proceedings of the 38th International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, May
2013.
L. Tierney. Markov chains for exploring posterior distributions. The Annals of
Statistics, 22(4):1701–1728, 1994.
R. S Tsay. Analysis of financial time series. John Wiley & Sons, 2 edition, 2005.
J. Unger, J. Kronander, P. Larsson, S. Gustavson, J. Löw, and A. Ynnerman.
Spatially varying image based lighting using HDR-video. Computers & graphics,
37(7):923–934, 2013.
P. E. Valenzuela, C. R. Rojas, and H. Hjalmarsson. Optimal input design for
dynamic systems: a graph theory approach. In Proceedings of the IEEE Conference on Decision and Control (CDC), Florence, Italy, dec 2013.
P. E. Valenzuela, J. Dahlin, C. R. Rojas, and T. B. Schön. A graph/particle-based
method for experiment design in nonlinear systems. In Proceedings of the 19th
118
Bibliography
IFAC World Congress, Cape Town, South Africa, August 2014. (accepted for
publication).
E. Veach and L. J. Guibas. Optimally combining sampling techniques for monte
carlo rendering. In Proceedings of the 22nd Annual Conference on Computer
Graphics, pages 419–428, Los Angeles, CA, USA., aug 1995. ACM.
E. Veach and L. J. Guibas. Metropolis light transport. In Proceedings of the
24th annual conference on Computer graphics and interactive techniques, pages
65–76, 1997.
D. J. Wilkinson. Stochastic modelling for systems biology. CRC press, 2 edition,
2011.
D. A. Woolhiser. Modeling daily precipitation – progress and problems. In Statistics in the Environmental and Earth Sciences 5, pages 71–89. Halsted Press New
York, 1992.
S. Yildirim, S. S. Singh, T. Dean, and A Jasra. Parameter Estimation in Hidden
Markov Models with Intractable Likelihoods Using Sequential Monte Carlo. Preprint, 2013. arXiv:1311.4117v1.
S. L. Zeger. A regression model for time series of counts. Biometrika, 75(4):
621–629, 1988.
Part II
Publications
Publications
The articles associated with this thesis have been removed for copyright
reasons. For more details about these see:
http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-106752
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement