Linköping studies in science and technology. Thesis. No. 1652 Licentiate’s Thesis Sequential Monte Carlo for inference in nonlinear state space models Johan Dahlin LERTEKNIK REG AU T O MA RO TI C C O N T L LINKÖPING Division of Automatic Control Department of Electrical Engineering Linköping University, SE-581 83 Linköping, Sweden http://www.control.isy.liu.se [email protected] Linköping 2014 This is a Swedish Licentiate’s Thesis. Swedish postgraduate education leads to a Doctor’s degree and/or a Licentiate’s degree. A Doctor’s degree comprises 240 ECTS credits (4 years of full-time studies). A Licentiate’s degree comprises 120 ECTS credits, of which at least 60 ECTS credits constitute the Licentiate’s thesis. Linköping studies in science and technology. Thesis. No. 1652 Sequential Monte Carlo for inference in nonlinear state space models Johan Dahlin [email protected] www.control.isy.liu.se Department of Electrical Engineering Linköping University SE-581 83 Linköping Sweden ISBN 978-91-7519-369-4 ISSN 0280-7971 LIU-TEK-LIC-2014:85 Copyright © 2014 Johan Dahlin Printed by LiU-Tryck, Linköping, Sweden 2014 Denna avhandling tillägnas min familj! Abstract Nonlinear state space models (SSMs) are a useful class of models to describe many different kinds of systems. Some examples of its applications are to model; the volatility in financial markets, the number of infected persons during an influenza epidemic and the annual number of major earthquakes around the world. In this thesis, we are concerned with state inference, parameter inference and input design for nonlinear SSMs based on sequential Monte Carlo (SMC) methods. The state inference problem consists of estimating some latent variable that is not directly observable in the output from the system. The parameter inference problem is concerned with fitting a pre-specified model structure to the observed output from the system. In input design, we are interested in constructing an input to the system, which maximises the information that is available about the parameters in the system output. All of these problems are analytically intractable for nonlinear SSMs. Instead, we make use of SMC to approximate the solution to the state inference problem and to solve the input design problem. Furthermore, we make use of Markov chain Monte Carlo (MCMC) and Bayesian optimisation (BO) to solve the parameter inference problem. In this thesis, we propose new methods for parameter inference in SSMs using both Bayesian and maximum likelihood inference. More specifically, we propose a new proposal for the particle Metropolis-Hastings algorithm, which includes gradient and Hessian information about the target distribution. We demonstrate that the use of this proposal can reduce the length of the burn-in phase and improve the mixing of the Markov chain. Furthermore, we develop a novel parameter inference method based on the combination of BO and SMC. We demonstrate that this method requires a relatively small amount of samples from the analytically intractable likelihood, which are computationally costly to obtain. Therefore, it could be a good alternative to other optimisation based parameter inference methods. The proposed BO and SMC combination is also extended for parameter inference in nonlinear SSMs with intractable likelihoods using approximate Bayesian computations. This method is used for parameter inference in a stochastic volatility model with α-stable returns using real-world financial data. Finally, we develop a novel method for input design in nonlinear SSMs which makes use of SMC methods to estimate the expected information matrix. This information is used in combination with graph theory and convex optimisation to estimate optimal inputs with amplitude constraints. We also consider parameter estimation in ARX models with Student-t innovations and unknown model orders. Two different algorithms are used for this inference: reversible Jump Markov chain Monte Carlo and Gibbs sampling with sparseness priors. These methods are used to model real-world EEG data with promising results. v Populärvetenskaplig sammanfattning Den värld som vi lever i är fylld av olika typer av system som kan beskrivas med matematiska modeller. Med hjälp av dessa modeller kan vi skapa oss en bättre förståelse av hur dessa system påverkas av sin omgivning, samt förutsäga hur de kommer att utvecklas över tid. Exempelvis kan man konstruera modeller av vädret baserad på kunskap om fysik samt tidigare års väder. Dessa modeller kan sedan användas för att exempelvis förutsäga om det kommer att regna imorgon. En annan typ av modeller kan användas för att prissätta olika typer av finansiella kontrakt, baserat på tidigare utfall och ekonomisk teori. Ett tredje exempel är modeller för att förutsäga antalet framtida jordbävningar i världen, givet historisk data och några modellantaganden. Det som är gemensamt för dessa exempel är att de alla beskriver icke-linjära dynamiska system, alltså system som utvecklas över tid. I denna avhandling är vi intresserade av att bygga icke-linjära tillståndsmodeller av dynamiska system med hjälp av datadrivna statistiska inferensmetoder. Med hjälp av dessa modeller och metoder är det möjligt att kombinera teoretiska kunskaper som beskrivs av en modellstruktur med observationer från systemet. Den senare informationen kan användas för att bestämma värdet på några okända parametrar i modellen, detta kallas även för parameterskattning. Ett annat vanligt förekommande problem är tillståndsskattning, där vi vill bestämma värdet på någon dynamisk storhet i systemet som inte kan observeras direkt. Det huvudsakliga problemet med detta angreppssätt är att inget av dessa skattningsproblem kan lösas exakt med hjälp av analytiska metoder. Istället nyttjar vi approximativa metoder som baseras på statistiska simuleringer för att lösa problemen. Så kallade sekventiella Monte Carlo-metoder används för att approximera lösningen till tillståndsskattningsproblemet. Detta görs med hjälp av en dator som simulerar en stor mängd hypoteser (även kallade partiklar) om hur systemet fungerar. De hypoteser som stämmer väl överens med det verkliga observerade beteendet sparas och förfinas i nästkommande steg. De övriga hypoteserna tas bort från simuleringen för att fokusera beräkningskraften på de hypoteser som har relevans enligt den observerade informationen. Parameterskattningsproblemet kan lösas approximativt med liknande metoder som även de bygger på simulering. I denna avhandling arbetar vi främst med att försöka förbättra parameterskattningsmetoder som bygger på partikel Markovkedje-Monte Carlo (MCMC) och Bayesiansk optimering. De förbättringar som vi föreslår leder till att skattningarna kan beräknas snabbare än tidigare genom att bättre ta tillvara på den information som finns i det observerade datamaterialet. Dessa metoder används för att exempelvis prissätta finansiella optioner. Vi föreslår även en ny algoritm för att skapa insignaler till system så att observationerna som erhålls från systemet, innehåller så mycket information som möjligt om de okända parametrarna. Slutligen demonstrerar vi hur man kan använda MCMC-metoder för parameterskattning i modeller som kan användas för att beskriva EEG-signaler. vii Acknowledgments I could never have written my thesis without the help, love and support from all the people around me, now and in the past. I will now spend a few lines to express my gratitude to some of the special people that have helped me along the way. First and foremost, I would like to acknowledge the help and support from my supervisors. My main supervisor Prof. Thomas Schön has always provided me with encouragement, ideas, new people to meet, new opportunities to grab and challenges to help me grow. Also, my co-supervisor Dr. Fredrik Lindsten has always been there for me with answers to my questions, explanations to my problems and ideas/suggestions for my work. I am most impressed by your enthusiasm and dedication in the roles as supervisors. I think that you both have surpassed what can be expected from a supervisor and for this I am extremely grateful. I could never have done this without you! Thank you for your support and all our time running together in Stanley park, Maastricht, Warwick and Söderköping! Furthermore, to be able to write a good thesis you require a good working environment. Prof. Svante Gunnarsson and Ninna Stensgård are two very important persons in this effort. Thank you for all your kindness, support and helpfulness in all matters; small and large. I would also like to acknowledge Dr. Henrik Tidefelt and Dr. Gustaf Hendeby for constructing and maintaining the LATEX-template in which this thesis is written. My brother Fredrik Dahlin, Lic. Joel Kronander, Jonas Linder, Dr. Fredrik Lindsten, Prof. Thomas Schön, Andreas Svensson, Patricio Valenzuela and Johan Wågberg have all helped with proof-reading and with suggestions to improve the thesis. All remaining errors are entirely my own. I am most grateful for the financial support from the project Probabilistic modelling of dynamical systems (Contract number: 621-2013-5524) funded by the Swedish Research Council. Another aspect of the atmosphere at work is all my wonderful colleagues. Especially, my room mate Lic. Michael Roth, who helps with his experience and advice about balancing life as a PhD student. Thank you for sharing the room with me and watering our plants while I am away! I would also like to thank Jonas Linder for all our adventures and our nice friendship together both at and outside of work. Manon Kok and I started in the group at the same time and have helped eachother over the years. Thank you for your positive attitude and for always being open for discussions. Also, I would like to acknowledge the wonderful BBQs and other funny things that Lic. Sina Khoshfetrat Pakazad arranges and lets me participate in! Furthermore, I would like to thank my remaining friends and colleagues at the group. Especially, (without any specific ordering) Dr. Christian Lyzell, Lic. Ylva Jung, Isak Nielsen, Hanna Nyquist, Dr. Daniel Petersson, Clas Veibäck and Dr. Emre Özkan for all the funny things that we have done together. This includes everything from beer tastings, wonderful food in France and fitting as many of RTs PhD students into a jacuzzi as possible to hitting some shallows on open sea with ix x Acknowledgments a canoe, cross country skiing in Chamonix and eating food sitting on the floor in a Japanese restaurant with screaming people everywhere. You have given me the most wonderful memories and times! I have also had the benefit of working with a lot of talented and enthusiastic researchers during my time in academia. I would first like to thank all my fellow students at Teknisk Fysik, Umeå University for a wonderful time as an undergraduate student. Also, Prof. Anders Fällström, Dr. Konrad Abramowicz, Dr. Ulf Holmberg and Dr. Sang Hoon Lee have inspired, supported and encouraged me to pursue a PhD degree, something I have not (a.s.) regretted! Also, my wonderful (former) colleagues at the Swedish Defence Research Agency (FOI) have supported and encouraged me to continue climbing the educational ladder. Thank you, Dr. Pontus Svensson, Dr. Fredrik Johansson, Dr. Tove Gustavi and Christian Mårtensson. Finally, I would like to thank all my co-authors during my time at Linköping University for some wonderful and fruitful collaborations: Daniel Hultqvist, Lic. Daniel Jönsson, Lic. Joel Kronander, Dr. Fredrik Lindsten, Cristian Rojas, Dr. Jakob Roll, Prof. Thomas Schön, Fredrik Svensson, Dr. Jonas Unger, Patricio Valenzuela, Prof. Mattias Villani and Dr. Adrian Wills. Furthermore, Lic. Joel Kronander and Dr. Jonas Unger have helped me with the images from the computer graphics application in the introduction. Prof. Mattias Villani together with Stefan Laséen and Vesna Crobo at Riksbanken helped with the economics application and made the forecasts from RAMSES II. Finally, I am most grateful to my loving family and close relatives for their support all the way from childhood until now and beyond. I love you all every much! Also, my friends are always a great source for support, inspiration and encouragement both at work and in life! My life would be empty and meaningless without you all! I hope that we all can spend some more time now when my thesis is done and when new challenges awaits! Because, what would life be without challenges, meeting new people and spending time with your loved ones. Empty. Linköping, May 2014 Johan Dahlin Contents Notation xv I Background 1 Introduction 1.1 Examples of applications . . . . . . . . . 1.1.1 Predicting GDP growth . . . . . 1.1.2 Rendering photorealistic images . 1.2 Thesis outline and contributions . . . . 1.3 Publications . . . . . . . . . . . . . . . . 2 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 5 7 9 14 Nonlinear state space models and statistical inference 2.1 State space models and inference problems . . . . . . 2.2 Some motivating examples . . . . . . . . . . . . . . . 2.2.1 Linear Gaussian model . . . . . . . . . . . . . 2.2.2 Volatility models in econometrics and finance 2.2.3 Earthquake count model in geology . . . . . . 2.2.4 Daily rainfall models in meteorology . . . . . 2.3 Maximum likelihood parameter inference . . . . . . . 2.4 Bayesian parameter inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 19 19 19 24 26 28 32 State inference using particle methods 3.1 Filtering and smoothing recursions . . . . . . . . . . . . . . 3.2 Monte Carlo and importance sampling . . . . . . . . . . . . 3.3 Particle filtering . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 The auxiliary particle filter . . . . . . . . . . . . . . 3.3.2 State inference using the auxiliary particle filter . . . 3.3.3 Statistical properties of the auxiliary particle filter . 3.3.4 Estimation of the likelihood and log-likelihood . . . 3.4 Particle smoothing . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 State inference using the particle fixed-lag smoother 3.4.2 Estimation of additive state functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 37 39 41 43 46 49 51 53 55 59 xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents xii 3.5 4 5 3.4.3 Statistical properties of the particle fixed-lag smoother . . . SMC for Image Based Lighting . . . . . . . . . . . . . . . . . . . . Parameter inference using sampling methods 4.1 Overview of computational methods for parameter inference 4.1.1 Maximum likelihood parameter inference . . . . . . 4.1.2 Bayesian parameter inference . . . . . . . . . . . . . 4.2 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Statistical properties of the MH algorithm . . . . . . 4.2.2 Proposals using Langevin and Hamiltonian dynamics 4.3 Particle Metropolis-Hastings . . . . . . . . . . . . . . . . . . 4.4 Bayesian optimisation . . . . . . . . . . . . . . . . . . . . . 4.4.1 Gaussian processes as surrogate functions . . . . . . 4.4.2 Acquisition rules . . . . . . . . . . . . . . . . . . . . 4.4.3 Gaussian process optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 69 70 70 71 76 78 83 85 88 93 96 Concluding remarks and future work 5.1 Summary of the contributions . . . . . . . . . . . . . . . . . . 5.2 Outlook and future work . . . . . . . . . . . . . . . . . . . . . 5.2.1 Particle Metropolis-Hastings . . . . . . . . . . . . . . 5.2.2 Gaussian process optimisation using the particle filter 5.2.3 Input design in SSMs . . . . . . . . . . . . . . . . . . 5.3 Source code and data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 103 104 104 105 106 106 Bibliography II 63 64 107 Publications A PMH using gradient and Hessian information 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Particle Metropolis-Hastings . . . . . . . . . . . . . . . . . 2.1 MH sampling with unbiased likelihoods . . . . . . 2.2 Constructing the first and second order proposals . 2.3 Properties of the first and second order proposals . 3 Estimation of likelihoods, gradients, and Hessians . . . . . 3.1 Auxiliary particle filter . . . . . . . . . . . . . . . . 3.2 Estimation of the likelihood . . . . . . . . . . . . . 3.3 Estimation of the gradient . . . . . . . . . . . . . . 3.4 Estimation of the Hessian . . . . . . . . . . . . . . 3.5 Accuracy of the estimated gradients and Hessians . 3.6 Resulting SMC algorithm . . . . . . . . . . . . . . 4 Numerical illustrations . . . . . . . . . . . . . . . . . . . . 4.1 Estimation of the log-likelihood and the gradient . 4.2 Burn-in and scale-invariance . . . . . . . . . . . . . 4.3 The mixing of the Markov chains at stationarity . 5 Discussion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 124 126 126 127 128 129 129 130 131 132 134 135 135 135 138 140 143 Contents xiii Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 B Particle filter-based GPO for parameter inference 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Maximum likelihood estimation with a surrogate cost function 3 Estimating the log-likelihood . . . . . . . . . . . . . . . . . . . 3.1 The particle filter . . . . . . . . . . . . . . . . . . . . . . 3.2 Estimation of the likelihood . . . . . . . . . . . . . . . . 3.3 Estimation of the log-likelihood . . . . . . . . . . . . . . 4 Modelling the surrogate function . . . . . . . . . . . . . . . . . 4.1 Gaussian process model . . . . . . . . . . . . . . . . . . 4.2 Updating the model and the hyperparameters . . . . . . 4.3 Example of log-likelihood modelling . . . . . . . . . . . 5 Acquisition rules . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Expected improvement . . . . . . . . . . . . . . . . . . . 6 Numerical illustrations . . . . . . . . . . . . . . . . . . . . . . . 6.1 Implementation details . . . . . . . . . . . . . . . . . . . 6.2 Linear Gaussian state space model . . . . . . . . . . . . 6.3 Nonlinear stochastic volatility model . . . . . . . . . . . 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 152 153 154 154 155 156 156 157 158 158 158 160 160 161 161 164 164 165 C Approximate inference in SSMs with intractable likelihoods using GPO 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 An intuitive overview . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Estimating the posterior distribution . . . . . . . . . . . . . . . . . 3.1 State inference . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Estimation of the log-likelihood . . . . . . . . . . . . . . . . 4 Gaussian process optimisation . . . . . . . . . . . . . . . . . . . . . 4.1 Constructing the surrogate function . . . . . . . . . . . . . 4.2 The acquisition rule . . . . . . . . . . . . . . . . . . . . . . 5 Putting the algorithm together . . . . . . . . . . . . . . . . . . . . 6 Numerical illustrations . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Inference in α-stable data . . . . . . . . . . . . . . . . . . . 6.2 Linear Gaussian model . . . . . . . . . . . . . . . . . . . . . 6.3 Stochastic volatility model with α-stable returns . . . . . . 7 Conclusions and outlook . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A α-stable distributions . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Simulating random variables . . . . . . . . . . . . . . . . . A.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . 167 170 171 172 172 174 174 174 176 176 177 177 181 181 184 186 188 188 189 191 . . . . . . . . . . . . . . . . . . D A graph/particle-based method for experiment design 193 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Contents xiv 3 New input design method . . . . . . . . . 3.1 Graph theoretical input design . . 3.2 Estimation of the score function . 3.3 Monte Carlo-based optimisation . 3.4 Summary of the method . . . . . . 4 Numerical examples . . . . . . . . . . . . 4.1 Linear Gaussian state space model 4.2 Nonlinear growth model . . . . . . 5 Conclusion . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E Hierarchical Bayesian approaches for robust inference in 1 Introduction . . . . . . . . . . . . . . . . . . . . . . 2 Hierarchical Bayesian ARX Models . . . . . . . . . 2.1 Student’s t distributed innovations . . . . . 2.2 Parametric model order . . . . . . . . . . . 2.3 Automatic relevance determination . . . . . 3 Markov chain Monte Carlo . . . . . . . . . . . . . 4 Posteriors and proposal distributions . . . . . . . . 4.1 Model order . . . . . . . . . . . . . . . . . . 4.2 ARX coefficients . . . . . . . . . . . . . . . 4.3 ARX coefficients variance . . . . . . . . . . 4.4 Latent variance variables . . . . . . . . . . 4.5 Innovation scale parameter . . . . . . . . . 4.6 Innovation DOF . . . . . . . . . . . . . . . 5 Numerical illustrations . . . . . . . . . . . . . . . . 5.1 Average model performance . . . . . . . . . 5.2 Robustness to outliers and missing data . . 5.3 Real EEG data . . . . . . . . . . . . . . . . 6 Conclusions and Future work . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 199 201 203 204 205 205 206 207 208 ARX models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 214 215 215 216 216 217 218 218 219 220 220 221 221 222 222 224 226 228 229 Notation Probability Notation a.s. −→ d −→ p −→ δz (dx) P,E,V ∼ Meaning Almost sure convergence. Convergence in distribution. Convergence in probability. Dirac point mass located at x = z. Probability, expectation and covariance operators. Sampled from or distributed according to. Statistical distributions Notation A(α, β, γ, η) B(p) N (µ, σ 2 ) G(α, β) IG(α, β) P(λ) U(a, b) Meaning α-stable distribution with stability α, skewness β, scale γ and location η. Bernoulli distribution with success probability p. Gaussian (normal) dist. with mean µ and variance σ 2 Gamma distribution with rate α and shape β. Inverse Gamma distribution with rate α and shape β. Poisson distribution with mean λ. Uniform distribution on the interval [a, b]. xv Notation xvi Operators and other symbols Notation Id , diag(v) ∇f (x) ∇2 f (x) I det(A), |A| A−1 tr(A) A> 2 v = vv > an:m sign(x) supp(f ) Meaning d × d identity matrix. Definition. Diagonal matrix with the vector v on the diagonal. Gradient of f (x). Hessian of f (x). Indicator function. Matrix determinant of A. Matrix inverse of A. Matrix trace of A. Matrix transpose of A. Outer product of the vector v. Sequence {an , an+1 , . . . , am−1 , am }, for m > n. Sign of x. Support of the function f , {x : f (x) > 0}. Statistical quantities Notation Meaning I(θ) L(θ) `(θ) θbML J (θ) θb p(θ|y1:T ) p(θ) θ S(θ) Expected information matrix evaluated at θ. Likelihood function evaluated at θ. Log-likelihood function evaluated at θ. Maximum likelihood parameter estimate. Observed information matrix evaluated at θ. Parameter estimate. Parameter posterior distribution. Parameter prior distribution. Parameter vector, θ ∈ Θ ⊆ Rd . Score function evaluated at θ. Algorithmic quantities Notation (i) at Z (i) xt Rθ (xt |x0:t−1 , yt ) Wθ (xt , xt−1 ) q(θ) π(θ) γ(θ) (i) (i) wt , w et Meaning Ancestor of particle i at time t. Normalisation constant. Particle i at time t. Particle proposal kernel. Particle weighting function. Proposal distribution. Target distribution. Unnormalised target distribution. Un- and normalised weight of particle i at time t. Notation xvii Abbreviations Abbreviation Meaning a.s. ABC ACF AIS APF AR(p) ARD ARCH(p) ARX(p) BIS BO bPF BRDF CDF CLT CPI DSGE EB EEG EI EM ESS faPF FFBSm FFBSi FL GARCH(p,q) GPO GPU HMM IACT IBL IID IS KDE LGSS LTE MCMC MH MIS ML MLE MLT MSE Almost surely (with probability 1). Approximate Bayesian computations. Autocorrelation function. Adaptive importance sampling. Auxiliary particle filter. Autoregressive process of order p. Automatic relevance determination. AR conditional heteroskedasticity process of order p. Autoregressive exogenous process of order p. Bidirectional importance sampling. Bayesian optimisation. Bootstrap particle filter. Bidirectional reflectance distribution function. Cumulative distribution function. Central limit theorem. Consumer price index. Dynamic stochastic general equilibrium. Empirical Bayes. Electroencephalography. Expected improvement. Environment map. Effective sample size. Fully-adapted particle filter. Forward-filtering backward-smoothing. Forward-filtering backward-simulation. Fixed-lag (particle smoother). Generalised ARCH process of order (p, q). Gaussian process optimisation. Graphical processing unit. Hidden Markov model. Integrated autocorrelation time. Image-based lightning. Independent and identically distributed. Importance sampling. Kernel density estimate/estimator. Linear Gaussian state space. Light transport equation. Markov chain Monte Carlo. Metropolis-Hastings. Multiple importance sampling. Maximum likelihood. Maximum likelihood estimator. Metropolis light transport. Mean square error. Notation xviii Abbreviations (cont.) Abbreviation PD PDF PMF PF PG PI PMCMC PMH PMH0 PMH1 PMH2 PS RJ-MCMC RTS RW SIS SIR SLLN SMC SPSA SSM UCB Meaning Positive definite. Probability density function. Probability mass function. Particle filter. Particle Gibbs. Probability of improvement. Particle Markov chain Monte Carlo. Particle Metropolis-Hastings. Marginal particle Metropolis-Hastings. PMH using first order information PMH using first and second order information Particle smoother. Reversible jump Markov chain Monte Carlo. Rauch-Tung-Stribel. Random walk. Sequential importance sampling. Sequential importance sampling and resampling. Strong law of large numbers. Sequential Monte Carlo. Simultaneous perturbation stochastic approximation. State space model. Upper confidence bound. Part I Background 1 Introduction Science is the art of collecting and organising knowledge about the universe by tested explanations and validated predictions. Therefore, modelling the world using observations and statistical inference is an integral part of the scientific method. The resulting statistical models can be used to describe certain observed phenomena or to predict new phenomena and future behaviours. An example of the former is to discover new physical models by generalising from observed data using induction. Examples of prediction applications are to validate scientific theories or to forecast the future GDP of Sweden, the probability of rainfall tomorrow and the number of earthquakes during the coming year. We discuss some of the details of these problems in the following chapters. This thesis is concerned with building dynamical models from recorded observations, i.e. models of systems that evolves over time. The observations are combined with past experiences and established scientific theory to build models using statistical tools. Here, we limit ourselves to discussing nonlinear state space models (SSMs), where most of the structure is known beforehand except a few parameters. A fairly general class of SSMs can be expressed as xt+1 |xt ∼ fθ (xt+1 |xt ), yt |xt ∼ gθ (yt |xt ), where xt and yt denotes an unobserved (latent) state and an observation from the system at time t. Here, fθ (xt+1 |xt ) and gθ (yt |xt ) denotes two Markov kernels parametrised by an unknown static real-valued parameter vector θ. Applications of this class of SSMs can be found in almost all of the natural sciences and most of the social sciences. Some specific examples are biology (Wilkinson, 2011), control (Ljung, 1999), epidemiology (Keeling and Rohani, 2008) and finance (Tsay, 2005; Hull, 2009). 3 1 4 Introduction The procedure to determine the parameter vector θ from the observations y1:T is referred to as parameter inference and this problem is analytically intractable for nonlinear SSMs. Another related problem is the state inference problem, where we would like to determine the value of xt given the information in the observations y1:T or y1:t . This problem is also analytically intractable for most SSMs. Instead, we make use of statistical simulation methods to estimate the parameters and states. As the name suggests, these methods are based on simulating many (often thousands or millions) of hypotheses referred to as particles. The particles that match the recorded observations are retained and the others are discarded. This procedure can be repeated in a sequential manner, where the solution of the problem is obtained as the solution to many subproblems. This idea is the basis for the sequential Monte Carlo (SMC) methods that are an integral part of the methods that we consider for state inference in SSMs. For example, the marginal filtering distribution pθ (xt |y1:t ) can be approximated by an empirical distribution, pbθ (dxt |y1:t ) = N X (i) w et δ i=1 (i) (i) xt (dxt ), (i) where xt and w et denotes the particle i and its corresponding (normalised) weight obtained from the SMC algorithm. Here, δz (dx) denotes a Dirac point mass located at x = z. The empirical distribution summarises all the information that is contained within the data about the value of the latent state at some time t. This information can then be used by other methods for solving the parameter inference problem in SSMs. The number of particles N in the SMC method controls both the accuracy of the empirical distributions and the computational cost. That is, a high accuracy requires many particles which incurs a high computational cost and this results in a trade-off between accuracy and speed. Also, we make use of the SMC algorithm within some iterative parameter inference methods to solve the state inference problem at each iteration. Therefore, we would like to limit the number of iterations required by the parameter inference methods to obtain an accurate parameter estimate with a reasonable computational cost. These ideas are the two main themes of this thesis. The first theme is to propose some developments to improve the efficiency of existing parameter inference methods based on SMC algorithms. The second theme is to extend some of the current methods for linear SSMs to nonlinear SSMs by making use of SMC algorithms. We return to the contributions of this thesis in Section 1.2. 1.1 Examples of applications In this section, we give two examples of problems in which computational statistical methods based on the SMC algorithm are useful. In the first example, we use a 1.1 Examples of applications 5 model to forecast the future development of the Swedish economy given past experience and economical theory. In the second example, we use a model constructed from the physics of light transport to render photorealistic images. 1.1.1 Predicting GDP growth The economy of a country is a complex system with an emergent behaviour depending on the actions of many interacting heterogeneous agents. In an economy, these agents correspond to consumers, companies, banks, politicians, governmental agencies, other countries, etc. As such, these agents may or may not act rationally to their situation and could therefore be difficult to model on an individual level. As a result, economical models mostly deal with the aggregated behaviour of many homogeneous agents, i.e. rational utility maximising agents with a common valuation of goods and services. These models can be used to produce forecasts, to gain understanding about the current situation in the economy and simulate the result of different policy decisions. An example of this could be to study the impact of changing the repo rate on the unemployment level and the GDP growth of the economy. For this purpose, many central banks are today using dynamic stochastic general equilibrium (DSGE) models (An and Schorfheide, 2007; Del Negro and Schorfheide, 2004) for modelling the economy of a country. The outputs from these models are various macroeconomic quantities, such as GDP growth, unemployment rate, inflation, etc. The general structure is given by economic theory, but there are some unknown parameters that needs to be inferred from data. Riksbanken (the Swedish central bank) has developed a DSGE model called the Riksbank Aggregate Macromodel for Studies of the Economy of Sweden II (RAMSES II) (Adolfson et al., 2013, 2007a) to model the Swedish economy. Essentially, RAMSES II is a nonlinear SSM with 12 outputs, about 40 latent states and about 65 unknown parameters. For computational convenience, only the log-linearised version of the full model is considered in most of the analysis. Consequently, Kalman filtering methods can be used to solve the state inference problem. The parameter inference problem is solved using a Metropolis-Hastings (MH) algorithm, where the proposal is a multivariate Gaussian distribution with the covariance matrix given by the inverse of the observed information matrix at the posterior mode. The information matrix is estimated using Quasi-Newton optimisation algorithms such as the BFGS algorithm (Nocedal and Wright, 2006). For more details, see Adolfson et al. (2007b). In Chapters 3 and 4, we discuss alternative methods that could solve the state and parameter inference problem in the original nonlinear version of RAMSES II. For related treatments using SMC and MCMC in combination with DSGE models, see Flury and Shephard (2011), Fernández-Villaverde and Rubio-Ramírez (2007) and Amisano and Tristani (2010). In Figure 1.1, we give an example of how the RAMSES II model can be used for forecasting the changes in GDP, the consumer price index with fixed interest rates 1 Introduction 2 0 −2 −4 Quarterly GDP growth (%) 4 6 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 2017 2019 2021 2023 2011 2013 2015 2017 2019 2021 2023 2011 2013 2015 2017 2019 2021 2023 2011 2013 2015 2017 2019 2021 2023 6 4 0 2 Repo rate (%) 8 10 Date 1995 1997 1999 2001 2003 2005 2007 2009 6 4 2 0 −2 Annual change in CPIF (%) 8 Date 1995 1997 1999 2001 2003 2005 2007 2009 20 0 −20 −40 Unemployment gap (%) 40 Date 1995 1997 1999 2001 2003 2005 2007 2009 Date Figure 1.1: The quarterly GDP growth (green), the repo rate (red), the annual change in the CPIF (blue) and unemployment gap (orange). The historical data is presented for each variable up until the dotted vertical lines. The predicted means are presented with the 50%, 70%, 90% and 95% credibility intervals (from lighter to darker gray). The published forecasts from Riksbanken are presented as crosses. The data and predictions are obtained by the courtesy of Riksbanken. 1.1 Examples of applications 7 (CPIF) inflation, the repo rate and the unemployment gap1 in Sweden. We first make use of some historical data (up until the dotted vertical line) to estimate the parameters and the latent states. The resulting model is then used to forecast the predictive posterior mean (solid lines) with some credibility intervals (gray areas). These predictive means are used by Riksbanken together with experts and other models to construct forecasts of the economy. This forecast is presented by crosses and differs from the output of the RAMSES II model. 1.1.2 Rendering photorealistic images To simulate light transport, we make use of a geometrical optics model, developed during centuries of research in the field of physics (Hecht, 2013). By the use of this models, we can simulate how light behaves in different environments and make use of this to render images. A popular related application is to add objects into an image or video sequence that were not present when the scene was captured. In this section, we shortly discuss how to do this using the so-called image-based lighting (IBL) method (Debevec, 1998; Pharr and Humphreys, 2010). To make use of the IBL method, we require an panoramic image of the real scene captured using a high dynamic range (HDR) camera. This type of camera can record much larger variations in brightness than a standard camera, which are needed to capture all the different light sources within the scene. The resulting image is referred to as an environment map (EM). In IBL, this panoramic image serves as the source of illumination when rendering images, allowing the real objects to cast shadows and interact with the virtual objects. Secondly, we need geometrical models of the objects that we would like to add into the scene. Finally, we require a mathematical description of the optical properties of the materials in these objects to be able to simulate how the light scatters over their surfaces. The IBL method combines all of this information using the light transport equation (LTE), which is a physical model of how light rays propagates through space and reflects off surfaces. The LTE model cannot be solved analytically, but it can be approximated using methods related to SMC algorithms. To see how this can be done, consider a cartoon of the setup presented in Figure 1.2. In the first step, a set of light rays originating from a pixel in the image plane is generated. We then track how these rays bounces around in the scene until they finally hit the EM. The colours and brightnesses of the EM in these locations are recorded and used to compute the resulting colour and brightness of the pixel in the image plane. This approach is repeated for all the pixels in the image plane. However in the real world, there are infinitely many rays that bounces around in the scene before they hit the pixels in the image plan. As a result, it is computationally infeasible to simulate all the light rays and all the bounces in the scene. Instead, there are methods to select only the light rays which contributes the most to 1 The unemployment gap is the amount (in percent) that the GDP must increase to be able to achieve full employment of the work force. The decrease of the unemployment gap is an important aim of financial policy making and can be achieved (in Keynesian economic theory) by increasing public spending or lowering taxes. 8 1 Introduction Figure 1.2: The basic setup of ray tracing underlying photorealistic image synthesis. The colour and brightness of a pixel in the image plane is determined by the EM, the geometry of the scene and the optical properties of the objects. Figure 1.3: The scene before (left) and after (right) the rendering using a version of the IBL method. The image is taken from Unger et al. (2013) and is used with courtesy of the authors. 1.2 Thesis outline and contributions 9 the brightness and colour each pixel in the image plane. This can be done in a similar manner to the methods discussed later in Section 4.2 resulting in the Metropolis light transport (MLT) algorithm (Veach and Guibas, 1997). The basis for these methods is to solve the LTE problem by simulating different hypotheses and improving them in analogue with SMC methods. That is, light rays that hit bright areas of the EM are kept and modified, whereas rays that does hit the EM in dim regions or bounces around too long are discarded. Note that, it can take several days to render a single image using the IBL algorithm, even when only allowing for a few bounces and light rays per pixel in the image plane. This problem grows even further when we would like to render a sequence of images. A possible solution could be to start from the solution from the previous frame and adapt it to the new frame. If the EMs are similar, this could lead to a decrease in the total computational cost. We return to this idea in Section 3.5. In Figure 1.3, we present an example from Unger et al. (2013) of a scene before (left) and after (right) it is rendered in a computer by the use of the IBL method and the methods discussed in Section 4.2. Note that, in the final result we have added several photorealistic objects into the scene such as the sofa, the table and have also changed the floor in the room. These methods are used in many entertainment applications to create special effects and to modify scenes in post production. Furthermore, they are useful in rendering images of scenes that are difficult or costly to build in the real-world. Some well-known companies (such as IKEA and Volvo) make use of these methods for digital design and advertisements as a cost effective alternative to traditional photography. 1.2 Thesis outline and contributions This thesis is divided into two parts. In Part I, we give some examples of models and applications together with an introduction to the different computational inference methods that are used. In Part II, we present edited versions of some published peer-reviewed papers and unpublished technical reports. Part I - Background In this part, we begin by introducing the SSM and provide some additional examples of its real-world applications in Chapter 2. Furthermore, we introduce two different statistical paradigms for parameter inference problems in SSMs: the maximum likelihood (ML) based approach and the Bayesian approach. Finally, we discuss why computational methods are required for estimating the solution to these problems. In Chapter 3, we review the state inference problem in SSMs and discuss the use of SMC methods for approximating the solution to these problems. We also discuss the use of SMC algorithms for other classes of models, which includes the computer graphics example discussed in Section 1.1.2. Chapter 4 is devoted to discussing the parameter inference problem for nonlinear 1 10 Introduction SSMs. We begin by giving an overview of different parameter inference methods and then discuss Markov chain Monte Carlo (MCMC) and Bayesian optimisation (BO) in more detail. The former can be used for Bayesian parameter inference and the latter can be used for ML or maximum a posteriori (MAP) based parameter inference. We conclude Part I by Chapter 5 which contains a summary of the contributions of the thesis together with some general conclusions and possible avenues for future work. Part II - Publications The main part of this thesis is the compilation of five papers published in peerreviewed conference proceedings or as technical reports. These papers contain the main contributions of this thesis: • In Paper A, we develop a novel particle MCMC algorithm that combines the Particle Metropolis-Hastings (PMH) with the Langevin dynamics. The resulting algorithm explores the posterior distribution more efficient then the marginal PMH algorithm, is invariant to affine transformations of the parameter vector and reduces the length of the the burn-in. As a consequence, the proposed algorithm requires less iterations, which makes it more computationally efficient than the marginal PMH algorithm. • In Paper B, we develop a novel algorithm for ML parameter inference by combining ideas from BO with SMC for log-likelihood estimation. The resulting algorithm is computationally efficient as it requires less samples from the log-likelihood compared with other popular methods. • In Paper C, we extend the combination of BO and SMC to parameter inference in nonlinear SSMs with intractable likelihoods. Computationally costly approximate Bayesian computations (ABC) are used to approximate the likelihood. We illustrate the proposed algorithm for parameter inference in stochastic volatility model with α-stable returns using real-world data. • In Paper D, we develop a novel algorithm for input design in nonlinear SSMs, which can handle amplitude constraints on the input. The proposed method makes use of SMC for estimating the expected information matrix. The algorithm performs well compared with some other methods in the literature and decreases the variance of the parameter estimates with almost an order of magnitude. • In Paper E, we propose two algorithms for parameter inference in ARX models with Student-t innovations which includes automatic model order selection. These methods makes use of reversible jump MCMC (RJMCMC) and the Gibbs sampler together with sparseness priors to estimate the model order and the parameter vector. We illustrate the use of the proposed algorithm to model real-world EEG data with promising results. 1.2 Thesis outline and contributions 11 Here, we present an abstract of each paper together with an account of the contribution of the author of this thesis. Paper A Paper A of this thesis is an edited version of, J. Dahlin, F. Lindsten, and T. B. Schön. Particle Metropolis-Hastings using gradient and Hessian information. Pre-print, 2014b. arXiv:1311.0686v2. which is a combination and development of the two earlier publications J. Dahlin, F. Lindsten, and T. B. Schön. Second-order particle MCMC for Bayesian parameter inference. In Proceedings of the 19th IFAC World Congress, Cape Town, South Africa, August 2014a. (accepted for publication). J. Dahlin, F. Lindsten, and T. B. Schön. Particle Metropolis Hastings using Langevin dynamics. In Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, May 2013a. Abstract: PMH allows for Bayesian parameter inference in nonlinear state space models by combining MCMC and particle filtering. The latter is used to estimate the intractable likelihood. In its original formulation, PMH makes use of a marginal MCMC proposal for the parameters, typically a Gaussian random walk. However, this can lead to a poor exploration of the parameter space and an inefficient use of the generated particles. We propose two alternative versions of PMH that incorporate gradient and Hessian information about the posterior into the proposal. This information is more or less obtained as a byproduct of the likelihood estimation. Indeed, we show how to estimate the required information using a fixed-lag particle smoother, with a computational cost growing linearly in the number of particles. We conclude that the proposed methods can: (i) decrease the length of the burn-in phase, (ii) increase the mixing of the Markov chain at the stationary phase, and (iii) make the proposal distribution scale invariant which simplifies tuning. Contributions and background: The author of this thesis contributed with the majority of the work including the design, the implementation, the numerical illustrations and the written presentation. Paper B Paper B of this thesis is an edited version of, J. Dahlin and F. Lindsten. Particle filter-based Gaussian process optimisation for parameter inference. In Proceedings of the 19th IFAC World Congress, Cape Town, South Africa, August 2014. (accepted for publication). 1 12 Introduction Abstract: We propose a novel method for maximum-likelihood-based parameter inference in nonlinear and/or non-Gaussian state space models. The method is an iterative procedure with three steps. At each iteration a particle filter is used to estimate the value of the log-likelihood function at the current parameter iterate. Using these log-likelihood estimates, a surrogate objective function is created by utilizing a Gaussian process model. Finally, we use a heuristic procedure to obtain a revised parameter iterate, providing an automatic trade-off between exploration and exploitation of the surrogate model. The method is profiled on two state space models with good performance both considering accuracy and computational cost. Contributions and background: The author of this thesis contributed with the majority of the work including the design, the implementation, the numerical illustrations and the written presentation. Paper C Paper C of this thesis is an edited version of, J. Dahlin, T. B. Schön, and M. Villani. Approximate inference in state space models with intractable likelihoods using Gaussian process optimisation. Technical Report LiTH-ISY-R-3075, Department of Electrical Engineering, Linköping University, Linköping, Sweden, April 2014c. Abstract: We propose a novel method for MAP parameter inference in nonlinear state space models with intractable likelihoods. The method is based on a combination of BO, SMC and SBC. SMC and ABC are used to approximate the intractable likelihood by using the similarity between simulated realisations from the model and the data obtained from the system. The BO algorithm is used for the MAP parameter estimation given noisy estimates of the log-likelihood. The proposed parameter inference method is evaluated in three problems using both synthetic and real-world data. The results are promising, indicating that the proposed algorithm converges fast and with reasonable accuracy compared with existing methods. Contributions and background: The author of this thesis contributed with the majority of the work including the design, the implementation, the numerical illustrations and the written presentation. This contribution resulted from the participation in the course Bayesian learning given by Prof. Mattias Villani at Linköping University during the autumn of 2013. Paper D Paper D of this thesis is an edited version of, P. E. Valenzuela, J. Dahlin, C. R. Rojas, and T. B. Schön. A graph/particlebased method for experiment design in nonlinear systems. In Proceedings of the 19th IFAC World Congress, Cape Town, South Africa, August 2014. (accepted for publication). 1.2 Thesis outline and contributions 13 Abstract: We propose an extended method for experiment design in nonlinear state space models. The proposed input design technique optimizes a scalar cost function of the information matrix, by computing the optimal stationary probability mass function (PMF) from which an input sequence is sampled. The feasible set of the stationary PMF is a polytope, allowing it to be expressed as a convex combination of its extreme points. The extreme points in the feasible set of PMFs can be computed using graph theory. Therefore, the final information matrix can be approximated as a convex combination of the information matrices associated with each extreme point. For nonlinear SSMs, the information matrices for each extreme point can be computed by using particle methods. Numerical examples show that the proposed technique can be successfully employed for experiment design in nonlinear SSMs. Contributions and background: This is an extension of the work presented in Valenzuela et al. (2013) and a result of the cooperation with the Department of Automatic Control at the Royal Institute of Technology (KTH). The author of this thesis designed and implemented the algorithm for estimating the expected information matrix and the Monte Carlo method for estimating the optimal input. The corresponding sections in the paper were also written by the author of this thesis. Paper E Paper E of this thesis is an edited version of, J. Dahlin, F. Lindsten, T. B. Schön, and A. Wills. Hierarchical Bayesian ARX models for robust inference. In Proceedings of the 16th IFAC Symposium on System Identification (SYSID), Brussels, Belgium, July 2012b. Abstract: Gaussian innovations are the typical choice in most ARX models but using other distributions such as the Student-t could be useful. We demonstrate that this choice of distribution for the innovations provides an increased robustness to data anomalies, such as outliers and missing observations. We consider these models in a Bayesian setting and perform inference using numerical procedures based on MCMC methods. These models include automatic order determination by two alternative methods, based on a parametric model order and a sparseness prior, respectively. The methods and the advantage of our choice of innovations are illustrated in three numerical studies using both simulated data and real EEG data. Contributions and background: The author of this thesis contributed to parts of the the implementation, generated most of the numerical illustrations and wrote the sections covering the numerical illustrations and the conclusions in the paper. The EEG data was kindly provided by Eline Borch Petersen and Thomas Lunner at Eriksholm Research Centre, Oticon A/S, Denmark. 1 14 1.3 Introduction Publications Published work of relevance to this thesis are listed below in reverse chronological order. Items marked with ? are included in Part II of this thesis. ? J. Dahlin, T. B. Schön, and M. Villani. Approximate inference in state space models with intractable likelihoods using Gaussian process optimisation. Technical Report LiTH-ISY-R-3075, Department of Electrical Engineering, Linköping University, Linköping, Sweden, April 2014c. ? J. Dahlin, F. Lindsten, and T. B. Schön. Particle Metropolis-Hastings using gradient and Hessian information. Pre-print, 2014b. arXiv:1311.0686v2. ? J. Dahlin and F. Lindsten. Particle filter-based Gaussian process optimisation for parameter inference. In Proceedings of the 19th IFAC World Congress, Cape Town, South Africa, August 2014. (accepted for publication). J. Dahlin, F. Lindsten, and T. B. Schön. Second-order particle MCMC for Bayesian parameter inference. In Proceedings of the 19th IFAC World Congress, Cape Town, South Africa, August 2014a. (accepted for publication). ? P. E. Valenzuela, J. Dahlin, C. R. Rojas, and T. B. Schön. A graph/particlebased method for experiment design in nonlinear systems. In Proceedings of the 19th IFAC World Congress, Cape Town, South Africa, August 2014. (accepted for publication) J. Dahlin, F. Lindsten, and T. B. Schön. Particle Metropolis Hastings using Langevin dynamics. In Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, May 2013a. ? J. Dahlin, F. Lindsten, T. B. Schön, and A. Wills. Hierarchical Bayesian ARX models for robust inference. In Proceedings of the 16th IFAC Symposium on System Identification (SYSID), Brussels, Belgium, July 2012b. Other published works related to but not included in the thesis are: J. Kronander, J. Dahlin, D. Jönsson, M. Kok, T. B. Schön, and J. Unger. Real-time Video Based Lighting Using GPU Raytracing. In Proceedings of the 2014 European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, September 2014a. (submitted, pending review). J. Kronander, T. B. Schön, and J. Dahlin. Backward sequential Monte Carlo for marginal smoothing. In Proceedings of the 2014 IEEE Statistical Signal Processing Workshop (SSP), Gold Coast, Australia, July 2014b. (accepted for publication). 1.3 Publications D. Hultqvist, J. Roll, F. Svensson, J. Dahlin, and T. B. Schön. Detection and positioning of overtaking vehicles using 1D optical flow. In Proceedings of the IEEE Intelligent Vehicles (IV) Symposium, Dearborn, MI, USA, June 2014. (accepted for publication). J. Dahlin and P. Svenson. Ensemble approaches for improving community detection methods. Pre-print, 2013. arXiv:1309.0242v1. J. Dahlin, F. Lindsten, and T. B. Schön. Inference in Gaussian models with missing data using Equalisation Maximisation. Pre-print, 2013b. arXiv:1308.4601v1. J. Dahlin, F. Johansson, L. Kaati, C. Mårtensson, and P. Svenson. A Method for Community Detection in Uncertain Networks. In Proceedings of International Symposium on Foundation of Open Source Intelligence and Security Informatics 2012, Istanbul, Turkey, August 2012a. J. Dahlin and P. Svenson. A Method for Community Detection in Uncertain Networks. In Proceedings of 2011 European Intelligence and Security Informatics Conference, Athens, Greece, August 2011. 15 2 Nonlinear state space models and statistical inference In this chapter, we introduce the SSM and give some motivating examples of different applications in which the model is used. We also review ML inference and Bayesian inference in connection with SSMs. Interested readers are referred to Douc et al. (2014), Cappé et al. (2005), Ljung (1999), Shumway and Stoffer (2010) and Brockwell and Davis (2002), for more detailed accounts of the topics covered here. 2.1 State space models and inference problems An SSM or hidden Markov model (HMM) consists of a pair of discrete-time stochastic processes1 x0:T , {xt }Tt=0 and y1:T , {yt }Tt=1 . Here, xt ∈ X ∈ Rn denotes the latent state and yt ∈ Y ∈ Rm denotes the observation obtained from the system at time t. The latent state is modelled as a Markov chain with initial state x0 ∼ µ(x0 ) and the transition kernel fθ (xt+1 |xt , ut ). Furthermore, we assume that the observations are mutually independent given the latent states and have the conditional observation density gθ (yt |xt , ut ). In both kernels, θ ∈ Θ ⊆ Rd denotes the static parameter vector of the Markov kernel and ut denotes a known input to the system. With these definitions, we can write the SSM on the compact form x0 ∼ µ(x0 ), xt+1 |xt ∼ fθ (xt+1 |xt , ut ), yt |xt ∼ gθ (yt |xt , ut ), (2.1a) (2.1b) (2.1c) 1 In this thesis, we do not make any difference in notation between a random variable and its realisation. This is done to ease the notation. 17 2 18 Nonlinear state space models and statistical inference ··· xt−1 xt xt+1 ··· ··· yt−1 yt yt+1 ··· Figure 2.1: Graphical model of an SSM with latent process (red) and observed process (blue). which we make use of in this thesis. This is a fairly general class of models and can be used to model both nonlinear and non-Gaussian systems. Another popular description of stochastic models like the class of SSMs is graphical model (Murphy, 2012; Bishop, 2006). The corresponding graphical representation of an SSM is depicted in Figure 2.1, where the latent state process is presented in red and the observed process in blue. From this graphical model, we see that the state xt is only dependent on the last state xt−1 due to the Markov property inherent in the model. That is, all the information about the past is summarised in the state at time t−1. Also, we see that the observations are mutually independent given the states, as there are no arrows directly between two observations. In this thesis, we are interested in two different inference problems connected to SSMs: (i) the state inference problem and (ii) the parameter inference problem. The first problem is to infer the density of the latent state process given the observations and the model. If we are interested in the current state given all the observations up until now, we would like to estimate the marginal filtering density pθ (xt |y1:t ). We return to this and other related problems in Chapter 3. There, we define the problem in mathematical terms and present numerical methods designed to approximate the filtering and smoothing densities. The second problem is to infer the values of the parameter θ given the set of observations y1:T and the model structure encoded by the Markov transition kernels fθ (xt+1 |xt ) and gθ (yt |xt ). It turns out that we have to solve the state inference problem as a part of the parameter inference problem. We later return to the mathematical formulation of this problem in the ML setting in Section 2.3 and in the Bayesian setting in Section 2.4. These problems are analytically intractable and cannot be computed in closed-form. Therefore, we present computational methods based on sampling methods for parameter inference in Chapter 4. 2.2 Some motivating examples 2.2 19 Some motivating examples SSMs have been successfully applied in various areas for modelling dynamical systems. In this section, we give three examples from different research fields and connect them with the inference problems discussed in the previous section. The first model is taken from finance, where we would like to model the real-valued latent volatility given some stock or exchange rate data. In the second model, we would like to make predictions of the number of annual major earthquakes. The third model is taken from meteorology, where we would like predict the probability of rain fall during the coming days. However, we start the the well-known linear Gaussian state space (LGSS) model, which we make use of as a benchmark throughout the thesis. 2.2.1 Linear Gaussian model Consider the scalar LGSS model2 , xt+1 |xt ∼ N x1+1 ; φxt + γut , σv2 , yt |xt ∼ N yt ; xt , σe2 , (2.2a) (2.2b) where the parameter vector is θ = {φ, γ, σv , σe }. Here, φ describes the persistence of the state and {σv , σe } controls the noise levels. In this model, we have added an optional input u1:T to the system, which is scaled by the parameter γ. Here, we require that φ ∈ (−1, 1) ⊂ R to obtain a stable system and that {σv , σe } ∈ R2+ as they correspond to standard deviations. The state inference problem can be solved exactly for this model using Kalman filters and smoothers (Kailath et al., 2000). This is a result of that the model is linear and only includes Gaussian kernels. Due to this property, we make use of the model as a benchmark problem for some of the algorithms reviewed in this thesis. Comparing the methods that we develop for the nonlinear SSMs with the LGSS model can reveal important properties of the algorithms and help with insights about how to calibrate them. The Kalman methods can also be used to estimate the log-likelihood, the score function and the information matrix, which are important quantities in the ML parameter inference problem discussed in Section 2.3. 2.2.2 Volatility models in econometrics and finance Nonlinear SSMs are often encountered in econometric and financial problems, where we would like to e.g. model the variations in the log-returns of a stock or an index. The log-returns are calculated by yt = log(st /st−1 ), where st denotes the price of some financial asset at time t. The variations in the log-returns can be seen as the instantaneous standard deviation and is referred to as the volatility. 2 This model is also known as ARX(1) in noise, where ARX(1) denotes an exogenous autoregressive process of order 1. 20 2 Nonlinear state space models and statistical inference The volatility plays an important role in the famous Black-Scholes pricing model (Black and Scholes, 1973). In this model, the log-returns are assumed to follow a Brownian motion with independent and identically distributed IID increments distributed according to some Gaussian distribution N (µ, σ 2 ), where σ denotes the volatility. Therefore, the volatility is an important component when calculating the price of options and other financial instrument based on the Black-Scholes model. For a discussion of how the volatility is used for pricing options, see Hull (2009), Björk (2004) and Glasserman (2004). For a more extensive treatment of financial time series and alternative inference methods for volatility models, see Tsay (2005). In Figure 2.2, we present the closing prices and daily log-returns for the the NASDAQ OMX Stockholm 30 Index during a 14 year period. Note, the large drops in the closing prices (middle) in the period around the years 2001, 2008 and 2011 in connection with the most recent financial crises (shocks). At these drops, we see that the log-returns are quite volatility, as they vary much between consecutive days. During other periods, the volatility is quite low and the log-returns does not vary much between consecutive days. These variations in the volatility are referred to as volatility clustering in finance. Also, from the QQ-plots we see that the log-returns are heavy-tailed and clearly non-Gaussian with large deviations from theoretical quantiles in the tail behaviour. All these features (and some other) are known as stylized facts (Cont, 2001). In this section, we present three different volatility models that tries to capture different aspects of the observed properties of financial data. A complication is that the resulting inference problems become more challenging as when try to capture more and more of the stylized facts. This results in a trade-off between accuracy of the model and the computational complexity of the resulting inference problems. A common theme for all the models considered here, is that they are all based on a (slowly) varying random walk-type model of the volatility. This model can be motivated by the volatility clustering behaviour, i.e. the underlying volatility varies slowly and gives rise to volatility clustering. Furthermore, the log-returns are modelled as a non-stationary white noise process where the variance is determined by the latent volatility. This corresponds quite well with the log-returns presented in the upper part of Figure 2.2. For a more through discussion about different volatility models, see Mitra (2011) and Kim et al. (1998). The first model is the generalised autoregressive conditional heteroskedasticity (GARCH) model (Bollerslev, 1986), which is a generalisation of the ARCH model (Engle, 1982). Here, we consider the GARCH(1,1) in noise model given by ht+1 |xt , ht = α + βx2t + γht , xt+1 |xt , ht = N xt+1 ; 0, ht+1 , yt |xt = N yt ; xt , τ 2 , (2.3a) (2.3b) (2.3c) where the parameter vector is θ = {α, β, γ, τ } with the constraints {α, β, γ, τ } ∈ Some motivating examples 21 0.00 −0.10 −0.05 Daily log−returns 0.05 0.10 2.2 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2008 2009 2010 2011 2012 2013 2014 1000 800 400 600 Daily closing prices 1200 1400 Year 2000 2001 2002 2003 2004 2005 2006 2007 25 20 Density 15 10 0.00 0 −0.10 5 −0.05 Sample Quantiles 0.05 30 35 0.10 Year −2 0 Theoretical Quantiles 2 −0.10 −0.05 0.00 0.05 0.10 Daily log−returns Figure 2.2: Daily log-returns (upper) for the NASDAQ OMX Stockholm 30 Index from 2000-01-04 to 2014-03-14. The daily closing prices (middle), QQplot of the log-returns (lower left) and histogram of the log-returns and the kernel density estimate (KDE) (lower right) are also presented. 22 2 Nonlinear state space models and statistical inference R4+ and β + γ ∈ (0, 1) ⊂ R for stability. In this model, the current log-return is taken into account when computing the volatility at the next time step. This construction tries to capture the volatility clustering. The second model is the Hull-White stochastic volatility (HWSV) model3 (Hull and White, 1987) given by (2.4a) xt+1 |xt ∼ N xt+1 ; µ + φ(xt − µ), σ 2 , yt |xt ∼ N yt ; 0, β 2 exp(xt ) , (2.4b) where the parameter vector is θ = {µ, φ, σ, β} with the constraints {µ, σ, β} ∈ R3+ and φ ∈ (−1, 1) ⊂ R for stability. There are many variants of the HWSV model that includes correlations between the noise sources (called leverage models) and outliers in the form of jump processes. See Chib et al. (2002) and Jacquier et al. (2004) for more information. One problem with HWSV is that the Gaussian observation noise in some cases cannot fully capture the heavy-tail behaviour found in real-world data. Instead, et is often assumed to be simulated from a Student-t distribution, which has heavier tails than the Gaussian distribution. Another modification is to assume that et is generated from an α-stable distribution, which can model both large outliers in the data and non-symmetric log-returns. This results in the third model, the stochastic volatility model with symmetric α-stable returns (SVα) Casarin (2004) given by xt+1 |xt ∼ N xt+1 ; µ + φxt , σ 2 , (2.5a) yt |xt ∼ A yt ; α, 0, exp(xt /2), 0 , (2.5b) where the parameter vector is θ = {µ, φ, σ, α} and the constraints {µ, σ} ∈ R2+ , α ∈ (0, 2] \ {1} ⊂ R and φ ∈ (−1, 1) ⊂ R for stability. Here, A(α, 0, 1, 0) denotes a symmetric α-stable distribution4 with stability parameter α. For this distribution, we cannot evaluate gθ (yt |xt ) and this results in problems when inferring the states and parameter vector of the model. In Paper C, we apply ABCs for solving this problem. As previously discussed, the inference problem in volatility models is mainly state inference, which requires a model and hence results in the need for parameter inference. For example, the state estimate can be used as the volatility estimate for option pricing. Also, the parameter vector of the model can be used to analyse if the log-return are symmetric and heavy-tailed or the persistence of the underlying volatility process. 3 A similar model is used for modelling the glacial varve thickness (the thickness of the clay collected within the glacial) in Shumway and Stoffer (2010). This model is obtain by replacing the noise in (2.4b) with gamma distributed noise, i.e. et ∼ G(α−1 , α) for some parameter α ∈ R+ . 4 See Appendix A of Paper C for a brief summary about α-stable distributions and their properties. For a more detailed presentation, see Nolan (2003). Some motivating examples 350 23 150 100 50 0 0 50 100 150 Latitude 200 -160 250 -110 -60 300 -10 2.2 -50 0 50 Longitude Figure 2.3: The major earthquakes in the world between 2000 and 2013. The size of the circle is proportional to the relative magnitude of the earthquake. 2 24 2.2.3 Nonlinear state space models and statistical inference Earthquake count model in geology Another application of SSMs is to model the number of major (with magnitude 7 or higher on the Richter scale) earthquakes each year. As a motivation for the model, we consider some real-world data from the Earthquake Data Base System of the U.S. Geological Survey 5 . The data describes the number of major earthquakes around the world between the years 1900 and 2013. In Figure 2.3, we present the locations of the major earthquakes during the period 2000 to 2013. In Figure 2.4, we present the annual number of major earthquakes with some exploratory plots. From the data, we see that the number of earthquakes is clearly correlated, similar to the clustering behaviour found in the volatility models from the previous application. That is, a year with many earthquakes is likely to be followed by another year with high earthquake intensity. The reason for this follows quite intuitive as pointed out in Langrock (2011) since earthquakes are due to stresses in the tectonic plates of the Earth. Therefore, the underlying process which model these stresses should be slowly varying over the time span of years. Due to the autocorrelation in the number of earthquakes, it is reasonable to assume that the intensity is determined by some underlying latent variable. Therefore, we can model the number of major earthquakes as an SSM. To this end, we use the model6 from Zeger (1988) and Chan and Ledolter (1995), which assumes that the number of earthquakes yt is a Poisson distributed variable (corresponding a positive integer or count data). Also, we assume that the mean of the Poisson process λt follows an AR(1) process, log(λt ) − µ = φ log λt−1 − µ + σv vt , where vt denotes a standard Gaussian random variable. By introducing xt = log(λt ) − µ and β = exp(µ), we obtain the SSM xt+1 |xt ∼ N xt+1 ; φxt , σv2 , (2.6a) yt |xt ∼ P yt ; β exp(xt ) , (2.6b) where the parameter vector is θ = {φ, σv , β} with the constraints φ ∈ (−1, 1) ⊂ R and {σv , β} ∈ R2+ . Here, P(λ) denotes a Poisson distributed variable with mean λ. That is, the probability of k ∈ N earthquakes during year t is given by the probability mass function (PMF), exp(−λ)λk . k! The inference problem in this type of model could be to determine the underlying intensity of the process (the state). This information could be useful in making predictions of the future number of major earthquakes. P[Nt = k] = 5 This data can be accessed from http://earthquake.usgs.gov/earthquakes/eqarchives/. 6 This type of Poisson count model can also be used to model the number of yearly polio infections (Zeger, 1988) and the number of transactions per minute of a stock on the market (Fokianos et al., 2009). Some motivating examples 25 30 25 20 15 5 10 Number of major earthquakes 35 40 2.2 1900 1920 1940 1960 1980 2000 2020 Density 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Year 0 10 20 30 40 50 0.6 0.4 0.2 −0.2 0.0 ACF of number of earthquakes 0.8 1.0 Number of earthquakes 0 5 10 15 20 Lag Figure 2.4: Upper: number of annual major earthquakes (with magnitude 7 or higher on the Richter scale) during the period between the years 1900 and 2013. Middle: the corresponding histogram with the KDE (blue). Lower: the estimated autocorrelation function (ACF). 26 2.2.4 2 Nonlinear state space models and statistical inference Daily rainfall models in meteorology A common problem in meteorology and weather forecasting is to estimate the probability and amount of rainfall. In practice, this problem is often split into two subproblems, each with its own model. The first model determines the probability of rainfall and the second model determines the amount of rainfall. For more information, see Srikanthan and McMahon (1999) and Woolhiser (1992). Here, we consider the problem to construct a model to determine the probability of rainfall given historic data of rainfall in the region of interest. It is also possible to use user data from the Internet to predict the probability of rainfall in some region. An example of this is to make use of some collected data from Twitter, see Naesseth (2012) for more information about this approach. We first consider some real-world data from the Swedish weather service (SMHI) collected daily at Malmslätt near Linköping during the period between the years 1952 and 2002. The data is presented in Figure 2.5 as the daily probability of rainfall (upper) and the average daily amount of rainfall (middle) calculated per week. We also present the ACF of rainfall, which indicates that there is a correlation between rainy and non-rainy days. Furthermore, the probability of rainfall seems to follow a cyclic behaviour with a high probability during the latter part of the year. The amount of rainfall also varies with a peak around week 30. To construct a model of the probability of rainfall, we follow the insights from the previous analysis and review the model proposed by Langrock and Zucchini (2011). To account for the autocorrelation in the rainfall, we make use of a latent process to describe the persistence of the weather. This construction can be used to create a rough model to account for the structure of low pressure weather systems which passes over a period of days. Hence, this can be seen as a short term model of the probability of rainfall. From practical knowledge, we also know that the probability of rainfall is connection with the season of the year. This cyclic behaviour was also seen in the weather data from Malmslätt. Therefore, we assume that there also exists a cyclical part of the latent process and that this together with the short term persistence determines the probability of rainfall. Furthermore, we assume that the output from the model is a binary variable yt ∼ B(pt ) from a Bernoulli distribution with success probability pt . That is, the variable assumes the value 1 with probability pt (rain falls during day t) and the value 0 with probability 1 − pt (no rain falls during day t). The final model on SSM form is given by X 2 2 X 2kπt 2kπt ht = αk cos + βk sin , (2.7a) 365 365 k=1 k=1 xt+1 |xt ∼ N xt+1 ; φxt , σv2 , (2.7b) exp(µ + xt + ht ) yt ∼ B , (2.7c) 1 + exp(µ + xt + ht ) where the parameter vector is θ = {φ, σv , µ, α1 , α2 , β1 , β2 } with the constraints Some motivating examples 27 0.50 0.45 0.40 0.30 0.35 Probability of rainfall 0.55 0.60 2.2 0 10 20 30 40 50 30 40 50 2.0 1.5 1.0 0.0 0.5 Average daily rainfall (mm) 2.5 3.0 Week 0 10 20 0.6 0.4 0.0 0.2 ACF of rainfall 0.8 1.0 Week 0 5 10 15 20 Lag Figure 2.5: The daily probability of rainfall (upper), the daily amount of rainfall (middle) and the ACF of rainfall (lower). The values are calculated as weekly averages of daily data from Malmslätt during the period between the years 1952 and 2002. The data is provided by the Swedish weather service (SMHI) and is used under the creative commons license. 28 2 Nonlinear state space models and statistical inference φ ∈ (−1, 1) ⊂ R and σv ∈ R+ . The inference problem in this model could be to determine the probability of rainfall (estimate pt ) given the data. Also, it could be interesting to determining the strength of the persistence in the system determined by φ or the strength of the seasonal components determined by {α1 , α2 , β1 , β2 }. 2.3 Maximum likelihood parameter inference In this section, we present the fundamentals of ML based parameter inference in SSMs. This presentation is mainly included to set the notation and to highlight some features of the method that are needed in the sequel. An accessible general introduction to ML inference is given by Casella and Berger (2001). For more extensive treatments, see Rao (1965) and Lehmann and Casella (1998). Inference in the ML paradigm is focused on optimising the likelihood function L(θ) = pθ (y1:T ), which also appears in Bayesian inference. The likelihood encodes the information contained with the observations y1:T into a quantity that can be used for inference. 2.1 Definition (Likelihood function for an SSM). The likelihood (function) of an SSM can be expressed as the decomposition L(θ) = pθ (y1:T ) = pθ (y1 ) T Y pθ (yt |y1:t−1 ), (2.8) t=2 where pθ (yt |y1:t−1 ) denotes the one-step-ahead predictor. It is common to replace the likelihood function with the log-likelihood (function) in many inference problems. This is done to simplify analytical calculations and improve the numerical stability of many algorithms. The log-likelihood is given by `(θ) = log pθ (y1:T ) = log pθ (y1 ) + T X log pθ (yt |y1:t−1 ). (2.9) t=2 In general, there are two different interpretations of the likelihood in statistics. The first is that the likelihood is a function of the data for a fixed parameter θ. A common name for this distribution is the sampling distribution and it plays an important role when calculating the distribution of some sampled data. The second interpretation (which we adopt in this thesis) is that the likelihood is a function of the parameter. That is the data is fixed and that the likelihood therefore summaries the information in the data. The ML parameter inference problem is formulated as a maximisation problem of the likelihood or equivalently the log-likelihood. This follows from that the logarithm is a monotone function and hence any maximiser of the likelihood is also a maximiser of the log-likelihood. 2.3 Maximum likelihood parameter inference 29 2.2 Definition (Maximum likelihood parameter inference problem). The parameter inference problem in the ML setting is given by θbML = argmax L(θ) = argmax `(θ), θ∈Θ (2.10) θ∈Θ where θbML denotes the ML parameter estimate. The interpretation of (2.10) is that we should select the parameter that together with the model is the most likely to have been generated the observations. This definition makes good intuitive sense, but as we shall see in the following section it is not the only method for estimating the parameter given the data. We continue by discussing some useful quantities connected with the log-likelihood and will then return to discuss the properties of the estimator in (2.10). The gradient of the log-likelihood is referred to as the score function and the negative Hessian of the log-likelihood is referred to as the observed information matrix. These quantities are useful in many optimisation algorithms as they are the first order and second order information about the objective function (the log-likelihood) in (2.10), respectively. Also, this information can be used to build efficient proposals in some sampling algorithms, see Paper A. 2.3 Definition (Score function). The score function is defined as the gradient of the log-likelihood, S(θ0 ) = ∇`(θ)θ=θ0 , (2.11) where the gradient is taken with respect to the parameter vector. The score function has a natural interpretation as the slope of the log-likelihood. Hence, the score function is zero when evaluated at the true parameter vector, S(θ? ) = 0. However, note that this is not necessarily true when we work with finite data samples as is discussed in Example 2.6. 2.4 Definition (Observed information matrix). The observed information matrix is defined as the negative Hessian of the log-likelihood, J (θ0 ) = −∇2 `(θ)θ=θ0 (2.12) where the Hessian is taken with respect to the parameter vector. The statistical interpretation of the observed information matrix is as a measure of the amount of information in the data regarding the parameter θ. That is, if the data is informative the resulting information matrix is large (according to some measure). Also, the information matrix can geometrically be seen as the negative curvature of the log-likelihood. As such, we expect it to be positive definite (PD) at the ML parameter estimate (c.f. the second-derivative test in basic calculus). Finally, we note that there exists a limiting behaviour for the observed information matrix, which approaches the so called expected information matrix as the number of data points tends to infinity. 30 2 Nonlinear state space models and statistical inference 2.5 Definition (Expected information matrix). The expected information matrix (or the Fisher information matrix) is defined by the expected value of the observed information matrix (2.12), h i 2 0 2 I(θ ) = −Ey1:T ∇ `(θ) θ=θ0 = Ey1:T ∇`(θ) θ=θ0 , (2.13) which is evaluated with respect to the data record. Note, that the expected information matrix is independent of the data realisation, whereas the observed information is dependent on the realisation. The expected information matrix is PD for all values of θ as it can be seen as the variance of the score function. We make use of this property in Section 4.2.2 and in Paper A to construct a random walk on a Riemann manifold using the information matrices. We conclude the discussion on the score function and the information matrix by Example 2.6, where we investigate the defined quantities for an LGSS model as a function of the parameter φ. 2.6 Example: Score and information matrix in the LGSS model Consider the LGSS model (2.2) with the parameter vector θ? = {0.5, 0, 1, 0.1} from which we generate a realisation of length T = 250 using the initial value x0 = 0. We fix {σv , σe } at their true values and create a grid over φ. For each grid point, we calculate the log-likelihood, the score function and the expected information matrix. The results are presented in Figure 2.6. We note that the log-likelihood has a distinct maximum near the true parameter (presented as dotted vertical lines) and that the score function is zero close to this point. The small difference in the score function is due to that we use a finite amount of data. Finally, the zero of the score function results in a maximum of the log-likelihood function as the expected information (negative Hessian) is positive. With the definition of the expected information matrix in place, we now return to discussing the properties of the ML parameter estimate. The ML estimator obtained from (2.10) has a number of strong asymptotic properties, i.e. when the number of observations tends to infinity. It is possible to show that this estimator is consistent, asymptotically normal and efficient under some regularity conditions. These conditions include that the parameter space is compact and that the likelihood, score function and information matrix exist and are well-behaved. The (ML) estimator is said to be consistent as it fulfils that a.s. θbML −→ θ? , T → ∞. That is, the estimate almost surely converges to the true value of the parameter in the limit of infinite data. Furthermore, as the estimator is asymptotically normal, we have that the error in the estimate satisfies a CLT given by √ d T θbML − θ? −→ N 0, I −1 (θ? ) , Maximum likelihood parameter inference 31 −360 −370 −380 −400 −390 Log−likelihood function −350 −340 2.3 0.0 0.2 0.4 0.6 0.8 1.0 700 500 −200 100 300 Expected information 0 −100 Score function 100 900 200 φ 0.0 0.2 0.4 0.6 φ 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 φ Figure 2.6: The estimates of the log-likelihood function (upper) the score function (lower left) and the expected information matrix (lower right) of the LGSS model in Example 2.6. The dotted lines indicate the true parameter and the zero-level. 32 2 Nonlinear state space models and statistical inference which follows from a Taylor expansion around the point θ = θ? of the log-likelihood. Note that, the expected information matrix enters into the expression and limits the accuracy of the estimate. Therefore, a natural question is if we can somehow change the size of this matrix to decrease the lower-bound on the variance of the estimate. This is a key problem in the field of input design, where an input is added to the SSM to maximise some scalar function of the expected information matrix. This problem is discussed for nonlinear SSMs in Example 4.10 and Paper D. Lastly, we say that an estimator is efficient if it attains the Cramér-Rao lower bound, which means that no other consistent estimator has a lower mean square error (MSE). That is, the ML estimator is the best unbiased estimator in the MSE sense and there are no better unbiased estimators. This last property is appealing and one might be tempted to say that the this estimator is the best choice for parameter inference. However, this result is only valid when we have an infinite number of samples. Therefore, other estimators (e.g. Bayes estimators discussed in the next section) could have better properties in the finite sample regime. Also, there can exist biased estimators with lower MSE than the ML estimator. We have now introduced the ML parameter inference problem and some of the properties of the resulting estimate. In practice, we usually encounter a number of problems when we try to solve the optimisation problem in (2.10). These includes that for nonlinear SSMs: (i) the log-likelihood, the score and the information matrix are all intractable, (ii) the optimisation problem cannot be solved in closedform and (iii) there could exist many local maxima of the log-likelihood making numerical optimisation problematic. To solve these problems, a number of different numerical approaches have been developed. We return to the use of these approaches for estimating the log-likelihood, the score and the information matrix in Section 3. We also discuss other numerical methods for solving the ML parameter inference problem in Chapter 4. 2.4 Bayesian parameter inference Previously in the ML paradigm, we implicitly assumed that the true parameter is a specific value. In this section, we instead assume that the true parameter is distributed according to some probability distribution. From this assumption, we can construct a different statistical paradigm called Bayesian inference, which can be to use for state and parameter inference in nonlinear SSMs. As in the previous section, we only briefly discuss Bayesian inference for SSMs to set the notation and highlight some important features that we make use of in this thesis. For general treatments of Bayesian analysis, see Robert (2007) and Berger (1985). Furthermore, two accessible introductions are Gelman et al. (2013) and Casella and Berger (2001). In the Bayesian parameter inference problem, we combine the information in the 2.4 Bayesian parameter inference 33 data described by the likelihood pθ (y1:T ) with prior information encoded as a probability distribution denoted p(θ). This combination of the prior information and the information in the data is made using Bayes’ theorem,. This procedure is referred to as the prior-posterior update in Bayesian inference. The result is an updated probability distribution called the posterior distribution, which we denote p(θ|y1:T ) in the parameter inference problem. Similar distributions can be defined for the state inference problem and we return to this is Section 3.1. 2.7 Definition (Bayesian parameter inference problem). Given the parameter prior p(θ) and the likelihood pθ (y1:T ), we obtain the parameter posterior by Bayes’ theorem as p (y1:T )p(θ) p(θ|y1:T ) = θ ∝ pθ (y1:T )p(θ), (2.14) p(y1:T ) where p(y1:T ) denotes the marginal likelihood (or the evidence). The name is due to that it can be computed by marginalisation Z p(y1:T ) = pθ (y1:T )p(θ) dθ. (2.15) One of the main questions in Bayesian inference is the choice of the prior and how we may encode our prior beliefs about the data into it. A common view is that this choice is subjective as it is often done using intuition and previous knowledge, which depends on subjective experiences. In this thesis, we make use of simple priors to encode stability properties of the nonlinear system or to keep the standard deviation positive at all times. For this, we make use of uniform priors over different subsets of the parameter space. An example of this is that we assume a uniform prior over φ ∈ (−1, 1) ⊂ R for the LGSS model (2.2), which can be expressed as p(φ) = U(φ; −1, 1). Note, that the use of improper priors can be seen as a link between ML and Bayesian inference, but we shall not discuss this further. Instead, interested readers are referred to Robert (2007) for more information. Another important class of priors that we make use of in Paper E is conjugate priors, which enables us to compute closed-form expressions for the posterior. A conjugate prior has the same functional form as the posterior and depends on the form of the likelihood. For example, the conjugate prior for the mean (given that the variance is known) of the Gaussian distribution is again a Gaussian distribution. This results from the fact that the product of two Gaussian distributions (the likelihood and the prior) is again a Gaussian distribution and hence it is a conjugate prior. We give another example of a conjugate prior in Example 2.8. Conjugate priors exist mainly for some combinations of priors and likelihoods in the exponential family of distributions, see Robert (2007) for a discussion. 2 Nonlinear state space models and statistical inference 0.6 0.0 0.2 0.4 Posterior 0.8 1.0 34 1 2 3 4 5 6 −460 −440 0.06 −520 −500 −480 Log−likelihood 0.10 0.08 Prior 0.12 −420 σ2 1 2 3 4 2 σ 5 6 1 2 3 4 5 6 2 σ Figure 2.7: The posterior (upper) of the parameter σ 2 in the Gaussian distribution resulting from the combination of an inverse-Gamma prior (lower left) and the data log-likelihood (lower right). 2.4 Bayesian parameter inference 35 2.8 Example: Prior-posterior update in a conjugate model Consider the problem of inferring the value of the variance σ 2 in a Gaussian distribution given that we know the value of the mean µ. Also, we assume that we have obtained a set of IID data y1:T generated by the model. The conjugate prior to the variance in a Gaussian distribution is the inverse-Gamma distribution with rate α0 and shape β0 denoted IG(α0 , β0 ). By direct calculation, we obtain that P the posterior has the parameters αT = α0 + T /2 and βT = β0 + 12 Tt=1 (yi − µ)2 . Here, we simulate the data realisation using the parameter {µ, σ 2 } = {1.5, 4} of length T = 200 and use the prior coefficients {α0 , β0 } = {1, 4}. The corresponding posterior, prior and log-likelihood are presented in Figure 2.7. We note that the prior assigns a small probability of the variance equal to 4. However, as the data record is relatively large, the resulting posterior is centred close to the true value of the parameter. The Bayesian parameter inference problem is completely described by (2.14) and everything that we need is know is encoded in the posterior. This follows from the likelihood principle, which states that all information obtained from the data is contained within the likelihood function, see Robert (2007). However, we are sometimes interested in computing point estimates of the parameter vector. To compute point estimates, we can make use of statistical decision theory to make a decision about what information from the posterior to use. Consider a loss function L : Θ × Θ → R+ , which takes the parameter and its estimate as input and returns a real-valued positive loss. The expected posterior loss (or posterior risk) ρ p(θ), δ y1:T is given by Z ρ p(θ), δ y1:T = L θ, δ(y1:T ) p(θ|y1:T ) dθ, (2.16) where δ(y1:T ) denotes the decision of the parameter estimate given the data. The Bayes estimator is defined as the minimising argument of the expected posterior loss, δ ? (y1:T ) = argmin ρ p(θ), δ y1:T . δ(y1:T )∈Θ Here, we restrict ourselves to discussing the resulting Bayes estimators for three different loss functions in Table 2.1, but there are many other possibly suitable choices for our application. From this, we see that the estimate that minimises the quadratic loss function is the expected value of the posterior, Z E[θ] = θ p(θ|y1:T ) dθ. (2.17) That is, the Bayes estimator is given by the posterior mean for this choice of loss function. This is an example of a common expectation operation in Bayesian inference, as the expected value of the parameter is computed with respect to the parameter posterior distribution. Similar expressions can be written for many other Bayes estimators. 2 36 Linear Quadratic 0-1 Nonlinear state space models and statistical inference Loss function Bayes estimator L(θ, δ) = |θ − δ| L(θ, δ) = (θ − δ)2 L(θ, δ) = I(θ = δ) Posterior median Posterior mean Posterior mode Table 2.1: Different loss functions and the resulting Bayes estimator. The statistical properties of the Bayes estimator depends on the choice of loss function when we work with finite data. However, it follows (under some conditions) from the Bernstein–von Mises theorem that the influence of the prior diminishes when the amount of data grows. Consequently, the MAP estimator (the posterior mode) converges to the ML estimator in some cases when the amount of data increases. Therefore it is possible to see ML parameter inference as a special case of Bayesian parameter inference in some sense. Finally, we note that Bayes estimators asymptotically have the same properties as the ML estimator, i.e. they are consistent, asymptotically normal and efficient. Also, as the number of data tends to infinity, the posterior tends to a Gaussian distribution which is known as the Bayesian CLT. For more information about the statistical properties of the Bayes estimators, see Robert (2007), Lehmann and Casella (1998) and Berger (1985). We have now introduced the Bayesian paradigm for parameter and state inference in nonlinear SSMs. We have seen that the key quantity in Bayesian inference is the posterior distribution, which depends on the likelihood. Therefore, we in practice encounter the same problems with analytically intractable likelihoods as for the ML paradigm. Another complication is that the posterior might not be described by any known distribution. As a consequence, we cannot compute many of the integrals depending on the posterior distribution encountered in Bayesian analysis in closed-form. Examples of such integrals are the normalisation in (2.15), the marginalisation in (2.16) and the expectation in (2.17). To counter these problem, we can make use of the same numerical methods from Chapter 3 as for the ML paradigm to estimate the likelihood function. To solve the Bayesian parameter inference problem, we can make use of sampling methods to obtain approximations of the posterior distributions of interest. These methods use computer simulations and have therefore only been available for practical use during the last decades. However, during this period these methods have been well-studied and are today established tools for Bayesian inference. Another approach is to make analytical approximations of the posterior distribution using known distributions or iterative procedures. We discuss some of these methods in Chapter 4 3 State inference using particle methods In this chapter, we discuss the state inference problem in detail and presents some algorithms for approximate state inference. The foundation of state inference in SSMs is given by the Bayesian filtering and smoothing recursions that are discussed in Section 3.1. Unfortunately, these recursions are analytically intractable for most SSMs. Therefore, the remaining part of this chapter is devoted to discussing numerical methods based on SMC for approximate state inference. We also discuss how to make use of SMC methods to estimate the likelihood, the score function and the information matrix in SSMs. This chapter is concluded by discussing the direct illumination problem in computer graphics and how it can be solved using SMC methods. 3.1 Filtering and smoothing recursions There are mainly two types of state inference problems in an SSM: filtering and smoothing. Both the filtering and smoothing problem can be marginal (inference on a single state), k-interval (inference of k states) or joint (inference on all states). In marginal filtering, only observations y1:t collected until the current time step t are used to infer the current value of the state xt . In marginal smoothing, all the collected observations including (possibly) future observations y1:T are used to infer the value of the current state xt with t ≤ T . In Table 3.1, we summarise some common filtering and smoothing problems in SSMs. The solutions to the filtering and smoothing problems are given by the Bayesian filtering and smoothing recursions (Jazwinski, 1970; Anderson and Moore, 2005). These (as the names suggest) are based on Bayesian inference similar to the parameter inference problems discussed in Section 2.4. These recursions can be used 37 3 38 State inference using particle methods Name Density (Marginal) filtering (Marginal) smoothing (t ≤ T ) Joint smoothing Fixed-interval smoothing (s < t ≤ T ) Fixed-lag smoothing (for lag ∆) pθ (xt |y1:t ) pθ (xt |y1:T ) pθ (x0:T |y1:T ) pθ (xs:t |y1:T ) pθ (xt−∆−1:t |y1:t ) Table 3.1: Common filtering and smoothing densities in SSMs. to iteratively solve the filtering problem for each time t using two steps given by pθ (xt |y1:t−1 )gθ (yt |xt ) , pθ (yt |y1:t−1 ) Z pθ (xt |y1:t−1 ) = fθ (xt |xt−1 )pθ (xt−1 |y1:t−1 ) dxt−1 . pθ (xt |y1:t ) = (3.1a) (3.1b) In the first step, the state estimate at time t is updated with the new observation yt (also known as the measurement update) using Bayes’ theorem. In the second step, a marginalisation is used to predict the next state xt using the information in the current state and the collected observations (also known as the time update). In a similar manner, the smoothing problem can be solved by iterating the marginalisation Z f (xt+1 |xt )p(xt+1 |y1:T ) p(xt |y1:T ) = p(xt |y1:t ) dxt+1 , (3.2a) p(xt+1 |y1:t ) Z p(xt+1 |y1:t ) = f (xt+1 |xt )p(xt |y1:t ) dxt , (3.2b) backwards in time. Here the smoothing recursion, makes use of the filtering distribution computed in a forward pass. The resulting smoother from this procedure is therefore known as the forward filtering backward smoothing (FFBSm) algorithm. We return to a numerical method based on the FFBSm algorithm for the use in SSMs later in this chapter. The filtering (3.1) and smoothing (3.2) recursions can only be solved analytically for two different classes of SSMs: linear Gaussian SSMs and SSMs with finite state processes. For the former, the recursions can be solved by the Kalman filter (KF) and e.g. the Rauch–Tung–Striebel (RTS) smoother (Rauch et al., 1965), respectively. Using the Kalman methods, we can exactly compute the likelihood and the states in an LGSS model. We do not discuss the details of the Kalman methods here and refer readers to Anderson and Moore (2005) and Kailath et al. (2000) for extensive treatments of Kalman filtering and different Kalman smoothers. The general intractability of the Bayesian filtering and smoothing recursions results from that there are no closed-form expressions for the densities in the recursions. This is similar to the problems that we encountered in the Bayesian parameter inference problem. Instead, we consider numerical approximations that relies on 3.2 Monte Carlo and importance sampling 39 Monte Carlo (MC) methods to estimate the filtering and smoothing distributions. This results in SMC and related methods, which we return to in Section 3.3. 3.2 Monte Carlo and importance sampling MC methods are a collection of statistical simulation methods based on sampling and the strong law of large numbers (SLLN). For a general introduction to MC methods, see Robert and Casella (2004) for an extensive treatment or Ross (2012) for an accessible introduction. In this thesis, we mainly use MC methods for estimating the expected value (an integral) of an arbitrary well-behaved test function ϕ(x), Z ϕ b = Eπ ϕ(x) = ϕ(x)π(x) dx, (3.3) where π(x) denotes a (normalised) target distribution from which we can simulate IID particles (or samples). As a consequence of the SSLN1 , we can estimate the expected value by the sample average ϕ bMC = N 1 X ϕ x(i) , N x(i) ∼ π(x), i=1 taken over independent realisations x(i) simulated from π(x). This estimator is strongly consistent, i.e. a.s. ϕ bMC −→ ϕ, b N → ∞. Also, we can construct a CLT for the error of the MC estimator, given by √ d 2 2 N ϕ b−ϕ bMC → N 0, σMC , σMC = V[ϕ(x)] < ∞, where we assume that the function ϕ(x) has a finite second moment. Here, we see that the MC estimator is asymptotically unbiased with Gaussian errors. Also, the variance of the error decreases as 1/N independent of the dimension of the problem, which is one of the main advantages of the MC methods. Consider, the problem of estimating (3.3) when the normalised target π(x) = γ(x)Z −1 is unknown together with the normalisation factor Z. Instead, we can only evaluate the unnormalised target γ(x) point-wise. In this case, importance sampling (IS) (Marshall, 1956) can be used to sample from another distribution called the proposal distribution (or importance distribution) q(x) and adapt the particles using a weighting scheme. For this, it is required that q(x) , 0 for almost all x ∈ supp(π) and that the support of ϕ(x)q(x) contains the support of π(x), i.e. supp(π) ⊂ supp(ϕq). The IS algorithm follows by rewriting the expected value in 1 The SSLN states that the sample average x̄ computed using N IID samples from π(x), a.s. converges almost surely to µ = Eπ [x] when N tends to ∞, i.e. x̄ −→ µ. 3 40 State inference using particle methods Algorithm 1 Importance sampling (IS) Inputs: γ(x) (unnormalised target), q(x) (proposal) and N > 0 (no. particles). Output: pb(dx) (empirical distribution). 1: 2: 3: 4: 5: 6: for i = 1 to N do Sample a particle by x(i) ∼ q(x). Compute the weight by w(i) = γ(x(i) )/q(x(i ). end for Normalise the particle weights by (3.5). Estimate pb(dx) using (3.7). (3.3) to obtain Eπ ϕ(x) = Z Z ϕ(x)π(x) dx = ϕ(x) γ(x) q(x) dx = Eq w(x)ϕ(x) , q(x) | {z } ,w(x) where w(x) denotes the (unnormalised) importance weights. The IS estimator follows in analogue with the vanilla MC estimator, ϕ bIS = N X w e(i) ϕ x(i) , x(i) ∼ q(x), (3.4) i=1 where the normalised weights are given by w(i) w e x(i) , w e(i) = PN , (k) k=1 w i = 1, . . . , N. (3.5) We can also construct an unbiased estimator for the normalisation constant by N 1 X (i) ZbIS = w , N (3.6) i=1 which we later shall make use of to estimate the (marginal) likelihood in SSMs. The IS algorithm can be interpreted as generating an empirical distribution, which can be seen by rewriting the estimator in (3.4) as a Dirac mixture given by pb(dx) = N X w e(i) δx(i) (dx), (3.7) i=1 which we can insert into (3.3) to recover (3.4) as presented in Algorithm 1. Here, δz (dx) denotes a Dirac point mass located at x = z. In the following, we make use of this property to compute expectations in state inference problems using sequential variants of the IS algorithm. We conclude the discussion of the IS algorithm by Example 3.1 in which we estimate the parameters of the HWSV model using some real-world data. Note that, the IS algorithm can be developed further to include multiple and adaptive 3.3 Particle filtering 41 proposals. In multiple importance sampling (MIS), the algorithm can make use of several proposal algorithms that are then combined to form the estimate. In adaptive importance sampling (AIS), we update a mixture of proposals or the proposal after each iteration to fit the target better. These methods could be interesting in developing new approximate state inference algorithms similar to the particle methods discussed in the next section. For more information, see Kronander and Schön (2014), Cornuet et al. (2011) and Veach and Guibas (1995). 3.1 Example: IS for Bayesian parameter inference in the HWSV model Consider the Bayesian parameter inference problem in the Hull-White SV model (2.4) and the data presented in Section 2.2.2. In this model, we would like to estimate the parameters given the data by the use of the IS algorithm. The unnormalised target distribution is given by γ(θ) = pθ (y1:T )p(θ), where we assume that we can evaluate the likelihood point-wise. Note that, the solution to this problem is analytically intractable but can approximated using methods discussed later in Section 3.3.4. Furthermore, we assume the uniform priors over φ ∈ (−1, 1) ⊂ R and {σv , β} ∈ R2+ . The parameter proposals are given by q(φ, ) = U(φ; −1, 1), q(σv ) = G(σv ; 2, 0.1), q(β) = G(β; 7, 0.1). These proposals are based on prior information regarding the typical values for parameters in the model. Hence, they can be seen as a kind of prior distributions in the Bayesian inference. For the IS algorithm, we use N = 200 samples and the procedure outlined in Algorithm 1. In Figure 3.1, we present the resulting posterior distributions with the proposal distributions. The parameter estimates obtain by the posterior mode are θbIS = {0.996, 0.129, 0.837}, which indicates a slowly varying latent volatility process. In Chapter 4, we discuss other methods for parameter inference that are more efficient when the number of parameters increases. However, for small problems the IS algorithm could be a useful alternative in cases when the prior information is good. Otherwise, the number samples required from the parameter posterior increases exponentially with the number of elements in the parameter vector. 3.3 Particle filtering The IS algorithm discussed in the previous section can be combined with the filtering recursions in Section 3.1 to sequentially estimate the joint smoothing distribution π(x0:t ) = pθ (x0:t |y1:t ). To this end, we can apply Algorithm 1 to sequentially approximate the target using an empirical distribution. The resulting algorithm is known as the sequential importance sampling (SIS) algorithm. The 3 State inference using particle methods 0 5 Density 10 15 42 0.90 0.92 0.94 0.96 0.98 1.00 0.15 0.20 0.25 0 5 Density 10 15 φ 0.00 0.05 0.10 3 0 1 2 Density 4 5 6 σv 0.6 0.7 0.8 0.9 β Figure 3.1: The estimates posterior distributions for φ (green), σv (red) and β (orange) compute as KDEs with the proposal distributions presented in blue. The grey dots indicates the samples obtained from the proposal and the dotted lines indicate the posterior means. 3.3 Particle filtering 43 main problem with this algorithm is that the variance of the estimate increases rapidly as t increases. For a long time, this was a major obstacle in approximate state inference. This results from that the particle weights deteriorate over time and in the limit only a single particle has a non-zero weight after a few iterations. Hence, the effective number of particles is one, which is the reason for the high variance in the estimates. However, by including a resampling step into the SIS algorithm, we can focus the attention of the algorithm on the areas of interest. The resampling step essentially duplicates particles with large weights and discard particles with small weights, while keeping the total number of particles fixed. Hence, we do not end up with only one effective particle in the estimator. This development of the algorithm leads to the SIS with resampling (SIR) algorithm. When this method is applied on SSMs, we obtain the basic particle filtering algorithm introduced by Gordon et al. (1993). In this thesis, we refer to this algorithm as the bootstrap particle filter (bPF). In subsequent developments, the particle filter is generalised to include more advanced proposals and resampling schemes. In this section, we present a refined version of the bPF called the auxiliary particle filter (APF) (Pitt and Shephard, 1999), which can use more general particle proposals and weighting functions than in the original formulation. This can result in a large decrease in the variance of the estimate and also a decrease in the number of particles required to achieve a certain accuracy. The APF also allows for the use of different resampling schemes than the multinomial resampling that is used in the original formulation of the bPF. To keep the presentation brief, we do not derive the APF and refer interested readers to Doucet and Johansen (2011) for a derivation of the APF starting from the IS algorithm. Furthermore, we note that the APF is a member of the more general family of SMC algorithms, which are discussed in Cappé et al. (2007) and Del Moral et al. (2006). 3.3.1 The auxiliary particle filter (i) (i) N The APF algorithm operates by constructing a particle system wt , xt i=1 sequentially over time. By the use of this particle system, we can construct an empirical marginal filtering distribution in analogue with (3.7) as pbθ (dxt |y1:t ) , N X (i) w et δ i=1 (i) xt (dxt ), (3.8a) (i) w (i) w et , P t N (k) , (3.8b) k=1 wt which can be seen as a discrete approximation of the filtering distribution using (i) (i) a collection of weighted Dirac point masses. Here xt and wt denote particle i at time t and its corresponding (unnormalised) importance weight. Also, the 3 44 State inference using particle methods empirical joint smoothing distribution follows similarly as pbθ (dx0:t |y1:t ) , N X (i) w et δ i=1 (i) x0:t (dx0:t ). (3.9) Three steps are carried out during each iteration of the APF to update the particle system from time t − 1 to t: (i) The particles are resampled according to their auxiliary weights. The means that particles with small weights are discarded and particles with large weights are multiplied. This mitigates the weight depletion problem experienced by the SIS algorithm. This is often the computational bottleneck of the APF algorithm as all the other steps can easily be implemented in parallel. (ii) The particles are propagated from time t − 1 to t. This can be seen as a simulation step, where new particles are generated by sampling from a Markov kernel. (iii) The (unnormalised) particle weight is calculated for each particle. These weights compare the particle with the recorded output from the system. A high weight is given to a particle that is likely to generate yt . After these three steps, we can construct empirical distributions using (3.8) and (3.9). It is also possible to make use of the particle system for estimating other quantities. We return to this in the following. Note that, by comparison with the filtering recursions, Step (ii) corresponds to the time update in (3.1b) and Step (iii) corresponds to the measurement update in (3.1a). We now proceed to discuss each step of the APF in more detail. (i) Step (i) can be seen as a sampling of ancestor indices denoted at , i.e. the index of the particle at time t − 1 from which the particle i at time t originates from. This can be expressed as simulating from a multinomial distribution with probabilities (i) (j) P(at = j) = νet−1 , (j) where νet j = 1, . . . , N, i = 1, . . . , N, (3.10) denotes a normalised auxiliary weight given by "N #−1 X (k) (j) (j) νet−1 = νt−1 νt−1 , k=1 for some auxiliary weight function (k) νt−1 determined by the user. After the resam(i) pling step, we obtain the unweighted particle system {e xt−1 , 1/N }N i=1 . In this thesis, we mainly consider two different types of APFs. The bPF is obtained by selecting the auxiliary weight as the particle weights, i.e. νt = wt . The fullyadapted particle filter (faPF) (Pitt and Shephard, 1999), which takes into account the (future) observation in the resampling step. This information is included by 3.3 Particle filtering 45 using the auxiliary weight given by Z νt−1 = pθ (yt |xt−1 ) = gθ (yt |xt )fθ (xt |xt−1 ) dxt , (3.11) when it can be computed in closed-form. This is only possible for some systems, e.g. the LGSS model (2.2) and the GARCH(1,1) model (2.3). In (3.10), we make use of multinomial resampling but there are other resampling algorithms that can be useful. Systematic resampling and stratified resampling are good alternatives to the multinomial resampling. These resampling methods often decrease the variance in the estimates of the filtering distribution. See Douc and Cappé (2005) and Hol et al. (2006) for comparisons of different resampling strategies. In Step (ii), each particle is propagated using a propagation kernel as, (i) at (i) xt ∼ Rθ xt |x0:t−1 , yt , i = 1, . . . , N. (3.12) The bPF is recovered by selecting the state dynamics as the propagation kernel, Rθ (xt |x0:t−1 , yt ) = fθ (xt |xt−1 ). The faPF makes use of the current observation when proposing new particles. This results in the propagation kernel Rθ (xt |x0:t−1 , yt ) = pθ (xt |yt , xt−1 ) = pθ (yt |xt )p(xt |xt−1 ) , p(yt |xt−1 ) (3.13) when it can be computed in closed-form. Each new particle is appended to the (i) (i) a (i) t particle trajectory by x0:t = {x0:t−1 , xt }, for i = 1, . . . , N . Finally, in Step (iii) each particle is assigned an importance weight by a weighting function Wθ (xt , xt−1 ), similar to the IS algorithm in (3.4). The resulting weight is given by (i) (i) (i) at wt = Wθ xt , x0:t−1 , i = 1, . . . , N (3.14a) wt−1 gθ (yt |xt )fθ (xt |xt−1 ) Wθ xt , x0:t−1 , . νt−1 Rθ (xt |x0:1:t−1 , yt ) (3.14b) For the bPF, the choice of resampling step and propagation kernel leads to that the weights are given by νt = wt = gθ (yt |xt ), i.e. the probability of observing yt given xt . For the faPF, we have that the particle weights are given by wt ≡ 1. The general form of the APF is presented in Algorithm 2 for approximate state inference in nonlinear SSMs. The complexity of the algorithm is linear in the number of time steps and particles, i.e. O(N T ). The user choices are mainly the number of particles N , the auxiliary weights ν, the propagation kernel Rθ (xt |x0:t−1 , yt ) and 3 46 State inference using particle methods Algorithm 2 Auxiliary particle filter (APF) Inputs: y1:T (observations), Rθ (xt |x0:t−1 , yt ) (propagation kernel), νt (auxiliary weights) and N > 0 (no. particles). Outputs: pbθ (xt |y1:t ) for t = 1 to T (the empirical marginal filtering distributions) and pbθ (x0:t |y1:t ) for t = 1 to T (the empirical joint smoothing distributions). 1: 2: 3: 4: 5: 6: 7: 8: (i) Initialise each particle x0 . for t = 1 to T do (i) Sample new the ancestor indices to obtain {at }N i=1 using (3.10). (i) Propagate the particles to obtain {xt }N i=1 using (3.12). (i) Calculate new importance weights to obtain {wt }N i=1 using (3.14). Estimate the marginal filtering distribution pbθ (xt |y1:t ) using (3.8). Estimate the joint smoothing distribution pbθ (x0:t |y1:t ) using (3.9). end for the resampling method. We now proceed with discussing the use of the APF for state inference and log-likelihood estimation. 3.3.2 State inference using the auxiliary particle filter Algorithm 2 can be used to estimate the marginal filtering and joint smoothing distributions in an SSM given some observations. From these empirical distributions, we can compute an estimate of the filtered marginal state xt|t , Z x bt|t = xt pbθ (xt |y1:t ) dxt = N X (i) (i) w et xt , (3.15) i=1 where we have inserted (3.8). Note that, the state trajectory x0:t|t can be estimated using an analogue expression given by Z N X (i) (i) x b0:t|t = x0:t pbθ (x0:t |y1:t ) dx0:t = w et x0:t , (3.16) i=1 by the use of the empirical smoothing distribution (3.9). In Example 3.2, we illustrate the use of the bPF for marginal state inference in the earthquake model (2.6) and data discussed in Section 2.2.3, 3.2 Example: State inference in the earthquake count model Consider the earthquake count model (2.6) using the data presented in Figure 2.4. To infer the state in the model, we make use of the bPF with N = 1 000 particles and the parameter vector θb = {0.88, 0.15, 17.65} estimated later in Example 4.9. In the upper part of Figure 3.2, we present the filtered number of major earthquakes obtained from the bPF (red line) together with the data (black dots). We also present the predicted number of major earthquakes (dark green) for 2014 to 2022 together with a 95% CI for the predictions. The uncertainty grows fast for the 3.3 Particle filtering 47 predictions (gray area), which indicates that a more advanced model is probably required to make good predictions. In the lower part of Figure 3.2, we present the estimated state obtained from the bPF, which indicates that the world currently is in a calmer period with a smaller earthquake intensity compared with e.g. the 1950s. This conclusion is also supported in the actual number of earthquakes recorded during these two periods. We also present the predicted latent state (dark green) for 2014 to 2022 together with a 95% CI for the predictions (gray area). Again, we see that there is a large uncertainty in the future latent state. From Algorithm 2, we know that the APF algorithm can be used to approximate the joint smoothing distribution. In the following, we require estimates of this distribution for the use in some of the proposed methods for approximate parameter inference. However, the accuracy of these estimates are poor due to problems with path degeneracy. To explain the nature of this problem, consider the resampling step in the APF algorithm. During each resampling, we discard some particle with a non-zero probability due to that they have small auxiliary weights. This means that the number of unique particles is smaller than N with a non-zero probability after each resampling step. As t increases, we repetitively resample the particle system and this leads to that the number of unique particles tends to one before some time s < t. That is, the particle trajectory collapses into a single trajectory for all the particles before time s. Consequently, the variance of the estimates of the joint smoothing distribution is large due to the same problem as in the SIS algorithm. To mitigate this effect, we could make use of the faPF or increase the number of particles. However, this could be problematic as the faPF is not available for many interesting models. Also the computational cost of the APF algorithm increases linearly with N . We illustrate this effect and compare these two alternatives in Example 3.3. There exists additional alternatives to mitigate the path degeneracy to obtain accurate approximation of the joint smoothing distribution. A popular alternative in practice is to resample only when the variance in the particle weights is larger than some threshold (Doucet and Johansen, 2011). This leads to a decrease in the path degeneracy problem but this is often not enough to completely solve the problem. Instead, a particle smoother is often used for this problem, which gives better accuracy in the estimates but often increases the computational cost. In Section 3.4, we return to the use of particle smoothing for estimating the smoothing distributions. We now continue with discussing the statistical properties of the APF and how to use the APF to estimate the likelihood and log-likelihood for an SSM. 3 State inference using particle methods 40 30 20 10 0 Number of major earthquakes 50 48 1900 1920 1940 1960 1980 2000 2020 1980 2000 2020 0.5 0.0 −0.5 −1.0 Estimated latent state 1.0 Year 1900 1920 1940 1960 Year Figure 3.2: Upper: the filtered number of major earthquakes (red line) and the data (black dots) obtained from the bPF using the data and the model from Example 3.2. The predicted number of earthquakes (green line) is also presented together with a 95% CI (gray area). Lower: the estimated latent earthquake intensity obtained from the bPF. The predicted latent state (green line) is also presented together with a 95% CI (gray area). 3.3 Particle filtering 49 3.3 Example: Path degeneracy in the GARCH(1,1) model Consider the GARCH(1,1) model in (2.3) from which we generate T = 20 observations using the parameter vector θ? = {0.10, 0.80, 0.05, 0.30}. Here, we make use of the bPF and faPF with systematic resampling at every iteration. The aim is to estimate the state trajectory x0:t|t using the APF and (3.16). For this model, the weight function (3.11) and the proposal (3.13) required for the faPF can be computed in closed from by properties of the joint Gaussian distributions, see Bishop (2006). The required quantities are given by Z p(yt |xt−1 , ht ) = N yt ; xt , τ 2 N (xt ; 0, ht ) dxt = N yt ; 0, τ 2 + ht , p(xt |yt , xt−1 ) ∝ p(yt |xt )p(xt |xt−1 ) = N yt ; xt , τ 2 N xt ; 0, ht −1 −2 = N xt ; τ −2 + h−1 τ yt , τ −2 + h−1 , t t from the marginalisation property and the conditioning property. In Figure 3.3, we present the state trajectories obtained by tracing the ancestor linage backwards from time T to 0. We see that for the bPF with N = 10 particles, the ancestral lines collapse to a single unique particle before time t = 10 due to the path degeneracy problem. Increasing the number of particles to N = 20 results in that the path degenerate before t = 8 instead. However, the faPF does not have the same problem with degeneracy as many ancestors survive the repeated resamplings and contributes with information for estimating the joint smoothing distribution. 3.3.3 Statistical properties of the auxiliary particle filter In this section, we review some results regarding two aspects of the statistical properties of the APF. Note that, the analysis of the APF is rather complicated compared with other estimators that makes use of independent samples from the target distribution. Instead, the particles obtained from the APF are not independent due to the interaction during the resampling step. However, there are many strong results regarding the statistical properties of the bPF in the literature. Extensive technical accounts are found in Del Moral (2013) and Del Moral (2004). Some of the statistical properties are also discussed in a more application oriented setting in Crisan and Doucet (2002), Douc et al. (2014) and Doucet and Johansen (2011). Assume that we would like to compute the expected value of some well-behaved test function ϕ(x) using the particle system generated by the APF. This is in analogue with the IS estimator in (3.3). From the empirical filtering distribution 3 State inference using particle methods 0.0 −1.5 −1.0 −0.5 State 0.5 1.0 1.5 50 bPF, N=10 0 5 10 15 20 0.0 −1.5 −1.0 −0.5 State 0.5 1.0 1.5 Time bPF, N=20 0 5 10 15 20 0.0 −1.5 −1.0 −0.5 State 0.5 1.0 1.5 Time faPF, N=10 0 5 10 15 20 Time Figure 3.3: The ancestral paths obtained from the bPF with 10 particles (upper), the bPF with 20 particles (middle) and the faPF with 10 particles (lower) with T = 20 in the GARCH(1,1) model in Example 3.3. The discarded particles are presented as gray dots. 3.3 Particle filtering 51 (3.8), we have that the expected value can be approximated by ϕ bAPF = N X (i) (i) w et ϕ xt . (3.17) i=1 Under some assumptions, it is possible to prove that this estimator is consistent and therefore converges in probability to the true expectation as the number of particles tend to infinity, h i p N −→ ∞. ϕ bAPF −→ E ϕ(x) , Hence, this estimator is asymptotically unbiased and equivalent with some unknown optimal filter. However, for finite N the estimator is often biased and we return to this problem in Section 3.4.3. We would also like to say something about the MSE of the estimator in (3.17). For this we assume that the function ϕ(x) is bounded2 for all x and some additional assumptions that are discussed by Crisan and Doucet (2002). It then follows that the MSE of the estimator obtained by the bPF can be upper bounded as 2 kϕk2 bAPF − E[ϕ(x)] E ϕ ≤ CT , (3.18) N where k · k denotes the supremum norm. Here, CT denotes a function that possibly depends on T but is independent of N . There exists numerous other results regarding the MSE of the estimator in (3.17). For example, it is possible to relax the assumption that ϕ(x) should be bounded and that we only use the bPF with multinomial resampling. The resulting upper bounds have a similar structure to (3.18) but with different functions C. For more information, see Del Moral (2013) and Douc et al. (2014). From this upper bound on the MSE, we would like to give some general recommendations regarding how to select N given T . However, this is difficult as the accuracy of the estimates is connected with the mixing property of the SSM (see Example 3.5). However, in practice it is recommended to use at least N ∝ T particles in the APF but sometimes even more particles are required to obtain reasonable estimates. Hence, we recommend that the user estimates the MSE for each model using e.g. a pilot run with some Monte Carlo simulations or by comparing with solution obtain from a particle smoother. 3.3.4 Estimation of the likelihood and log-likelihood As previously discussed, the likelihood L(θ) and log-likelihood `(θ) play important roles in both ML and Bayesian parameter inference. We also discussed that they are analytically intractable for nonlinear SSMs. However, we can obtain an unbiased estimate of the likelihood for any number of particles using the weights 2 This is a rather restrictive assumption as it does not satisfied by the function ϕ(x) = x, which is used to compute the estimate of the filtered state b xt|t . 3 52 State inference using particle methods generated by the APF. To understand how this can be done, we consider the decomposition in Definition 2.1 and the fact that each predictive likelihood can be written as Z p(yt |y1:t−1 ) = pθ (yt , xt |y1:t−1 ) dxt Z = gθ (yt |xt )fθ (xt |xt−1 )pθ (xt−1 |y1:t−1 ) dxt−1:t . If we consider the bPF algorithm, we can rewrite this as Z gθ (yt |xt )fθ (xt |xt−1 ) pbPF (yt |y1:t−1 ) = Rθ (xt |xt−1 , yt )pθ (xt−1 |y1:t−1 ) dxt−1:t Rθ (xt |xt−1 , yt ) Z = Wθ (xt , xt−1 )Rθ (xt |xt−1 , yt )pθ (xt−1 |y1:t−1 ) dxt−1:t by the expression for the weight function (3.14) as wt−1 = νt−1 for the bPF. (i) (i) To approximate the predictive likelihood, we make use of that {e xt , xt−1 }N i=1 is approximately distributed according to Rθ (xt |xt−1 , yt )pθ (xt−1 |y1:t−1 ). From this, it follows that N Z 1 X Wθ (xt , xt−1 )δ (i) (i) (dxt−1:t ) pbPF (yt |y1:t−1 ) ≈ e xt ,xt−1 N = = 1 N 1 N i=1 N X i=1 N X (i) (i) Wθ x et , xt−1 (i) wt . i=1 It is also possible to show that the faPF leads to a similar estimator by replacing wt with νt , see Pitt (2002) for the derivation. The resulting estimator for the likelihood using the APF (including both the bPF and faPF as special cases) is given by ) (T −1 N ) (N X (i) Y X (i) 1 b = pbθ (y1:T ) = L(θ) wT νt , (3.19) N T +1 i=1 t=0 i=1 where the first summation is unity for the faPF and where νt = wt for the bPF. To implement this estimator, we run Algorithm 2 and then calculate the likelihood estimate using (3.19) by inserting the particle weights generated by the APF. The statistical properties of the likelihood estimator are studied by Del Moral (2004). It turns out that the estimator is consistent and unbiased for any N ≥ 1. Furthermore, the error of the estimate satisfies a CLT, i √ h d b N L(θ) − L(θ) −→ N 0, ψ 2 (θ) , (3.20) for some asymptotic variance ψ 2 (θ), see Proposition 9.4.1 in Del Moral (2004). 3.4 Particle smoothing 53 It is also straightforward to obtain an estimator for the log-likelihood from (3.19), (N ) T −1 (N ) X (i) X X (i) \ b `(θ) = log pθ (y1:T ) = log wT + log νt − T log(N + 1). (3.21) i=1 t=0 i=1 However, this estimator is biased for a finite number of particles, but it is still consistent and asymptotically normal. This result follows from applying the secondorder delta method (Casella and Berger, 2001) on (3.20). The resulting CLT for the log-likelihood estimator is given by √ γ 2 (θ) d b −→ N 0, γ 2 (θ) , (3.22) N `(θ) − `(θ) + 2N where we introduce γ(θ) = ψ(θ)/L(θ). As a result, we have an expression for the bias of the estimator given by −γ 2 (θ)/2N for a finite number of particles. Consequently, it is possible to compensate for the bias as the variance of the estimator γ 2 (θ) can be estimated using Monte Carlo simulations by repeated application of the APF on the same data. This could be an interesting improvement for the proposed methods in Papers B and C, where we make use of the log-likelihood estimator. 3.4 Example: Bias and variance of the log-likelihood estimate Consider the setup in Example 2.6 and the problem of estimating the log-likelihood at θ = θ? using the faPF with N = 10 particles. We repeat the estimation over 1 000 Monte Carlo simulations using the same data. The error of the log-likelihood estimate is calculated using the true valued obtain from the Kalman filter. In Figure 3.4, we present the histogram of the error together with a Gaussian approximation (upper), the box plot of the errors (lower left) and the QQ-plot of the errors (lower right). The Gaussian approximation fits the data quite well and the resulting average error is −0.03 with variance γ b2 (θ? ) = 0.05. The predicted √ 2 ? bias is calculated using (3.22) by −b γ (θ )/2 N = −0.01. The QQ-plot validates the Gaussian assumption as we do not seen any deviating tail behaviour. 3.4 Particle smoothing Particle smoothers approximate the solution to the smoothing problem in an SSM similar to how the APF approximates the corresponding filtering problem. However, there exists a number of different approaches to carry out the smoothing given the particle system from the APF. The simplest smoother is to make use of the APF to approximate the joint smoothing distribution as discussed in the previous. The main problem with this approach is that the path degeneracy problem limits the accuracy of the estimate. Another similar approach is to make use of the fixed-lag (FL) particle smoother (Kitagawa and Sato, 2001), which is based on using the APF to estimate the fixed-lag smoothing distribution (recall Table 3.1). In the following, we make use of this smoother as it has a low computational cost and a reasonable accuracy compared with other particle smoothers. 3 State inference using particle methods 1.0 0.0 0.5 Density 1.5 2.0 54 −1.0 −0.5 0.0 0.5 1.0 0.5 Sample Quantiles 0.0 −0.5 0.5 0.0 −0.5 Error in the log−likelihood estimate Error in the log−likelihood estimate −3 −2 −1 0 1 2 3 Theoretical Quantiles Figure 3.4: The histogram with a Gaussian approximation (blue line) of the error in the log-likelihood estimates (upper) in the LGSS model using a faPF. The boxplot (lower left) and QQ-plot (lower right) supports the Gaussian assumption of the error. 3.4 Particle smoothing 55 More advanced smoothers are often based on approximations of the Bayesian smoothing recursion (3.2). There are two main families of smoothers that results from this approach: the FFBSm and the forward filtering backward simulator (FFBSi). The original marginal FFBSm and FFBSi algorithms are discussed in Doucet et al. (2000) and Godsill et al. (2004), respectively. Another type of smoother makes use of two APFs (one running forward in time and the other backwards) and combines the output of these by using a two-filter formula (Briers et al., 2010; Fearnhead et al., 2010). Furthermore, other types of smoothers have been proposed in Bunch and Godsill (2013) and Dubarry and Douc (2011) using MCMC methods (discussed in the next chapter). The interesting improvement in these two smoothers are that they generate new particles in the backward sweep, which is not done in the FFBSm and FFBSi. See Lindsten and Schön (2013) for a recent survey of different particle smoothing methods. In this section, we focus on the use of the FL smoother for estimating some quantities that are required for the proposed methods in Papers A and D. We introduce the underlying assumptions of the smoother and discuss how to implement it. We also show how it can be used to estimate the score function of an SSM and parts of the information matrix. We conclude by discussing the properties of the estimates obtained from the FL smoother. 3.4.1 State inference using the particle fixed-lag smoother The FL smoother relies on the forgetting properties of an SSM, i.e. that the Markov chain quickly forgets about its earlier states. This property is illustrated in Example 3.5 for an LGSS model. 3.5 Example: Mixing property in the LGSS model Consider the LGSS model using the same setup as in Example 2.6, where 20 different state processes are simulated during 8 time steps. Here, each state process has a randomly selected initial states distributed as x0 ∼ N (0, 202 ). In Figure 3.5, we present the evolution of the state processes in three different LGSS models. We note that the value of φ determines the rate at which the processes converge to a stationary phase. A larger value of φ gives the process a longer memory and this results in that it requires longer time to forget its initial condition. That is, the state process mixes slowly and therefore future observations contain useful information about the current state. The observation that some SSMs mixes quickly is the basis for the assumption that pθ (x0:t |y1:T ) ≈ pθ (x0:t |y1:κt ), (3.23) where κt = min{T, t + ∆} and ∆ denotes some lag determined by the user. This means that future observations contain a decreasing amount of information about the current state. The rate of this decrease is determined by the mixing of the model as discussed in Example 3.5. If the model mixes quickly, future observations have a limited amount of information about the current state as the state process 3 State inference using particle methods 0 −40 −20 State 20 40 60 56 −60 φ=0.2 0 2 4 6 8 0 −40 −20 State 20 40 60 Time −60 φ=0.5 0 2 4 6 8 0 −40 −20 State 20 40 60 Time −60 φ=0.8 0 2 4 6 8 Time Figure 3.5: The evolution of 20 different state processes in the LGSS model from Example 2.6 using different initial values. Three different values of φ are used in the LGSS model: φ = 0.2 (upper), φ = 0.5 (middle) and φ = 0.8 (lower). 3.4 Particle smoothing 57 Algorithm 3 Two-step fixed-lag (FL) particle smoother Inputs: y1:T (observations), Rθ (xt |x0:t−1 , yt ) (propagation kernel), νt (auxiliary weights), N > 0 (no. particles) and ∆ (lag). Outputs: pbθ (xt−1:t |y1:T ) for t = 1 to T (empirical two-step smoothing dist.). 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: (i) Initialise each particle x0 . for t = 1 to T do (i) Sample new the ancestor indices to obtain {at }N i=1 using (3.10). (i) Propagate the particles to obtain {xt }N i=1 using (3.12). (i) Calculate new importance weights to obtain {wt }N i=1 using (3.14). if t > ∆ + 1 then (i) Compute κt = min{t + ∆, T } and recover {xκt ,t−1:t }N i=1 . Compute (3.24) to obtain pbθ (xt−1:t |y1:T ). end if end for quickly forgets its past. Hence, the lag ∆ can be selected to be rather small and still make (3.23) a valid approximation. This is the main intuition behind the FL smoother and the procedure follows directly from the APF in Algorithm 2. In the following, we require estimates of the two-step smoothing distribution pbθ (dxt−1:t |y1:κt ) to estimate the score function and the information matrix for an SSM. By using the FL smoother assumption, we can compute the required estimate by a marginalisation over the joint smoothing distribution, Z pθ (xt−1:t |y1:κt ) = pθ (x0:κt |y1:κt ) dx0:t−2 dxt+1:κt . By inserting the empirical joint smoothing distribution pθ (x1:κt |y1:κt ) from (3.9), an estimate of the two-step smoothing distribution is obtained as pbθ (dxt−1:t |y1:κt ) = N X i=1 (i) (i) w eκt δ (i) t ,t−1:t xκ (dxt−1:t ), (3.24) (i) where xκt ,t denotes the ancestor particle from which the particle xκt originated from at time t. This ancestor is obtained by tracing the linage backwards similar to Figure 3.3, where the same is done for the particles starting at time T . We summarise the procedure for estimating the two-step smoothing distribution using the FL-smoother in Algorithm 3. The marginal smoothing distribution can be computed using an analogue expression and this results in a similar procedure. Finally, we present an application of these methods for volatility estimation in Example 3.6. 3 State inference using particle methods 0.10 58 0.00 −0.10 −0.05 Daily log−returns 0.05 bPF FL smoother 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2008 2009 2010 2011 2012 2013 2014 2008 2009 2010 2011 2012 2013 2014 0.02 0.01 0.00 −0.02 −0.01 bPF estimate of volatility 0.03 0.04 Year 2000 2001 2002 2003 2004 2005 2006 2007 0.02 0.01 0.00 −0.01 −0.02 FL smoother estimate of volatility 0.03 0.04 Year 2000 2001 2002 2003 2004 2005 2006 2007 Year Figure 3.6: Upper: the log-returns (gray dots) from the Nasdaq OMX Stockholm 30 Index presented in Figure 2.2. The estimated 95% CIs are also presented for the bPF (red) and the FL smoother (blue). Middle: the estimated latent volatility obtained from the bPF. Lower: the estimated latent volatility obtained from the FL smoother. 3.4 Particle smoothing 59 3.6 Example: State inference in the Hull-White SV model Consider the Hull-White SV model (2.4) with data from Section 2.2.2 and the parameter vector estimate from Example 3.1. To infer the latent volatility (the state), we apply the bPF using Algorithm 2 and the FL smoother according to Algorithm 3 using N = 5 000 particles and ∆ = 12. In the upper part of Figure 3.6, we present the filtered log-returns with a 95% CI for the bPF (red) and the FL smoother (blue). Here, we present the upper CI for the bPF and the lower CI for the FL smoother so that the two lines do not overlap. Note that the mean of the log-returns is zero and therefore the CIs are symmetric around zero. We conclude that the CIs seem reasonable as they cover most of the log-returns except at the financial crises. Also, the CI computed from the bPF is a bit rougher than the CI computed using the FL smoother. In the middle and lower parts of Figure 3.6, we present the estimated volatilities using both methods. We see that the estimates correspond reasonable well with variation of the log-returns. Note that, the periods with larger volatility are connected with the financial crises. 3.4.2 Estimation of additive state functionals In this section, we consider the use of particle smoothing to estimate the expected value of an additive functional given the observations. This is in analogue with the Monte Carlo estimate of the expectation of a function in (3.3). An additive functional satisfies the expression Sθ (x0:T ) = T X sθ,t (xt−1:t ), (3.25) t=1 which means that we can decompose a function that depends on the entire particle trajectory into several functionals. Here, sθ,t (xt−1:t ) denotes some general functional that depends on only two states of the trajectory. This type of additive functionals occurs frequently in nonlinear SSMs when we would like to compute functions that depend on the kernels fθ (xt+1 |xt ) and gθ (yt |xy ). In the following, we give some concrete examples of these functionals connected with parameter inference in SSMs. In these problems, we would like to compute the expected value of the additive functional given the observations, h i Z E Sθ (x0:T )y1:T = Sθ (x0:T )pθ (x0:T |y1:T ) dx0:T = T Z X sθ,t (xt−1:t )pθ (xt−1:t |y1:T ) dxt−1:t . (3.26) t=1 This can be done by inserting the two-step smoothing distribution estimated by any particle filter or smoothing algorithm. Examples of some different approaches for this are found in Poyiadjis et al. (2011) using the APF and in Del Moral et al. (2010) using a forward smoother based on the FFBSm. In this section and in 3 60 State inference using particle methods Paper A, we discuss the use of the FL smoother for this application. The resulting estimate is obtained by inserting (3.24) into (3.26), Sbθ (x0:T ) = T X N X (i) (i) w eκt sθ,t (xκt ,t−1:t ), (3.27) t=1 i=1 which can be computed by some minor modifications of Algorithm 3. We encounter this type of additive functionals are encountered in two common problems concerning SSMs: when estimating the score function and parts of the information matrix. In this thesis, we make use of the material in this section for estimating the latter two quantities in Papers A and D. To derive the expressions for the additive functions related to these problems, we consider the logarithm of the joint distribution of states and observations in an SSM, log pθ (x0:T , y1:T ) = log µ(x0 ) + T h X i log fθ (xt |xt−1 ) + log gθ (yt |xt ) . (3.28) t=1 By using this quantity, the score function can be estimated using Fisher’s identity (Fisher, 1925; Cappé et al., 2005), Z h i S(θ0 ) = ∇`(θ)θ=θ0 = ∇ log pθ (x0:T , y1:T )θ=θ0 pθ (x0:T |y1:T ) dx0:T , which results in the functional ξθ,t (xt−1:t ) = ∇ log fθ (xt |xt−1 )θ=θ0 + ∇ log gθ (yt |xt )θ=θ0 , (3.29) corresponding to the gradient of (3.28) evaluated at θ = θ0 . An estimator of the score function is obtained by inserting the additive function (3.29) into the empirical distribution obtained by the FL smoother (3.27) as b 0) = S(θ T X N X (i) (i) w eκt ξθ,t (xκt ,t−1:t ). t=1 i=1 The observed information matrix can be estimated using Louis’ identity (Louis, 1982; Cappé et al., 2005). However, only some parts of the identity can be directly estimated using the FL smoother. The remaining parts must be estimated using the APF or a more advanced smoother like the FFBSm. To see this, we rewrite the observed information matrix using Louis’ identity, h i2 h ih i−1 J (θ0 ) = −∇2 `(θ)θ=θ0 = ∇`(θ)θ=θ0 − ∇2 L(θ)θ=θ0 L(θ) . Here, the second term of the identity can be written as h ih i−1 Z h i2 2 = ∇ log pθ (x0:T , y1:T )θ=θ0 pθ (x0:T |y1:T ) dx0:T ∇ L(θ) θ=θ0 L(θ) Z h i + ∇2 log pθ (x0:T , y1:T ) 0 pθ (x0:T |y1:T ) dx0:T . θ=θ (3.30) 3.4 Particle smoothing 61 The first term in (3.30) cannot be directly estimated by the FL smoother, as it does not provide approximations of the densities needed. Instead, it can be estimated using the APF directly as proposed by Poyiadjis et al. (2011) or by the use of a combination of the FL smoother and the APF. The latter alternative is discussed in Paper A. However, the second term in (3.30) can be estimated using the FL smoother directly as it can be written as an additive functional given by ζθ,k (xt−1:t ) = ∇2 log fθ (xt |xt−1 )θ=θ0 + ∇2 log gθ (yt |xt )θ=θ0 . (3.31) corresponding to the Hessian of (3.28) evaluated at θ = θ0 . An estimator for this part of Louis’ identity is obtained by inserting the additive function (3.31) into the empirical distribution obtained by the FL smoother (3.27). The expected information matrix can be estimated as the sample covariance matrix of a large number of score functions estimated using different data sets. We discuss this method further in Paper D. We have now seen two examples of additive functionals in connection with parameter inference in SSMs and how to estimate them using the FL smoother. The same setup can also be used in the expectation maximisation algorithm (Dempster et al., 1977; McLachlan and Krishnan, 2008) as discussed in Del Moral et al. (2010) using a forward smoother. We return to this problem in Section 4.1, where we discuss different methods for parameter inference. We end this section with Example 3.7, where we estimate the log-likelihood, score and observed information matrix for a nonlinear SSM using the FL smoother. 3.7 Example: Score and information matrix in the Hull-White SV model Consider the problem of estimating the log-likelihood, the score function and the natural gradient (the score function divided by the observed information matrix) for θ = φ in the HWSV model (2.4). We again make use of the data from Section 2.2.2 and the parameter vector estimate from Example 3.1. For this model, the additive functional connected with the score function (3.29) is 2 1 1 2 ξθ,t (xt−1:t ) = ∇ − log 2πσv − 2 xt+1 − φxt 2 2σv φ=φ0 2 1 y exp(−xt ) + ∇ − log 2πβ 2 exp(xt ) − t 2 2β 2 φ=φ0 0 = xt xt+1 − φ xt , and similar for the observed information matrix (3.31), h i ζθ,t (xt−1:t ) = ∇ xt xt+1 − φxt = −x2t . 0 φ=φ Here, we make use of the FL smoother in Algorithm 3 for estimating (3.27) with lag ∆ = 12 and N = 5 000 particles. We vary the parameter on a grid within the interval φ ∈ [0.70, 0.99] and estimate the required quantities at each grid point. Here, we fix {σv , β} to their estimated value. The observed information matrix is estimated using the combination of the APF and the FL smoother introduced in Paper A. 3 −7500 −7000 −6500 State inference using particle methods −8000 Estimated log−likelihood function 62 0.70 0.75 0.80 0.85 0.90 0.95 1.00 0.004 0.002 0.000 −0.004 −0.002 Estimated natural gradient 5000 0 −5000 −10000 Estimated score function 10000 φ 0.70 0.75 0.80 0.85 φ 0.90 0.95 1.00 0.90 0.92 0.94 0.96 0.98 1.00 φ Figure 3.7: The estimates of the log-likelihood function (upper) the score function (lower left) and the natural gradient (lower right) of the HWSV model in Example 3.7. The dotted lines indicate the true parameters and the zero level. 3.4 Particle smoothing 63 The resulting estimates are presented in Figure 3.7. We see that the log-likelihood and score estimates seems reasonable and have a rather low variance. The natural gradient is more difficult to estimate due to that the observed information matrix estimates are noisy. We see that the sign of the estimate is correct but the gradient is noisy and rather small. 3.4.3 Statistical properties of the particle fixed-lag smoother We conclude the discussion of particle smoothing by reviewing some results regarding the statistical properties of the estimates obtained by the FL smoother for the additive functionals. In Olsson et al. (2008), the authors analyse the bias and variance of the estimates from the APF and the FL smoother, respectively. The variances of the estimates are concluded (under some to be √ regularity conditions) √ upper bounded by quantities proportional to T 2 / N and (T log T )/ N for the APF and the FL smoother, respectively. The bias is also analysed in Olsson et al. (2008) and the authors conclude (under some regularity conditions) that the bias is upper bounded by quantities proportional to T 2 /N and λ + (T log T )/N , respectively. Here, λ denotes a quantity which is independent of the number of particles. Hence, the FL smoother gives a biased estimate for all N whereas the bias of the estimate obtained from the APF decreases as 1/N . In the analysis, it is assumed that the lag is selected according to ∆ ∝ c log T , where c > −1/ log ρ and ρ ∈ [0, 1] denotes a measure of the mixing of the model (Olsson et al., 2008). If ∆ is too small, the underlying approximation of the FL smoother (3.23) and the accuracy of the estimate are rather poor. This is the result of that more information about the current state is available in future observations that are not taken into account by the smoother. If instead ∆ is too large, we get problems with path degeneracy and in the limit when ∆ = T , the APF estimate is recovered. The choice of ∆ is therefore crucial for obtaining good estimates from the FL smoother and its optimal value depends on both the number of the observations and the mixing properties of the model. Hence, it must be tailored for each application individually as is further discussed in Paper A. Finally, we mention that even better accuracy can be obtained by using other more advanced smoothing algorithms. The main drawback with these algorithm is that their computational cost is larger than for the APF and the FL smoother, which have a computational cost that is proportional to O(N T ). For example, the FFBSm algorithm (Doucet et al., 2000) has a computational cost that is proportional to O(N 2 T ). The benefit of using this smoother is that the variance of the estimate grows proportional to O(T ), instead of as O(T 2 ) and O(T log T ) for the APF and the FL smoother, respectively. These properties are discussed in more detail in Del Moral et al. (2010) and Poyiadjis et al. (2011). There are also some new promising particle smoothers with a computational complexity proportional to O(T N log N ). For more information, see Klaas et al. (2006), Taghavi et al. (2013) and Gray (2003). 64 3.5 3 State inference using particle methods SMC for Image Based Lighting In Sections 3.3 and 3.4, we have discussed the application of SMC to the filtering and smoothing problems in SSMs. However, the SMC algorithm can be applied to other problems that are sequential in nature and where the target distribution π(x0:k ) grows over some index k = 1, . . . , K. It is also possible to make use of SMC methods for static problems by defining an artificial target that grows over k even if the original target does not. This approach is discussed by Del Moral et al. (2006) and opens up for using SMC for a wide range of problems. Figure 3.8: The setup used in the IBL approach, where the LTE gives the outgoing radiance at the angle ωr that hits the image plane. This radiance is computed by taking the EM, the BRDF and the visibility into account. Note, that one of the light sources are occluded in this scene and does not contribute to the outgoing radiance. We exemplify the usefulness of this general class of SMC algorithm by returning to the problem of rendering a sequence of photorealistic images using IBL (Debevec, 1998; Pharr and Humphreys, 2010) as discussed in Section 1.1.2. In Figure 3.8, we present a cartoon of the setup of the LTE. Here, we are interested in calculating the outgoing radiance Lr,k (ωr ) from a point at the outgoing angle ωr at frame k, which is given by the LTE as Z Lr,k (ωr ) = fr (ωi → ωr )Lk (ωi )V (ωi )(ωi · n) dωi , (3.32) where n denotes the surface normal and ωi denotes incoming angles of the light rays that hits the object at n. Here, fr , Lk and V denote the bidirectional reflectance distribution function (BRDF), the EM in frame k and the binary visibility function, respectively. Furthermore, we assume that these functions can be evaluated pointwise. In this problem, the BRDF describes the optical properties of the object and can be seen as a distribution over incoming angles. 3.5 SMC for Image Based Lighting 65 To see how SMC can be used to solve (3.32), we define this as a sequence of normalisation problems, where kth unnormalised target distribution is given by γk (ωi ) = fr (ωi → ωr )Lk (ωi )(ωi · n), where we neglect the visibility term for computational convenience. Calculating the outgoing radiance (without regarding visibility) Lr,k (ωr ) can therefore be seen as normalisation, Z Lr,k (ωr ) = γk (ωi ) dωi , for a sequence of EMs indexed by k. From this setup, we see that the desired particles corresponds to incoming angles ωi of the light that hits the object and contributes to the outgoing radiance. As previously discussed, the challenge is to select only a few important angles to take into account when solving the LTE. Otherwise, the problem becomes computational infeasible. In the filtering problem, this normalisation factor is the marginal likelihood of the problem, which can be computed using the particle weights. In a similar manner, we can compute Lr (ωr ) and the only difference between the APF in Algorithm 2 is the choice of propagation kernel. Here, we do not have an SSM to describe the dynamical behaviour of the EM between frames. Instead, we make use of an MCMC kernel to adapt the particles between EMs. Details of this approach are found in Kronander et al. (2014a) and Ghosh et al. (2006). Another approach to solve (3.32) is to use the IS algorithm directly on Lr,k (ωr ) for each frame individually. This can be done with an algorithm similar to SIR, where we first sample from the EM and compute the particle weights using the BRDF. The resulting particles are obtained after a resampling step. This method is referred to as the bidirectional importance sampling (BIS) (Burke et al., 2005) algorithm. This approach is computationally cheap, but cannot make use of information from the previous frame when estimating Lr,k (ωr ) in the current frame k. As previously discussed, this problem can be solved with the SMC setup that we consider here. We conclude this section by Example 3.8 in which we compare these two approaches. 3.8 Example: SMC for direct illumination Consider a simple setup, where we would like to render a sphere given a sequence of HDR images. This is similar to the problem considered in Section 1.1.2 but simpler since we do not consider multiple bounces. Hence, this problem is referred to as direct illumination. Here, we compare four different approaches to render the sphere in the image. The first and second methods are to make use of the IS algorithm to sample directly from the EM and the BRDF, respectively. Hence, we disregard the information in the BRDF and the EM, respectively. We also make use of the BIS algorithm and the SMC algorithm previously outlined. In Figure 3.9, we compare the four different approaches at frame k = 10 in a sequence of EMs. The first three methods solves the problem for each frame individually and the SMC approach makes use of the previous information. We 66 3 State inference using particle methods see that using the IS algorithm directly to sample from the EM does not perform well. The reason is that we consider a mirror sphere, where the BRDF is rather narrow, which means that only light from a few incoming angles contribute to the outgoing radiance. The remaining three methods gives comparable results. Therefore, we conclude that there could be advantages by using the SMC algorithm for this application. See Kronander et al. (2014a) for a more detailed discussion and comparison of these approaches. 3.5 SMC for Image Based Lighting 67 Figure 3.9: Mirror sphere, frame 10 with different sampling techniques: IS the EM (upper left), IS the BRDF (upper right), BIS (lower left), and the SMC algorithm (lower right). 4 Parameter inference using sampling methods In this chapter, we consider some sampling based methods for ML and Bayesian parameter inference in SSMs. We start by giving a small overview of some of the current algorithms in the field. Later, we introduce and discuss the use of MCMC (Robert and Casella, 2004) and particle MCMC (PMCMC) (Andrieu et al., 2010) approaches to Bayesian parameter inference. Here, we also discuss the use of Langevin and Hamiltonian dynamics to improve the efficiency of the method. This material is later used in Paper A to construct a particle version of these MCMC algorithms. The second part of this chapter is devoted to discussing the BO algorithm (Jones, 2001; Boyle, 2007; Lizotte, 2008; Osborne, 2010) and how it can be used together with GPs (Rasmussen and Williams, 2006) for ML parameter inference and input design. This approach is used in Papers B and C for ML based parameter inference and MAP based parameter inference, respectively. 4.1 Overview of computational methods for parameter inference There are many different methods developed for parameter inference in SSMs using both ML and Bayesian methods. Here, we review some of these methods to give an overview of the area and discuss some of the more popular methods. For more exhaustive accounts of different methods, see Douc et al. (2014), Kantas et al. (2009), Cappé et al. (2007) and Cappé et al. (2005). 69 70 4.1.1 4 Parameter inference using sampling methods Maximum likelihood parameter inference Most parameter inference problems in the ML framework cannot be solved by analytical calculations. Instead popular approaches are based on optimisation and other iterative algorithms to solve the inference problem in Definition 2.2. From Chapter 3, we know that the APF can be used to calculate the log-likelihood and that the FL smoother can be used to estimate the score function (the gradient of the log-likelihood). Hence, we can make use of a gradient-based optimisation algorithm to maximise the log-likelihood and thereby estimate the parameters of the model. This method operates by an iterative application of b k ), θbk+1 = θbk + k S(θ (4.1) where θbk and k denote the current estimate of the parameter vector and the step length, respectively. This approach is used by Poyiadjis et al. (2011) and Yildirim et al. (2013) for parameter inference in SV models with Gaussian returns and α-distributed returns, respectively. Furthermore, we can estimate the Hessian information by a particle smoother and make use of it in (4.1). This results in a Newton optimisation algorithm which operates by an iterative application of b k ). θbk+1 = θbk + k Jb−1 (θk )S(θ (4.2) Another approach is to estimate the Hessian on-the-fly in a Quasi-Newton algorithm using finite difference approximation. A popular member of this class of algorithms is the simultaneous perturbation stochastic approximation (SPSA) algorithm discussed in Spall (1987). The SPSA algorithm is used for inference in the αSV model in Ehrlich et al. (2012). The expectation maximisation algorithm (Dempster et al., 1977; McLachlan and Krishnan, 2008) is another popular alternative for ML inference. In this method, the latent states are seen as missing information and are estimated together with the parameter vector of the model. The quantity required by the expectation maximisation algorithm is an additive functional that can be estimated using a particle smoother. Many different versions of the expectation maximisation algorithm are available in the literature using different particle smoothers. Some examples are Del Moral et al. (2010) using a forward smoother, Olsson et al. (2008) using the FL smoother, Lindsten (2013) using the conditional particle filter with ancestor sampling and Schön et al. (2011) using the FFBSm. Finally, we mention another approach based on Bayesian optimisation (Jones, 2001; Boyle, 2007; Lizotte, 2008), introduced in Papers B and C for ML parameter inference in SSMs where the likelihood can be intractable. We return to this approach in Section 4.4, where we introduce the method in detail. 4.1.2 Bayesian parameter inference As for the ML parameter inference approach, most interesting Bayesian parameter inference problems are analytically intractable. Some problems can be solved in closed-form using conjugate priors as discussed in Section 2.4. However, in general we require approximate methods to solve expectation, marginalisation and 4.2 Metropolis-Hastings 71 normalisation problems in Bayesian inference. Previously, we discussed the sampling based approach in which we make use of statistical simulations and computers to approximate the posterior. Examples of some sampling based approaches are MCMC methods or SMC methods. Some recent methods that makes us of the latter method are SMC-squared (SMC2 ) (Chopin et al., 2013) and particle learning (Carvalho et al., 2010). An alternative class of methods are based on approximate analytical computations, which often makes use of iterated approximations to approximate the posterior. Examples of analytical approximations are the Laplace approximation, the integrated nested Laplace approximation (INLA) (Rue et al., 2009), variational Bayes (VB) (Bishop, 2006) and expectation propagation (EP) (Minka, 2001). In this thesis, we are primarily interested in sampling based inference using MCMC (Robert and Casella, 2004) for Bayesian parameter inference. MCMC is a family of algorithms that all are based on simulating a Markov chain with the target as its stationary distribution. After some burn-in, the chain reaches stationarity and the samples obtained from the chain can be treated as independent samples from the parameter posterior by the ergodic theorem discussed in Section 4.2.1. Here, we consider two instances of MCMC algorithms, the Metropolis-Hastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970) and the Gibbs sampler (Geman and Geman, 1984). The Gibbs sampler can be seen as a special case of the more general MH algorithm. As such it can only be used in models with a certain conditional structure of the joint parameter posterior. We return to the Gibbs sampler in Paper E for parameter inference in a non-Gaussian ARX model. The MH algorithm is a general method for sampling from intractable distributions and is applied in many different fields. Some applications of the MH algorithm are found in biology (Wilkinson, 2011), system identification (Ninness and Henriksen, 2010), finance (Johannes and Polson, 2009) and statistics (Robert and Casella, 2004). A major problem with this algorithm is that it requires evaluations of the likelihood of the SSMs. As discussed in Section 3.3.4, we can only obtain an unbiased estimator of this quantity, but it turns out that this is enough. As a result of the unbiasedness property, the resulting Markov chain keeps the target as its stationary distribution. This type of methods are known as psuedo-marginal algorithms (Andrieu and Roberts, 2009; Andrieu et al., 2010) and this family includes the particle MH (PMH) and particle Gibbs (PG) algorithms. This opens up for Bayesian parameter inference in SSMs. We return to discussing the PMH algorithm in Section 4.3 and propose refinements to it in Paper A. 4.2 Metropolis-Hastings In this section, we discuss the use of the MH algorithm for sampling from the parameter posterior p(θ|y1:T ). Here, we give a short introduction to the MH algorithm and also briefly discuss why it works and discuss some interesting extensions of the algorithm based on random walks on Riemann manifolds. Interested readers are referred to Robert and Casella (2004), Gelman et al. (2013) and Liu (2008) for 4 72 Parameter inference using sampling methods more detailed accounts of the algorithm and its general application. The MH algorithm samples the target distribution π(θ) = p(θ|y1:T ) by simulating a carefully constructed Markov chain on the parameter space Θ. The Markov chain is constructed in such a way that it admits the target as its unique stationary distribution. The algorithm consists of two steps: (i) a new parameter θ00 is sampled from a proposal distribution q(θ00 |θ0 ) given the current state θ0 and (ii) the current parameter is changed to θ00 with acceptance probability α(θ00 , θ0 ), otherwise the chain remains at the current parameter. The acceptance probability is given by α(θ00 , θ0 ) = 1 ∧ p(θ00 ) pθ00 (y1:T ) q(θ0 |θ00 ) π(θ00 ) q(θ0 |θ00 ) =1∧ , 0 00 0 π(θ ) q(θ |θ ) p(θ0 ) pθ0 (y1:T ) q(θ00 |θ0 ) (4.3) where a ∧ b , min{a, b} and pθ (y1:T ) denotes the likelihood. The resulting procedure is outlined in Algorithm 4 for parameter inference in models where we can compute the likelihood exactly. Here, we have used the form of the parameter posterior from (2.14), where the marginal likelihoods cancel in the acceptance probability. Therefore, the algorithm only requires that we can point-wise evaluate the unnormalised target distribution. Note that the IS algorithm could be used to estimate the parameter posterior using e.g. a multivariate Gaussian distribution as the proposal as in Example 3.1. However, if θ is high dimensional this approach would require many samples to accurately estimate the posterior. In the MH algorithm, we could instead construct a Markov chain that explores the posterior distribution by local moves thus exploiting the previously accepted parameter. Hence, it focuses its attention to areas of the parameter space in which the posterior assigns a relatively large probability mass. This makes sampling of high dimensional target more efficient, as less iterations of the algorithm are required to obtain an accurate estimate. To see this, assume that we have a symmetric proposal q(θ00 |θ0 ) = q(θ0 |θ00 ), so that the ratio between the proposals cancel in (4.3). The remaining part is the ratio between the target evaluated at θ00 and θ0 . If the proposed parameter θ00 results in that the target assumes a larger value than in θ0 , it is always accepted. Also, we accept a proposed parameter with some probability if this results in a small decrease of the target compared with the previous iteration. This results in that the MH sampler both allows for the algorithm to climb the posterior to its mode and explore the area surrounding the mode. Hence, the MH algorithm can possibly escape local extrema which is a problem for many local optimisation algorithms used in numerical ML parameter inference. The performance of the MH algorithm is dependent on the choice of the proposal distribution. A poor proposal leads to a poor exploration of the posterior distribution, which results in that many iterations of the algorithm are required to obtain a good approximation of the posterior. Here, we discuss some common choices of proposals which are needed for the discussion in Section 4.2.2 and in Paper A. The perhaps simplest example is the independent proposal in which q(θ00 |θ0 ) = q(θ00 ), i.e. a parameter is proposed without taking the previously accepted parameter into 4.2 Metropolis-Hastings 73 Algorithm 4 Metropolis-Hastings (MH) for Bayesian inference in SSMs Inputs: M > 0 (no. MCMC steps), q( · ) (proposal) and θ0 (initial parameter). Output: θ = {θ1 , . . . , θM } (samples from the parameter posterior). 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: Initalise using θ0 . for k = 1 to M do Sample θ0 from the proposal θ0 ∼ q(θ0 |θk−1 ). Calculate the likelihood pθ0 (y1:T ). Sample ωk from U(0, 1). if ωk < α(θ0 , θk−1 ) given by (4.3) then θk ← θ0 . {Accept the parameter} else θk ← θk−1 . {Reject the parameter} end if end for account. This proposal cannot exploit the previous accepted parameters and this could be a difficulty in inference problems when the posterior is quite complicated. However, if the proposal is similar to the posterior, this leads to a good mixing of the Markov chain as the proposed parameters are uncorrelated. This insight have been used in Giordani and Kohn (2010) to construct a proposal distribution based on a mixture of Gaussian distributions, which results in an efficient MH sampler in some applications. Another popular choice is the (symmetric) Gaussian random walk (RW) with some step length , which results from selecting q(θ00 |θ0 ) = N (θ00 ; θ0 , 2 ). The choice of the step length in the RW proposal determines the mixing of the Markov chain and thereby the efficiency of the exploration of the parameter posterior. If the step length is too small, the exploration is poor since that it takes a long time to explore the target. If it is too large, we seldom accept the proposed parameter as the difference in the values that the posterior assume is too large, resulting in a small acceptance probability. This is often referred to as the mixing of the Markov chain (c.f. Example 3.5). That is we would like to balance the mixing of the Markov chain to get a reasonable exploration of the posterior and acceptance rate. This problem is illustrated in Example 4.1, where we make use of Algorithm 4 to infer the parameters in an LGSS model for which we can compute the likelihood exactly. We return to the choice of proposal and the mixing of the Markov chain in following sections. 4.1 Example: Parameter inference in the LGSS model Consider the parameter inference problem in the LGSS model (2.2) with the unknown parameter θ = φ. We use the parameter θ? = 0.5 together with {σv , σe } = {1.0, 0.1} to generate a realisation from the model with T = 250 and known initial state x0 = 0. Here, we use the MH algorithm defined in Algorithm 4 together with the Kalman filter for estimating the likelihood. A Gaussian random walk proposal is used with zero mean and variance . For simplicity, we use a uniform prior over 4 Parameter inference using sampling methods ACF of φ φ 0.2 0.2 0.3 0.4 0.6 0.4 0.8 0.5 1.0 0.6 74 0 100 200 300 400 0.0 0.1 ε =0.01 500 0 10 20 40 50 30 40 50 30 40 50 0.7 0.8 0.6 ACF of φ 0.4 0.6 0.5 0.4 0.0 0.2 0.2 0.3 φ 30 Lag 1.0 Iteration 0.1 ε =0.10 0 100 200 300 400 500 0 10 20 Lag ACF of φ 0.4 0.4 0.1 ε =1.00 0 100 200 300 Iteration 400 500 0.0 0.2 0.2 0.3 φ 0.6 0.5 0.8 0.6 1.0 0.7 Iteration 0 10 20 Lag Figure 4.1: The traces (left) of the first 500 iterations of φ in the LGSS model (2.2) using the RW proposal with three different step lengths = 0.01 (upper), = 0.10 (middle) and = 1.00 (lower). The estimated autocorrelations function (right) using 9 000 iterations corresponding to each step length in the RW proposal. The dotted lines show the true parameters from which the data was generated. Metropolis-Hastings 75 4 0 0.1 0.2 2 0.3 0.4 φ Density 0.5 6 0.6 0.7 8 0.8 4.2 0 100 200 300 400 500 0.4 0.5 0.6 0.7 φ 0.6 3 0 0.8 1 1.0 2 1.2 σv Density 1.4 4 1.6 1.8 5 2.0 Iteration 0 100 200 300 Iteration 400 500 0.6 0.7 0.8 0.9 1.0 1.1 1.2 σv Figure 4.2: The trace of the first 500 iterations of φ (upper left) and σv (lower left) in the LGSS model in Example 4.1. The estimated parameter posteriors (right) obtained from 9 000 iterations of the MH algorithm. The dotted lines show the true parameters from which the data was generated. 4 76 Parameter inference using sampling methods φ ∈ (−1, 1) ⊂ R to ensure that the system is stable at all times. In Figure 4.1, we present the trace plot of the Markov chain together with the estimated autocorrelation function from M = 10 000 iterations (discarding the first 1 000 iterations as burn-in) for the RW proposal with three different step lengths. We see that the mixing is rather poor for the smallest step size = 0.01, as this results in a poor exploration of the posterior and therefore a large correlation in the Markov chain. This is also the case if the step length is too large as illustrated by the proposal with step length = 1.00. Here, the large step length results is a small acceptance probability and therefore a larger correlation in the Markov chain. We now add the parameter σv to our inference problem and would therefore like to infer the parameters {φ, σv } using the same setup as before. Here, we also add an uniform prior over σv ∈ R+ as it corresponds to a standard deviation. We use the step length = 0.1 with the RW proposal and simulate the Markov chain for M = 10 000 iterations (discarding the first 1 000 iterations as burn-in). In Figure 4.2, we present the trace of each parameter and the resulting posterior density estimate. We see that the Markov chain mixes well in both parameters and that the posterior estimates rather close to the true parameters. 4.2.1 Statistical properties of the MH algorithm In this section, we discuss some of the statistical results that the MH algorithm relies upon and briefly mention their underlying assumptions. For more information about the properties of the MH algorithm and MCMC algorithms in general, see e.g. Tierney (1994), Robert and Casella (2004) and Meyn and Tweedie (2009). The core result that is used in the MH algorithm is the ergodic theorem (Tierney, 1994; Robert and Casella, 2004). Given a well-behaved test function ϕ, we can construct a Monte Carlo estimator, ϕ bMH = N 1 X ϕ(θi ), N i=1 {θi }M i=1 denotes the samples obtain from the MH algorithm. If the Markov where chain is ergodic, then by the ergodic theorem this estimator is strongly consistent, i.e. a.s. N → ∞. ϕ bMH −→ E ϕ(θ) , Note that this property does not follow directly from the SLLN as the samples obtained for the posterior are not IID, as the proposed parameters are correlated. Hence, the ergodic theorem can be seen as the SLLN for correlated samples. Also, a CLT for the error in the estimator (under some conditions) is given by √ d N ϕ bMH − E ϕ(θ) → N 0, σϕ2 , 4.2 Metropolis-Hastings 77 where σϕ2 denotes the variance of the estimator, which is proportional to the mixing properties of the Markov chain. In fact, this variance is proportional to the integrated autocorrelation time (IACT), which is given by IACT(θ1:M ) = 1 + 2 ∞ X ρk (θ1:M ). k=1 The IACT cannot be computed analytically in practice as we do not know the true autocorrelation function for many models. Therefore, we often approximate it by [ 1:M ) = 1 + 2 IACT(θ K X ρbk (θ1:M ), k=1 where ρbk ( · ) denotes the empirical autocorrelation function at lag k and K denotes some maximum lag for which to compute the IACT. This value can be selected as a fixed value or by the first index at which the ACF √ becomes statistically insignificant, i.e. the first index K such that |b ρK (θ1:M )| < 2/ M . Another related measure is the effective sample size (ESS) defined as ESS(θ1:M ) = M M P∞ = , IACT(θ1:M ) 1 + 2 k=1 ρk (θ1:M ) (4.4) which can be approximated in the same manner as the IACT. The IACT and ESS have the interpretations as the number of iterations between each independent sample and the number of independent samples obtained from the posterior, respectively. Hence, we would like to minimise the IACT and maximise the ESS to get an optimal mixing in the Markov chain and many uncorrelated samples from the posterior. We illustrate the ESS values for parameter inference in an LGSS model in Example 4.2. 4.2 Example: ESS values for inference in the LGSS model We return to Example 4.1 and calculate the ESS values for the three different Markov chains considered in Figure 4.1. The ESS values (4.4) are {22, 1292, 353} when calculated using the adaptive method for the RW proposal with each of the three step lengths, respectively. These number validates the previous discussion that the proposal with = 0.10 is preferable for this problem, as this maximises the number of uncorrelated samples from the posterior. The consistency results and the CLT relies on the ergodic theorem, which assumes that the Markov chain is irreducible and aperiodic. Irreducible means that we should be able to get from any state to any state in the state space. Aperiodicity means that the Markov chain does not return to a state with regular intervals, i.e. no loops that the chain can get stuck in. These requirements need to be check for each MH algorithm to validate its assumptions. However, often in practice these assumptions hold for at least unimodal parameter posteriors. Another property that is important for a Markov chain used in MCMC is reversible. 4 78 Parameter inference using sampling methods This property holds if the chain satisfies the detailed balance equation, π(θ00 )K(θ00 , θ0 ) = π(θ0 )K(θ0 , θ00 ), where K(θ00 , θ0 ) denote the Markov transition kernel. This results in that we can write Z Z Z π(θ00 )K(θ00 , θ0 ) dθ00 = π(θ0 )K(θ0 , θ00 ) dθ00 = π(θ0 ) K(θ0 , θ00 ) dθ00 = π(θ0 ), which shows that π is an invariant distribution of K, i.e. admits π(θ) as its stationary distribution. To show that the MH algorithm, satisfies the detailed balance equation we follow Liu (2008) and write the transition kernel of the Markov chain as, π(θ0 )q(θ0 , θ00 ) 00 0 00 0 K(θ , θ ) = q(θ , θ ) min 1, , θ00 , θ0 π(θ00 )q(θ00 , θ0 ) which gives π(θ00 )K(θ00 , θ0 ) = min{π(θ00 )q(θ00 , θ0 ), π(θ0 )q(θ0 , θ00 )}, which is a symmetric function in θ00 and θ0 . This results in that the detailed balances is fulfilled and the resulting Markov chain obtained from the MH algorithm is reversible. 4.2.2 Proposals using Langevin and Hamiltonian dynamics We have previously discussed two different proposals for the MH algorithm and that these could perform better than the IS algorithm when the target is high dimensional. It turns out that the RW proposal in the MH algorithm still scales rather poorly as the dimension of the parameter space d increases, see Roberts et al. (1997). That is, the mixing of the Markov chain decreases and more iterations are required to maintain the number of independent samples from the target. This is a result of that the random walk does not explore the parameter space well when the dimension increases. A modification to the random walk proposal that could increase the mixing in the Markov chain is to add a drift term that is proportional to the gradient of the logtarget. This leads to a proposal which is the noisy equivalent of (4.1). In statistics, the resulting algorithm is known as the Metropolis adjusted Langevin algorithm (MALA) (Roberts and Rosenthal, 1998; Neal, 2010), where the proposal is said to follow a Langevin dynamics. This means that the proposal can be seen as the discrete version of a continuous time Langevin diffusion process (Øksendal, 2010; Kloeden and Platen, 1992). The Langevin diffusion with stationary distribution p(θ|y1:T ) is given by the stochastic differential equation, i 1h dθ(τ ) = ∇ log p(θ|y1:T )θ=θ(τ ) dτ + dB(τ ), 2 where B(τ ) denotes a Brownian motion. To obtain samples from the parameter posterior, we can simulate the Langevin diffusion to stationarity, which is useful in the proposal as it guides the process towards the mode of the posterior. The 4.2 Metropolis-Hastings 79 discrete time proposal used in MALA follows from a first order Euler discretisation of (4.5) as 2 q(θ00 |θ0 ) = N θ00 ; θ0 + S(θ0 ), 2 Id , 2 where denotes a step length of the discretisation and the corresponding proposal. The proposal can also be derived using a Laplace approximation of the log-posterior as discussed in Paper A. Another version of this algorithm is the manifold MALA (mMALA) (Neal, 2010; Girolami and Calderhead, 2011) which also includes the Hessian information of the log-posterior in analogue with Newton algorithms (c.f. (4.2)). This proposal can be derived similar to the MALA proposal and has the form ! 2 −1 0 0 2 −1 0 00 0 00 0 q(θ |θ ) = N θ ; θ + J (θ )S(θ ), J (θ ) , 2 where we include the observed information matrix J (θ) into the proposal. We make use of the MALA and mMALA proposals in Paper A to construct particle versions of the algorithms for parameter inference in SSMs. Alternative algorithms that solves the same problem are proposed by Girolami and Calderhead (2011), where the MALA and mMALA are used for parameter inference in the SV model. In this thesis, we adopt an optimisation mind set and refer to the MALA and mMALA algorithm as first order and second order proposals, respectively. The names of the proposals refer to the use of first order information (the gradient) and the second order information (the Hessian) in the proposals. In the optimisation literature, this corresponds to the first order gradient-based search method and and second order Newton method, respectively. The corresponding MH algorithms are therefore called MH1 and MH2, respectively. Another perspective of the second order proposal is that the added gradient and Hessian information about the log-target is used to construct a random walk on a Riemann manifold (Livingstone and Girolami, 2014; Girolami and Calderhead, 2011). For this end, we require that the negative Hessian is a PD matrix which is not always the case when we use the observed information matrix as the second order information. In Girolami and Calderhead (2011), it is proposed that the expected information matrix should be used instead, but this is computational costly to estimate for an SSM. Therefore, we make use of methods from optimisation (Nocedal and Wright, 2006) to make the observed information matrix PD if necessary. This is done by regularisation, i.e. adding an appropriate diagonal matrix to make the negative eigenvalues positive, see Paper A for more information. Another related method is called manifold Hamiltonian MCMC (mHMC) and it can improve the mixing of the Markov chain further. This method originates from physics and is one instance of hybrid MC (Duane et al., 1987) algorithms, which are used in statistical physics to simulate from high dimensional targets. Here, we shortly discuss their use in a statistical setting for parameter inference. Interested readers are referred to Liu (2008), Neal (2010) and Girolami and Calderhead (2011) 80 4 Parameter inference using sampling methods for more information. The mHMC algorithm is based on simulating a physical system with the Hamiltonian (the total energy), H(θ, p) = U (θ) + K(p), (4.5) where p ∼ N (0, J −1 (θ)) denotes a random fictitious momentum for each parameter. Here, the potential energy function and the kinetic energy function are given by U (θ) = − log π(θ) and K(p) = p> J −1 (θ)p/2, respectively. The proposal simulates this Hamiltonian system for a number of steps L using the so-called leap-frog algorithm (Neal, 2010) with some step size . The result from this procedure is the proposed parameter and the resulting momentum {θ00 , p00 }. This pair is accepted/rejected using a similar procedure as in the MH algorithm. The acceptance probability compares the Hamiltonian of the system at the last accepted parameter and the proposed parameter by α(θ00 , θ0 ) = 1 ∧ exp H(θ00 , p00 ) − H(θ0 , p0 ) . (4.6) As the accept/reject decision is delayed for L steps, this proposal allows for the chain to move a larger distance between the iterations of the algorithm. This increases the mixing of the Markov chain and it also allows for the Markov chain to visit isolated modes in the posterior. This leads to a better exploration of the posterior as well as more effective samples. The mHMC algorithm is used in many applications and is currently a popular algorithm in statistics and machine learning, see e.g. Chen et al. (2014), Beam et al. (2014) and Betancourt and Girolami (2013). In Girolami and Calderhead (2011), the authors make use of the mHMC algorithm for parameter inference in a SV model with impressive results. To adapt the mHMC algorithm to the PMCMC framework is therefore an exciting opportunity for future research. We conclude this section with an comparison between the three MCMC algorithms previously discussed. In Example 4.3, we replicate and extend an illustration from Neal (2010) to compare the different methods when sampling from a highdimensional target distribution. 4.3 Example: Parameter inference in a 100-dimensional Gaussian dist. Consider the problem of sampling from a non-isotropic 100-dimensional Gaussian distribution with zero-mean and covariance matrix Σ = diag(1.00, 0.99, . . . , 0.01). In this problem, we use random step sizes where RW ∼ U(0.018, 0.026) for MH-RW and HMC ∼ U(0.010, 0.016) for HMC with L = 150. These values follow the suggestions by Neal (2010). For MALA, we us the step length LMC ∼ U(0.008, 0, 013), which results in an acceptance rate of about 60%. The HMC algorithm is executed for 1 000 iterations and the RW and MALA algorithms are executed for 150 000 iterations but only every 150th iteration is considered to make a fair comparison with the HMC. The resulting acceptance probabilities are 0.26, 0.60 and 0.88, for each proposal respectively. The trace plots and ACF for the first coordinate (with standard deviation 1) are presented in Figure 4.3. The estimated mean and standard deviations of the target distribution are presented in Figure 4.4. The HMC algorithm gives almost Metropolis-Hastings 1.0 81 ACF of x1 0.4 0.2 0 0.0 −2 x1 0.6 2 0.8 4 4.2 −4 RW−MH 0 200 400 600 800 1000 0 10 20 30 40 30 40 30 40 Lag ACF of x1 0.4 0.2 0 0.0 −2 x1 0.6 2 0.8 4 1.0 Iteration −4 MALA 0 200 400 600 800 1000 0 10 20 Lag 0.4 ACF of x1 0.2 0 0.0 −2 x1 0.6 2 0.8 4 1.0 Iteration −4 HMC 0 200 400 600 Iteration 800 1000 0 10 20 Lag Figure 4.3: The trace of x1 (left) and the corresponding estimated ACF (right) for the RW-MH, MALA and HMC algorithms. 4 Parameter inference using sampling methods 0.4 ^ σ 0.0 −0.6 −0.4 0.2 −0.2 ^ µ 0.0 0.6 0.2 0.8 0.4 0.6 1.0 82 RW−MH 0 20 40 60 80 100 0 20 40 60 80 100 80 100 80 100 Coordinate 0.4 ^ σ 0.0 −0.6 −0.4 0.2 −0.2 ^ µ 0.0 0.6 0.2 0.8 0.4 0.6 1.0 Coordinate MALA 0 20 40 60 80 100 0 20 40 60 Coordinate 0.4 ^ σ HMC 0 20 40 60 Coordinate 80 100 0.0 −0.6 −0.4 0.2 −0.2 ^ µ 0.0 0.6 0.2 0.8 0.4 0.6 1.0 Coordinate 0 20 40 60 Coordinate Figure 4.4: The estimated parameters obtained for each coordinate from the RW-MH, MALA and HMC algorithms. The estimates of the mean vector (left) and the diagonal of the covariance matrix (right) are presented for each coordinate. The dashed lines indicate the true parameter values for each coordinate. 4.3 Particle Metropolis-Hastings 83 independent samples from the target. The ACF falls off quicker for the MALA compared with the RW-MH, which indicates a more efficient exploration of the posterior. However, the estimates of the covariance matrix are biased for the MALA, which could be the result of that the Markov chain gets stuck somewhere in the parameter space. We conclude that there is a large gain in using the HMC algorithm for inference in models with a high-dimensional parameter vector. 4.3 Particle Metropolis-Hastings As previously discussed in Chapter 2, the likelihood of an SSM is analytically intractable and therefore we cannot make use of the MH algorithm directly for parameter inference. In Section 3.3.4, we reviewed how to construct an unbiased estimator for the likelihood based on the APF. A natural solution to this problem could therefore be to replace the intractable quantities in the acceptance probability with unbiased estimates. This idea is first used by Beaumont (2003), where the authors replace an analytical intractable likelihood with an unbiased estimate for a genetics application of the MH algorithm. The first use of this idea for parameter inference in nonlinear SSMs is found in Fernández-Villaverde and Rubio-Ramírez (2007), where the intractable likelihood is replaced with an estimate from the APF. Subsequent work by Andrieu and Roberts (2009) and Andrieu et al. (2010) analyse the resulting PMH algorithm and prove that this is a valid approach to solve the problem. To see why this method works, we review the derivation of the PMH algorithm following the presentation in Flury and Shephard (2011). Consider the problem of using the MH algorithm to sample the parameter posterior (2.14), i.e. π(θ) = p(θ|y1:T ). This implies that the acceptance probability (4.3) depends explicitly on the intractable likelihood pθ (y1:T ), preventing direct application of the MH algorithm to this problem. Instead, assume that there exists an unbiased, non-negative estimator of the likelihood pbθ (y1:T |u), i.e. Z Eu|θ pbθ (y1:T |u) = pbθ (y1:T |u)mθ (u) dθ = pθ (y1:T ) (4.7) where u ∈ U denotes the multivariate random variable (vector) used to construct this estimator. Here, mθ (u) denotes the probability density of u on U. When the APF is used to construct the estimator of the likelihood, the random variable u is (i) (i) the particles and their ancestors {x0:T , a1:T }N i=1 . The PMH algorithm can then be seen as a standard MH algorithm operating in a non-standard extended space Θ × U, with the extended target given by π(θ, u|y1:T ) = pb (y1:T |u)mθ (u)p(θ|y1:T ) pbθ (y1:T |u)mθ (u)p(θ) = θ , p(y1:T ) pθ (y1:T ) and the proposal distribution mθ00 (u00 )q(θ00 |θ0 ). As a result, we can recover the 4 84 Parameter inference using sampling methods Algorithm 5 Particle Metropolis-Hastings (PMH) for Bayesian inference in SSMs Inputs: Algorithm 2, M > 0 (no. MCMC steps), q(θ00 , θ0 ) (proposal) and θ0 (initial parameters). Output: θ = {θ1 , . . . , θM } (samples from the parameter posterior). 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: Initalise using θ0 . for k = 1 to M do Sample θ0 from the proposal θ0 ∼ q(θ0 |θk−1 ). Estimate the likelihood pbθ0 (y1:T ) using Algorithm 2 and (3.19). Sample ωk from U(0, 1). if ωk < α(θ0 , θk−1 ) given by (4.8) then {θk , pbθk (y1:T )} ← {θ0 , pbθ0 (y1:T )}. {Accept the parameter} else {θk , pbθk (y1:T )} ← {θk−1 , pbθk−1 (y1:T )}. {Reject the parameter} end if end for parameter posterior by marginalisation of the extended target, Z Z p(θ|y1:T ) π(θ, u|y1:T ) du = pbθ (y1:T |u)mθ (u) du = p(θ|y1:T ), pθ (y1:T ) | {z } =pθ (y1:T ) using the unbiasedness property (4.7) of the likelihood. Samples from the parameter posterior can therefore be obtained as a byproduct by simulating from π(θ, u|y1:T ). By selecting the proposal distribution as q(θ00 |θ0 , u), the acceptance probability is given by α(θ00 , θ0 ) = 1 ∧ pbθ00 (y1:T |u00 ) p(θ00 ) q(θ00 |θ0 , u0 ) . pbθ0 (y1:T |u0 ) p(θ0 ) q(θ0 |θ00 , u00 ) (4.8) Note, that the acceptance probability is the same as for the MH algorithm, but replacing the intractable likelihood with an unbiased estimator and including an extended proposal. As previously discussed, the random variable u contains the entire particle system generated by the APF algorithm. From Section 3.4.2, we know that this information can be used in combination with the FL smoother to estimate the score and information matrix, which can be used to construct particle versions of the MH1 and MH2 algorithms. This is the main idea behind the proposed PMH1 and PMH2 algorithms in Paper A, to which we refer interested readers for more information. It turns out that the acceptance rate of the PMH algorithm is closely connected with the number of particles used to estimate the log-likelihood. If N is too small, then the variance of the log-likelihood estimates are large and therefore we often get stuck with the Markov chain with a resulting low acceptance rate. We also know that N is connected with the computational cost of the APF algorithm. Therefore, we have a trade-off between the number of MCMC iterations and the number of 4.4 Bayesian optimisation 85 particles in the APF, where we would like to minimise the total computational cost of the PMH algorithm. This problem is analysed and discussed by Doucet et al. (2012), Pitt et al. (2010) and Pitt et al. (2012). From this work, it is recommended to use a value of N such that the variance of the log-likelihood estimates is between 0.25 and 2.25. Consequently, we can determine the optimal number of particles by a pilot run. Two other common versions of the PMH algorithm are the particle independent MH (PIMH) algorithm and the particle marginal MH (PMMH or PMH0) algorithm. The Gaussian versions of these proposals are obtained by using q(θ00 ) = N (θ00 ; 0, 2 ) and q(θ00 |θ0 ) = N (θ00 ; θ0 , 2 ), respectively. The resulting general version of the PMH algorithm that incorporates the PMH0 and PIMH as special cases is presented in Algorithm 5. The full procedure for PMH1 and PMH2 are found in Algorithm 1 in Paper A. 4.4 Example: PMH0 for parameter inference in the GARCH(1,1) model Consider the parameter inference problem for the GARCH(1,1) model (2.3) using the NASDAQ OMX Stockholm 30 Index data from Section 2.2.2. We make use of the PMH0 algorithm and the RW proposal with the step length = 0.005 for all elements in the parameter vector. We run the algorithm for M = 20 000 iterations (discarding the first 5 000 as burn-in) with the faPF algorithm and N = 3 000 particles for estimating the likelihood. The resulting acceptance rate is 0.06 (after burn-in) with ESS {39, 21, 21, 11} for the four parameters, respectively. Note that the poor ESS are due to that we for simplicity make use of the same step length for all parameters. In Figure 4.5, we present the trace plots and posterior estimates obtain from the run. We see that the mixing is rather poor and longer runs with smaller step lengths are needed. Also, we could make use of PMH1 or PMH2 from Paper A to improve the mixing. The posterior means are θbPMH = {0.0088, 0.14, 0.86, 0.63}, which can be used as point estimates of the parameters in the model. 4.4 Bayesian optimisation BO (Jones, 2001; Boyle, 2007; Lizotte, 2008; Osborne, 2010) is a popular (global) derivative-free optimisation method, which is currently studied extensively in the machine learning community. Here, we consider the use of BO to solve xmax = argmax f (x), (4.9) x∈B where B ⊂ Rd denotes some compact set. Here, we assume that we cannot directly evaluate the real-valued function f (x), but can obtain noisy estimates (samples) modelled as yk = f (xk ) + zk , zk ∼ N (0, σz2 ), (4.10) 4 Parameter inference using sampling methods 150 0 0.004 50 100 0.008 α Density 200 0.012 250 300 86 5000 6000 7000 8000 9000 10000 0.000 0.005 0.010 0.015 0.020 φ 30 Density 20 0.14 0 0.10 10 0.12 β 40 0.16 50 60 Iteration 5000 6000 7000 8000 9000 10000 0.10 0.12 0.14 0.16 0.18 0.20 0.86 0.88 0.90 β 30 Density γ 0 0.83 10 0.85 20 0.87 40 0.89 50 Iteration 5000 6000 7000 8000 9000 10000 0.80 0.82 0.84 γ 20 15 Density 0 0.58 5 0.60 10 0.62 τ 0.64 25 0.66 30 Iteration 5000 6000 7000 8000 Iteration 9000 10000 0.55 0.60 0.65 0.70 τ Figure 4.5: The trace plots (left) and the corresponding posterior estimates (right) of the GARCH(1,1) model using the Nasdaq OMX Stockholm 30 Index data from Section 2.2.2. The estimates are computed using the PMH0 algorithm with 20 000 iterations and discarding the first 5 000 as burn-in. 4.4 Bayesian optimisation 87 where zk denotes some zero-mean Gaussian noise with unknown variance σz2 . Therefore, the BO algorithm is useful when we can only estimate the value of the objective function with some simulation based algorithm. It also turns out that the BO algorithm requires less estimates of the objective function than other optimisation algorithms (Brochu et al., 2010). As a consequence, BO is useful when the noisy estimates of the objective function are computationally costly to obtain. In the following, we make use of BO for ML based parameter inference and for input design in nonlinear SSMs. In these cases, the objective function corresponds to the log-likelihood and the logarithm of the determinant of the expected information matrix, respectively. In both these cases, these quantities are analytically intractable but we can obtain noisy estimates from the objective function by the use of computationally costly particle filtering and smoothing. Hence, these are two applications which fits the BO algorithm well. The BO algorithm operates by constructing a surrogate function also called a response surface that emulates the objective function. In BO, this surrogate function is modelled as a probabilistic function with some prior form selected by the user. The name comes from that the samples from the objective function are used together with the prior to update the model using Bayes’ theorem. Using the updated posterior, we can predict the value of the objective function anywhere in the space of interest and also obtain an uncertainty of the predicted value. Using the predictive distribution, the algorithm can analyse where the optimum of the objective function could be located and focus the sampling to that area. Also, the algorithm can choose to explore areas where there are large uncertainties in the predicted value of the objective function. We refer to these two situations as exploitation and exploration, respectively. In the following, we discuss this part of the algorithm in more detail and the acquisition rules that is used to make the decisions regarding exploitation and exploration. To conclude this overview of the BO algorithm, we present the three steps that are carried out during the kth iteration: (i) Sample the objective function in xk to obtain yk . (ii) Use the data collected Dk = {xi , yi }ki=1 to construct a surrogate function. (iii) Use an acquisition rule and the surrogate function to select xk+1 , i.e. where to sample the objective function in the next iteration of the algorithm. We now proceed by discussing Steps (ii) and (iii) in more detail. Step (i) depends on the specific optimisation problem that we would like to solve and for the two previously discussed applications correspond to the APF and the FL smoother in Algorithms 2 and 3, respectively. In this thesis, we make use of GPs (Rasmussen and Williams, 2006) as the surrogate function. Therefore, we devote the following section to introducing the GP and discuss its structure and how to combine it with the obtained samples from the objective function. Then, we discuss some different acquisition functions and compare their properties. Finally, we combine the GP with the BO algorithm to 88 4 Parameter inference using sampling methods obtain the Gaussian process optimisation (GPO) algorithm. We conclude this section by discussing some applications of GPO in connection with SSMs. For more information regarding BO, see Lizotte (2008), Boyle (2007), Brochu et al. (2010) and Snoek et al. (2012). 4.4.1 Gaussian processes as surrogate functions GPs are an instance of Bayesian nonparametric models and has its origins in kriging (Cressie, 1993; Matheron, 1963) methods from spatial statistics. An application of kriging is to interpolate between elevation measurements sampled in some terrain to build a map of the elevation in an area. The underlying assumption is that the elevation should vary smoothly between sampled points. This is the property of the GP that we would like to use for interpolating the objective function between the values in which we sample it. A GP (Rasmussen and Williams, 2006) can be seen as a generalisation of the multivariate Gaussian distribution to an infinite dimension. As such, it is a collection of random variables, where each finite subset is jointly distributed according to a Gaussian distribution. A realisation drawn from a GP can therefore be seen as an infinitely long vector of values, which can be seen as a function over the real space Rd . This is why the GP is considered by some to be a prior over functions on Rd . As the GP is a Gaussian distribution of infinite dimension, we cannot characterise it using a mean vector and covariance matrix. Instead, we introduce a mean function m(x) and a kernel (or covariance function) κ(x, x0 ) defined as m(x) = E[f (x)], h i κ(x, x0 ) = E f (x) − m(x) f (x0 ) − m(x0 ) . (4.11a) (4.11b) To construct the surrogate function, we assume a priori that the objective function can be modelled as a GP, f (x) ∼ GP m(x), κ(x, x0 ) , (4.12) with the mean function and kernel defined by (4.11). Here, the mean function specifies the average value of the process and the kernel specifies the correlation between (nearby) samples. Both functions are considered to be prior choices to the algorithm and are used to encode the beliefs about the data before it is observed. Consequently, we have that both the prior (4.12) and the data likelihood (4.10) are distributed according to Gaussian distributions. Hence, the posterior resulting from Bayes’ theorem is a Gaussian distribution with some mean and covariance that can be calculated using standard results. From this posterior, we can con- 4.4 Bayesian optimisation 89 struct the predictive distribution at some test point x? given the data Dk by f (x? )Dk ∼ N x? ; µf (x? |Dk ), σf2 (x? |Dk ) , (4.13a) h i −1 2 µf (x? |Dk ) = κ> y1:N , (4.13b) ? κ x1:k , x1:k + σz Ik h i −1 2 σf2 (x? |Dk ) = κ x? , x? − κ> κ? + σz2 , (4.13c) ? κ x1:k , x1:k + σz Ik where κ? = κ x? , x1:k denotes the covariance between the test value and the sampling points. Here, κ x1:k , x1:k denotes a matrix where the element at (i, j) is given by κ(xi , xj ) for i = 1, . . . , k and j = 1, . . . , k. To obtain the GP posterior, we need to select a kernel function. Note that, it is possible to include the assumption of a non-zero mean function into the kernel function by adding an appropriate constant kernel. Therefore, we only make use of a zero mean function in this thesis and focus on the kernel design problem, where several kernels can be combined by different operations to encode the prior beliefs of the structure in the data. Here, we only consider the combination of a constant covariance function and three different popular kernels: the squared exponential (SE), the Matérn 3/2 and the Matérn 5/2. See Rasmussen and Williams (2006) for other kernels and for a discussion of how they can be combined. The SE kernel is also known as radial basis function (RBF) and has the form (x − x0 )2 0 2 , (4.14) κSE (x, x ) = σκ exp 2l2 where the hyperparameters are α = {σκ2 , l}. Here, l is called the characteristic length scale as it scales the Euclidean distance between the two points x and x0 . Two other kernels are the Matérn 3/2 and the Matérn 5/2 with the form √ √ 3(x − x0 ) 3(x − x0 ) κ3/2 = σκ2 1 + exp − , (4.15a) l l √ √ 5(x − x0 ) 5(x − x0 )2 5(x − x0 ) 2 κ5/2 = σκ 1 + + , (4.15b) exp − l 3l2 l where the hyperparameters are α = {σκ2 , l}. The main difference between the three kernels are their smoothness properties. The SE kernel is the smoothest and it has infinitely many continuous derivatives. The κ3/2 and κ5/2 kernels only have one and two continuous derivatives, respectively. To illustrate the different kernels, we present some simulated realisations from each in Example 4.5. 4.5 Example: GP kernels In Figure 4.6, we present realisations from the GP prior (4.12) using three different kernels with two different length scales. We see that the smoothness of the prior decreases from top to bottom and this verifies the previous discussion. Also, we see that the length scale determines the rate of change in the realisations. 4 90 Parameter inference using sampling methods 0 f(x) 1 2 3 SE, l=3 −3 −2 −1 0 −3 −2 −1 f(x) 1 2 3 SE, l=1 0 2 4 6 8 10 0 2 4 x 8 10 0 f(x) 1 2 3 Matérn 5/2, l=3 −3 −2 −1 0 −3 −2 −1 f(x) 1 2 3 Matérn 5/2, l=1 0 2 4 6 8 10 0 2 4 x 6 8 10 x 0 1 2 3 Matérn 3/2, l=3 −2 −3 −3 −2 −1 0 f(x) 1 2 3 Matérn 3/2, l=1 −1 f(x) 6 x 0 2 4 6 x 8 10 0 2 4 6 8 10 x Figure 4.6: Realisations simulated from the GP prior using three different kernels: SE (upper), Matérn 5/2 (middle) and Matérn 3/2 (lower) and length scales 1 (left) and 3 (right). 4.4 Bayesian optimisation 91 Finally, we need to determine suitable values for the hyperparameters in the kernel. This can be done using an ML procedure called emperical Bayes (EB), where the marginal likelihood of the data is optimised with respect to the hyperparameters α. Note that this is not a pure Bayesian approach as the data is used to determine the properties of the prior. Nevertheless, it is a popular approach in the GP literature and therefore we make use of it here. The marginal likelihood can be computed by a marginalisation, Z p(y|x, α) = p(y|f, x, α)p(f |x, α) df, where we drop the subscript on y1:k , x1:k and f1:k for brevity. The log-marginal likelihood can be obtained in closed form using results for the Gaussian distribution as i−1 1 1 h y − log κ x, x + σz2 IN , log p(y|x, α) ∝ − y > κ x, x + σz2 IN 2 2 where we have neglected the terms that are independent of α (independent of the kernel). The gradient of the log-marginal likelihood for αj can be computed using −1 ∂ −1 ∂ 1 > log p(y|x, α) = tr ββ − κ x, x κ x, x , β = κ x, x y. ∂αj 2 ∂θj Therefore, the hyperparameters can be estimated by maximising the log-marginal likelihood using a gradient-based search algorithm (4.1). In Example 4.6, we make use of the EB method for GP regression using the different kernels discussed in this section. 4.6 Example: GP regression Consider the GP regression problem where we would like to recover the underlying function f (x) given by h i2 f (x) = 3 cos(0.5x) + sin(2 + 0.25x) , from the noisy measurements y generated by (4.10) with σz2 = 2. Here, we make use of the three kernels in (4.14) and (4.15) with an added constant kernel to account for the non-zero mean of the data. The hyperparameters of the resulting kernels are estimated using the EB procedure. In the left part of Figure 4.7, we use N = 5 observations and present the resulting predictive mean (solid line), the underlying function f (x) (dashed) and the 95% CI of the predictive distribution (gray area). We see that most of the samples are located in the left part of the region and as a result the predictive means of the GPs follows the underlying function. In the right part of Figure 4.7, we present the same setup but using N = 15 sampled points instead. Here, the overall fit is much better and the three GP predictive distributions recover the underlying function pretty well across the region. As the underlying function is smooth (have infinity many continuous derivatives), it is well captured by the SE kernel. 30 Parameter inference using sampling methods 30 4 92 f(x) 10 20 SE, N=15 0 10 0 f(x) 20 SE, N=5 0 2 4 6 8 10 0 2 4 8 10 30 f(x) 10 20 Matérn 5/2, N=15 0 10 0 f(x) 20 Matérn 5/2, N=5 0 2 4 6 8 10 0 2 4 6 8 10 30 x 30 x 0 10 f(x) 20 Matérn 3/2, N=15 10 20 Matérn 3/2, N=5 0 f(x) 6 x 30 x 0 2 4 6 x 8 10 0 2 4 6 8 x Figure 4.7: The GP regression problem in Example 4.6 using N = 5 (left) and N = 15 (right) data points with three different kernels. The mean of the predictive distributions (solid lines) and the corresponding 95% CI (gray area) are presented together with the true function (dashed line) and the noisy observations (black dots). 10 4.4 Bayesian optimisation 4.4.2 93 Acquisition rules We now proceed by discussing Step (iii) of the BO algorithm, i.e. the acquisition rule and how it operates. The main idea with this rule is to make use of the predictive mean and its uncertainty to decide on a good parameter to sample the objective in during the next iteration. We would like the algorithm to explore the parameter space to find all the peaks, but still focus the samples around the maximum to decrease the number of samples required to solve the problem. This is called the exploration and exploitation trade-off as we would like to explore the space, but also exploit the information encoded in the surrogate function about about where the maxima are located. Using a general acquisition rule AQ(x|Dk ), we would like to select xk+1 in Step (iii) as the maximising argument, xk+1 = argmax AQ(x|Dk ), (4.16) x∈B which in itself is an optimisation problem. We discuss some different methods for solving this in Paper B and now instead proceed with discussing three different acquisition rules that are popular and widely used in GPO (Brochu et al., 2010). These are: the probability of improvement (PI), the expected improvement (EI) and the upper confidence bound (UCB). The PI and EI make use of the fact that the predictive distribution is Gaussian and that we can use the predictive mean and covariance. From this, we can use the probability density function (PDF) and the cumulative distribution function (CDF) of the Gaussian distribution to calculate the PI and EI. This can be done by introducing the highest predicted value of the surrogate function (or the incumbent), µmax = max µf (xi |Dk ), xi ∈x1:k (4.17) then the PI (Kushner, 1964) can be computed using PI(x|Dk ) = P µf (x|Dk ) ≥ µmax + ξ = Φ(Zk ), (4.18a) µf (x|Dk ) − µmax − ξ , σf (x|Dk ) (4.18b) Zk = where Φ( · ) denotes the Gaussian CDF. Here, ξ ≥ 0 denotes a user-defined coefficient proposed in Lizotte (2008) to balance the exploitation and exploration behaviour. From the form of the PI expression, we note that the variable Zk can be seen as a standard Gaussian random variable. Hence, Zk assumes a (large) positive value if the predictive mean is close to or larger than µmax . Therefore, we obtain a value of the PI close to one and it is probable that the GPO algorithm will sample the objective function in this region during its next iteration. Instead, we obtain a small value of the PI if the predictive mean is much smaller than µmax and/or the uncertainty is very large. A small value of the PI results in that it is unlikely that the GPO algorithm will sample the objective function in this region during its next iteration. 94 4 Parameter inference using sampling methods However, the PI only takes into account the probability of an improvement and not its size. To include this information, we consider the EI rule (Mockus et al., 1978; Jones et al., 1998) of the form EI(x|Dk ) = µf (x|Dk ) − µmax − ξ Φ(Zk ) + σf (x|Dk )φ(Zk ), (4.19) where φ( · ) denotes the Gaussian PDF. The interpretation of the EI follows in analogue with the PI rule. If the predictive mean is close to or larger than µmax then Φ(Zk ) assumes a value close to one, which scales the expected gain in the objective function. Here, we also take the uncertainty into account by the second term in (4.19). This term can be seen as a scaling of the uncertainty in the predictive distribution. The scaling is large if Zk is close to zero and decreases for larger value. This means that we get an extra contribution to the EI if the predictive mean is close to µmax , which means that it could be interesting to explore that area. For more information, see Paper B where we review the derivation of the EI rule. The third acquisition rule, the UCB rule follows from that we can construct a CI using the predictive distribution. The intuition for this is that if the predictive mean is high in an area of the parameter space, the resulting UCB is also large. Moreover, uncertainty in a region increases the predictive covariance, which also increases the value of the UCB. By this rule, we therefore explore areas where peaks have been found and where the uncertainty is large. As the name suggests, we are interested in the upper bound which for a Gaussian distribution is given by UCB(x|Dk ) = µf (x|Dk ) + σf (x|Dk ). (4.20) Here, ≥ 0 denotes a coefficient determining the confidence level of the interval. As the predictive distribution is Gaussian, we would choose = 1.96 to obtain a 95% CI and = 2.58 to obtain a 99% CI. 4.7 Example: GPO using different acquisition rules Consider again the problem in Example 4.6 using the Matérn 5/2 kernel with N = 3, N = 5 and N = 15 samples. In the left of Figure 4.8, we present the predictive distributions together with the underlying function as before. In the right part of the figure, we present the normalised value of the three different acquisition functions previously discussed: the PI (green), the EI (red) and the UCB (blue). Here, we use the recommended value of ξ = 0.01 in Lizotte (2008) and = 1.96, resulting in that the UCB corresponds to a 95% CI. We note that the three acquisition functions have quite different behaviours in the three situations. In the first situation (upper), the PI and the UCB have two peaks that are located in the left and right part of the region to exploit the current information and to explore the region better, respectively. The EI would like to sample the left end of the region to exploit the current information and to reduce the uncertainty in that area. In the second situation (middle), the EI again would like to exploit the current information by placing a sample near the peak of the predictive mean. The other two acquisition functions would like to explore the right part of the region better. Bayesian optimisation 1.2 95 30 4.4 1.0 Matérn 5/2, N=3 20 UCB 0.6 AQ(x) 0.4 10 f(x) 0.8 PI 0.0 0.2 0 EI 0 2 4 6 8 10 0 2 4 6 8 10 6 8 10 6 8 10 1.2 x 30 x 0.6 AQ(x) 0.0 0.2 0 0.4 10 f(x) 0.8 20 1.0 Matérn 5/2, N=5 0 2 4 6 8 10 0 2 4 1.2 x 30 x 0.6 AQ(x) 0.0 0.2 0 0.4 10 f(x) 0.8 20 1.0 Matérn 5/2, N=10 0 2 4 6 x 8 10 0 2 4 x Figure 4.8: The GP regression problem in Example 4.6 using N = 3 (upper), N = 5 (middle) and N = 10 (lower) data points with the Matérn 5/2 kernel. In the left part, we present the mean of the predictive distributions (solid lines) and the corresponding 95% CI (gray area) together with the true function (dashed line) and the noisy observations (black dots). In the right part, we present normalised values of three different acquisition functions for each situation. 4 96 Parameter inference using sampling methods Finally, in the third situation (lower) the three functions agree and would like to exploit the current peak in the predictive mean. Hence, we conclude that the three acquisition functions have rather different behaviour and we return in a following example to investigate how this affects the resulting GPO algorithm. 4.4.3 Gaussian process optimisation The full procedure for using GPO to estimate the solution to (4.9) is presented in Algorithm 6. As previously discussed, the user choices include the kernel for the GP prior and the acquisition function. Again, we remind the reader that the choice of kernel is crucial for the performance of the algorithm. Furthermore, an optimisation method is needed to optimise the acquisition function in (4.16). Here, we make use of a global derivative-free optimisation algorithm called DIRECT (Jones et al., 1993), but we discuss other possible choices in Paper B. The GPO algorithm is often initialised by some randomly selected parameters sampled from the parameter prior. These samples are used to estimate the hyperparameters of the GP kernels so that the AQ function can operate. The number of these samples varies with the dimension of the problem but between 5 and 50 are reasonable numbers for small problems. Also, as the EB procedure is computationally costly, it is beneficial to estimate the GP hyperparameters every 5th or 10th iteration to save computations. We end this section by discussing three different examples of where GPO and/or surrogate function modelling is useful. In Examples 4.8 and 4.9, we use the GPO algorithm for solving the ML parameter inference problem in an SSM and compare the different possible choices of AQs. Note that more examples of this application are found in Papers B and C. Finally in Example 4.10, we illustrate the use of GPO for creating an input that maximises the logarithm of determinant of the expected information in an SSM. Remember that this corresponds to maximising the accuracy of the ML parameter estimate, which is the objective in input design. Algorithm 6 Gaussian process optimisation (GPO) Inputs: κ( · ) (GP kernel), AQ (acquisition func.), K (no. iterations) and x1 (init. value). Output: x bmax (estimate of the maximising argument of (4.9)). 1: 2: 3: 4: 5: 6: 7: 8: Initialise the algorithm by random sampling. for k = 1 to K do Sample f (xk ) to obtain the noisy estimate yk . Compute (4.13) to obtain µf (x|Dk ). Compute (4.17) to obtain µmax . Compute (4.16) to obtain xk+1 . end for Compute the maximiser of µf (x|DK ) to obtain the estimate x bmax of (4.9). 4.4 Bayesian optimisation 97 4.8 Example: GPO for ML inference in the GARCH(1,1) model Consider the ML parameter inference problem in the GARCH(1,1) model (2.3) using synthetic data with θ = τ and Θ = (0, 0.25) ∈ R. Here, we make use of the GPO algorithm for solving this problem, by first rewriting (4.10) as b k ) = `(θk ) + zk , `(θ zk ∼ N (0, σz2 ), which is similar to the CLT established in Section 3.3.4. As a consequence, (4.9) turns into the maximum likelihood maximisation problem discussed in Section 2.3. The resulting procedure follows from Algorithm 6 by plugging in the APF from Algorithm 2 for log-likelihood estimation. Here, we use an one dimensional parameter vector to be able to compare the three different AQs discussed in the previous section. We generate T = 250 observations from the model using {α, β, γ, τ } = {0.1, 0.8, 0.05, 0.3}. We make use of the faPF with N = 100 to estimate the log-likelihood during each iteration. Here, we use the recommended value of ξ = 0.01 in Lizotte (2008) and = 1.96 for the acquisition rule. In Figure 4.9, we present the procedure at five consecutive iterations using three different acquisition rules. The procedure is initialised with two randomly selected samples of the log-likelihood, which are not shown in the figure. Here, we see that the behaviour of the three acquisition rules are rather different but the resulting mode of the log-likelihood is almost the same. The parameter estimate is obtained around θbML = 0.12 for all three choices of the acquisition rule. Note, that this small toy example is probably not complex enough to show the real differences between the acquisition rules. Remember, that more extensive evaluations by Lizotte (2008) show that the EI rule is often a good choice in many different applications. 0.00 0.10 α 0.20 0.00 0.10 0.10 α 0.20 0.20 −240 −200 UCB −160 0.20 −280 −220 −300 −300 0.0 −240 0.0 0.4 PI 0.6 EI −250 1.0 −200 UCB −260 1.5 −210 1.0 −150 −220 0.8 −220 Log−likelihood 0.5 −230 Log−likelihood 0.2 0.20 −160 0.00 −300 −300 0.0 −240 0.0 −300 0.4 PI 0.6 EI 0.8 −250 −200 UCB −260 −150 −220 1.2 −210 1.0 −220 0.8 −220 Log−likelihood 0.4 −230 Log−likelihood 0.2 −260 Log−likelihood 0.20 −240 −200 UCB α 0.10 −260 1.0 −210 0.00 Log−likelihood 0.4 0.6 EI −220 1.0 α 0.10 −300 0.2 −230 0.8 0.8 −220 0.00 −220 0.6 PI Log−likelihood 0.4 −260 Log−likelihood α 0.10 −260 0.0 −240 0.2 −300 0.00 Log−likelihood 0.6 0.20 0.4 EI −210 0.0 0.20 0.2 −220 0.10 −230 1.0 0.00 0.10 0.20 0.8 0.8 −220 α −280 0.6 −260 Log−likelihood −185 −300 0 −240 0.0 −300 30 EI −165 UCB −260 −220 50 −210 1.5 −220 −155 40 −220 Log−likelihood 20 −230 −175 10 1.0 PI Log−likelihood 0.5 −260 Log−likelihood 4 −300 PI −300 0.00 0.10 0.20 0.0 0.4 −220 0.00 0.10 Log−likelihood 0.2 −260 Log−likelihood 0.00 −240 0.0 −300 98 Parameter inference using sampling methods α 0.00 α 0.00 α 0.00 α 0.00 0.00 0.10 α 0.10 0.10 0.10 0.10 0.20 α 0.20 α 0.20 α 0.20 α 0.20 Figure 4.9: Five steps of the GPO algorithm for ML parameter inference in Example 4.8 using three different acquisition rules: PI (left), EI (center), UCB (right). The predictive mean and the resulting value of the acquisition rule are presented with coloured solid and dotted lines, respectively. The black dots and gray areas indicate the samples obtained from the log-likelihood and the 95% predictive CI, respectively. 4.4 Bayesian optimisation 99 4.9 Example: GPO for ML inference in the earthquake count model We return to the setting considered in Example 4.9 for ML parameter inference in the earthquake count model (2.6) using the real-world data discussed in Section 2.2.3. Here, the parameter vector is given by θ = {φ, σv , β} and the parameter space is Θ = (0, 1)×(0, 1)×(10, 20) ∈ R3 . We make use of the bPF with N = 1 000 particles to estimate the log-likelihood. The procedure is initialised with 50 samples from the log-likelihood at randomly selected parameters and continues with 150 iterations of the GPO algorithm. In Figure 4.10, we present the current ML parameter estimate at each iteration of the GPO algorithm. We see that the parameter estimates and the predicted loglikelihood stabilise after about 150 iterations. This is rather fast compared with e.g. SPSA which in Paper B requires an order of magnitude more samples of the loglikelihood. The final parameter estimate is obtained as θbML = {0.88, 0.15, 17.65}. This shows that the underlying intensity is rather slowly varying and that the mean number of major earthquakes each year is 17.65. 4.10 Example: Input design in the LGSS model using GPO Consider the LGSS model (2.2) with the parameters θ? = {0.5, 1, 0.1, 0.1}, T = 250 and an input u1:T generated by −1, with probability (1 − α1 )(1 − α2 ), ut = 0, with probability α1 , 1, with probability (1 − α1 )α2 , for t = 1, . . . , T . Here, the aim is to find the input parameters α? = {α1? , α2? } such that b ? , u1:T (α)) , α? = argmax log det I(θ α∈[0,1]×[0,1] b ? , u1:T (α)) denotes the estimated expected information matrix using the where I(θ input u1:T (α) generated using the input parameters α. We make use of the second formulation in (2.13) to estimate the expected information matrix. This is done by first estimating the score function with the FL smoother in Algorithm 3 for M = 100 different data realisations from the model using the same input u1:T and the faPF with N = 100. Finally, the expected information matrix is estimated using the sample covariance matrix of the score function estimates. We integrate this problem into the GPO procedure outlined in Algorithm 6, where the estimation of the expected information matrix is Step (i). Furthermore, we make use of the EI as the acquisition rule and initialise the procedure with 10 uniform random samples for α. In Figure 4.11, we present the resulting predictive mean for the input parameters and the sampling points for the algorithm. The input parameter estimates are obtained as α b = {0.26, 0.50}. We see that the algorithm converges quickly to the estimates, which this shows that the GPO algorithm can be a useful alternative in input design. This as, it requires a limited number of computationally costly estimates of the expected information matrix. 4 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 φ̂ σ̂v 0.6 0.8 1.0 Parameter inference using sampling methods 1.0 100 50 100 150 200 50 100 200 150 200 Iteration 10 −360 −380 −420 12 −400 14 β̂ 16 log−likelihood estimate 18 −340 20 −320 Iteration 150 50 100 150 Iteration 200 50 100 Iteration Figure 4.10: The ML parameter estimate and the resulting predicted loglikelihood at each iteration of the GPO algorithm in Example 4.9. Bayesian optimisation 101 0.2 0.4 α2 0.6 0.8 4.4 0.2 0.4 0.6 0.8 0.6 0.4 α1 0.2 Sample point 0.8 1.0 α1 0.0 α2 0 10 20 30 40 50 0.0 −1.0 −0.5 Input 0.5 1.0 Iteration 0 20 40 60 80 100 Time Figure 4.11: Upper: the estimated mean of the expected information matrix as a function of the input parameters. Middle: the sampling points of the GPO algorithm. Lower: a realisation of the estimated optimal input. 5 Concluding remarks and future work In this chapter, we give a summary of the contributions in the papers included in this thesis and discuss some avenues for future work. 5.1 Summary of the contributions Broadly speaking, the main contributions of this thesis are mainly within two areas. The first contribution is to develop new methods and improve existing methods for efficient parameter inference in SSMs. Here, we are concerned with computational efficiency, which means that we would like to reduce the number of particles or iterations needed to reach a certain accuracy in the parameter estimates. This contribution is contained within Papers A-C. The second contribution is to make use of SMC and MCMC to extend existing methods for parameter inference and input design to nonlinear problems. This contribution is discussed in Papers D and E. In Paper A, we develop a novel PMCMC algorithm that combines the PMH algorithm from Section 4.3 with the Langevin dynamics discussed in Section 4.2.2. The key idea here is to include the particle system within the proposal of the PMH algorithm. With this information, we can make use of the FL smoother from Section 3.4.2 to estimate the score function and the observed information matrix. These quantities are then used to construct the PMH1 and PMH2 algorithms in Paper A in analogue with the MH1 and MH2 algorithms discussed in Section 4.2.2. The resulting algorithm is efficient as it explores the posterior distribution better and this results in a higher ESS compared with the PMH0 algorithm. Furthermore, the added information makes the algorithm invariant to affine transformations of the parameter vector and reduces the length of the the 103 104 5 Concluding remarks and future work burn-in. As a consequence, the proposed algorithm require less iterations than the PMH0 algorithm in some settings, which makes it more computationally efficient as the computational complexity of the two algorithm are the same. In Paper B, we develop a novel algorithm for ML parameter inference by combining ideas from GPO in Section 4.4 with log-likelihood estimation using the APF from Section 3.3.4. The resulting algorithm is computationally efficient as it requires less samples from the log-likelihood compared with some other optimisation methods. As these estimates are computationally costly to obtain this results in an overall decreased computational cost. Compared with SPSA, the gain is about one order of magnitude, see Paper B for a comparison on the HWSV model. In Paper C, we extend the combination of GPO and SMC to parameter inference in nonlinear SSMs with intractable likelihoods. Computationally costly ABC methods are used to approximate the intractable likelihood. Therefore, there could be substantial gains in using this algorithm for inference in this type of models. In Paper D, we develop a novel algorithm for input design in nonlinear SSMs, which can handle amplitude constraints on the input. The proposed method combines results from Valenzuela et al. (2013) with SMC from Section 3.4.2 for estimating the expected information matrix. In Paper E, we propose two algorithms for parameter inference in ARX models with Student-t innovations which includes automatic model order selection by two different methods. These methods makes use of the MH algorithm discussed in Section 4.2 with reversible jump and the Gibbs sampler together with sparseness priors. 5.2 Outlook and future work In this section, we summarise some ideas for future work and extensions of the contributions presented within this thesis. We discuss three different areas; the PMH-algorithm, the GPO-SMC algorithm and input design in SSMs. 5.2.1 Particle Metropolis-Hastings The proposed contributions to the PMH algorithm are mainly methodological developments of existing methods. Therefore, it would be interesting to examine the theoretical properties of the PMH1 and the PMH2 algorithms. This includes questions regarding the convergence rate of the algorithm and how its properties scale with the dimension of the parameter space. Similar analysis has previously been done for MH0 (Roberts et al., 1997), MH1 (Roberts and Rosenthal, 1998) and PMH0 (Sherlock et al., 2013). A possible first step for the PMH analysis is to consider the situation where the number of observations is large. By the discussion in Section 2.4, we know that the Bayesian CLT would give a roughly Gaussian posterior, which simplifies the analysis. Further methodological developments could also be interesting in the PMCMC framework. This includes the development of adaptive PMH1 and PMH2, which automatically determines suitable step sizes and the number of particles. It could 5.2 Outlook and future work 105 also be possible to relax the reversibility constraint of the Markov chain during the burn-in phase of the algorithm. This could decrease the hitting time of the posterior mode by the Markov chain. The adaptive mechanism could then decide when the chain has reached the mode and then turn on the reversibility condition, so that the chain admits the target as its stationary distribution. Relevant work for this idea is found in Andrieu and Thoms (2008), Peters et al. (2010) and Pitt et al. (2012). Another approach to reduce the length of the burn-in is to make use of GPO to estimate the location of the posterior mode in a pilot run. This is similar to the work by Owen et al. (2014), where the authors make use of some pilot runs of the ABC algorithm to initialise the PMH0 algorithm in SSMs with intractable likelihoods. Also, it would be interesting to develop a particle HMC algorithm as suggested in the discussions (Doucet et al., 2011) following Girolami and Calderhead (2011). The challenge with this idea is how to handle that multiple APFs are run within each PMH iteration, which might require some additional developments to the PMCMC framework. Better particle smoothers could also be useful as they could improve the estimates of the score function and the information matrix. This results in that larger step lengths can be used in the PMH2 algorithm and could lead to even larger increases in mixing of the Markov chain. Online methods for Bayesian inference would also be of great interest, especially for the many big data problems that are likely to be faced in the future. Also, graphical processing units (GPUs) and other multicore architectures can be used to decrease the computational cost as some parts of the SMC and MCMC algorithms are possible to run in parallel. Interested readers are referred to Beam et al. (2014), Neiswanger et al. (2013), Henriksen et al. (2012) and Murray (2012) for more information. 5.2.2 Gaussian process optimisation using the particle filter As we have demonstrated in this thesis, the performance of the GPO algorithm depends on the kernel function in the GP prior and the choice of acquisition function. Therefore it would be interesting to develop new acquisition functions that could make use of ideas from sparse GPs, Newton methods and/or proximal point algorithms (Rasmussen and Williams, 2006; Nocedal and Wright, 2006). The main challenge is to construct a rule that keeps exploring the objective function, but still keeps the fast convergence that we have illustrated in the examples in Section 4.4.3 and in Papers B and C. Another possible improvement to the algorithm is to remove the bias in the loglikelihood estimate. This could be done by the bias compensation discussed in Example 3.4. Also, it would be interesting to develop online methods for this algorithm, perhaps by using some kind of stochastic approximation scheme. Finally, we think that there are many interesting applications of GP models for estimating the score and information matrix for an SSM. As these are computationally costly to evaluate with good accuracy, perhaps the ideas from probabilistic numerics could be helpful. This is an emerging field in machine learning, where GPs also 5 106 Concluding remarks and future work are used to estimate derivatives and integrals as well as to solve ordinary differential equations. For more information, see Hennig (2013), Osborne et al. (2012), Osborne (2010) and Boyle (2007). 5.2.3 Input design in SSMs The main drawback of the input design method proposed in Paper D is that the expected information matrix is computationally costly to evaluate. It is possible to make a perfect parallel implementation of this method to reduce the computational cost. However, it would be even more interesting to develop the idea from Example 4.10 and make use of GPO for input design. This could be useful when considering that GPO does not require many evaluations of the objective function, which corresponds to estimates of the expected information matrix. Also, it would be interesting to consider methods that infer the parameters of the model at the same time as the optimal input. This would relax the unrealistic assumption that we need to know the true parameters to be able to construct an optimal input. There are mainly three approaches for solving this problem. The first is to pose the problem in a robust optimisation setting and only assume that the true parameter is located within some set. By this construction, the resulting optimal input would be an average (in some sense) over this set. The second approach is to construct a sequential algorithm that first infer the parameters and then the optimal input. This is repeated over many iterations by exciting the system with the input constructed in the last iteration. However, it is difficult to prove that this approach would converge to the true optimal input and the true parameters. Finally, we could pose this problem in a Bayesian manner and marginalise over the parameters of the model. The resulting optimal input would therefore be a marginalisation over the parameter posterior. Therefore, we would not need to know the true parameters of the system. See Müller et al. (2007), Kuck et al. (2006) and Bubeck and Cesa-Bianchi (2012) for more information. 5.3 Source code and data Source code written in Python and R for recreating most of the examples in Part I are available from the author’s homepages at: http://users.isy.liu.se/en/ rt/johda87/ and http://code.johandahlin.com/. Furthermore, source code for recreating some of the numerical illustrations from Papers A, B, C and E are also available from the same homepages. The source code and data are provided under the MIT license with no guaranteed support and no responsibility for its use and function. Bibliography M. Adolfson, S. Laséen, J. Lindé, and M. Villani. RAMSES – a new general equilibrium model for monetary policy analysis. Sveriges Riksbank Economic Review, 2, 2007a. M. Adolfson, S. Laséen, J. Lindé, and M. Villani. Bayesian estimation of an open economy DSGE model with incomplete pass-through. Journal of International Economics, 72(2):481–511, 2007b. M. Adolfson, S. Laséen, L. Christiano, M. Trabandt, and K. Walentin. RAMSES II – Model Description. Occasional Paper Series, 12, 2013. G. Amisano and O. Tristani. Euro area inflation persistence in an estimated nonlinear DSGE model. Journal of Economic Dynamics and Control, 34(10): 1837–1858, 2010. S. An and F. Schorfheide. Bayesian analysis of DSGE models. Econometric reviews, 26(2-4):113–172, 2007. B. D. O. Anderson and J. B. Moore. Optimal filtering. Courier Publications, 2005. C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2):697–725, 2009. C. Andrieu and J. Thoms. A tutorial on adaptive MCMC. Statistics and Computing, 18(4):343–373, 2008. C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010. A. L. Beam, S. K. Ghosh, and J. Doyle. Fast Hamiltonian Monte Carlo Using GPU Computing. Pre-print, 2014. arXiv:1402.4089v1. M. A. Beaumont. Estimation of population growth or decline in genetically monitored populations. Genetics, 164(3):1139–1160, 2003. J. O. Berger. Statistical decision theory and Bayesian analysis. Springer, 1985. 107 108 Bibliography M. J. Betancourt and M. Girolami. Hamiltonian Monte Carlo for Hierarchical Models. Pre-print, 2013. arXiv:1312.0906v1. C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, USA, 2006. T. Björk. Arbitrage theory in continuous time. Oxford University Press, 2004. F. Black and M. Scholes. The pricing of options and corporate liabilities. The Journal of Political Economy, pages 637–654, 1973. T. Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics, 31(3):307–327, 1986. P. Boyle. Gaussian processes for regression and optimisation. PhD thesis, Victoria University of Wellington, 2007. M. Briers, A. Doucet, and S. Maskell. Smoothing algorithms for state-space models. Annals of the Institute of Statistical Mathematics, 62(1):61–89, 2010. E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Pre-print, 2010. arXiv:1012.2599v1. P. J. Brockwell and R. A. Davis. Introduction to time series and forecasting. Springer, 2002. S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012. P. Bunch and S. Godsill. Improved particle approximations to the joint smoothing distribution using Markov Chain Monte Carlo. IEEE Transactions on Signal Processing, 61(4):956–963, 2013. D. Burke, A. Ghosh, and W. Heidrich. Bidirectional importance sampling for direct illumination. In Proceedings of the 16th Eurographics Symposium on Rendering Techniques, pages 147–156, Konstanz, Germany, June 2005. O. Cappé, E. Moulines, and T. Rydén. Inference in Hidden Markov Models. Springer, 2005. O. Cappé, S. J Godsill, and E. Moulines. An overview of existing methods and recent advances in sequential Monte Carlo. Proceedings of the IEEE, 95(5): 899–924, 2007. C. M. Carvalho, M. S. Johannes, H. F. Lopes, and N. G. Polson. Particle learning and smoothing. Statistical Science, 25(1):88–106, 2010. R. Casarin. Bayesian inference for generalised Markov switching stochastic volatility models, 2004. CEREMADE Journal Working Paper 0414. G. Casella and R. L. Berger. Statistical Inference. Duxbury Press, 2 edition, 2001. Bibliography 109 K. S. Chan and J. Ledolter. Monte Carlo EM estimation for time series models involving counts. Journal of the American Statistical Association, 90(429):242– 252, 1995. T. Chen, E. B. Fox, and C. Guestrin. Stochastic Gradient Hamiltonian Monte Carlo. Pre-print, 2014. arXiv:1402.4102v1. S. Chib, F. Nardari, and N. Shephard. Markov chain Monte Carlo methods for stochastic volatility models. Journal of Econometrics, 108(2):281–316, 2002. N. Chopin, P. E. Jacob, and O. Papaspiliopoulos. SMC2 : an efficient algorithm for sequential analysis of state space models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):397–426, 2013. R. Cont. Empirical properties of asset returns: stylized facts and statistical issues. Quantitative Finance, 1:223–236, 2001. J-M. Cornuet, J-M. Marin, A. Mira, and C. P. Robert. Adaptive Multiple Importance Sampling. Pre-print, 2011. arXiv:0907.1254v5. N. Cressie. Statistics for spatial data. Wiley, 1993. D. Crisan and A. Doucet. A survey of convergence results on particle filtering methods for practitioners. IEEE Transactions on Signal Processing, 50(3):736– 746, 2002. J. Dahlin and F. Lindsten. Particle filter-based Gaussian process optimisation for parameter inference. In Proceedings of the 19th IFAC World Congress, Cape Town, South Africa, August 2014. (accepted for publication). J. Dahlin and P. Svenson. A Method for Community Detection in Uncertain Networks. In Proceedings of 2011 European Intelligence and Security Informatics Conference, Athens, Greece, August 2011. J. Dahlin and P. Svenson. Ensemble approaches for improving community detection methods. Pre-print, 2013. arXiv:1309.0242v1. J. Dahlin, F. Johansson, L. Kaati, C. Mårtensson, and P. Svenson. A Method for Community Detection in Uncertain Networks. In Proceedings of International Symposium on Foundation of Open Source Intelligence and Security Informatics 2012, Istanbul, Turkey, August 2012a. J. Dahlin, F. Lindsten, T. B. Schön, and A. Wills. Hierarchical Bayesian ARX models for robust inference. In Proceedings of the 16th IFAC Symposium on System Identification (SYSID), Brussels, Belgium, July 2012b. J. Dahlin, F. Lindsten, and T. B. Schön. Particle Metropolis Hastings using Langevin dynamics. In Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, May 2013a. J. Dahlin, F. Lindsten, and T. B. Schön. Inference in Gaussian models with missing data using Equalisation Maximisation. Pre-print, 2013b. arXiv:1308.4601v1. 110 Bibliography J. Dahlin, F. Lindsten, and T. B. Schön. Second-order particle MCMC for Bayesian parameter inference. In Proceedings of the 19th IFAC World Congress, Cape Town, South Africa, August 2014a. (accepted for publication). J. Dahlin, F. Lindsten, and T. B. Schön. Particle Metropolis-Hastings using gradient and Hessian information. Pre-print, 2014b. arXiv:1311.0686v2. J. Dahlin, T. B. Schön, and M. Villani. Approximate inference in state space models with intractable likelihoods using Gaussian process optimisation. Technical Report LiTH-ISY-R-3075, Department of Electrical Engineering, Linköping University, Linköping, Sweden, April 2014c. P. Debevec. Rendering synthetic objects into real scenes: bridging traditional and image-based graphics with global illumination and high dynamic range photography. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pages 189–198, Orlando, FL, USA, jul 1998. ACM. P. Del Moral. Feynman-Kac Formulae - Genealogical and Interacting Particle Systems with Applications. Probability and its Applications. Springer, 2004. P. Del Moral. Mean field simulation for Monte Carlo integration. CRC Press, 2013. P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411– 436, 2006. P. Del Moral, A. Doucet, and S. Singh. Forward smoothing using sequential Monte Carlo. Pre-print, 2010. arXiv:1012.5390v1. M. Del Negro and F. Schorfheide. Priors from General Equilibrium Models for VARS. International Economic Review, 45(2):643–673, 2004. A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1): 1–38, 1977. R. Douc and O. Cappé. Comparison of resampling schemes for particle filtering. In Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis (ISPA), pages 64–69, Zagreb, Croatia, September 2005. R. Douc, E. Moulines, and D. S Stoffer. Nonlinear Time Series: theory, methods and applications with R examples. CRC Press, 2014. A. Doucet and A. Johansen. A tutorial on particle filtering and smoothing: Fifteen years later. In D. Crisan and B. Rozovsky, editors, The Oxford Handbook of Nonlinear Filtering. Oxford University Press, 2011. A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and computing, 10(3):197–208, 2000. A. Doucet, P. Jacob, and A. M. Johansen. Discussion on Riemann manifold Bibliography 111 Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B Statistical Methodology, 73(2), p 162, 2011. A. Doucet, M. K. Pitt, and R. Kohn. Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator. arXiv.org, arXiv:1210.1871, October 2012. S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics letters B, 195(2):216–222, 1987. C. Dubarry and R Douc. Particle approximation improvement of the joint smoothing distribution with on-the-fly variance estimation. Pre-print, 2011. arXiv:1107.55241. E. Ehrlich, A. Jasra, and N. Kantas. Static Parameter Estimation for ABC Approximations of Hidden Markov Models. Pre-print, 2012. arXiv:1210.4683v1. R. F. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica: Journal of the Econometric Society, pages 987–1007, 1982. P. Fearnhead, D. Wyncoll, and J. Tawn. A sequential smoothing algorithm with linear computational cost. Biometrika, 97(2):447–464, 2010. J. Fernández-Villaverde and J. F. Rubio-Ramírez. Estimating macroeconomic models: A likelihood approach. The Review of Economic Studies, 74(4):1059– 1087, 2007. R. A. Fisher. Theory of statistical estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 22(05):700–725, 1925. T. Flury and N. Shephard. Bayesian inference based only on simulated likelihood: particle filter analysis of dynamic economic models. Econometric Theory, 27(5): 933–956, 2011. K. Fokianos, A. Rahbek, and D. Tjøstheim. Poisson autoregression. Journal of the American Statistical Association, 104(488):1430–1439, 2009. A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian data analysis. Chapman & Hall/CRC, 3 edition, 2013. S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721–741, 1984. A. Ghosh, A. Doucet, and W. Heidrich. Sequential sampling for dynamic environment map illumination. In Proceedings of the 17th Eurographics conference on Rendering Techniques, pages 115–126, Nicosia, Cyprus, June 2006. P. Giordani and R. Kohn. Adaptive independent Metropolis-Hastings by fast estimation of mixtures of normals. Journal of Computational and Graphical Statistics, 19(2):243–259, 2010. 112 Bibliography M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):1–37, 2011. P. Glasserman. Monte Carlo methods in financial engineering, volume 53. Springer, 2004. S. J. Godsill, A. Doucet, and M. West. Monte Carlo smoothing for nonlinear time series. Journal of the American Statistical Association, 99(465):156–168, March 2004. N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEEE Proceedings of Radar and Signal Processing, 140(2):107–113, 1993. A. G. Gray. Bringing tractability to generalized n-body problems in statistical and scientific computation. PhD thesis, Carnegie Mellon University, 2003. W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970. E. Hecht. Optics. Pearson, 4 edition, 2013. P. Hennig. Fast probabilistic optimization from noisy gradients. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, jun 2013. S. Henriksen, A. Wills, T. B. Schön, and B. Ninness. Parallel implementation of particle MCMC methods on a GPU. In Proceedings of the 16th IFAC Symposium on System Identification (SYSID), Brussels, Belgium, July 2012. J. D. Hol, T. B. Schön, and F. Gustafsson. On resampling algorithms for particle filters. In Proceedings of the Nonlinear Statistical Signal Processing Workshop, Cambridge, UK, September 2006. J. Hull. Options, Futures, and other Derivatives. Pearson, 7 edition, 2009. J. Hull and A. White. The pricing of options on assets with stochastic volatilities. The Journal of Finance, 42(2):281–300, 1987. D. Hultqvist, J. Roll, F. Svensson, J. Dahlin, and T. B. Schön. Detection and positioning of overtaking vehicles using 1D optical flow. In Proceedings of the IEEE Intelligent Vehicles (IV) Symposium, Dearborn, MI, USA, June 2014. (accepted for publication). E. Jacquier, N. G. Polson, and P. E. Rossi. Bayesian analysis of stochastic volatility models with fat-tails and correlated errors. Journal of Econometrics, 122(1):185– 212, 2004. A. H. Jazwinski. Stochastic processes and filtering theory, volume 63. Academic press, 1970. Bibliography 113 M. Johannes and N. Polson. MCMC Methods for Continuous-time Financial Econometrics. In Y. Ait-Sahalia and L. Hansen, editors, Handbook of Financial Econometrics, Vol. 1: Tools and Techniques, volume 2, pages 1–72. NorthHolland, 2009. D. R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21(4):345–383, 2001. D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157–181, 1993. D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455–492, 1998. T. Kailath, A. H. Sayed, and B. Hassibi. Linear Estimation. Prentice Hall, Upper Saddle River, NJ, USA, 2000. N. Kantas, A. Doucet, S.S. Singh, and J.M. Maciejowski. An overview of sequential monte carlo methods for parameter estimation in general state-space models. In IFAC Symposium on System Identification (SYSID), Saint-Malo, France, July 2009. M. J. Keeling and P. Rohani. Modeling infectious diseases in humans and animals. Princeton University Press, 2008. S. Kim, N. Shephard, and S. Chib. Stochastic volatility: likelihood inference and comparison with ARCH models. The Review of Economic Studies, 65(3): 361–393, 1998. G. Kitagawa and S. Sato. Monte Carlo smoothing and self-organising state-space model. In A. Doucet, N. de Fretias, and N. Gordon, editors, Sequential Monte Carlo methods in practice, pages 177–195. Springer, 2001. M. Klaas, M. Briers, N. de Freitas, A. Doucet, S. Maskell, and D. Lang. Fast particle smoothing: if I had a million particles. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, USA, June 2006. P. E. Kloeden and E. Platen. Numerical solution of stochastic differential equations, volume 23. Springer, 4 edition, 1992. J. Kronander and T. B. Schön. Robust auxiliary particle filters using multiple importance sampling. In Proceedings of the 2014 IEEE Statistical Signal Processing Workshop (SSP), Gold Coast, Australia, July 2014. (accepted for publication). J. Kronander, J. Dahlin, D. Jönsson, M. Kok, T. B. Schön, and J. Unger. Real-time Video Based Lighting Using GPU Raytracing. In Proceedings of the 2014 European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, September 2014a. (submitted, pending review). 114 Bibliography J. Kronander, T. B. Schön, and J. Dahlin. Backward sequential Monte Carlo for marginal smoothing. In Proceedings of the 2014 IEEE Statistical Signal Processing Workshop (SSP), Gold Coast, Australia, July 2014b. (accepted for publication). H. Kuck, N. de Freitas, and A. Doucet. SMC samplers for Bayesian optimal nonlinear design. In Proceedings of the 2006 Nonlinear Statistical Signal Processing Workshop, pages 99–102, Cambridge, UK, sep 2006. H. J. Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86(1): 97–106, 1964. R. Langrock. Some applications of nonlinear and non-Gaussian state–space modelling by means of hidden Markov models. Journal of Applied Statistics, 38(12): 2955–2970, 2011. R. Langrock and W. Zucchini. Hidden Markov models with arbitrary state dwelltime distributions. Computational Statistics & Data Analysis, 55(1):715–724, 2011. E. L. Lehmann and G. Casella. Theory of point estimation. Springer, 1998. F. Lindsten. An efficient stochastic approximation EM algorithm using conditional particle filters. In Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, May 2013. F. Lindsten and T. B. Schön. Backward simulation methods for Monte Carlo statistical inference. In Foundations and Trends in Machine Learning, volume 6, pages 1–143, August 2013. J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer, 2008. S. Livingstone and M. Girolami. Information-geometric Markov Chain Monte Carlo methods using Diffusions. Pre-print, 2014. arXiv:1403.7957v1. D. J. Lizotte. Practical Bayesian optimization. PhD thesis, University of Alberta, 2008. L. Ljung. System identification: theory for the user. Prentice Hall, 1999. T. A. Louis. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 44(02):226–233, 1982. A. Marshall. The use of multi-stage sampling schemes in Monte Carlo simulations. In M. Meyer, editor, Symposium on Monte Carlo Methods, pages 123–140. Wiley, 1956. G. Matheron. Principles of geostatistics. Economic Geology, 58(8):1246–1266, 1963. Bibliography 115 G. J. McLachlan and T. Krishnan. The EM algorithm and extensions. WileyInterscience, second edition, 2008. N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092, 1953. S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, 2009. T. P. Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the 17th conference on Uncertainty in Artificial Intelligence, pages 362–369, Seattle, WA, USA, aug 2001. S. Mitra. A review of volatility and option pricing. International Journal of Financial Markets and Derivatives, 2(3):149–179, 2011. J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methods for seeking the extremum. In L. C. W. Dixon and G. P. Szego, editors, Toward Global Optimization, pages 117–129. North-Holland, 1978. P. Müller, D. A. Berry, A. P. Grieve, M. Smith, and M. Krams. Simulation-based sequential Bayesian design. Journal of Statistical Planning and Inference, 137 (10):3140–3150, 2007. K. P. Murphy. Machine learning: a probabilistic perspective. The MIT Press, 2012. L. Murray. GPU acceleration of the particle filter: The Metropolis resampler. Pre-print, 2012. arXiv:1202.6163v1. C. A. Naesseth. Nowcasting using Microblog Data. Bachelor thesis, Linköping University, sep 2012. LiTH-ISY-EX-ET-12/0398. R. M. Neal. MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G. Jones, and X-L. Meng, editors, Handbook of Markov Chain Monte Carlo. Chapman & Hall/ CRC Press, June 2010. W. Neiswanger, C. Wang, and E Xing. Asymptotically Exact, Embarrassingly Parallel MCMC. Pre-print, 2013. arXiv:1311.4780v2. B. Ninness and S. Henriksen. Bayesian system identification via Markov chain Monte Carlo techniques. Automatica, 46(1):40–51, 2010. J. Nocedal and S. Wright. Numerical Optimization. Springer, 2 edition, 2006. J. Nolan. Stable distributions: models for heavy-tailed data. Birkhauser, 2003. B. Øksendal. Stochastic differential equations. Springer, 6 edition, 2010. J. Olsson, O. Cappé, R. Douc, and E. Moulines. Sequential Monte Carlo smoothing with application to parameter estimation in nonlinear state space models. Bernoulli, 14(1):155–179, 2008. 116 Bibliography M. Osborne. Bayesian Gaussian Processes for Sequential Prediction, Optimisation and Quadrature. PhD thesis, University of Oxford, 2010. M. A. Osborne, R. Garnett, S. J. Roberts, C. Hart, S. Aigrain, and N. Gibson. Bayesian quadrature for ratios. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 832–840, La Palma, Canary Islands, SP, apr 2012. J. Owen, D. J. Wilkinson, and C. S. Gillespie. Scalable Inference for Markov Processes with Intractable Likelihoods. Pre-print, 2014. arXiv:1403.6886v1. G. W. Peters, G. R. Hosack, and K. R. Hayes. Ecological non-linear state space model selection via adaptive particle Markov chain Monte Carlo. Pre-print, 2010. arXiv:1005.2238v1. M. Pharr and G. Humphreys. Physically based rendering: From theory to implementation. Morgan Kaufmann, 2010. M. K. Pitt. Smooth Particle Filters for Likelihood Evaluation and Maximisation. Technical Report 651, Department of economics, University of Warwick, Coventry, UK, July 2002. Warwick economic research papers. M. K. Pitt and N. Shephard. Filtering via simulation: Auxiliary particle filters. Journal of the American Statistical Association, 94(446):590–599, 1999. M. K. Pitt, R. S. Silva, P. Giordani, and R. Kohn. Auxiliary particle filtering within adaptive Metropolis-Hastings sampling. Pre-print, 2010. arXiv:1006.1914v1. M. K. Pitt, R. S. Silva, P. Giordani, and R. Kohn. On some properties of Markov chain Monte Carlo simulation methods based on the particle filter. Journal of Econometrics, 171(2):134–151, 2012. G. Poyiadjis, A. Doucet, and S. S. Singh. Particle approximations of the score and observed information matrix in state space models with application to parameter estimation. Biometrika, 98(1):65–80, 2011. C. R. Rao. Linear Statistical Inference and Its Applications. Wiley, 1965. C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. H. E. Rauch, F. Tung, and C. T. Striebel. Maximum likelihood estimates of linear dynamic systems. AIAA Journal, 3(8):1445–1450, August 1965. C. P. Robert. The Bayesian choice. Springer, 2007. C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2 edition, 2004. G. O. Roberts and J. S. Rosenthal. Optimal Scaling of Discrete Approximations to Langevin Diffusions. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 60(1):255–268, 1998. Bibliography 117 G. O. Roberts, A. Gelman, and W. R. Gilks. Weak convergence and optimal scaling of random walk Metropolis algorithms. The Annals of Applied Probability, 7 (1):110–120, 1997. S. M. Ross. Simulation. Academic Press, 5 edition, 2012. H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2):319–392, 2009. T. B. Schön, A. Wills, and B. Ninness. System identification of nonlinear statespace models. Automatica, 47(1):39–49, 2011. C. Sherlock, A. H. Thiery, G. O. Roberts, and J. S. Rosenthal. On the efficency of pseudo-marginal random walk Metropolis algorithms. Pre-print, 2013. arXiv:1309.7209v1. R. H. Shumway and D. S. Stoffer. Time series analysis and its applications. Springer, 3 edition, 2010. J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian Optimization of Machine Learning Algorithms. In Advances in Neural Information Processing Systems 25 (NIPS 2012), pages 2951–2959. Curran Associates, Inc., 2012. J. C. Spall. A stochastic approximation technique for generating maximum likelihood parameter estimates. In American Control Conference, pages 1161–1167, Minneapolis, MN, USA, June 1987. R. Srikanthan and T. A. McMahon. Stochastic generation of annual, monthly and daily climate data: A review. Hydrology and Earth System Sciences, 5(4): 653–670, 1999. E. Taghavi, F. Lindsten, L. Svensson, and T. B. Schön. Adaptive stopping for fast particle smoothing. In Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, May 2013. L. Tierney. Markov chains for exploring posterior distributions. The Annals of Statistics, 22(4):1701–1728, 1994. R. S Tsay. Analysis of financial time series. John Wiley & Sons, 2 edition, 2005. J. Unger, J. Kronander, P. Larsson, S. Gustavson, J. Löw, and A. Ynnerman. Spatially varying image based lighting using HDR-video. Computers & graphics, 37(7):923–934, 2013. P. E. Valenzuela, C. R. Rojas, and H. Hjalmarsson. Optimal input design for dynamic systems: a graph theory approach. In Proceedings of the IEEE Conference on Decision and Control (CDC), Florence, Italy, dec 2013. P. E. Valenzuela, J. Dahlin, C. R. Rojas, and T. B. Schön. A graph/particle-based method for experiment design in nonlinear systems. In Proceedings of the 19th 118 Bibliography IFAC World Congress, Cape Town, South Africa, August 2014. (accepted for publication). E. Veach and L. J. Guibas. Optimally combining sampling techniques for monte carlo rendering. In Proceedings of the 22nd Annual Conference on Computer Graphics, pages 419–428, Los Angeles, CA, USA., aug 1995. ACM. E. Veach and L. J. Guibas. Metropolis light transport. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pages 65–76, 1997. D. J. Wilkinson. Stochastic modelling for systems biology. CRC press, 2 edition, 2011. D. A. Woolhiser. Modeling daily precipitation – progress and problems. In Statistics in the Environmental and Earth Sciences 5, pages 71–89. Halsted Press New York, 1992. S. Yildirim, S. S. Singh, T. Dean, and A Jasra. Parameter Estimation in Hidden Markov Models with Intractable Likelihoods Using Sequential Monte Carlo. Preprint, 2013. arXiv:1311.4117v1. S. L. Zeger. A regression model for time series of counts. Biometrika, 75(4): 621–629, 1988. Part II Publications Publications The articles associated with this thesis have been removed for copyright reasons. For more details about these see: http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-106752

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement