# Analytical Approximations for Bayesian Inference Tohid Ardeshiri

Linköping studies in science and technology. Dissertations. No. 1710 Analytical Approximations for Bayesian Inference Tohid Ardeshiri Linköping studies in science and technology. Dissertations. No. 1710 Analytical Approximations for Bayesian Inference Tohid Ardeshiri [email protected] www.control.isy.liu.se Division of Automatic Control Department of Electrical Engineering Linköping University SE–581 83 Linköping Sweden ISBN : 978-91-7685-930-8 ISSN 0345-7524 Copyright © 2015 Tohid Ardeshiri Printed by LiU-Tryck, Linköping, Sweden 2015 To Adrian Abstract Bayesian inference is a statistical inference technique in which Bayes’ theorem is used to update the probability distribution of a random variable using observations. Except for few simple cases, expression of such probability distributions using compact analytical expressions is infeasible. Approximation methods are required to express the a priori knowledge about a random variable in form of prior distributions. Further approximations are needed to compute posterior distributions of the random variables using the observations. When the computational complexity of representation of such posteriors increases over time as in mixture models, approximations are required to reduce the complexity of such representations. This thesis further extends existing approximation methods for Bayesian inference, and generalizes the existing approximation methods in three aspects namely; prior selection, posterior evaluation given the observations and maintenance of computation complexity. Particularly, the maximum entropy properties of the first-order stable spline kernel for identification of linear time-invariant stable and causal systems are shown. Analytical approximations are used to express the prior knowledge about the properties of the impulse response of a linear time-invariant stable and causal system. Variational Bayes (VB) method is used to compute an approximate posterior in two inference problems. In the first problem, an approximate posterior for the state smoothing problem for linear state-space models with unknown and timevarying noise covariances is proposed. In the second problem, the VB method is used for approximate inference in state-space models with skewed measurement noise. Moreover, a novel approximation method for Bayesian inference is proposed. The proposed Bayesian inference technique is based on Taylor series approximation of the logarithm of the likelihood function. The proposed approximation is devised for the case where the prior distribution belongs to the exponential family of distributions. Finally, two contributions are dedicated to the mixture reduction (MR) problem. The first contribution, generalize the existing MR algorithms for Gaussian mixtures to the exponential family of distributions and compares them in an extended target tracking scenario. The second contribution, proposes a new Gaussian mixture reduction algorithm which minimizes the reverse Kullback-Leibler divergence and has specific peak preserving properties. v Populärvetenskaplig sammanfattning Bayes sats är ett grundläggande verktyg inom statistik som kan användas för att förfina förkunskapen om en variabel med hjälp av observationer. Förkunskapen kallas prior och beskrivs matematiskt som en sannolikhetsfunktion för den okända variabeln, och observationen beskrivs av en s.k. likelihood-funktion. Bayes sats säger att den normaliserade produkten av dessa beskriver den så kallade posteriorn, dvs. fördelningen för variabeln som skattningen ska baseras på. Kärnproblemet i avhandlingen är att denna funktion i de flesta fall inte är analytisk, dvs kan skrivas som en matematiskt uttryck, och måste approximeras på ett eller annat sätt. Ett antal effektiva metoder presenteras i detta arbete. Bilsäkerhet är en illustrativ tillämpning som studeras av författaren. Antag att mjukvaran i en kamera har upptäckt en cykel framför bilen den är monterad i, och att mjukvaran dessutom kan hitta de två hjulens positioner i bilden. Hjulen avbildas som ellipser i bilden, och man kan då med hjälp av ellipsernas form och Bayes sats förfina informationen om cykelns position och dessutom skatta hur cykeln kommer att ändra riktning, t.ex. om den håller på att svänga ut framför bilen. Bayes sats kan nämligen användas en gång till för att förutsäga var cykeln kommer befinna sig när nästa kamerabild tas. Dessa två steg för att förutsäga och skatta är grundkomponenter i olinjära filter, som är ett fokusområde i avhandlingen. Konceptet att modellera cykeln inte bara som en punkt, utan som en struktur med två hjul, går under benämningen utökat mål, och detta område har varit en motivation för många av resultaten i avhandlingen. När man använder Bayes sats finns det ett antal kombinationer av prior och likelihood som faktiskt ger en analytisk posterior. En sådan prior som passar till en viss likelihood kallas konjugerad prior. En metod som föreslås är att approximera likelihooden så att den prior man har blir konjugerad. Med hjälp av detta trick blir posteriorn analytisk, och även på samma form som priorn. Det senare är viktigt när operationen ska upprepas många gånger, som t.ex. i ett olinjärt filter. Ett annat exempel då en posterior har ett analytiskt uttryck är när både prior och likelihood är viktade summor av normalfördelningar. Sådana fördelningar är mycket flexibla och kan approximera vilken fördelning som helst godtyckligt väl. Posteriorn blir då också en viktad summa av normalfördelningar. Problemet är att den får allt fler komponenter varje gång Bayes sats används. Ett av avhandlingens bidrag tar fram konkreta algoritmer för att begränsa antalet komponenter genom smarta approximationer. Rent allmänt kan posteriorn alltid approximeras inom en given funktionsklass, och här studeras Kullback-Leibler som mått att optimera över. Detta används för att skatta impulssvar för dynamiska system. En metod som används i flera bidrag är Variational Bayes (VB). Här används VB till att hitta en produktform av posteriorn över två delmängder av variabler som ska skattas, vilket enligt VB kan göras stegvis med stora besparingar i beräkningskomplexitet. vii Acknowledgments On the 6th of November 1632, the Swedish king Gustav II Adolf, the founder of the Swedish Empire (1611-1721), was killed in the battle of Lützen in modernday Germany. Gustav II Adolf was an extremely able commander (rather than an obedient soldier), was nearsighted and had a prominent nose. It is claimed by historians that in the thick mix of gun smoke and fog covering the battlefield, he was separated from his fellow riders and killed by several shots. It is indeed just a coincidence that this thesis will be defended on the very same day 383 years later. Even the fact that I have well-known aspirations for becoming the king of Sweden does not worry me. Furthermore, I am not concerned about the fact that my opponent Dr Wolfgang Koch comes from a German defense institution which is situated only 500 km away from Lützen. Let us stay objective when forming prior beliefs. Furthermore, separation from my fellow riders is not expected to happen since I have written this thesis to enable me and my fellow riders to see through the fog of noise and smoke of disturbances using Bayes’ rule. I would never have been able to write such a dissertation without the support of my fellow riders. Here, I want to acknowledge their contributions to this thesis. 131 years after that fateful day reverend Thomas Bayes wrote the article “An Essay towards solving a Problem in the Doctrine of Chances ” for which I am grateful. It took another 246 year till Lennart Ljung admitted me to the group and gave the opportunity to find my supervisor and my research subject. Thank you Lennart. I want to thank my supervisor Fredrik Gustafsson for his engagement in the beginning and the end of my PhD studies and his patience along the way. I want to thank the head of division of Automatic Control Svante Gunnarsson for giving me the space to maneuver and making exceptions of traditions whenever I wished for them. Umut Orguner possesses every quality one would desire in a supervisor. I offer my sincerest gratitude to Umut Orguner for sharing his vast knowledge, his patience and his generosity. Thank you Emre Özkan and Fredrik Gustafsson for proof reading this thesis. During my graduate studies I had the opportunity to collaborate with many talented researchers. It was a pleasure to co-author papers with Jonas Sjöberg, Jonas Bärgman, Mathias Lidberg, Robert Thomson, Anders Hansson, Johan Löfberg, Mikael Norrlöf, Fredrik Larsson, Michael Felsberg, Fredrik Gustafsson, Thomas B. Schön, Christian Lundquist, Umut Orguner, Emre Özkan, Karl Granström, Tianshi Chen, Henri Nurminen, Robert Piché, Francesca P. Carli, Alessandro Chiuso, Lennart Ljung, and Gianluigi Pillonetto. Not all the research work resulted in publications; My conversations with Fredrik Lindsten, Saikat Saha, Michael Roth, Daniel Axehill, Martin Enqvist, Carsten Fritsche, Liam Ellis, and Claudio Altafini have been enlightening. I am very excited about my ongoing collaborations with Rafael Rui, Michael Roth, Henri Nurminen, and Mikhail Lifshits. Furthermore, I learned a lot from Anders Hansson and Torkel Glad through their graduate courses for which I am grateful. ix x Acknowledgments I would like to thank division’s secretaries Ulla Salaneck, Åsa Karmelind and Ninna Stensgård. Peter Rosander with whom I shared my office for three years is the kindest and most considerate Swedish man I have met so far. I want to thank Michael Roth and Kora Neupert for being such a wonderful friends. I want to thank Henrik Ohlsson for helping me get on my board so many times on those cold windy days. I want to extend my gratitude to George Mathai, Niklas Wahlström, Alicia Pamela Tonnoli, Bram Dil, Ylva Jung, Fredrik Lindsten, Patrik Axelsson, Manon Kok, Martin Lindfors, Gustav Lindmark, Christian Andersson Naesseth , Hanna Nyqvist, Jonas Linder, Karl Granström, André Carvalho Bittencourt and Christian Lundquist for all the good time be it while playing board games, skiing in the Alps, and kite surfing at Swedish west coast. I also want to thank Rikard Falkeborn, Daniel Petersson, Martin Skoglund and Christian Lyzell for patiently helping me with software related questions and especially Henrik Tidefelt for his contributions to the LATEX template used for this thesis. I offer my sincerest gratitude to the hardworking people of Sweden who gave me the possibility to study my PhD degree not worrying about funding and gave me the opportunity to have paid paternity leave. I wish to acknowledge the financial support from the frame project Extended Target Tracking and project Scalable Kalman Filters both funded by Swedish Research Council. I want to thank my teachers at Sharif University of Technology for being exceptionally dedicated teachers, especially Mohammad Durali and Ali Meghdari. I want to thank Azadeh, Nazanin, Navid, Kavous, Hamed, Shahin, Shahla, Hossein and Hooshang for their support during my undergraduate studies. Adrian! Your very existence, beautiful smiles, big hugs and kisses has been my shelter during these years. I love you more than “a thousand million and nine hundreds ninety one”. Last but not the least, I want to thank my wonderful parents who taught me by example many things above all how to endure during difficult times. Linköping, October 2015 Tohid Ardeshiri Contents Notation I xv Background 1 Introduction 1.1 Approximate Bayesian Inference . . . . . . . . . 1.2 Contributions . . . . . . . . . . . . . . . . . . . 1.3 Publications . . . . . . . . . . . . . . . . . . . . 1.4 Applications . . . . . . . . . . . . . . . . . . . . 1.4.1 Bicycle tracking using ellipse extraction 1.4.2 Positioning using ultra wide-band data . 1.4.3 Path tracking for robots . . . . . . . . . 1.5 Thesis outline . . . . . . . . . . . . . . . . . . . 1.5.1 Outline of Part I . . . . . . . . . . . . . . 1.5.2 Outline of Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 7 8 10 10 13 13 14 14 15 2 Entropy, Exponential Family, and Variational Bayes 2.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Maximum entropy prior distributions . 2.2 Exponential Family . . . . . . . . . . . . . . . . 2.3 Variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 19 20 22 23 3 System Identification 3.1 Impulse Response Identification . 3.2 Continuous-time impulse response 3.3 Discrete-time impulse response . . 3.4 Maximum Entropy Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 25 27 29 30 4 Mixture Reduction 4.1 Mixture Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Mixture Reduction for Target Tracking . . . . . . . . . . . . . . . . 4.3 Greedy mixture reduction . . . . . . . . . . . . . . . . . . . . . . . 33 33 34 36 xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Contents 4.4 Divergence measures . . . . . . . . . . . . . . . . . . . . 4.4.1 Integral square error . . . . . . . . . . . . . . . . 4.4.2 Kullback-Leibler Divergence . . . . . . . . . . . . 4.4.3 α-Divergences . . . . . . . . . . . . . . . . . . . . 4.5 Numerical comparison of mixture reduction algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 38 38 39 42 5 Concluding remarks 45 A Expressions for some members of exponential family 49 B Multiple hypothesis testing 67 C Implementation aspects of the ISE approach 69 Bibliography 75 II Publications A Maximum entropy properties of discrete-time first-order stable spline kernel 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 MaxEnt property of Wiener and SS-1 kernels . . . . . . . . . . . . 2.1 DT Wiener process . . . . . . . . . . . . . . . . . . . . . . . 2.2 The first order SS kernel . . . . . . . . . . . . . . . . . . . . 3 Special structure of Wiener and SS-1 kernels and their MaxEnt interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 MaxEnt covariance completion . . . . . . . . . . . . . . . . 4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B Approximate Bayesian Smoothing with Unknown Process and Measurement Noise Covariances 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Variational Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Unknown time-varying noise covariances . . . . . . . . . . 4.2 Unknown time-invariant noise covariances . . . . . . . . . 5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . A Derivations for the smoother . . . . . . . . . . . . . . . . . . . . . . (i+1) A.1 Derivations for the approximate posterior qx ( · ) . . . . . (i+1) A.2 Derivations for the approximate posterior qQ ( · ) . . . . . (i+1) A.3 Derivations for the approximate posterior qR A.4 Calculation of the expected values . . . . . . B Comparison with Expectation Maximization . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . (·) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 86 87 87 88 89 90 91 94 97 100 101 103 104 104 108 110 110 111 111 115 117 118 121 Contents C Robust Inference for State-Space Models with Skewed Measurement Noise 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Skew t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Variational solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 One-dimensional positioning . . . . . . . . . . . . . . . . . 5.2 Pseudorange positioning . . . . . . . . . . . . . . . . . . . . 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Derivations for the smoother . . . . . . . . . . . . . . . . . . . . . . A.1 Derivations for qx . . . . . . . . . . . . . . . . . . . . . . . . A.2 Derivations for qu . . . . . . . . . . . . . . . . . . . . . . . A.3 Derivations for qΛ . . . . . . . . . . . . . . . . . . . . . . . B Derivations for the filter . . . . . . . . . . . . . . . . . . . . . . . . B.1 Derivations for qx . . . . . . . . . . . . . . . . . . . . . . . . B.2 Derivations for qu . . . . . . . . . . . . . . . . . . . . . . . B.3 Derivations for qΛ . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D Bayesian Inference via Approximation of Log-likelihood for Priors in Exponential Family 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 The Exponential Family . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Conjugate Priors in Exponential Family . . . . . . . . . . . 2.2 Conjugate Likelihoods in Exponential Family . . . . . . . . 3 Measurement Update via Approximation of the Log-likelihood . . 3.1 Taylor series expansion . . . . . . . . . . . . . . . . . . . . . 3.2 The extended Kalman filter . . . . . . . . . . . . . . . . . . 3.3 A general linearization guideline . . . . . . . . . . . . . . . 4 Extended Target Tracking . . . . . . . . . . . . . . . . . . . . . . . . 4.1 The problem formulation . . . . . . . . . . . . . . . . . . . 4.2 Solution proposed by Feldmann et al. (Feldmann et al., 2011) 4.3 ETT via log-likelihood linearization . . . . . . . . . . . . . 5 Numerical simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Monte-Carlo simulations . . . . . . . . . . . . . . . . . . . . 5.2 Single extended target tracking scenario . . . . . . . . . . . 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Proof of Lemma 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . B First Order Taylor Series Approximations for Some Scalar Valued Functions of Matrix Variables . . . . . . . . . . . . . . . . . . . . . C Proof of EKF derivation in Example 9 . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E Greedy Reduction Algorithms for Mixtures of Exponential Family xiii 125 128 129 130 131 132 132 135 137 139 139 140 141 142 143 144 145 146 149 152 153 154 157 161 162 162 164 167 167 168 170 171 172 175 178 179 179 182 184 186 189 xiv Contents 1 2 Introduction . . . . . . . . . . . . . . . . . . Background . . . . . . . . . . . . . . . . . . 2.1 Mixtures and Their Reduction . . . . 2.2 Exponential Family of Distributions 3 Merging algorithm . . . . . . . . . . . . . . 4 General Mixture Reduction Algorithms . . . 4.1 Global Approaches . . . . . . . . . . 4.2 Local Approach . . . . . . . . . . . . 5 Numerical simulations . . . . . . . . . . . . 5.1 Example-I . . . . . . . . . . . . . . . 5.2 Example-II . . . . . . . . . . . . . . . 6 Conclusion . . . . . . . . . . . . . . . . . . . A Proof of Theorem 1 . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 192 192 193 194 194 194 196 197 197 198 200 201 202 F Gaussian Mixture Reduction Using Reverse Kullback-Leibler Divergence 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Runnalls’ Method . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Williams’ Method . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Approximations for RKLD . . . . . . . . . . . . . . . . . . . . . . . 5.1 Approximations for pruning hypotheses . . . . . . . . . . . 5.2 Approximations for merging hypotheses . . . . . . . . . . . 6 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Example with real world data . . . . . . . . . . . . . . . . . 6.2 Robust clustering . . . . . . . . . . . . . . . . . . . . . . . . 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . B Derivation of V (qK , qI , qJ ) . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 208 209 211 211 211 211 214 218 219 221 224 224 227 231 231 232 233 Notation Abbreviations Abbreviation ARMSE ETT EKF EM GLM GM GM-PHD GP GPS iid INLA KF KLD LTI MAP MC MR MRA MTT PDF PHD RKLD RMSE RTS SS-1 STVBF STVBS VB Meaning Average Root Mean Square Error Extended Target Tracking Extended Kalman Filter Expectation Maximization Generalized Linear Models Gaussian Mixture Gaussian Mixture Probability Hypothesis Density Gaussian Process Global Positioning System independent and identically distributed Integrated Nested Laplace Approximation Kalman Filter Kullback-Leibler Divergence Linear Time-Invariant Maximum A Posteriori Monte-Carlo Mixture Reduction Mixture Reduction Algorithm Multiple Target Tracking Probability Density Function Probability Hypothesis Density Reverse Kullback-Leibler Divergence Root Mean Square Error Rauch-Tung-Striebel First Order Stable Spline Skew-t Variational Bayes Filter Skew-t Variational Bayes Smoother Variational Bayes xv xvi Notation Some sets Notation N R d S++ Meaning Set of natural numbers Set of real numbers Set of d × d symmetric positive definite matrices Probability Notation Meaning E V DK L ( · || · ) H H ∼ Expectation Variance Kullback-Leibler divergence Differential entropy Differential entropy rate Distributed according to or sampled from Common distributions Notation N (µ, Σ) U (a, b) Exp(x; λ) Weibull(λ, k) Laplace(µ, b) Rayleigh(σ ) log −N (µ, σ ) Gamma(α, β) IGamma(α, β) Wd (n, V ) Meaning Multivariate Gaussian with mean µ and covariance Σ Uniform over the interval [a, b] Exponential with rate λ Weibull with scale λ and shape k Laplace with location µ and scale b Rayleigh with scale σ Log-normal with location µ and scale σ Gamma with shape α and rate β Inverse gamma with shape α and rate β Wishart with degrees of freedom n and scale matrix d V ∈ S++ I W d (ν, Ψ ) Inverse Wishart with degrees of freedom ν and scale d matrix Ψ ∈ S++ 2 Student’s t-distribution with location parameter µ, t(µ, σ , ν) spread parameter σ , and degrees of freedom ν T( · ; 0, 1, ν) Cumulative distribution function (CDF) of Student’s t-distribution with degrees of freedom ν ST(z; µ, σ 2 , δ, ν) Skew t-distribution with location parameter µ, spread parameter σ , shape parameter δ and degrees of freedom ν N+ (µ, Σ) Truncated Multivariate Gaussian with closed positive orthant as support, location parameter µ and squaredscale matrix Σ xvii Notation Operators and Symbols Notation Meaning tr (A) det (A), |A| AT vec(A) Diag( · ) Trace of matrix A Determinant of matrix A Transpose of matrix A Vectorized matrix A Diagonal matrix whose diagonal elements are the arguments of the operator Sequence (xm , xm+1 , · · · , xn ) d-dimensional identity matrix Hypothesis Minimizing argument with respect to λ Maximizing argument with respect to λ xm:n Id H argminλ argmaxλ Part I Background 1 Introduction This chapter introduces the research area that is considered in this thesis and summarizes the contributions that constitute this thesis. In Section 1.1, an introduction to the approximate Bayesian inference is given. In Section 1.2, the main contributions are summarized. In Section 1.3, the publications by the PhD candidate are listed. In Section 1.4 three applied research results produced by the PhD candidate are presented. In Section 1.5, the outline of the thesis is given. 1.1 Approximate Bayesian Inference Bayesian inference is a statistical inference technique in which Bayes’ theorem is used to update the probability distribution of a random latent variable using observations. This technique provides a mathematical tool for modeling systems where uncertainties of the model, as well as the system, are reflected by the probability distributions. The probabilistic models which are constructed by probability distributions that describe our knowledge about the system are determined using the rules of the probability calculus. Probabilistic models describe the relation between the random latent variables, the deterministic parameters, and the measurements. Such relations are specified by prior distributions of the latent variables p(x), and the likelihood function p(y|x) which gives a probabilistic description of the measurements given (some of) the latent variables. Using the probabilistic model and measurements the exact posterior can be expressed in a functional form using the Bayes’ rule p(x|y) = R p(x)p(y|x) p(x)p(y|x) dx . (1.1) The prior knowledge about the latent variables and the parameters is expressed via prior distributions. Ideally, the prior distribution should express 3 4 1 Introduction this prior knowledge about the latent variables without any extra assumptions. The maximum entropy method (Jaynes, 1982) provides a tool to express the prior knowledge in form of prior distributions without further assumptions. Describing the prior knowledge about a random variable using compact analytical expressions is not always feasible. In such cases approximation methods are required. One of the contributions in this thesis concerns such approximations. Determination of the posterior distribution of a latent variable x given the measurements (observed data) y is at the core of Bayesian inference using probabilistic models. The exact posterior distribution can be analytical. A subclass of cases where the posterior is analytical is when the posterior belongs to the same family of distributions as the prior distribution. In such cases the prior distribution is called a conjugate prior for the likelihood function. A well-known example where analytical posterior is obtained using conjugate priors is when the latent variable is a priori normal-distributed and the likelihood function given the latent variable as its mean is again normal. Example 1.1 Let x have a normal prior distribution with mean µ and covariance Σ, i.e., x ∼ N (x; µ, Σ). A measurement y with the likelihood function p(y|x) = N (y; Hx, R) is in hand where H is a matrix with proper dimensions and R is a covariance matrix. The posterior distribution of x can be obtained using the Bayes’ rule given in (1.1); p(x|y) = R =R p(x)p(y|x) (1.2) p(x)p(y|x) dx N (x; µ, Σ)N (y; Hx, R) N (x; µ, Σ)N (y; Hx, R) dx . (1.3) The posterior distribution p(x|y) has an analytical solution and turns out to be the normal distribution N (x; µ0 , Σ0 ) whose parameters can be computed via closed form expressions given here, µ0 =µ + K(y − H µ), (1.4a) Σ0 =Σ − K H Σ, (1.4b) where K = ΣH(H ΣH T + R)−1 . (1.5) The exact posterior distribution of a latent variable can not always be given a compact analytical expression. In the following, three examples of such cases will be given. In Example 1.2, a problem that is encountered in nonlinear filtering is presented. 1.1 5 Approximate Bayesian Inference Example 1.2 Let x have a normal prior distribution p(x) = N (x; µ, Σ). A measurement y with the likelihood function p(y|x) = N (y; h(x), R) is in hand where h( · ) is a vector valued nonlinear function and R is a covariance matrix. The posterior distribution of x can be expressed using the Bayes’ rule given in (1.1). However, the posterior distribution does not necessarily have a compact analytical solution. A remedy can be obtained by approximation of the likelihood function via linearization of the function h( · ) around the prior mean µ as in b − µ) h(x) ≈ h(µ) + H(x (1.6) b , ∇x h(x)| H x=µ . (1.7) where b R), the approximate posteUsing the approximate likelihood p(y|x) = N (y; Hx, rior distribution can be computed using the analytical expressions given in Example 1.1. One of the contributions in this thesis concerns the problem and the solution proposed in Example 1.2. In Example 1.3, a problem that is encountered in simulating a Markov chain with multi-modal transition density will be presented. Example 1.3 Consider a Markov chain with transition density: p(xk+1 |xk ) = wN (xk+1 ; Axk , Q) + (1 − w)N (xk+1 ; Axk , Q), (1.8) where 0 < w < 1, and factors A and A are two square matrices. We are interested in the marginal distribution of x1000 , where x1 has the distribution p(x1 ) = N (x1 ; µ1 , Σ1 ). The marginal distribution of xk can be obtained recursively by integration as in Z p(xk ) = p(xk |xk−1 )p(xk−1 ) dxk−1 . (1.9) The first step of the recursion is computed here. Z p(x2 ) = wN (x2 ; Ax1 , Q) + (1 − w)N (x2 ; Ax1 , Q) N (x1 ; µ1 , Σ1 ) dx1 T =wN (x2 ; Aµ1 , Q + AΣ1 AT ) + (1 − w)N (x2 ; Aµ1 , Q + AΣ1 A ) (1.10) (1.11) Although the marginal density of x2 can be computed analytically, the complexity of p(x2 ) has increased compared to p(x1 ) due to increase in the number of Gaussian components needed to express p(x2 ). The number of Gaussian components needed to express the marginal density of xk grows exponentially with 6 1 Introduction x1 x2 x3 y1 y2 y3 xT ··· yT Figure 1.1: A probabilistic graphical model for stochastic dynamical system with latent state xk and measurements yk . respect to k and is 2k−1 . In order to maintain the computational complexity of the marginal distributions of xk at a tractable level, the number of components needs to be reduced via approximation of the true marginal density of xk with another distribution with less components. A candidate solution for the problem is minimizing a statistical distance between the true density of xk and its approximation. Two of the contributions in this thesis concern the problem described in Example 1.3. Approximate Bayesian inference is particularly important when the measurements appear sequentially in time as in the filtering task for a stochastic dynamical system, whose probabilistic graphical model is presented in Figure 1.1. In Example 1.4, the Bayesian filtering recursion is introduced and the need for approximations is highlighted. Three contributions in this thesis concern problems of similar nature to Example 1.4. Example 1.4 Consider a stochastic dynamical system represented by the following recursion x1 ∼p(x1 ), (1.12a) yk ∼p(yk |xk ), (1.12b) xk+1 ∼p(xk+1 |xk ). (1.12c) The Bayesian filtering recursion corresponds to computing the posterior distributions p(xk |y1:k ); p(xk |y1:k ) = R p(xk |y1:k−1 )p(yk |xk ) p(xk |y1:k−1 )p(yk |xk ) dxk . (1.13) The density p(xk |y1:k−1 ) in the numerator of (1.13) which is called the predicted density of xk and is obtained by integration as in Z p(xk |y1:k−1 ) = p(xk |xk−1 )p(xk−1 |y1:k−1 ) dxk−1 . (1.14) In such filtering problems, the posterior to the last processed measurement is the prior distribution in the next time step. To be able to use the same inference algorithm in a recursive manner, the posterior distribution at each time step should 1.2 Contributions 7 obtain the same form as the prior. When such a condition does not exist, approximations can be used. A class of such approximations is called the variational approximations, where the posterior is assumed to have a specific functional form (the same as the prior). Subsequently a statistical distance between the assumed posterior and the true posterior is minimized to find the hyper-parameters of the assumed (approximate) posterior. Several methods for approximate inference over probabilistic models are proposed in the literature such as variational Bayes (Jordan et al., 1999), expectation propagation (Minka, 2001), integrated nested Laplace approximation (INLA) (Rue et al., 2009), generalized linear models (GLMs) (Nelder and Wedderburn, 1972) and, Monte-Carlo (MC) sampling methods (Hastings, 1970; Geman and Geman, 1984). Variational Bayes (VB) and expectation propagation (EP) are two optimizationbased solutions to the approximate Bayesian inference (Wainwright and Jordan, 2008). In these two approaches Kullback-Leibler divergence (Cover and Thomas, 2006) between the true posterior distribution and an approximate posterior is minimized. INLA is a technique to perform approximate Bayesian inference in latent Gaussian models (Hennevogl et al., 2001) using the Laplace approximation. GLMs are an extension of ordinary linear regression when errors belong to the exponential family. Sampling methods such as Markov Chain Monte Carlo (MCMC) methods provide a general class of solutions to the approximate Bayesian inference problem. In this thesis, the focus is on fast analytical approximations which are applicable to large-scale inference problems. These approximations propose solutions to the Bayesian inference problems where the vanilla versions are described in Examples 1.2, 1.3 and 1.4. These analytical approximations either involve minimization of a statistical divergence between the true distribution and its approximation or are based on expansion of a function with respect to a basis function. In Section 1.2, the main contributions in this thesis are summarized. In Section 1.5, the connection between the problems highlighted here in form of Examples 1.2, 1.3 and 1.4 and their corresponding contributions in this thesis will be drawn. 1.2 Contributions The contributions of this thesis address various aspects of Bayesian inference. These contributions can be categorized in three groups: 1. Prior selection: The prior information about a stochastic process in a Gaussian process regression problem can be encoded in the covariance function. The maximum entropy properties of a covariance function for Gaussian process regression referred to as the discrete-time first-order stable spline kernel is proven. 8 1 Introduction 2. Determination of the posterior distribution for dynamical systems: Approximate posterior of two Bayesian inference problems are derived using the variational Bayes technique. Furthermore, an approximation method for general Bayesian inference problems using linearization of log-likelihood function is proposed. These contributions in this category concern problems such as those highlighted in Examples 1.2 and 1.4. 3. Maintenance of computational complexity: The contributions in this category concern with maintenance of computational complexity in problems such as the one introduced in Example 1.3. 1.3 Publications The following papers, listed in reverse chronological order, are published T. Ardeshiri, U. Orguner, and F. Gustafsson. Bayesian inference via approximation of log-likelihood for priors in exponential family. ArXiv e-prints, October 2015b. Submitted to Signal Processing, IEEE Transactions on. T. Ardeshiri, E. Özkan, U. Orguner, and F. Gustafsson. Approximate Bayesian smoothing with unknown process and measurement noise covariances. To appear in Signal Processing Letters, IEEE, 2015. T. Chen, T. Ardeshiri, F. P. Carli, A. Chiuso, L. Ljung, and G. Pillonetto. Maximum entropy properties of discrete-time first-order stable spline kernel. To appear in Automatica, 2015. T. Ardeshiri, U. Orguner, and E. Özkan. Gaussian Mixture Reduction Using Reverse Kullback-Leibler Divergence. ArXiv e-prints, August 2015. To be Submitted to Signal Processing, IEEE Transactions on. H. Nurminen, T. Ardeshiri, R. Piché, and F. Gustafsson. A NLOSrobust TOA positioning filter based on a skew-t measurement noise model. In 2015 International Conference on Indoor Positioning and Indoor Navigation (IPIN), Banff, Alberta, Canada, October 2015b. H. Nurminen, T. Ardeshiri, R. Piché, and F. Gustafsson. Robust inference for state-space models with skewed measurement noise. Signal Processing Letters, IEEE, 22(11):1898–1902, Nov 2015a. ISSN 10709908. doi: 10.1109/LSP.2015.2437456. T. Ardeshiri, K. Granström, E. Özkan, and U. Orguner. Greedy reduction algorithms for mixtures of exponential family. Signal Processing Letters, IEEE, 22(6):676–680, June 2015a. ISSN 1070-9908. doi: 10.1109/LSP.2014.2367154. 1.3 Publications T. Ardeshiri and T. Chen. Maximum entropy property of discretetime stable spline kernel. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 3676–3680, April 2015. doi: 10.1109/ICASSP.2015.7178657. T. Ardeshiri and E. Özkan. An adaptive PHD filter for tracking with unknown sensor characteristics. In Information Fusion (FUSION), 2013 16th International Conference on, pages 1736–1743, July 2013. T. Ardeshiri, U. Orguner, C. Lundquist, and T. Schön. On mixture reduction for multiple target tracking. In Information Fusion (FUSION), 2012 15th International Conference on, pages 692–699, July 2012. T. Ardeshiri, F. Larsson, F. Gustafsson, T. Schön, and M. Felsberg. Bicycle tracking using ellipse extraction. In Information Fusion (FUSION), 2011 Proceedings of the 14th International Conference on, pages 1–8, July 2011a. T. Ardeshiri, M. Norrlöf, J. Löfberg, and A. Hansson. Convex optimization approach for time-optimal path tracking of robots with speed dependent constraints. In Proceedings of the 18th IFAC World Congress, Milan, Italy, pages 14648–14653, August 2011b. T. Ardeshiri, S. Kharrazi, R. Thomson, and J. Bärgman. Offset eliminative map matching algorithm for intersection active safety applications. In Intelligent Vehicles Symposium, 2006 IEEE, pages 82–88, 2006b. doi: 10.1109/IVS.2006.1689609. T. Ardeshiri, S. Kharrazi, J. Sjöberg, J. Bärgman, and L. M. Sensor fusion for vehicle positioning in intersection active safety applications. In International Symposium on Advanced Vehicle Control, 2006a. 9 10 1.4 1 Introduction Applications In this section, a summary of three applied research results produced by the PhD candidate is presented. 1.4.1 Bicycle tracking using ellipse extraction A new approach to track bicycles from imagery sensor data is proposed in (Ardeshiri et al., 2011a). It is based on detecting ellipsoids in the image as in Figures 1.2 and 1.3. These ellipses are treated these pair-wise using a dynamic bicycle model illustrated in Figure 1.4. One important application area is in automotive collision avoidance systems, where no dedicated systems for bicyclists yet exist and where very few theoretical studies have been published. Possible conflicts can be predicted from the position and velocity state in the model, but also from the steering wheel articulation and roll angle that indicate yaw changes before the velocity vector changes. An algorithm is proposed in (Ardeshiri et al., 2011a) which consists of an ellipsoid detection and estimation algorithm and a particle filter. A simulation study of three critical single target scenarios is presented, and the algorithm is shown to produce excellent state estimates. An experiment using a stationary camera and the particle filter for state estimation is performed and has shown encouraging results. Figure 1.2: The green ellipses indicate measurements obtained from the two bike wheels. The ellipse parameters are later fed through a particle filter framework in order to estimate the bicycle state. 1.4 Applications 11 Figure 1.3: Ellipse extraction. Top left: Query image, Top right: Query image after background subtraction. Bottom: Ellipses plotted with 0.9 and 1.1 times the estimated size, the actual estimated ellipses are halfway between the lines. 12 1 Introduction N mg θ z x l2 l1 L Cα Cα y y (a) side view (b) front view x Z ψ δ α z X (c) top view (d) slope Figure 1.4: (a) Illustration of the coordinate system and the bicycle parameters. The wheelbase L and the distance of center of gravity to the wheel centers is denoted by l1 and l2 . The y-axis goes through the center of gravity and the x-axis goes through the wheel centers. (b) Illustration of the inclination θ of the bicycle. The inclination angle can be calculated using Newton’s second law of motion. The gravitational force is denoted by mg and the reaction force of the ground is denoted by N . (c) An extended bicycle model is used as motion model where ψ and δ are shown in this figure. The orientation of the camera at the origin of the global coordinate system is shown. (d) The slope of the bicycle’s track is denoted by α. 1.4 13 Applications 1.4.2 Positioning using ultra wide-band data The skew-t variational Bayes filter (STVBF) (Nurminen et al., 2015a) is applied to indoor positioning with time-of-arrival (TOA) based distance measurements and pedestrian dead reckoning (PDR) in (Nurminen et al., 2015b). The proposed filter accommodates large positive outliers caused by occasional non-line-of-sight (NLOS) conditions by using a skew-t model of measurement errors. Real-data tests using the fusion of inertial sensors based PDR and ultra-wideband based TOA ranging show that the STVBF clearly outperforms the extended Kalman filter (EKF) in positioning accuracy with the computational complexity about three times that of the EKF. A tracking performance of one of the test tracks is illustrated in Figure 1.5. UWB beacon reference track STVBF EKF END INIT 5m Figure 1.5: Test track 1 consists of corridors and turns at corridor junctions. 1.4.3 Path tracking for robots The task of generating time optimal trajectories for a six degrees of freedom industrial robot is discussed in (Ardeshiri et al., 2011b) and an existing convex optimization formulation of the problem is extended to include new types of constraints. The new constraints are speed dependent and can be motivated from physical modeling of the motors and the drive system. It is shown how the speed dependent constraints should be added in order to keep the convexity of the overall problem. A method to, conservatively, approximate the linear speed dependent constraints by a convex constraint is also proposed (see Figure 1.6). A numerical example proves versatility of the extension proposed in (Ardeshiri et al., 2011b). 14 1 Introduction τi τ i,affine 2 q̇ i,affine q̇i2 τ i,affine Figure 1.6: The torque at a joint of a robotic arm is plotted versus the square of angular velocity of the same joint. The non-convex true feasible set is approximated by a set of affine constraints. The true actuator’s constraints is represented by the dashed line. The approximation of the feasible set by a convex set is illustrated by the hatched area. 1.5 Thesis outline The thesis is divided into two parts. In the rest of the first part, background material for these contributions will be provided1 . In the second part of the thesis, a compilation of six edited publications are presented. 1.5.1 Outline of Part I In Chapter 2, the concepts of entropy, relative entropy, and maximum entropy priors, and their relation to the exponential family are introduced. Also a short introduction to variational Bayes method is given. The background material in Chapter 2 are intended to lay the theoretical foundation for the Papers A, B, C, D and E in the second part of this thesis. In Chapter 3, a short introduction to the problem of identification of linear time-invariant, stable and causal systems using Gaussian process regression methods is given. This Chapter is intended to give an introduction to the problem addressed in Paper A which is about approximation of the prior knowledge for the purpose of devising a maximum entropy prior distribution. 1 Parts of the material presented in the first part of the thesis is already published by the author in form of technical reports, conference papers and journal articles. 1.5 Thesis outline 15 In Chapter 4, an introduction to the mixture reduction problem introduced in Example 1.3 is presented. The mixture reduction problem is addressed in the second part of this thesis by Papers E and F. Concluding remarks are given in Chapter 5. 1.5.2 Outline of Part II Part II of the thesis is a compilation of six edited contributions which are summarized in the following. Maximum entropy properties of discrete-time first-order stable spline kernel Paper A T. Chen, T. Ardeshiri, F. P. Carli, A. Chiuso, L. Ljung, and G. Pillonetto. Maximum entropy properties of discrete-time first-order stable spline kernel. To appear in Automatica, 2015. presents the maximum entropy properties of the discrete-time first-order stable spline kernel. The first order stable spline (SS-1) kernel (also known as the tunedcorrelated kernel) is used extensively in regularized system identification, where the impulse response is modeled as a zero-mean Gaussian process whose covariance function is given by well designed and tuned kernels. In particular, the exact maximum entropy problem solved by the SS-1 kernel without Gaussian and uniform sampling assumptions is formulated. Under general sampling assumption, the special structure of the SS-1 kernel (e.g. its tridiagonal inverse and factorization have closed form expression) is derived. Also a maximum entropy covariance completion interpretation is given to it. Approximate Bayesian smoothing with unknown process and measurement noise covariances Paper B T. Ardeshiri, E. Özkan, U. Orguner, and F. Gustafsson. Approximate Bayesian smoothing with unknown process and measurement noise covariances. To appear in Signal Processing Letters, IEEE, 2015. presents an adaptive smoother for linear state-space models with unknown process and measurement noise covariances. The proposed method utilizes the variational Bayes technique to perform approximate inference. The resulting smoother is computationally efficient, easy to implement, and can be applied to high dimensional linear systems. The performance of the algorithm is illustrated on a target tracking example. 16 1 Introduction Robust inference for state-space models with skewed measurement noise Paper C H. Nurminen, T. Ardeshiri, R. Piché, and F. Gustafsson. Robust inference for state-space models with skewed measurement noise. Signal Processing Letters, IEEE, 22(11):1898–1902, Nov 2015a. ISSN 10709908. doi: 10.1109/LSP.2015.2437456. presents filtering and smoothing algorithms for linear discrete- time state-space models with skewed and heavy-tailed measurement noise. The algorithms use a variational Bayes approximation of the posterior distribution of models that have normal prior and skew-t-distributed measurement noise. The proposed filter and smoother are compared with conventional low-complexity alternatives in a simulated pseudorange positioning scenario. In the simulations the proposed methods achieve better accuracy than the alternative methods, the computational complexity of the filter being roughly 5 to 10 times that of the Kalman filter. Bayesian inference via approximation of log-likelihood for priors in exponential family Paper D T. Ardeshiri, U. Orguner, and F. Gustafsson. Bayesian inference via approximation of log-likelihood for priors in exponential family. ArXiv e-prints, October 2015b. Submitted to Signal Processing, IEEE Transactions on. presents a Bayesian inference technique based on Taylor series approximation of the logarithm of the likelihood function. The proposed approximation is devised for the case where the prior distribution belongs to the exponential family of distribution and is continuous. The logarithm of the likelihood function is linearized with respect to the sufficient statistic of the prior distribution in exponential family such that the posterior obtains the same exponential family form as the prior. Similarities between the proposed method and the extended Kalman filter for nonlinear filtering are illustrated. Further, an extended target measurement update for target models where the target extent is represented by a random matrix having an inverse Wishart distribution is derived. The approximate update covers the important case where the spread of measurement is due to the target extent as well as the measurement noise in the sensor. 1.5 Thesis outline 17 Greedy reduction algorithms for mixtures of exponential family Paper E T. Ardeshiri, K. Granström, E. Özkan, and U. Orguner. Greedy reduction algorithms for mixtures of exponential family. Signal Processing Letters, IEEE, 22(6):676–680, June 2015a. ISSN 1070-9908. doi: 10.1109/LSP.2014.2367154. presents a general framework for greedy reduction of mixture densities of exponential family. The performances of the generalized algorithms are illustrated both on an artificial example where randomly generated mixture densities are reduced and on a target tracking scenario where the reduction is carried out in the recursion of a Gaussian inverse Wishart probability hypothesis density (PHD) filter. Gaussian mixture reduction using reverse Kullback-Leibler divergence Paper F T. Ardeshiri, U. Orguner, and E. Özkan. Gaussian Mixture Reduction Using Reverse Kullback-Leibler Divergence. ArXiv e-prints, August 2015. To be Submitted to Signal Processing, IEEE Transactions on. presents a greedy mixture reduction algorithm which is capable of pruning mixture components as well as merging them based on the Kullback-Leibler divergence (KLD). The algorithm is distinct from the well-known Runnalls’ KLD based method since it is not restricted to merging operations. The capability of pruning (in addition to merging) gives the algorithm the ability of preserving the peaks of the original mixture during the reduction. Analytical approximations are derived to circumvent the computational intractability of the KLD which results in a computationally efficient method. The proposed algorithm is compared with Runnalls’ and Williams’ methods in two numerical examples, using both simulated and real world data. The results indicate that the performance and computational complexity of the proposed approach make it an efficient alternative to existing mixture reduction methods. 2 Entropy, Exponential Family, and Variational Bayes Analytical approximations proposed in the second part of this thesis build upon the existing literature on maximum entropy priors, exponential family of distributions and variational Bayes. In this chapter some preliminary definitions and results relating to these contributions will be given. In Section 2.1, entropy and the relative entropy will be defined. Furthermore, maximum entropy distributions will be derived. The background material in section 2.1 will lay the foundations for Paper A in the second part of the thesis. Also the relationship between the maximum entropy priors and the exponential family will be explained. In Section 2.2, the exponential family of distributions and some of their properties will be given. These background material will be used in Papers E and D which are about approximate inference techniques relating to exponential family of distributions. In Section 2.3 the variational Bayes (VB) method is described. The VB method is used to derive approximate posteriors in Papers B and C. 2.1 Entropy Entropy is a measure of the uncertainty of a random variable. In this thesis, only continuous random variables are considered. Consequently, only the aspects of the information theory which are related to continuous random variables will be covered. The definitions of the differential entropy and relative entropy will be given in the following. Definition 2.1. For a distribution with its support on S with density p( · ), the differential entropy is defined by (Cover and Thomas, 2012) Z H(p) = − p(x) log p(x) dx. (2.1) S 19 20 2 Entropy, Exponential Family, and Variational Bayes Example 2.2 P For standard normal distribution in Rn where p(x) = (2π)−n/2 exp{− nj=1 x2j /2} P and log p(x) = − n2 log(2π) − 21 nj=1 x2j the following holds. H(p) = − E[log p(x)] = n/2 log(2π) + n/2 = n/2 log(2πe), where e is the Euler number. Definition 2.3. The relative entropy or the Kullback-Leibler divergence between two PDFs is defined by p(x) DK L (p||q) = E log . (2.2) q(x) p(x) 2.1.1 Maximum entropy prior distributions By maximizing the differential entropy of a distribution subject to constraints imposed by prior knowledge, the probability distribution which encompasses the least assumptions about the data can be obtained. In the following, the maximum entropy distribution subject to some constraints expressed by equality constraints on expectation of some functions will be derived. Example 2.4 Maximize the entropy H(p) over all probability densities p( · ) satisfying 1. p(x) ≥ 0, with equality outside the support set S, R 2. S p(x) dx = 1, R 3. S p(x)Ti (x) dx = αi , for 1 ≤ i ≤ m . The solution to the maximum entropy problem can be found using calculus (Cover and Thomas, 2012); The Lagrangian for the problem is given by Z Z Z m X J(p) = − p log p + λ0 p + λi Ti p. (2.3) i=1 Since the entropy is a concave function defined over a convex set we can compute the functional derivative and equate it to zero to obtain m X ∂J λi Ti (x) = 0. = − log p(x) − 1 + λ0 + ∂p(x) (2.4) m X p(x) = exp −1 + λ0 + λi Ti (x) . (2.5) i=1 Hence, i=1 2.1 21 Entropy The result of the example above will be proven using the information inequality in the following theorem. P Theorem 2.5. Let p∗ (x) = exp −1 + λ0 + m i=1 λi Ti (x) , x ∈ S, where λ0 , λ1 , ..., λm are chosen so that p∗ satisfies (Cover and Thomas, 2012, Theorem 12.1.1) 1. p(x) ≥ 0, with equality outside the support set S, R 2. S p(x) dx = 1, 3. R S p(x)Ti (x) dx = αi , for 1 ≤ i ≤ m . Then, p∗ uniquely maximizes H(p) over all probability densities p satisfying the constraints. Proof: Proof is obtained using the information inequality. Let g satisfy the constraints. Then Z Z Z g ∗ ∗ H(g) = − g ln g = − g ln ∗ p = −DK L (g||p ) − g ln p∗ p S S S Z Z m X ≤ − g ln p∗ = − g −1 + λ0 + λi Ti S i=1 S Z Z m X ∗ = − p −1 + λ0 + λi Ti = − p∗ ln p∗ = H(p∗ ) S i=1 S Note that the equality holds iff DK L (g||p∗ ) = 0 for all x. Therefore, g = p∗ except for a set of measure 0. Example 2.6 The maximum entropy distribution on the support S = (−∞, ∞) satisfying the constraint E[x] = µ, E[x2 ] = σ 2 is p(x) = N (x; µ, σ 2 ). Example 2.7 The maximum entropy distribution on the support S = [0, +∞) satisfying the constraint E[x] = λ, is p(x) = Exp(x; λ−1 ). Example 2.8 The maximum entropy distribution on the support S = [a, b] satisfying no other constraint than integrability, is p(x) = U (x; a, b). 22 2 Entropy, Exponential Family, and Variational Bayes The differential entropy for continuous random variables has some weaknesses compared to the discrete random variables which are listed here. Remark 2.9. Differential entropy differs from the entropy of finely quantized version of the continuous random variable (the Shannon entropy) by the logarithm of the quantization resolution which is infinite in the limit. (Cover and Thomas, 2012, Theorem 8.3.1). Remark 2.10. Differential entropy is not scale invariant on Rn . That is, for a vector-valued random variable X ∈ Rn and a non-singular matrix A ∈ Rn×n (Cover and Thomas, 2012, page 254) H(AX) = H(X) + log | det(A)|. (2.6) Remark 2.11. Differential entropy can be negative. Hence, the well known relation between the information content of a distribution and the Shannon entropy does not hold for the differential entropy. As we showed in theorem 2.5, the maximum entropy distribution subject to expectation constraints given in the theorem obtains the form P exp −1 + λ0 + m λ T (x) . In the following section the exponential family of i=1 i i distributions will be introduced. The members of this family arise naturally as the solution to the problem of finding maximum entropy distribution subject to the expectation constraint on their sufficient statistic T (x). 2.2 Exponential Family The exponential family of distributions (Wainwright and Jordan, 2008) includes many common distributions such as Gaussian, beta, Dirichlet, gamma and Wishart. The exponential family in its natural form can be represented by its natural parameters η, sufficient statistic T (x), Log-partition function A(η) and base measure h(x) as in q(x; η) = h(x) exp(η · T (x) − A(η)), (2.7) where the natural parameter η belongs to the natural parameter space Ω = {η ∈ Rm |A(η) < +∞}. Here a · b denotes the inner product of a and b. In Table 2.1 the sufficient statistic for some continuous members of the exponential family are given. Definition 2.12. The set corresponding to all mean values for the sufficient statistics M = {µ ∈ Rm |∃p, E[T (x)] = µ} (2.8) p is called the mean parameter space (Wainwright and Jordan, 2008). Definition 2.13. In a regular family of exponential family the domain Ω is an open set (Wainwright and Jordan, 2008). Definition 2.14. In minimal representation of an exponential family a unique parameter vector is associated with each distribution (Wainwright and Jordan, 2008). 2.3 23 Variational Bayes Table 2.1: Some continuous exponential family distributions and their sufficient statistic are listed. Continuous Exp. Family Distribution Exponential distribution Normal distribution with known variance σ 2 Normal distribution Pareto distribution with known minimum xm Weibull distribution with known shape k Chi-squared distribution Dirichlet distribution Laplace distribution with known mean µ Inverse Gaussian distribution Scaled inverse Chi-squared distribution Beta distribution Lognormal distribution Gamma distribution Inverse gamma distribution Gaussian Gamma distribution Wishart distribution Inverse Wishart distribution T(·) x x/σ (x, xxT ) log x xk log x (log x1 , · · · , log xn ) |x − µ| (x, 1/x) (log x, 1/x) (log x, log(1 − x)) (log x, (log x)2 ) (log x, x) (log x, 1/x) (log τ, τ, τx, τx2 ) (log |X|, X) (log |X|, X −1 ) The formulas for representation of some probability distribution functions in the exponential family form are given in Appendix A. 2.3 Variational Bayes Variational Bayes (VB) method is used to find an approximate solution to inference problems when an exact solution is not analytically tractable. Consider a Bayesian model in which prior distributions are assigned to all parameters and latent variables. We will denote all these parameters and latent variables by x where x , {x1 , x2 , · · · , xn }. Now, consider the measurement vector y along with the joint posterior distribution p(x, y). When there is no analytical solution for the posterior p(x|y) we can look for an approximate analytical solution using the following factorized variational approximation. p(x|y) ≈ q(x) , q1 (x1 )q2 (x2 ) · · · qn (xn ), (2.9) (2.10) where the densities q1 (x1 ), q2 (x2 ), · · · , qn (xn ) are the approximate posterior densities for x1 , x2 , · · · , xn , respectively. Technique of VB (Bishop, 2006, Ch. 10),(Tzikas et al., 2008) chooses the estimates q̂1 (x1 ), q̂2 (x2 ), · · · , q̂n (xn ) for the factors in (2.10) 24 2 Entropy, Exponential Family, and Variational Bayes using the following optimization problem q̂(x) = argmin DK L (q(x)||p(x|y)). (2.11) q(x) The optimal solution for the optimization problem satisfies the following set of equations. log q̂i (xi ) = E [log p(x, y)] + const., 1 ≤ i ≤ n −i (2.12) where the term const. is constant with respect to the variables xi and the subscript −i under the expectation operator means that the expectation is taken with respect to factors other than qi (xi ). The solution to (2.12) can be obtained via fixed-point iterations where only one factor in (2.10) is updated and all the other factors are fixed to their last estimated values (Bishop, 2006, Ch. 10). The iterations converge to a local optima of (2.11) (Bishop, 2006, Ch. 10), (Wainwright and Jordan, 2008, Ch. 3). The posterior in (2.9) can be the smoothing distribution of the states and model parameters which my not be analytical. In Papers B and C, it is shown that the VB technique can be used to find an approximate posterior. 3 System Identification This chapter concerns a maximum entropy prior for a specific approximate Bayesian inference problem. Particularly, the prior information about the impulse response of a linear time-invariant (LTI) stable and causal system will be described. The background material presented in this chapter lays the foundation for the contribution in Paper A in the second part of this thesis where the prior information is approximated to construct a maximum entropy kernel for Gaussian process regression. Parts of the back ground material is published in (Ardeshiri and Chen, 2015). 3.1 Impulse Response Identification System identification is about how to construct mathematical models based on observed data, see e.g., (Ljung, 1999). For linear time-invariant (LTI) and causal systems, the identification problem can be stated as follows. Consider y(ti ) = f ∗ u(ti ) + v(ti ), i = 0, 1, · · · , N (3.1) where ti , i = 0, 1, · · · , N are the time instants at which the measured input u(t) and output y(t) are collected, v(t) is the disturbance, f (t) is the impulse response with t ∈ R+ , [0, ∞) for continuous-time systems and t = ti , i = 0, 1, · · · for discrete-time systems, and f ∗ u(ti ) is the convolution of f ( · ) and u( · ) evaluated at t = ti . The goal is to estimate f (t) as good as possible. Recently, there have been increasing interests in system identification community to study system identification problems with machine learning methods, see e.g., (Ljung et al., 2011), (Pillonetto et al., 2014). An emerging trend among others is to apply Gaussian process regression methods for LTI stable and causal system identification problems, see (Pillonetto and Nicolao, 2010) and its follow up papers (Pillonetto et al., 2011), (Chen et al., 2012a), (Chen et al., 2014). Its 25 26 3 System Identification idea is to model the impulse response f (t) with a suitably defined Gaussian process which is characterized by f (t) ∼ GP(m(t), k(t, s)), (3.2) where m(t) is the mean function and is often set to be zero, and k(t, s) is the covariance function, also called the kernel function in machine learning and statistics, see e.g., (Rasmussen and Williams, 2006). The kernel k(t, s) is parametrized by a hyper-parameter β and further written as k(t, s; β). The key issue is to design a suitable parametrization of k(t, s; β), or in other words, the structure of k(t, s; β), because it reflects our prior knowledge about the system to be identified. Several kernel structures have been proposed in the literature, e.g., the stable spline (SS) kernel in (Pillonetto and Nicolao, 2010) and the diagonal and correlated (DC) kernel in (Chen et al., 2012a). Interestingly, (Pillonetto and Nicolao, 2011) shows based on a result in (Nicolao et al., 1998) that for continuous-time systems, the continuous-time first-order SS kernel (also derived by deterministic arguments in (Chen et al., 2012a) and called Tuned Correlated (TC) kernel): k(t, s) = min{e−βt , e−βs }, t, s ∈ R+ (3.3) has a certain maximum entropy property. In Example 3.1 an the impulse response identification problem will be further illustrated. Example 3.1 Consider the LTI system and the simulated input-output data presented in Figure 3.1. y(t) u(t) Input and Output data 5 0 −5 0 50 100 5 0 −5 0 50 100 150 Time (seconds) 150 200 250 200 250 Figure 3.1: Input (black) and output (green) data versus time. The impulse response f can be computed using the input-output data and, the prior knowledge about the form of the impulse response expressed by the kernel function, k(t, s) = min{e−βt , e−βs }, t, s = ti , i = 0, 1, · · · (3.4) 3.2 27 Continuous-time impulse response where f (t) ∼ GP(0, k(t, s)). The estimated impulse response is presented in Figure 3.2. Estimated Impulse Response 0.6 f (t) 0.4 0.2 0 −0.2 0 20 40 60 80 Time (seconds) 100 120 Figure 3.2: The estimated impulse response (dark blue) along with one standard deviation band (cyan). In the following some characteristics of the impulse response of LTI stable and causal systems will be given in two separate sections one for the continuous-time case and another for the discrete-time case. 3.2 Continuous-time impulse response The prior knowledge about the continuous-time impulse response of a stable and causal LTI system are 1. bounded input bounded output (BIBO) stability of the system and, 2. smoothness of the impulse response. For continuous-time impulse response the BIBO stability is assured when the impulse response be absolutely integrable, i.e., its L1 -norm exists; Z∞ |f (t)| dt = kf k1 < ∞. (3.5) −∞ The smoothness constraint on the continuous-time impulse responses can be addressed as in (Nicolao et al., 1998, Theorem 1) where the authors suggest that the smoothness of a signal can be imposed by assuming that the variances of its derivatives are finite. # " df = λ, λ < ∞ (3.6) V dt The impulse response of a continuous-time LTI system and its L1 -norm is illustrated in Figure 3.3. 28 3 System Identification Figure 3.3: The impulse response of a continuous-time LTI stable system. The shaded area under the impulse response should be finite. Some definitions which would be needed to solve the maximum entropy kernel estimation problem for continuous-time stochastic processes are given in the following. Definition 3.2. (L2 differentiation) (Åström, 1970, page 37) A second order stochastic process f is said to be differentiable in the mean square at t if the limit lim s→0 f (t + s) − f (t) = f 0 (t) s (3.7) exists in the sense of mean square convergence, that is, if " lim E s→0 f (t + s) − f (t) − f 0 (t) s #2 = 0. (3.8) Recall that the derivative variances can be expressed via spectral measure by h E f (m) (t) 2 i 1 = 2π Z∞ ω2m S(ω) dω (3.9) −∞ where, mth square mean derivative of f exists iff the integral in the right hand side of (3.9) is finite (Lifshits, 2014, page 107). Definition 3.3. The differential entropy rate of a real-valued continuous-time stochastic process f ( · ) is defined in (Nicolao et al., 1998) as 1 H(f ) = 4π Z∞ −∞ log S(ω) dω. (3.10) 3.3 3.3 29 Discrete-time impulse response Discrete-time impulse response The prior knowledge about the discrete-time impulse response of a stable and causal LTI system are 1. bounded input bounded output (BIBO) stability of the system and, 2. smoothness of the impulse response. BIBO stability is assured when the impulse response is absolutely summable, i.e., its ` 1 norm exists; ∞ X (3.11) |f (n)| = kf k1 < ∞. n=−∞ The smoothness constraint on the discrete-time impulse responses can be imposed by assuming that the variances of its finite differences are proportional to the time increment over which the finite difference is computed; V[f (ti+1 ) − f (ti )] = λ(ti+1 − ti ), ∞ > λ > 0. (3.12) Some definitions which would be needed to solve the maximum entropy kernel estimation problem for discrete-time stochastic processes are given in the following. Definition 3.4. (Differential entropy rate of a sequence) Let {X(n)} be a sequence. Its differential entropy rate is defined as (Cover and Thomas, 2012) H(X) , lim n→∞ 1 H(p(X(1), ..., X(n))), n (3.13) when the limit exists. Note that stationarity or even wide-sense stationarity are not required for definition 3.4 to hold. Proposition 3.5. Among all sequences with given covariance, Gaussian one has the maximal differential entropy rate (Cover and Thomas, 2012). Example 3.6 For independent and identically distributed (iid) standard Gaussian sequence we have √ H(X) = log 2πe. (3.14) Let X( · ) be a centered (zero-mean) discrete-time stationary sequence with autocorrelation R( · ) where, R(n) = E[X(n)X(0)]. Also, let S( · ) denote the spectral density of X( · ) on [−π, π]. The following hold for a spectral representation S( · ) (Papoulis and Pillai, 2002, page 421), 1 R(n) = 2π Zπ −π e inω S(ω) dω. (3.15) 30 3 System Identification Example 3.7 X( · ) is the iid standard sequence with covariance function 1 R(n) = 0 n=0 otherwise (3.16) iff S(ω) = 1 for ω ∈ [−π, π]. Theorem 3.8. If a stationary sequence is Gaussian with spectral density S(ω), then (Papoulis and Pillai, 2002, page 663) Zπ √ 1 log S(ω) dω. H(X) = log 2πe + 4π (3.17) −π Example 3.9 Here, we will verify this theorem for iid Gaussian sequence with variance σ 2 . From (3.14) we obtain √ H(X) = log 2πe + log σ . (3.18) From (3.17) we can obtain the same result as in Zπ √ 1 log σ 2 du H(X) = log 2πe + 4π −π √ = log 2πe + log σ . 3.4 (3.19) Maximum Entropy Kernel In (Pillonetto and Nicolao, 2011), the maximum differential entropy rate continuoustime stochastic process subject to constraints on smoothness and bounded-input bounded-output (BIBO) stability is sought. In (Pillonetto and Nicolao, 2011) the Definition 3.3 for the differential entropy rate of a stationary continuous-time Gaussian process g(t) with power spectrum S(ω) is adopted from (Nicolao et al., 1998). Furthermore, the following proposition is adopted from (Nicolao et al., 1998). Proposition 3.10. (Nicolao et al., 1998, Theroem 1) Let g(t) be a zero-mean bandlimited stationary Gaussian process with power spectrum S(ω) = 0 for |ω| > B. 3.4 31 Maximum Entropy Kernel Given finite λ2k , k = 0, 1, · · · , m, assume that there exist real numbers αj , j = 0, 1, · · · , m such that ZB ω2k dω = 2πλ2k , 2j j=0 αj ω Pm −B k = 0, 1, · · · , m. (3.20) Under this assumption, if there exists S(ω) that maximizes H(g) in 1 H(g) = 4π +∞ Z log S(ω) dω. (3.21) −∞ dk g(t) subject to constraints V[ dt k ] = λ2k , k = 0, 1, · · · , m, then the spectrum is given by S(ω) = Pm 1α ω2j . In particular, if there is no constraints on the first m − 1 j=0 j order derivatives, then the spectrum becomes S(ω) = 1 . αm ω2m Deriving the maximum entropy process in continuous-time in (Nicolao et al., 1998) and (Pillonetto and Nicolao, 2011) is quite involved, due to the infinitedimensional nature of of the problem and absence of a well-defined differential entropy rate for a generic continuous-time stochastic process. In Paper A, we focus on discrete-time impulse responses (stochastic processes), and provide a simple and self-contained proof to show the maximum entropy properties of the discrete-time first-order SS kernel (3.4). The advantages of working in discrete-time domain include 1. The differential entropy rate is well-defined for discrete-time stochastic process. 2. Given a stochastic process, its finite difference process can be well-defined in discrete-time domain. 3. It is possible to show what maximum entropy property a zero-mean discretetime Gaussian process with covariance function (3.4) has. 4 Mixture Reduction The background material presented in this chapter introduces the mixture reduction problem and presents the background material which are related to papers E and F. Some of the material presented in this chapter are published by the PhD candidate in (Ardeshiri et al., 2012) and (Ardeshiri et al., 2014). 4.1 Mixture Reduction A common problem encountered in Bayesian inference and particularly tracking is mixture reduction (MR). Examples of such circumstances are multi-hypotheses tracking (MHT)(Blackman and Popoli, 1999), Gaussian sum filter(Alspach and Sorenson, 1972), multiple model filtering (Blackman and Popoli, 1999), Gaussian mixture probability hypothesis density (GM-PHD) filter (Vo and Ma, 2006). In these algorithms the information about the state of a random variable is modeled as a mixture density. A mixture density is a probability density which is a convex combination of (more basic) component probability densities, see e.g. (Bishop, 2006). A normalized mixture with N components is defined as p(x) = N X w I q(x; η I ), (4.1) I=1 where the terms w I are positive weights summing to unity, and η I are the parameters of the component density q(x; η I ). When the component density is a Gaussian density the mixture density is referred to as Gaussian mixture (GM). The mixture reduction problem (MRP) is to find an approximation of the original mixture density by a mixture density with fewer components. To be able to implement these algorithms for real time applications a mixture reduction 33 34 4 Mixture Reduction step is necessary. The aim of the reduction algorithm is to reduce the computational complexity into a predefined budget while keeping the inevitable error introduced by the approximation as small as possible. 4.2 Mixture Reduction for Target Tracking This section concerns with the mixture reduction algorithms in multiple target tracking. The current mixture reduction convention in multiple target tracking (MTT) is to use exactly the same algorithm for reducing the computational load to a feasible level as for extracting the state estimates. In general, the mixture reduction for the state extraction should be much more aggressive than that for computational feasibility. For this reason, the number of components in the mixtures have to be reduced much more than what the computational resources actually allow for. This can result in coarser approximations than what is actually necessary. It is proposed in (Ardeshiri et al., 2012) to split the reduction step into two separate procedures according to: • Reduction in the loop is a reduction step which must be performed at each point in time for computational feasibility of the overall target tracking framework. The objective for this algorithm is to reduce the number of components and to minimize the information loss. • Reduction for state extraction aims at reducing the number of components so that the remaining components can be considered as state estimates in the target tracking framework. This separation makes it possible to tailor these two algorithms to fulfill their individual objectives, which reduces the unnecessary approximations in the overall algorithm. A block diagram of the conventional mixture reduction method on a high level is shown in Figure 4.1. Prediction Update Mixture Reduction State Extraction Figure 4.1: The standard flowchart of the MTT algorithms has only one mixture reduction block. In the proposed implementation of MR for MTT in (Ardeshiri et al., 2012), the reduction algorithm is split into two subroutines each of which is tailored for its own purpose, see Figure 4.2. The first reduction algorithm, denoted reduction 4.2 35 Mixture Reduction for Target Tracking in the loop, is designed to reduce the computational cost of the algorithm to the computational budget between the updates. In this reduction step, the number of components should be reduced to a number that is tractable by the available computational budget and minimal loss of information is in focus. The second reduction algorithm, denoted reduction for extraction, is designed to reduce the mixture to as many components as the number of targets. In this part of the algorithm, application dependent specifications and heuristics can enter into the picture. If the purpose of state extraction is only visualization, the second reduction does not have to be performed at the same frequency as the measurements are received and can be made less frequent. The advantages of the proposed algorithm are that the unnecessary loss of information in the reduction in the loop step will only be due to the finite computational budget rather than closeness of the components. Furthermore, some computational cost can be discounted if the state extraction does not have to be performed for every measurement update step. Prediction Update Mixture Reduction for Computational Feasibility Mixture Reduction for State Extraction State Extraction Figure 4.2: The proposed block diagram of the MTT algorithm with two mixture reduction blocks; one tailored to keep the computational complexity within the computational budget and one tailored for state extraction. Another important advantage of the proposed algorithm in (Ardeshiri et al., 2012) is that the number of final components in both of the reduction algorithms is known since the computational budget is predefined in the reduction in the loop algorithm. Furthermore, the number of target states can be predetermined by summarizing the weights in e.g., a GM-PHD filter and utilized in the reduction for extraction algorithm. The clustering or optimization method selected for reduction can be executed more efficiently compared to a scenario where the 36 4 Mixture Reduction number of components is left to be decided by the algorithm itself. 4.3 Greedy mixture reduction Ideally, the MRP is formulated as a nonlinear optimization problem where a divergence measure between a mixture and its approximation with a desired number of components is selected. The optimization problem is then solved by numerical solvers when the problem is not analytically tractable. The numerical optimization based approaches can be computationally quite expensive, especially for high dimensional data and they generally suffer from the problem of local optima. Hence, a common alternative solution to the MRP has been the greedy iterative approach. When the computational budget permits a numerical solution, the greedy approaches are used to initialize the global optimization approach (Williams and Maybeck, 2006). In the greedy approach, the number of components in the mixture is reduced one at a time. By applying the same procedure over and over, a desired number of components can be reached. In order to reduce the number of components by one, two types of operations are considered, namely, pruning one component and merging of two components. These two operations will be given an official definition in the following. Pruning which is the simplest operation for reducing the number of components in a mixture density is to remove one (or more) components of a mixture and rescaling the remaining components such that it integrates to unity. For example pruning component J from (4.1) results in the mixture density p0 (x) = (1 − w J )−1 N X w I q(x; η I ). (4.2) I=1,I,J The merging operation in a MRA approximates a subset of components in a mixture density with a single component of the same component density type. In general, an optimization problem minimizing the KLD between the normalized subset of the mixture and the single component is used for this purpose leading to a moment matching operation. More formally, approximation of a fraction of the mixture density (4.1) consisting of two components I and J; w I q(x; η I ) + w J q(x; η J ) by a single weighted component (w I + w J )q(x; η I J ) is referred to as merging components I and J, where η I J = arg min DK L ! w I q(x; η I ) + w J q(x; η J ) IJ q(x; η ) . || wI + wJ When the component densities are Gaussian densities with mean µ and covari- 4.3 37 Greedy mixture reduction ance Σ the parameters of the approximate density are given by 1 (w I µI + w J µJ ), + wJ X wK K K IJ K IJ T Σ + (µ − µ )(µ − µ ) . = wI + wJ µI J = ΣI J wI (4.3a) (4.3b) K∈{I,J} There are two different types of greedy approaches in the literature, local and global approaches. The local approaches consider only the merging operation. The (two) components to be merged are selected among all possible pairs of components based on a divergence measure between the individual components and the divergence between the original mixture and its approximation is not (explicitly) taken into account. Well-known examples of local approaches are given in (Salmond, 1990; Granström and Orguner, 2012b). In the global approach, each of the pruning or merging possibilities are considered to be a hypothesis. The decisions are then made by choosing the candidate hypothesis that minimizes a divergence measure involving the original mixture and the corresponding reduced mixtures (all of which has one less component). In the global approach to mixture reduction, pruning or merging operations applicable on the original mixture p(x) are considered to be hypotheses denoted by H. The resulting mixtures that would be obtained if the I th component is pruned, or if the I th and J th components are merged, are denoted by p(x|H0I ) and p(x|HI J ), respectively. The single component obtained by merging the I th and J th components is denoted by q(x, η I J ). If p(x) has K components there are K pruning and K(K − 1)/2 merging hypotheses. In order to decide on the candidate pruning and merging operations, all corresponding mixtures p(x|H0I ), p(x|HI J ) and their associated divergence measure are calculated. The hypothesis which results in the smallest divergence measure, is most similar mixture to the original mixture, is selected. More particularly, at the k th stage of reducing mixture density of equation (4.1), nk = N − k + 1 components are left and there are 12 nk × (nk − 1) possible merging decisions and nk possible pruning decisions to choose from. Let the reduced density at the k th stage be denoted by pk (x). We have a multiple hypotheses decision problem at hand where the hypotheses are formulated according to H01 : x ∼ pk (x|H01 ), H02 : x ∼ pk (x|H02 ), Pruning Hypotheses .. . H : x ∼ p 0nk k (x|H0nk ), H12 : x ∼ pk (x|H12 ), H 13 : x ∼ pk (x|H13 ), Merging Hypotheses .. . H : x ∼ p (x|H (nk −1)nk k (nk −1)nk ), 38 4 Mixture Reduction which is a decision problem with nk (nk + 1)/2 hypotheses. The first nk hypotheses account for pruning and the rest account for merging decisions. The subscript on hypotheses H. refers to the two components to be merged for merging hypotheses while in the case of pruning hypotheses the subscript refers to the label of the component to be pruned which is preceded by zero. The divergence measures used for the aforementioned decision problem are presented in the following Section. 4.4 Divergence measures A divergence measure is a function which establishes the distance of one probability distribution to the other on a statistical manifold (Minka, 2005). A divergence measure is a weaker form of a metric, in particular the divergence does not need to be symmetric and does not need to satisfy the triangle inequality. 4.4.1 Integral square error ISE is a divergence measure between two densities which is defined as Z ISE(p||q) = |p(x) − q(x)|2 dx (4.4) for two densities p(x) and q(x). ISE has all properties of a metric. ISE is used by Williams and Maybeck in (Williams and Maybeck, 2006) as a divergence measure for mixture reduction. The cost of the hypothesis HK obeys Z ISE(HK ) = |p(x) − pk (x|HK )|2 dx. (4.5) In this approach, the hypothesis which gives the smallest ISE will be chosen at each step of the reduction i.e., the decision rule based on ISE becomes “decide HK if ISE(HK ) < ISE(HL ) for all L , K”, where K and L are permissible indices of the hypotheses. An attractive property of the ISE as a divergence measure is that the ISE between two Gaussian mixtures has an analytical solution. 4.4.2 Kullback-Leibler Divergence The global approach to mixture reduction problem can be posed as a multiple hypothesis testing problem 1 . Suppose that we have a mixture p(x) with N components as in (4.1). Suppose we have a number of reduced mixtures {p(x|Hj )}K j=1 and we would like to select one of them. Assuming that we have the data {xi }Si=1 sampled from p( · ), the selection of the best reduced mixture can be posed as 1 For a short introduction to multiple hypothesis testing and maximum a posteriori decision rule see Appendix B. 4.4 39 Divergence measures a multiple hypothesis testing problem where the test statistics become the loglikelihood of the data given as log p({xi }Si=1 |Hj ) = S X log p(xi |Hj ) (4.6) i=1 and the decision is made to select Hj ∗ where j ∗ , arg max log p({xi }Si=1 |Hj ). j (4.7) When we let the number of the samples S go to ∞, we see that lim S→∞ h i 1 log p({xi }Si=1 |Hj ) = E log p(x|Hj ) S p( · ) (4.8) by the law of large numbers. Kullback-Leibler divergence DK L (p( · )||p( · |Hj )) between p( · ) and p( · |Hj ) is given as h i DK L (p( · )||p( · |Hj )) = −H(p(x)) − E log p(x|Hj ) p( · ) (4.9) where H( · ) is the entropy of its argument density. Therefore, the optimization (4.7) is equivalently given as j ∗ , arg min DK L (p(x)||p(x|Hj )). j (4.10) The cost function in (4.10) can not be analytically evaluated when one of the arguments in the KLD is a Gaussian mixture. Runnalls in (Runnalls, 2007) used a nice analytical approximation of the KLD between two mixtures which can only be used for evaluating the merging hypotheses. The approximation is in fact an upper bound on DK L (p(x)||p(x|HI J )), which is the cost of merging two components I and J, and is denoted by B(I, J) and defined by B(I, J) ,w I DK L (q(x; η I )||q(x; η I J )) + w J DK L (q(x; η J )||q(x; η I J )). (4.11) Runnalls has shown that B(I, J) ≥ DK L (p(x)||p(x|HI J )) in (Runnalls, 2007). The greedy MR algorithm suggested by (Runnalls, 2007) will be referred to as approximate Kullback-Leibler (AKL) algorithm in the rest of this thesis. 4.4.3 α -Divergences A generalization of the KLD called the α-divergence is a family of divergences defined over a range of continuous hyper-parameter α ∈ (−∞, ∞) by ! Z 4 (1+α)/2 (1−α)/2 Dα (p||q) , 1 − p(x) q(x) dx . 1 − α2 40 4 Mixture Reduction Some special cases of of the α-divergence are lim Dα (p||q) = DK L (p||q) (4.12a) lim Dα (p||q) = DK L (q||p) (4.12b) α→1 α→−1 D0 (p||q) = DH (p||q) (4.12c) where, DH (p||q) is the Hellinger distance (Bishop, 2006). We will analyze the divergence measures given above in Example 4.1. Example 4.1 Using an example given in (Minka, 2005) the effect of changing the hyper-parameter α in the divergence measure is illustrated and compared with the ISE distance. Consider the Gaussian mixture p(x) = 0.6 N (x; −2, 1) + 0.4 N (x; 2, 0.16), and its approximation q(x) which is a Gaussian distribution with unknown mean and standard deviation. In Figure 4.3 the minimizing argument of Dα (p||q) over q is given for various values of α alongside the minimizing argument of ISE(p||r). The parameters of q (mean and standard deviation) are given in Figure 4.4. The parameters of q vary smoothly with α except when −1 ≤ α ≤ 1. When α −1 the solution is mode seeking (the mode with the largest mass) and when α 1 the optimal solution distributes the probability mass over the support and where there is considerable probability mass in original distribution. The observations made in this simple example are general and can be conveniently explained by the definition of the α-divergence. The minimizing argument of ISE(p||r) in this example does not have the same general interpretation; in this example ISE is rather similar to Dα for α = 1, if the two Gaussian densities where far away from each other, ISE becomes more similar to Dα for α = −1. Another observation that is made is that the minimizing argument of Dα (p||q) over q does not vary so much for values of α outside the interval of [−1, 1]. Therefore, we will only study the αdivergence in the limit as α → 1 where it corresponds to the KLD and as α → −1 where it corresponds to to the reversed Kullback-Leibler divergence (RKLD). In applications where the mode seeking property of the solution is desired RKLD is suitable. On the other hand, when the solution should preserve the statistical moments of a mixture density KLD is the most appropriate. In Paper F a Gaussian mixture reduction algorithm using the RKLD is proposed which has the mode seeking property illustrated in Example 4.1. 4.4 41 Divergence measures p q p q r r α = −∞ α=∞ p q p p q q r r α = −1 r α=0 α=1 Figure 4.3: The Gaussian mixture p (black) is approximated by two Gaussian densities q (red) and r (blue). q minimizes the α-divergence for different values of α and r minimizes the ISE. 2 3 1 2.5 0 estimated std dev estimated mean ISE std dev true mean ISE mean −1 −2 −20 0 α 20 true std dev 2 1.5 1 −20 0 α 20 Figure 4.4: The mean and standard deviation of the Gaussian density q which minimizes the α-divergence to p for different values of α (black). For α = 1 the mean and standard deviation of q matches those of p (red). Mean and standard deviation of the Gaussian density r which minimizes the ISE to p is given for comparison (blue). 42 4.5 4 Mixture Reduction Numerical comparison of mixture reduction algorithms In Paper E three mixture reduction algorithms for mixtures of the exponential family are given and are evaluated in simulations. In these algorithms the ISE approach and AKL approach are compared with a local approach referred to as Symmetrized Kullback-Leibler Divergence. The Symmetrized Kullback-Leibler Divergence is used for the comparison of the merging hypotheses in local algorithms such as (Kitagawa, 1994), (Chen et al., 2012b), (Granström and Orguner, 2012b) and (Granström and Orguner, 2012a). The symmetrized KLD (SKL) for two component densities is defined as DSK L (I, J) = DK L (qη I ||qη J ) + DK L (qη J ||qη I ). (4.13) This approach is referred to as SKL and is used in the numerical simulation intended for comparison of different MR algorithms in the following. In this section, eight mixture reduction examples are illustrated. In Figures 4.5, 4.6, 4.7, 4.8, 4.9, 4.10, 4.11 and 4.12 mixture densities of exponential, Weibull, Rayleigh, Log-Normal, gamma, inverse gamma and Gaussian distribution are reduced, respectively. In each figure, a mixture density with 25 components along with its reduced approximations with 3 components using three reduction algorithms AKL, SKL and ISE are plotted. In these figures the original mixture density (black solid line) and its components (black dashed line) are given. In the subfigures AKL, SKL and ISE are used to approximate the original mixture which has 25 component densities with mixtures with 3 component densities. The approximate densities (thick dashed lines) and their components (thin dashed line) are drawn in different colors; red(AKL), green(SKL) and blue(ISE). AKL is used in the left sub-figure, SKL is used in the center sub-figure and ISE is used in the right sub-figure. The reduced mixture in the right sub-figure is not rescaled after possible pruning steps and is plotted as it is used in the ISE algorithm. For implementation aspects of the ISE approach see Appendix C. 4.5 43 Numerical comparison of mixture reduction algorithms 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0 0 50 100 0 0 50 100 0 0 50 100 Figure 4.5: Exponential Distribution 0.08 0.08 0.08 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0 20 40 60 0 20 40 60 0 20 40 60 Figure 4.6: Weibull Distribution with known shape k 0.08 0.08 0.08 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0 −100 −50 0 50 100 0 −100 −50 0 50 100 0 −100 −50 0 50 100 Figure 4.7: Laplace Distribution with known mean µ 0.1 0.1 0.1 0.05 0.05 0.05 0 10 20 30 40 0 10 20 30 40 0 Figure 4.8: Rayleigh Distribution 10 20 30 40 44 4 Mixture Reduction 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0 −1 10 0 10 1 0 −1 10 2 10 10 0 10 1 0 −1 10 2 10 10 0 1 10 2 10 10 Figure 4.9: Log-normal Distribution 0.025 0.025 0.025 0.02 0.02 0.02 0.015 0.015 0.015 0.01 0.01 0.01 0.005 0.005 0.005 0 20 40 60 0 80 100 20 40 60 0 80 100 20 40 60 80 100 Figure 4.10: Gamma Distribution 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 1 2 3 0 0 1 2 3 0 0 1 2 3 Figure 4.11: Inverse Gamma Distribution 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0 0 50 100 0 0 50 100 0 0 Figure 4.12: Univariate Gaussian Distribution 50 100 5 Concluding remarks This chapter concludes the first part of this thesis. An overall summary of the contributions given in the second part of the thesis and some directions for further research will be given here. For more detailed discussion on the each contribution see the discussions and concluding remarks at the end of each contribution. In paper A, the maximum entropy properties of the first-order stable spline kernel for identification of linear time-invariant stable and causal systems are shown. Analytical approximations are used to express the prior knowledge about the properties of the impulse response of a linear time-invariant stable and causal system. Future work on the subject includes studying maximum entropy interpretation of other kernels used for regression using Gaussian processes. Furthermore, the maximum entropy approach can be used to construct new kernels for system identification. In papers B, variational Bayes (VB) method is used to compute an approximate posterior for the state smoothing problem for linear state-space models with unknown and time-varying noise covariances. The VB method gives an approximate posterior for the unknown noise covariances. Nevertheless, the Variational Bayes type algorithms approximate the posterior by minimizing Kullback-Leibler divergence in zero forcing mode, meaning that if there are multiple modes in the true posterior, the algorithm approximates only one of the modes. Hence the posterior covariance might underestimate the true covariance significantly in such cases. Computing a better estimate of estimation uncertainly for the noise covariances can be a future work. Theoretical comparison of the proposed VB method with expectation maximization and maximum likelihood estimate of the noise covariances is another possible future work. In paper C, the VB method is used for approximate inference in state-space models with skewed measurement noise. A filter and a smoother that take into account the skewness and heavy-tailedness of the measurement noise are pro45 46 5 Concluding remarks posed where skew-t distribution is used to model the distribution of measurement noise. Future research on the subject includes learning the skewness and spread parameters of the measurement noise from the data. Further research on the subject can include studying a class of hierarchical models for modeling the noise parameters and devising algorithms for learning the parameters of such a model from the data. In paper D, a novel approximation method for Bayesian inference is proposed. The proposed Bayesian inference technique is based on Taylor series approximation of the logarithm of the likelihood function. The proposed approximation is devised for the case where the prior distribution belongs to the exponential family of distributions. The linearization of the log-likelihood is performed with respect to the sufficient statistic of the prior distribution. Extension of the proposed method for prior distributions outside the exponential family of distribution can be a future research direction. The comparison of possible choices for the linearization point and linearization methods with respect to the sufficient statistic are among the future research problems. In papers E and F, two contributions are dedicated to the mixture reduction (MR) problem. The first contribution, generalizes the existing MR algorithms for Gaussian mixtures to the exponential family of distributions and compares them in an extended target tracking scenario. The second contribution, proposes a new Gaussian mixture reduction algorithm using the reversed Kullback-Leibler divergence which has specific peak preserving properties. Future research on these topics includes evaluation of these methods in real life scenarios with real measurements. There is a general class of solutions to the Bayesian inference problem referred to by sampling methods which can obtain much better performance compared to proposed approximation methods with respect to accuracy when the computation time is not critical. Sampling methods are not covered in this thesis. That is why the approximations used in this thesis are specified as analytical approximations. The proposed analytical approximations however, can be used for initialization of the sampling based methods as well as selecting proposals in Monte-Carlo (MC) methods. Speeding up these MC methods using the proposed approximation methods is a general direction for future research. Appendix A Expressions for some members of exponential family Essential expressions and formula for reduction of mixture densities of common exponential family distributions are given in this section. These expressions can be found in (Ardeshiri et al., 2014) as well. Some functions which are used in the expressions such as the gamma function are defined here for completeness. The gamma function is defined by Z∞ Γ (t) = x t−1 exp (−x) dx. (A.1) 0 The multivariate gamma function which is a generalization of the gamma function is Z Γd (t) = exp (− Tr (S))|S|t− d+1 2 dS (A.2) S0 = π d(d−1)/4 d Y 1−j Γ t+ . 2 j=1 The digamma function is given as 0 Γ (t) d ψ(t) = log Γ (t) = . dt Γ (t) 49 (A.3) 50 A Expressions for some members of exponential family The multivariate polygamma function of order n is defined as (n) dn+1 log Γd (t) dt n+1 d X 1−j dn+1 = log Γ t + 2 dt n+1 ψd (t) = (A.4) (A.5) j=1 = d X j=1 1−j ψ (n) t + . 2 (A.6) The multinomial beta function in terms of the gamma function is given as QK BK (α) = j=1 Γ (αj ) P . K Γ j=1 αj (A.7) Exponential Distribution Exp(x; λ) = λ exp(−λx) (A.8a) Support: x ∈ [ 0, +∞) (A.8b) Parameter space: λ ∈ (0, +∞) η = −λ (A.8c) (A.8d) A(η) = − log(−η) (A.8e) ∂A 1 =− ∂η η (A.8f) ∇η A = h(x) = 1 (A.8g) E[h(x)] = 1 T (x) = x (A.8h) (A.8i) Solution to ∇η L A = Y is given by ηL = − 1 . Y (A.9) 51 Weibull Distribution with known shape k ! k x k−1 xk Weibull(x; λ, k) = exp − k λ λ λ Support: x ∈ [ 0, +∞) Parameter space: λ ∈ (0, +∞), k ∈ (0, +∞) 1 η=− k λ A(η) = − log(−η) − log(k) (A.10a) (A.10b) (A.10c) (A.10d) (A.10e) ∂A 1 ∇η A = =− ∂η η (A.10f) h(x) = x k−1 (A.10g) ! E[h(x)] = Γ 1−k 2k − 1 (−η) k k (A.10h) T (x) = x k (A.10i) Solution to ∇η L A = Y is given by ηL = − 1 . Y (A.11) The expression for Eq(x;η) [h(x)] is derived here Z∞ E[h(x)] = x k−1 0 Z∞ = ! ( ) xk k x k−1 xk x k−1 exp − k dx = z , k , dz = k k dx λ λ λ λ λ λk−1 z k−1 k exp(−z) dz 0 = λk−1 Γ ! ! ! 1−k k−1 2k − 1 2k − 1 + 1 = λk−1 Γ =Γ (−η) k . k k k (A.12) 52 A Expressions for some members of exponential family Laplace Distribution with known mean µ |x − µ| 1 Laplace(x; µ, b) = exp − 2b b Support: x ∈ (−∞, ∞) Parameter space: b ∈ (0, +∞), µ ∈ R 1 η=− b ! 2 A(η) = log − η ∂A 1 =− ∇η A = ∂η η h(x) = 1 E[h(x)] = 1 T (x) = |x − µ| ! (A.13a) (A.13b) (A.13c) (A.13d) (A.13e) (A.13f) (A.13g) (A.13h) (A.13i) Solution to ∇η L A = Y is given by ηL = − 1 . Y (A.14) 53 Rayleigh Distribution x x2 Rayleigh(x; σ ) = 2 exp − 2 σ 2σ ! (A.15a) Support: x ∈ [ 0, +∞) Parameter space: σ ∈ (0, +∞) 1 η=− 2 2σ A(η) = − log(−2η) ∇η A = (A.15b) (A.15c) (A.15d) (A.15e) ∂A 1 =− ∂η η (A.15f) h(x) = x r π E[h(x)] = −2η (A.15g) (A.15h) T (x) = x2 (A.15i) 1 . Y (A.16) Solution to ∇η L A = Y is given by ηL = − The expression for Eq(x;η) [h(x)] is derived here Z∞ E[h(x)] = 0 x2 x x 2 exp − 2 σ 2σ r ! dx = σ π = 2 r π . −2η (A.17) 54 A Expressions for some members of exponential family Log-normal Distribution 1 √ 1 exp − 2 (log x − µ)2 2σ xσ 2π Support: x ∈ (0, +∞) log −N (x; µ, σ ) = Parameter space: σ ∈ (0, +∞), µ ∈ R η = (η1 , η2 ) µ η1 = 2 σ 1 η2 = − 2 2σ η2 1 A(η) = − 1 − log(−2η2 ) 4η2 2 ! ∂A ∂A , ∇η A = ∂η1 ∂η2 η1 ∂A =− ∂η1 2η2 η2 ∂A 1 = 12 − ∂η2 4η2 2η2 1 h(x) = √ x 2π ! η1 1 1 − exp E[h(x)] = √ 2η2 4η2 2π T (x) = log(x), (log(x))2 (A.18a) (A.18b) (A.18c) (A.18d) (A.18e) (A.18f) (A.18g) (A.18h) (A.18i) (A.18j) (A.18k) (A.18l) (A.18m) Solution to the system of equations ∇η L A = Y is given by −2 , Y2 − Y12 (A.19a) η1L = −2Y1 η2L . (A.19b) η2L = The expression for Eq(x;η) [h(x)] is derived here ! ! η1 1 1 1 σ2 1 1 =√ exp −µ + =√ exp − . E[h(x)] = √ E x 2 2η2 4η2 2π 2π 2π (A.20) 55 Gamma Distribution β α α−1 x exp(−βx) Γ (α) Support: x ∈ (0, +∞) Gamma(x; α, β) = Parameter space: α ∈ (0, +∞), β ∈ (0, +∞) η = (η1 , η2 ) (A.21a) (A.21b) (A.21c) (A.21d) η1 = α − 1 (A.21e) η2 = −β (A.21f) A(η) = log Γ (η1 + 1) − (η1 + 1) log(−η2 ) ! ∂A ∂A ∇η A = , ∂η1 ∂η2 ∂A = ψ(η1 + 1) − log(−η2 ) ∂η1 η1 + 1 ∂A =− ∂η2 η2 h(x) = 1 (A.21g) (A.21h) (A.21i) (A.21j) (A.21k) E[h(x)] = 1 T (x) = (log(x), x) (A.21l) (A.21m) To solve the system of equations ∇η L A = Y , first let Z = log(Y2 )− Y1 and u = η1 +1. Then solve ψ(u) − log(u) + Z = 0 numerically and obtain η1L = u − 1, u η2L = − . Y2 (A.22a) (A.22b) 56 A Expressions for some members of exponential family Inverse Gamma Distribution β α −α−1 β x exp − Γ (α) x Support: x ∈ (0, +∞) IGamma(x; α, β) = Parameter space: α ∈ (0, +∞), β ∈ (0, +∞) (A.23a) (A.23b) (A.23c) η = (η1 , η2 ) (A.23d) η1 = −α − 1 (A.23e) η2 = −β (A.23f) A(η) = log Γ (−η1 − 1) − (−η1 − 1) log(−η2 ) ! ∂A ∂A ∇η A = , ∂η1 ∂η2 ∂A = −ψ(−η1 − 1) + log(−η2 ) ∂η1 η1 + 1 ∂A = ∂η2 η2 h(x) = 1 (A.23g) (A.23h) (A.23i) (A.23j) (A.23k) E[h(x)] = 1 1 T (x) = log(x), x (A.23l) (A.23m) To solve the system of equations ∇η L A = Y first let Z = log(Y2 ) + Y1 and u = −η1 − 1. Then solve ψ(u) + log(u) − Z = 0 numerically and obtain η1L = −u − 1, u η2L = − . Y2 (A.24a) (A.24b) 57 Univariate Gaussian Distribution 1 1 exp − 2 (x − µ)2 √ 2σ σ 2π Support: x ∈ R N (x; µ, σ 2 ) = Parameter space: σ ∈ (0, +∞), µ ∈ R η = (η1 , η2 ) µ η1 = 2 σ 1 η2 = − 2 2σ η2 1 A(η) = − 1 − log(−2η2 ) 4η2 2 ! ∂A ∂A ∇η A = , ∂η1 ∂η2 η1 ∂A =− ∂η1 2η2 η2 ∂A 1 = 12 − ∂η2 2η 4η2 2 1 h(x) = √ 2π 1 E[h(x)] = √ 2π T (x) = x, x2 (A.25a) (A.25b) (A.25c) (A.25d) (A.25e) (A.25f) (A.25g) (A.25h) (A.25i) (A.25j) (A.25k) (A.25l) (A.25m) Solution to the system of equations ∇η L A = Y is given by −2 , Y2 − Y12 (A.26a) η1L = −2Y1 η2L . (A.26b) η2L = 58 A Expressions for some members of exponential family Multivariate Gaussian Distribution k 1 1 N (x; m, P ) = (2π)− 2 |P |− 2 exp − (x − m)T P −1 (x − m) 2 Support: x ∈ Rk Parameter space: P ∈ R k×k T and P = P 0, µ ∈ R η = (η1 , η2 ) η1 = P −1 m k (A.27a) (A.27b) (A.27c) (A.27d) (A.27e) 1 η2 = − P −1 2 1 1 A(η) = − η1T η2−1 η1 − log | − 2η2 | 4 ! 2 ∂A ∂A ∇η A = , ∂η1 ∂η2 1 ∂A = − η1T η2−1 ∂η1 2 1 1 ∂A = η2−T η1 η1T η2−T − η2−1 ∂η2 4 2 (A.27h) h(x) = (2π)−k/2 (A.27k) −k/2 (A.27l) E[h(x)] = (2π) T (x) = x, xxT (A.27f) (A.27g) (A.27i) (A.27j) (A.27m) Solution to system of equations ∇η L A = Y is given by 1 η2L = − (Y2 − Y1T Y1 )−1 , 2 T η1L = −2Y1 η2L . (A.28) (A.29) 59 Gaussian Gamma Distribution 1 GaussianGamma (x, τ; µ, λ, α, β) = N x; µ, Gamma(τ; α, β) λτ Support: x ∈ R, τ ∈ (0, +∞) (A.30a) (A.30b) Parameter space: α ∈ (0, +∞), β ∈ (0, +∞), λ ∈ (0, +∞), µ ∈ R (A.30c) η = (η1 , η2 , η3 , η4 ) 1 η1 = α − 2 λµ2 η2 = −β − 2 η3 = λµ λ η4 = − 2 1 1 − log(−2η4 ) A(η) = log Γ η1 + 2 2 ! η32 1 − η1 + log −η2 + 2 4η4 ! ∂A ∂A ∂A ∂A , , , ∇η A = ∂η1 ∂η2 ∂η3 ∂η4 ! η32 1 ∂A = ψ η1 + − log −η2 + ∂η1 2 4η4 η1 + 21 ∂A =− η2 ∂η2 −η2 + 4η3 4 1 2η + η 3 1+ 2 ∂A =− η2 ∂η3 4η4 −η2 + 4η3 4 1 2 η3 η1 + 2 1 ∂A − = η32 ∂η4 2η4 2 4η4 −η2 + 4η (A.30d) (A.30e) (A.30f) (A.30g) (A.30h) (A.30i) (A.30j) (A.30k) (A.30l) (A.30m) (A.30n) 4 1 h(x) = √ 2π 1 E[h(x)] = √ 2π T (x) = log(τ), τ, τ x, τ x2 (A.30o) (A.30p) (A.30q) 60 A Expressions for some members of exponential family To solve the system of equations ∇η L A = Y first let Z = log(−Y2 ) − Y1 and u = η1 + 12 . Then solve ψ(u) − log(u) + Z = 0 numerically and obtain η1L = u − 1 , 2 1 Y32 + Y4 2 Y2 2η4 Y3 , η3L = Y2 (A.31a) !−1 η4L = − η2L = η1 + 12 η32 + . 4η4 Y2 , (A.31b) (A.31c) (A.31d) 61 Dirichlet distribution K DirK (x; α) = 1 Y αi −1 xi B(α) (A.32a) i=1 Support: xi ∈ [0, 1] for i = 1 · · · K and K X xi = 1 (A.32b) i=1 Parameter space: αi > 0 and K ≥ 2 (A.32c) η = (η1 , · · · , ηK ) (A.32d) ηi = αi − 1 (A.32e) K K X X A(η) = log Γ (ηi + 1) − log Γ (ηi + 1) i=1 i=1 ! ∂A ∂A ∂A , ,··· , ∇η A = ∂η1 ∂η2 ∂ηK K X ∂A = ψ(ηi + 1) − ψ (ηi + 1) ∂ηi (A.32f) (A.32g) (A.32h) i=1 h(x) = 1 (A.32i) E[h(x)] = 1 T (x) = (log(x1 ), · · · , log(xK )) (A.32j) (A.32k) The system of equations ∇η L A = Y can be solved using a numerical method such as newton method where the Hessian is given by, K X ∂2 A (1) (1) , = ψ (η + 1) − ψ (η + 1) (A.33a) i k ∂ηi 2 k=1 K X ∂2 A (ηk + 1) . (A.33b) = −ψ (1) ∂ηij k=1 62 A Expressions for some members of exponential family Wishart Distribution 1 |X| 2 (n−d−1) exp Tr − 12 V −1 X Wd (X; n, V ) = 1 1 2 2 nd Γd 12 n |V | 2 n Support: X ∈ Rd×d and X = X T 0 (A.34a) (A.34b) T and V = V 0, n ≥ d (A.34c) η = (η1 , η2 ) 1 η1 = (n − d − 1) 2 1 η2 = − V −1 2 ! ! d+1 d+1 A(η) = − η1 + log | − η2 | + log Γd η1 + 2 2 ! ∂A ∂A ∇η A = , ∂η1 ∂η2 ! d+1 ∂A = − log | − η2 | + ψd η1 + ∂η1 2 ! ∂A d + 1 −1 = − η1 + η2 ∂η2 2 (A.34d) h(X) = 1 (A.34k) Parameter space: V ∈ R d×d E[h(X)] = 1 T (X) = (log |X|, X) (A.34e) (A.34f) (A.34g) (A.34h) (A.34i) (A.34j) (A.34l) (A.34m) To solve the system of equations ∇η L A = Y first let Z = log |Y2 | − Y1 and u = η1 + d+1 2 . Then solve ψd (u) − d log(u) + Z = 0 numerically and obtain d+1 , 2 η2L = −uY2−1 . η1L = u − (A.35a) (A.35b) 63 Inverse Wishart Distribution 1 |Ψ | 2 (ν−d−1) exp Tr − 12 Ψ X −1 I W d (X; ν, Ψ ) = 1 1 2 2 (ν−d−1)d Γd 12 (ν − d − 1) |X| 2 ν (A.36a) Support: X ∈ Rd×d and X = X T 0 (A.36b) Parameter space: ν > 2d Ψ ∈ R d×d , Ψ =Ψ T 0 (A.36c) η = (η1 , η2 ) 1 η1 = − ν 2 1 η2 = − Ψ 2 ! ! d+1 d+1 A(η) = η1 + log | − η2 | + log Γd −η1 − 2 2 ! ∂A ∂A ∇η A = , ∂η1 ∂η2 ! d+1 ∂A = log | − η2 | − ψd −η1 − ∂η1 2 ! ∂A d + 1 −1 = η1 + η2 ∂η2 2 (A.36d) h(X) = 1 (A.36k) E[h(X)] = 1 T (X) = log |X|, X −1 (A.36e) (A.36f) (A.36g) (A.36h) (A.36i) (A.36j) (A.36l) (A.36m) To solve the system of equations ∇η L A = Y first let Z = − log(Y2 ) − Y1 and u = −η1 − d+1 2 . Then solve −ψd (u) + d log(u) + Z = 0 numerically and obtain d+1 , 2 η2L = −uY2−1 . η1L = −u − (A.37a) (A.37b) 64 A Expressions for some members of exponential family Gaussian Inverse Wishart Distribution GIW (x, X; m, P , ν, Ψ ) = N (x; m, P ) I W d (X; ν, Ψ ) Support: x ∈ Rk , X ∈ Rd×d and X = X T 0 Parameter space: ν > 2d Ψ ∈ R P ∈R k×k d×d , Ψ =Ψ T T (A.38a) (A.38b) 0, and P = P 0, µ ∈ Rk (A.38c) η = (η1 , η2 , η3 , η4 ) (A.38d) 1 η1 = − ν (A.38e) 2 1 η2 = − Ψ (A.38f) 2 η3 = P −1 m (A.38g) 1 (A.38h) η4 = − P −1 2 ! ! d+1 d+1 A(η) = η1 + log | − η2 | + log Γd −η1 − 2 2 1 1 T −1 (A.38i) − η3 η4 η3 − log | − 2η4 | 4 2 ! ∂A ∂A ∂A ∂A , , , (A.38j) ∇η A = ∂η1 ∂η2 ∂η3 ∂η4 ! ∂A d+1 = log | − η2 | − ψd −η1 − (A.38k) ∂η1 2 ! ∂A d + 1 −1 = η1 + η2 (A.38l) ∂η2 2 1 ∂A = − η3T η4−1 (A.38m) ∂η3 2 1 ∂A 1 (A.38n) = η4−T η1 η3T η4−T − η4−1 ∂η4 4 2 h(x, X) = (2π)−k/2 (A.38o) −k/2 (A.38p) E[h(x, X)] = (2π) T (x, X) = log |X|, X −1 , x, xxT (A.38q) 65 To solve the system of equations ∇η L A = Y first let Z = − log(Y2 ) − Y1 and u = −η1 − d+1 2 . Then solve −ψd (u) + d log(u) + Z = 0 numerically and obtain d+1 , 2 η2L = −uY2−1 , 1 η3L = − (Y4 − Y3T Y3 )−1 , 2 T L η4 = −2Y3 η4L . η1L = −u − (A.39a) (A.39b) (A.39c) (A.39d) B Multiple hypothesis testing Here, the multiple hypothesis testing problem and the maximum a posteriori decision rule is given for the sake of completeness (Ardeshiri et al., 2014). For more complete treatment see (Kay, 1998). Consider that we want to decide among M hypotheses {H1 , H2 , ..., HM }. Let the cost assigned to the decision to choose Hi when Hj is true is denoted by Cij where ( 0 i=j . (B.1) Cij = 1 i,j The expected Bayes risk (Kay, 1998) becomes R= M X M X Cij P (Hi |Hj )P (Hj ). (B.2) i=1 j=1 We are looking for a decision rule that minimizes R. Let us partition the space to regions Ri for i = 1 : M so that R= M X M X Z i=1 j=1 = = M Z X M X i=1 R j=1 i M Z X i=1 R p(x|Hj )P (Hj ) dx Cij Ri Cij P (Hj |x)p(x) dx Ci p(x) dx i 67 (B.3) 68 B Multiple hypothesis testing P where Ci (x) = M j=1 Cij P (Hj |x). Since each data x should trigger only one decision, i.e. assigned to only one of the Ri partitions we should decide Hk for which Ci is minimum. P Since Ci (x) = M j=1 P (Hj |x) − P (Hi |x), Ci (x) is minimized if P (Hi |x) is maximized. Thus the decision rule is decide Hk if P (Hk |x) > P (Hi |x) for i , k. For equal prior probabilities P (Hk ) = P (Hi ) the decision rule will be to decide Hk if p(x|Hk ) > p(x|Hi ) for i , k. This decision rule is also referred to as maximum a posteriori decision rule. If the prior probabilities are not equal due to e.g., heuristics P (Hk ) , P (Hi ), Bayes rule P (Hi |x) ∝ p(x|Hi )P (Hi ) can be used. This possibility is not exploited in this thesis. C Implementation aspects of the ISE approach An advantage of the ISE metric is that, it can be computed analytically for many distributions (Ardeshiri et al., 2015a). In the ISE approach two parameters can be varied to create slightly different reduction algorithms as detailed below (Ardeshiri et al., 2014): 1. In the first variation, the ISE is calculated for each hypothesis according R to ISE(HK ) = |p(x) − pk (x|HK )|2 dx and the density after pruning is renormalized. This variation is consistent with the presentation of the ISE algorithm so far in this technical report. 2. In the second variation, as it is pointed out in (Williams and Maybeck, 2006), when the ISE is being calculated for a pruning hypothesis the rescaling can be skipped since re-normalizing the weights will increase the error value in parts of the support that are not affected by the pruning hypothesis. This choice also brings substantial computational savings. 3. In the third variation, instead of comparing p(x|HK ) with the original mixture p(x), it is compared with the resulting mixture of the previous reduction step pk (x), as given here Z ISE(HK ) = |pk (x) − pk (x|HK )|2 dx. In this way, the ISE metric for merging decision can be simplified to ISE(HI J ) = (w I )2 Q(I, I) + (w J )2 Q(J, J) + (w I J )2 Q(I J, I J) + 2w I w J Q(I, J) − 2w I w I J Q(I, I J) − 2w J w I J Q(J, I J). 69 70 C Implementation aspects of the ISE approach where, Z Q(I, J) = q(x; η I )q(x; η J ) dx. (C.1) Q(I, J) can be calculated analytically for many basic densities of interest belonging to the exponential family such as Gaussian, gamma and Wishart distributions. For explicit expressions for the exponential family of distribution see (Ardeshiri et al., 2015a) and (Ardeshiri et al., 2014). Similarly the ISE metric for pruning decision can be simplified as in wI ISE(H0I ) = 1 − wI !2 N N X N X X i i j Q(I, I) − 2 w Q(I, i) + w w Q(i, j) . i=1 i=1 j=1 4. The fourth variant is similar to the third variant in terms of the choice of the reference density, but the mixture is not renormalized after each pruning which results in the expression ISE(H0I ) = (w I )2 Q(I, I) for pruning hypotheses. Calculation of the ISE for each hypothesis at every step of the reduction is costly. A scheme is suggested here to cache the calculated quantities to reduce the computational cost of the reduction. The cost reduction scheme is given for the second type of implementation of the ISE approach, where the mixture density after pruning hypothesis is not re-normalized. In the first step of the reduction of the mixture density (4.1) merging of all possible pairs of components results in 12 N (N − 1) hypotheses. For the evaluation of these hypotheses the resultant component of each merging should be calculated. To calculate the ISE of each hypothesis Q( · , · ) should be calculated for all pairs of components in the mixture as well as the pair of components where one component is among the merged components and the other one is among the existing components. All these quantities should be stored and can be reused in the future reduction steps. At the k th step of the reduction of the mixture density given in (4.1), the reduced density is denoted by pk (x). In order to keep the notation less cluttered, let the term q J denote w J q(x; η J ); p denote p(x) and pk denote pk (x). Let us assume that the cost of the reduction hypotheses at the k th stage denoted by ISEk (HR ) are stored in a vector Yk and let M = argmin ISEk (HR ) for all permissible values of R. 71 When M corresponds to a pruning hypothesis, for example M = 0J, the vector Yk+1 can be updated with less computations for next pruning hypotheses using Z ISEk+1 (H0S |M = 0J) = (p − pk + q J + q S )2 dx Z Z Z S 2 J 2 = (p − pk + q ) dx + (q ) dx + 2 q J (p − pk + q S ) dx Z Z Z Z S 2 J 2 J = (p − pk + q ) dx + (q ) dx + 2 q (p − pk ) dx + 2 q J q S dx (C.2) Z Z Z = ISEk (H0S ) + (q J )2 dx + 2 q J (p − pk ) dx +2 q J q S dx, | {z } A(J) where, the quantity ISEk (H0S ) is already known from the previous step and A(J) is a part of the ISE added to elements of Yk due to the pruning of the J th component. Similarly, when M corresponds to a pruning hypothesis, for example M = 0J, the vector Yk+1 can be updated with less computations for the next merging hypotheses using Z ISEk+1 (HST |M = 0J) = (p − pk + q J + q S + q T − q ST )2 dx Z Z Z S T ST 2 J 2 = (p − pk + q + q − q ) dx + (q ) dx + 2 q J (p − pk + q S + q T − q ST ) dx Z Z Z S T ST 2 J 2 = (p − pk + q + q − q ) dx + (q ) dx + 2 q J (p − pk ) dx Z + 2 q J (q S + q T − q ST ) dx Z = ISEk (HST ) + A(J) + 2 q J (q S + q T − q ST ) dx. (C.3) After each pruning step all elements of vector Yk+1 corresponding to the pruned component will be eliminated from Yk+1 . Using a similar approach, when M corresponds to a merging hypothesis, say M = I J, the vector Yk+1 can be updated with less computations for the next prun- 72 C Implementation aspects of the ISE approach ing hypotheses using Z ISEk+1 (H0S |M = I J) = (p − pk + q J + q I − q I J + q S )2 dx Z Z = (p − pk + q S )2 dx + (q J + q I − q I J )2 dx Z + 2 (q J + q I − q I J )(p − pk + q S ) dx Z Z S 2 = (p − pk + q ) dx + (q J + q I − q I J )2 dx Z Z J I IJ + 2 (q + q − q )(p − pk ) dx + 2 (q J + q I − q I J )q S dx Z Z = ISEk (H0S ) + (q J + q I − q I J )2 dx + 2 (q J + q I − q I J )(p − pk ) dx | {z } (C.4) C(I,J) Z +2 (q J + q I − q I J )q S dx, and for the next merging hypotheses using Z ISEk+1 (HST |M = I J) = (p − pk + q J + q I − q I J + q S + q T − q ST )2 dx Z Z S T ST 2 = (p − pk + q + q − q ) dx + (q J + q I − q I J )2 dx Z + 2 (q J + q I − q I J )(p − pk + q S + q T − q ST ) dx Z Z = (p − pk + q S + q T − q ST )2 dx + (q J + q I − q I J )2 dx Z + 2 (q J + q I − q I J )(p − pk ) dx Z + 2 (q J + q I − q I J )(q S + q T − q ST ) dx Z = ISEk (HST ) + C(I, J) + 2 (q J + q I − q I J )(q S + q T − q ST ) dx. (C.5) When two components I and J are merged, the merged component labeled I J will obtain the label of component I in the computation environment and all elements of Yk+1 corresponding to element J will be eliminated. The vector Yk+1 73 should be updated for the new component as in Z ISEk+1 (H(I J)S |M = I J) = (p − pk + q J + q I − q I J + q S + q I J − q(I J)S )2 dx Z Z 2 = (p − pk ) dx + (q J + q I + q S − q(I J)S )2 dx (C.6) Z + 2 (p − pk )(q J + q I + q S − q(I J)S ) dx, where, the first term is known from the last reduction step. Bibliography D. Alspach and H. Sorenson. Nonlinear Bayesian estimation using Gaussian sum approximations. Automatic Control, IEEE Transactions on, 17(4):439–448, 1972. ISSN 0018-9286. doi: 10.1109/TAC.1972.1100034. T. Ardeshiri and T. Chen. Maximum entropy property of discrete-time stable spline kernel. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 3676–3680, April 2015. doi: 10.1109/ICASSP.2015.7178657. T. Ardeshiri and E. Özkan. An adaptive PHD filter for tracking with unknown sensor characteristics. In Information Fusion (FUSION), 2013 16th International Conference on, pages 1736–1743, July 2013. T. Ardeshiri, S. Kharrazi, J. Sjöberg, J. Bärgman, and L. M. Sensor fusion for vehicle positioning in intersection active safety applications. In International Symposium on Advanced Vehicle Control, 2006a. T. Ardeshiri, S. Kharrazi, R. Thomson, and J. Bärgman. Offset eliminative map matching algorithm for intersection active safety applications. In Intelligent Vehicles Symposium, 2006 IEEE, pages 82–88, 2006b. doi: 10.1109/IVS.2006. 1689609. T. Ardeshiri, F. Larsson, F. Gustafsson, T. Schön, and M. Felsberg. Bicycle tracking using ellipse extraction. In Information Fusion (FUSION), 2011 Proceedings of the 14th International Conference on, pages 1–8, July 2011a. T. Ardeshiri, M. Norrlöf, J. Löfberg, and A. Hansson. Convex optimization approach for time-optimal path tracking of robots with speed dependent constraints. In Proceedings of the 18th IFAC World Congress, Milan, Italy, pages 14648–14653, August 2011b. T. Ardeshiri, U. Orguner, C. Lundquist, and T. Schön. On mixture reduction for multiple target tracking. In Information Fusion (FUSION), 2012 15th International Conference on, pages 692–699, July 2012. 75 76 Bibliography T. Ardeshiri, K. Granström, E. Özkan, and U. Orguner. Greedy reduction algorithms for mixtures of exponential family. Signal Processing Letters, IEEE, 22 (6):676–680, June 2015a. ISSN 1070-9908. doi: 10.1109/LSP.2014.2367154. T. Ardeshiri, U. Orguner, and F. Gustafsson. Bayesian inference via approximation of log-likelihood for priors in exponential family. ArXiv e-prints, October 2015b. Submitted to Signal Processing, IEEE Transactions on. T. Ardeshiri, U. Orguner, and E. Özkan. Gaussian Mixture Reduction Using Reverse Kullback-Leibler Divergence. ArXiv e-prints, August 2015. To be Submitted to Signal Processing, IEEE Transactions on. T. Ardeshiri, E. Özkan, U. Orguner, and F. Gustafsson. Approximate Bayesian smoothing with unknown process and measurement noise covariances. To appear in Signal Processing Letters, IEEE, 2015. T. Ardeshiri, E. Özkan, and U. Orguner. On reduction of mixtures of the exponential family distributions. Technical Report LiTH-ISY-R-3076, Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden, August 2014. URL http://urn.kb.se/resolve?urn=urn:nbn:se: liu:diva-100234. C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. ISBN 0387310738. S. Blackman and R. Popoli. Design and analysis of modern tracking systems. Artech House radar library. Artech House, 1999. ISBN 9781580530064. T. Chen, H. Ohlsson, and L. Ljung. On the estimation of transfer functions, regularizations and Gaussian processes - Revisited. Automatica, 48:1525–1535, 2012a. T. Chen, T. Ardeshiri, F. P. Carli, A. Chiuso, L. Ljung, and G. Pillonetto. Maximum entropy properties of discrete-time first-order stable spline kernel. To appear in Automatica, 2015. T. Chen, M. Andersen, L. Ljung, A. Chiuso, and G. Pillonetto. System identification via sparse multiple kernel-based regularization using sequential convex optimization techniques. Automatic Control, IEEE Transactions on, 59(11): 2933–2945, Nov 2014. ISSN 0018-9286. doi: 10.1109/TAC.2014.2351851. X. Chen, R. Tharmarasa, M. Pelletier, and T. Kirubarajan. Integrated clutter estimation and target tracking using Poisson point processes. Aerospace and Electronic Systems, IEEE Transactions on, 48(2):1210–1235, April 2012b. ISSN 0018-9251. doi: 10.1109/TAES.2012.6178058. T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012. Bibliography 77 T. M. Cover and J. Thomas. Elements of Information Theory. John Wiley and Sons, 2006. M. Feldmann, D. Fränken, and W. Koch. Tracking of extended objects and group targets using random matrices. Signal Processing, IEEE Transactions on, 59(4): 1409–1420, April 2011. S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741, 1984. ISSN 0162-8828. K. Granström and U. Orguner. Estimation and maintenance of measurement rates for multiple extended target tracking. In Information Fusion (FUSION), 2012 15th International Conference on, pages 2170 –2176, july 2012a. K. Granström and U. Orguner. On the reduction of Gaussian inverse Wishart mixtures. In Information Fusion (FUSION), 2012 15th International Conference on, pages 2162 –2169, july 2012b. W. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97–109, 1970. doi: 10.1093/biomet/57.1.97. W. Hennevogl, L. Fahrmeir, and G. Tutz. Multivariate Statistical Modelling Based on Generalized Linear Models. Springer Series in Statistics. Springer New York, 2001. ISBN 9780387951874. E. T. Jaynes. On the rationale of maximum-entropy methods. Proceedings of the IEEE, 70(9):939–952, 1982. M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Mach. Learn., 37(2):183–233, November 1999. ISSN 0885-6125. doi: 10.1023/A:1007665907178. S. Kay. Fundamentals of Statistical Signal Processing: Detection theory. Prentice Hall signal processing series. Prentice-Hall PTR, 1998. ISBN 9780135041352. G. Kitagawa. The two-filter formula for smoothing and an implementation of the Gaussian-sum smoother. Annals of the Institute of Statistical Mathematics, 46 (4):605–623, 1994. M. Lifshits. Random Processes by Example. World Scientific Publishing Co. Pte. Ltd, 2014. ISBN 978-981-4522-28-1. L. Ljung. System Identification - Theory for the User. Prentice-Hall, Upper Saddle River, N.J., 2nd edition, 1999. L. Ljung, H. Hjalmarsson, and H. Ohlsson. Four encounters with system identification. European Journal of Control, 17:449–471, 2011. 78 Bibliography T. Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-01), pages 362–369, San Francisco, CA, 2001. Morgan Kaufmann. T. Minka. Divergence measures and message passing. Technical report, Microsoft Research Ltd., Cambridge, UK, 2005. J. A. Nelder and R. W. M. Wedderburn. Generalized linear models. Journal of the Royal Statistical Society. Series A (General), 135(3):pp. 370–384, 1972. ISSN 00359238. G. D. Nicolao, G. Ferrari-Trecate, and A. Lecchini. MAXENT priors for stochastic filtering problems. In Mathematical Theory of Networks and Systems, Padova, Italy, July 1998. H. Nurminen, T. Ardeshiri, R. Piché, and F. Gustafsson. Robust inference for state-space models with skewed measurement noise. Signal Processing Letters, IEEE, 22(11):1898–1902, Nov 2015a. ISSN 1070-9908. doi: 10.1109/LSP.2015. 2437456. H. Nurminen, T. Ardeshiri, R. Piché, and F. Gustafsson. A NLOS-robust TOA positioning filter based on a skew-t measurement noise model. In 2015 International Conference on Indoor Positioning and Indoor Navigation (IPIN), Banff, Alberta, Canada, October 2015b. A. Papoulis and S. Pillai. Probability, Random Variables, and Stochastic Processes. McGraw-Hill series in electrical engineering: Communications and signal processing. Tata McGraw-Hill, 2002. ISBN 9780070486584. G. Pillonetto and G. D. Nicolao. A new kernel-based approach for linear system identification. Automatica, 46(1):81–93, 2010. G. Pillonetto and G. D. Nicolao. Kernel selection in linear system identification. Part I: A Gaussian process perspective. In Proc. 50th IEEE Conference on Decision and Control, pages 4318–4325, Orlando, Florida, 2011. G. Pillonetto, A. Chiuso, and G. D. Nicolao. Prediction error identification of linear systems: a nonparametric Gaussian regression approach. Automatica, 47(2):291–305, 2011. G. Pillonetto, F. Dinuzzo, T. Chen, G. De Nicolao, and L. Ljung. Kernel methods in system identification, machine learning and function estimation: A survey. Automatica, 50(3):657–682, 2014. C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. K. J. Åström. Introduction to stochastic control theory, volume 70 of Mathematics in science and engineering. Academic press, New York, London, 1970. ISBN 0-12-065650-7. Bibliography 79 H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2):319– 392, 2009. ISSN 1467-9868. A. Runnalls. Kullback-Leibler approach to Gaussian mixture reduction. Aerospace and Electronic Systems, IEEE Transactions on, 43(3):989–999, July 2007. ISSN 0018-9251. doi: 10.1109/TAES.2007.4383588. D. J. Salmond. Mixture reduction algorithms for target tracking in clutter. In Proceeding of SPIE, Signal and Data Processing of Small Targets, volume 1305, pages 434–445, 1990. D. G. Tzikas, A. C. Likas, and N. P. Galatsanos. The variational approximation for Bayesian inference. IEEE Signal Processing Magazine, 25(6):131–146, November 2008. B.-N. Vo and W.-K. Ma. The Gaussian mixture probability hypothesis density filter. Signal Processing, IEEE Transactions on, 54(11):4091–4104, nov. 2006. ISSN 1053-587X. doi: 10.1109/TSP.2006.881190. M. Wainwright and M. Jordan. Graphical Models, Exponential Families, and Variational Inference. Foundations and trends in machine learning. Now Publishers, 2008. ISBN 9781601981844. J. L. Williams and P. S. Maybeck. Cost-function-based hypothesis control techniques for multiple hypothesis tracking. Mathematical and Computer Modelling, 43(9-10):976–989, May 2006. ISSN 08957177. doi: 10.1016/j.mcm. 2005.05.022. Part II Publications Papers The articles associated with this thesis have been removed for copyright reasons. For more details about these see: http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-121619 PhD Dissertations Division of Automatic Control Linköping University M. Millnert: Identification and control of systems subject to abrupt changes. Thesis No. 82, 1982. ISBN 91-7372-542-0. A. J. M. van Overbeek: On-line structure selection for the identification of multivariable systems. Thesis No. 86, 1982. ISBN 91-7372-586-2. B. Bengtsson: On some control problems for queues. Thesis No. 87, 1982. ISBN 91-7372593-5. S. Ljung: Fast algorithms for integral equations and least squares identification problems. Thesis No. 93, 1983. ISBN 91-7372-641-9. H. Jonson: A Newton method for solving non-linear optimal control problems with general constraints. Thesis No. 104, 1983. ISBN 91-7372-718-0. E. Trulsson: Adaptive control based on explicit criterion minimization. Thesis No. 106, 1983. ISBN 91-7372-728-8. K. Nordström: Uncertainty, robustness and sensitivity reduction in the design of single input control systems. Thesis No. 162, 1987. ISBN 91-7870-170-8. B. Wahlberg: On the identification and approximation of linear systems. Thesis No. 163, 1987. ISBN 91-7870-175-9. S. Gunnarsson: Frequency domain aspects of modeling and control in adaptive systems. Thesis No. 194, 1988. ISBN 91-7870-380-8. A. Isaksson: On system identification in one and two dimensions with signal processing applications. Thesis No. 196, 1988. ISBN 91-7870-383-2. M. Viberg: Subspace fitting concepts in sensor array processing. Thesis No. 217, 1989. ISBN 91-7870-529-0. K. Forsman: Constructive commutative algebra in nonlinear control theory. Thesis No. 261, 1991. ISBN 91-7870-827-3. F. Gustafsson: Estimation of discrete parameters in linear systems. Thesis No. 271, 1992. ISBN 91-7870-876-1. P. Nagy: Tools for knowledge-based signal processing with applications to system identification. Thesis No. 280, 1992. ISBN 91-7870-962-8. T. Svensson: Mathematical tools and software for analysis and design of nonlinear control systems. Thesis No. 285, 1992. ISBN 91-7870-989-X. S. Andersson: On dimension reduction in sensor array signal processing. Thesis No. 290, 1992. ISBN 91-7871-015-4. H. Hjalmarsson: Aspects on incomplete modeling in system identification. Thesis No. 298, 1993. ISBN 91-7871-070-7. I. Klein: Automatic synthesis of sequential control schemes. Thesis No. 305, 1993. ISBN 91-7871-090-1. J.-E. Strömberg: A mode switching modelling philosophy. Thesis No. 353, 1994. ISBN 917871-430-3. K. Wang Chen: Transformation and symbolic calculations in filtering and control. Thesis No. 361, 1994. ISBN 91-7871-467-2. T. McKelvey: Identification of state-space models from time and frequency data. Thesis No. 380, 1995. ISBN 91-7871-531-8. J. Sjöberg: Non-linear system identification with neural networks. Thesis No. 381, 1995. ISBN 91-7871-534-2. R. Germundsson: Symbolic systems – theory, computation and applications. Thesis No. 389, 1995. ISBN 91-7871-578-4. P. Pucar: Modeling and segmentation using multiple models. Thesis No. 405, 1995. ISBN 91-7871-627-6. H. Fortell: Algebraic approaches to normal forms and zero dynamics. Thesis No. 407, 1995. ISBN 91-7871-629-2. A. Helmersson: Methods for robust gain scheduling. Thesis No. 406, 1995. ISBN 91-7871628-4. P. Lindskog: Methods, algorithms and tools for system identification based on prior knowledge. Thesis No. 436, 1996. ISBN 91-7871-424-8. J. Gunnarsson: Symbolic methods and tools for discrete event dynamic systems. Thesis No. 477, 1997. ISBN 91-7871-917-8. M. Jirstrand: Constructive methods for inequality constraints in control. Thesis No. 527, 1998. ISBN 91-7219-187-2. U. Forssell: Closed-loop identification: Methods, theory, and applications. Thesis No. 566, 1999. ISBN 91-7219-432-4. A. Stenman: Model on demand: Algorithms, analysis and applications. Thesis No. 571, 1999. ISBN 91-7219-450-2. N. Bergman: Recursive Bayesian estimation: Navigation and tracking applications. Thesis No. 579, 1999. ISBN 91-7219-473-1. K. Edström: Switched bond graphs: Simulation and analysis. Thesis No. 586, 1999. ISBN 91-7219-493-6. M. Larsson: Behavioral and structural model based approaches to discrete diagnosis. Thesis No. 608, 1999. ISBN 91-7219-615-5. F. Gunnarsson: Power control in cellular radio systems: Analysis, design and estimation. Thesis No. 623, 2000. ISBN 91-7219-689-0. V. Einarsson: Model checking methods for mode switching systems. Thesis No. 652, 2000. ISBN 91-7219-836-2. M. Norrlöf: Iterative learning control: Analysis, design, and experiments. Thesis No. 653, 2000. ISBN 91-7219-837-0. F. Tjärnström: Variance expressions and model reduction in system identification. Thesis No. 730, 2002. ISBN 91-7373-253-2. J. Löfberg: Minimax approaches to robust model predictive control. Thesis No. 812, 2003. ISBN 91-7373-622-8. J. Roll: Local and piecewise affine approaches to system identification. Thesis No. 802, 2003. ISBN 91-7373-608-2. J. Elbornsson: Analysis, estimation and compensation of mismatch effects in A/D converters. Thesis No. 811, 2003. ISBN 91-7373-621-X. O. Härkegård: Backstepping and control allocation with applications to flight control. Thesis No. 820, 2003. ISBN 91-7373-647-3. R. Wallin: Optimization algorithms for system analysis and identification. Thesis No. 919, 2004. ISBN 91-85297-19-4. D. Lindgren: Projection methods for classification and identification. Thesis No. 915, 2005. ISBN 91-85297-06-2. R. Karlsson: Particle Filtering for Positioning and Tracking Applications. Thesis No. 924, 2005. ISBN 91-85297-34-8. J. Jansson: Collision Avoidance Theory with Applications to Automotive Collision Mitigation. Thesis No. 950, 2005. ISBN 91-85299-45-6. E. Geijer Lundin: Uplink Load in CDMA Cellular Radio Systems. Thesis No. 977, 2005. ISBN 91-85457-49-3. M. Enqvist: Linear Models of Nonlinear Systems. Thesis No. 985, 2005. ISBN 91-8545764-7. T. B. Schön: Estimation of Nonlinear Dynamic Systems — Theory and Applications. Thesis No. 998, 2006. ISBN 91-85497-03-7. I. Lind: Regressor and Structure Selection — Uses of ANOVA in System Identification. Thesis No. 1012, 2006. ISBN 91-85523-98-4. J. Gillberg: Frequency Domain Identification of Continuous-Time Systems Reconstruction and Robustness. Thesis No. 1031, 2006. ISBN 91-85523-34-8. M. Gerdin: Identification and Estimation for Models Described by Differential-Algebraic Equations. Thesis No. 1046, 2006. ISBN 91-85643-87-4. C. Grönwall: Ground Object Recognition using Laser Radar Data – Geometric Fitting, Performance Analysis, and Applications. Thesis No. 1055, 2006. ISBN 91-85643-53-X. A. Eidehall: Tracking and threat assessment for automotive collision avoidance. Thesis No. 1066, 2007. ISBN 91-85643-10-6. F. Eng: Non-Uniform Sampling in Statistical Signal Processing. Thesis No. 1082, 2007. ISBN 978-91-85715-49-7. E. Wernholt: Multivariable Frequency-Domain Identification of Industrial Robots. Thesis No. 1138, 2007. ISBN 978-91-85895-72-4. D. Axehill: Integer Quadratic Programming for Control and Communication. Thesis No. 1158, 2008. ISBN 978-91-85523-03-0. G. Hendeby: Performance and Implementation Aspects of Nonlinear Filtering. Thesis No. 1161, 2008. ISBN 978-91-7393-979-9. J. Sjöberg: Optimal Control and Model Reduction of Nonlinear DAE Models. Thesis No. 1166, 2008. ISBN 978-91-7393-964-5. D. Törnqvist: Estimation and Detection with Applications to Navigation. Thesis No. 1216, 2008. ISBN 978-91-7393-785-6. P-J. Nordlund: Efficient Estimation and Detection Methods for Airborne Applications. Thesis No. 1231, 2008. ISBN 978-91-7393-720-7. H. Tidefelt: Differential-algebraic equations and matrix-valued singular perturbation. Thesis No. 1292, 2009. ISBN 978-91-7393-479-4. H. Ohlsson: Regularization for Sparseness and Smoothness — Applications in System Identification and Signal Processing. Thesis No. 1351, 2010. ISBN 978-91-7393-287-5. S. Moberg: Modeling and Control of Flexible Manipulators. Thesis No. 1349, 2010. ISBN 978-91-7393-289-9. J. Wallén: Estimation-based iterative learning control. Thesis No. 1358, 2011. ISBN 97891-7393-255-4. J. Hol: Sensor Fusion and Calibration of Inertial Sensors, Vision, Ultra-Wideband and GPS. Thesis No. 1368, 2011. ISBN 978-91-7393-197-7. D. Ankelhed: On the Design of Low Order H-infinity Controllers. Thesis No. 1371, 2011. ISBN 978-91-7393-157-1. C. Lundquist: Sensor Fusion for Automotive Applications. Thesis No. 1409, 2011. ISBN 978-91-7393-023-9. P. Skoglar: Tracking and Planning for Surveillance Applications. Thesis No. 1432, 2012. ISBN 978-91-7519-941-2. K. Granström: Extended target tracking using PHD filters. Thesis No. 1476, 2012. ISBN 978-91-7519-796-8. C. Lyzell: Structural Reformulations in System Identification. Thesis No. 1475, 2012. ISBN 978-91-7519-800-2. J. Callmer: Autonomous Localization in Unknown Environments. Thesis No. 1520, 2013. ISBN 978-91-7519-620-6. D. Petersson: A Nonlinear Optimization Approach to H2-Optimal Modeling and Control. Thesis No. 1528, 2013. ISBN 978-91-7519-567-4. Z. Sjanic: Navigation and Mapping for Aerial Vehicles Based on Inertial and Imaging Sensors. Thesis No. 1533, 2013. ISBN 978-91-7519-553-7. F. Lindsten: Particle Filters and Markov Chains for Learning of Dynamical Systems. Thesis No. 1530, 2013. ISBN 978-91-7519-559-9. P. Axelsson: Sensor Fusion and Control Applied to Industrial Manipulators. Thesis No. 1585, 2014. ISBN 978-91-7519-368-7. A. Carvalho Bittencourt: Modeling and Diagnosis of Friction and Wear in Industrial Robots. Thesis No. 1617, 2014. ISBN 978-91-7519-251-2. M. Skoglund: Inertial Navigation and Mapping for Autonomous Vehicles. Thesis No. 1623, 2014. ISBN 978-91-7519-233-8. S. Khoshfetrat Pakazad: Divide and Conquer: Distributed Optimization and Robustness Analysis. Thesis No. 1676, 2015. ISBN 978-91-7519-050-1.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

### Related manuals

Download PDF

advertisement