# Estimation sous contraintes de communication

THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITÉ DE GRENOBLE Spécialité : signal, image, parole, télécommunications (SIPT) Arrêté ministériel : 7 août 2006 Présentée par M. Rodrigo CABRAL FARIAS Thèse dirigée par M. Jean-Marc BROSSIER préparée au sein du laboratoire Grenoble, images, parole, signal, automatique (GIPSA-lab) dans l’école doctorale d’électronique, électrotechnique, automatique et traitement du signal (EEATS) Estimation sous contraintes de communication : algorithmes et performances asymptotiques Thèse soutenue publiquement le 17/07/2013, devant le jury composé de : M. Eric MOULINES Professeur Télécom ParisTech, Rapporteur M. Jean-Yves TOURNERET Professeur ENSEEIHT, Rapporteur M. Josep VIDAL Professeur Universitat Politènica de Catalunya, Examinateur M. Jean-François BERCHER Professeur associé ESIEE, Examinateur M. Eric MOREAU Professeur Université du Sud Toulon-Var, Examinateur M. Jean-Marc BROSSIER Professeur Grenoble-INP, Directeur de thèse Acknowledgements I would like to thank the Erasmus program Euro Brazilian Windows II for funding this thesis and the director of the GIPSA-lab, Jean-Marc Thiriet, for welcoming me in his laboratory. I would like to express my gratitude to my thesis director, Jean-Marc Brossier, for allowing me to be a free researcher during my thesis, pointing carefully the mistakes in some of my strange ideas and motivating me to look deeper when the ideas were not that strange. Special thanks are extended to the professors Eric Moisan, Laurent Ros, Olivier Michel and Steeve Zozor, who helped me countless times. Also, I would like to thank the members of my jury Eric Moreau, Eric Moulines, Jean-Yves Tourneret, Josep Vidal and Jean-François Bercher for their precious remarks on my work and for all the insights on how I can extend it. During these three years, I bothered a non negligible quantity of PhD students in the laboratory. I would like to acknowledge those who have survived to this torture for their patience. Thanks to the survivors: Aude, Damien, Douglas, Gailene, Humberto, Jonathan, Robin, Wei, Xuan Vu and Zhong Yang. I am particularly grateful for the assistance given by Vio, who was able to encourage me to move forward, even when I felt the pressure was excessive. To my family, mãe, Dea e vó, I can only express that it was very difficult to stay such a long time far from you. 3 Contents Notations 11 Abbreviations and acronyms 13 Assumptions 15 Introduction 17 I Estimation based on quantized measurements: algorithms and performance 1 Estimation of a constant parameter 25 31 1.1 Measurement model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.2 Maximum likelihood, Cramér–Rao bound and Fisher information . . . . . . . . 37 1.3 Binary quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 1.4 Multibit quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 1.5 Adaptive quantizers: the high complexity fusion center approach . . . . . . . . 60 1.6 Chapter summary and directions . . . . . . . . . . . . . . . . . . . . . . . . . . 73 2 Estimation of a varying parameter 75 2.1 Parameter and measurement model . . . . . . . . . . . . . . . . . . . . . . . . . 77 2.2 Optimal estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 2.3 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 2.4 Evaluation of the estimation performance . . . . . . . . . . . . . . . . . . . . . 87 2.5 Quantized innovations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 2.6 Chapter summary and directions . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3 Adaptive quantizers for estimation 105 3.1 Parameter model and measurement model . . . . . . . . . . . . . . . . . . . . . 108 3.2 General estimation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.3 Estimation performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.4 Optimal algorithm parameters and performance . . . . . . . . . . . . . . . . . . 125 5 6 Contents 3.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 3.6 Adaptive quantizers for estimation: extensions . . . . . . . . . . . . . . . . . . . 149 3.7 Chapter summary and directions . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Conclusions of Part I II Estimation based on quantized measurements: high-rate approximations 4 High-rate approximations of the FI 169 171 177 4.1 Asymptotic approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 4.2 Bit allocation for scalar location parameter estimation . . . . . . . . . . . . . . 200 4.3 Generalization with the f –divergence . . . . . . . . . . . . . . . . . . . . . . . . 207 4.4 Chapter summary and directions . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Conclusions of Part II 217 Conclusions 219 Main conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 A Appendices 223 A.1 Why? - Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 A.2 More? - Further details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 A.3 How? - Algorithms and implementation issues . . . . . . . . . . . . . . . . . . . 248 B Résumé détaillé en français (extended abstract in French) 253 B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 B.2 Estimation et quantification : algorithmes et performances . . . . . . . . . . . . 261 B.3 Estimation et quantification : approximations à haute résolution . . . . . . . . 286 B.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Bibliography 295 List of Figures 1 Estimation using a sensing system. . . . . . . . . . . . . . . . . . . . . . . . . . 20 2 Scalar remote sensing problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3 Estimation based on quantized measurements. . . . . . . . . . . . . . . . . . . . 21 1.1 Quantizer function Q (Yk ) with NI quantization intervals and uniform threshold spacing with length ∆. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 1.2 Scheme representing the general measurement/estimation system. . . . . . . . . 42 1.3 Quantity related to the CRB for quantized measurements B and its upper bound B̄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 1.4 Function M × δ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 1.5 PDF for the uniform/Gaussian distribution. . . . . . . . . . . . . . . . . . . . . 49 1.6 CRBB q and simulated MLE MSE for uniform/Gaussian noise. . . . . . . . . . . 50 1.7 CRBB q and simulated MLE MSE for GGD noise. . . . . . . . . . . . . . . . . . 52 1.8 FI as a function of the normalized difference between the central threshold and the true parameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.1 Scheme representing the adjustable quantizer. . . . . . . . . . . . . . . . . . . . 110 3.2 Block representation of the estimation scheme. . . . . . . . . . . . . . . . . . . 111 3.3 ODE bias approximation and simulated bias for the estimation of a Wiener process with the adaptive algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 118 3.4 Adaptive algorithm loss of estimation performance due to quantization of measurements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 3.5 Quantization loss of performance for GGD noise and NB ∈ {2, 3, 4, 5} when Xk is constant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 3.6 Quantization loss of performance for STD noise and NB ∈ {2, 3, 4, 5} when Xk is constant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 3.7 Simulated quantization performance loss for a Wiener process Xk with σw = 0.001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 3.8 Comparison of simulated and theoretical losses in the Gaussian and Cauchy noise cases when estimating a wiener process with σw = 0.1 or σw = 0.001. . . . 142 3.9 Comparison of simulated and theoretical losses in the Gaussian and Cauchy noise cases for estimating a Wiener process with constant mean drift. . . . . . . 143 3.10 Minimum CRB and simulated MSE for the adaptive algorithm with decreasing gain and for the adaptive algorithm based on the MLE. . . . . . . . . . . . . . 144 7 8 List of Figures 3.11 Asymptotic MSE for the optimal estimator of a Wiener process with small σw and simulated MSE for the adaptive algorithm with constant gain and for the PF with dynamic central threshold. . . . . . . . . . . . . . . . . . . . . . . . . . 146 3.12 Scheme representing the adjustable quantizer. The offset and gain are adjusted dynamically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 3.13 CRB for estimating a location parameter of Gaussian and Cauchy distributions based on quantized and continuous measurements and simulated MSE for the estimation of the location parameter with the adaptive location-scale parameter estimator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 3.14 Scheme representing the sensor network with a fusion center. . . . . . . . . . . 158 3.15 Cramér–Rao bound and simulated MSE for the adaptive algorithm in the fusion center approach with different numbers of sensors and 4 quantization intervals. 162 3.16 Cramér–Rao bound and simulated MSE for the adaptive algorithm with different numbers of sensors and fixed total number of bits. . . . . . . . . . . . . . . 163 4.1 Interval densities for the estimation of a GGD location parameter. . . . . . . . 192 4.2 Simulated MSE for the adaptive algorithm with nonuniform thresholds considering Gaussian and Cauchy measurement distributions. . . . . . . . . . . . . . . 199 4.3 Water-filling solutions for multicarrier modulation power allocation and for rate constrained sensing system bit allocation. . . . . . . . . . . . . . . . . . . . . . 206 A.1 Geometric scheme to show that the probability of the interval A0 + A1 is less than the probability of the exterior region of the left quarter circle C1 . . . . . . 225 A.2 Log-likelihood function for Cauchy noise distribution. . . . . . . . . . . . . . . . 235 A.3 An iteration of the binary threshold update in a finite grid. . . . . . . . . . . . 238 List of Tables 4.1 FI for the estimation of Gaussian and Cauchy location parameters based on quantized measurements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 4.2 Functions characterizing the GFD for different inference problems and interval densities maximizing the inference performance based on quantized measurements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 9 Notations Vector and sequences Sets N Natural numbers X (boldface) Vector or matrix R Real numbers Sup⊤ Transposition Sub+ Set with only positive elements diag (X) Diagonal matrix from X Sup⋆ Set including zero X1:N Sequence X1 , X2 , · · · , XN Probability X (uppercase) Random variable Var (X) Variance x (lowercase) Realization or parameter VarX| (X) Conditional variance P () Probability measure f () or p () Probability density function P (|) Conditional probability f (|) Conditional density E (X) Expectation F () Cumulative distribution function EX| (X) Conditional expectation f (; x) Parametrization by x Whenever the random variables related to a function are not clear from the context, they are indicated implicitly by the variable in the argument of the function. Main variables and parameters X Unknown parameter L Likelihood Y Continuous measurement S Score function i Quantized measurement I Fisher information X̂ Estimator CRB Cramér–Rao bound V Measurement noise BCRB Bayesian Cramér–Rao bound W Parameter variation MSE Mean squared error 11 12 Notations Main variables and parameters σw Increments variance NI Number of quantization intervals δ Noise scale NB Number of quantization bits ε Estimation error N Number of samples Quantizer shift Ns Number of sensors k Sample/time index τ Quantization threshold Lq Loss due to quantization τ′ Threshold variation u Deterministic drift q Quantization interval γ(scalar) Adaptive algorithm step ∆ Interval length Γ(matrix) Adaptive algorithm step Quantizer input parameter η Adapt. algorithm correction λ () Interval density The subscript k can be used either to indicate one specific sample of a sequence or to make explicit that a quantity is a sequence. Most quantities related to continuous measurements have a subscript c and most quantities related to quantized measurements have a subscript q. Functions ()−1 Inverse of a function sign () Sign function Γ () Gamma function γ (, ) Incomplete gamma function B Beta function I (, ) Incomplete beta function ()+ Ramp function Abbreviations and acronyms AUV Autonomous underwater vehicle BCRB Bayesian Cramér–Rao bound BI Bayesian information CRB Cramér–Rao bound CDF Cumulative distribution function DSP Digital signal processing FI Fisher information GFD Generalized f –divergence GGD Generalized Gaussian distribution i.i.d. Independent and identically distributed KLD Kullback–Leibler divergence MAP Maximum a posteriori estimator MLE Maximum likelihood estimator MSE Mean squared error MMSE Minimum mean squared error ODE Ordinary differential equation PF Particle filter PDF Probability density function r.v. Random variable RHS Right-hand side STD Student’s-t distribution w.r.t. With respect to 13 Assumptions Assumptions (on the noise distribution): AN1 The marginal CDF of the noise, denoted F , admits a PDF f w.r.t. the standard Lebesgue measure on (R, B (R)). AN2 The PDF f (v) is a strictly positive even function and it strictly decreases w.r.t. |v|. AN3 F is locally Lipschitz continuous. Assumptions (on the quantizer): AQ1 NI is considered to be an even natural number and the set I where ik is defined is NI NI . I = − , · · · , −1, 1, · · · , 2 2 AQ2 The quantizer is symmetric around the central threshold. This means that the vector of thresholds τ is given by ⊤ ′ ′ ′ ′ τ = τ− NI = τ0 − τ NI · · · τ−1 = τ0 − τ1 τ0 τ1 = τ0 + τ1 · · · τ NI = τ0 + τ NI 2 2 2 2 with the threshold vector elements forming a strictly increasing sequence and the nonnegative vector of threshold variations w.r.t. the central threshold given by ⊤ ′ ′ ′ ′ τ = τ0 = 0 τ1 · · · τ NI = +∞ . 2 AQ3 The quantizer output levels have odd symmetry w.r.t. i: ηi = −η−i , with ηi > 0 for i > 0. Modified assumptions (on the quantizer): AQ2’ The quantizer is symmetric around the central threshold which is equal to zero. This means that the vector of thresholds τ is given by the vector of threshold variations ⊤ ′ ′ ′ ′ τ = −τ NI · · · − τ1 0 + τ1 · · · + τ NI , 2 2 where the threshold variations τi′ form an increasing sequence. AQ3’ The quantizer output levels ηx [i] are odd and the output levels ηδ [i] are even. ηx [i] = −ηx [−i] , with ηx [i] > 0 for i > 0 and ηδ [1] < 0. 15 ηδ [i] = ηδ [−i] , 16 Assumptions Assumptions on Iq for the MLE update to have asymptotically optimal performance: A1.MLE Iq (ε) is maximum for ε = 0. A2.MLE Iq (ε) is locally decreasing around zero. A3.MLE The function Iq (ε) has bounded Iq (0), dIq (ε) dε ε=0 = 0, bounded fore accepting a Taylor approximation around zero (for small ε′ ): ε′2 d2 Iq (ε) ′ ′2 Iq ε = Iq (0) + + ◦ ε , 2 dε2 ε=0 d2 Iq (ε) , dε2 ε=0 there- ◦(ε′2 ) where the ◦ ε2 here is equivalent to say that the quantity ε′2 tends to zero when ε′ tends to zero. Introduction Quantization: the stranger in the room Open a book, any basic book on digital signal processing (DSP), and count the number of pages dedicated to the sampling theorem and discrete-time signal processing: FFT, Ztransform, FIR and IIR filtering. Now, count the number of pages dedicated to quantization. Even if half of the "digital world" comes from quantization, by reading some basic books on DSP, we have the feeling that it is a completely unimportant subject.1 A curious person might think: is it really unimportant? Maybe it is simply so difficult to be treated and explained in an easy way, that most DSP books skip a detailed description of quantization. We think this explanation is the reason most of the texts presenting DSP assume that signals are quantized with a very high resolution, so they have the possibility of explaining quantization almost in a footnote. As a consequence, quantization seems to be the stranger that comes to the "DSP party" and almost nobody wants to speak with (even if it is one of the party organizers). Some signal processing domains find useful (and in some circumstances they are not wrong) to refuse "contact" with quantization. Whenever they need to address quantization issues they always call it in a derogatory way – "quantization noise". In this thesis we expect to make one of the subjects in the signal processing party to "talk" with quantization in a polite way, without detracting terms. The subject we chose is estimation. In the following, we will explain the motivation and the main points of their "conversation". Sensor networks and quantization: the welcome guest Although we do not explicitly design estimation algorithms using a sensor network architecture, this thesis is intended to contribute in the development of estimation techniques that can be applied or extended to sensor networks. Sensor network emergence. With the reduction in cost and size of electronic devices such as sensors and transceivers, a whole new field emerged under the name Sensor Networks. This term, in general, means any set of sensors capable of communication and processing used for a specific task, e.g. estimation, detection, tracking, classification, etc. Sensor networks are attractive for many reasons [Akyildiz 2002], [Intanagonwiwat 2000], [Zhao 2004, pp. 7–8]: 1 Note that the real problem of digitizing a signal by considering sampling and quantization as a joint operation is simply a non-issue in signal processing literature. We do not study this problem in this thesis either, but it is an interesting problem. 17 18 Introduction • fault-tolerance and flexibility. By using multiple sensors to realize a sensing task, even if one of them is unable to measure, the other sensors guarantee that the sensing system is still working. By proper design, the sensor network can reconfigure the way it operates, so that if a failure occurs in a sensor or a small set of sensors, the performance of the sensing system is not strongly affected. • Easy deployment. The decreased cost of the sensors makes it possible to deploy large quantities of sensors in a given area without detailed placement of the sensors. This simplifies the deployment of sensing systems in difficult access and hostile environments. • Risky environment sensing. By allowing the sensors to communicate wirelessly, remote sensing can be done in areas where human activity is impossible or cannot be sustained for long periods of time. • No maintenance sensing. The fault tolerance capabilities of sensor networks allows it to be applied in applications where maintenance of the sensing system is difficult. • Multi-hop communication. By using the communication capabilities of the sensors to allow multi-hop communication, the total energy used in communication for the sensing task may decrease, as the attenuation of transmitted signals is smaller for smaller distances. • Enhanced signal-to-noise ratio. In tracking or detection applications, the performance of the task is normally dependent on the signal-to-noise ratio of the measurements. If we consider that the signal we measure attenuates with distance, then in a sensor network, as the density of sensors can be high, it is expected that at least a few sensors will measure the signal with high signal-to-noise ratio, enhancing in this way the final performance. Sensor network applications. Based on the advantages of sensor networks presented above, a plethora of applications can be developed in many different domains [Arampatzis 2005], [Chong 2003], [Durisic 2012], [Puccinelli 2005]: • environmental monitoring. Habitat monitoring, bio-complexity mapping, weather forecasting and disaster prevention (volcanic eruptions, floods, earthquakes). • Agricultural monitoring. Precision irrigation, fertilization and pest control. • Civil engineering. Building automation, building emergency systems and structural health monitoring. • Urban monitoring. Pollution monitoring, video surveillance and traffic control. • Health applications. Monitoring of human physiological data, tracking of doctors and patients in a hospital. • Commercial applications. Support for logistics, production surveillance and automation. Introduction 19 • Military applications. Self-healing landmines, soldier detection and tracking, shot origin information, perimeter protection, chemical, biological and explosive vapor detection, missile canister monitoring and blast localization. The need for quantization. Even if progress in sensor and communication technologies motivates the use of a large number of communicating sensors, practical considerations such as the use of non replenishable energy sources (sensors are self-powered with batteries) and maximum size constraints impose three design constraints: • energy constraint: which comes directly from the choice that the sensors use a non replenishable energy source. • Rate constraint: this constraint is related to the fact that the communication channel bandwidth must be shared by a large quantity of sensors and that the energy is also constrained. The energy spent in a sensor network can be divided mainly in three activities, sensing, communication and processing. It is known that the major energy consumer of these activities is communication [Akyildiz 2002]. As bandwidth is constrained, the simplest solution to have reduced energy consumption is to find a way to achieve the same or similar goal by communicating with a lower rate (number of bits per unit of time). • Complexity constraint: although much less important in energy consumption, complexity both in terms of processing and memory must be small to keep the cost and size of the sensors small. One way to treat these problems is to consider that the sensors quantize their measurements before the realization of any other operations2 . This allows to • reduce complexity by using pre-stored tables for the computations and also by bounding memory requirements. • Reduce directly the rate by controlling the number of quantization intervals. • Reduce energy requirements, as a consequence of the reduction in complexity and rate. These are the main reasons for studying quantization in this thesis. Different objectives and the scope of the thesis In a sensing system the main task is to infer some information that is embedded in the measurements. The two main classes of inference problems studied in signal processing are detection and estimation. The literature on the joint subjects, detection based on quantized 2 We do not claim here that imposing quantization of the measurements is the optimal solution. In some cases, it can be shown that a complete analog scheme is optimal [Gastpar 2008]. 20 Introduction measurements and estimation based on quantized measurements, is not expressive if compared with the literature on the separated subjects, however, as a consequence of the emergence of sensor networks its size is increasing. Some references on these subjects are the following: • Detection: [Benitz 1989], [Gupta 2003], [Kassam 1977], [Longo 1990], [Picinbono 1988], [Poor 1977], [Poor 1988], [Tsitsiklis 1993], [Villard 2010], [Villard 2011]. • Estimation: [Aysal 2008], [Fang 2008], [Gubner 1993], [Luo 2005], [Marano 2007], [Papadopoulos 2001], [Poor 1988], [Ribeiro 2006a], [Ribeiro 2006b], [Ribeiro 2006c], [Wang 2010]. Estimation based on quantized measurements. As mentioned before, in this thesis we will study the second of the subjects mentioned above, namely, estimation based on quantized measurements. We will start by explaining the general estimation problem in a sensing system. By making a sequence of simplifications in the general problem, we will get to the main scope of this thesis. In the general scheme, each sensor measures a continuous amplitude quantity X (i) , processes locally its measurement and sends it to the point where the estimate will be evaluated. The point of evaluation can be either a fusion center, one of the sensors or all sensors. In the last case, all sensors broadcast their processed measurements. This scheme is shown in Fig. 1. The quantity in this case can be a sequence of vectors, a sequence of scalars, a constant vector or a constant scalar. X(1) Sensing Processing Transmission Sensor 1 X(2) Sensing Processing Transmission Estimation X̂(1) X̂(2) .. . X̂(Ns ) Sensor 2 .. . X(Ns ) Sensing Processing Transmission Sensor Ns Figure 1: Estimation problem using a sensing system. Multiple sensors send preprocessed information to the final estimator that must recover the quantities of interest. The first simplification that we will make is to consider only one of the terminals (sensors) in the sensing system, eventually, we might consider the problem with multiple terminals but with the same quantity being measured by all sensors. We will also consider that the quantity to be estimated is either a sequence of scalars or one scalar. We will use the notation Xk for the quantity to be estimated in both cases, k is the sample index and, in most cases, it will be Introduction 21 also the discrete-time index. When Xk is a scalar constant, we have Xk = x. The simplified problem, which can also be called scalar remote sensing problem, is depicted in Fig. 2. Xk Sensing Processing Transmission Estimation X̂k Sensor Figure 2: Scalar remote sensing problem. A scalar single terminal simplification of the problem depicted in Fig. 1. The parameter Xk is measured with continuous amplitude additive noise Vk . The continuous measurement will be denoted Yk = Xk + Vk . The estimation problem we mainly deal here is location estimation, as Xk in this case is a location parameter characterizing the measurement distribution. Other technical considerations about the noise sequence will be presented later. In some points of the thesis we will not constrain Xk to be a location parameter and we will let it be a general parameter. According to the previous discussion on the design constraints, the processing block is replaced by a scalar quantizer. Thus, each noisy continuous measurement Yk will generate a quantized measurement ik according to the quantizer function Q (). Each quantized measurement is defined in a finite set of values so the rate (number of possible values per measurement of the alphabet) is fixed and known. We suppose that the rate in bits per unit of time is chosen such that the transmission channel capacity is not exceeded, thus, by adding proper channel coding in the transmission block, we can consider that the channel is perfect. For each time k we are interested in estimating Xk based on the set of past measurements i1 , i2 , · · · , ik . The problem is then depicted in Fig. 3. Vk Xk Yk Noise Q (Yk ) Quant. ik Perfect transmission g (i1 , · · · , ik ) X̂k Estimation Measurement Figure 3: Estimation based on quantized measurements. A parameter is measured with additive noise, the measurements are then quantized and transmitted through a perfect channel. Based on the past quantized measurements, the objective is to estimate Xk for each time k with the sequence of mappings g (). As it is shown in Fig. 3, we also consider that the quantizer structure can depend on the past quantized measurements. What we want to study. We want to propose algorithms for estimating Xk based on ik . The parameter Xk , which will be detailed later, can be either a deterministic constant or a slowly varying random process. 22 Introduction After proposing the algorithms, we want to evaluate their performance. Given the algorithm performance, we want to study the effects of the quantizer function parameters, the quantization thresholds, and of the quantizer resolution, the number of quantization intervals or bits. For assessing how quantization impacts on estimation, we will also compare the estimation performance of the proposed algorithms with the estimation performance of their corresponding continuous measurement versions. The objective here is to estimate Xk based on the interval information (we know only in which interval the measurement is) of a noisy version of it. What we do not want to study (and we will not study). We do not want to reconstruct the measurement Yk from the quantized measurements and then estimate Xk based on the reconstructed measurements, as if they were continuous. By doing this, we would simply join their optimal separated solutions, which are well known. We do not want to consider quantization as additive noise either. We want to consider the problem in its true form, that is to study how to exploit the information contained in intervals and not in continuous values. What we want to study but we will not study. To specify in a precise way the scope of the thesis, we also have to state the problems we may have consciously overlooked. Consciously overlooked in this case means that, differently from the class of problems above, we wanted to study them, but to keep the subject simple, they will be neglected. These subjects are: vector parameters and vector quantization, presence of noisy channels (fading or additive) and channel coding, fast varying signals, estimation of continuous time signals and Bayesian estimation of a random constant. Introduction 23 Structure of the thesis and outline This thesis is formed by a general introduction, two parts and a general conclusion. Each part is divided in introduction, chapters and conclusion. In the first part, there are three chapters and in the second a single chapter. Each chapter is subdivided in three parts: introduction with the main contributions of the chapter, the main development and a summary/conclusion with some directions for future work. The conclusions in the order thesis–part–chapter increase in level of details. The thesis conclusion is a general overview, the part conclusion presents the points that we think we must retain without explaining the technical details and the chapter summary is a detailed account of the points observed in the chapter. The thesis outline is the following: • Part I: a study of algorithms/performance for estimation based on quantized measurements. – Chapter 1: the main details on the quantizer structure and noise are given. The fundamental algorithms and performance for the estimation of a deterministic scalar constant parameter are presented. Algorithms both for static quantization and adaptive quantization are studied. – Chapter 2: the time-varying parameter counterpart of Ch. 1 is presented. We consider the parameter to be a slowly varying scalar Wiener process and we present Bayesian algorithms for tracking the parameter. – Chapter 3: Low complexity algorithms are proposed as alternatives to those presented in Ch. 1 and 2. We also study some extensions of the scalar location problem: an extension that considers that the noise scale parameter is unknown and an extension that considers multiple sensors. • Part II: a high resolution (high-rate) approximate analytical expression for the estimation performance. – Chapter 4: an open problem from Part I is how to set completely the quantizer key parameters so that estimation performance is maximized. In this chapter, we study how to solve this problem approximately by considering high resolution approximations (small quantization intervals approximation). We give a practical solution to obtain the optimal quantizer and the corresponding asymptotic estimation performance. Each part will begin with an example, which can be seen as a background for the presentation of the problem. The examples serve only for presentation purposes and their specific subjects (water management and deep-sea water mining) are not the main subject of this work. The appendices of this thesis are divided in three parts, one part for presenting proofs that are considered not important to develop in the main text Why? - App. A.1, another for giving more details about a subject More? - App. A.2 and one part for explaining some implementation issues How? - App. A.3. 24 Introduction For defining a new abbreviation or acronym we write the expression in boldface with the abbreviation in parenthesis (). For citing a reference that was already cited similarly elsewhere, we write the reference and the work where it was cited with (cited in ...). Publication During this thesis three papers were presented in international conferences • Rodrigo C. Farias and Jean-Marc Brossier, Adaptive Estimation Based on Quantized Measurements, IEEE International Conference on Communications (ICC), 2013, Budapest, Hungary. • Rodrigo C. Farias and Jean-Marc Brossier, Adjustable Quantizers for Joint Estimation of Location and Scale Parameters , IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, Vancouver, Canada. • Rodrigo C. Farias and Jean-Marc Brossier, Asymptotic Approximation of Optimal Quantizers for Estimation, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, Vancouver, Canada. One paper was accepted for presentation in a French conference • Rodrigo C. Farias and Jean-Marc Brossier, "Quantification asymétrique optimale pour l’estimation d’un paramètre de centrage dans un bruit de loi symétrique", "Colloque GRETSI", 2013, Brest, France and one article was published • Rodrigo C. Farias and Jean-Marc Brossier, Adaptive quantizers for estimation, Signal Processing, Elsevier, vol. 93, november 2013. Part I Estimation based on quantized measurements: algorithms and performance 25 27 "A word to the wise is enough" - popular saying. Motivation As a background to introduce this part, we start with an application example. A recent trend in placing water as a key element in government strategic decisions (including possible future military interventions) lead to the choice of the motivational example. Agriculture is responsible for 70% of freshwater withdrawals. Food production for satisfying the daily caloric needs of a person consumes 3000 liters of water, a very large quantity when compared with the 2-5 liters used for drinking. Add to these ingredients the fact that world population is growing and that a large part of the population is changing its diet, consuming more meat and vegetables and therefore even more water [Molden 2007] and we have a possible recipe for future water scarcity. One possible policy for preventing future water scarcity is to develop or improve irrigation systems in sub-developed countries, where water use efficiency is very low [Molden 2007]. For doing so, measuring accurately the soil moisture of crop fields is a main issue. Thus, as a background scenario for introducing this chapter, we will consider the problem of estimating the moisture level of crop fields. Consider that multiple crop field areas will have each a set of sensors noisy measurements of some quantities related to soil moisture. All the data will be transmitted to a central processor, which after estimating the moisture levels, will decide which crops must be irrigated. As the number of sensed areas can be large, for example, when the irrigation system is integrated for an entire geographic region, quantization will be applied to respect communication constraints. The solution to this problem can be simplified by assuming that the decision (control) part of the problem can be decoupled from the estimation part. We will focus here only on the estimation part. In a first approach, we can assume that the moisture levels are unknown deterministic scalars, unrelated from one region to the other and that they are approximately constant for a block of N independent measurements. If humidity sensors are used, from the symmetry of the problem and the assumption that the moisture levels are not related, the joint estimation problem of all levels decouples into many scalar estimation problems with identical general form. This general form is the following: (a) Estimate a constant scalar location parameter x, based on N independent noisy measurements Y1:N = {Y1 = x + V1 , · · · , YN = x + VN } , which are scalarly quantized with a quantizer function Q (to be defined later) i1:N = {i1 = Q (Y1 ) , · · · , iN = Q (YN )} . 28 A more detailed meaning for "estimate" is (1) Give an analytical form or a procedure describing the parameter estimator X̂. (2) Give the estimation performance or an approximation of the estimation performance as a function of • number of measurements; • noise characteristics; • the quantizer function. After giving a solution for this problem, we may be interested in considering a more complex model for x, for instance, instead of considering it as a constant, we can assume that it varies randomly with time. A simple dynamical model is Xk = Xk−1 + Wk , where k is the discrete-time index, Wk is an independent and identically distributed 2 . Thus, if X is Gaussian, X (i.i.d.) Gaussian process, with zero mean and variance σw 0 k is a Gaussian process known as a discrete-time Wiener process or as a discrete-time random walk process. This type of process is commonly used to describe slowly varying parameters when their evolution is random but with unknown form. A reason to use this model is that by constraining the increments to be Gaussian distributed, minimal quantity of information is imposed for a given increment variance (in terms of information theory quantity of information). Now, suppose we have statistics about precipitation on the crop field region, for example its average, we also know the last quantities of water irrigated on the field and how to relate both precipitation and irrigated water to average increase in moisture level, denoted uk . This will allow us to use a more precise dynamical model for Xk , using as increments Gaussian random variables (r.v.) with mean uk . Consequently, our model will become a discretetime Wiener process with a deterministic drift. The objective is the same as before, estimate Xk based on scalarly quantized Yk . However, the relation between measurements can be exploited now. Instead of considering the static estimation problem for separate blocks of measurements, we can now use all past measurements in the estimation of the varying parameter, under the constraint that the parameter evolution must follow the dynamical model. 29 Therefore, we are also interested in solving the following problem: (b) Estimate a varying random parameter Xk at time k based on the last and present scalarly quantized measurements i1:k . This is a filtering problem, as estimates depend not only on past measurements but also on the present measurement [Jazwinski 1970]. The problems of estimation based on past measurements, i.e. prediction, and of estimation based on additional future measurements, i.e. smoothing, will not be treated in this thesis. Outline for this part For the problem at hand, 3 types of model with increasing complexity can be considered, these 3 models are related to the two estimation problems (a) and (b) as it is shown below: Constant model Location (a)parameter estimation Scalar Wiener process model (b)Filtering Scalar Wiener process model with drift Many other practical estimation problems rely on the models presented above and consequently can be cast as (a) or (b). We will look now for their solutions. First, we will present algorithms and performance for the estimation of a constant location parameter. We will study maximum likelihood estimators and their asymptotic performance through the Cramér–Rao bound. We will see that estimation performance is sensible to the distance between the quantizer dynamic range and the parameter. For commonly used noise models, we will see that estimation performance actually degrades when the dynamic range is far from the parameter. As a solution, we will search for adaptive schemes that place the dynamic range close to the parameter. We will show that in the binary case, the asymptotically optimal adaptive algorithm is given in a simple recursive form. After that, we will focus on filtering. A general solution using recursive integral expressions will be given. As this solution is analytically intractable, an approximate solution based on sequential Monte Carlo methods (particle filtering) will be considered. Its performance will be assessed through a lower bound, the Bayesian Cramér–Rao bound. Then, by analyzing the bound, we will see that a good estimation scheme can be obtained by quantizing the measurement prediction error, usually called the innovation. We will show that the asymptotically optimal filter based on the quantized innovation is also given in a simple recursive form when 30 the parameter varies slowly. Motivated by the recursive forms that are obtained asymptotically, both in the constant and varying parameter cases, we will present a low complexity adaptive algorithm for estimation using quantized measurements. The estimation performance and the optimal algorithm parameters will be obtained for constant and Wiener process models. Extensions of the algorithm for the cases when multiple sensors with a fusion center are used and when the noise scale factor (a measure of its amplitude) is unknown will also be obtained. At the end of this part some conclusions will be drawn on the overall aspects of estimation based on quantized measurements. Chapter 1 Estimation of a constant parameter: what is done and a little more In this chapter we study the problem of estimation of a constant location parameter based on quantized measurements. We start the chapter with the measurement model, which is mainly the noise model and the definition of the quantizer. The first sections of the chapter deal with a fixed quantizer structure (fixed quantization thresholds), while in the last sections, we present estimation schemes with an adaptive quantizer structure. In the part concerning a fixed quantizer structure, we start by giving a general estimation algorithm based on the maximum likelihood method. Its performance is given in terms of the Cramér–Rao bound. Then, we study the general effects of quantization on estimation performance. This is done through the analysis of the Cramér–Rao bound, a quantity that is directly related to the Fisher information. We also analyze the performance of binary and multibit quantization as a function of the quantizer tuning parameter. We give a detailed implementation of the maximum likelihood estimator for general noise distributions in the binary case, while in the multibit case, the maximum likelihood estimator is detailed for a more restricted class of noise distributions, more precisely, log-concave distributions. As a main result of the performance analysis for the fixed threshold scheme, we will see that, for commonly used noise models, the estimation performance degrades as the quantization dynamic range is distant from the true parameter. This is used as a motivation to study estimation schemes that adaptively place the quantizer dynamic range close to the true parameter. We study two adaptive schemes. One based on a simple update of the quantizer main parameter, but with the final estimate given by maximum likelihood estimation and the other based on the use of the maximum likelihood last estimate as the quantizer main parameter. Their performances are given also in terms of the Cramér–Rao bound. We will also see that the estimator based on the maximum likelihood threshold update is asymptotically equivalent to a low complexity recursive algorithm. We finish this chapter with a summary of the main points that were studied and with the directions for further research. The directions will point for further work that is presented in other chapters or that will be studied in the future. 31 32 Chapter 1. Estimation of a constant parameter Contributions presented in this chapter: • Global and local analysis for binary quantization. By reading carefully the literature on the subject, we have the impression that setting the quantization threshold on the true parameter value is optimal for symmetric distributions [Wang 2010, p. 265]. But this affirmation is actually false. We present here global and local conditions on the noise distribution that guarantees that this threshold value (equal to the parameter value) is indeed optimal. • Asymmetric threshold case. Differently from the literature where only the symmetric cases are shown, we show some cases where the noise distribution is symmetric and the optimal quantization threshold is not the median. • Laplacian noise. In the literature, most of the analysis is focused on the Gaussian noise case, where, as it is expected, quantization strictly decreases estimation performance. Here, we study also the Laplacian case. The Laplacian case is easier to analyze and it is a nice counterexample to the intuition that quantization strictly decreases estimation performance (see p. 55). • Adaptive binary quantization scheme in a finite grid. We present a method to obtain the asymptotic threshold probabilities in the adaptive binary threshold scheme (see (More? - App. A.2.4)). Differently from the method presented in [Fang 2008], where a truncation approximation is used, in the method presented here, we define boundaries on the possible threshold values so that the number of threshold values is finite and the asymptotic probabilities can be evaluated analytically. • Multibit adaptive scheme based on the maximum likelihood estimator and its convergence. We extend the binary adaptive scheme presented in [Fang 2008] to the multibit case and we also extend its proof of convergence to the general multibit non Gaussian case. • Asymptotic binary adaptive scheme based on the MLE. We give a less heuristic proof that the adaptive quantization scheme based on the maximum likelihood estimator is given asymptotically in a simple recursive form. 33 Contents 1.1 1.2 1.3 1.4 1.5 1.6 Measurement model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.1.1 Noise model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.1.2 Quantization model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Maximum likelihood, Cramér–Rao bound and Fisher information . . 37 1.2.1 Maximum likelihood estimator . . . . . . . . . . . . . . . . . . . . . . . . 38 1.2.2 Cramér–Rao bound and the Fisher information . . . . . . . . . . . . . . . 39 1.2.3 Quantization loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Binary quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 1.3.1 The Gaussian case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 1.3.2 The Laplacian case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 1.3.3 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 1.3.4 Asymmetric threshold: surprising cases . . . . . . . . . . . . . . . . . . . 48 1.3.5 Conclusions on binary quantization performance . . . . . . . . . . . . . . 52 1.3.6 MLE for binary quantization . . . . . . . . . . . . . . . . . . . . . . . . . 53 Multibit quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 1.4.1 The Laplacian case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 1.4.2 The Gaussian and Cauchy cases under uniform quantization . . . . . . . . 56 1.4.3 Summary of the main points . . . . . . . . . . . . . . . . . . . . . . . . . 58 1.4.4 MLE for multibit quantization with fixed thresholds . . . . . . . . . . . . 58 Adaptive quantizers: the high complexity fusion center approach . . 60 1.5.1 MLE for the adaptive binary scheme . . . . . . . . . . . . . . . . . . . . . 61 1.5.2 Performance for the adaptive binary scheme . . . . . . . . . . . . . . . . . 62 1.5.3 Adaptive scheme based on the MLE . . . . . . . . . . . . . . . . . . . . . 66 1.5.4 Performance for the adaptive multibit scheme based on the MLE . . . . . 66 1.5.5 Equivalent low complexity asymptotic scheme . . . . . . . . . . . . . . . . 70 Chapter summary and directions . . . . . . . . . . . . . . . . . . . . . . 73 34 Chapter 1. Estimation of a constant parameter 1.1 Measurement model We start by explaining the measurement model. The unknown deterministic scalar constant parameter to be estimated is x∈R and it is measured N times, N ∈ N⋆ , with i.i.d. additive noise Vk . For k ∈ {1, · · · , N } the continuous measurements are Y k = x + Vk . 1.1.1 (1.1) Noise model The continuous sequences of r.v. Yk and Vk are defined on the probability space P = (Ω, F, P) with values on (R, B (R)). For simplification purposes the following hypotheses on the noise distribution will be considered: Assumptions (on the noise distribution): AN1 The marginal cumulative distribution function (CDF) of the noise, denoted F , admits a probability density function (PDF) f with respect to (w.r.t.) the standard Lebesgue measure on (R, B (R)). AN2 The PDF f (v) is a strictly positive even function and it strictly decreases w.r.t. |v|. Assumption AN1 is a commonly used assumption that in practice will be used when the derivative of F w.r.t. its arguments is needed. AN2 means that the noise distributions are unimodal and symmetric around zero and it will be used for the following reasons: 1. The unimodal behavior of the noise will allow to have a general qualitative characterization of estimation performance as a function of quantization parameters. More precisely, it will be observed that for unimodal densities very poor estimation performance occurs for quantizers having the dynamical range far away from x. 2. It will be used as a condition for the convergence of some new adaptive estimation algorithms presented in this thesis. 3. In the lack of physical constraints (e.g. positivity), there should be no reason for the components of the noise to be asymmetric. Thus, if we consider that the noise is a normalized sum of an infinite number of symmetric i.i.d. r.v. (an infinite sum of small perturbations), then it is known that the resulting noise r.v. distribution is a symmetric stable distribution [Samorodnitsky 1994], which is unimodal. Even if not all unimodal symmetric distributions are stable the generalized central limit theorem above serves as an additional motivation. 1.1. Measurement model 1.1.2 35 Quantization model From the reasons presented in the Introduction and in the motivational example given above, the measurements are quantized. We will consider that they are scalarly quantized, which means that each measurement is quantized separately from the others. The quantizer output can be written as ik = Q (Yk ) , (1.2) where ik is a value from a finite set I of R with NI elements. Due to notation issues, we denote both the quantized measurement random variable and its realization with lowercase i. NI is the number of quantization intervals. A simple example of quantizer Q with uniform threshold spacing is given in Fig. 1.1. ik = Q (Yk ) ... NI 2 .. . .. . 2 1 τ− NI = −∞ 2 τ0 − ... NI 2 − 1 ∆ τ0 − 2∆ ... τ0 − ∆ τ0 ... τ1 = τ0 + τ1′ τ0 + 2∆ τ0 + −1 = τ0 + ∆ ... NI 2 −1 ∆ τ NI = +∞ Yk 2 .. . −2 ... .. . − N2I Figure 1.1: Quantizer function Q (Yk ) with NI quantization intervals and uniform threshold spacing with length ∆. The number of quantization intervals NI is even, the quantizer is symmetric around the central threshold τ0 and the output indexes are integers without the zero. Except for the uniform thresholds, Fig. 1.1 shows the main elements of the general quantization model that will be used: • the number of quantization intervals NI will be an even number, this will lead to a clearer presentation, as in each analysis we will not need to deal with the additional central interval. • The outputs of the quantizer will be defined on a set of integers from − N2I to N2I , without zero. This will simplify the notation of the algorithms that will be presented later. Note that as we will consider that the output of the quantizer is obtained without additional noise (it passes through a noiseless channel), the assignment of the output values ik is not important, as long as the assigned values are different. For estimation purposes, only a label is needed at quantization output. The estimator or parts of the estimation procedure will carry out the role of the output quantization levels, as they are used in standard quantization, by generating estimates (values) based on the information from the intervals (indicated by the labels) where the continuous measurements lie. 36 Chapter 1. Estimation of a constant parameter Observe that if we introduce in the model a noisy communication channel and constraints on transmission power, the assignment of the output values becomes important. As it was stated in the Introduction, we are not going to consider this model in this thesis, but we can keep this extended problem as a possibility for future work. • The quantizer is defined by NI + 1 thresholds τi , which can be separated in three types: one central threshold τ0 , N2I − 1 thresholds that are larger than τ0 with an additional threshold at +∞ and N2I − 1 that are smaller with an additional threshold at −∞. We will consider that the non central thresholds are symmetric w.r.t. τ0 , thus, for example, the threshold τi is given by τ0 plus a variation τi′ and the threshold τ−i is given by τ0 minus the same variation. In the figure, the variations are integer multiples of ∆, which corresponds to uniform quantization. In general, we will not impose uniform quantization. The assumption on the symmetry of the quantizer is difficult to justify at this point, but the main idea is that, as it will be shown further for commonly used noise models, the best central threshold for estimation purposes is exactly x, thus if we set τ0 = x, from the assumption of noise symmetry, it seems reasonable to assume that the quantizer (a good one) is symmetric. In Part II, it will be shown that for large NI the optimal quantizer is indeed symmetric around τ0 for symmetric noise distributions. The infinite thresholds for the extreme positive and negative thresholds are used to have the same notation for the probabilities of the granular (region inside the quantizer input dynamic range) and overload regions (region outside the quantizer input dynamic range). From Fig. 1.1 and the explanations above, the quantizer function can be described as follows: if we have a measurement Yk ≥ τ0 that falls in the quantization interval qi = [τi−1 , τi ), then its output will be i. Otherwise, if Yk ≤ τ0 and it falls in q−i = [τ−i , τ−i+1 ), then the quantizer output will be −i. As an example, consider that we have a uniform quantizer with 16 quantization levels, τ0 = 0 and uniform quantization step-length ∆ = 1, then for the input y1:10 = {−20, −8.5, −3.4, −5.6, −0.1, 0.7, 3.2, 10.7, 7.1, −2.3} , we obtain y1 = −20 → i1 = −8, y6 = 0.7 → i6 = 1, y3 = −3.4 → i3 = −4, y8 = 10.7 → i8 = 8, y5 = −0.1 → i5 = −1, y10 = −2.3 → i10 = −3. y2 = −8.5 → i2 = −8, y4 = −5.6 → i4 = −6, y7 = 3.2 → i7 = 4, y9 = 7.1 → i9 = 8, 1.2. Maximum likelihood, Cramér–Rao bound and Fisher information 37 Observe that by using the threshold variations τi′ , we can write the input–output relation in a more compact way: ik = i sign (Yk − τ0 ) , ′ for |Yk − τ0 | ∈ τi−1 , τi′ . (1.3) Note that the index k here is the time or sample index and it is not the particular value of i. Before proceeding, we will state explicitly the assumptions on the quantizer. Assumptions (on the quantizer): AQ1 NI is considered to be an even natural number and the set I where ik is defined is I= NI NI − , · · · , −1, 1, · · · , 2 2 . AQ2 The quantizer is symmetric around the central threshold. This means that the vector of thresholds τ is given by (⊤ is the transpose operator) ′ τ = τ− NI = τ0 − τ NI · · · τ−1 = τ0 − 2 2 τ1′ τ0 τ1 = τ0 + τ1′ ′ · · · τ N I = τ 0 + τ NI 2 2 ⊤ with the threshold vector elements forming a strictly increasing sequence and the nonnegative vector of threshold variations w.r.t. the central threshold given by ′ τ = 1.2 τ0′ =0 τ1′ ··· ′ τ NI 2 ⊤ = +∞ . Maximum likelihood, Cramér–Rao bound and Fisher information We want to estimate x based on i1:N = {i1 , · · · , iN } (problem (a)). For doing so, we will look for an estimator X̂ (i1:N ) - which is a r.v. as it is a function of r.v., that must be as close as possible to x. In our case, we are going to choose the quantitative meaning of "as close as possible" to be with minimum (or small) mean squared error (MSE): 2 , (1.4) MSE = E X̂ − x E is the expectation w.r.t. the joint distribution of the noise. The MSE is a commonly used performance criterion for estimation problems. Although it is widely used, it has the inconvenient that it is impossible to find in a general form the X̂ minimizing it by direct analytical minimization [Van Trees 1968, p. 64]. 38 Chapter 1. Estimation of a constant parameter 1.2.1 Maximum likelihood estimator A common solution for this problem is to suppose that N is large, in theory N must tend to infinity, and that X̂ is constrained to be unbiased, which means h i E X̂ = x, in this case, the optimal X̂ minimizing the MSE is known to be the maximum likelihood estimator (MLE) [Kay 1993, p. 160]. The MLE consists of maximizing the likelihood function which is the joint distribution of the measurements considering that the measurements are fixed parameters and that the parameter x is variable1 . For the estimation problem considered here, the likelihood for an independent block of measurements i1:N is L (x; i1:N ) = N Y P (ik ; x) , (1.5) k=1 where P (ik ; x) is the probability of having a quantizer output ik at time k for a parameter x. This probability can be rewritten using the noise CDF and the thresholds: P (ik ; x) = ( P (τik −1 6 Yk < τik ) , if ik > 0, P (τik 6 Yk < τik +1 ) , if ik < 0, using the definition of Yk = x + Vk given by (1.1) P (ik ; x) = ( = ( P (τik −1 6 x + Vk < τik ) , if ik > 0, P (τik 6 x + Vk < τik +1 ) , if ik < 0, F (τik − x) − F (τik −1 − x) , if ik > 0, F (τik +1 − x) − F (τik − x) , if ik < 0. (1.6) The MLE is the value of x maximizing L (x; i1:N ) for a given i1:N : X̂M L,q = X̂M L (i1:N ) = argmax L (x; i1:N ) . (1.7) x The subscript q is used to make explicit that the estimation is done with quantized measurements. As the logarithm is a strictly increasing function on R⋆+ and most used likelihood functions are given in exponential form, it is common to solve an equivalent maximization problem: X̂M L,q = argmax log L (x; i1:N ) . x 1 Clearly, this is an inversion of roles from the modeling point of view and this is the main reason why we do not call the likelihood function simply by joint PDF. 1.2. Maximum likelihood, Cramér–Rao bound and Fisher information 1.2.2 39 Cramér–Rao bound and the Fisher information The MLE is the procedure to find the estimate. We still need its performance. Unfortunately, no finite sample (finite N ) performance results are available for the MLE. We will focus then on asymptotic results for which in some sense, as stated before, the MLE is optimal. The MSE for the MLE can be written as E X̂M L,q − x 2 i2 h + Var X̂M L,q = bias2 + variance. = E X̂M L,q − x As it was stated, the MLE is asymptotically unbiased: h i E X̂M L,q = N →∞ x. (1.8) Therefore, it is characterized asymptotically only by its variance. The Cramér–Rao bound (CRB) is a lower bound on the variance of any unbiased estimator [Kay 1993, p. 30] and the bound is valid even for finite N . Under some regularity conditions, the asymptotic variance of the MLE is known to be minimum and it attains the CRB [Kay 1993, p. 160]: Var X̂M L,q ∼ CRBq , (1.9) N →∞ later, we will compare this CRBq with its corresponding version for continuous measurements that we will denote CRBc . The symbol ∼ used here means that both quantities are N →∞ equivalent lim N →∞ Var X̂M L,q CRBq = 1. As the MLE is asymptotically unbiased and with asymptotically minimum variance it is usually called an asymptotically efficient estimator in classical estimation terms. Note that the optimality in asymptotic variance does not imply optimality in MSE sense, as a biased estimator can attain a lower asymptotic MSE when compared with the MLE. Also, it is important to stress that the variance of the MLE will tend to the CRB only if the maximum of the likelihood can be achieved. This can be an issue when we need to evaluate the maximum of the likelihood through a numerical method, in this case we have to ensure that the numerical method will converge to the global maximum. In what follows, we will assume that the MLE, either evaluated analytically or numerically, is always the global maximum of the likelihood. For further discussion on the issues of finding the MLE see (More? - App. A.2.1). The CRB is the inverse of the Fisher information (FI) [Kay 1993, p. 30]. The FI is given by the variance of the score function Sq . As the expected value of the score function is zero [Kay 1993, p. 67], the FI is given by the second order moment of the score function. Starting from the definition of the score function for N quantized measurements and going in 40 Chapter 1. Estimation of a constant parameter the direction of the asymptotic variance of the MLE, we have the following expressions: Sq,1:N = Iq,1:N = Var X̂M L,q ∼ N →∞ ∂ log L (x; i1:N ) ∂x ( ) 2 ∂ log L (x; i1:N ) 2 E Sq,1:N = E ∂x CRBq = 1 = Iq,1:N - score function, - FI, 1 E h ∂ log L(x;i1:N ) ∂x i2 - variance and CRB. the subscript is used to indicate that these quantities are related to the quantized measurements i1:N . Due to the fact that the measurements are i.i.d., whenever we want to refer to the score function and FI for one measurement ik , we can drop the sample indexes, thus writing Sq and Iq . Under the assumption of independent measurements (independent noise), we have the following: • the joint probability in the FI expression decomposes in a product of marginal probabilities. • The logarithm of the product of marginal probabilities becomes the sum of the logarithm of each probability. • After differentiating the sum of logarithms w.r.t. x, the square of the differentiated sum can be decomposed in a sum of squared terms and a sum of products between different terms. • The expectation of the products between different terms is zero because the factors in the products are independent and with zero mean (they are score functions thus having zero mean [Kay 1993, p. 67]). • The expectation of each squared term is the FI for the corresponding individual measurement. Therefore, as the measurements are also identically distributed, the FI for N quantized measurements is N times the FI for one measurement Iq : 1 . (1.10) Var X̂M L,q ∼ CRBq = N →∞ N Iq The score function for one measurement Sq is ∂P(i ;x) k ∂ log L (x; ik ) ∂x = Sq = ∂x P (ik ; x) and the corresponding FI is ( ) ∂ log L (x; ik ) 2 Iq = E = ∂x = X ik ∈I X ik ∈I " ∂P(ik ;x) ∂x (1.11) #2 P (ik ; x) h i2 ∂P(ik ;x) ∂x P (ik ; x) . P (ik ; x) , (1.12) 1.2. Maximum likelihood, Cramér–Rao bound and Fisher information 41 Defining the difference between the central threshold and the parameter as ε = τ0 − x, using the CDF, PDF notations and the symmetry of the quantization thresholds we have NI 2 ′ − f ε − τ′ X2 f ε + τi′ −1 − f ε + τi′ 2 f ε − τ ik k k + ik −1 . Iq = F ε + τ′ − F ε + τ′ F ε − τi′k −1 − F ε − τi′k ik =1 ik −1 ik ik = (1.13) The solution to problem (a) (p. 27) given by the MLE is the following: Solution to (a) - MLE for a fixed thresholds set τ (a1) 1) Estimator X̂M L,q = argmax L (x; i1:N ) x or X̂M L,q (i1:N ) = argmax log L (x; i1:N ) , x with L (x; i1:N ) given by (1.5) L (x; i1:N ) = N Y P (ik ; x) . k=1 2) Performance (asymptotic) X̂M L,q is asymptotically unbiased h i E X̂M L,q = N →∞ x and its asymptotic MSE or variance is given by Var X̂M L,q ∼ N →∞ CRBq = 1 , N Iq with Iq given by (1.13). The CRB given above is not only related to the MLE, but can be used to approximately assess the performance of any good (close to optimal) estimator. In our case, it can be used to characterize the performance of the measurement/estimation system (Fig. 1.2) independently of the estimator. 1.2.3 Quantization loss The solution given above does not contain any direct characterization of the estimation performance as a function of NI and/or τ . We are going to look into these details now. 42 Chapter 1. Estimation of a constant parameter Y1:N = Noisy {Y1 , · · · , YN } Quantization continuous measurements i1:N = {i1 , · · · , iN } Parameter estimation Parameter estimation X̂q CRBq X̂c CRBc Figure 1.2: Scheme representing the general measurement/estimation system. The continuous measurements sequence Y1:N is scalarly quantized and the quantized sequence i1:N is used for estimation. X̂q and X̂c are the estimators based on quantized or continuous measurements and CRBq and CRBc are their respective CRB. Loss with respect to the continuous measurement We will start analyzing the general effect of quantization on estimation. An approximate way of doing this (exact for N → ∞) is to study the quantized FI for one measurement Iq and its difference with respect to the continuous measurement FI Ic . Iq was given in (1.13), while Ic is given by (1.14) Ic = E Sc2 , where Sc is the score function for continuous measurements given by Sc (y) = ∂ log f (y − x) . ∂x (1.15) h i The difference between Ic and Iq can be obtained by evaluating the quantity E (Sc − Sq )2 [Marano 2007]. Indeed, h i E (Sc − Sq )2 = E Sc2 + E Sq2 − 2E [Sc Sq ] = Ic + Iq − 2E [Sc Sq ] and it can be shown that E [Sc Sq ] = E Sq2 (Why? - App. A.1.1). Thus from above, we have2 h i Ic − Iq = E (Sc − Sq )2 > 0, (1.16) as the right-hand side (RHS) is the expectation of a squared function, the FI difference is nonnegative, meaning that the FI for quantized measurements is always less or equal to its continuous measurement equivalent. Therefore, as the corresponding CRB will have larger or equal values, it is clear, as it was already expected, that quantization of measurements reduces estimation performance (see Fig. 1.2 for the two estimation settings). 2 Special attention must be given to the fact that to obtain (1.16), the measurement PDF form f (y − x) is not used, in the proof in App. A.1.1 a general form f (y; x) is used, thus the conclusion above is also valid for general parameter estimation problems, not only location parameter estimation. 1.2. Maximum likelihood, Cramér–Rao bound and Fisher information 43 Loss with respect to the number of quantization intervals Even if performance loss is positive or zero, nothing guarantees, until now, that estimation performance increases with increasing NI , as it is intuitively expected. We will suppose that we have a threshold set τ for NI quantization intervals. We will suppose ε = 0 for simplification. We will add one threshold τ ′ between two thresholds τi−1 and τi (τi > τ ′ > τi−1 ), i > 0 is assumed only to simplify notation. The sum elements defining Iq does not change, except for the term corresponding to interval qi . The old and new FI only for this region are respectively [f (τi ) − f (τi−1 )]2 f (τi ) − f (τi−1 ) 2 τ = [F (τi ) − F (τi−1 )] , (1.17) Iq,i = F (τi ) − F (τi−1 ) F (τi ) − F (τi−1 ) [f (τi ) − f (τ ′ )]2 [f (τ ′ ) − f (τi−1 )]2 + . (1.18) F (τi ) − F (τ ′ ) F (τ ′ ) − F (τi−1 ) We can expand (1.17) adding and subtracting a term f (τ ′ ) in the numerator of the first factor, adding and subtracting F (τ ′ ) in the denominator of the first factor and multiplying and dividing the results numerator terms by F (τi ) − F (τ ′ ) and F (τ ′ ) − F (τi−1 ). This gives 2 ′ ′ [f (τi )−f (τ ′)] [F (τi ) − F (τ ′ )] + [f (τ ′ )−f (τi−1 )] [F (τ ′ ) − F (τi−1 )] [F (τi )−F (τ )] [F (τ )−F (τi−1 )] τ = Iq,i × [F (τi ) − F (τ ′ )] + [F (τ ′ ) − F (τi−1 )] F (τi ) − F τ ′ + F τ ′ − F (τi−1 ) . (1.19) {τ }∪{τ ′ } Iq,i = The Jensen’s inequality tells us the following [Hardy 1988, p. 74]: for a sequence of values ai , positive weights bi and a convex function φ we have P P bi φ (ai ) a i bi . (1.20) φ iP 6 i P bi bi i Multiplying both sides of (1.20) by and F (τ ′ ) − F (τi−1 ), ai with following: P bi i [f (τi )−f (τ ′ )] [F (τi )−F (τ ′ )] i and identifying in (1.19) bi with F (τi ) − F (τ ′ ) and [f (τ ′ )−f (τi−1 )] [F (τ ′ )−F (τi−1 )] {τ }∪{τ ′ } τ 6 Iq,i Iq,i . and φ (x) with x2 , we have the (1.21) As it was expected, adding a threshold, or equivalently a quantizer interval, increases the FI and, as consequence, it decreases the CRB, enhancing estimation performance. Note that this is also true if we start with an optimal partition (a partition that maximizes the FI) and we add a threshold arbitrarily, however, in this case, the final interval partition may not be optimal within the class of quantizers with NI + 1 intervals, even if we try to optimize the new threshold position. As adding thresholds increases the FI and as Iq is bounded above by Ic , the FI tends to a limit value when NI tends to infinity. An interesting point to be studied is to know if we can make it converge to Ic . This will be done in Part II, where we will see that, under some regularity assumptions on the quantizer intervals, Iq converges to Ic . Now, to have a more precise characterization of the estimation performance as a function of NI , we must first describe how it is influenced by τ . For the optimal τ , we will be able to obtain the dependence of the estimation performance only on NI and the noise characteristics. 44 1.3 Chapter 1. Estimation of a constant parameter Binary quantization ′ We begin the analysis by the binary case, NI = 2. For binary observations (τ−1 = −∞ ′ and τ1 = ∞), the CRB for (1.13) in the CRB. As N measurements′ can be written by ′using ′ = 0, 1 − F (ε + τ1 ) = 0 and F ε + τ−1 = 0 by assumption AN2, f (ε + τ1′ ) = 0, f ε + τ−1 we obtain F (ε) [1 − F (ε)] CRBB . (1.22) q = N f 2 (ε) The analysis of performance in this case reduces to the analysis of the function B (ε) = N CRBB q = 1.3.1 F (ε) [1 − F (ε)] . f 2 (ε) (1.23) The Gaussian case This function was studied in the Gaussian noise case in [Papadopoulos 2001] and revisited in [Ribeiro 2006a]. In this case, 1 ε 2 f (ε) = √ exp − , (1.24) δ πδ where√δ is the noise scale factor, which can be linearly related to the standard deviation σ (δ = 2σ). By plotting B as a function of ε (see Fig. 1.3), it was noted in [Papadopoulos 2001] that the minimum value B ⋆ is attained for ε = 0 and that B (ε) increases when |ε| increases. Thus, the optimal threshold τ0⋆ must be equal to x and the minimum value of B (ε) is B ⋆ = 2 1 = πδ4 . We can compare the CRB for one continuous measurement Bc = I1c with B ⋆ , 4f 2 (0) to have an idea about the loss of performance. Using (1.14), (1.15) and the expression for the PDF of the Gaussian distribution (1.24), we have: Bc = or equivalently 1 = Ic 2 ⋆ δ2 i2 = 2 = π B , ∂ log f (y−x) 1 E h ∂x π Bc ≈ 1.57Bc . 2 The performance loss due to binary quantization is surprisingly small. However, note that this requires τ0 = x, which is impossible to do in practice as x is the unknown parameter to be estimated. For increasing |ε| > 0, we can observe that B increases in a rather sensitive way. B⋆ = An upper bound on B was given in [Ribeiro 2006a] by noting that the product in the numerator can be bounded by the following exponential (Why? - App. A.1.2): 1 ε 2 F (ε) [1 − F (ε)] 6 exp − . (1.25) 4 δ This bound can be used in (1.23) with (1.24) to obtain π ε 2 , B (ε) 6 B̄ (ε) = exp + 4 δ (1.26) 1.3. Binary quantization 45 which is a function that increases exponentially with ε. To confirm that the bound is tight, at least for moderate ε, we plot the function B̄ also in Fig. 1.3. B B̄ Bound × 1 δ2 6 4 2 π/4 0 −1.5 −1 −0.5 0 0.5 1 1.5 ε δ Figure 1.3: Quantity related to the CRB for quantized measurements, B, as a function of the normalized difference δε between threshold and parameter. B̄ is its upper bound, which has an exponential form. The noise distribution is the Gaussian distribution and the normalizing factor δ is the Gaussian noise scale parameter. The normalizations in both axis are done to be able to have a plot independent of δ. Therefore, for the Gaussian case, we can conclude that the estimation performance loss for the binary case is relatively small if we set the threshold at the true parameter value, but it increases rapidly when we quantize far from it. 1.3.2 The Laplacian case We can try to look to another symmetric unimodal distribution to see if the same happens. For example, we can consider the Laplacian distribution, whose PDF and CDF are ε 1 (1.27) exp − , 2δ δ where sign (ε) is the sign function f (ε) = sign (ε) = F (ε) = 1 ε i 1 sign (ε) h + 1 − exp − , 2 2 δ (1.28) , if ε > 0, 0 , if ε = 0, −1 , if ε < 0. Applying (1.27) and (1.28) to (1.23), we get n ε o h 1 sign(ε) ε i sign(ε) 1 − + 1 − exp − 1 − exp − 2 2 δ 2 2 δ ε B = 1 exp −2 δ 4δ 2 ε 1 2 1 1 1 − exp −2 ε ε − exp − 1 − exp − δ 4 δ ε δ = 2 = 4 41 1 ε exp −2 exp −2 2 2 δ 4δ h 4δ ε δ i 2 = δ 2 exp − 1 δ (1.29) 46 Chapter 1. Estimation of a constant parameter and we can see that B and consequently the CRB is minimized for τ0 = x and that it is sensible to ε, growing exponentially when we increase |ε| = |τ0 − x|. 1.3.3 The general case We can try to verify if the increasing behavior of B (ε) w.r.t. |ε| will be observed in the general case, when the noise PDF is unimodal and symmetric. Attempt of global analysis: dead end A For unimodal symmetric distributions we have that f (ε) = f (−ε) and F (ε) = 1 − F (−ε). Therefore, as it was observed for the specific Gaussian and Laplacian cases, B (ε) is a symmetric function. For analyzing if the increasing behavior is true in general, we can concentrate the analysis on the first derivative of B w.r.t. ε, for ε > 0. The derivative is f 2 (ε) [1 − 2F (ε)] − 2F (ε) [1 − F (ε)] f (1) (ε) dB = , dε f 3 (ε) (1.30) where f (1) (ε) is the first derivative of the PDF w.r.t. ε, supposed to exist3 . Observe that if the distribution is symmetric, we have 1 − 2F (0) = 0 and only the second term in the numerator can be nonzero for ε = 0. Adding the condition that f (1) (0) = 0 makes ε = 0 to be a local extremum of B, being a candidate point to be a local minimum. In a first attempt to verify if ε = 0 is a global minimum, we can calculate the second derivative and look if its sign is negative for all ε. If we calculate the second derivative we get −3f 2 (ε) f (1) (ε) [1 − 2F (ε)] + F (ε) [1 − F (ε)] 6f (1) 2 (ε) − 2f (ε) f (2) (ε) d2 B = − 2, dε2 f 4 (ε) (1.31) with f (2) (ε) the second derivative supposed to exist3 . Even using the assumptions on the noise distribution, we cannot get any conclusion on the sign of the second derivative. Thus, we can try to go back to the first derivative and analyze its sign. Using the symmetry of B, a sufficient condition for ε = 0 to be a global maximum is that dB dε > 0 for ε > 0. The derivative dB dε has the same sign of the numerator in the RHS of (1.30), therefore, we can obtain the condition f 2 (ε) [2F (ε) − 1] , − f (1) (ε) > 2F (ε) [1 − F (ε)] using the fact that the density is monotonically decreasing (f (1) (ε) < 0 for ε < 0) and symmetric ([2F (ε) − 1] > 0 for ε > 0), we can write (1) f 2 (ε) [2F (ε) − 1] . f (ε) > 2F (ε) [1 − F (ε)] (1.32) Unfortunately, by using the assumptions on the distribution we cannot go further. But, at least, we can use the condition above (1.32) to verify empirically, for commonly used noise 3 This rules out the evaluation of this quantity for ε = 0 in the Laplacian case, which is not a problematic case, as we know analytically that B is strictly increasing with |ε| in this case. 1.3. Binary quantization 47 models, the increasing behavior of B (ε) with |ε|. For doing so, we (re)tested the Gaussian and Laplacian distributions with (1.32), we also added a heavy-tailed distribution4 to see if in this case the conclusions change. The heavy-tailed distribution is the Cauchy distribution with PDF and CDF given respectively by 1 1 ε 1 1 i, h f (ε) = (1.33) 2 F (ε) = + arctan . (1.34) πδ 1 + ε 2 π δ δ For the three distributions (Gaussian, Laplacian and Cauchy), we calculated the quantity M = (1) f 2 (ε)[2F (ε)−1] f (ε) − 2F (ε)[1−F (ε)] , which must be positive to have the monotonic increasing behavior of B w.r.t. |ε|. The result is displayed in Fig. 1.4, where we observe that this is indeed true. Gaussian Cauchy Laplacian M × δ2 0.4 0.2 0 0 1 2 3 4 5 ε δ Figure 1.4: M ×δ 2 as a function of δε > 0. The plot is given for three types of noise distribution: Gaussian, Cauchy and Laplacian. All distributions have a noise scale parameter denoted δ. The function must be positive for the optimal threshold in binary quantization to be exactly placed at the true parameter τ ⋆ = x. The normalizations in both axis are done to be able to have a plot independent of δ. Local analysis As condition (1.32) is difficult to verify in general, we can try to analyze the local behavior of B (ε) around ε = 0. Even if the results will be weaker, as they will only be local results, we can expect that the conditions for ε = 0 to be a local minimum of B (ε) will be easy to verify. We saw above that if f (1) (0) = 0, then we have an extremum of B (ε) at ε = 0. If we use one more time the assumption f (1) (0) = 0 on the second derivative at zero and the symmetry (F (0) [1 − F (0)] = 14 ), we get 1 f (2) (0) d2 B = − − 2. dε2 ε=0 2 f 3 (0) 4 A heavy-tailed distribution is a distribution whose ratio between 1 − F (x + y) and 1 − F (x) is equal to one when x tends to infinity [Sigman 1999]. A subclass of this family is the class of all sub-exponential distributions, where the Student-t distributions (for which the Cauchy distribution is a special case) and Paretian distributions are included. 48 Chapter 1. Estimation of a constant parameter 2 > 0. When we For ε = 0 to be a local minimum of B (ε), we have the condition ddεB2 ε=0 apply this condition to the expression above, we can obtain the following condition on the noise PDF and its second derivative: − f (2) (0) > 4f 3 (0) . (1.35) For the Gaussian distribution this condition is satisfied as we have 1 2 1 4 − f (2) (0) = 3 √ > 4f 3 (0) = 3 3 δ δ π2 π and also for the Cauchy distribution − f (2) (0) = 1.3.4 1 2 1 4 > 4f 3 (0) = 3 3 . 3 δ π δ π Asymmetric threshold: surprising cases B Surprisingly, we can find symmetric distributions and even a class of unimodal symmetric distributions for which the condition (1.35) is not satisfied, as a consequence, for these distributions, ε = 0 can be a local maximum instead of a local minimum. The uniform/Gaussian case A simple way to define a symmetric distribution that does not satisfy (1.35) is to set the values of the PDF around zero to a nonzero constant, in this way f (0) > 0 and f (2) (0) = 0. This makes the second derivative at ε = 0 to be negative, leading to a local maximum of B (ε) at that point. As an example we can consider a noise PDF that is uniform in the interval − α2 , α2 , where α ∈ R+ , and that decreases as a Gaussian distribution with a standard deviation parameter σ outside this interval. We call this noise distribution the uniform/Gaussian distribution and the analytic expression for its PDF is α 2 ε+ 2 1 1 , for ε < − α2 , fGL (ε) = C √2πσ exp − 2 σ (1.36) f (ε) = fU (ε) = C √12πσ , for − α2 ≤ ε ≤ α2 , α 2 1 ε− 2 1 , for ε > α2 , fGR (ε) = C √2πσ exp − 2 σ α where C = 1 + √2πσ is a normalization constant that makes the integral of the PDF to be equal to one. This PDF is depicted in Fig. 1.5. To obtain the function B (ε), we have to describe the CDF of the uniform/Gaussian r.v.. If we denote Φ (ε) the CDF of a standard Gaussian distribution (the CDF for a Gaussian with σ = 1), we obtain the following: α ε+ 2 1 Φ , for ε < − α2 , σ C h i 1 ε + α2 , for − α2 ≤ ε ≤ α2 , (1.37) F (ε) = C1 12 + √2πσ α i h ε+ 2 1 √α + Φ , for ε > α2 . C σ 2πσ 1.3. Binary quantization 49 fU (ε) fGL (ε) fGR (ε) σ σ − α2 α 2 ε Figure 1.5: PDF for the uniform/Gaussian distribution. The center region is uniform with width α, while the left and right sides are Gaussian with standard deviation parameter σ. Using (1.36) and (1.37) in the expression for B (ε) (1.23), we get F (ε) [1 − F (ε)] B (ε) = = f 2 (ε) 2 α h α i ε+ 2 ε+ 2 ε+ α 2 2 Φ C − Φ , for ε < − α2 , 2πσ exp σ σ σ 2 ε2 1 2πσ 2 21 + α2 √2πσ − 2πσ = , for − α2 ≤ ε ≤ α2 , 2 α i h 2 h α i ε− 2 ε− 2 ε− α 2 2 √α + Φ 1 − Φ , for ε > α2 . 2πσ exp σ σ σ 2πσ observe that in the interval − α2 , mum at zero. α 2 (1.38) , the function is concave, so we really have a local maxi- For observing the global behavior of this bound, we plotted CRBB q for a number of samples N = 500, α = 1, σ = 1 and for values of ε in the interval [−2, 2]. For verifying that the behavior of the bound was close to the true MSE of the MLE, we simulated the MLE 105 times for N = 500, the simulation results were used for evaluating a simulated MSE. The details on the implementation of the MLE for binary quantization will be presented further in (a1.1) in Sec. 1.3.6 and for more specific implementation details about the uniform/Gaussian case see (More? - App. A.2.2). The simulation of the noise was done by exploiting the fact that the uniform/Gaussian distribution is a mixture of distributions that are easy to sample (How? - App. A.3.1). The results are shown in Fig. 1.6. We can observe the concave behavior of the bound around ε = 0 and the presence of two minima at points different from ε = 0. This shows that for this type of noise, binary quantization must be done in an asymmetric way, by shifting the central threshold to a zone where the noise is not uniform. Note also that if we shift too much, the performance starts too degrade again. We suspect that this asymmetric behavior comes from the fact that for the uniform distribution the most informative points in statistical sense are the boundaries of the distribution (where it passes from a positive value to zero). Finally, we can also see that the MSE for the MLE is quite close to the bound, indicating that we can use the bound for analyzing the behavior of the MSE. 50 Chapter 1. Estimation of a constant parameter 1.2 ·10−2 CRBB q Sim. – MLE MSE 1 0.8 0.6 −2 0 ε −1 1 2 Figure 1.6: CRBB q and simulated MLE MSE for uniform/Gaussian noise. Both the bound and simulated MSE were evaluated for a number of samples N = 500 and for ε in the interval [−2, 2]. The MSE for the MLE was evaluated through Monte Carlo simulation using 105 realizations of blocks with 500 samples. We considered the following noise parameters: α = 1 and σ = 1. The generalized Gaussian case We can also look if there are noise distributions without the central uniform behavior for which the condition on the second derivative (1.35) is not respected. All distributions that have zero second derivative at ε = 0 will not respect the condition. To have zero second derivative at zero, the PDF must be flat around zero. A class of distributions for which we can control the flatness around zero by changing a parameter is the generalized Gaussian distribution (GGD). A more detailed presentation of the GGD will be given in Ch. 3 with the motivation for using it as a noise model. Here, we will present only its PDF and CDF, which are given respectively by f (ε) = F (ε) = β ε β exp − , δ 2δΓ β1 1 ε β γ β, δ 1 , 1 + sign (ε) 2 Γ 1 (1.39) (1.40) β where δ is the noise scale parameter, β is a shape parameter which allows for controlling the flatness around zero. Both δ and β are constrained to be strictly positive δ > 0, β > 0. Γ () is the gamma function +∞ Z Γ (x) = z x−1 exp (−z) dz 0 1.3. Binary quantization 51 and γ (, ) is the incomplete gamma function γ (x, w) = Zw z x−1 exp (−z) dz. 0 We need to calculate f (2) (ε) at ε = 0. For doing so, we will evaluate the derivatives for ε < 0 and ε > 0 and then we will evaluate their limits when ε tends to zero. For the first derivative we have h −ε β−1 D exp − δ f (1) (ε) = h −D ε β−1 exp − δ where D = 2 β . 2δ 2 Γ 1 β −ε β δ ε β δ i i , , for ε < 0, for ε > 0, Observe that if β ≤ 1, then the first derivative at zero is not defined. For β > 1, the derivative is zero. For the evaluation of the second derivative we will consider β > 1. We get the following second derivative: i h h (β−1) i β −ε 2(β−1) −ε β−2 −ε β D − + exp − , for ε < 0, δ δ δ δ δ f (2) (ε) = h i h i D − (β−1) ε β−2 + β ε 2(β−1) exp − ε β , for ε > 0. δ δ δ δ δ We can see that for 1 < β < 2, the derivatives when ε approaches zero are both −∞. For these cases, the point ε = 0 is a local minimum of B (ε). In the Gaussian case β = 2, the second derivative has a finite negative value and we saw before that ε = 0 is a local minimum (empirically we also observed that it is a global minimum). For the cases β > 2, the second derivative is zero, thus corresponding to the special cases of local maximum that we were looking for. The function B (ε), that we expect to be a "w" shaped function for β > 2, can be evaluated using (1.39) and (1.40) in the expression for B (ε) (1.23). This gives 2 1 , ε β 2 Γ2 1 γ δ β β β δ F (ε) [1 − F (ε)] exp 2 ε . 1 − (1.41) = B (ε) = f 2 (ε) β2 δ Γ2 1 β As in the uniform/Gaussian case, we also plotted CRBB q and the simulated MSE of the MLE. We used N = 500, β = 4, δ = 1 and values of ε in the interval [−1, 1]. We simulated the MLE 105 times and the results were used to obtain an estimate of the true MSE for this estimator. For more specific implementation details about the MLE in the GGD case see (More? - App. A.2.3). The GGD noise was generated using transformations of gamma variates (How? - App. A.3.2). The results are shown in Fig. 1.7. We can notice again that the optimal threshold must be placed in an asymmetric way and also that the simulated estimation performance is close to the bound. Contrary to the 52 Chapter 1. Estimation of a constant parameter ·10−3 CRBB q Sim. – MLE MSE 1.6 1.4 1.2 −1 −0.5 0 ε 0.5 1 Figure 1.7: CRBB q and simulated MLE MSE for GGD noise. Both the bound and simulated MSE were evaluated for a number of samples N = 500 and for ε in the interval [−1, 1]. The MSE for the MLE was evaluated through Monte Carlo simulation using 105 realizations of blocks with 500 samples. We considered the following noise parameters: β = 4 and δ = 1. uniform/Gaussian case, we cannot have a clear interpretation on the position of the minimum point. The minimum point was observed to be sensible to changes in β and δ. It was also observed that as we set β closer to 2 (the Gaussian case), we obtain a difference of performance between the point ε = 0 and the minimum point that gets smaller. On the other hand, as we increase β (getting closer to the uniform distribution), the difference seems to increase. 1.3.5 Conclusions on binary quantization performance To conclude, we can say that the best estimation performance in the binary case for commonly used noise models (CRBB q ) is obtained for ε = 0 or τ0 = x: CRBB,⋆ = q 1 F (0) [1 − F (0)] = , 2 N f (0) 4N f 2 (0) (1.42) which is also a lower bound on the asymptotically achievable performance. However, even under the unimodal symmetric assumption, this rather intuitive conclusion is not always true. From the local condition on the second derivative, we can see that if the noise PDF is slightly flat around zero, then a "w" shaped performance function will appear, leading to an optimal threshold that might be placed asymmetrically w.r.t. to its input r.v. distribution. The variation between the performance for the point ε = 0 and the minimum CRBB,⋆ q in the asymmetric cases seems to depend on the flatness of the distribution. An increased flatness around zero, seems to be related to an increased performance variation. This strong dependence between the shape of the CRB and the noise distribution seems to be a good subject for future work. Another interesting direction for future work on this issue about asymmetry is to analyze how it can appear on the detection problem using binary quantized measurements. It 1.3. Binary quantization 53 appears that such behavior will be present for the same noise distributions considered above (uniform/Gaussian and GGD) in the problem of local optimum detection of signals based on binary quantized measurements. For this problem, it can be shown that the asymptotic performance depends also on the FI for quantized measurements [Kassam 1977]. 1.3.6 MLE for binary quantization The specific implementation of (a1) in the binary case with a fixed threshold can be done in a simple way [Papadopoulos 2001] (and revisited in [Ribeiro 2006a]). The sequence of N quantized measurements can be observed as a sequence of N i.i.d. samples from a Bernoulli distribution with probability p = P (ik = 1) = 1 − F (τ0 − x). Thus, hiding the functional dependency on x and τ0 , we can calculate the likelihood of p with the sequence i1:N . The likelihood of p for i1:N can be written in a simple form by observing the following: • for a measurement ik , P (ik = 1; p) = p and P (ik = −1; p) = 1 − p. We can write P (ik ; p) in a form pf1 (ik ) (1 − p)f−1 (ik ) , where the functions f1 and f−1 are respectively 1 and 0 when ik = 1, and 0 and 1 when ik = −1. A Simple choice for these functions is k f1 (ik ) = ik2+1 and f−1 (ik ) = 1−i 2 . • As the measurements are independent, the likelihood for the sequence i1:N will be the product of the marginal likelihoods P (ik = 1; p). This leads to L (p, i1:N ) = N Y p ik +1 2 k=1 (1 − p) 1−ik 2 . Calculating its logarithm and then evaluating the MLE for p denoted P̂M L , we get the following [Wasserman 2003, p. 123]: N 1 X 1 + ik P̂M L = . (1.43) N 2 k=1 The MLE (in general) has the property that if we want to estimate a parameter x which is an invertiblefunction of z, x = g (z) and we know the MLE for z, ẐM L , then the MLE for x is X̂M L = g ẐM L [Kay 1993, p. 176]. This property is known as functional invariance. For our problem we can write x = g (p) = τ0 − F −1 (1 − p) , (1.44) F −1 is the inverse of the noise CDF. By definition F −1 is invertible, as F is strictly increasing due to the monotonicity assumption on F , so the function g in this case is invertible. Thus, by the functional invariance of the MLE, we can obtain X̂M L,q , after replacing p in (1.44) by P̂M L given by (1.43). This leads to an analytical expression for the MLE: " !# N X 1 1 X̂M L,q = g P̂M L = τ0 − F −1 1 − P̂M L = τ0 − F −1 1− ik . (1.45) 2 N k=1 54 Chapter 1. Estimation of a constant parameter Therefore, the solution to problem (a) (p. 27) in the binary case can be detailed as follows Solution to (a) - MLE for binary quantized measurements and fixed threshold τ0 (a1.1) 1) Estimator X̂M L,q = g P̂M L = τ0 − F −1 1 − P̂M L !# " N X 1 1 1− ik . = τ0 − F −1 2 N k=1 2) Performance (asymptotic) X̂M L,q is asymptotically unbiased h i E X̂M L,q = N →∞ x and its asymptotic MSE or variance is given by Var X̂M L,q ∼ N →∞ CRBB q = F (τ0 − x) [1 − F (τ0 − x)] , N f 2 (τ0 − x) which is minimal for commonly used noise models (Gaussian, Laplacian and Cauchy distributions) if τ0 = x, attaining 4N f12 (0) and increases with |τ0 − x|. Notice that this algorithm can be used for any noise distribution, not only for symmetric unimodal distributions. 1.4 Multibit quantization Now, we study the multiple interval (multibit) case, NI > 2. The expression characterizing estimation performance for this case is given by (1.13): NI 2 ′ ′ X2 f ε + τi′ − f ε + τi′ −1 2 − f ε − τ f ε − τ ik −1 ik k k + . Iq (ε) = F ε + τ′ − F ε + τ′ F ε − τi′k −1 − F ε − τi′k ik =1 ik −1 ik ik = We remind that a larger Iq (ε) gives a better asymptotic estimation performance. We will start by analyzing the influence of the central threshold. 1.4. Multibit quantization 55 For verifying symmetry, we replace ε by −ε. NI 2 ′ ′ X2 f −ε + τi′ − f −ε + τi′ −1 2 − f −ε − τ f −ε − τ ik ik −1 k k + . Iq (−ε) = F −ε + τ ′ − F −ε + τ ′ F −ε − τi′k −1 − F −ε − τi′k ik =1 ik ik −1 ik = The following equalities come from the symmetry assumptions: F −ε + τi′k = 1 − F ε − τi′k , F −ε + τi′k −1 = 1 − F ε − τi′k −1 , f −ε + τi′k −1 = f ε − τi′k −1 , F −ε − τi′k −1 = 1 − F ε + τi′k −1 , f −ε − τi′k −1 = f ε + τi′k −1 , F −ε − τi′k = 1 − F ε + τi′k . f −ε − τi′k = f ε + τi′k , Applying these expressions to Iq (−ε) and multiplying by −1 inside the squared terms we get that Iq (ε) = Iq (−ε), thus, the even symmetry observed in the binary case can be extended to this case. f −ε + τi′k 1.4.1 = f ε − τi′k , The Laplacian case Now, we start with the Laplacian case which is easy to be treated analytically. If we set ε = 0, NI 2 ′ ′ X2 f τi′ − f τi′ −1 2 f −τ − f −τik k k + ik −1 , Iq (0) = F τ′ − F τ′ F −τi′k −1 − F −τi′k ik =1 ik −1 ik ik = using also the symmetry assumption (similar development as above), one can easily observe that the second term inside the sum terms is equal to the first, which means that we can rewrite the sum as N 2 ik = 2I ′ ′ X f τik − f τik −1 . Iq (0) = 2 (1.46) F τ′ − F τ′ ik =1 ik −1 ik Using the PDF and CDF for the Laplacian distribution (1.27) and (1.28), separating the last term of the sum and simplifying the notation for the absolute value and sign functions (τi′k ≥ 0), we obtain " !#2 τ ′N ′ ′ 2 I −1 τi τi −1 NI 1 2 1 k k exp − 2 −1 i = exp − − exp − δ 4δ 2 2 k X δ δ 4δ ′ ′ ! + Iq (0) = 2 τ ′N τi −1 τi I 1 k k 1 ik =1 2 −1 exp − − exp − exp − 2 δ δ 2 δ N ′ ik = 2I −1 ′ ′ τ NI X τik −1 τi k 1 −1 2 = exp − + exp . − − exp − δ2 δ δ δ ik =1 The terms inside the sum (in the Σ operator) cancel each other except for the first and last term, the last term and the term outside the sum also cancel each other. Iq (0) is then given 56 Chapter 1. Estimation of a constant parameter by only one term, which is Iq (0) = 1 δ2 ′ τ exp − δ0 , as τ0′ = 0, we have Iq (0) = 1 . δ2 Surprisingly, this is exactly the same as the FI for continuous measurements (Why? - App. A.1.3). Thus, this means that not only τ0 = x is optimal for the Laplacian distribution but also that no loss of performance is observed. As the quantized measurement FI can only increase by adding quantization intervals and as it is upper bounded by the continuous FI, we see that once we have placed the threshold at x, the quantized measurement FI will be the same for all NI ≥ 2. This means that in practice, as we want to minimize the rate, the optimal choice of number of quantization intervals will be NI = 2. 1.4.2 The Gaussian and Cauchy cases under uniform quantization Instead of diving into calculus for trying to obtain some characterization of Iq as a function of ε, we preferred to directly plot its influence for a given set of thresholds. We evaluated Iq given by (1.13) as a function of δε with δε ∈ [−10, 10]. The evaluation was done for the Gaussian and Cauchy distributions. The quantizer was assumed to have NI = 8 and a uniform step ∆ between thresholds, which means that τ ′ = [0 ∆ 2∆ 3∆ + ∞]. Here, uniform quantization was assumed only to simplify the presentation.5 Three different ∆ were chosen for the evaluation, ∆ = 0.1δ, ∆⋆ and 2δ. ∆⋆ was chosen as the maximizer of Iq when ε = 0 and it was obtained by exhaustive search. The results are given in Fig. 1.8 where the continuous FI Ic is also plotted for comparison. Remember that for the Gaussian distribution Ic = δ22 . For the Cauchy distribution we have Ic = 2δ12 (Why? - App. A.1.4 ). Observe that in all cases the point ε = 0 gives maximum Iq . Note that differently from the binary case, the FI does not strictly decrease when |ε| increases, this only happens when |ε| is outside the quantizer range. We can also see that the optimal ∆ gives Iq values very close to Ic . It is also interesting to observe that when we choose ∆ very large compared with ∆⋆ , we obtain a maximum Iq smaller than for ∆⋆ , but this Iq does not decrease to zero inside the quantizer range. This indicates that when we have a prior information on the interval of values where x is located, then a more robust solution can be found by using a large quantization step (for example by using a ∆ that is equal to the prior interval length divided by NI ). Clearly in this case, the price to pay is that even if we have ε = 0 the performance is lower than the optimal, being very close to the performance for a binary quantizer. Differently from the binary case, after evaluating Iq (ε) for the GGD with β > 2 and NI > 2, it was observed that when we use ∆⋆ as quantization step, the symmetric quantizer assumption seems to force the performance to be optimal for ε = 0. Less surprisingly now, when the quantization step is chosen too large, the asymmetric behavior appears, this is due to the fact that the performance around ε = τi′ is very close to the binary quantizer performance. In the same way as for the other noise distributions considered above, it was also observed that when the parameter is outside the quantizer range the performance is degraded. 5 It will be shown in Part II, that for large NI , the optimal quantization intervals may not be uniform. 1.4. Multibit quantization 57 0.8 ∆ = 0.1δ ∆⋆ ∆ = 2δ Ic 0.6 FI × δ 2 FI × δ 2 2 ∆ = 0.1δ ∆⋆ ∆ = 2δ Ic 1 0.4 0.2 0 −10 −5 0 5 10 0 −10 0 −5 ε δ 5 10 ε δ (a) (b) Figure 1.8: FI for a range [−10, 10] of normalized difference δε between the central threshold τ0 and the true parameter x. The quantizer with NI = 8 is uniform with quantization interval length (in the granular region) ∆. In (a), the noise distribution is Gaussian and ∆ = 0.1δ, ∆⋆ = 0.399δ, 2δ. ∆⋆ is the optimal quantization step for δε = 0. In (b), the Cauchy noise distribution is used and ∆ = 0.1δ, ∆⋆ = 0.5878δ, 2δ. ∆⋆ is also the optimal quantization step for δε = 0. For both cases, Ic is the FI for the continuous measurement. In the Gaussian case Ic = 2δ 2 , while in the Cauchy case Ic = 21 δ 2 . The normalizations on the difference range and also on FI were done to be able to have a plot independent of δ. In all tested cases6 (under the symmetry assumptions), it was observed that when ∆⋆ is used, ε = 0 is the optimal solution. Thus, we can say that for commonly used noise models, if the quantization thresholds are well chosen, τ0 = x is optimal. The "commonly used" term here seems to be larger than in the binary case, as all the GGD with β > 2 do not have anymore the asymmetric behavior for the optimal central threshold. After setting τ0 = x, we still need to characterize the other thresholds to have a full performance characterization depending only on NI . This can be equivalently stated as finding the variations from the central thresholds τ ′ maximizing Iq (0) given by (1.46): Iq⋆ = argmax Iq (0) . (1.47) τ′ Unfortunately, an analytical solution cannot be found in general. An efficient solution for this problem could be obtained if this problem was convex or convexifiable [Boyd 2004], but this is not the case, so this is a very complicated multidimensional maximization problem. A possible solution to it is to fix the quantizer to be uniform, then in this case the problem is still one dimensional and it can be solved by exhaustive search (searching for the maximum on a fine grid of possible values). Existence of a non-degenerate solution (0 < ∆⋆ < ∞) is guaranteed by the following argument: for ∆⋆ → +∞, all the distribution is concentrated on the first quantizer interval (remember that ε = 0), thus Iq will be equal to the binary case Iq and for ∆⋆ = 0, we get directly the binary quantization performance. As it was explained above, Iq 6 Two families of distributions were tested, the GGD and the Student-t distribution which will be presented later in Ch. 3. They were tested with uniform symmetric quantizers. 58 Chapter 1. Estimation of a constant parameter increases when we add thresholds, so at least one non-degenerate solution must exist. For a non-uniform solution, we can try to use local approximations by using Taylor series, this subject will be left to Part II. 1.4.3 Summary of the main points Thus, up to this point we have: • estimation performance based on quantized measurements is bounded above by the estimation performance based on continuous measurements. • Adding quantization levels does not decrease estimation performance (it might increase in most of the cases). • The optimal central threshold τ0 must be placed at the true parameter x for commonly used noise models (Gaussian, Laplacian, Cauchy distributions). If we consider NI > 2, symmetric thresholds w.r.t. the central threshold and well chosen quantization intervals, then it seems that τ0 = x may be optimal for a large class of symmetric unimodal distributions (for all the distributions above plus other members of the GGD). • Maximizing the estimation performance w.r.t. the other thresholds (1.47) is in general a complicated problem. 1.4.4 MLE for multibit quantization with fixed thresholds As it was done in the binary case, we still need to precise how to implement the MLE. Note that in this case the likelihood is given by (1.5) L (x; i1:N ) = N Y P (ik ; x) . k=1 Now, the MLE cannot be written in simple form and we must resort to numerical maximization. In general, we could use a steepest ascent algorithm, to iteratively climb the likelihood function. As it was developed in [Ribeiro 2006a], an efficient solution can be found when the noise distribution is log-concave. A log-concave distribution is a distribution for which its logarithm is concave, a simple example is the Gaussian distribution. If f is log-concave, it is known that P (ik ; x) is log-concave [Boyd 2004, p. 107] and also that their product (expression above) is log-concave [Boyd 2004, p.105]. Thus, under this assumption the log of L is concave, an efficient solution for finding the MLE is the Newton’s algorithm [Boyd 2004, p. 496] given by [Ribeiro 2006a]: ∂ log P(ik ;x) , (1.48) X̂M L,j = X̂M L,j−1 − ∂ 2 log∂x P(ik ;x) 2 ∂x x=X̂M L,j−1 the subscript j is used to represent the iteration index and |x=X̂M L,j−1 means that the function on its left is evaluated at the point x = X̂M L,j−1 . After starting the algorithm with an 1.4. Multibit quantization 59 arbitrary X̂M L,0 done until a pre-specified small minimum value εmin for , the iterations are the variations X̂M L,j − X̂M L,j−1 is crossed. All the interest in obtaining a concave problem formulation comes from the fact that the Newton’s algorithm not only guarantees convergence to a global maximum but also does it with quadratic convergence, i.e. when the iterates gets close to the optimal value, at each iteration X̂M L,j gets 2 digits closer to x [Boyd 2004, p. 489]. Therefore, for NI > 2, with a fixed set of thresholds and considering that the distribution is log-concave we have the following solution for problem (a) (p. 27): Solution to (a) - MLE for quantized measurements with log-concave noise distribution, NI > 2 and fixed τ (a1.2) 1) Estimator Define an initial guess on the estimate X̂M L,0 . X̂M L,j − X̂M L,j−1 < εmin , do X̂M L,j = X̂M L,j−1 − Until ∂ log P(ik ;x) ∂x ∂ 2 log P(ik ;x) 2 ∂x x=X̂M L,j−1 and set j = j + 1. Then, X̂M L,q is set to the last X̂M L,j . 2) Performance (asymptotic) X̂M L,q is asymptotically unbiased i h = E X̂M L,q N →∞ x and its asymptotic MSE or variance is given by Var X̂M L,q with Iq given by (1.13). ∼ N →∞ CRBq = 1 , N Iq 60 Chapter 1. Estimation of a constant parameter 1.5 Adaptive quantizers: the high complexity fusion center approach The analysis and results above indicate that to get optimal estimation performance from quantized measurements we must, in general, place the central threshold close to the true parameter7 . This can be done by using the information given by the measurements to move adaptively the central threshold. Main work that has been already done on this subject will be presented in this section. An adaptive scheme to estimate x based on a sensor network of binary quantizers is presented in [Li 2007]. The main idea is that enhanced estimation performance can be obtained if the sensors can place dynamically their thresholds around x. Here, we present an equivalent sequential version using only one sensor. The following is proposed: 1. a sensor can communicate binary measurements to a fusion center. The sensor measurement noise sequence is supposed to be i.i.d.. 2. The sensor starts with a known binary threshold τ0,0 , where the second subscript is for the discrete-time index. Note that now the threshold will be considered to be varying. 3. At each instant k, the sensor obtains a binary quantized measurement ik (ik ∈ {−1, 1}). 4. The sensor then updates the threshold by the following simple cumulative rule: τ0,k = τ0,k−1 + γik , (1.49) where γ is a constant positive adaptation step (see the remarks after the MLE definition). 5. The sensor sends its measurement ik to the fusion center. 6. The fusion center updates its τ0,k , and stocks in a memory both ik and τ0,k . Note that the fusion center threshold is exactly the same as the one obtained in the sensor threshold update. 7. After a predefined number of iterations, for example N , or at each iteration k, the fusion center can get a more precise estimate of x (more precise than τ0,k ) by using a MLE based on all past ik . 7 The literature on the subject also points in the same direction. The case when x is constrained to lie in a bounded interval X of R was extensively studied in [Papadopoulos 2001]. Main attention was given to the effects of different schemes for setting τ0 . The schemes considered were: fixed, varying but random and i.i.d, varying deterministically and based on feedback. For each scheme, the worst case CRBq (x was chosen to maximize the CRB) was evaluated and divided by the continuous measurement CRB to give a measure on the performance loss induced by quantization. The loss was shown to be more sensible w.r.t. an equivalent signal-to-noise ratio (the interval X length divided by the noise scale factor) in the fixed case and insensible in the feedback case. Some solutions based on iterative maximum likelihood techniques, which puts the new threshold on the last ML estimate, were presented but no theoretical proofs that they reach the minimum CRBq were given. In [Ribeiro 2006a], where the binary quantization Gaussian noise case was mainly studied, it was pointed out that the sensibility of the estimation performance to ε and its optimality for ε = 0 indicates that, to enhance performance, we could move adaptively the binary threshold, placing it on the last available estimate X̂ to get closer and closer to the true x. 1.5. Adaptive quantizers: the high complexity fusion center approach 1.5.1 61 MLE for the adaptive binary scheme As the threshold is dependent on the measurements, the measurements are not independent anymore. However, as a measurement ik is dependent only on past measurements and this dependence is done through τ0,k−1 , conditioned on the threshold that was used, the measurements are independent. This leads to the following likelihood and log-likelihood for the measurements until time N : L (x; i1:N ) = P(i1:N ; x) = N Y k=1 = = log L (x; i1:N ) = N Y k=1 N Y P(ik |τ0,k−1 ; x) [1 − F (τ0,k−1 − x)] k=1 N X k=1 P(ik |ik−1 , · · · , i1 ; x) 1+ik 2 F (τ0,k−1 − x) 1−ik 2 (1.50) , 1 + ik 1 − ik log [1 − F (τ0,k−1 − x)] + log F (τ0,k−1 − x) ,(1.51) 2 2 where the vertical bar inside the probability symbol means that the probability measure is evaluated for the r.v. on the left side of the bar, conditioned on the r.v. on the right side of the bar. The conditioning makes the output ik depend on τ0,k−1 as if it was a deterministic parameter, that is why we can use the same notation with CDF F parametrized by a fixed nonrandom threshold. At the fusion center at time N , all the thresholds and binary measurements are known, the maximum likelihood estimator can then be calculated by maximizing (1.50) or (1.51): X̂M L,q = argmax x k=1 or X̂M L,q = argmax x N Y N X 1 + ik k=1 2 [1 − F (τ0,k−1 − x)] 1+ik 2 F (τ0,k−1 − x) 1−ik 2 (1.52) 1 − ik log [1 − F (τ0,k−1 − x)] + log F (τ0,k−1 − x) . 2 Note that the threshold moves with each measurement, while the estimate is obtained only at the end of the measurement block. Observe also that when the noise distributions are logconcave, the MLE can also be obtained by using the Newton’s algorithm, as it was discussed in the previous section. Remarks: it is intuitive to expect that the mean τ0,k will reach an equilibrium after some time. If the threshold is above the parameter, iteration (1.49) will reduce its value, in the other case, if the threshold is below iteration (1.49) will increase its value. In h the parameter, i τ0,k −τ0,k−1 the mean equilibrium we have E = E [ik ] = 0, thus as ik = 1 or ik = −1 the only γ possibility for this to happen is when P (ik = 1; x) = P (ik = −1; x) = 12 , which in the case of symmetric noise distributions means to say that E [τ0,k ] = x. 62 Chapter 1. Estimation of a constant parameter The variance of the thresholds will depend on the noise distribution but also on the parameter γ, if we choose γ to be relatively small, once the threshold is close to the parameter, it will fluctuates around it with a small variance. The fact that the threshold updates are easy to implement (it is just a cumulative sum) and that the estimator is a complex one goes well with real implementation constraints, where complexity on the sensor side of the problem is strongly constrained and on the fusion center side, it is less constrained. 1.5.2 Performance for the adaptive binary scheme We must look now to the performance of this scheme. The performance analysis that is presented here was proposed in [Fang 2008]. Even if the measurements are dependent, it is known that, under some conditions that are satisfied here, the MLE will still attain the CRB [Crowder 1976]. Thus, the main problem here will be the evaluation of the FI. As the measurements are dependent, the FI for N measurements is not N times the FI for one measurement and we need to evaluate it using the score function for the entire block of measurements. The FI for N measurements is " #2 N X ∂ log P(ik |τ0,k−1 ; x) . Iq,1:N = E [Sq ] = E ∂x k=1 It was shown in [Li 2007] that this quantity is equal to Iq,1:N = N X k=1 f 2 (τ0,k−1 − x) E , F (τ0,k−1 − x) [1 − F (τ0,k−1 − x)] (1.53) where the expectation is evaluated under the only r.v. that still appears on the expression, τ0,k−1 . If we assume that τ0,0 = 0 then the τ0,k−1 is a random walk in an infinite grid (more specifically finite for finite k) with values {−∞, · · · , −2γ, −γ, 0, γ, 2γ, · · · , +∞}. For understanding how (1.53) was obtained, one can decompose the squared sum of score functions into a sum of squared scores and a sum of score cross products. As the measurements are independent, the expectation of each cross product will be the product of expectations. The product of expectations will be zero because the expectation of a score function is zero [Kay 1993, p. 67]. Therefore, Iq,1:N will be the expectation of the sum of squared scores. Decomposing the expectation into an expectation on ik conditioned on the thresholds and an expectation on the thresholds, one gets (1.53). Denoting the probability of having τ0,k−1 = jγ by P (τ0,k−1 = jγ) = pj,k−1 we have: Iq,1:N = N X +∞ X k=1 j=−∞ f 2 (jγ − x) pj,k−1 . F (jγ − x) [1 − F (jγ − x)] (1.54) Note that this is equivalent to obtaining N measurements from a binary quantizer with a random thresholding scheme that changes its prior threshold distribution pk−1 in time. The prior distribution changes in a way that when k → ∞, it is expected that most of its probability will be concentrated around the parameter. This is in contrast to the methods presented in 1.5. Adaptive quantizers: the high complexity fusion center approach 63 [Ribeiro 2006a], where x is random with a given prior and N binary thresholds are chosen using a function of the prior distribution, in this case, having the right mode of the prior distribution is crucial, while in the adaptive scheme above, the mode of pk−1 will be around x for large k without any initial prior. Putting the factors of (1.54) in (infinite dimension) vector notation f 2 (−γ − x) ′ , I q = ··· , F (−γ − x) [1 − F (−γ − x)] f 2 (0 − x) f 2 (γ − x) , , ··· F (0 − x) [1 − F (0 − x)] F (γ − x) [1 − F (γ − x)] ⊤ , pk−1 = [· · · , p−1,k−1 , p0,k−1 , p1,k−1 , · · · ]⊤ (1.55) (1.56) allows to rewrite the sum of the products as a scalar product. Thus, (1.54) becomes Iq,1:N = N X ⊤ I′ q pk−1 . (1.57) k=1 Using the definition of the threshold evolution (1.49), it is possible to observe that a specific threshold value jγ has a probability of happening at instant k − 1 that depends on the probabilities of having thresholds at (j − 1) γ or (j + 1) γ and of measuring ik−1 = 1 or ik−1 = −1 respectively. This gives rise to a recursive equation for pj,k−1 : pj,k−1 = pj−1,k−2 [1 − F ((j − 1) γ − x)] + pj+1,k−2 F ((j + 1) γ − x) . (1.58) This shows that the threshold values form a Markov chain, as the present probability of the threshold values depends only on the previous probabilities pj,k−2 . It is possible to write the vector of threshold pk−1 probabilities in recursive form (1.59) pk = Tpk−1 , where T is a (infinite dimensional) tridiagonal transition matrix, defined as follows T= .. . 1−F 0 .. .. . . (−2γ − x) 0 F (0 − x) 0 0 . 0 1 − F (−γ − x) 0 F (−γ − x) 0 0 0 1 − F (0 − x) 0 F (2γ − x) .. .. .. . . . 0 The stationarity theorem for Markov chains guarantees that pk−1 will attain an asymptotic distribution p∞ [Fine 1968] (cited in [Fang 2008])8 and this distribution can be obtained by solving the system of equations p∞ = Tp∞ . 8 In [Fine 1968], it is shown that the possible threshold values can be separated in two classes of states, which are periodic. The probability vectors for each class are shown to converge to unique asymptotic probability vectors. The asymptotic probability vectors when put together form the vector p∞ . 64 Chapter 1. Estimation of a constant parameter To solve this infinite dimensional system [Fang 2008] considered that only a part of the thresholds around the true parameter will have a non negligible probability, for practical purposes it was considered that non negligible thresholds are those in the interval Iτ = [−5σv − |x| , 5σv + |x|] , where σv is the standard of the noise9 . The non negligible probability vector, denoted l deviation m p̃∞ will have size 2 5σvγ+|x| + 1 = 2jmax + 1, where ⌈y⌉ is the closest integer that is larger than y and the "+1" comes from the zero threshold. The approximate threshold distribution can then be obtained by solving p̃−jmax ,∞ .. . p̃∞ = p̃∞,0 (1.60) = T̃p̃∞ , .. . p̃jmax ,∞ where T̃ is the truncated transition matrix around the zero threshold (we show only the upper left corner) 0 1 − F (−γjmax + ε) T̃ = F (−γ (jmax − 1) + ε) 0 .. . F (−γ (jmax − 2) + ε) .. . 0 .. 0 One can also truncate I′ q only for the non negligible probability elements Ĩ′q 0 0 . . f 2 (γjmax + ε) f 2 (−γjmax + ε) , ··· , = F (−γjmax + ε) [1 − F (−γjmax + ε)] F (γjmax + ε) [1 − F (γjmax + ε)] ⊤ . (1.61) Following the development in [Fang 2008], after a finite time Nc the probability vector pk will be indistinguishable from p∞ , thus when N → ∞, an infinity number of terms in Iq,1:N will ⊤ behave approximately as Ĩ′ q p̃∞ , which leads to the following asymptotic approximation of the FI: N X ⊤ ⊤ Iq,1:N = I′ q pk−1 ∼ N Ĩ′ q p̃∞ . (1.62) k=1 N →∞ 9 For one of the noise distributions considered here, the Cauchy distribution, the standard deviation is undefined. In this case, one can use the scale parameter δ instead of the standard deviation σv . 1.5. Adaptive quantizers: the high complexity fusion center approach 65 This gives the following solution for problem (a) (p. 27): Solution to (a) - MLE for binary quantized measurements with adaptive thresholds given by a simple cumulative sum. (a2.1) 1) Estimator Define an initial threshold τ0,0 and a positive γ, then from k = 1 to N : • the sensor obtains a binary measurement ik using τ0,k−1 . • The sensor sends ik to the fusion center and updates the threshold (1.49): τ0,k = τ0,k−1 + γik . • The fusion center stores ik and also evaluates and stores τ0,k . With i1,N and τ0,1:N , the fusion center evaluates the MLE (1.52): X̂M L,q = argmax x N Y k=1 [1 − F (τ0,k−1 − x)] 1+ik 2 F (τ0,k−1 − x) 1−ik 2 . 2) Performance (asymptotic and approximate) X̂M L,q is asymptotically unbiased i h = x E X̂M L,q N →∞ and its asymptotic MSE or variance can be approximated by 1 , ∼ CRBq ≈ Var X̂M L,q ⊤ N →∞ N Ĩ′ q p̃∞ with Ĩ′ q given by (1.61) and p̃∞ by (1.60). An alternative to have analytical results on the vector p∞ without using an approximation with truncation can be obtained by considering that x lies in a symmetric interval [−A, A], where A is a positive real. We can create boundaries on the possible values of the threshold in such a way that the number of possible thresholds is finite. In this way, the threshold sequence can be modeled as a Markov chain defined in a domain with a finite number of values and we can evaluate the asymptotic threshold distribution without using truncation approximations (More? - App. A.2.4). 66 Chapter 1. Estimation of a constant parameter 1.5.3 Adaptive scheme based on the MLE One of the disadvantages of (a2.1) is that the threshold will fluctuate around x and it will not converge to x, producing a performance that is still not optimal. A remedy for this problem was proposed also in [Fang 2008] (and previously in [Papadopoulos 2001]). By accepting a feedback from the fusion center and assuming that the fusion center has enough processing power to evaluate the MLE for the past measurements at each time, instead of using the cumulative sum for updating the threshold, we can use the last MLE estimate. Intuitively, with a growing number of measurements for the MLE, the threshold will be placed closer and closer to x, producing as a result an MLE with performance approaching the optimal one (for τ0,k−1 = x). The new update is given by τ0,k = X̂M L,k , (1.63) where X̂M L,k is the MLE for the measurements i1:k . The asymptotic performance analysis was also presented in [Fang 2008], the authors claim that in the binary quantization and Gaussian 2 case, the performance (variance) is asymptotically given by πδ 4N . Therefore, this update scheme 4N is asymptotically optimal as Iq (0) = πδ 2 is the maximum FI that can be achieved. We will mimic some parts of their proof, but we will change some arguments to obtain a more general result for NI ≥ 2. 1.5.4 Performance for the adaptive multibit scheme based on the MLE Under an adaptive τ0 with the vector τ ′ fixed, we can rewrite the FI given in (1.53) for a general NI using a parametrization on the error εk = τ0,k−1 − x, which now depends on time and it is given as follows (Why? - App. A.1.5) NI Iq,1:N = N X E [Iq (εk )] , (1.64) k=1 where εk is a sequence of r.v. defined on R, contrary to the previous case when the thresholds were defined in a grid. The function Iq (εk ) is given by (1.13). For proceeding, we will make additional assumptions on Iq (ε) (the assumptions on the noise AN1 p. 34 and AN2 p. 34 are also assumed). 1.5. Adaptive quantizers: the high complexity fusion center approach 67 Assumptions on Iq for the MLE update to have asymptotically optimal performance: A1.MLE Iq (ε) is maximum for ε = 0. A2.MLE Iq (ε) is locally decreasing around zero. A3.MLE The function Iq (ε) has bounded Iq (0), dIq (ε) dε ε=0 = 0, bounded fore accepting a Taylor approximation around zero (for small ε′ ): ε′2 d2 Iq (ε) ′2 + ◦ ε , Iq ε′ = Iq (0) + 2 dε2 ε=0 d2 Iq (ε) , dε2 ε=0 there- (1.65) ◦(ε′2 ) where the ◦ ε2 here is equivalent to say that the quantity ε′2 tends to zero when ε′ tends to zero. If we look to Fig. 1.8 (p. 57), we can see that these assumptions seem to be satisfied by Gaussian and Cauchy distributions. Except for the Laplacian-like distributions with a derivative discontinuity at ε = 0, a large class of smooth symmetric unimodal distributions satisfy these assumptions for NI > 2 and well chosen quantizer intervals. Note that in the binary cases, where the threshold must be placed asymmetrically, we can add a fixed bias in the MLE threshold update to obtain a better performance. Also for the asymmetric cases, all the assumptions can be stated around the maximum point for the FI instead of ε = 0. NI The objective now will be to bound above and below the quantity Iq,1:N in such a way, NI that when we make N → ∞ both bounds will "squeeze" Iq,1:N on an interval that goes asymptotically to N Iq (0). For a large number of measurements M < N , the MLE studied here is consistent even if the measurements are dependent, for verifying this, one can check the regularity conditions given in [Crowder 1976]. Thus, for ε′ > 0 and ξ > 0, it is possible to choose a number of measurements M such P |εk | ≤ ε′ ≥ 1 − ξ, for k ≥ M. (1.66) Applying this inequality with the monotonicity property of A2.MLE, we can say that we can find a M such (1.67) P Iq (εk ) ≥ Iq ε′ ≥ 1 − ξ, for k ≥ M. Now the sum in (1.64) can be separated in two sums, one for the terms with k < M , Iq,1:M −1 and the other with k ≥ M , Iq,M :N : NI Iq,1:N = Iq,1:M −1 + Iq,M :N = (M −1 X k=1 E [Iq (εk )] ) + ( N X k=M ) E [Iq (εk )] . (1.68) Using A1.MLE and the fact that Iq (εk ) is a nonnegative quantity, we know that Iq (εk ) ∈ [0, Iq (0)]. Thus, the first term can be written as: Iq,1:M −1 = αM (M − 1) Iq (0) , (1.69) 68 Chapter 1. Estimation of a constant parameter with αM ∈ [0, 1]. The terms on Iq,M :N can be lower bounded using the Markov’s inequality. The Markov’s inequality states that for a nonnegative r.v. Y and value y > 0, we must have [Wasserman 2003, p. 63]: E (Y ) P (Y > y) ≤ . y Using this inequality for an arbitrary term of Iq,M :N with the value Iq (ε′ ) gives E [Iq (εk )] ≥ Iq ε′ P Iq (εk ) ≥ Iq ε′ , for k ≥ M. (1.70) Then, supposing that the thresholds are updated using the MLE, we can use (1.67) in (1.70) to have (1.71) E [Iq (εk )] ≥ Iq ε′ (1 − ξ) , for k ≥ M. NI For sufficiently large M (and consequently N ), Iq,1:N can be lower bounded using (1.71) and (1.69) NI (1.72) Iq,1:N ≥ αM (M − 1) Iq (0) + [N − (M − 1)] Iq ε′ (1 − ξ) . From A1.MLE the FI can be upper bounded by the optimal Iq NI Iq,1:N ≤ N Iq (0) . (1.73) Joining (1.72) and (1.73) gives the following: NI αM (M − 1) Iq (0) + [N − (M − 1)] Iq ε′ (1 − ξ) ≤ Iq,1:N ≤ N Iq (0) . (1.74) For small ε′ , we can use A3.MLE to obtain ε′2 d2 Iq (ε) NI ′2 (1 − ξ) ≤ Iq,1:N ≤ N Iq (0) . + ◦ ε αM (M − 1) Iq (0) + [N − (M − 1)] Iq (0) + 2 dε2 ε=0 (1.75) The term on the left of the inequality can be rewritten as ′2 2 (M − 1) [αM − (1 − ξ)] ε d Iq (ε) ′2 N Iq (0) (1 − ξ) + + ◦ ε + [N − (M − 1)] (1 − ξ) . N 2 dε2 ε=0 Separating a factor N Iq (0) we can write the term above as N Iq (0) (1 − ξ ′ ) with ′ ξ = (M − 1) [αM − (1 − ξ)] −ξ + N (M − 1) + 1− N Therefore, the inequality (1.74) becomes (1 − ξ) ε′2 d2 Iq (ε) ′2 +◦ ε . 2 dε2 ε=0 Iq (0) (1.76) NI N Iq (0) 1 − ξ ′ ≤ Iq,1:N ≤ N Iq (0) . By imposing N ≫ M (N much larger than M ) so that (MN−1) is arbitrary small and by choosing M sufficiently large so that ξ is small, we can make the first term in ξ ′ to approach zero. Using also N ≫ M and choosing now M sufficiently large so that ε′ is arbitrary small, we can make the second term in ξ ′ to approach zero. Therefore, we can make the left side of 1.5. Adaptive quantizers: the high complexity fusion center approach 69 the inequality above to be close to N Iq (0) when N and M tend to infinity with the condition NI N ≫ M . As the upper bound on Iq,1:N is also N Iq (0), we have that NI Iq,1:N ∼ N →∞ or equivalently CRBq ∼ 1 N →∞ (1.77) N Iq (0) , N Iq (0) (1.78) . We have now the following solution for problem (a)10 : Solution to (a) - MLE for quantized measurements with NI ≥ 2 and adaptive thresholds given by the MLE. (a2.2) 1) Estimator Define an initial threshold τ0,0 , then from k = 1 to N : • the sensor obtains a binary measurement ik using τ0,k−1 . • The sensor sends ik to the fusion center. • The fusion center stores ik , evaluates and stores X̂M L,k = τ0,k following (1.63) τ0,k = X̂M L,k , where the estimate X̂M L,k is given by X̂M L,k = argmax x k Y P (ij ; x, τ0,j−1 ) . j=1 • The fusion center sends τ0,k = X̂M L,k to the sensor. 2) Performance (asymptotic) X̂M L,q = X̂M L,k=N is asymptotically unbiased h i E X̂M L,q = x N →∞ and its asymptotic MSE or variance attains the optimal value 1 Var X̂M L,q . ∼ N →∞ N Iq (0) Now, we have an estimator with adaptive thresholds (mainly the central threshold) that attains the asymptotically optimal performance. The estimator guides the quantizer dynamic 10 The threshold τ0,k−1 is added in the notation of the probabilities to make the dependence on time more explicit. 70 Chapter 1. Estimation of a constant parameter range close to the parameter by setting the central point of the quantizer with a decreasing fluctuation around τ . 1.5.5 Equivalent low complexity asymptotic scheme The main disadvantage of (a2.2) is its high complexity, since the MLE must be obtained at each iteration. In [Papadopoulos 2001] a heuristic based on an approximation of the expectation maximization method for applying the MLE update with reduced complexity on the binary quantization and Gaussian noise case was presented. The proposed threshold/estimate update is given by the following recursive expression: √ δ π ik . (1.79) X̂k = τ0,k = X̂k−1 + 2k Observe that the difference in complexity is large. In general, the MLE must be obtained with a maximization algorithm, e.g. Newton’s algorithm, which itself has an inner recursive procedure that may need multiple iterations for reaching convergence for each time k. In (1.79), we have only a recursive procedure in k, which requires a multiplication of ik by a gain and summation with the last estimate. We can show that (1.79) can be generalized easily to non Gaussian noise cases. We will use a less heuristic method (less than the method used to obtain (1.79)). We will assume, additionally to symmetry, that the noise PDF has f (1) (0) = 0. If we consider that k is large, then from the convergence of the CRB discussed above and the asymptotic normality of the MLE [Kay 1993, p. 167] (or [Crowder 1976]), the error between the threshold used to obtain ik , ε = X̂M L,k−1 − x = τ0,k − x, is Gaussian distributed with zero mean and variance 1 11 (k−1)Iq (0) : r (k − 1) Iq (0) (k − 1) Iq (0) 2 fε (ε) = (1.80) exp − ε , 2π 2 where fε is the PDF of the error. We can try to estimate the random error using the new quantized observation ik and the knowledge about its distribution given by the PDF above. After estimating it, we can correct X̂M L,k−1 using the estimate. As ε is random, we will use an estimator equivalent to the MLE, but for random parameters. In this case the maximum a posteriori estimator (MAP) will be used. The posterior distribution (the one that might be maximized) is the conditional PDF of ε given ik . Using Bayes theorem, it is given by p (ε|ik ) = P (ik |ε) fε (ε) , P (ik ) (1.81) where in the binary case the conditional probability P (ik |ε) is given by P (ik |ε) = [1 − F (ε)] 1+ik 2 F (ε) 1−ik 2 . (1.82) The denominator is the marginal probability of the output ik and it does no depend on ε. The MAP is then given by [Kay 1993, p. 350] ε̂M AP = argmax p (ε|ik ) . (1.83) ε 11 Observe that here we are using the parametrization of the Gaussian distribution with its variance and not with its scale parameter 1.5. Adaptive quantizers: the high complexity fusion center approach 71 In the same way as for the MLE, we can maximize the logarithm of the posterior, as P (ik ) does not depend on ε, we can write an equivalent form for (1.83) as ε̂M AP = argmax log p (ε|ik ) ε = argmax {log [P (ik |ε)] + log [fε (ε)]} . (1.84) ε Using (1.81) and (1.80) in the RHS (1.84), we obtain (k − 1) Iq (0) 2 1 + ik 1 − ik ε . log p (ε|ik ) = log [1 − F (ε)] + log [F (ε)] − 2 2 2 (1.85) Under consistency of the MLE, it is expected that for large k, the probability of |ε| being small is close to 1. Thus, we can look for a maximum point of (1.85) around zero. This leads us to expand log [1 − F (ε)] and log [F (ε)] around zero. The expansions are given by ε2 d2 log [1 − F (z)] d log [1 − F (z)] + log [1 − F (ε)] = log [1 − F (0)] + ε + ◦ ε2 , 2 dz 2 dz z=0 z=0 2 2 ε d log [F (z)] d log [F (z)] + ◦ ε2 . + log [F (ε)] = log [F (0)] + ε 2 dz 2 dz z=0 z=0 Using the symmetry of the distribution (1 − F (0) = F (0) = 12 ) the terms with logarithms are log 21 = − log (2). The derivatives at the zero point are f (0) d log [1 − F (z)] = − = −2f (0) , dz 1 − F (0) z=0 d log [F (z)] f (0) = = 2f (0) dz F (0) z=0 and using the assumption f (1) (0) = 0, the second derivatives are −f (1) (0) f 2 (0) d2 log [1 − F (z)] 2 = − 2 = −4f (0) , dz 2 1 − F (0) [1 − F (0)] z=0 2 2 (1) f (0) f (0) d log [F (z)] − 2 = −4f 2 (0) . = 2 dz F (0) F (0) z=0 Applying these expressions to the expansions above, we get ε2 2 f (0) + ◦ ε2 , 2 ε2 log [F (ε)] = − log (2) + 2εf (0) − 4 f 2 (0) + ◦ ε2 . 2 log [1 − F (ε)] = − log (2) − 2εf (0) − 4 These expansions can be used in (1.85), this gives the following: ε2 2 1 + ik 2 + log p (ε|ik ) = − log (2) − 2εf (0) − 4 f (0) + ◦ ε 2 2 ε2 1 − ik − log (2) + 2εf (0) − 4 f 2 (0) + ◦ ε2 + + 2 2 (k − 1) Iq (0) 2 − ε . 2 (1.86) 72 Chapter 1. Estimation of a constant parameter To find the maximum, we differentiate log p (ε|ik ) in (1.86) w.r.t. ε and we equate it to zero. This gives 1 + ik (k − 1) Iq (0) ε = −2f (0) − 4εf 2 (0) + ◦ (ε) + 2 1 − ik 2f (0) − 4εf 2 (0) + ◦ (ε) + 2 = −2f (0) ik − 4εf 2 (0) + ◦ (ε) . For binary measurements, we know that Iq (0) = 4f 2 (0). Thus adding ε4f 2 (0) on both sides gives k4f 2 (0) ε = −2f (0) ik + ◦ (ε) . Thus, we have ε̂M AP ∼ − k→∞ ik . 2f (0) k The optimal new threshold/estimate when k → ∞ is then given by X̂k = X̂k−1 − ε̂M AP ≈ X̂k−1 + ik . 2kf (0) (1.87) This is exactly the same recursive estimator obtained by [Papadopoulos 2001] when the noise is Gaussian (f (0) = √1πδ for the Gaussian distribution). Note that this recursive update/estimation procedure is asymptotically equivalent to the MLE update, as both procedures (MLE and MAP) have equivalent error distribution for k → ∞ [Wasserman 2003, p. 181]. Clearly, some questions arise about the low complexity recursive estimator above: • can (1.87) converge if we use it when the initial distance |ε| = |τ0 − x| is arbitrary (not necessarily small)? • Can we extend this low complexity recursive procedure to the NI > 2 case? Answers for these questions will be given in Ch. 3. 1.6. Chapter summary and directions 1.6 73 Chapter summary and directions We conclude this chapter with the main points observed until now and directions for future work. • Estimation performance in terms of MSE can be minimized asymptotically under an unbiasedness constraint by the MLE (a1). The asymptotic performance is then mainly characterized by the CRB which is given in terms of the FI. • The FI for quantized measurements is upper bounded by the FI for continuous measurements and lower bounded by the FI for binary quantization. Moreover, it increases as additional quantization intervals are used. • The CRB and FI are very sensitive to the central threshold of the quantizer. – For commonly used noise models (Gaussian, Laplacian and Cauchy), the threshold must be placed exactly at the parameter. – In the binary quantization case, even if we restrict the noise distribution to be symmetric and unimodal this is not always true. We can find cases (GGD) where quantizing the input r.v. asymmetrically can be optimal. In these cases, it was also observed that the gain of performance obtained by using an asymmetric quantizer seems to be dependent on the noise distribution, however, in general, the gain from using the optimal asymmetric quantizer in the place of a symmetric quantizer seems to be small when compared with the gain that can be obtained by using a symmetric quantizer in the place of a poorly chosen asymmetric quantizer. – An interesting subject for future research is to study in more detail the effect of the noise distribution on the shape of the performance function B (ε) in the asymmetric cases, for example, we can try to characterize the loss incurred by imposing symmetric quantization w.r.t. optimal quantization. Another possible point for future research is to see if such asymmetric behavior also appears in the problem of detection using binary quantized measurements. – In all cases, under symmetry assumptions on the noise and on the quantizer, estimation performance degrades when the quantizer dynamic (the quantizer threshold in the binary case) is very distant to the true parameter. – For multibit quantization, also under symmetry assumptions, it seems that if we choose the quantizer thresholds (or equivalently the quantizer intervals) well, then for a large class of unimodal distributions it is optimal to place the central threshold at the true parameter. Note that, quantizing "well" in this case means that we choose the quantization intervals to have a good symmetric quantization performance. An interesting point for future analysis is to see if we can get a better performance than in the symmetric case, when we optimize the quantizer intervals for an asymmetric quantizer (one that is not centered at x). A partial answer for this will be given in Part II, where we will see that when the number of quantization intervals tends to infinity, the optimal quantizer is symmetric for symmetric noise distributions. 74 Chapter 1. Estimation of a constant parameter • Selection of optimal quantization intervals, or equivalently, optimal non central thresholds, was observed to be a difficult problem for nonuniform quantization. The asymptotic design of the optimal quantizer that approaches the optimal finite solution will also be studied in Part II. • The MLE for binary quantized measurements and a fixed threshold can be obtained in closed form (a1.1). While in the general case it might be obtained numerically. When the noise distribution is log-concave, the Newton’s algorithm can be used as an efficient numerical solution (a1.2). • As the performance degrades when the quantizer range is far from the parameter, the quantizer central threshold must be placed adaptively around the parameter. A simple solution in the binary case is to move the threshold up or down with a constant step. Then, asymptotically, the threshold will settle its mean close to the parameter and it will fluctuates around it. The measurements obtained in this case can be used to have a MLE with asymptotic performance less sensitive to uncertainty on the true parameter value (a2.1). • By accepting an increased complexity, the central threshold (both in binary and non binary cases) can be set closer and closer to the true parameter by updating it at each time with the MLE based on all the past measurements (a2.2). This scheme asymptotically attains a performance equal to the performance obtained when the threshold is placed at the parameter, which is equivalent to say that this scheme is asymptotically optimal for commonly used noise models. • When the time goes to infinity the threshold update based on MLE is equivalent to a simple recursive update with decreasing correction gain (1.87). Low complexity recursive schemes of this type and their performance will be studied in detail in Ch. 3. Chapter 2 Estimation of a varying parameter: what is done and a little more In this chapter we study the estimation of a varying parameter based on quantized measurements. First, we will present the parameter evolution model and the measurement model. Then, we will present the optimal estimator in the MSE sense and its performance. Due to the difficulties that arise when we want to have analytical expressions for the optimal estimator and its performance, we will obtain the optimal estimator using a numerical method. We present and discuss a numerical solution known as particle filtering, which is a method based on Monte Carlo simulation. We give then a bound on its performance using the Bayesian Cramér–Rao bound. After the analysis of the bound, we present a particle filtering scheme based on the quantized prediction error, which is commonly known as quantized innovation. At the end of the chapter, we show that the optimal estimator has, asymptotically, a simple recursive form for a slowly varying parameter. After obtaining the performance for the asymptotically optimal estimator and comparing it to the lower bound on the MSE, we conclude the chapter with a summary and directions for work to be presented in other chapters or to be presented in the future. Contributions presented in this chapter: • Motivation to use the quantized innovation. By analyzing a simple signal model, we can obtain a detailed characterization of the bound on the mean squared error for estimation based on quantized measurements. From the bound, we can see clearly that a good estimation scheme can be obtained by quantizing the innovation. This differs from [Ribeiro 2006c] and [You 2008], where the motivation for using the quantized innovation does not come from any quantitative analysis and relies only on intuition. • Asymptotically optimal estimator for a slowly varying parameter. We show that the asymptotically optimal estimator for slowly varying Wiener process parameter can be approximated by a low complexity recursive estimator. We also verify its optimality by comparing it to a lower bound on the mean squared error. The Wiener process model that we consider is a special case of the model in [Ribeiro 2006c], but we do not consider that the noise is Gaussian and we do not impose the quantization to be binary. 75 76 Chapter 2. Estimation of a varying parameter Contents 2.1 Parameter and measurement model . . . . . . . . . . . . . . . . . . . . 77 2.1.1 Parameter model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 2.1.2 Measurement model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 2.2 Optimal estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 2.3 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 2.4 2.5 2.6 2.3.1 Monte Carlo integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 2.3.2 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 2.3.3 Sequential importance sampling . . . . . . . . . . . . . . . . . . . . . . . . 83 2.3.4 Sequential importance resampling . . . . . . . . . . . . . . . . . . . . . . 85 Evaluation of the estimation performance . . . . . . . . . . . . . . . . . 87 2.4.1 Online empirical evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 87 2.4.2 BCRB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Quantized innovations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 2.5.1 Prediction and innovation . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 2.5.2 Bound for the quantized innovations . . . . . . . . . . . . . . . . . . . . . 93 2.5.3 Gaussian assumption and asymptotic estimation of a slow parameter . . . 94 Chapter summary and directions . . . . . . . . . . . . . . . . . . . . . . 102 2.1. Parameter and measurement model 2.1 2.1.1 77 Parameter and measurement model Parameter model The parameter to be estimated now is a stochastic process X defined on the probability space P = (Ω, F, P) with values on (R, B (R)). At each instant k ∈ N⋆ , the corresponding scalar r.v. Xk will be given by the Wiener process model: Xk = Xk−1 + Wk , k > 0, (2.1) where Wk is the k-th element of a sequence of independent Gaussian r.v.. Its mean is given by 2 . If u = 0 then X forms a standard discrete-time uk and its variance is a known constant σw k k Wiener process, otherwise, it is a Wiener process with drift. The initial distribution of Xk is supposed to be Gaussian with known mean x′0 and known variance σ02 . The PDF of X0 , denoted, p (x0 ) is also known as the initial prior of the stochastic process. For estimation purposes, the initial mean represents a guess on the value of X0 and the initial variance represents the degree of uncertainty on this guess. From (2.1), we can see that conditioned on Xk−1 , Xk is independent from the past X0:k−2 . Therefore, this process is a homogeneous Markov process. Until instant k, it can be characterized by its joint PDF p (x0:k ), which factorizes as follows p (x0:k ) = p (x0 ) k Y j=1 p (xj |xj−1 ) , (2.2) where p (xj |xj−1 ) is the conditional PDF of Xj given Xj−1 . This conditional PDF can be written using the Gaussian assumption on Wk as " # 1 xj − xj−1 − uj 2 1 exp − . (2.3) p (xj |xj−1 ) = √ 2 σw 2πσw Therefore, from the knowledge of p (x0 ), uk and σw , we can describe probabilistically the process X until any arbitrary instant k using (2.2) and (2.3). 2.1.2 Measurement model Continuous measurement The process X is measured with noise (2.4) Y k = X k + Vk . The same assumptions on Vk as for constant x, AN1 and AN2, are considered in this case. Quantizer For tracking the varying parameter, the quantizer will be assumed to be dynamic with varying threshold set τ k : h i⊤ τ k = τ− NI ,k · · · τ−1,k τ0,k τ1,k · · · τ NI ,k . 2 2 78 Chapter 2. Estimation of a varying parameter The assumptions on the labeling of the outputs and symmetry, AQ1 and AQ2, are still considered to be valid. The quantized measurements are then given as the output of the quantization function Q () defined in (1.2) ik = Q (Yk ) , where as in the adaptive case, the function Q can change in time. 2.2 Optimal estimator As it was stated at the beginning of this chapter, we are interested in solving problem (b) (p. 29). That is to estimate Xk based on the past and present quantized measurements i1:k . In what follows, we consider that τ 1:k is a fixed sequence. As in the constant case we want the estimator, or filter in this case, to have minimum MSE (MMSE). We want for all k an estimator X̂ (i1:k ) minimizing the MSE MSEk = E X̂k − Xk 2 . (2.5) As the parameter itself is random, the expectation is evaluated w.r.t. the joint distribution of the measurements i1:k and the parameter Xk . Differently from the deterministic case, when the parameter is random, the general form of the MMSE estimator can be obtained directly from the minimization of MSEk . It can be shown that its general form is [Jazwinski 1970, p. 149] X̂k = EXk |i1:k (Xk ) , (2.6) where the subscript Xk |y1:k means that the expectation is evaluated w.r.t. the probability measure of Xk given a realization of i1:k . The MMSE estimator is then the posterior mean, i.e. the conditional mean of the parameter Xk given a specific realization sequence of quantized measurements i1:k 1 . The MMSE estimator is unbiased since h i E X̂k = Ei1:k EXk |i1:k (Xk ) = EXk ,i1:k (Xk ) = E (Xk ) , where the first equality comes from the decomposition of the expectation on the joint variables and the second equality comes from marginalization of the i1:k . 1 Similarly, we obtain that the MSE is the mean of the posterior variance n n 2 oo = Ei1:k VarXk |i1:k (Xk ) . MSEk = Ei1:k EXk |i1:k Xk − EXk |i1:k (Xk ) (2.7) Note that this estimator is different from the MAP, which is the maximum value xk that maximizes the posterior p (xk |i1:k ). It can be shown that the MAP is the optimal estimator under the mean absolute error [Van Trees 1968, pp. 56–57]. 2.2. Optimal estimator 79 Note that for a given realization i1:k , VarXk |i1:k (Xk ) is the conditional MSE and it can be used when online assessment of the MSE is needed. Online here means that the performance is not averaged on the distribution of the measurements, but evaluated for a given realization. All the information is contained in the posterior distribution. Its mean is the optimal estimator and its averaged variance is the MMSE. Assuming that the posterior distribution accepts a PDF p (xk |i1:k ), the MMSE estimator and its MSE are given respectively by Z X̂k = EXk |i1:k (Xk ) = xk p (xk |i1:k ) dxk , (2.8) R MSEk = Ei1:k VarXk |i1:k (Xk ) X Z 2 = xk − EXk |i1:k (Xk ) p (xk |i1:k ) dxk P (i1:k ) . ⊗k i1:k ∈I (2.9) R where I ⊗k is the joint set where the quantized measurements are defined. To simplify the evaluation of the quantities above, a recursive form for p (xk |i1:k ), and as a byproduct for P (i1:k ), can be obtained by using the Markovian property of the dynamical model for the process X. The main idea is to write the PDF for prediction p (xk |i1:k−1 ) as a function of p (xk−1 |i1:k−1 ) using the dynamical model information p (xk |xk−1 ) and then pass from the prediction PDF to the posterior p (xk |i1:k ) using the information given by the measurement P (ik |xk ). These two expressions, one for prediction using the model and the other for update using the measurement are given respectively by (Why? - App. A.1.6): Z p (xk |i1:k−1 ) = p (xk |xk−1 ) p (xk−1 |i1:k−1 ) dxk−1 , (2.10) R p (xk |i1:k ) = R R P (ik |xk ) p (xk |i1:k−1 ) . P ik |x′k p x′k |i1:k−1 dx′k (2.11) The denominator in the RHS of (2.11) is equal to P (ik |i1:k−1 ) (Why? - App. A.1.6), thus this integral can be reused for writing P (i1:k ) in recursive form for k > 1 P (i1:k ) = P (ik |i1:k−1 ) P (i1:k−1 ) = Z R P (ik |xk ) p (xk |i1:k−1 ) dxk P (i1:k−1 ) , (2.12) for k = 1 this probability is P (i1 ) = Z P (i1 |x0 ) p (x0 ) dx0 . R In these expressions the prior p (x0 ), as stated above, is a Gaussian function " # 1 x0 − x′0 2 1 exp − , p (x0 ) = √ 2 σ0 2πσ0 (2.13) 80 Chapter 2. Estimation of a varying parameter the conditional PDF p (xk |xk−1 ) is given by (2.3) and the probability P (ik |xk ) is given by (1.6) with the dynamical threshold set τ k instead of only one fixed set: P (ik |xk ) = ( F (τik ,k − xk ) − F (τik −1,k − xk ) , if ik > 0, F (τik +1,k − xk ) − F (τik ,k − xk ) , if ik < 0. (2.14) The general solution to (b) (p. 29) given by the optimal filter is the following: Solution to (b) - MMSE estimator for a fixed threshold set sequence τ 1:k (b1) 1) Estimator For each time k, the estimator is given by Z X̂k = EXk |i1:k (Xk ) = xk p (xk |i1:k ) dxk , R where the posterior PDF p (xk |i1:k ) can be evaluated recursively using (2.10) and (2.11). 2) Performance (exact) X̂k is unbiased h i E X̂k = E [Xk ] and its MSE for each time k is MSEk = Ei1:k VarXk |i1:k (Xk ) X Z 2 = xk − EXk |i1:k (Xk ) p (xk |i1:k ) dxk P (i1:k ) ⊗k i1:k ∈I R where now not only (2.10) and (2.11) are used, but also (2.12) and (2.13) to obtain P (i1:k ). Some attention must be given to the fact that the MMSE estimator given above and the recursive form for the evaluation of the posterior PDF are quite general and can be applied in many other nonlinear filtering problems. A major drawback with (b1) is that evaluating the integrals in the prediction/update expressions and in the expectation is analytically intractable. Therefore, we must look for a numerical method for solving it approximately. This will be done next. 2.3. Particle Filtering 2.3 81 Particle Filtering To obtain the posterior mean (2.8), we must evaluate the integral R R xk p (xk |i1:k ) dxk . A general solution is to evaluate it numerically, for example, using a Monte Carlo integration method. 2.3.1 Monte Carlo integration The Monte Carlo integration method consists in approximating the expectation of a function g (X) Z E [g (X)] = g (x) p (x) dx, R where p (x) is the PDF of X, by the sample mean calculated using multiple i.i.d. samples X (j) from the distribution of X [Robert 1999, p. 83] E [g (X)] ≈ ḡNS = NS 1 X g x(j) , NS j=1 with NS the number of samples and x(j) the j-th i.i.d. sample realization. The approximation is clearly unbiased NS NS h i X 1 1 X E g X (j) = E g X (j) = E [g (X)] . NS NS j=1 j=1 By the strong law of large numbers, it converges with probability one to the true expectation E [g (X)] [Robert 1999, p. 83] P lim ḡNS = E [g (X)] = 1. NS →+∞ Moreover, by using a central limit theorem, the asymptotic normalized approximation error εḡ tends to a zero mean Gaussian distribution with variance given by Var (εḡ ) = 1 Var [g (X)] . NS Thus, if g (X) has finite variance, the variance of the approximation reduces by increasing the number of samples. In our case, we want to approximate the posterior mean NS 1 X (j) X̂k ≈ Xk , NS j=1 (j) with Xk i.i.d. samples from the posterior distribution. (2.15) 82 Chapter 2. Estimation of a varying parameter Observe that we can also rewrite the posterior mean in an equivalent way using the joint posterior PDF p (x1:k |i1:k ): Z X̂k = xk p (x1:k |i1:k ) dx1:k . (2.16) R (j) In this case we will sample independent trajectories X1:k from p (x1:k |i1:k ) and the posterior mean is also given by (2.15). The main problem here is that the posterior distribution and the joint posterior distribution are usually difficult to sample directly. Therefore, to solve this problem we will use a method called importance sampling. 2.3.2 Importance sampling Retaining the second form of the posterior mean (2.16), the main idea of importance sampling [Robert 1999, p. 92] is to multiply and divide the integrand in the expectation by a PDF q (x1:k |i1:k )2 from which we know how to sample the trajectories X1:k . This gives Z p (x1:k |i1:k ) X̂k = xk q (x1:k |i1:k ) dx1:k . q (x1:k |i1:k ) R Note that the support of the PDF q (x1:k |i1:k ) might be strictly larger than the support of the posterior. Denoting the ratio between PDF as an importance weight w (x1:k ) w (x1:k ) = p (x1:k |i1:k ) , q (x1:k |i1:k ) (2.17) the expectation can be approximated by X̂k = Z xk w (x1:k ) q (x1:k |i1:k ) dx1:k R NS 1 X (j) (j) ≈ Xk w X1:k , NS (2.18) j=1 (j) where X1:k are i.i.d. trajectories from q (x1:k |i1:k ). We can divide the expectation by the integral of the posterior as its value is equal to one, this gives N PS (j) (j) R Xk w X1:k xk w (x1:k ) q (x1:k |i1:k ) dxk j=1 R ≈ N X̂k = R . w (x1:k ) q (x1:k |i1:k ) dxk PS (j) w X1:k R j=1 (j) Defining the normalized weights w̃ X1:k (j) w̃ X1:k as (j) w X1:k , = N PS (j) w X1:k (2.19) j=1 2 Note that q (x1:k |i1:k ) can depend on the measurements, after we will choose a simplified form which does not depend on the measurements. 2.3. Particle Filtering 83 we have that the posterior mean can be approximated by X̂k ≈ NS X j=1 (j) (j) Xk w̃ X1:k . By comparing the approximation in (2.20) with the integral (2.20) R R xk p (xk |i1:k ) dxk , we realize that this method is equivalent to approximate the posterior by a discrete distribution with support values chosen randomly and with probabilities given by the normalized weights p (xk |i1:k ) ≈ NS X j=1 (j) (j) w̃ x1:k δD xk − xk , (2.21) where δD () is a Dirac distribution. 2.3.3 Sequential importance sampling The remaining problems now are the choice of a PDF q (x1:k |i1:k ) easy to sample and the evaluation of the weights. (j) (j) To be able to sample the trajectory X1:k without modifying the past trajectory X1:k−1 (so that we do not need to resample the past trajectory), we must choose a distribution q (x1:k |i1:k ) for which the marginal distribution for k − 1 is exactly q (x1:k−1 |i1:k−1 ). This can be done using the following form for q (x1:k |i1:k ) [Doucet 1998]: q (x1:k |i1:k ) = q (x1:k−1 |i1:k−1 ) q (xk |x1:k−1 , i1:k ) . (j) (2.22) (j) In this case to extend a sample trajectory from realization x1:k−1 to x1:k , we sample (j) (j) q xk |x1:k−1 , i1:k to generate the new point of the trajectory xk . To evaluate the weights, we develop p (x1:k |i1:k ) using conditioning and the independence assumptions on the model: • Xk is independent of X1:k−2 and i1:k−1 conditioned on Xk−1 , • ik is independent of X1:k−1 and i1:k−1 conditioned on Xk . This gives p (x1:k |i1:k ) = P (ik |xk ) p (xk |xk−1 ) p (x1:k−1 |i1:k−1 ) . P (ik |i1:k−1 ) (2.23) Replacing the simplified form of q (x1:k |i1:k ) (2.22) and the joint posterior above (omitting P (ik |i1:k−1 ), which is constant in x1:k ) in the expression (2.17), we have the following weight for the trajectory j: (j) (j) (j) P ik |x(j) p x |i p x |x 1:k−1 1:k−1 k k k−1 (j) , w x1:k ∝ (2.24) (j) (j) q x1:k−1 |i1:k−1 q x1:k−1 |i1:k−1 84 Chapter 2. Estimation of a varying parameter where ∝ is the symbol for proportional. The fact that the weights are defined up to a proportional factor is not important because for approximating the posterior mean we use the (j) normalized weights. Note that the factor p x1:k−1 |i1:k−1 (j) q x1:k−1 |i1:k−1 is the weight for the samples at time k − 1. Thus, we can write a recursive expression that relates the normalized weights for time k − 1 with the weights for time k (j) (j) P ik |x(j) p x |x k k k−1 (j) w̃ x(j) (2.25) w x1:k ∝ 1:k−1 . (j) (j) q xk |x0:k−1 , i1:k We need now to define the PDF q (xk |x0:k−1 , i1:k ) which is used to generate the samples. The two most commonly used choices are the following: • choice 1: p (xk |xk−1 , ik ), minimum weight variance distribution. The quality of the approximation of the posterior by the discrete distribution (2.21) is dependent on the variance of the weights and the variance depends on the PDF q (xk |x0:k−1 , i1:k ). It (j) can be shown that conditioned on the past trajectory x1:k−1 realization and on the measurements realization i1:k , the variance of the weights is minimized for [Doucet 1998] q (xk |x0:k−1 , i1:k ) = p (xk |xk−1 , ik ) . (2.26) Unfortunately this distribution is difficult to sample directly. In our case, we can sample from it by using a rejection method (More? - App. A.2.5). • Choice 2: p (xk |xk−1 , ik ), prior distribution. In order to simplify the evaluation of the weights we can choose " # 1 xk − xk−1 − uk 2 1 q (xk |x0:k−1 , i1:k ) = p (xk |xk−1 ) = √ . (2.27) exp − 2 σw 2πσw (j) (j) Thus for each previous xk−1 , we are going to obtain a sample from a r.v. Xk using (j) the distribution p xk |xk−1 3 . In our case, this choice reduces the problem to sampling from a Gaussian distribution, which is very simple, and updating the weights following (we chose the proportionality factor to be one) (j) (j) (j) w x1:k = P ik |xk w̃ x1:k−1 . (2.28) Note that in both cases the sampling and evaluation of the weights do not require the past (j) measurements and the samples x1:k−2 . This leads to memory requirements that do not increase over time. If we compare both choices in terms of complexity, the second choice is better because it only requires sampling from a Gaussian distribution and evaluating the weights with the likelihood. Therefore, from now on, we will use the second choice for the sampling distribution. We have the following procedure: 3 For details on how to sample from it using a standard Gaussian variate see (How? - App. A.3.3). 2.3. Particle Filtering 85 (1:NS ) 1. Sample the prior distribution p (x0 ). This will generate NS samples x0 normalized weights w̃ For time k, (j) x0 = . Set uniform 1 NS . (j) 2. Create NS samples each from the corresponding r.v. Xk with PDF given by (2.27): !2 (j) 1 xk − xk−1 − uk 1 (j) . p xk |xk−1 = √ exp − 2 σw 2πσw 3. Evaluate the sample weights using the measurement and the last weights with (2.28): (j) (j) (j) w x1:k = P ik |xk w̃ x1:k−1 . 4. Normalize the weights using (2.19): (j) w̃ x1:k (j) w x1:k . = N PS (j) w x1:k j=1 5. Obtain the estimate with the weighted mean x̂k ≈ NS X j=1 (j) (j) xk w̃ x1:k . This procedure is the sequential extension of importance sampling applied to filtering and this is the reason for its commonly used name - sequential importance sampling filter. As this method is a special case of importance sampling, it has the same general characteristics, namely, it is biased for a fixed number of samples, but it converges with probability one to the optimal estimator when NS → ∞ [Doucet 1998]. 2.3.4 Sequential importance resampling We would expect that by increasing the number of samples the filter would get closer and closer to the optimal estimate. However, the convergence result is asymptotic, it works only when NS tends to infinity. When NS is finite, it can be shown that the variance of the weights increases over the time [Kong 1994]. This problem is known as the degeneracy problem and what happens in practice is that after some time most of the normalized weights are close to zero, which is equivalent to say that most of the samples are useless [Doucet 1998]. (j) In the case of sampling with p xk |xk−1 , the cause of this problem is easy to understand. We start with a given prior distribution, then duringthe procedure we evaluate the posterior (j) for values of Xk sampled randomly using p xk |xk−1 , as there is no feedback from the measurements in the sampling processes, after some time, the samples can lie very far from the values of Xk where the posterior has larger values. As a consequence, this will produce a very poor discrete approximation of the posterior. A possible remedy for this problem is to drive the sampling process using the measurements i1:k . 86 Chapter 2. Estimation of a varying parameter Resampling (j) This can be donein asimple way by reproducing the samples xk for which the posterior (j) approximation w̃ x1:k is large and deleting the samples for which the posterior is small. This procedure, known as resampling, can be carried out in practice by sampling NS times4 the posterior discrete approximation given by w̃ x(j) , if x = x(j) , k k 1:k (2.29) P (xk ) = 0, otherwise. After resampling, for retaining the posterior approximation, the weights of the samples are set to 1 (j) . (2.30) w̃ x1:k = NS As the posterior approximation is a multinomial distribution, the procedure of resampling using the approximation of the posterior (2.29) is known as multinomial resampling. Multinomial resampling can be easily implemented using NS independent uniform samples, for details see (How? - App. A.3.4) (app4) and for other types of resampling techniques see [Hol 2006]. The process of resampling should not be performed every time as it leads to the impoverishment of the sample set [Berzuini 1997]. Sample impoverishment comes as the opposite extremum of the degeneracy problem, as in this case we simply neglect possible trajectories of Xk with medium and low likelihood, leading to a not sufficiently rich approximation of the posterior. For triggering the resampling process we can monitor the number of effective samples Neff , that is to say, the equivalent number of samples if we were using the true posterior for Monte Carlo evaluation. This number can be approximated by [Doucet 1998] Neff = N PS j=1 1 w̃2 (j) x1:k . (2.31) Therefore, each time Neff < Nthresh , where Nthresh ∈ [1, NS ] is a minimum acceptable number of effective samples, the resampling process is triggered. Sequential importance sampling with the resampling step for general Bayesian estimation was first suggested in [Rubin 1988](cited in [Doucet 1998]) under the name sequential importance resampling. Its widespread use in filtering with the specific choice of p (xk |xk−1 ) as the sampling distribution was initiated with [Gordon 1993] under the name of bootstrap filter. This method was proposed for solving general nonlinear non Gaussian filtering problems. The method presented above can be found in the literature under many other names, the most common is particle filter (PF). In this case "particle" is the name given for a sample (j) xk . We will use the terms particle filter and particle from now on. A proof of convergence of the general PF is given in [Berzuini 1997] for the case with resampling at each iterate. It is shown that when NS → ∞ the error between the optimal 4 We could resample more or less than NS samples, we chose NS because it is the most commonly used choice in the literature. 2.4. Evaluation of the estimation performance 87 √ estimator and the PF estimate multiplied by NS tends to a Gaussian r.v. with fixed finite variance. This means that, for a large number of particles, when the number of particles increases the PF estimate is more and more concentrated around the optimal estimator. Application of PF for estimation based on quantized measurements with a fixed sequence of threshold sets are reported in [Ruan 2004] and [Karlsson 2005]. In [Ruan 2004] the main focus is on analyzing the main issues related to the fusion of quantized measurements from multiple sensors for tracking in general, the results reported therein are given by simulation. A more restricted model with Xk given by a vector linear Gaussian evolution and quantized linear Gaussian measurement is used in [Karlsson 2005], where a theoretical lower bound on estimation performance is obtained and compared with simulation results. The bound that is used is the equivalent counterpart of the CRB for random parameters the Bayesian Cramér–Rao bound (BCRB). 2.4 Evaluation of the estimation performance We have already explained how to obtain the estimates for our problem (b) (p. 29). We still need to evaluate its performance. 2.4.1 Online empirical evaluation The variance of the posterior approximation (supposing that NS is sufficiently large for the bias to be negligible) NS 2 X (j) (j) xk − x̂k w̃ x1:k , MSEk ≈ j=1 gives an online estimate of the MSE. The problem with this approach is that the performance is conditioned on the given measurement sequence i1:k . In this case, approximated performance can be obtained only after having the measurements, thus no design of the system (choice of the number of quantization intervals NI , choice of the sensor quality δ) can be done. Even if we push more into the Monte Carlo philosophy and try to evaluate the mean of the approximated MSE above using Monte Carlo integration, we will have to simulate a large number of times the 2 ). Therefore, PF procedure by changing the parameters needed for system design (NI , δ , σw it is better to turn our attention to analytical results on performance. 2.4.2 BCRB The analytical form of the MSE (2.9) depends on the posterior distribution. Thus, for the same reason, we cannot have an analytical expression for the estimator, we are not going to have an analytical expression for the MSE. We must resort then to a bound on the MSE. As a consequence, we will follow [Karlsson 2005] and we will also analyze the BCRB. As our case is simpler (Xk is a scalar Wiener process) than the vector linear case studied in [Karlsson 2005], we will be able to analyze the effects of the measurement system parameters in a more clear and simple way. 88 Chapter 2. Estimation of a varying parameter The BCRB at instant k, BCRBk , is a lower bound on MSEk , it is given by the inverse of the Bayesian information (BI) [Van Trees 1968, p. 84] MSEk ≥ BCRBk = 1 . Jk (2.32) The BI at time k, JK , is given by ∂ 2 log p (Xk , i1:k ) . Jk = −E ∂Xk2 (2.33) As Xk is random, the expectation here is evaluated using the joint probability measure of Xk and i1:k . This result is general and it is not linked particularly to the quantization problem, we could replace i1:k by any measurement related to Xk . By assuming that Xk is a Markov process (also here i1:k can be any type of measurement), in [Tichavsky 1998] a recursive form for evaluating the BI is obtained Jk = C k − where5 h ∂ 2 log p(X0 ) ∂X02 i Bk2 , Ak + Jk−1 and Ak = −E J0 = −E h 2 i h 2 i ∂ log p(Xk |Xk−1 ) ∂ log P(ik |Xk ) Ck = −E − E . ∂X 2 ∂X 2 k (2.34) ∂ 2 log p(Xk |Xk−1 ) 2 ∂Xk−1 , Bk = −E h ∂ 2 log p(Xk |Xk−1 ) ∂Xk ∂Xk−1 i , k Using (2.3) for evaluating the terms Ak , Bk and Ck , we have 2 2 1 Xk −Xk−1 −uk 2 log √ 1 2 − 1 Xk −Xk−1 −uk ∂ exp − ∂ 2 σw 2 σw 2πσw Ak = −E = −E 2 2 ∂Xk−1 ∂Xk−1 = 1 . 2 σw In the same way 1 , 2 σw 2 1 ∂ log P (ik |Xk ) −E . 2 σw ∂Xk2 Bk = − Ck = Decomposing the expectation above, we obtain 2 ∂ log P (ik |Xk ) 1 . Ck = 2 + EXk −Eik |Xk σw ∂Xk2 The inner expectation is another form of expressing the FI for estimating Xk when Xk is considered to be a deterministic parameter [Kay 1993, p. 34]. Thus, by using the parametrization of the FI for quantized measurements with the r.v. εk = τ0,k − Xk , we can write Ck = 5 1 + E [Iq (εk )] . 2 σw Note that we are using the notation for discrete measurements i1:k with P (ik |xk ). 2.4. Evaluation of the estimation performance 89 where the expectation is evaluated using the probability measure of εk . Using these results in (2.34) gives 1 1 1 , Jk = 2 + E [Iq (εk )] − 4 (2.35) 1 σw σw +J k−1 2 σw with J0 given by J0 = −E 2 log √ ∂ 1 2πσ0 X0 −x′0 2 exp − 12 σ0 ∂X02 = 1 . σ02 (2.36) For commonly used noise models (Gaussian, Laplacian and Cauchy), the FI is maximized for εk = 0. Thus, we can obtain a simple upper bound on the BI by assuming εk = 0 with probability one. This gives Jk ≤ Jk′ = 1 1 + Iq (0) − 4 2 σw σw 1 1 2 σw ′ + Jk−1 , (2.37) with J0′ = J0 . This will give a simple lower bound on the BCRB and consequently on the MSE, which can be used to assess approximately the performance of the PF. 90 Chapter 2. Estimation of a varying parameter The solution to problem (b) (p. 29) given by the PF is Solution to (b) - Particle filter for a fixed threshold set sequence τ 1:k (b1.1) 1) Estimator (j) • Set uniform normalized weights w̃ x0 = N1S and initialize NS n o (1) (N ) particles x0 , · · · , x0 S by sampling the prior p (x0 ) = √ " 2 # 1 x0 − x′0 1 . exp − 2 σ0 2πσ0 For each time k, (j) • for j from 1 to NS , sample the r.v. Xk with PDF (How? - App. A.3.3) !2 (j) x − x − u k k 1 1 (j) k−1 , p xk |xk−1 = √ exp − 2 σw 2πσw • for j from 1 to NS , evaluate and normalize the weights (j) w x1:k (j) w̃ x1:k = N , (j) (j) (j) PS (j) w x1:k = P ik |xk w̃ x1:k−1 , w x1:k j=1 (j) is given by (2.14). where P ik |xk • Obtain the estimate with the weighted mean x̂k ≈ NS X j=1 (j) (j) xk w̃ x1:k . • Evaluate the number of effective particles Neff = N PS j=1 1 (j) w̃2 x1:k , if Neff < Nthresh , then resample using multinomial resampling (How? - App. A.3.4) (app4). 2) Performance (lower bound) The MSE can be lower bounded as follows 1 MSEk ≥ ′ , Jk with Jk′ given recursively by 1 1 Jk′ = 2 + Iq (0) − 4 σw σw 2.5 1 1 2 σw ′ + Jk−1 . Quantized innovations For commonly used symmetrically distributed noise models (Gaussian, Laplacian and Cauchy distributions), we saw in Ch. 1 that Iq (ε) around ε = 0 is a locally decreasing function with |ε|, thus from (2.35) we can see that closer τ0,k is to the parameter realization xk , higher will 2.5. Quantized innovations 91 be the BI. If we assume that the BCRB is sufficiently tight for accepting its behavior as an approximation of the behavior of the MSE, closer τ0,k is to the parameter realization xk , lower will be the MSE. This indicates that the dynamical range of the quantizer must vary in time in order to follow the parameter and produce enhanced estimation performance. 2.5.1 Prediction and innovation Prediction. The main problem with the approach –τ0,k = xk – is that we do not know xk , if we knew, we would not need to estimate it. We might then accept a small loss of performance by using the closest value to xk that we have in hand, in our case, a prediction of xk using the last estimate value x̂k−1 and the drift uk . If X̂k−1 is the MMSE estimator based on i1:k−1 , then the MMSE estimator of Xk based also on i1:k−1 , denoted X̂k|k−1 , is the conditional mean [Jazwinski 1970, p. 149] X̂k|k−1 = EXk |i1:k−1 (Xk ) . Using the dynamical model for Xk and the linearity of the conditional expectation X̂k|k−1 = EXk |i1:k−1 (Xk−1 + Wk ) = EXk−1 |i1:k−1 (Xk−1 ) + EWk |i1:k−1 (Wk ) . The first term is the optimal estimate for Xk−1 . As Wk is independent of all Wn with n 6= k and it is also independent of all Vk , it does not depend on i1:k−1 . Thus, the second term is simply EWk (Wk ), which is uk . This gives X̂k|k−1 = τ0,k = X̂k−1 + uk . Considering that the estimator is good enough (at least for large k), we expect to have the r.v. εk with most of its probability concentrated around zero, thus leading to a higher E [Iq (εk )] and, consequently, to a lower MSE. Quantizing the Innovation. Quantizing the prediction error is a known subject in standard quantization. It is widely known under the name predictive quantization [Gersho 1992, Ch. 7]. Note that the procedure considered here is different. Instead of quantizing the prediction error of reconstruction, we quantize the error between the prediction in estimation sense and the noisy measurement Yk − X̂k|k−1 . The prediction error in this case is commonly called the innovation process in continuous measurement linear filtering theory [Kay 1993, p. 433]. The name comes from the fact that it represents the previously unknown information brought by the new measurement. As a consequence, the quantized prediction error for estimation purposes is called quantized innovation. We can slightly change solution (b1.1) (p. 90) by adding the adaptive replacement of the central threshold with the prediction6 . This is what was done in [Sukhavasi 2009b] under the assumption of Gaussian noise, linear and Gaussian vector Xk ( Xk = AXk−1 + Wk ) and binary quantization. Constraining Xk to be the scalar Wiener process considered here and generalizing the algorithm for symmetrically distributed noise and NI ≥ 2, we have 6 As the measurements are now linked through the use of the prediction for quantizing, we cannot guarantee the convergence of the particle approximation through standard results and more advanced results are needed [Crisan 2000]. 92 Chapter 2. Estimation of a varying parameter Solution to (b) - Particle filter with adaptive threshold sequence τ 1:k quantizing the innovation. (b2.1) 1) Estimator (j) • Set uniform normalized weights w̃ x0 = N1S and initialize NS n o (1) (N ) particles x0 , · · · , x0 S by sampling with prior " 2 # 1 1 x0 − x′0 exp − p (x0 ) = √ . 2 σ0 2πσ0 For each time k, (j) • for j from 1 to NS , sample the r.v. Xk with PDF (How? - App. A.3.3) !2 (j) xk − xk−1 − uk 1 1 (j) , p xk |xk−1 = √ exp − 2 σw 2πσw • for j from 1 to NS , evaluate and normalize the weights (j) w x1:k (j) , w̃ x1:k = N (j) (j) (j) PS (j) w x1:k = P ik |xk w̃ x1:k−1 , w x1:k j=1 (j) is given by (2.14). where P ik |xk • Obtain the estimate with the weighted mean x̂k ≈ NS X j=1 (j) (j) xk w̃ x1:k . • Set the central threshold of the quantizer to the new estimate τ0,k = x̂k−1 + uk . • Evaluate the number of effective particles Neff = N PS j=1 1 (j) w̃2 x1:k , if Neff < Nthresh , then resample using multinomial resampling (How? - App. A.3.4) (app4). 2) Performance (lower bound) The MSE can be lower bounded as follows 1 MSEk ≥ ′ , Jk with Jk′ given recursively by 1 1 Jk′ = 2 + Iq (0) − 4 σw σw 1 1 2 σw ′ + Jk−1 . 2.5. Quantized innovations 2.5.2 93 Bound for the quantized innovations Observe that the lower bound is still valid because we still have E [Iq (εk )] ≤ Iq (0). But, we might have a performance closer to the bound, as εk might be concentrated mostly around zero. It is also important to note that now even if the bound is tight (which might not be true), we can get very close to it, but in general we cannot attain it. This is due to the fact that the MSE for a varying parameter never goes to zero, leading to a residual spread on the PDF of εk , which makes E [Iq (εk )] ≤ Iq (0). To have an approximation on the evolution of the MSE, we can analyze the lower bound on the BCRB. Therefore, we are interested in analyzing the evolution of Jk′ . We can start by ′ analyzing the evolution of its increments. Subtracting the expressions for Jk′ and Jk−1 , we have ! ′ ′ Jk−1 − Jk−2 1 1 1 1 ′ ′ . − = (2.38) Jk − Jk−1 = 4 1 4 ′ ′ 1 1 σw σ12 + Jk−2 σw + Jk−1 + J′ + J′ σ2 w 2 σw w k−1 2 σw k−2 2 is also The BI is positive by definition, as it is an expectation of a squared quantity, and σw positive by definition, thus the denominator of the expression above is always positive. This leads to a sign of the increment at time k − 1 that is the same as the sign of the increment at k − 2. As a conclusion, we can say that the BI is monotonic, it always increases or decreases. For determining if the BI increases or decreases, we can see from the recursive expression above that this will be determined by the first increment J1′ − J0′ . By subtracting J0′ = σ12 0 from J1′ , we obtain 1 1 + Iq (0) − 4 2 σw σw J1′ − J0′ = Regrouping the terms with factor J1′ − J0′ = Iq (0) + 1 2 σw 1 2 σw 1 + 1 σ02 − 1 . σ02 gives 1 1 1 1 1 1− − 2. − 2 = Iq (0) + 2 2 2 2 σ σw σ0 σw + σ0 σ0 1 + σw2 0 Thus, if Iq (0) > 1 1 − 2 , σ02 σw + σ02 J1′ − J0′ is positive and the BI is always increasing, otherwise it always decreases. As a consequence, if the inequality is satisfied the BCRB is always decreasing, otherwise always increasing. As stated before, the information bound Jk′ is bounded below by zero. By looking to (2.37) Jk ≤ Jk′ = 1 1 + Iq (0) − 4 2 σw σw 1 1 2 σw ′ + Jk−1 , we can see that it is bounded above by Iq (0) + σ12 , as the other term that is subtracted is w always positive. Joining the facts that Jk′ is lower and upper bounded with the fact that it 94 Chapter 2. Estimation of a varying parameter is always increasing or decreasing, we can conclude that Jk′ will converge to a fixed point (a fixed value). Except in the cases when the inequality above is an equality, from (2.38), we ′ see that the increment Jk′ − Jk−1 cannot be zero, as it is equal to the last increment (which is positive) multiplied by a positive value. Therefore, the fixed point of Jk′ is expected to be attained only asymptotically. Jk′ ′ , by definition it is the value of J ′ for which Denoting this asymptotic fixed point J∞ k ′ = Jk−1 . Thus, it can be found by solving 1 1 + Iq (0) − 4 2 σw σw ′ J∞ = which is equivalent to solve 2 ′ ′ − Iq (0) J∞ − J∞ 1 1 2 σw ′ + J∞ , Iq (0) = 0. 2 σw The solutions for the equation above are Iq (0) ± q Iq2 (0) + 4Iq (0) 2 σw 2 . ′ positive (it is positive by definition), we must take the positive solution. In order to have J∞ 4Iq (0) As σ2 is positive, the positive solution is obtained for the positive sign. Therefore, w ′ J∞ = Iq (0) + q Iq2 (0) + 4Iq (0) 2 σw (2.39) 2 ′ and the asymptotic MSE is then lower bounded by the inverse of J∞ MSE∞ ≥ 2 q Iq (0) + Iq2 (0) + 4Iq (0) 2 σw . (2.40) The following behaviors can then be obtained for the evolution of the bound: if we start with a very small σ02 (small compared with Iq1(0) ), as we can see from the inequality related to the monotonicity pattern, the lower bound on the MSE will always increase, tending asymptotically to J1′ . If we start with a large σ02 , the lower bound will always decrease, also tending ∞ asymptotically to J1′ . ∞ From the analysis we can see that the MSE, as expected, is always strictly positive, it is lower bounded by σ02 when this value is very small compared with Iq1(0) and it is lower bounded by 1 ′ J∞ 2.5.3 when σ02 is large compared with 1 Iq (0) . Gaussian assumption and asymptotic estimation of a slowly varying parameter Other filtering methods based on the quantized innovation are proposed in the literature under the Gaussian noise assumption. In [Ribeiro 2006c], binary measurements are obtained 2.5. Quantized innovations 95 by applying the sign function to the innovation. A similar procedure to the well-known Kalman filter [Kay 1993, Ch. 13] is derived by assuming that the posterior at instant k −1 is Gaussian. In the same line, [You 2008] proposes a Kalman-like procedure for quantized innovations with NI > 2. A careful reader of the literature on the subject might note that the idea of considering Gaussian approximations of the posterior for filtering based on quantized data with Gaussian noise dates back to [Curry 1970]. Also, the idea of quantizing the innovation seems to be first exploited in [Borkar 1995]7 (cited in [Sukhavasi 2009a]). The general algorithm presented in [You 2008] has its approximate performance dependent on Iq (0), with Iq √ (0) being evaluated for the Gaussian distribution with variance σv2 = 1 (noise scale factor δ = 2). The performance of the algorithms is enhanced by maximizing Iq (0). This is in accordance with the lower bound on the MSE studied above for the Wiener process model with symmetric noise, MSEk ≥ J1′ , which decreases with increasing Iq (0) ((2.37) shows k that Jk′ increases with Iq (0)). This gives additional motivation for studying how to maximize Iq (0) w.r.t. the thresholds. The assumption that the posterior is a Gaussian distribution for all k and all σw stated in [Curry 1970], [Ribeiro 2006c] and [You 2008] is a very rough approximation. For observing this, consider that the assumption that the prediction PDF p (xk |i1:k−1 ) is Gaussian is correct. Then, from the update expression (2.11), we know that the posterior is proportional to the function P (ik |xk ) p (xk |i1:k−1 ). The probability P (ik |xk ) is a difference of CDF, which is a function that is approximately a rectangular window with slowly decreasing borders centered at the quantization interval for ik . If the standard deviation of the prediction is large or has similar value of the equivalent width of P (ik |xk ) and the prediction distribution has a mean that is different of the quantization interval center, then it is easy to see that the resultant P (ik |xk ) p (xk |i1:k−1 ) will be a skewed function, not similar at all to a Gaussian function. As an additional remark, we can see that differently from the continuous measurement case, where the measurement noise must be Gaussian for having a Gaussian posterior, the assumption of Gaussian noise does not help here, as the function P (ik |xk ) is not close to Gaussian even in the Gaussian case. We will use the Gaussian assumption when σw is small and k tends to infinity. Under these assumptions and considering that we quantize the innovations, we will obtain an approximation of the asymptotically optimal estimator and its performance. To verify that the approximation is reasonable, we will compare the approximate asymptotic performance with the asymptotic BCRB. 2.5.3.1 Asymptotic estimator for a slowly varying parameter As it was discussed above, it is reasonable to accept that the estimator MSE will converge to 2 . When the Wiener process increment standard deviation σ is small compared a constant σ∞ w with the noise scale factor, the estimator has sufficient time for reducing the estimation vari2 is small. If ance before Xk changes significantly, thus it is also reasonable to state that σ∞ 7 In this case, the true innovation is quantized, i.e., the innovation obtained by using the estimator based on the continuous measurements, this is different from the methods in [Ribeiro 2006c] and [You 2008], where the quantized innovation is the innovation obtained using the estimator based on the quantized measurements. 96 Chapter 2. Estimation of a varying parameter we assume that the previous posterior after some time is approximately Gaussian with mean 2 , then, as the prediction PDF is the convolution (2.10) of p (x |x E (Xk−1 ) and variance σ∞ k k−1 ), which is Gaussian, with p (xk |i1:k−1 ) which is also Gaussian, we obtain that the prediction PDF conditioned on the past observations is Gaussian distributed with mean E (Xk−1 ) + uk 2 + σ 2 . For estimating the optimal X we must evaluate the conditional mean and variance σ∞ k w R xk P (ik |xk ) p (xk |i1:k−1 ) dxk Z P (ik |xk ) p (xk |i1:k−1 ) R R R dxk = . X̂k = xk P ik |x′k p x′k |i1:k−1 dx′k P ik |x′k p x′k |i1:k−1 dx′k R R R The numerator in the last term of the RHS can be seen as the prediction mean of the r.v. Xk P (ik |Xk ) (the mean w.r.t. p (xk |i1:k−1 )), under the assumption that σ∞ is small, the factor P (ik |Xk ), which is given by (2.14) F τ ′ + X̂k|k−1 − Xk − F τ ′ + X̂ − X if ik > 0, k , k|k−1 |ik | |ik−1| P (ik |Xk ) = ′ F −τ ′ if ik < 0, |ik +1| + X̂k|k−1 − Xk − F −τ|ik | + X̂k|k−1 − Xk , can be well approximated by a first order Taylor series expansion around X̂k|k−1 − Xk = 0 X̂ − X + P (ik |Xk ) = P (ik |Xk )|X̂ f i , X̂ , X k d k k k|k−1 k|k−1 k|k−1 =Xk X̂k|k−1 =Xk + ◦ X̂k|k−1 − Xk , (2.41) where fd ik , X̂k|k−1 , Xk is the first derivative of P (ik |Xk ) w.r.t. the prediction error X̂k|k−1 − Xk . It can be written as a function of the noise PDF f fd ik , X̂k|k−1 , Xk = f τ ′ + X̂k|k−1 − Xk − f τ ′ + X̂ − X , if ik > 0, k k|k−1 |ik | |ik−1| = (2.42) ′ + X̂ f −τ ′ + X̂ − X − f −τ − X , if i < 0. k k k k|k−1 k|k−1 |ik +1| |ik | Note that when X̂k|k−1 = Xk the function fd ik , X̂k|k−1 , Xk depends only on ik . Using (2.41), the numerator in the estimator expression is then the prediction mean of 2 f X̂ − X + X Xk P (ik |Xk ) = Xk P (ik |Xk )|X̂ i , X̂ , X d k k k k|k−1 k|k−1 k =X k k|k−1 X̂k|k−1 =Xk + ◦ Xk X̂k|k−1 − Xk2 . The prediction mean of Xk is the prediction X̂k|k−1 , while X̂k|k−1 is simply a constant for the evaluation of this mean. Thus, using linearity and the fact that 2 2 E2Xk |i1:k−1 (Xk ) − EXk |i1:k−1 Xk2 = −VarXk |i1:k−1 (Xk ) = − σ∞ + σw , we have Z xk P (ik |xk ) p (xk |i1:k−1 ) dxk = X̂k|k−1 P (ik |Xk )|X̂ k|k−1 =Xk + R 2 2 − σ∞ + σw fd ik , X̂k|k−1 , Xk X̂k|k−1 =Xk 2 2 + ◦ σ∞ + σw . (2.43) 2.5. Quantized innovations 97 To obtain the denominator in the estimation expression, we can use a similar procedure. Note that now the prediction expectation is evaluated for the r.v. P (ik |Xk ) instead of Xk P (ik |Xk ). We will use the second order Taylor expansion of P (ik |Xk ) in this case: X̂ − X + f P (ik |Xk ) = P (ik |Xk )|X̂ i , X̂ , X + k d k k k|k−1 k|k−1 =Xk k|k−1 + X̂k|k−1 − Xk 2 2 X̂k|k−1 =Xk fd′ ik , X̂k|k−1 , Xk X̂k|k−1 =Xk +◦ X̂k|k−1 − Xk 2 , (2.44) where fd′ ik , X̂k|k−1 , Xk is the second derivative of P (ik |Xk ) w.r.t. the prediction error. By differentiating fd in (2.42), we can observe that, for X̂k|k−1 = Xk , this function also depends . For the only on ik . The mean of the first term above is the constant P (ik |Xk )|X̂ k|k−1 =Xk second term fd is a constant and the mean of the prediction is zero, as the optimal predictor is unbiased [Jazwinski 1970, p. 150]. The third and last terms depend on the prediction 2 2 + σ 2 , also due to the mean of X̂k|k−1 − Xk , which is equal to the prediction variance σ∞ w unbiasedness of the optimal predictor. This gives the following Z P ik |x′k p x′k |i1:k−1 dx′k = P (ik |Xk )|X̂ =Xk + k|k−1 R 2 + σ2 σ∞ w 2 2 + fd′ ik , X̂k|k−1 , Xk . + σw + ◦ σ∞ 2 X̂k|k−1 =Xk (2.45) Dividing the RHS of the expressions (2.43) and (2.45), we have an expression for the estimator 2 + σ2 f 2 + σ2 − σ X̂k|k−1 P (ik |Xk )|X̂ i , X̂ , X + ◦ σ∞ d k k k|k−1 ∞ w w k|k−1 =Xk X̂k|k−1 =Xk . X̂k = 2 +σ 2 ) (σ∞ ′ i , X̂ w 2 + σ2 ) + ◦ (σ + P (ik |Xk )|X̂ f , X k k k|k−1 ∞ w d 2 =X k k|k−1 X̂k|k−1 =Xk Dividing the numerator and denominator by P (ik |Xk )|X̂ X̂k = 2 + σ2 X̂k|k−1 − σ∞ w 1+ 2 +σ 2 ) (σ∞ w 2 fd (ik ,X̂k|k−1 ,Xk )|X̂ P(ik |Xk )|X̂ fd′ (ik ,X̂k|k−1 ,Xk )|X̂ P(ik |Xk )|X̂ k|k−1 , we get k|k−1 =Xk k|k−1 =Xk k|k−1 =Xk k|k−1 =Xk + 2 + σ2 + ◦ σ∞ w 2 ◦ (σ∞ + . (2.46) 2) σw ′ fd (ik ,X̂k|k−1 ,Xk )|X̂ k|k−1 =Xk 2 + σ 2 ≪ 1, the denominator is approximately If is bounded and σ∞ w P(ik |Xk )|X̂ =X k k|k−1 one and we can approximate the optimal estimator by8 fd ik , X̂k|k−1 , Xk X̂k|k−1 =Xk 2 2 . X̂k ≈ X̂k|k−1 − σ∞ + σw P (ik |Xk )|X̂ =Xk (2.47) k|k−1 8 1 around x = 0 would produce a more precise approximation, Note that a first order Taylor series of 1+x but this would generate a more complicated algorithm for the performance analysis. 98 Chapter 2. Estimation of a varying parameter 2.5.3.2 Performance of the asymptotic estimator 2 . The idea here Now, we need to calculate the asymptotic MSE of this estimator, which is σ∞ to start is to rewrite the asymptotic prediction error as a sum of the estimation error plus a function of the observations. Using (2.47) we have fd ik , X̂k|k−1 , Xk X̂k|k−1 =Xk 2 2 + σw X̂k − Xk = X̂k|k−1 − Xk − σ∞ , P (ik |Xk )|X̂ =Xk k|k−1 subtracting the term with fd from both sides, we have fd ik , X̂k|k−1 , Xk X̂k|k−1 =Xk 2 2 + σw X̂k − Xk + σ∞ = X̂k|k−1 − Xk . P (ik |Xk )|X̂ =Xk k|k−1 Squaring and taking the expectation gives E X̂k − Xk 2 + 2E X̂k − Xk 2 2 2 +E σ∞ + σw 2 2 σ∞ + σw fd2 ik , X̂k|k−1 , Xk P2 (ik |Xk )|X̂ fd ik , X̂k|k−1 , Xk P (ik |Xk )|X̂ X̂k|k−1 =Xk k|k−1 =Xk =E X̂k|k−1 =Xk k|k−1 =Xk X̂k|k−1 − Xk + 2 . 2 . The second term is the expectation of The first term is the asymptotic squared error σ∞ 2 + σ 2 the the estimation error multiplied by a function of the measurement ik , for small σ∞ w estimation procedure is optimal (it minimizes the MSE), thus this expectation equals zero 2 + σ 2 2 can leave the expectation and the term on the [Rhodes 1971]9 . The constant σ∞ w 2 + σ 2 . Therefore, we have RHS is the prediction error σ∞ w fd2 ik , X̂k|k−1 , Xk X̂k|k−1 =Xk 2 2 2 2 2 2 (2.48) + σw E + σ∞ σ∞ = σ∞ + σw . 2 P (ik |Xk )|X̂ =Xk k|k−1 The expectation that still needs to be evaluated is an expectation under the marginal probability measure of ik , P (ik ). This probability measure can be evaluated by marginalizing on the prediction error εk = X̂k − Xk Z P (ik ) = P (ik |xk ) p (εk ) dεk . R Remember that in the quantized innovation scheme P (ik |xk ) is a function of ik and εk . The marginal can be also observed as the mean of P (ik |Xk ) evaluated w.r.t. the distribution of 9 This is a more general form of the well-known orthogonal projection theorem. 2.5. Quantized innovations 99 εk . For evaluating the mean, we can again use a second order Taylor series expansion around εk = 0. This will lead to the same expression as in (2.45). Then, the remaining expectation is 2 i , X̂ f , X fd2 ik , X̂k|k−1 , Xk k k k|k−1 X d X̂k|k−1 =Xk X̂k|k−1 =Xk + = E P2 (ik |Xk )|X̂ P (ik |Xk )|X̂ =Xk =Xk ik ∈I k|k−1 2 2 + σ∞ + σw f2 X d ik ∈I ik , X̂k|k−1 , Xk X̂k|k−1 =Xk k|k−1 fd′ ik , X̂k|k−1 , Xk P2 (ik |Xk )|X̂ X̂k|k−1 =Xk + k|k−1 =Xk 2 2 + σw . + ◦ σ∞ The first term on the RHS can be identified as the FI Iq (0). Considering that the sum in the 2 + σ 2 2 , the second term second term on the RHS is bounded, then, after multiplying by σ∞ w i h 2 + σ 2 2 term. This leads to 2 + σ 2 3 which is a ◦ σ∞ is multiplied by σ∞ w w 2 σ∞ + 2 2 σw E fd2 ik , X̂k|k−1 , Xk P2 (ik |Xk )|X̂ X̂k|k−1 =Xk k|k−1 =Xk 2 σ∞ = + 2 2 σw Iq (0) + ◦ h 2 σ∞ + 2 2 σw i . Using the expression above in (2.48), we obtain h i 2 2 2 2 2 2 2 2 2 + σw Iq (0) + ◦ σ∞ = σ∞ + σw , σ∞ + σ∞ + σw or equivalently 2 2 2 σ∞ + σw +◦ " 2 + σ2 σ∞ w Iq (0) 2 # = 2 σw . Iq (0) 2 is small enough so that the ◦ term is negligible, we can obtain the following approximation If σw 2 : for σ∞ σw 2 2 ≈p σ∞ − σw . (2.49) Iq (0) p Considering that σw is small compared with Iq (0) and with one, we have a rough approximation for the asymptotic performance σw 2 . σ∞ ≈p Iq (0) (2.50) 2 from (2.49) in the approximate expression for X̂ (2.47), the following Finally, replacing σ∞ k is obtained: fd ik , X̂k|k−1 , Xk X̂k|k−1 =Xk σw X̂k ≈ X̂k|k−1 − p = P (i |X )| Iq (0) k k X̂k|k−1 =Xk fd ik , X̂k|k−1 , Xk X̂k|k−1 =Xk σw . (2.51) = X̂k−1 + uk − p P (ik |Xk )|X̂ Iq (0) =Xk k|k−1 100 Chapter 2. Estimation of a varying parameter A few remarks are important here. First, in a similar way as it happened for the adaptive estimator of a constant parameter, the asymptotic estimation procedure is very simple, it is a correction on the last estimate which depends on the observation through fd (ik ,X̂k|k−1 ,Xk )|X̂ P(ik |Xk )|X̂ k|k−1 =Xk k|k−1 =Xk , a function of ik only. This means that the corrections can be stored in a table. Second, the correction gain now is even simpler than in the constant parameter case, it is a constant. Third, 2 (2.50) agrees with the intuition on estimation performance, if the rough approximation for σ∞ σw increases, the MSE increases, as the estimator has less effective samples to estimate before the parameter changes significantly. If Iq (0) increases, which is equivalent to say that the noise level is reduced and/or that the quantizer resolution is increased, the MSE decreases as the statistical information given by each sample is reduced. 2.5.3.3 Asymptotic lower bound on the BCRB for a slowly varying parameter To check if the asymptotic estimator above is indeed close to optimal, we can compare its estimation performance with the asymptotic MSE lower bound, which is given by (2.40): MSE∞ ≥ 2 q Iq (0) + Iq2 (0) + 4Iq (0) 2 σw . Comparing with (2.50) must be done for small σw . For evaluating the RHS above in this case, we can multiply its numerator and its denominator by √σw . This will lead to 2 2 Iq (0) + q Iq2 (0) + Using the expansion around x = 0, √ 4Iq (0) 2 σw Iq (0)σw √ 2 Iq (0) 1+x=1+ 2 q Iq (0) + Iq2 (0) + = 4Iq (0) 2 σw = x 2 Iq (0) √σw Iq (0) q . 2 Iq (0)σw + + 1 4 + ◦ (x), on the square root above gives √σw Iq (0) Iq (0)σw √ 2 Iq (0) +1+ 2 Iq (0)σw 8 2) + ◦ (σw , 2 is small compared with I (0) for making the I (0) to disappear where we used the fact that σw q q from the ◦ term. Note that this was also supposed to get the rough approximation (2.50) above. 1 = 1 − x + ◦ (x). Supposing, We can use again an expansion around zero. Now, we use 1+x Iq (0) additionally that σw is small compared with √ , we can use a ◦ term depending only on Iq (0) σw . Thus, we obtain 2 Iq (0) + q Iq2 (0) + 4Iq (0) 2 σw " # Iq (0) σw σw 1− p + ◦ (σw ) . =p Iq (0) 2 Iq (0) 2.5. Quantized innovations 101 The squared terms can be assimilated to ◦ (σw ) leading finally to EQM∞ ≥ 2 q Iq (0) + Iq2 (0) + 4Iq (0) 2 σw =p σw + ◦ (σw ) , Iq (0) which for small σw is exactly the same as the rough approximation of the asymptotic estimator performance. Consequently, we can say that the asymptotic estimator obtained above is optimal, as in this specific case, it attains the lower bound. As in the previous section, where the adaptive MLE scheme was shown to have a simple recursive form asymptotically, a question arise: • can the asymptotic estimator (2.47) for slowly varying Xk converge when we use it with an arbitrary (not necessarily small) initial error? The answer for this question will be given in Ch. 3. 102 2.6 Chapter 2. Estimation of a varying parameter Chapter summary and directions We sum up the main points of this chapter: • instead of considering that the parameter is constant, we assumed that the parameter can vary in time, more specifically, following a Wiener process model. We saw that, in general, the optimal estimator can be obtained by evaluating the mean of the parameter conditioned on the past and present quantized measurements. Thus, the core of the problem was observed to be the evaluation of the posterior PDF (PDF of Xk conditioned on i1:k ). For a Markov process Xk , which is the case for a Wiener process model, the posterior can be evaluated in a recursive way, first by obtaining a prediction PDF using the posterior at time k − 1 and the evolution model, then by updating the prediction PDF to the posterior at time k, incorporating the new measurement ik . • The integrals involved in the recursive expressions are complicated to be evaluated analytically, so we must resort to numerical algorithms for solving them. One way of doing this is to apply Monte Carlo integration. This leads to a PF solution. The PF solution is a recursive simplified form of Monte Carlo integration applied to the filtering problem with an additional resampling step. The performance of the optimal estimator could also be obtained using Monte Carlo integration, but it would be very difficult and time consuming to study the effects of the system parameters (noise level, Wiener process increments variance and quantizer resolution) using the Monte Carlo results. Therefore, we considered a simpler solution by using a bound on the MSE for which we can have a simple analytical expression. • In our case, we used the BCRB, which is the inverse of the Bayesian information. The BI for the Wiener process Xk can be evaluated recursively. From its recursive expression, we could see that the BI and consequently the bound were affected by the quantization through a E [Iq (εk )] term, where εk is the difference between the central threshold and the parameter. If E [Iq (εk )] is increased, then the bound decreases, if it decreases, the bound increases. For commonly used noise models, Iq (εk ) is maximum at εk = 0, therefore, a practical lower bound can be obtained by using Iq (0) instead of E [Iq (εk )]. • If we accept that the bound is tight enough to mimic the behavior of the MSE, another consequence of the dependence of the bound on E [Iq (εk )] and the fact that Iq (εk ) is large close to εk is that the central threshold must be placed as close as possible to the true parameter. This can be done in an approximate way, by setting the central threshold to the prediction of Xk based on the past measurements. Thus, a good estimation procedure might be based on the quantized innovation. In this case, it is expected that the estimation performance will be closer to the bound, when compared with a quantizer with arbitrary central threshold. • When σw is small and k tends to infinity, the optimal estimator can be approximated by a low complexity recursive expression, with its MSE attaining the BCRB and given by √σw . This shows one more time, the importance of studying the maximization of Iq (0) Iq (0) w.r.t. the quantization threshold variations. As stated before, the asymptotic analysis 2.6. Chapter summary and directions 103 of this problem will be done in Part II. The simplicity of the asymptotic estimator when Xk varies slowly will be used as a motivation to study in more detail recursive algorithms of the type - prediction + correction based on ik . This will be done in Ch. 3. • A generalization of the signal model used here can be obtained by considering that the dynamical parameter is a vector Xk with size N and that it obeys a linear Gaussian model of the type Xk = Φk Xk−1 + Wk , where Φk is a N × N matrix and Wk is a sequence of independent Gaussian vectors. The continuous measurement is a vector Yk with dimension M Yk = Hk Xk + Vk , where Hk is a M × N matrix and Vk is a sequence of independent vectors. Quantization can be done also scalarly, but we might consider two possibilities for the quantization of each Yk , scalar quantization of each dimension or vector quantization of the entire vector. A direct application of the estimation problem with this model is the control of linear systems under rate constraints. We will not go further in this direction in this thesis, but we will keep this generalized version of the problem for future work. Chapter 3 Adaptive quantizers for estimation As we saw in the previous chapters, to obtain good estimation performance, the quantizer dynamic might be adaptively set around the true parameter to be estimated. We also saw that the asymptotically optimal estimator in the constant parameter case or in the slowly varying parameter case has a simple recursive form. We asked at the end of each chapter: • can the asymptotic estimator based on binary measurements converge when we use its simplified form (the low complexity equivalent) with an arbitrary initial error (not necessarily small)? • Can we extend this low complexity recursive procedure to the case NI > 2? • Can the asymptotic estimator for slowly varying Xk converge when we use its simplified form (the low complexity equivalent) with an arbitrary initial error (not necessarily small)? In this chapter we will answer these questions. For doing so, we will impose the estimation algorithm to have a general recursive form that includes the asymptotically optimal estimators as special cases. We will start the chapter with a brief review of the signal models that will be used (constant, Wiener process without and with drift) and with the definition of the quantizer to be used. Then, we will define the estimation algorithm form and we will study its performance for the signal models defined previously in terms of the mean error and of the MSE . Based on the performance analysis, we will obtain the optimal estimator parameters and the corresponding optimal performance. As in related work [Papadopoulos 2001], the optimal performance will be used to obtain a measurement of performance loss due to quantization. This loss will be evaluated for each signal model by using the corresponding optimal performance for estimation based on continuous measurements. The performance results will be verified through simulation. We will also propose extensions of the adaptive algorithm in the following cases: • quantized measurements from a sensor are used for estimating a constant parameter, but in this case, the noise scale parameter is considered to be unknown. • Multiple sensors and a fusion center are used to estimate a constant parameter. The sensors can send only quantized measurements to the fusion center, while the fusion center can broadcast continuous values to the sensors. 105 106 Chapter 3. Adaptive quantizers for estimation In each of these cases we will follow a similar procedure. We will define the problem and the estimation algorithm to be used. Then, we will obtain the optimal estimator parameters and the corresponding optimal performance. Simulation will be used to check the validity of the results. At the end of the chapter, we will summarize the main results of the chapter and we will give some directions for future work. Contributions presented in this chapter: • Design and analysis of an adaptive estimation algorithm based on multibit quantized noisy measurements. This differs from [Li 2007] and [Fang 2008], where only binary quantization is treated. • Explicit performance analysis for tracking of a slowly varying parameter. Differently from [Papadopoulos 2001, Ribeiro 2006a, Li 2007, Fang 2008], where the parameter is set to be constant and all subsequent analysis is based on this hypothesis. Even if tracking is treated in a more general way in [Ribeiro 2006c] and [You 2008], we do not state assumptions on noise Gaussianity. Note that the assumption that the parameter varies slowly seems more restrictive than the parameter models considered in [Ribeiro 2006c] and [You 2008], actually, the slowly varying assumption is hidden in the performance evaluation for the binary case given in [Ribeiro 2006c], where it is shown that the performance of the proposed filter reaches the equivalent continuous when the sampling time tends to zero. • Low complexity algorithms. The algorithms proposed here are based on simple recursive techniques that have lower complexity than the methods proposed in [Li 2007] and [Fang 2008]. • Joint location and scale adaptive estimator. The algorithm that we propose is an extension of the location estimation problem. This extension is discussed in [Ribeiro 2006b] but only for fixed quantization thresholds. • Fusion center approach. This approach can be seen as a multisensor, multibit, low complexity alternative to the adaptive techniques presented in [Li 2007] and [Fang 2008] and also as an adaptive alternative for the optimal threshold distribution approach given in [Ribeiro 2006a], where a prior distribution on the parameter is needed. 107 Contents 3.1 Parameter model and measurement model . . . . . . . . . . . . . . . . 108 3.1.1 Parameter model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.1.2 Noise model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.1.3 Adjustable quantizer model . . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.2 General estimation algorithm . . . . . . . . . . . . . . . . . . . . . . . . 111 3.3 Estimation performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.4 3.5 3.6 3.7 3.3.1 Mean ordinary differential equation . . . . . . . . . . . . . . . . . . . . . . 113 3.3.2 Asymptotic MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Optimal algorithm parameters and performance . . . . . . . . . . . . . 125 3.4.1 Optimal algorithm parameters . . . . . . . . . . . . . . . . . . . . . . . . 125 3.4.2 Algorithm performance for optimal gain and coefficients . . . . . . . . . . 128 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 3.5.1 General considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 3.5.2 Theoretical performance loss due to quantization . . . . . . . . . . . . . . 137 3.5.3 Simulated loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 3.5.4 Comparison with the high complexity algorithms . . . . . . . . . . . . . . 143 3.5.5 Discussion on the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Adaptive quantizers for estimation: extensions . . . . . . . . . . . . . 149 3.6.1 Joint estimation of location and scale parameters . . . . . . . . . . . . . . 149 3.6.2 Fusion center approach with multiple sensors . . . . . . . . . . . . . . . . 155 Chapter summary and directions . . . . . . . . . . . . . . . . . . . . . . 164 108 3.1 3.1.1 Chapter 3. Adaptive quantizers for estimation Parameter model and measurement model Parameter model We will join the constant and varying models by using the dynamic model (2.1) Xk = Xk−1 + Wk , where {Wk , k = 1, 2, · · · } is a sequence of independent Gaussian r.v. (also independent of Xn , for n < k) whose means form a deterministic sequence {uk , k = 1, 2, · · · } and its standard deviation is σw : 2 . W k ∼ N u k , σw Symbol ∼ means "distributed according to" and N is the symbol for the Gaussian distribution. Differently from what was considered previously, the sequence uk will be considered to be a known or unknown constant u and we will assume that it has small value. We will also assume that σw is a known, small constant. "Small" in both cases means that these constants are small when compared with the noise scale parameter. In the Gaussian noise case, this is equivalent to say that they are small when compared with the noise standard deviation. The fact that we use a constant u instead of the varying uk will allow to have asymptotic performance results. In practice, all the results that will be presented will be valid for varying uk , as long as the sequence uk is small and slowly varying. The model above is a compact form to describe the three parameter models that are studied in this thesis: • constant: by taking u = σw = 0 and X0 = x, we have the constant parameter model. • Wiener process: if u = 0, (small) nonzero σw and Gaussian X0 with unknown mean and variance, then Xk is a (slowly) varying Wiener process. • Wiener process with drift: in this case u and σw are non zero (and with small amplitudes). 3.1.2 Noise model The continuous amplitude measurement is again given by the additive model Y k = Xk + Vk , where the noise r.v. sequence Vk respects the assumptions considered previously: • the sequence is i.i.d.. • AN1 (p. 34) – The marginal noise CDF denoted F (v) accepts a PDF denoted f (v). • AN2 (p. 34) – f (v) is a strictly positive even function that strictly decreases w.r.t. |v|. 3.1. Parameter model and measurement model 109 An additional assumption will be considered on the noise CDF. Assumption (on the noise distribution): AN3 F is locally Lipschitz continuous. A function F (v) is Lipschitz continuous in an interval V, if for every two points v1 and v2 in V there exists a constant L such that |F (v1 ) − F (v2 )| ≤ L |v1 − v2 | , the function is locally Lipschitz continuous if for every v ∈ R, we can find an interval V ′ containing v such that the function is Lipschitz continuous. This assumption is required by the method of analysis that will be used to assess the performance of the proposed algorithm. Most noise CDF considered in practice are Lipschitz continuous, thus this assumption is generally satisfied. 3.1.3 Adjustable quantizer model We saw in Ch. 1 and 2 that the quantizer central threshold must be dynamically updated to obtain a good estimation performance. We will make explicit this feature by imposing the quantizer to have an adjustable offset bk . For adjusting the amplitude of the quantizer input, we can also consider that after offsetting the input, we apply an adjustable gain ∆1k . The quantized measurements at the output of the adjustable quantizer are given by Y k − bk . (3.1) ik = Q ∆k By considering dynamic input offset and gain, we can fix the quantizer to have a static structure with a central threshold that now can be set to zero. Thus, the quantizer thresholds are equal to the threshold variations. This modifies assumption AQ2 (p. 37). Assumption (on the quantizer): AQ2’ The quantizer is symmetric around the central threshold which is equal to zero. This means that the vector of thresholds τ is given by the vector of threshold variations ⊤ ′ ′ ′ ′ ′ τ = τ = −τ NI · · · − τ1 0 + τ1 · · · + τ NI , 2 2 where the threshold variations τi′ form an increasing sequence. The adjustable quantizer output is given by Y k − bk = i sign (Yk − bk ) , ik = Q ∆k for |Yk − bk | ′ ∈ τi−1 , τi′ . ∆k (3.2) A scheme representing the adjustable quantizer is given in Fig. 3.1 . Note that even if the quantizer is not uniform (with constant step-length between thresholds), it can be implemented using a uniform quantizer with a compander approach [Gersho 1992]. 110 Chapter 3. Adaptive quantizers for estimation Static quantizer τ2′ Adjustable gain Yk τ1′ 1 ∆k − bk Adjustable offset 0 ik −τ1′ −τ2′ Figure 3.1: Scheme representing the adjustable quantizer. The offset and gain can be adjusted dynamically, while the quantizer thresholds are fixed. Based on the quantizer outputs, the main objective is to estimate Xk . A secondary objective is to adjust the parameters bk and ∆k to enhance estimation performance. As the estimate X̂k of Xk will be possibly used in real time applications, it might be interesting to estimate it online. Therefore, we are again interested in solving problems (a) and (b), the main difference is that now we want to solve (a) for each time index k. It was observed in the previous chapters that • when estimating a constant, we can place the central threshold in the last estimate to have an asymptotically optimal algorithm. • When estimating a Wiener process, we can place the central threshold at the prediction. For Wiener process without drift the prediction is exactly X̂k−1 and for Wiener process with drift the prediction is X̂k−1 + uk . Based on these observations and for simplification purposes, we will set for all cases bk = X̂k−1 . Also to simplify, we will consider that the gain is set to be a constant. For the algorithm presented later, the fact that the offset is set to X̂k−1 will have as a consequence asymptotically optimal parameters that do not depend on the mean of Xk , thus simplifying the analysis. Some remarks here are important: • We will see that imposing the use of bk = X̂k−1 , instead of using the prediction, will make the algorithm parameters and the performance to be different for Wiener process with and without drift. • If we use the prediction, instead of the last estimate, for setting the quantizer offset and for estimating Xk , then all the results that we will obtain for a Wiener process without drift will be valid also for the process with drift. • In the special cases where the optimal central threshold is not the median of the continuous amplitude measurement, we can evaluate the optimal quantizer offset ε⋆ w.r.t. the true parameter (the point of minimum in the "w" shaped CRB curves) and then add this value to the offset of the adaptive quantizer bk = X̂k−1 + ε⋆ . 3.2. General estimation algorithm 111 • As most performance results will be given asymptotically, the simplification brought by 1 using a constant gain ∆ can still be partially achieved if we constrain this gain to be constant after some time or to achieve asymptotically a constant value. In this case, the analysis of error convergence will have to take into account that the measurement system varies in time and we must be able also to evaluate its asymptotic value. 1 will be again considered to be variable further in the chapter, where we will • The gain ∆ estimate jointly a constant Xk and the scale parameter of the noise. In this case, the gain will not only be variable, but it will also depend on the measurements. The general scheme for the estimation of Xk is depicted in Fig. 3.2 and the main objective will be to find the algorithm that will be placed in the block named Update. Adjustable Quantizer Vk Xk τ2′ τ1′ Yk 1 ∆ 0 −τ1′ − ik Quantized measurements −τ2′ ∆ X̂k−1 Update X̂k Estimate Figure 3.2: Block representation of the estimation scheme. The estimation algorithm and the procedures to set the offset and the gain are represented by the Update block. 3.2 General estimation algorithm At the end of Ch. 1, we saw that the estimator in the adaptive binary quantization scheme based on the MLE is asymptotically given by (1.87) X̂k = X̂k−1 + ik , 2kf (0) whereas at the end of Ch. 2, we saw that the asymptotic expression for the optimal estimator of a slowly varying Wiener process is (2.51) fd ik , X̂k|k−1 , Xk X̂k|k−1 =Xk σw . X̂k ≈ X̂k|k−1 − p P (i |X )| Iq (0) k k X̂k|k−1 =Xk 112 Chapter 3. Adaptive quantizers for estimation Both asymptotic estimators have low complexity and both are special cases of the following adaptive algorithm: " !# Yk − X̂k−1 X̂k = X̂k−1 + γk η Q , (3.3) ∆ where, γk is a sequence of positive real gains and η[·] is a mapping from I to R η: I → R j → ηj , (3.4) n o which is characterized by the sequence of NI coefficients η− NI , . . . , η−1 , η1 , . . . , η NI . Notice 2 2 that the coefficients η[·] can be seen as the "estimation equivalent" of the output quantization levels used in standard quantization theory. Even if nothing guarantees that the algorithm (3.3) is optimal for finite time, the fact that it can be equivalent asymptotically to the optimal estimator and that it has low complexity are strong motivations for using it. Other more intuitive motivations are the following: • similarly to the binary grid method proposed by [Fang 2008], for a slowly varying or constant parameter, we can choose the coefficients η[·] in a way that the algorithm will tend to be around true parameter at least in the mean. • When estimating a constant, the maximum likelihood estimator can be approximated by a simpler online algorithm using a stochastic gradient ascent algorithm, which has the same form as (3.3). It will be shown later that for the optimal choice of ηi , algorithm (3.3) is equivalent to a stochastic gradient ascent method to maximize the log-likelihood. • To estimate a Wiener process, an approximate choice of estimator is a Kalman filter like method based on the quantized innovation, which is also (3.3). Due to the symmetry of the problem for commonly used noise models, when X̂k is close to Xk , it seems reasonable to suppose that the corrections given by the output quantizer levels have odd symmetry with positive values for positive ik . This symmetry will be useful later for simplification purposes and we will add it to the other assumptions. Assumption (on the quantizer output levels): AQ3 The quantizer output levels have odd symmetry w.r.t. i: ηi = −η−i , (3.5) with ηi > 0 for i > 0. In the special cases where the threshold must be placed asymmetrically and we put an additional constant value in the quantizer offset (ε⋆ ), the assumption above may lead to an asymptotic estimation bias. For observing this, consider that the quantization offset is already at the parameter. Then, the mean of the correction η [ik ] will be zero, as the distribution of the ik is even and ηi is odd. Thus the algorithm is in a mean equilibrium point. As the offset is 3.3. Estimation performance 113 placed ε⋆ away from the estimate, the mean of the estimate has an equilibrium point that is different from the true parameter. The non differentiable non linearity Q in (3.3) makes it difficult to be analyzed. Fortunately, an analysis based on mean approximations was developed in [Benveniste 1990] for a wide class of adaptive algorithms. Within this framework, the function η can be a general nonlinear non-differentiable function of Yk and X̂k and it is shown that the gains γk that optimize the estimation of Xk can be chosen as follows: • γk ∝ 1 k when Xk is constant. • γk is constant for a Wiener process Xk . 2 • γk is a constant which is proportional to u 3 when Xk is a Wiener process with drift. Notice that the gains for the constant and Wiener process models given above have the same form of the asymptotically optimal gains found in Ch. 1 and Ch. 2. The only difference is 2 the gain proportional to u 3 in the case with drift, which reflects the choice of using X̂k−1 in the place of the prediction. In the following sections we will consider the gains given above for the algorithm (3.3) and we will apply the general analysis presented in [Benveniste 1990] to obtain its performance. 3.3 Estimation performance To obtain the estimation performance, the analysis is separated in • the analysis of the estimator mean. This gives a rough approximation of the estimator behavior. With this information we can see if the estimator converges in the mean and we can also characterize its bias. • The analysis of the estimation variance. This analysis will give the details on the fluctuation around the mean and it will be obtained, in most cases, asymptotically. 3.3.1 Mean ordinary differential equation The core of the analysis that we use here and that is presented in a general setting in [Benveniste 1990] is to approximate the mean E X̂k by x̂ (tk ), where x̂ (t) is the solution of the ordinary differential equation (ODE) dx̂ = h (x̂) . dt The correspondence between continuous and discrete time is given by tk = (3.6) k P γj and h (x̂) is j=1 the following: x − x̂ + V h (x̂) = E η Q , ∆ (3.7) 114 Chapter 3. Adaptive quantizers for estimation where the expectation is evaluated w.r.t. the distribution of V , which is the noise marginal distribution. A simple heuristic to obtain the approximation is the following: first we rewrite (3.3) as " !# Xk − X̂k−1 + Vk X̂k − X̂k−1 =η Q , γk ∆ then we consider that the parameter is approximately a constant Xk = x and that X̂k−1 on the RHS can be approximated by the mean at time k, i.e. X̂k−1 = x̂. Evaluating the expectation on both sides E X̂k − E X̂k−1 x − x̂ + Vk =E η Q , γk ∆ we see now that the RHS is h (x̂) and if we consider the algorithm gain as a small time step, E(X̂k )−E(X̂k−1 ) then is an approximation of the time derivative. γk For the approximation given by the ODE (3.6) to be valid as an approximation of E X̂k at least after some time k and for using the results from [Benveniste 1990], some conditions must be satisfied: • conditions on the Gains. The gains must sum to infinity ∞ X γk = +∞, k=1 when they are decreasing, the sum of their power must be finite ∞ X γkα < +∞, for some α > 1 k=1 and when they are not decreasing, they must tend to a finite limit γ∞ = lim γk < +∞. k→∞ As the cumulative sum of the gains is an equivalent for the time in the ODE approximation, the condition that the sum of the gains goes to infinity is equivalent to say that the time in the ODE can go to infinity, so that the algorithm does not get "stuck" in time. The condition on the sum of the powers of the decreasing gains is used to guarantee that the fluctuations of the estimator will decrease when we want to estimate a constant. The last condition on the limit of the gains is used to have fixed asymptotic performance results. We can see that all these conditions are satisfied for the three types of gain defined previously. • Conditions on the continuous measurements. For a fixed Xk = x, the continuous measurements Yk form a Markov chain with a unique stationary asymptotic distribution. This condition is also necessary to have fixed asymptotic results. In the problem considered here, the distribution of the continuous measurements given a fixed parameter x is the distribution of the noise shifted by the parameter x. As the noise distribution is i.i.d., the distribution of Yk is stationary for all k, thus clearly respecting this condition. 3.3. Estimation performance 115 • Regularity conditions on h (x̂). The function h (x̂) is locally Lipschitz continuous. The main point of using the analysis presented in [Benveniste 1990] is that it is not necessary to have a continuous correction function η. The analysis is mainly based on replacing the mean of the algorithm by the ODE approximations and then evaluating the fluctuations around it. This analysis then reposes mainly on h and not on η. For the local existence, uniqueness and regularity of the ODE solution, we might impose regularity conditions on h. Also, for evaluating the fluctuations around the ODE solution we might look to local expansions of h, which then leads naturally to conditions as the one stated above. Using the assumptions on the quantizer thresholds and output levels, the expectation in (3.7) can be written as: NI h (x̂) = 2 X i=1 [ηi Fd (i, x̂, x) − ηi Fd (−i, x̂, x)] , where Fd is a difference of CDFs: F (τ ′ ∆ + x̂ − x) − F τ ′ ∆ + x̂ − x , i i−1 Fd = F τ ′ ∆ + x̂ − x − F (τ ′ ∆ + x̂ − x) , i+1 i o n if i ∈ 1, · · · , N2I , o n if i ∈ −1, · · · , − N2I . (3.8) (3.9) From assumption AN3, the function h is a linear combination of locally Lipschitz continuous functions, this implies that h is also locally Lipschitz continuous, and the condition is satisfied. All the conditions are satisfied in our case, therefore, we can apply the performance results from [Benveniste 1990]. Mean of the algorithm for estimating a constant For estimating a constant, the gain of the algorithm is of the form [Benveniste 1990] γk = γ . k (3.10) The ODE is given by (3.6) dx̂ = h (x̂) , dt with the time given by tk = γ k P j=1 this case, it is valid for large k. 1 j. The ODE approximation is valid for small gains, so in The estimation bias after a transient time can be approximated using the ODE above. By denoting the bias as ε (t) = x̂ (t) − x, the bias ODE is dε = h̃ (ε) , dt (3.11) 116 Chapter 3. Adaptive quantizers for estimation where h̃ (ε) = h (ε + x) is a function that does not depend on the true parameter x (to verify this, use ε + x in the place of x̂ in the expression for Fd ). As the function h̃ (ε) depends on a sum of CDF which might not even have analytical form, it is difficult to find analytical solutions for (3.11). The solution in general can be obtained using a numerical method, for example a Runge–Kutta method (see [Golub 1991] for details on numerical solvers). Even if we cannot obtain in general a characterization of the bias for all k using the ODE, we can at least analyze what happens asymptotically to the mean of the algorithm. Asymptotic stability and asymptotic unbiasedness An interesting point to study is the asymptotic mean convergence of the algorithm. More precisely, if we prove that ε → 0 as t → ∞ for every ε (0) ∈ R, then we prove that the algorithm is asymptotically unbiased, as its true mean can be approximated by the ODE. The convergence in the mean is not only useful for showing that the algorithm indeed works, at least in the mean, but it is also a requirement for the evaluation of the MSE that will be presented later. The fact that ε → 0 as t → ∞ for every ε (0) ∈ R means that ε = 0 is a globally asymptotically stable point [Khalil 1992]. Global asymptotic stability of ε = 0 can be shown using an asymptotic stability theorem for nonlinear ODEs. This will require the definition of an unbounded Lyapunov function of the error. To simplify, a quadratic function will be used: L (ε) = ε2 , (3.12) which is a positive definite function and tends to infinity when ε tends to infinity. If h̃ (ε) = 0 for ε = 0 and dL dt < 0 for ε 6= 0, then by the Barbashin–Krasovskii theorem [Khalil 1992, p. 124] ε = 0 is a globally asymptotically stable point. To show that both conditions are met, expression (3.8) can be rewritten as a function of ε: NI h̃ (ε) = 2 X i=1 i h ηi F̃d (i, ε) − F̃d (−i, ε) , (3.13) where F̃d (i, ε) = Fd (i, ε + x, x) is also a function that does not depend on x. When ε = 0, the differences between F̃d in the sum are differences between probabilities on symmetric intervals. The symmetry of the noise PDF stated in AN2 and the symmetry of the quantizer stated in AQ2’ imply that h̃ (0) = 0, fulfilling the first condition. The second condition can be written in more detail by using the chain rule for the derivative: dL dε dL = = 2εh̃ (ε) < 0, for ε 6= 0. (3.14) dt dε dt Thus, h̃ (ε) has to respect the following constraints: h̃ (ε) > 0, for ε < 0 and h̃ (ε) < 0, for ε > 0. (3.15) 3.3. Estimation performance 117 When ε 6= 0, the terms in the sum that gives h̃ (ε) are the difference between integrals of the noise PDF under the same interval size but with asymmetric interval centers. Using the symmetry assumptions, for ε > 0, F̃d (i, ε) is the integration of f over an interval more distant to zero than for F̃d (−i, ε), then by the decreasing assumption on f , F̃d (i, ε) < F̃d (−i, ε) and consequently h̃ (ε) < 0. Using the same reasoning for ε < 0 one can show that h̃ (ε) > 0. Therefore, the inequalities in (3.15) are satisfied and dL dt < 0 for ε 6= 0. Finally, as both conditions are satisfied, one can say that ε = 0 is globally asymptotically stable, which means that the estimator is asymptotically unbiased for estimating a constant. Mean of the algorithm for estimating a Wiener process When we want to estimate a Wiener process, the gain of the algorithm is considered to be a constant γk = γ. In this case, if we consider γ to be a small constant, we can also write the ODE approximation to the mean with (3.6) dx̂ = h (x̂) . dt Now, the constant x in the expression for h is the mean of the Wiener process (which is also the mean of the initial condition X0 ) and the time is tk = kγ. Note that in this case, by imposing a γ sufficiently small the ODE will be valid for all k and there will be no transient time. Actually, this could also be done for the constant parameter, but as we will see later, the optimal γ minimizing the asymptotic MSE may not be small for estimating a constant and it will indeed be small for estimating a Wiener process with small σw . The bias ODE is also given by (3.11), therefore, for small γ the algorithm is also asymptotically unbiased in this case. To show an example for which the ODE approximates well the estimation bias, we simulated the adaptive algorithm for NI = 2 and NI = 4 in the Gaussian noise case. The quantizer 1 = 1, the threshold variations and the output coefficients were chosen to be uniform, gain was ∆ ′ ′ τ = [τ1 = 1 τ2′ = 2]⊤ , {η1 = 1, η2 = 2} for NI = 2 and τ ′ = [τ1′ = 1 τ2′ = 2 τ3′ = 3 τ4′ = 4]⊤ , {η1 = 1, η2 = 2, η3 = 3, η4 = 4} for NI = 4. The noise scale parameter was chosen to be δ = 1, the Wiener process increment standard deviation σw = 10−3 and the adaptive gain γ = 10−3 . We considered the mean of the Wiener process to be E (Xk ) = 0 and the initial condition of the algorithm was set to be X̂0 = 1. To obtain an estimation of the bias, we simulated the algorithm 10 times for blocks of 104 samples. For each sample (each index k) we averaged the error through the different simulations. The solution of the bias ODE (3.11) was obtained numerically with a Runge–Kutta method with order 4 and 5. The results are displayed in Fig. 3.3. 118 Chapter 3. Adaptive quantizers for estimation E (εk ) 1 ODE approx. – NI = 2 Sim. – NI = 2 ODE approx. – NI = 4 Sim. – NI = 4 0.5 0 0 0.2 0.4 0.6 Time [k] 0.8 1 ·104 Figure 3.3: ODE bias approximation and simulated bias for the estimation of a Wiener process with the adaptive algorithm. The noise was considered to be Gaussian with δ = 1. Both NI = 2 and NI = 4 were considered with τ ′ = [τ1′ = 1 τ2′ = 2]⊤ , {η1 = 1, η2 = 2} and τ ′ = [τ1′ = 1 τ2′ = 2 τ3′ = 3 τ4′ = 4]⊤ , {η1 = 1, η2 = 2, η3 = 3, η4 = 4}. In both cases, the quantizer input gain was considered to be one. The Wiener process increment standard deviation σw and the adaptive gain were set to 10−3 . The algorithm was initialized with X̂0 = 1, while the true mean of the Wiener process was set to zero. To obtain the simulated bias, we simulated 10 realizations of the estimation procedure for blocks with 104 samples. The simulated bias was obtained through averaging of the simulations. The ODE approximation of the bias was obtained by solving numerically the ODE (3.11) with a Runge–Kutta method. We note that the ODE approximation corresponds well to the mean trajectory of the estimation error. For this specific choice of parameters, which corresponds to the binary constant step update presented in [Li 2007] and [Fang 2008] and to a multibit extension of it (when NI = 4), we see that the algorithm can set the mean of the central threshold, which in this case is also the estimator, at the parameter mean even if the parameter is time-varying. We also observe that for the choice of simulation parameters used here, the convergence time of the algorithm for NI = 4 is smaller than the convergence time for NI = 2. As a final remark on the Wiener process case, when γ → 0, the ODE approximation is increasingly accurate as the inherent discretization error (from time discretization) decreases to zero. Also when γ → 0, we get the constant Xk case studied in [Li 2007] and [Fang 2008]. Thus, the proof of asymptotic mean convergence given above is also a proof of convergence of the fixed step algorithms presented in [Li 2007] and [Fang 2008] and multibit extensions of it, when the step of the algorithm is small. Mean of the algorithm for estimating a Wiener process with drift When the Wiener process has a drift, we consider again that the algorithm has a constant gain γk = γ, 3.3. Estimation performance 119 However, in this case as the mean of the parameter is not stationary, we cannot consider the ODE approximation with a constant x in the function h. To obtain the ODE, we will use again the heuristic presented above, but in this case, we will include the dynamical model of the parameter. We start with the expectation of the increments divided by γ E (Xk ) − E (Xk−1 ) γ E X̂k − E X̂k−1 γ u , γ x − x̂ + Vk = E η Q , ∆ = then we approximate it by a pair of coupled ODEs dx dt dx̂ dt = u , γ = h̃ (x̂ − x) , where the time for both equations is tk = kγ. Note that the algorithm ODE now depends on the solution of the parameter ODE. By subtracting both expressions, we have an ODE for the bias ε dε u = h̃ (ε) − . (3.16) dt γ As the parameter is now moving deterministically with the drift u, we can assume that most of the algorithm tracking effort will be done to remove the bias ε. Therefore, the algorithm must be fast enough to follow the parameter and we must have γ ≫ u. This also makes uγ to be small, thus if all ηi are not too small, we can find an ε∞ such that h̃ (ε∞ ) = uγ , which means that ε∞ is an equilibrium point for the bias. It was shown above that the bias ODE without the forcing term uγ is globally asymptotically stable, thus for a slowly varying parameter, we can expect that the algorithm will tend to get close to the true parameter. After a time tk−1 , we can assume that the algorithm is sufficiently close to the true parameter, so that we can approximate the function h̃ (ε) with a first order taylor expansion around ε = 0 h̃ (ε) = h̃ (0) + h̃(1) (0) ε + ◦ (ε) , where h̃(1) (0) is the derivative of h̃ (ε) with respect to ε evaluated at ε = 0. The ODE can then be rewritten as u dε = h̃(1) (0) ε − + ◦ (ε) , for t > tk−1 . (3.17) dt γ For tk sufficiently large we can neglect the ◦ (ε) term. Thus, the bias ODE can be approximated by a linear ODE. For the linear ODE approximation not to diverge we must impose the condition h̃(1) (0) < 0. (3.18) Therefore, under this condition the approximate bias will tend to an approximation of the equilibrium point ε∞ . 120 Chapter 3. Adaptive quantizers for estimation Before obtaining the asymptotic bias given by the equilibrium point ε∞ , we will verify condition (3.18). The derivative of h̃ (ε) w.r.t. ε is given by NI 2 i h dh X = h̃(1) (ε) = ηi f˜d (i, ε) − f˜d (−i, ε) , dε (3.19) i=1 where f˜d (i, ε) is f (τ ′ ∆ + ε) − f τ ′ ∆ + ε , i i−1 f˜d (i, ε) = f τ ′ ∆ + ε − f (τ ′ ∆ + ε) , i+1 i o n if i ∈ 1, · · · , N2I , n o if i ∈ −1, · · · , − N2I . (3.20) o n ′ and the At point ε = 0, f˜d (i, ε) = f˜d (i, 0) for i ∈ 1, · · · , N2I is negative because τi′ > τi−1 noise PDF is strictly decreasing by assumption. For −i, f˜d (−i, 0) has the same absolute value as f˜d (i, 0) by the symmetry assumptions, but it is positive. Therefore, f˜d (i, 0) − f˜d (−i, 0) = 2f˜d (i, 0) and this difference is always negative. The sum h̃(1) (ε) is then given by NI h̃(1) (0) = 2 2 X ηi f˜d (i, 0) (3.21) i=1 and it is also negative, as the output quantizer levels ηi are positive for positive i by assumption. This means that condition (3.18) is satisfied and the ODE linear approximation will converge to an equilibrium point. For simplifying the notation, we will use hε in the place of h̃(1) (0) from now on. As the system is linear, the equilibrium point will be unique and independent of the initial condition. We can obtain its expression by setting dε dt to zero in the ODE approximation. This leads to the following equation: u hε ε∞ − = 0, γ for which the solution is ε∞ = u . γhε As the bias ODE is an approximation of the true bias, this is equivalent to say that for small u u E X̂k − Xk ≈ . (3.22) k→∞ γhε Note that differently from the constant and Wiener process cases, the estimator is not asymptotically unbiased. Observe also that if uk is not a small constant, but a small amplitude slowly varying sequence, we could replace u by u (t) in the ODE approximation above and for each time step (t ∈ [tk , tk + γ)) approximate the varying u (t) by the constant uk . This would lead to replace u by uk in the bias approximate expression above (3.22) and instead of considering it as a valid expression for k → ∞, we would say that it is valid for a large k. 3.3. Estimation performance 3.3.2 121 Asymptotic MSE After characterizing the mean behavior of the algorithm, we must quantify its random fluctuations. For doing so, we will mainly use asymptotic results on the variance of the algorithm. With the asymptotic bias and the asymptotic variance we can obtain the asymptotic MSE. The asymptotic MSE is a function of the parameter γ, thus by minimizing it through γ, we will obtain expressions for the MSE independent of γ. Asymptotic variance for estimating a constant Under the condition that the algorithm is asymptotically unbiased, it can be shown using a central limit theorem, that the normalized estimation error is asymptotically distributed as a Gaussian r.v. [Benveniste 1990, p. 109] √ k X̂k − x the symbol k→∞ 2 N 0, σ∞ , (3.23) 2 is given by means convergence in distribution. The asymptotic variance σ∞ 2 = σ∞ γ2R , −2γhε − 1 (3.24) where the term h̃ε is the derivative of h̃ (ε) w.r.t. ε at ε = 0, as it was defined before. The term R inthe numerator is the variance of the adaptive algorithm normalized increments X̂k −X̂k−1 when the mean of the algorithm, which is approximated by the ODE solution γk x̂, is equal to x. From the symmetry assumptions on the noise and on the quantizer, the normalized mean of the increments h (x̂) is zero when x̂ = x. Thus, this variance is given by the second order moment of the quantizer output levels: R = x − x̂ + V Var η Q ∆ x̂=x NI NI = 2 X i=1 2 ηi2 Fd (i, x, x) + η−i Fd (−i, x, x) = 2 NI 2 = 2 X 2 X ηi2 Fd (i, x, x) i=1 ηi2 F̃d (i, 0) , (3.25) i=1 where the third equality comes from the symmetry of the quantizer and the noise distribution and the last equality is obtained using the F̃d notation. For minimizing the asymptotic variance w.r.t. γ, we must find the positive γ for which = 0. The expression for the derivative is 2 (γ) dσ∞ dγ 2 (γ) dσ∞ 2γ 2 hε 2γ R =R + = −2γ 2 hε − 2γ , 2 2 dγ −2γhε − 1 (−2γhε − 1) (−2γhε − 1) 122 Chapter 3. Adaptive quantizers for estimation which equals zero for γ = − h1ε . Note that this gain is positive as hε is negative. By rewriting the derivative above as 2 (γ) dσ∞ 1 −2Rγhε γ+ , = dγ hε (−2γhε − 1)2 we can see that for γ > − h1ε , the derivative is positive and for γ < − h1ε , the derivative is 2 . The optimum gain γ ⋆ and its corresponding negative, thus γ = − h1ε gives a minimum σ∞ variance are 1 (3.26) γ⋆ = − , hε R 2 = 2. (3.27) σ∞ hε Note that this result is valid under the condition that the estimator is asymptotically unbiased, a condition that was shown to be true in the previous subsection. Asymptotic variance for estimating a Wiener process The MSE for a varying parameter and a constant adaptive gain can be expressed as a sum of three terms nh i o2 2 + ◦ (γ) MSEk = E [x̂ (tk ) − x (tk )] + E X̂k − x̂ (tk ) − [Xk − x (tk )] (3.28) = ε2 (tk ) + E ξk2 + ◦ (γ) , i h where ε2 (tk ) = E2 [x̂ (tk ) − x (tk )] and ξk = X̂k − x̂ (tk ) − [Xk − x (tk )]. The first term ε2 (tk ) is an approximation of the squared bias E2 [εk ]. The second term is an approximation of the error variance, which can be obtained by evaluating the second order moment of the total fluctuation of the error ξk . The last term is the error due to the approximations and if γ is small this term is negligible. As σw is small by assumption, γ must be small for tracking Xk without large fluctuations, thus this last term is expected to be negligible. It was shown in the last subsection that the algorithm is asymptotically unbiased, thus the first term of the decomposition tends to zero as k tends to infinity. As a consequence, the asymptotic MSE, that we denote MSEq,∞ , depends mainly on the asymptotic characterization of ξk . Under the conditions that the estimator is asymptotically unbiased and that hε < 0, which were both shown to be true in the previous subsection, it can be shown [Benveniste 1990, pp. 130–131] that ξk tends to be a stationary Gaussian process with marginal distribution 2 N 0, σξ . The asymptotic variance σξ2 is given as a sum of two terms, one produced by the fluctuations of the estimator itself and equal to the parameter and equal to 2 σw −2γhε , thus giving σξ2 = and leading to the asymptotic MSE MSEq,∞ = γR −2hε and the other due to the fluctuations of 2 σw γR + −2hε −2γhε 2 γR σw + + ◦ (γ) . −2hε −2γhε (3.29) 3.3. Estimation performance 123 Neglecting the ◦ (γ), we can find the approximately optimal gain by equating to zero its derivative w.r.t. γ. This gives the equation 2 dMSEq,∞ (γ) σw 1 −R + 2 = 0, ≈ dγ 2hε γ which is zero for γ = σw √ . R The second derivative can be approximated by d2 MSEq,∞ (γ) σ2 ≈ − w3 , 2 dγ hε γ 2 is positive, the second derivative is positive for positive γ. This means as hε is negative and σw σ that choosing γ = √wR leads to a minimum MSE. Thus, σw γ⋆ = √ R (3.30) and the corresponding asymptotic MSE is MSEq,∞ √ σw R + ◦ (γ ⋆ ) . = −hε (3.31) We can express MSEq,∞ as a function of σ∞ given in (3.27). This gives MSEq,∞ = σw σ∞ + ◦ (γ ⋆ ) . (3.32) Observe that both the asymptotic MSE for estimating a Wiener process and for estimating a constant depend on the quantizer parameters (ηi , ∆ and τ ′ ) through an increasing function 2 , therefore the asymptotically optimal quantizer parameters is the same in both cases. of σ∞ The only difference in the adaptive algorithm for these two cases is the sequence of gains γk . Asymptotic MSE for estimating a Wiener process with drift When the Wiener process has a drift, the MSE can still be written as the sum of three terms (3.28) MSEk = ε2 (tk ) + E ξk2 + ◦ (γ) . Even if γ ≫ u, we still expect it to be small, so that the algorithm is able to reduce the effects of the measurement noise. Thus, we can still neglect the ◦ (γ). We will proceed similarly as for the Wiener process without drift. We will evaluate the asymptotic MSE and then we will obtain the asymptotically optimal gain. Differently, from the Wiener process without drift, the algorithm is not asymptotically unbiased and we must use the expression for the asymptotic bias approximation (3.22) ε∞ = u γhε in the first term of MSEq,∞ . As it is explained in [Benveniste 1990, p. 133], by using γ ≫ u, the fluctuations of the parameter around its ODE approximation are negligible when compared 124 Chapter 3. Adaptive quantizers for estimation with the fluctuations of the algorithm. Therefore, we can approximate the asymptotic variance of the fluctuation by γR . (3.33) σξ2 ≈ −2hε Using the bias from (3.22) and the variance from (3.33), we obtain MSEq,∞ ≈ γR u2 + + ◦ (γ) . 2 2 γ hε −2hε (3.34) To obtain the minimum w.r.t. γ, we must find γ satisfying dMSEq,∞ (γ) R 2u2 ≈− 3 2 + = 0. dγ γ hε −2hε 2 1 3 4u The solution of this equation in the variable γ is γ = −h . To verify that this value of γ εR corresponds to a minimum of MSEq,∞ , we evaluate the second derivative d2 MSEq,∞ (γ) 6u2 . ≈ dγ 2 γ 4 h2ε We can verify that this quantity is positive. Therefore, 31 4u2 ⋆ γ = −hε R (3.35) and its corresponding asymptotic MSE is MSEq,∞ ≈ 3 uR 4h2ε 2 3 + ◦ (γ ⋆ ) . (3.36) Note that in practice, u may be unknown and it will be necessary to replace its value in γ ⋆ by an estimate of it û, which can be also obtained adaptively, for example by calculating a recursive mean on X̂k − X̂k−1 . 2 with a dependence The asymptotic MSE in (3.36) can also be rewritten as a function of σ∞ on u 2 u 3 2 σ∞ + ◦ (γ ⋆ ) . (3.37) MSEq,∞ ≈ 3 4 2 . Also in this case the asymptotic MSE is an increasing function of σ∞ Remark: in the previous subsection, we remarked that if uk is a small amplitude slowly uk for large k. Thus, following varying parameter, the bias could be approximated by εk ≈ γh ε the same development and considering that the gains γk can be slowly variable, we have for large k 31 4u2k ⋆ γk = −hε R and the corresponding asymptotic MSE 2 uk R 3 + ◦ (γk⋆ ) . MSEk ≈ 3 4h2ε 3.4. Optimal algorithm parameters and performance 3.4 125 Optimal algorithm parameters and performance Now, we focus on the asymptotically optimal design of the quantizer parameters. From the previous results, we can see that the asymptotic performance for the three cases is dependent 2 . Also, for the three cases the asymptotic performance deon an increasing function of σ∞ 2 . Therefore, the optimal pends on the quantizer parameters (ηi , ∆ and τ ′ ) only through σ∞ parameters are the same for the three cases. 2 w.r.t. to the quantizer update coefficients In the next subsections, first we will minimize σ∞ 1 . After that, we will present the ηi , then we will discuss on the choice of the input gain ∆ 2 optimal algorithm general form and its corresponding σ∞ . We then discuss on how to optimize the performance w.r.t. the threshold variations set τ ′ . Finally, we will present the optimal gain and performance for each of the three parameter models, by considering the optimal update coefficients. In each case, we will also evaluate the performance loss due to quantization. 3.4.1 Optimal algorithm parameters Update coefficients (output levels) 2 (3.27), the optiUsing the expressions for hε (3.21) and R (3.25) in the expression for σ∞ mization of the algorithm performance w.r.t. the update coefficients can be written as the following minimization problem: argmin η R η ⊤ Fd η = argmin 2, h2ε η 2 (η ⊤ fd ) (3.38) where η is a vector with the coefficients h i⊤ η = η 1 · · · η NI , (3.39) 2 Fd is a diagonal matrix given by Fd = diag F̃d (1, 0) , · · · , F̃d NI ,0 , 2 (3.40) with diag [] the function that creates a matrix with the input sequence added to the diagonal of a zero matrix. fd is the following vector fd = f˜d (1, 0) · · · f˜d ⊤ NI ,0 . 2 (3.41) The minimization problem is equivalent to the following maximization problem: 2 η ⊤ fd . argmax ⊤ η Fd η η (3.42) 126 Chapter 3. Adaptive quantizers for estimation Using the fact that Fd is a positive semidefinite matrix (it is a diagonal matrix with nonzero diagonal elements), we can rewrite (3.42) as ⊤ 2 1 − 12 2 F d fd Fd η argmax 1 ⊤ 1 , η Fd 2 η Fd 2 η 1 1 the matrices Fd 2 and Fd − 2 are obtained by taking the square root and the inverse of the square root of the diagonal elements in Fd . Using the Cauchy–Schwarz inequality on the expression in the numerator gives 2 1 ⊤ 1 − F d 2 fd Fd 2 η ≤ fd ⊤ Fd −1 fd 1 ⊤ 1 Fd 2 η Fd 2 η and the equality happens for 1 1 F d 2 η ∝ F d − 2 fd . Under the assumption that the update coefficients are positive for positive i AQ3 (p. 112), the optimal η can be chosen to be η ⋆ = −Fd −1 fd . (3.43) 2 w.r.t. η is The minimum σ∞ 2 = σ∞ 2 fd ⊤ −1 NI 2 2 1 X f˜d (i, 0) 2 = Fd −1 fd F̃ (i, 0) i=1 d . (3.44) We can recognize that the sum above is exactly equal to the FI given in (1.13) when the central threshold is placed exactly at the parameter x, Iq (0) NI Iq (0) = 2 2 X f˜d2 (i, 0) i=1 F̃d (i, 0) . (3.45) Choice of the input gain To simplify the choice of the constant ∆, we can consider that the noise CDF is parametrized by a known scale parameter δ, which means that x F (x) = Fn , δ where Fn is the CDF for δ = 1. In this case the key quantity that appears in the evaluation of the quantizer output levels is ∆ δ . Thus, the evaluation of the output levels can be simplified by setting ∆ = c∆ δ, (3.46) 3.4. Optimal algorithm parameters and performance 127 where c∆ is a constant used to adjust the input gain when the quantizer threshold variation range is fixed or to adjust the quantization step-length when the threshold variations are uniform and fixed to a value that cannot be changed. For given δ, c∆ and Fn , the coefficients do not depend on the true parameter value, neither on the estimator value, so that they can be pre-calculated and stored in a table. In scalar form the coefficients are f˜d (i, 0) . (3.47) ηi⋆ = − F̃d (i, 0) Note that for ∆ given by (3.46), ηi depends on δ only through a 1δ multiplicative factor, the other factor can be written as a function of the normalized PDF and CDF, thus it can be pre-calculated based only on the normalized distribution. An interesting observation is that ηi⋆ is given by the score function for estimating a constant location parameter when considering that the offset is fixed and placed exactly at x, therefore this algorithm is equivalent to a gradient ascent technique to maximize the log-likelihood that iterates only one time per observation and sets the offset each time at the last estimate. Optimal algorithm and general performance for the three cases Using the ηi⋆ from (3.47) and the assumption on the symmetry of the output levels AQ3, the adaptive estimator is X̂k = X̂k−1 + γk sign (ik ) η|i⋆ k | , (3.48) Y −X̂ with ik = Q k c∆ δk−1 . The asymptotic (γ, ηi )-optimized adaptive algorithm performance is approximated for all the three cases (for the constant case it is exact) by MSEq,∞ ≈ ψ [Iq (0)] , (3.49) where ψ is a decreasing function of Iq (0): • constant: M SEk ≈ 1 kIq (0) . • Wiener process: MSEq,∞ ≈ √σw . Iq (0) • Wiener process with drift: MSEq,∞ ≈ 3 u 4Iq (0) 2 3 . Optimal threshold variations In the performance given in (3.49), the threshold variations set τ ′ is influent only through Iq (0). Therefore, for optimizing the algorithm through τ ′ , we will have the same optimization problem discussed in Ch. 1, namely (1.47) Iq⋆ = argmax Iq (0) . τ′ 128 Chapter 3. Adaptive quantizers for estimation In Ch. 1, we saw that this problem is difficult in general. Two alternatives were proposed: the first one would be to constrain the quantizer to be uniform and then obtain the optimal quantizer interval step-length. The second would be to consider a general quantizer but with a very large (tending to infinity) number of quantizer intervals. For the simulated results to be presented later, Sec. 3.5, we will use the first approach. We consider that the positive threshold variations are uniform and fixed to be ⊤ ′ ′ ′ ′ ′ τ = −τ NI = −∞ · · · − τ1 = −1 0 + τ1 = +1 · · · + τ NI = +∞ . (3.50) 2 2 Then in this case, only c∆ need to be maximized and, as it was stated before, this can be done using a grid method. 3.4.2 Algorithm performance for optimal gain and coefficients We now present for each parameter model the optimal adaptive gain γk⋆ and the asymptotic MSE for the update coefficients η ⋆ . In each case, after evaluating the asymptotic MSE, we will also evaluate the effect of quantization on the estimation performance. This will be done by evaluating the performance loss due to quantization Lq defined by ! MSEq,∞ , (3.51) Lq = 10 log10 ˜ c,∞ MSE where MSEq,∞ is the asymptotic MSE for the adaptive algorithm based on quantized measure˜ c,∞ is a quantity related to the asymptotic performance of estimation based ments and MSE ˜ c,∞ will be specified later for each case. Observe that the on continuous measurements. MSE loss Lq is a relative measure and it is expressed in decibels (dB). Before proceeding to the performance evaluation for each case, we still need to determine the quantities hε and R for the optimal update coefficients. Using the expression for ηi⋆ (3.47) in the expression for hε (3.21) and R (3.25), we have NI hε = −2 3.4.2.1 2 X f˜d2 (i, 0) i=1 F̃d (i, 0) NI = −Iq (0) , (3.52) R=2 2 X f˜d2 (i, 0) i=1 F̃d (i, 0) = Iq (0) . (3.53) Constant case: gain and performance Replacing hε given by (3.52) in (3.26) and then the result in (3.10), we have the following gains 1 γk⋆ = . (3.54) kIq (0) 2 (3.27), we get Also, replacing (3.52) and (3.53) in the expression for σ∞ 2 σ∞ = 1 . Iq (0) (3.55) 3.4. Optimal algorithm parameters and performance 129 In practice, this means that, for large k, the MSE will be MSEk ≈ 1 . kIq (0) (3.56) ˜ c,∞ can be obtained through the CRB. As the The continuous asymptotic performance MSE measurements are independent, the FI for k continuous measurements is k times the FI for continuous measurements Ic , thus the continuous measurement bound CRBc is CRBc = 1 . kIc (3.57) The expression for Ic can be obtained by evaluating the expectation E Sc2 , where the score is given by (1.15) ∂ log f (y − x) . Sc (y) = ∂x Changing variables, Ic is given by the following integral: Ic = Z f (1) (x) f (x) !2 f (x) dx. MSEk k→∞ CRBc = Ic Iq (0) , R The ratio MSEq,∞ ˜ c,∞ MSE is then given by lim Lq = −10 log10 Iq (0) Ic (3.58) leading to the loss . (3.59) 130 Chapter 3. Adaptive quantizers for estimation We have the following solution to problem (a) (p. 27): Solution to (a) - Adaptive algorithm with decreasing gain (a3) 1) Estimator For each time k, the estimate and threshold update is given by (3.48) X̂k = τ0,k = X̂k−1 + γk sign (ik ) η|i⋆ k | , ˜ Y −X̂ with ik = Q k c∆ δk−1 , γk = kIq1(0) and ηi⋆ = − F̃fd (i,0) . (i,0) d 2) Performance (asymptotic) X̂k is asymptotically unbiased and its bias for large k can be approximated by ε (tk ), which is the solution of the ODE (3.11) dε = h̃ (ε) , dt where h̃ (ε) = h (ε + x), h is given by (3.7) and the time is k P γj . Its asymptotic MSE or variance is given by (3.56) tk = j=1 MSEk ∼ k→∞ 1 , kIq (0) where Iq (0) is given by (1.13) with ε = 0, representing a loss of performance w.r.t. the asymptotically optimal estimator based on continuous measurements of (3.59) Iq (0) , Lq = −10 log10 Ic with Ic the continuous measurement FI given by (3.58). 3.4.2.2 Wiener process case: gain and performance Using (3.53) in (3.30), we obtain the optimal constant gain σw γ⋆ = p Iq (0) (3.60) and for this gain, the asymptotic MSE is given by substituting (3.55) in (3.32) σw MSEq,∞ = p + ◦ (σw ) . Iq (0) (3.61) Note that we used the fact that γ ⋆ in this case depends linearly on σw for writing the ◦ term. 3.4. Optimal algorithm parameters and performance 131 The comparison with the continuous case can be done by using the asymptotic BCRB for ˜ c,∞ . The evaluation of the asymptotic BCRB follows in the continuous measurements as MSE same line as the one presented in Ch. 2 for estimation based on quantized measurements. The main difference is that in the continuous case, the FI Ic is independent of the parameter value, thus E [Ic ] = Ic and we do not need to consider a lower bound on the BCRB. For small σw (small compared with Ic ) the asymptotic BCRB can be approximated exactly in the same way as for the lower bound on the MSE for quantized measurements σw BCRBc,∞ = √ + ◦ (σw ) . Ic (3.62) The loss of performance, in this case denoted LW q , is given as follows LW q = 10 log10 MSEq,∞ BCRBc,∞ = 10 log10 √σw + ◦ (σw ) Iq (0) σw √ + Ic ◦ (σw ) . (3.63) √ We multiply the numerator and the denominator inside the logarithm of (3.63) by σwIc . This gives q ◦(σw ) Ic + σ I (0) w q , LW q = 10 log10 ◦(σw ) 1 + σw √ where we have assimilated the Ic in the ◦ (σw ) term. Using the first order Taylor expansion 1 around x = 0, 1+x = 1 − x + ◦ (x), we can obtain LW q Then factorizing x ln(10) q Ic Iq (0) = 10 log10 s Ic ◦ (σw ) + Iq (0) σw ! . and using the first order Taylor expansion around x = 0, log10 (1 + x) = + ◦ (x), where ln is the natural logarithm, we have LW q = 10 log10 s Ic Iq (0) ! ◦ (σw ) = −5 log10 + σw Iq (0) Ic + ◦ (σw ) . σw Note that the first term is half the loss of performance for the constant case 1 ◦ (σw ) LW . q = Lq + 2 σw From the definition of the ◦ term we also have 1 lim LW q = Lq . σw →0 2 (3.64) 132 Chapter 3. Adaptive quantizers for estimation This gives the following solution to problem (b) (p. 29) when the parameter is modeled by a Wiener process without drift: Solution to (b) - Adaptive algorithm with constant gain for tracking a Wiener process with small σw . (b3.1) 1) Estimator For each time k, the estimate and threshold update is given by (3.48) X̂k = τ0,k = X̂k−1 + γ sign (ik ) η|i⋆ k | , with ik = Q Yk −X̂k−1 c∆ δ , γ = √σw Iq (0) ˜ and ηi⋆ = − F̃fd (i,0) . (i,0) d 2) Performance (approximated and asymptotic) X̂k is asymptotically unbiased and its bias can be approximated by ε (tk ), which is the solution of the ODE (3.11) dε = h̃ (ε) , dt where h̃ (ε) = h (ε + x), h is given by (3.7) and the time is tk = kγ. Its asymptotic MSE or variance is given by (3.61) σw MSEq,∞ = p + ◦ (σw ) , Iq (0) where Iq (0) is given by (1.13) with ε = 0, representing a loss of performance w.r.t. the asymptotically optimal estimator based on continuous measurements of (3.64) Iq (0) 1 ◦ (σw ) ◦ (σw ) W Lq = −5 log10 = Lq + , + Ic σw 2 σw with Ic the continuous measurement FI given by (3.58) and Lq the loss of the adaptive algorithm for estimating a constant. 3.4.2.3 Wiener process with drift case: gain and performance Replacing the expressions for hε (3.52) and R (3.53) in the expressions for γ ⋆ (3.35) and MSEq,∞ (3.36), we obtain 1 2 3 4u2 3 u ⋆ γ = , (3.65) MSEq,∞ ≈ 3 + ◦ (γ ⋆ ) . (3.66) 2 Iq (0) 4Iq (0) If u is unknown, it might be estimated. It can be estimated by smoothing the differences 3.4. Optimal algorithm parameters and performance 133 between successive estimates Ûk = Ûk−1 + γku h i X̂k − X̂k−1 − Ûk−1 . (3.67) where γku is a sequence of small positive gains. The estimator Ûk can replace u in the evaluation of the gain and of the asymptotic MSE. If the drift is not constant but slowly varying, the adaptive algorithm above can also be used. In this case, additional information on the evolution of the drift might be incorporated in (3.67) to have more precise estimates and get an adaptive gain closer to the optimal. For the evaluation of the loss due to quantization, we could use BCRBc,∞ for the continuous measurement performance. However, this would result in an unfair comparison, as the imposition of using X̂k−1 instead of the prediction is known to be suboptimal. Therefore, the evaluation of the loss will be done using the approximate performance for an adaptive algorithm of the same form, but using continuous measurements instead of quantized measurements. The algorithm has the following form: X̂k = X̂k−1 + γkc ηc Yk − X̂k−1 , where γkc and the non linearity ηc (x) are optimized to minimize the asymptotic MSE. Using the same theory described for the quantized case it is possible to show that the optimal γkc and ηc (x) are 2 31 f ′ (x) 4u c,⋆ , η (x) = − c , γk = f (x) Ic2 which exist under the constraint that Ic converges and is not zero and that f ′ (x) exists for every x. The MSE can be approximated in a similar way as before MSEc,∞ u ≈3 4Ic 2 3 . (3.68) ˜ c,∞ in the evaluation of the loss. Using similar This asymptotic MSE can be used as MSE D, taylor expansions as in the previous Wiener model and denoting the loss in this case by LW q we have 2 ◦ u 32 ◦ u3 Iq (0) 2 20 D ≈ − log10 LW = Lq + . (3.69) + 2 2 q 3 Ic 3 u3 u3 Note that here the limit result is on u 2 D = Lq . lim LW q 3 u→0 However, note also that hidden in the approximation is the fact that σw must also tend to zero. 134 Chapter 3. Adaptive quantizers for estimation We have the following solution to problem (b) (p. 29) when the parameter is modeled by a Wiener process with deterministic drift: Solution to (b) - Adaptive algorithm with constant gain for tracking a Wiener process with small σw and small u. (b3.2) 1) Estimator For each time k, the estimate and threshold update is given by (3.48) X̂k = τ0,k = X̂k−1 + γ sign (ik ) η|i⋆ k | , with ik = Q Yk −X̂k−1 c∆ δ ,γ= 4u2 Iq2 (0) 1 3 ˜ . and ηi⋆ = − F̃fd (i,0) (i,0) d 2) Performance (approximated and approximated asymptotic) The estimation bias can be approximated by ε (tk ), which is the solution of the ODE (3.16) u dε = h̃ (ε) − , dt γ where h̃ (ε) = h (ε + x), h is given by (3.7), x is the mean of the Wiener process and the time is tk = kγ. Its asymptotic MSE or variance is approximated as follows (3.66) MSEq,∞ ≈ 3 u 4Iq (0) 2 3 2 + ◦ u3 , where Iq (0) is given by (1.13) with ε = 0, representing a loss of performance w.r.t. the asymptotically optimal adaptive estimator based on continuous measurements of (3.69) 2 ◦ u 32 ◦ u3 Iq (0) 20 2 D + log L + LW ≈ − = , q 10 2 2 q 3 Ic 3 u3 u3 with Ic the continuous measurement FI given by (3.58) and Lq the loss of the adaptive algorithm for estimating a constant. Observe that the losses for the three models of Xk depend directly on Lq , thus Lq allows to approximate how much of performance is lost for a specific type of noise and thresholds set when comparing to the equivalent continuous measurements based algorithm. 3.5. Simulations 3.5 135 Simulations Now, we are going to check the validity of the results through simulation. We will mainly focus on obtaining a simulated version of the loss of performance for the three parameter models and then we will compare the simulated loss with the theoretical one. After that, we will compare the adaptive algorithm performance with the algorithms presented in the previous chapters, namely the adaptive MLE scheme for estimating a constant and the PF with dynamical central threshold for estimating a Wiener process. This comparison will allow us to know if we lose in estimation performance and what we lose in estimation performance, when we use the low complexity adaptive algorithm presented in this chapter, instead of the algorithms presented in the previous chapters. 3.5.1 General considerations Threshold variations. In what follows the threshold variations are considered to be uniform and given by (3.50) ′ ′ τ = −τ NI = −∞ · · · − 2 τ1′ = −1 0 + τ1′ ′ = +1 · · · + τ NI 2 ⊤ = +∞ . Evaluation of Iq (0) and the algorithm parameters For a given type of noise, supposing that its noise scale parameter δ is known, for a fixed NI , Iq (0) can be evaluated by using the normalized CDF and PDF, Fn and fn (CDF and PDF for δ = 1), in (3.45) (or (1.46)). Using the parametrization ∆ = c∆ δ and the fact that f (x) = 1δ fn xδ , we have NI 2 2 X {fn [(i − 1) c∆ ] − fn [ic∆ ]}2 Iq (0) = 2 . δ {Fn [ic∆ ] − Fn [(i − 1) c∆ ]} (3.70) i=1 As Iq (0) is now a function of c∆ only, it can be maximized by adjusting this parameter. Being a scalar maximization problem this can be done by using grid optimization (searching for the maximum in a fine grid of possible c∆ ). After finding the optimal c⋆∆ , the coefficients ηi = ηi⋆ can be evaluated using the normalized CDF and PDF in (3.47). This gives ηi⋆ = 1 fn [(i − 1) c⋆∆ ] − fn [ic⋆∆ ] . δ Fn ic⋆∆ − Fn (i − 1) c⋆∆ Then, with δ, the optimal Iq (0) and depending on the model, σw or u, we can evaluate and then all the algorithm parameters are defined. (3.71) 1 ∆, γk Discussion on the signal model Note that it is supposed that the model for Xk is known, as setting γk depends on it. As a consequence of this assumption, in a real application the choice between the three models must be clear. When this choice is not clear from the application, it is always simpler to choose Xk to be a Wiener process, first, because the complexity of the algorithm is lower and 136 Chapter 3. Adaptive quantizers for estimation second, because supposing that the increments are Gaussian and i.i.d. does not impose too much information on the evolution of Xk . Still, σw must be known, in practice it can be set based on prior knowledge on the possible variation of Xk or by accepting a slower convergence and a small loss of asymptotic performance, it can be estimated jointly with Xk using an extra adaptive estimator for it. In the last case, when it is known that the increments of Xk have a deterministic component, the fact that the γk depends on u is not very useful and prior information on the variations of Xk are not normally as detailed as knowing u itself, making it necessary to accept a small loss of performance to estimate u jointly. The estimation of u can be done using (3.67) where prior knowledge on the variations of uk can be integrated in the gain γku . If precise knowledge on the evolution of uk is known through dynamical models, it might be more useful to use other forms of adaptive estimators known as multi-step algorithms [Benveniste 1990, Ch. 4]. Discussion on the noise model The evaluation of the loss and the verification of the results will be done considering two different classes of noise that verify assumptions AN1, AN2 and AN3, namely, generalized Gaussian (GGD) noise and noise distributed according to the Student’s-t distribution (STD). The motivation for the use of these two distributions comes from signal processing, statistics and information theory. In signal processing, when additive noise is not constrained to be Gaussian, a common assumption is that the noise follows a GGD [Varanasi 1989]. This distribution not only contains the Gaussian case as a specific example, but also by changing one of its parameters, one can model the impulsive Laplacian distribution as well as distributions close to uniform. In robust statistics, when the additive noise is considered to be impulsive, a general class for the distribution of the noise is the STD [Lange 1989]. STD includes as a specific case the Cauchy distribution, known to be heavy-tailed and used intensively in robust statistics. Also, by changing a parameter of the distribution, an entire class of heavy-tailed distributions can be represented. When looking from an information point of view, if no prior is used for the noise, noise models must be as random as possible to ensure that the noise is an uninformative part of the measurement. Thus, noise models must maximize some criterion of randomness. Commonly used criteria for randomness are entropy measures and both distributions considered above are entropy maximizers. The GGD maximizes the Shannon entropy under constraints on the moments [Cover 2006, Ch. 12] and the STD maximizes the Rényi entropy under constraints on the second order moment [Costa 2003]. Both families of distributions are parametrized by a shape parameter β ∈ R+ and a scale parameter δ. The CDF and PDF of the GGD were given in Ch. 1 by (1.39) and (1.40) β x β exp − , fGGD (x) = δ 2δΓ β1 1 x β γ β, δ 1 , FGGD (x) = 1 + sign (x) 2 Γ 1 β 3.5. Simulations 137 while for the STD, the CDF and PDF are respectively β+1 Γ β+1 2 1 x 2 − 2 1+ fST D (x) = , √ β δ δ βπΓ β2 ( " #) 1 β 1 1 + sign (x) 1 − I β , , FST D (x) = 2 2 ( x ) +β 2 2 (3.72) (3.73) δ I (, ) is the incomplete beta function Iw (x, y) = Zw 0 3.5.2 z x−1 (1 − z)y−1 dz. Theoretical performance loss due to quantization The main quantity that must be evaluated before simulating the algorithm is the theoretical loss Lq . This quantity will not only be useful to check the simulation results, but will also be useful to observe how the performance evolves as we change the number of quantization intervals and as we change the noise model. To evaluate Lq , after evaluating Iq (0) based on the CDF and PDF given above, we also need to evaluate Ic . The continuous measurement FI for the GGD can be obtained by using (1.39) in the integral expression (3.58), this gives (Why? - App. A.1.7) 1 β (β − 1) Γ 1 − β 1 . (3.74) Ic,GGD = 2 1 δ Γ β For the STD the continuous measurement FI is given by using (3.72) also in (3.58). Integrating, we obtain (Why? - App. A.1.8) 1 β+1 . (3.75) Ic,ST D = 2 δ β+3 We evaluated the theoretical loss for NI ∈ {2, 4, 8, 16, 32}, which corresponds to NB = log2 (NI ) ∈ {1, 2, 3, 4, 5} numbers of bits, for shape parameters β ∈ {1.5, 2, 2.5, 3} for GGD noise and β ∈ {1, 2, 3} for STD noise. The results are shown in Fig. 3.4. As it was intuitively expected, the loss reduces with increasing NB . It is interesting to note that the maximum loss, observed for NB = 1, goes from approximately 1dB to 4dB, which represents factors less than 3 in MSE increase for estimating a constant with 1 bit quantization. Also interesting is the fact that the loss decreases rapidly with NB , for 2 bit quantization all the tested types of noise produce losses below 1dB, resulting in linear increases in MSE not larger than 1.3. This indicates that when using the adaptive estimators developed here, it is not very useful to use more than 4 or 5 bits for quantization. The performance for one bit seems to be related to the noise tail. Note that smaller losses were obtained for distributions with heavier tail (STD in general and GGD with β = 1.5). This is due to the fact that for large tail distributions a small region around the median of the 138 Chapter 3. Adaptive quantizers for estimation 4 Loss [dB] 3 - β β β β 1 = 1.5 = 2 (Gaussian) = 2.5 =3 STD - β = 1 (Cauchy) STD - β = 2 STD - β = 3 0.8 Loss [dB] GGD GGD GGD GGD 2 0.6 0.4 1 0.2 0 1 2 3 4 Number of bits [NB ] (a) 5 0 1 2 3 4 Number of bits [NB ] 5 (b) Figure 3.4: Adaptive algorithm loss of estimation performance due to quantization of measurements corresponding to the constant case Lq (theoretical). The loss is evaluated for different types of noise, GGD noise in (a) and STD noise in (b), and different numbers of quantization bits. For the other models of parameter studied here, the loss is proportional to Lq . distribution is very informative, thus (as most of the information is contained there) when the only threshold available is placed close to the median, the relative gain of information is greater than in the other cases, leading to smaller losses. This can also be the reason for the slow decrease of the loss for these distributions. As the quantizer thresholds are placed uniformly, some of them will be placed in the non informative amplitude region and consequently, the decrease in loss will be not as sharp as in the other cases. The loss was not shown in Fig. 3.4 for the Laplacian distribution, because for this distribution the adaptive optimal estimator in the continuous case is already an adaptive estimator with a binary quantizer. One can see this by evaluating the coefficients ηi , which in this case are constant for positive i showing that only the sign of the difference between the measurement and the last estimate is important. This behavior of optimality for binary quantization was already observed in Ch. 1, where we showed that the CRB for binary quantized measurements can be equal to the CRB for continuous measurements in the Laplacian case. Consequently, the loss in this case is zero dB for all NB . 3.5.3 Simulated loss To validate the results, we will simulate the loss of performance. The simulation results will be presented in the same order as the theoretical results presented in the previous sections. First the constant case, then the Wiener process case and finally the Wiener process with drift. All the simulations are done for NB ∈ {2, 3, 4, 5}. Simulated loss: constant case In the constant case, the 7 types of noise with previously evaluated Lq were tested, the value of X0 = x was set to zero and the initial condition of the adaptive algorithm was set with a small error (X̂0 ∈ {0, 10}). The number of samples was set to 5000 to ensure convergence. The algorithm was simulated 2.5 × 106 times and the error results were averaged yielding a 3.5. Simulations 139 Loss [dB] simulated MSE. Based on the simulated MSE a simulated loss was calculated. GGD noise was simulated using a transformation of gamma variates (How? - App. A.3.2), while STD noise was simulated using a transformation of independent uniform variates similar to the transformation used for generating Gaussian variates (How? - App. A.3.5). The results are shown in Fig. 3.5 for GGD noise and in Fig. 3.6 for STD noise. 1 0.5 β β β β = 1.5 = 2 (Gaussian) = 2.5 =3 0 100 101 102 Time [k] 103 (a) Loss [dB] 0.15 0.1 0.05 0 100 β β β β = 1.5 = 2 (Gaussian) = 2.5 =3 101 102 Time [k] 103 (b) Figure 3.5: Quantization loss of performance for GGD noise and NB ∈ {2, 3, 4, 5} when Xk is constant. For each type of noise there are 4 curves, the constant losses are the theoretical results and the decreasing losses are the simulated results, thus producing pairs of curves of the same type, for each pair the higher results represent lower number of quantization bits. In (a) results for NB = 2 and 3 are shown. In (b) the results for NB = 4 and 5 are shown. The simulated results were obtained through Monte Carlo simulation using 2.5 × 106 realizations of blocks of 5000 error samples, the true parameter value in all simulations was set to zero, while X̂ was set to have a small initial error (X̂0 ∈ {0, 10}). We used δ = 1 in all simulations. 140 Chapter 3. Adaptive quantizers for estimation Loss [dB] 1 0.8 0.6 β = 1 (Cauchy) β=2 β=3 0.4 0.2 0 100 101 102 Time [k] 103 (a) Loss [dB] 0.4 0.3 0.2 β=1 (Cauchy) β=2 β=3 0.1 0 100 101 102 Time [k] 103 (b) Figure 3.6: Quantization loss of performance for STD noise and NB ∈ {2, 3, 4, 5} when Xk is constant. For each type of noise there are 4 curves, the constant losses are the theoretical results and the decreasing losses are the simulated results, thus producing pairs of curves of the same type, for each pair the higher results represent lower number of quantization bits. In (a) results for NB = 2 and 3 are shown. In (b) results for NB = 4 and 5 are shown. The simulated results were obtained through Monte Carlo simulation using 2.5 × 106 realizations of blocks of 5000 error samples, the true parameter value in all simulations was set to zero, while X̂ was set to have a small initial error (X̂0 ∈ {0, 10}). We used δ = 1 in all simulations. Remarks: • note that the losses are independent of δ as both Iq (0) and Ic depend on it through the same multiplicative constant δ12 . • The simulated results seem to converge to the theoretical approximations of Lq , thus validating these approximations. This also means that the variance of estimation tends in simulation to the CRB for quantized observations kIq1(0) , showing that the algorithm is asymptotically optimal. • The convergence time seems to be related to NB (when NB increases, the time to get closer to the optimal performance decreases). 3.5. Simulations 141 Simulated loss: Wiener process case For a Wiener process, LW q was evaluated by setting X̂0 randomly around 0 and X0 = 0, then 4 10 realizations with 105 samples were simulated and the MSE was estimated by averaging the realizations of the squared error for each instant. As it was observed that the error was approximately stationary after k = 1000, the sample MSE was also averaged resulting in an estimate of the asymptotic MSE. Based on the obtained values of the MSE, a simulated loss was evaluated. The results for the 7 types of noise and σw = 0.001 are shown in Fig. 3.7. As expected, the results have the same form of the theoretical loss given in Fig. 3.4. 0.5 GGD - β = 1.5 GGD - β = 2 (Gaussian) GGD - β = 2.5 GGD - β = 3 STD - β = 1 (Cauchy) STD - β = 2 STD - β = 3 Loss [dB] 0.4 0.3 0.2 0.1 2 3 4 Number of bits [NB ] 5 Figure 3.7: Simulated quantization performance loss for a Wiener process Xk with σw = 0.001, different types of noise and numbers of quantization bits. The simulated losses were obtained through Monte Carlo simulation. For each evaluated loss (each symbol on the curves) 104 realizations with 105 samples were simulated. As it was observed that the error is stationary after k = 1000, the sample MSE was also averaged leading to an estimate of the asymptotic MSE and consequently of the loss. The simulations were done by setting the initial estimate randomly around zero (with a Gaussian distribution) and also by setting X0 = 0. In all simulations, we considered δ = 1. To verify the results for different values of σw , the loss was evaluated through simulation also for σw = 0.1 in the Gaussian (GGD with β = 2) and Cauchy cases (STD with β = 1). The results are shown in Fig. 3.8, where the theoretical losses for these cases are also shown. These results clearly show that Xk may move slowly to give a performance close to the theoretical results. However, it is also interesting to note that the simulated loss seems to have the same decreasing rate as a function of NB when compared with the theoretical results. This means that the dependence on Iq (0) of the MSE seems to be still correct. Moreover, it indicates that even in a faster regime for Xk , the threshold variations can be set by maximizing Iq (0). 142 Chapter 3. Adaptive quantizers for estimation Cauchy - σw = 0.1 Cauchy - σw = 0.001 Gaussian - σw = 0.1 Gaussian - σw = 0.001 Cauchy – Theo. Gaussian – Theo. Loss [dB] 1 0.5 0 2 3 4 Number of bits [NB ] 5 Figure 3.8: Comparison of simulated and theoretical losses in the Gaussian and Cauchy noise cases when estimating a wiener process with σw = 0.1 or σw = 0.001. The simulated losses were obtained through Monte Carlo simulation. For each evaluated loss (each symbol on the curves) 104 realizations with 105 samples were simulated. As it was observed that the error is stationary after k = 1000, the sample MSE was also averaged leading to an estimate of the asymptotic MSE and consequently of the loss. The simulations were done by setting the initial estimate randomly around zero (with a Gaussian distribution) and also by setting X0 = 0. In all simulations, we considered δ = 1. Simulated loss: Wiener process with drift case For a Wiener process Xk with drift, Wk was simulated with mean and standard deviations u = σw = 10−4 , which represents a slow drift with small random fluctuations. The initial conditions were set to X0 = X̂ = 0 and the drift estimator was set with constant gain γku = 10−5 . Its initial condition was set to the true u to reduce the transient time and, consequently, the simulation time. As uk is constant, the loss evaluation was done in the same form as for Xk without drift, after averaging the squared error through realizations and time. The results for the Gaussian and Cauchy cases are shown in Fig. 3.9. The small offset between the simulated and theoretical results is justified by the joint estimation of u and Xk . Note that keeping γku small allows one to adaptively follow slow variations in the drift. The convergence to the simulated loss in Fig. 3.9 was also obtained for simulations including errors in the initial conditions. However, in this case, the transient regime was very long, indicating that other schemes might be considered when the theoretical performance is needed in a short period of time. Note also that if the drift is known, the procedure simulated for tracking Xk is clearly suboptimal. In this case, we can obtain better asymptotic results by using the prediction (which includes the drift) in the adaptive algorithm. However, in practice, as we have to estimate jointly the unknown drift, the simulated algorithm normally has a shorter transient than the version using the prediction. This is an advantage when the drift can vary in time. 3.5. Simulations 143 Gaussian - Sim. Cauchy - Sim. Gaussian - Theo. Cauchy - Theo. Loss [dB] 0.4 0.3 0.2 0.1 2 3 4 Number of bits [NB ] 5 Figure 3.9: Comparison of simulated and theoretical losses in the Gaussian and Cauchy noise cases for estimating a Wiener process with constant mean drift uk = 10−4 and standard deviation σw = 10−4 . The simulation results were obtained with 104 realizations of 105 samples, for evaluating the simulated asymptotic MSE, the squared error samples were averaged through the realizations and through the time samples after the transient time (for k > 1000). The initial estimate value and initial parameter value were both set to zero. The initial value of the estimate of the drift was also set to the true parameter value to reduce the transient time. 3.5.4 Comparison with the high complexity algorithms The adaptive algorithms that we propose will be compared with their equivalent counterparts given in previous chapters. When the parameter is constant, we will compare the adaptive algorithm with decreasing gain (a3) with the adaptive algorithm based on the MLE (a2.2) presented in Ch. 1 (p. 69). We will discuss the main differences in terms of performance and computational complexity. Adaptive algorithm vs adaptive MLE Asymptotic performance. Asymptotically both algorithms are equivalent, since they are asymptotically unbiased and their asymptotic variance is equivalent to kIq1(0) . This means that for commonly used noise distributions both algorithms are asymptotically optimal under the unbiasedness constraint. Thus, if there is a difference in performance, this difference might be found in the transient, before getting close to the asymptotic performance. Transient performance. The transient for both algorithms is difficult to study analytically. For the adaptive scheme with decreasing gain, the first few steps will be mainly characterized by the bias. Unfortunately, the bias approximation given by the ODE approximation cannot be used in the initial transient as the size of the steps is too large. For the adaptive scheme based on the MLE, we cannot obtain any result either, as the general behavior of the MLE is known only asymptotically. Therefore, we will analyze the transient through simulations. We simulated both algorithms for NI = 8 and two different types of noises, Gaussian and Cauchy noises. The threshold variations were considered to be uniform with step-length 144 Chapter 3. Adaptive quantizers for estimation chosen in the same way as for the evaluation and simulation of the losses. For evaluating the simulated MSE for the transient, we simulated 1000 realizations of the algorithms, each realization with 50 samples. The noise scale factor used for both cases was δ = 1 and the parameter and initial estimate were x = 0 and X̂0 = 1. For starting the adaptive scheme based on the MLE, 10 samples with fixed thresholds were used for obtaining the first estimate. The algorithm used in the maximization procedure of the MLE was a search algorithm1 . The results are shown in Fig. 3.10, where we also show the CRB for quantized measurements when the central threshold is placed at the true parameter CRBq⋆ = kIq1(0) . MSE 1 CRB⋆q Sim. – adaptive alg. Sim. – adaptive MLE 0.5 0 10 20 30 Time [k] 40 50 (a) MSE 1 CRB⋆q Sim. – adaptive alg. Sim. – adaptive MLE 0.5 0 10 20 30 Time [k] 40 50 (b) Figure 3.10: Minimum CRB and simulated MSE for the adaptive algorithm with decreasing gain and for the adaptive algorithm based on the MLE. Both algorithms were simulated with NI = 8, optimal uniform thresholds, Gaussian and Cauchy noise with δ = 1, x = 0 and X̂0 = 1. For evaluating the transient MSE, 50 samples were simulated 1000 times for each algorithm. The scheme based on the MLE is started by applying the MLE with samples obtained with fixed thresholds. The maximization in the MLE is done with a search algorithm1 . In (a), results for Gaussian noise are shown. In (b), the results for Cauchy noise are shown. We would expect that the MLE based algorithm would produce better results, as it seems that we treat the data in an intuitively better way (we maximize the likelihood of the data). This is indeed the case when we consider Cauchy noise, but the opposite happens when we test it with Gaussian noise. The decreasing gain algorithm is even slightly below the bound initially (which is possible only because the algorithm is initially biased). Thus, we cannot R 1 function fminsearch. We chose this function instead of Newton’s More precisely we used the MATLAB method because it can handle non-convex problems. For Cauchy noise the likelihood is not convex. 3.5. Simulations 145 say that one of the algorithms is better than the other. As the algorithms performance seems equivalent, a practical choice can be done in terms of complexity. Complexity. At time k, the adaptive scheme based on MLE must solve a maximization problem using the last k measurements i1:k . Each measurement produces an additional term on the log-likelihood to be maximized, thus at time k, the evaluation of the log-likelihood function itself requires k evaluations of the logarithm of the marginal likelihood. Note that the marginal likelihood can be very costly to be evaluated as it is a difference of CDF. For the adaptive algorithm with decreasing gains, the gains can be precalculated and stored in a table, or they can be obtained by using one division, the update coefficients can also be precalculated and stored in a table. To generate one estimate the adaptive algorithm then requires: one search in a table to have the update coefficient, one division or one search in a table to have the gain, one multiplication to obtain the total correction and one sum to have the final estimate. One can conclude that the adaptive algorithm with decreasing gains has far lower complexity requirements when compared with the scheme based on the MLE. Note also that the adaptive algorithm based on MLE needs a certain number of measurements with fixed (or not adaptive) thresholds to start. This is due to the fact that the MLE for one measurement is ill defined and produces estimates equal to +∞ or −∞. Note that it can also happen with more than one measurement, if all measurements are equal to +1 or if they are all equal to −1. Thus, for the adaptive algorithm based on MLE we can have realizations with unbounded values and this will happen especially in the cases when the initial quantizer dynamics is far away from the parameter. Such behavior will not happen for the adaptive algorithm with decreasing gains as the update coefficients are bounded above (considering PDF with upper bounded f˜d and lower bounded from zero F̃d ). Therefore, for practical purposes the choice between them is clear, we should choose the algorithm with decreasing gains (a3). Adaptive algorithm vs PF We compare now the adaptive algorithm (with fixed gain) and the PF procedure for tracking a Wiener process. Asymptotic performance for fast parameter evolution. In this case, for any σw , the PF is known to be optimal if the number of particles tends to infinity. Thus, for a very large number of particles we expect the PF procedure to be as good as the adaptive algorithm. Asymptotic performance for slow parameter evolution. When σw is small, the procedures have equivalent asymptotic performance. The PF is approximately unbiased, if we choose a sufficiently large number of particles and the adaptive procedure is asymptotically unbiased. Their asymptotic MSE is approximately √σw . Thus, when σw is small, the Iq (0) differences, if they exist, will also occur in the transient performance. 146 Chapter 3. Adaptive quantizers for estimation Transient performance. Similarly to the constant case, we analyze the transient performance through simulation. We simulated both the adaptive algorithm and the PF for NI = 8 and asymptotically optimal uniform quantization. The parameter model was a Wiener process with increment standard deviation σw = 0.001, with initial standard deviation Var (X0 ) = 0.1 and with initial mean equal to zero. We simulated the algorithms both for Gaussian and Cauchy noise with δ = 1. For obtaining the simulated transient MSE, 1000 samples were simulated 2500 times for each algorithm and each noise distribution. The initial estimate for both algorithms X̂0 was set to zero in all the cases. We used 5000 particles in the PF and its resampling procedure was triggered each time the number of effective particles was below 50. The results are shown in Fig. 3.11 where the asymptotically optimal performance ( √σw ) for Iq (0) small σw is also presented. 1 · 10−2 Asymp. optimal Sim. – adaptive alg. Sim. – PF MSE 8 · 10−3 6 · 10−3 4 · 10−3 2 · 10−3 0 200 400 600 800 1,000 Time [k] (a) Asymp. optimal Sim. – adaptive alg. Sim. – PF MSE 1 · 10−2 5 · 10−3 0 200 400 600 800 1,000 Time [k] (b) Figure 3.11: Asymptotic MSE for the optimal estimator of a Wiener process with small σw and simulated MSE for the adaptive algorithm with constant gain and for the PF with dynamic central threshold. Both algorithms were simulated with NI = 8, optimal uniform thresholds, Gaussian and Cauchy noise with δ = 1, σw = 0.001, E (X0 ) = 0, Var (X0 ) = 0.1 and X̂0 = 1. The evaluation of the transient MSE was done with 2500 simulations of the algorithms for blocks with 1000 samples. The PF was simulated with 5000 particles and its threshold for the resampling procedure was set at Nthresh = 50. In (a) results for Gaussian noise are shown, while in (b) we have the results for Cauchy noise. In this case the expected results are obtained. The PF, which might be close to optimal when the number of particles is large, is clearly faster to converge when compared with the adaptive algorithm. Complexity. When comparing the complexity of the algorithms the difference is impressive. 3.5. Simulations 147 At time k, for each particle in the PF, a Gaussian r.v. has to be simulated in the prediction step and its likelihood has to be evaluated. After that, the weighted mean of the particles is computed. It is then followed by the evaluation of the effective number of particles with a possible resampling step. For the adaptive algorithm the complexity is one search in a table, to obtain the update coefficient, one multiplication with the constant gain and one sum with the previous estimate. Therefore, one might choose the PF whenever there is no restriction on the complexity of the algorithm2 . If there is a strong complexity restriction, by paying the price of a slower convergence, the adaptive algorithm can be a good solution. 3.5.5 Discussion on the results We summarize the main points observed until now and we will discuss some of them. • We proposed a low complexity adaptive algorithm to track one of three models, constant, Wiener process and Wiener process with drift. Under the hypothesis that the noise PDF is symmetric and strictly decreasing and that the quantizer is also symmetric with its center placed on the previous parameter estimate, we could prove by using Lyapunov theory that the algorithm is asymptotically unbiased for the estimation of a constant and of a Wiener process. We showed that the asymptotic performance for the optimal update coefficients is a function of the FI Iq (0), which shows that this function plays an important role in the choice of the threshold variations, as it was also observed in Ch. 1 and 2. • For the optimal update coefficients, the adaptive algorithm that is obtained is a generalization of the recursive algorithm found at the end of Ch. 1, being exactly equal if we constrain NI = 2. In the case of estimating a Wiener process, the adaptive algorithm with optimal update coefficients is equal to the asymptotic recursive algorithm presented at the end of Ch. 2. Therefore, the adaptive algorithm is a low complexity alternative to the algorithms presented in Ch. 1 and 2 with equivalent asymptotic performance. • For testing the results, we considered two different families of noises, generalized Gaussian noises and Student’s-t noises, both tested with uniform quantization. First, we evaluated the theoretical loss of performance due to quantization w.r.t. the continuous measurement equivalent estimator for different numbers of quantization intervals. The results indicate that with only a few quantization bits (4 and 5) the adaptive algorithm performance is very close to the continuous measurement case and it was observed that uniform quantization seems to penalize more estimation performance under heavy tailed distributions. • Estimation in the three possible scenarios was simulated and the results validated the accuracy of the theoretical approximations. 2 Note that the number of particles necessary to have close to optimal performance can be reduced by using the optimal proposal distribution, thus reducing complexity. This can have an impact on the choice of the algorithm when the restriction on complexity is not strong. 148 Chapter 3. Adaptive quantizers for estimation In the constant case, it was observed that the algorithm performance was very close to the Cramér–Rao bound. In the Wiener process case it was observed that the theoretical results are very accurate for small increments of the Wiener process and in the drift case it was seen that by accepting a small increase in the MSE it is possible to estimate jointly the drift. • As the algorithms are asymptotically equivalent in performance to the adaptive scheme based on the MLE in the constant case and to the PF in the Wiener process case, we simulated their transient performance, to see if we lose in performance and how much we lose by using the low complexity approach. In the constant case, we cannot say that the adaptive scheme based on the MLE is better, thus in practice, the adaptive algorithm with decreasing gain might be used as it requires far lower complexity. In the Wiener process case, the PF is superior to the adaptive algorithm with constant gain, thus if no complexity constraints are considered, we might use the PF. If we have strong complexity constraints, by accepting a slower convergence, the adaptive algorithm gives a good solution. • An interesting link between standard quantization and the adaptive algorithm for tracking the Wiener process can be observed. In the binary case, the adaptive algorithm proposed here is similar to delta modulation [Gersho 1992, p. 214], the difference is that here we do not use the quantization noise approach for obtaining its performance and we also consider the effect of the measurement noise on the final performance. When NI > 2 the algorithm that we propose can be seen as a form of predictive quantization intended for estimation and not for reconstruction of the measurements. • Another interesting result is that a varying parameter has a loss of performance due to quantization smaller than the loss for a constant parameter, thus a type of dithering effect seems to be present. In this case, the variations of the input signal brings the tracking performance of the estimator closer to the continuous measurement performance. • The fact that the number of quantization bits does not influence much the performance of estimation leads to conclude that it seems more reasonable to focus on using more sensors than using high resolution quantizers for increasing performance. Consequently, this motivates the use of sensor network approaches. An approach of this type will be presented in Subsec. 3.6.2. • As in practice sensor noise scale parameter and Wiener process increment standard deviation can be unknown and slowly variable, it would be also interesting to study how the algorithm design and performance would change by estimating all these parameters jointly. We will study the joint estimation of the constant x and the scale parameter in Subsec. 3.6.1. The joint estimation of σw in the Wiener process case will lead to a scheme similar to delta modulation with variable gain, this is left for future work. 3.6. Adaptive quantizers for estimation: extensions 3.6 149 Adaptive quantizers for estimation: extensions to locationscale estimation and to the multiple sensor approach We present now the two extensions discussed in the previous section. The first extension that will be presented is the joint estimation of the unknown noise scale factor. We will see that the adaptive estimation of x does not change, the only thing that changes is the addition of the adaptive estimator of the scale parameter δ. We will also see that the fact that we do not know the scale parameter value does not degrade the asymptotic estimation performance when compared with the location-only estimation problem. We then present the multiple sensor approach based on a fusion center architecture. We will see that the optimal correction of the adaptive algorithm based on multiple quantized measurements from different sensors will be simply a weighted sum of their corrections in the single sensor case. 3.6.1 Joint estimation of location and scale parameters We start by stating the problem and defining the adaptive estimator. In a second step, we look for its performance and we optimize the algorithm in a similar way as it was done previously. We find the optimal adaptive gain, i.e. the optimal adaptive gain matrix. The optimal update coefficients are obtained in a third step. At the end of the section, we present some simulations and we discuss the results. Problem statement and estimator are quantized with We consider that a sequence of i.i.d. r.v. Yk with marginal CDF Fn y−x δ an adjustable quantizer (Fn () is the noise CDF for δ = 1), resulting in a sequence of discrete measurements i1:k . The pair of parameters (x, δ) is unknown and the objective is to estimate it based on the quantized measurements. This is equivalent to the following modification of problem (a) (p. 27): (a’) Solve problem (a) when the noise scale parameter δ is unknown and must be estimated jointly with x. Observe that this problem is a joint location-scale estimation based on quantized measurements. The adjustable quantizer is given by (3.1), where for enhancing the estimation performance, we set the offset and the input gain to be bk = X̂k−1 , ∆k = c∆ δ̂k−1 . (3.76) Note that the main difference with the adjustable quantizer used previously is the use of the last scale parameter estimate for setting the input gain. 150 Chapter 3. Adaptive quantizers for estimation The adaptive estimation algorithm can be extended to include the joint estimation of the scale parameter. The extended version is " # " # Γ X̂k X̂k−1 ηx (ik ) = + δ̂k−1 , (3.77) ηδ (ik ) k δ̂k δ̂k−1 where io of gains, n h ηx [i]i and ηδ [i]h are iosequences of NI update coefficients n h Γ isi a 2 × 2 hmatrix NI NI NI NI and ηδ − 2 , . . . , ηδ 2 . η x − 2 , . . . , ηx 2 The advantages of this extended version are the following: • it is still a low complexity algorithm, requiring only a few operations more than the initial adaptive algorithm. • It is an online algorithm. Making it possible for real-time applications to have access to the recent estimates at any time k. • Its performance can also be studied using the general results from [Benveniste 1990]. The noise and quantizer follow the assumptions AN1–AQ1, AQ2’ and AN3. For simplification purposes and to have a stable algorithm, we will assume that both ηx [i] and ηδ [i] are symmetric, ηδ [i] have even symmetry with negative3 ηδ [1] = ηδ [−1], while ηx [i] are defined with odd symmetry and they are positive for positive i, similarly as stated in AQ3. Assumption (on the quantizer output levels): AQ3’ The quantizer output levels ηx [i] are odd and the output levels ηδ [i] are even. ηx [i] = −ηx [−i] , ηδ [i] = ηδ [−i] , (3.78) with ηx [i] > 0 for i > 0 and ηδ [1] < 0. The estimation scheme is depicted in Fig. 3.12, where the UPDATE block is the estimation algorithm. 3 This constraint on ηδ [1] is imposed to guarantee the convergence of δ̂k . The idea here is that when the quantized measurements are small, it means asymptotically (when X̂k is close to x) that the quantizer range is too large, thus the range and, consequently, δ̂k must be reduced. If we set the coefficients with the opposite sign, δ̂k will diverge. 3.6. Adaptive quantizers for estimation: extensions Adjustable Quantizer 151 τ2′ τ1′ Yk 1 c∆ δ̂k−1 0 −τ1′ − ik Quantized measurements −τ2′ δ̂k−1 X̂k−1 UPDATE X̂k Estimate Figure 3.12: Scheme representing the adjustable quantizer. The offset and gain are adjusted dynamically using the estimates while the quantizer thresholds (the threshold variations) are fixed. Optimal parameters and performance The analysis of the algorithm will be done using the results from [Benveniste 1990, Ch. 3]. We will analyze the bias and the asymptotic covariance matrix of the estimation error. Similarly to the estimation of the constant location parameter, the algorithm mean can be approximated by the solution of an ODE. However, in this case, we have a vectorial ODE with one component for x̂ and one component for δ̂: d x̂ = Γh x̂, δ̂ . (3.79) dt δ̂ The relation between continuous and discrete time is tk = k P j=1 vector field: h x̂, δ̂ 1 j and h is the following mean δ̂ηx Q y−x̂ c∆ δ̂ = = E δ̂ηδ Q y−x̂ c∆ δ̂ N I n o 2 P F̃ η [i] i, x̂, x, δ̂, δ − F̃ −i, x̂, x, δ̂, δ x d d i=1 = δ̂ N I n o 2 P ηδ [i] F̃d i, x̂, x, δ̂, δ + F̃d −i, x̂, x, δ̂, δ i=1 (3.80) , where the expectation is w.r.t. to the noise marginal probability measure, the second equality comes from the symmetry assumptions and F̃d is n o Fn τi c∆ δ̂ + x̂−x − Fn τi−1 c∆ δ̂ + x̂−x , if i ∈ 1, · · · , NI , δ δ 2 δ n o δ F̃d = (3.81) Fn τi+1 c∆ δ̂ + x̂−x − Fn τi c∆ δ̂ + x̂−x , if i ∈ −1, · · · , − NI . δ δ δ δ 2 152 Chapter 3. Adaptive quantizers for estimation The conditions on the mean convergence of the algorithm are then conditions on the global asymptotic stability of the point x̂ = x and δ̂ = δ. One necessary condition for asymptotic stability true parameters must be an equilibrium point of the ODE, which means is that the that h x̂ = x, δ̂ = δ must be zero. From the symmetry assumptions: h x̂ = x, δ̂ = δ = where the vector Fvec d is Fvec d 0 ⊤ 2η δ Fvec d = F̃d [1] · · · F̃d NI 2 ⊤ , , with elements Fd [i] = Fd (i, x, x, δ, δ) independent of the parameters. Then, the condition for the parameters to be the equilibrium point is vec η⊤ δ Fd = 0. (3.82) Other conditions are necessary for the mean convergence of the algorithm. These conditions can be found by the analysis of the ODE using Lyapunov theory. The analysis of these other conditions will not be detailed here and under the assumptions already stated and the constraint on η δ given in (3.82), it will be assumed that the algorithm converges in the mean to the true parameters. We turn our attention now to the asymptotic fluctuation of the algorithm, which is given by its asymptotic covariance matrix. Under the assumptions stated previously (assumptions AN1–AQ3’ and the assumption that the algorithm is asymptotically unbiased), √ it can be shown [Benveniste 1990, pp. 110–113] that the normalized estimation error kεk tends in distribution to a zero mean Gaussian random variable as follows √ kεk N (0, P) , (3.83) k→∞ where P is the covariance matrix given by the optimal gain Γ⋆ . The matrices P and Γ⋆ are the following: T η F ηx x d 0 (x) 2 ηT δ2 x fd (3.84) P= TF η η d δ 2 2 δ 0 (δ) ηT δ fd and 1 Γ⋆ = − 2 1 (x) ηT x fd 0 0 1 (δ) ηT δ fd , (3.85) h i (x) ˜(x) [1] · · · f˜(x) NI ]T and f (δ) = where Fd is a diagonal matrix Fd = diag [Fvec ], f = [ f d d d d d 2 h i (δ) (δ) N T I [f˜d [1] · · · f˜d ] are the derivatives in vector form of the quantizer output probabilities 2 F̃d i, x̂, x, δ̂, δ multiplied by δ̂ when x̂ = x and δ̂ = δ: (x) f˜d [i] = fn (τi ) − fn (τi−1 ) , (δ) f˜d [i] = c∆ [τi fn (τi ) − τi−1 fn (τi−1 )] . (3.86) (3.87) 3.6. Adaptive quantizers for estimation: extensions 153 These results are obtained in an equivalent way as the results presented for the estimation of x. But in this case Γ⋆ is not the inverse of the scalar derivative of −h, but instead is the inverse of the Jacobian matrix of −h evaluated at the point x̂ = x, δ̂ = δ . In the same way, the normalized for the optimal gain is the normalized covariance of the vector of covariance ηx (ik ) corrections pre and post-multiplied by the inverse of the Jacobian of h, with all ηδ (ik ) the factors being evaluated at x̂ = x, δ̂ = δ . The specific diagonal pattern of the Γ⋆ and P comes from the symmetry assumptions on the noise and the quantizer. Minimization of the estimation variance can be done through the minimization of the diagonal terms of P w.r.t. η x and η δ . The two minimization problems can be solved separately. In the case of the optimization w.r.t. η δ , the equilibrium constraint (3.82) has to be taken into account. The optimal η x can be found by using the Cauchy-Schwarz inequality, while the optimal η δ are obtained by casting the constrained minimization problem as a modified eigenvalue problem solved in [Golub 1973] (Why? - App. A.1.9). The optimal coefficients are (x) η x ∝ F−1 d fd , (δ) (δ) η δ ∝ F−1 d fd − 1fd (δ) = F−1 d fd , where 1 is a squared matrix with ones. The second equality comes from the fact that the sum (δ) of fd is zero. To respect the assumptions we can set (x) (3.88) (δ) (3.89) η x = −F−1 d fd , η δ = −F−1 d fd . Therefore, the optimal P and Γ⋆ are 2 δ P = δ 2 Γ⋆ = 2 1 (x)T fd 0 (x) F−1 d fd 0 1 (δ)T fd (δ) F−1 d fd . (3.90) Note that the asymptotic variances are equal to the CRB for estimating the parameters based on the quantized measurements, when the quantizer offset and input gain are placed exactly at x and c∆1 δ . 154 Chapter 3. Adaptive quantizers for estimation We have the following solution to problem (a’) (p. 149): Solution to (a’) - Adaptive algorithm with decreasing gain for estimating x and δ (a’1) 1) Estimator For each time k, the estimate, the quantizer offset and the quantizer input gain are obtained using (3.77) # " " # # " τ0,k Γ X̂k X̂k−1 ηx (ik ) = + δ̂k−1 , = ∆k ηδ (ik ) k δ̂k δ̂k−1 c∆ with ik = Q ηx (ik ) ηδ (ik ) Yk −X̂k−1 c∆ δ̂ (x) f˜d [ik ] , Γ= − F̃d [ik ] = . (δ) f˜ [i ] − F̃d [i k] 1 2 1 (x)T fd 0 (x) F−1 d fd 0 1 (δ)T fd (δ) F−1 d fd and d k 2) Performance (assumed and asymptotic) The estimator is assumed to be asymptotically unbiased. √ When k → ∞ the normalized estimation error vector kεk is Gaussian distributed with covariance matrix P given by (3.90) 1 0 T −1 (x) 2 (x) δ fd F d f d . P= 1 0 2 (δ)T −1 (δ) fd F d fd Observe also that the asymptotic performance can still be optimized through τ ′ and c∆ . As optimization through τ ′ is difficult, in the simulation section we will consider again that the thresholds variations are uniform as in (3.50) ′ ′ τ = −τ NI = −∞ · · · − 2 τ1′ = −1 0 + τ1′ ′ = +1 · · · + τ NI 2 ⊤ = +∞ , thus the only free parameter for optimization is c∆ . Simulations The algorithm will be simulated to validate the theoretical results. The simulation will be focused on the performance for the estimation of x. As it was mentioned, the quantizer is uniform and c∆ will be chosen so as to minimize the variance of estimation of x. As this is a scalar problem, it can be solved by an exhaustive search using a fine grid. After finding the 3.6. Adaptive quantizers for estimation: extensions 155 optimal c∆ , the other parameters of the algorithm Γ, η x and η δ can be evaluated using the information from the noise distribution. The Gaussian and Cauchy distribution will be used for modeling the noise. The algorithm will be simulated for 5 × 105 blocks with 4 × 104 samples each. The simulated MSE for the estimation of the location parameter will be evaluated by calculating the mean of the squared error for each sample. Other simulation parameters are δ = 1, δ̂0 = 2, x = 0, X̂0 = 1 and NI ∈ {4, 8, 16, 32}. For comparison purposes, the CRB for the estimation of x based on continuous measurements CRBc will be also evaluated for Gaussian and Cauchy distributions. Using the fact that the measurements are independent and the expressions for Ic for the GGD given in (3.74) with β = 2 and for the STD given in (3.75) with β = 1, the CRBc for Gaussian 2 2 and Cauchy noise are respectively 12 δk and 2 δk . The results of the simulation are shown in Fig. 3.13, where we also plotted the CRB for the estimation with quantized measurements when the offset and gain are static and set with the true parameter values. The MSE was normalized by k and the logarithm scale is used in both axis for better visualization. It can be observed that after a transient time, the simulated performance becomes very close to the asymptotic theoretical results, also it can be seen that the gain in performance when increasing NI is very small even for a small number of quantization intervals (NI = 8 or 16) and that the gap between the performance given by NI = 32 and the continuous measurement bound is negligible. Discussion on the results Despite the very low complexity of the algorithm, its asymptotic performance for estimating the parameters is not only decoupled (the covariance is diagonal) but it is also optimal. The normalized asymptotic variance for estimating x is Iq1(0) and the variance for estimating δ is also the inverse of the corresponding FI. This optimal decoupling means that no degradation of performance is brought by estimating jointly the scale parameter. As no degradation is present, the asymptotic performance of the estimator of x has the same behavior as it was shown previously, if we choose NI = 4 or 5 the estimation performance is very close to the optimal continuous measurement performance. This indicates that even when δ is unknown there is no need to use high resolution quantizers, if we have a large number of samples. 3.6.2 Fusion center approach with multiple sensors We present now the adaptive algorithm for estimating a constant parameter, when a fusion center has access to quantized measurements from multiple sensors. We will define first, the problem, the architecture to be used and the adaptive estimator. Then, similarly to the joint location-scale problem, we obtain the algorithm performance and its optimal parameters. We close this section with simulations and a discussion about our results. 156 Chapter 3. Adaptive quantizers for estimation 10−0.25 MSE × k CRBc CRB⋆q Sim. – adapt. alg. 10−0.3 100 101 102 Time [k] 103 104 103 104 (a) MSE × k 100.6 100.4 CRBc CRB⋆q Sim. – adapt. alg. 100.2 100 101 102 Time [k] (b) Figure 3.13: CRB for estimating a location parameter of Gaussian and Cauchy distributions based on quantized and continuous measurements and simulated MSE for the estimation of the location parameter with the adaptive location-scale parameter estimator. In all cases, we considered the true scale parameter and its initial estimate δ = 1, δ̂0 = 2, for the location parameter we considered x = 0, X̂0 = 1. The numbers of quantization intervals simulated were NI ∈ {4, 8, 16, 32}. For obtaining the simulated MSE for the location parameter, the algorithm was simulated for 5 × 105 blocks with 4 × 104 samples each. The curves that are asymptotically lower are related to a higher number of quantization intervals. Problem statement and estimator The scalar parameter is supposed to be a constant x and it is measured by Ns sensors. Each sensor measures the parameter with additive noise (j) Yk (j) = x + Vk , for j ∈ {1, · · · , Ns } , (3.91) (j) where Vk is the noise r.v. for the sample k obtained at the sensor j. The sensor noises are independent and each sensor noise is i.i.d.. The noise r.v. also respects assumptions AN1, AN2 and AN3. Its marginal CDF for sample k of sensor j will be denoted as F (j) (v) and its PDF as f (j) (v). The measurements at each sensor are quantized by a scalar adjustable quantizer, similar to the quantizer used in the previous sections. The quantizers for the sensors are then char(j) acterized by their input gains 1(j) , input offsets bk and the vector of threshold variations ∆k 3.6. Adaptive quantizers for estimation: extensions 157 (j) (considered to be static) that defines the NI quantizer intervals (j) (j) (j) (j) (j) (j) τ ′ = τ ′ NI · · · τ ′ −1 τ ′ 0 τ ′ 1 · · · τ ′ NI . − 2 2 We will consider again the following assumptions: • AQ1 on the quantizer outputs: theset of possible quantizer outputs of the sensor j is I (j) = (j) (j) − NI 2 , · · · , −1, 1, · · · , NI 2 . • AQ2 on the quantizer threshold variations: the quantizers have symmetric threshold (j) (j) (j) (j) variations τ ′ i = −τ ′ −i with τ ′ 0 = 0 and τ ′ NI = +∞. 2 The output of quantizer j is then given by (j) ik = Q(j) (j) Yk (j) − bk (j) ∆k ! (j) = i sign Yk (j) − bk , for (j) (j) Y k − bk (j) ∆k h (j) (j) ∈ τi−1 , τi . (3.92) The noise CDF are considered also to have a known scale parameter δ (j) . Therefore, similarly to what was done before, we can use the noise scale factor to normalize the input of the quantizer (j) (j) ∆k = c∆ δ (j) , (3.93) where c∆ is a free parameter which, as it was explained before, can be used to adjust the quantizer input range or to optimize quantization performance when the threshold variations are fixed. After obtaining the quantized measurements, the sensors send their measurements to a fusion center. The transmission of the quantized measurements is supposed to be perfect, as it was explained in the Introduction. The fusion center can feedback information to the sensors through perfect continuous amplitude channels. Thus, we want to solve the following modification of problem (a) (p. 27): (a”) Solve problem (a) with independent quantized measurements from Ns sensors. The measurements from the Ns sensors are available at a fusion center that can process these measurements and feedback information to the sensors through perfect continuous amplitude channels. Note that the simplifying assumption of perfect feedback channels means that the fusion center has enough power and/or band for feedbacking real (or very finely quantized) noiseless estimates. To solve problem (a”), the fusion center generates an online estimate X̂k that will be broadcasted to the quantizers through the feedback channels, so that they can use it as their next input offset for enhancing estimation performance. At time k, this means that (j) bk = X̂k−1 . (3.94) 158 Chapter 3. Adaptive quantizers for estimation (1) Vk (1) Quantizer 1 ik Sensor 1 (2) Vk (2) x Quantizer 2 ik Sensor 2 ik UPDATE X̂k Fusion center (Ns ) Vk (Ns ) Quantizer Ns ik Sensor Ns Figure 3.14: Scheme representing the sensor network. The fusion center updates the estimate of the parameter and broadcasts it through a perfect channel to the sensors. The sensors then use the new estimate as their quantizer input offset (their quantizer central threshold). The general scheme is depicted in Fig. 3.14, where the UPDATE block contains an online estimator of the parameter. For estimating the parameter, we can use an extension of the adaptive algorithm with decreasing gains X̂k = X̂k−1 + γ η (ik ) , k (3.95) h i (1) (N ) T where γ is a positive gain, ik is the vector of quantized observations ik · · · ik s and η [i] is the update coefficient (or the quantizer output level) defined as a function from (1) (N ) s I ,··· ,I to R. The main advantage of this algorithm when compared with an adaptive scheme based on the MLE is its low complexity both in terms of processing and memory requirements. Optimal parameters and performance Using the results from [Benveniste 1990, Ch. 3], the asymptotic variance of the estimation error can be obtained under the condition that the mean error converges to zero as k → ∞. To prove this convergence, it would be sufficient to use the ODE approximation of the mean of X̂k and then prove global convergence properties for the ODE using Lyapunov theory. Such analysis is left for future work. Here, only the mean behavior of the algorithm at equilibrium (X̂k = x) will be studied. 3.6. Adaptive quantizers for estimation: extensions 159 When X̂k−1 = x, the normalized mean increment γk E X̂k − X̂k−1 is given by k E X̂k − X̂k−1 = E [η (i)] = η T Fvec d , γ (3.96) where η is a vector regrouping all possible values of the output coefficients ! !#T " η= η i − h and Fvec d = · · · F̃d [i] · · · N i⊤ (1) I 2 , ··· , i − N · · · η i N (1) , · · · , i N (Ns ) (Ns ) I 2 I I 2 2 with F̃d [i] = Ns Y (j) F̃d j=1 h i i(j) , (3.97) (j) (j) i is the probability of having the output i(j) at the sensor j when X̂k = x: (j) N (j) (j) (j) (j) (j) (j) (j) (j) (j) I τi c δ δ −F , τi−1 cδ δ , if i ∈ 1, · · · , 2 F h i (j) (j) F̃d i = (j) NI (j) (j) (j) (j) (j) (j) (j) (j) (j) . τi+1 cδ δ −F τi c δ δ , if i ∈ −1, · · · , − 2 F where F̃d (3.98) Thus, the following condition is needed to have an equilibrium point at the true parameter: η T Fvec d = 0. (3.99) Note that this is a necessary condition for asymptotic unbiasedness of the algorithm. Assuming that the algorithm is asymptotically unbiased, similarly to the single sensor case, we can use the results in [Benveniste 1990, pp. 110–113] to obtain the asymptotic distribution of the estimation error, the optimal gain γ ⋆ and the minimum normalized asymptotic 2 . The asymptotic estimation error is Gaussian distributed and it estimation error variance σ∞ is given as follows √ 2 kεk N 0, σ∞ . (3.100) k→∞ The optimal γ and minimum 2 σ∞ are then given by γ⋆ = − and 2 = σ∞ 1 η T fd ηT F dη (η T fd )2 (3.101) (3.102) . vec The matrix F d is a diagonal matrix diag [Fvec d ] and fd is the vector form (as η and Fd ) regrouping the elements f˜d [i] = Ns X j=1 h i f˜d i(j) Ns Y j′ = 1 j ′ 6= j (j ′ ) F̃d h i ′ i(j ) , (3.103) 160 Chapter 3. Adaptive quantizers for estimation where (j) NI (j) (j) (j) (j) (j) (j) (j) (j) (j) , τi c δ δ −f τi−1 cδ δ , if i ∈ 1, · · · , 2 h i f (j) f˜d i(j) = (j) NI (j) (j) (j) (j) (j) (j) (j) (j) (j) . τi+1 cδ δ −f τi c δ δ , if i ∈ −1, · · · , − 2 f (3.104) The asymptotic performance can also be optimized through the choice of η, this can be done by minimizing (3.102) w.r.t. η under the equilibrium constraint (3.99). This problem can be solved in the same way as it was done for finding the optimal vector η δ in the joint estimation of location and scale parameters. Consequently, we find the following optimal vector η (Why? - App. A.1.9): −1 η ∝ F−1 d fd − 1fd = Fd fd . The second equality comes from the fact that the sum of the elements of fd is zero. For proving this, note that for each possible i there is −i. As the function f˜d [i] is odd and F̃d [i] ˜ ˜ ˜ is fd [i] = −fd [−i]. Therefore, when adding fd [i] for all the possible i, the pairs even, we have f˜d [i] , f˜d [−i] will cancel each other, resulting in a zero sum. Similarly to the previous cases we will choose η = −F−1 (3.105) d fd . For the update coefficients given by (3.105), the asymptotic normalized variance and the optimal gain are 1 2 = γ ⋆ = T −1 . σ∞ (3.106) fd F d fd Using the expressions for F̃d [i] (3.97) and for f˜d [i] (3.103), for a given measurement vector i the update coefficients are4 Ns ˜(j) (j) X fd i η (i) = − . (3.107) (j) (j) i j=1 F̃d If we use the symmetry assumptions, the expression for the asymptotic normalized variance and for the optimal gain (3.106) becomes (Why? - App. A.1.10) 2 σ∞ = γ⋆ = 1 Ns P P (j) 2 f˜d [i(j) ] (j) j=1 i(j) ∈I (j) F̃d . (3.108) [i(j) ] Observe that the update coefficients are the sum of the update coefficients obtained in the single sensor approach. The asymptotic normalized variance is equal to the inverse of the sum of the FI Iq (0) for each sensor, which means that the algorithm is asymptotically efficient. 4 Using this specific form for the update coefficients, we can prove, similarly as it was done for the single sensor case, that the algorithm is asymptotically unbiased. 3.6. Adaptive quantizers for estimation: extensions 161 We have then the following solution to problem (a”) (p. 157): Solution to (a”) - Adaptive algorithm with decreasing gain for estimating x using multiple sensors and a fusion center (a”1) 1) Estimator For each time k, • the sensors send (j) ik = Q(j) (j) Yk −X̂k−1 (j) c∆ δ (j) to the fusion cen- ter. • The fusion center estimates the parameter using (3.95) X̂k = X̂k−1 + where γ ⋆ = 1 (j) 2 (j) f˜ i P d (j) (j) i j=1 i(j) ∈I (j) F̃d γ η (ik ) , k and η (i) = − (j) f˜d [i(j) ] j=1 F̃ (j) i(j) . ] d [ P Ns [ ] [ ] • The fusion center then broadcasts the estimate to the sensors through perfect channels to be used as the next quantizers input offset. N Ps 2) Performance (assumed and asymptotic) The estimator is assumed to be asymptotically √ unbiased. When k → ∞ the normalized estimation error kεk is Gaus2 given by (3.108) sian distributed with variance σ∞ 2 σ∞ = 1 Ns P P (j) 2 f˜d [i(j) ] (j) j=1 i(j) ∈I (j) F̃d . [i(j) ] (j) Note that, again here, we can still optimize the performance through τ ′ (j) and c∆ . In the same way as it was done previously, in what follows we consider that the threshold vari(j) ations are uniform with unitary step-length and that only c∆ are used for optimizing the performance. 162 Chapter 3. Adaptive quantizers for estimation Simulations The validity of the results will be verified through simulations. All the sensors within a simulation will be considered to have the same type of noise and the same noise scale factor δ = 1. The noise considered will be Gaussian or Cauchy distributed. Optimization w.r.t. c∆ (the same gain for all sensors in this case, as the noise is identically distributed) will be done by searching the maximum of the corresponding FI in a fine grid. After finding the optimal ˜ and the gain γ ⋆ can be calculated. c∆ , the coefficients − F̃fd [i] [i] d For all the following simulations, the length of the block of samples will be 5000 and for evaluating the MSE the average of the squared error will be calculated using 5 × 104 blocks. The parameter value and initial estimator value are x = 0 and X̂0 = 1. In the first simulation, it will be considered that all the quantizers have NI = 4 and Ns will be 1, 2 or 3, the results can be observed in Fig. 3.15 in log scale both in time and MSE. The simulated results are compared with the theoretical approximations, for this algorithm they are asymptotically equal to the CRB for quantized measurements obtained from a number of sensors Ns , CRBqNs ,⋆ . CRBqNs ,⋆ Gaussian CRBqNs ,⋆ Cauchy Sim. – Gaussian Sim. – Cauchy 100 MSE 10−1 10−2 10−3 10−4 100 101 102 Time [k] 103 Figure 3.15: Cramér–Rao bound and simulated MSE for the adaptive algorithm when NI = 4, Ns = 1, 2, 3 and the noise is Gaussian or Cauchy distributed, both with δ = 1. For obtaining the simulated MSE, the algorithm was simulated 5 × 104 times for blocks with 5000 samples. For all simulations the true parameter was set to zero and the initial estimate was X̂0 = 1. In each set of curves the results for the three different number of sensors are represented, the highest MSE curves represents the performance for Ns = 1 and the lowest MSE represent Ns = 3. The curves are plotted in log log scales for better visualization. As it was expected, the MSE decreases with the number of sensors and the simulated results are very close to the theoretical approximation for a large number of samples. To have a more appropriate comparison between different numbers of sensors, channel bandwidth constraints must be considered. In the second simulation, the total rate will be fixed to 5 bits. Two possible settings will be considered, a single sensor approach using the 5 bits (NI = 32) and a multisensor approach 3.6. Adaptive quantizers for estimation: extensions 163 with one sensor quantizing the measurements with 2 (NI = 4) bits and the other with 3 bits (NI = 8). We keep all the other simulation parameters from the previous simulation. The results are shown in Fig. 3.16, also with a comparison with the asymptotic performance (which again is equal to the optimal CRB for quantized measurements). CRBqNs ,⋆ Gaussian CRBqNs ,⋆ Cauchy Sim. – Gaussian Sim. – Cauchy 100 MSE 10−1 10−2 10−3 10−4 100 101 102 Time [k] 103 Figure 3.16: Cramér–Rao bound and simulated MSE for the adaptive algorithm for Ns = 1 and NB = 5 and for Ns = 2, one sensor with NB1 = 2 bits and the other with NB2 = 3 bits. The noise was considered to be Gaussian or Cauchy distributed, both with δ = 1. For obtaining the simulated MSE, the algorithm was simulated 5 × 104 times for blocks with 5000 samples. For all simulations the true parameter was set to zero and the initial estimate was X̂0 = 1. In each set of results the higher curve represents the performance for Ns = 1. The curves are plotted in log log scales for better visualization. For both types of noise, the theoretical and simulated results show that the multisensor approach is superior. Discussion on the results The proposed algorithm shows that in practice, in a rate constrained context, a multiple sensor approach with low resolution quantizers might be superior to a a high resolution single sensor approach. Such observation motivates the use of low resolution sensor networks for estimation purposes. Note that in the case studied, we did not analyze the interaction between the noise scale factor (it is considered to be constant over the sensors) and the number of quantization bits used in each sensor. When the total number of bits to be transmitted to the fusion center is constrained, an interesting problem for further investigation is the problem of optimal allocation of number of bits to sensors as a function of their noise scale factors. This problem will be studied in an approximate form in Part II. The adaptive algorithm that is implemented in the fusion center has very low complexity. The complexity is roughly linear in the number of sensors, as the optimal correction η is equivalent to a weighted sum of the corrections given by the single sensor algorithm. Despite 164 Chapter 3. Adaptive quantizers for estimation this fact, it can be very costly to implement this algorithm due to the perfect feedback channels requirement. Thus in future work, we can consider that the feedback channels are not perfect, for example, by considering that the estimates are fedback after being quantized and that they are corrupted with additive noise. 3.7 Chapter summary and directions We summarize now the main points observed in this chapter and we also present some subjects that are interesting for further research. • We presented an adaptive algorithm that can estimate three types of parameter: constant, slowly varying with a Wiener process model without drift or slowly varying with a Wiener process model with drift. The adaptive algorithm can be used for any even number of quantization intervals and under the assumption that the noise is symmetric, unimodal and has a regular CDF (locally Lipschitz continuous) it was shown that – using decreasing gains, when the parameter is a constant, the algorithm converges asymptotically in the mean to the true parameter value and its asymptotic performance in terms of variance attains the minimum CRB for common noise distributions CRB⋆q . Thus, the answers to the two initial questions (p. 105) are positive: the algorithm with gain proportional to k1 converges and it can be extended to a multibit setting. – Using a constant gain, when the parameter is modeled by a slowly varying Wiener process without drift, the algorithm also converges in the mean to the true parameter and the algorithm is approximately asymptotically optimal. This answers the third question also in a positive way. – Using a constant gain, when the parameter is modeled by a slowly varying Wiener process with drift, the algorithm is biased and its asymptotic MSE can be minimized by setting the gain as a function of the drift. • Using the asymptotic performance results, we evaluated the loss of estimation performance due to quantization for the algorithm. We observed the following: – the loss in all cases is a function of Iq (0), showing one more time the importance of studying the behavior of this quantity as a function of the threshold variations. We remind that this problem will be studied with an asymptotic approach NI → ∞ in Part II. – When the parameter varies, the loss due to quantization is smaller than when the parameter is constant. Thus, when using quantized measurements for estimation, it seems that a type of dithering effect is present. – The loss of performance is almost negligible in all cases for 4 or 5 quantization bits. In a rate constrained scenario this seems to be a strong motivation for using a low to medium resolution multiple sensor approach instead of a high resolution single sensor approach. This was validated using an extension of the adaptive algorithm 3.7. Chapter summary and directions 165 designed for multiple sensors that can communicate their measurements to a fusion center. • When comparing the adaptive algorithm with its equivalent counterparts studied in Ch. 1 and 2, the following was observed: – for estimating a constant, the adaptive algorithm has a very low complexity when compared with the adaptive scheme based on the MLE and their performance is equivalent. – for estimating a slowly varying Wiener process, the algorithm has also a very low complexity when compared with the PF scheme using dynamical central threshold. In this case the only drawback of the adaptive algorithm is that it has a longer transient time. Therefore, if complexity constraints are present, the adaptive algorithm seems to be the best analyzed solution. If no constraints on complexity are considered, then the adaptive algorithm is still the best choice for estimating a constant, but it should be replaced by the PF for estimating the slowly varying parameter. An interesting point for future work would be to look for ways of choosing the quantizer update coefficients during the transient, so that the adaptive algorithm performance would be similar to the PF performance. • We presented two extensions of the algorithm, both for estimating a constant parameter. They are the following: – the joint location-scale adaptive estimator, for which we showed that even if we do not know the noise scale parameter it is possible to estimate it with the same asymptotic performance obtained for the case with known scale parameter. – The fusion center approach with multiple sensors. In this extension of the algorithm, we considered that measurements from multiple sensors are sent to a fusion center. The role of the fusion center is to estimate the parameter and then broadcast the estimate to the sensors, so that it can be used for setting the quantizers offset. As it was mentioned above, with this approach we showed that a low to medium resolution multiple sensor approach might be better for estimation purposes than a high resolution single sensor approach. We remind that this was shown for sensors with the same type of noise distribution and the same noise scale parameter value. Thus, an interesting subject to study is the bit allocation problem among sensors, when the total bandwidth is constrained and the sensors have the same type of noise distribution but different scale parameters. This will be done in Part II, in the case of a weak bandwidth constraint. • Many other extensions can be the subject of future work. They are – the joint estimation of the drift when we track a Wiener process with drift. Actually, this was already done by adding a simple adaptive estimator of the drift. However, in some cases we can have a more detailed dynamical model for the drift. By 166 Chapter 3. Adaptive quantizers for estimation using adaptive multistep algorithms [Benveniste 1990, Sec. 4.2], we can use this additional information to have a better estimate of the Wiener process. – The joint estimation of the Wiener process increment standard deviation and the Wiener process itself. This will lead to a robust multibit generalization of delta modulation with varying gain. – The joint estimation of a location parameter and the shape of the noise distribution. In this case, we can consider that the noise CDF has an unknown shape but a known structure, for example that it is locally polynomial, and that we want to estimate jointly the location parameter and the parameters of the noise distribution. – The nonparametric estimation of the location parameter. We can consider for example that we only know that the noise distribution is symmetric, without any specific parametrization. Then we can try to define a nonparametric adaptive algorithm based on adaptive histograms for getting as close as possible to the parametric performance. – The joint estimation of location-scales parameters when the multiple sensor fusion center approach is considered. This extension can be directly implemented by joining the features of the adaptive location-scale estimator with the fusion center approach. The main difference in this case is that for reducing the communication complexity, the sensors will have to estimate their individual scale parameters for setting their quantizers. • We can also consider some extensions of the estimation problem itself for which modifications of the adaptive algorithm would be a good solution. Some examples are: – a fusion center approach where the quantized measurements from the sensors are transmitted through noisy channels. This is a problem that we decided not to treat but it is an interesting and more realistic point for further development. – A fusion center approach where the information that is fedback is quantized and passed through noisy channels. For this extension, we can consider an additional adaptive algorithm at the sensors for smoothing out the noise from the feedback channels. For dealing with quantization of the estimates we can consider including a dither signal. With this extension we will be able to assess the importance of the feedback channel quality, thus giving a more realistic global estimate of the sensor network cost. The main issue that makes these two extensions far more difficult to be studied is that the output quantizer indexes cannot be defined arbitrarily, as they are corrupted by the channel noise. – The estimation of a scalar parameter following an autoregressive model Xk = aXk−1 + Wk instead of the Wiener process model. This will lead to a robust generalization of scalar predictive quantization. 3.7. Chapter summary and directions 167 – Compression of a Wiener process with drift. We can consider that at sensor level we can store continuous measurements (or very finely quantized) and then we can apply the adaptive algorithm for a block of measurements in both time directions (forward and backward) and average the results to have final estimates with reduced bias. By storing the initial and final continuous measurements and both the forward and backward quantized sequence, we equivalently have stored a compressed estimate of the true parameter sequence. Conclusions of Part I The main objective of Part I was to propose and study the performance of algorithms for estimation based on quantized measurements. We assumed simple parameter and measurement models: • Parameter model – a scalar parameter that can be either constant or varying with a Wiener process model. • Measurement model (noise model) – the scalar constant is measured with independent, unimodal and symmetrically distributed noise. • Measurement model (quantizer) – the quantizer is symmetric. Under these settings, we obtained the following conclusions: • Adaptiveness is important. The performance of estimation based on quantized measurements is mainly dependent on the FI for quantized measurements and this dependence is direct. Increased FI is equivalent to increased estimation performance. For the noise distributions considered, the FI is increased if we set the quantizer dynamic range close to the parameter to be estimated and for commonly used noise distributions (Gaussian, Laplacian, Cauchy), we must put the quantizer central threshold exactly at the true parameter value. As we do not know the value of the parameter, for obtaining optimal estimation performance, we must resort to adaptive algorithms that place the quantizer range close to the true parameter value, for example by placing the quantizer central threshold at the most recent estimate of the parameter. Therefore, this indicates that adaptiveness of the quantizer is a main requirement for optimal estimation. • Low complexity is possible and it might be even asymptotically optimal. It is possible to estimate a constant and a slowly varying parameter with a low complexity adaptive algorithm. The adaptive algorithm is not only convergent in the mean (with a small bias in the drift case) but its parameters can be chosen in such a way that it is exactly equivalent to the asymptotically optimal estimator. This observation goes in the exact opposite direction of some proposed solutions (adaptive scheme based on the MLE and the PF) which requires high complexity both in terms of memory and processing. • Low to medium resolution is enough. Both for a constant and slowly varying parameter model, the loss of performance that is incurred by using quantized measurements instead of continuous amplitude measurements seems to be negligible for a number of quantization bits larger than 4 or 5. In a rate constrained context this means that using more sensors with less resolution may be better than using less sensors with more resolution. 169 Part II Estimation based on quantized measurements: high-rate approximations 171 173 "Finite – to fail, but infinite to venture" - part of a poem of Emily Dickinson. Motivation The introduction of this part will also be done using a motivational example. To maintain their economic growth, emerging economies will have to look for new mineral and material sources. This will generate a potential increase of exploration in unusual places, for example the seafloor. Sulfur and metal base rich mineral deposits can be found at seafloor in hydrothermal vent sites [Hoagland 2010]. Hydrothermal vents, also called black smokers, occur when seawater penetrates the oceanic crust through fissures. The water penetrates so deeply in the oceanic crust that it enters in contact with upper parts of magma chambers. A large increase in temperature (from ≈ 2◦ C to ≈ 400◦ C) is observed, along with a decrease in pH and Eh. The hot corrosive liquid rises then through fissures carrying metal and sulfur from the rocks. The mineral rich water is released in the seawater as hot black smoke. Precipitation of the elements present in the smoke happens when the hot water is mixed with ocean cold water. As a result, a mineral rich chimney and a massive sulfide deposit are formed around the hot water releasing point [Herzig 2002]. To mine the sulfide deposit, first, it must be located. One possible way to locate it is by measuring the concentration of chemical compounds and elements in the seawater. The chemical plume generated by the hydrothermal vent can be detected using CH4 , Fe, H, He or Mn concentration measurements [Baker 2004]. After detecting the plume, for example using sensor measurements from multiple autonomous underwater vehicles (AUV) that communicate with a fusion center, the source location must be found. This can be done by following the ascent gradient direction of a chemical compound concentration. The gradient direction can be obtained by exploiting the local information measured by the AUV. Underwater communication is challenging as bandwidth is severely constrained in this environment. To overcome this problem, quantization of the concentration measurements from the AUV can be considered. As a consequence, to calculate an approximated gradient at the fusion center we will have to deal with the same problem treated in Part I, which is the following: • How to estimate a scalar constant (the concentration) based on noisy quantized measurements? Algorithms for doing this were presented in Part I, where it was noted that estimation performance is given (at least asymptotically) by the FI for quantized measurements. However, a question remained without answer: • How to set the quantizer thresholds to have an optimized estimation performance? Only in some cases the answer for this question was given: 174 1. In the binary case it was observed that for commonly used noise models, the quantization threshold should be placed exactly at the parameter. 2. In the multibit uniform quantization case, after setting the central threshold at the parameter, the corresponding performance maximization is a one-dimensional optimization problem, which can be solved using exhaustive search. In the general nonuniform case, setting the threshold was observed to be a complicated optimization problem. Similarly to standard quantization [Gersho 1992, pp. 185–186], where analytical characterization of quantization performance is difficult for a finite number of quantization intervals, when the number of quantization intervals is large, the set of intervals can be approximated by an interval density. The interval density is a function whose integral over an interval gives the fraction of the number of quantization intervals contained in that interval. By using small interval approximations of the FI, we can obtain an asymptotic expression (NI → ∞) for the FI as a function of the interval density. The resulting FI can be maximized w.r.t. the interval density to get an approximation of the optimal interval set. After that, the optimal interval density can be used to have an approximated analytical expression for the optimal FI, thus giving a complete asymptotic characterization of the estimation algorithms. As the interval density is an asymptotic quantity, a main issue that must also be solved is how to do a practical approximation of this density with a finite number of intervals. An interesting question would be to find an analytical expression for the approximately optimal quantization thresholds as a function of the number of intervals. Writing it in a more detailed form, we want to do the following: • Find an asymptotic (in terms of number of quantization bits) approximation of the Fisher information for estimating a constant parameter embedded in noise as a function of an interval density. (c) • Find the asymptotically optimal interval density. • Give an analytical expression approximating the maximum FI for the optimal interval density. • Obtain a practical approximation for the optimal quantization thresholds. Now, with the optimal thresholds given by the asymptotic approximation, the adaptive estimation algorithms from Part I work (at least asymptotically) in an optimal way. We can imagine that for reliability issues or for reducing measurement latency, multiple concentration sensors are installed in each AUV. Due to deterioration, the sensors do not have exactly the same noise levels. Thus, under a given rate constraint, another question arises: 175 • How many quantization bits do we allocate for each sensor? For an array of sensors with the same type of noise and considering the independence between sensor noise, the only parameter that can change from sensor to sensor is the noise scale factor. Thus, what we want to do precisely is (d) Find the optimal or approximately optimal number of bits per sensor as a function of the noise scale factors under a maximum constraint on the total number of bits. The problems presented above are quite general and they can appear in the performance analysis of any optimal estimation algorithm in a constrained rate context. In what follows, we will obtain insight on how we can solve these problems. As we have only one chapter in this part, its outline will be given directly at the chapter introduction. Chapter 4 High rate approximations of the FI To obtain insight on how we solve problems (c) and (d), we will resort to an asymptotic approach, that is, we will make the number of quantization intervals goes to infinity NI → ∞ and we will see how the FI behaves as a function of the quantizer. This approach can also be found under the names high resolution or high-rate (in the title), the former is used to emphasize that the quantizer intervals are supposed to be very small and the latter is used to make explicit that the communication rate must be high, as the number of quantization bits is large. Note that making NI → ∞ seems to contradict one of the conclusions of Part I, that states that with only a few quantization bits we have a negligible loss of performance due to quantization. However, even if we make NI → ∞, we will see that the asymptotic approximations still depend on NI , so that, as we stated above, we can use it to gain some insight on the estimation performance for finite NI . Actually, we will see that for the location parameter estimation problem studied in Part I, the asymptotic approximations are valid even for small numbers of quantization bits (NB = 4 and NB = 5), which is very fortunate, as these cases were observed to be the practical useful limit in quantization for estimation and also they are the cases with lowest number of bits for which the maximization of the estimation performance w.r.t. the quantizer thresholds is difficult to be done in a direct way. When NI → ∞, the quantizer can be characterized by its density of quantization intervals, thus asymptotically, the behavior of the FI as a function of the quantizer can be characterized by studying its behavior as a function of the intervals density. As a consequence, one of the main objectives of this chapter is to obtain an asymptotic analytic expression of the FI as a function of the interval density. • Fixed rate encoding. We will obtain this expression for scalar quantization and we will not impose any strong constraints on the type of estimation problem that is treated (for example, we will not constrain it to be a location estimation problem). • Variable rate encoding. Additionally to the fixed rate encoding scheme, where all the quantizer outputs use the same number of bits for encoding, we will also obtain the optimal interval density maximizing the FI for the variable rate encoding scheme, where we can use different numbers of bits for different quantizer outputs. We will also discuss on the difficulties of implementing the variable rate encoding scheme in practice. • Practical implementation. We will describe how to implement in practice an approximation of the optimal interval density. 177 178 Chapter 4. High-rate approximations of the FI We will check the validity of the results in the location parameter case by comparing the theoretical results for the maximum FI obtained with the optimal interval density with the FI obtained with the practical approximation of the optimal density and with the FI for optimal uniform quantization. We will show that in practice we can obtain the asymptotic performance results by using the adaptive algorithm presented in Ch. 3. We will also look in some detail the location and scale parameter estimation problems for GGD and STD measurements. In the single sensor location parameter case, we will study the problem of deciding how many quantization bits we might allocate to each sensor in a sensor network, when the total rate is constrained and all the sensors have the same type of noise distribution but different noise scale parameters. Approximate solutions for this problem will be given using the asymptotic approximations. To show the connections between the results found here and asymptotic results for other inference problems, we will study the asymptotic approximation of a generalized inference performance measure known as the generalized f –divergence. The asymptotic results for this divergence were proposed in [Poor 1988], mainly for the uniform vector quantization fixed rate encoding case and they were stated but not proved in the non uniform case. Here, we will give a simple derivation of the asymptotic approximation for this divergence in the scalar case using the same procedure as the one that is used for the FI, we will also extend the results to the variable rate encoding case. After obtaining the general optimal density of thresholds, we will point out the similarities and differences between the way quantization must be done for three different inference problems: classic estimation (considered in this thesis), Bayesian estimation and detection. At the end of the chapter we will summarize the main results and we will indicate some possible points for future work. Contributions presented in this chapter: • Asymptotic approximation of the optimal interval density for classic parameter estimation. The asymptotic analysis presented in [Poor 1988] is only detailed for uniform quantization, differently from the development that is presented here, where we consider non uniform quantization with the interval density approach. • Practical implementation of the optimal quantizer in the location parameter estimation problem. In this chapter, we show that the asymptotically optimal quantizer depends on the true parameter value and we also show that in practice we can achieve the asymptotically optimal performance using the adaptive algorithm with decreasing gain presented in Ch. 3. This shows the importance of the adaptive approach. No result of this type seems to be present on the literature. • Approximate bit allocation for the multiple sensor approach. The approximations of the optimal bit allocation among sensors seem to be new in the context of classic location parameter estimation. 179 Contents 4.1 4.2 4.3 4.4 Asymptotic approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 180 4.1.1 General setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 4.1.2 Loss of estimation performance due to quantization 4.1.3 Asymptotic approximation of the loss . . . . . . . . . . . . . . . . . . . . 181 4.1.4 Optimal fixed rate encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 183 4.1.5 Variable rate encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 4.1.6 Estimation of GGD and STD location and scale parameters . . . . . . . . 190 4.1.7 Location parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . 196 . . . . . . . . . . . . 180 Bit allocation for scalar location parameter estimation . . . . . . . . . 200 4.2.1 Unconstrained numbers of bits . . . . . . . . . . . . . . . . . . . . . . . . 201 4.2.2 Positive numbers of bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Generalization with the f –divergence . . . . . . . . . . . . . . . . . . . 207 4.3.1 Definition of the generalized f –divergence . . . . . . . . . . . . . . . . . . 207 4.3.2 Generalized f –divergence in inference problems . . . . . . . . . . . . . . . 207 4.3.3 Asymptotic results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 4.3.4 Interval densities for inference problems . . . . . . . . . . . . . . . . . . . 211 Chapter summary and directions . . . . . . . . . . . . . . . . . . . . . . 213 180 4.1 4.1.1 Chapter 4. High-rate approximations of the FI Asymptotic approximation General setting The general setting considered here is the estimation of a scalar deterministic parameter x ∈ R of a continuous distribution based on N independent measurements from this distribution Y = [Y1 Y2 · · · YN ]⊤ . Again here, we will consider that the estimation of x is not based on Y. Instead, it is based on a scalar quantized version of Y denoted i = [i1 i2 · · · iN ]T = [Q (Y1 ) Q (Y2 ) · · · Q (YN )]T . The function Q represents the scalar quantizer and is given by Q (Y ) = i, if Y ∈ qi = [τi−1 , τi ) , (4.1) where i ∈ {1, · · · , NI }, NI is the number of quantization intervals qi and τi are the quantizer thresholds. The first and last thresholds will be set to τ0 = τmin and τNI = τmax . Note that the setting considered here is more general than the setting presented in Part I, as we do not restrict the estimation problem to be a location estimation problem and we do not impose any symmetry on the quantizer. Observe also that the quantizer interval indexes now go from 1 to NI . It will be assumed that the marginal CDF of the continuous measurements parametrized by x F (y; x) admits a PDF f (y; x) that is positive, smooth in both x and y and defined on a bounded support. The bounded support assumption is needed to simplify the derivation of the asymptotic results. 4.1.2 Loss of estimation performance due to quantization For estimating a constant with quantized or continuous noisy measurements, we saw in Ch. 1 that the asymptotic performance of an optimal unbiased estimator attains the corresponding CRB. This asymptotic characterization is not restricted to location parameter estimation. Under regularity conditions on the likelihood, it can be applied to any situation where we want to estimate a constant embedded in noisy measurements. Thus, for a general parameter x and for a large number of samples, the estimation performance is still linked to the FI as follows h i 1 , (4.2) Var X̂ ∼ CRBq = N Iq where Iq is the FI for a quantized measurement that was already presented in Ch. 1 (for a location parameter). Rewriting the FI with the notation from this part, we have ( ) 2 ∂ log P (i; x) 2 I q = E Sq = E ∂x NI X ∂ log P (i; x) 2 = P (i; x) , (4.3) ∂x i=1 4.1. Asymptotic approximation 181 Sq is again the score function for quantized measurements and P (i; x) is the probability of having the quantizer output i (parametrized by x): P (i; x) = F (τi ; x) − F (τi−1 ; x) . (4.4) The FI for quantized measurements can be written as a function of the FI for continuous measurements and the score functions, exactly in the same way as it was done in Ch. 1 (1.16 p. 42) (Why? - App. A.1.1) h i Iq = Ic − E (Sc − Sq )2 , (4.5) h i f (y;x) is the score function for continuous measurements, L = E (Sc − Sq )2 where Sc = ∂ log∂x is the loss of FI and consequently of estimation performance due to quantization. The main objective from now on will be to minimize L through the choice of the quantizer intervals when NI is large. Notice that minimizing L defined here is equivalent to minimizing Lq defined in Ch. 3. 4.1.3 Asymptotic approximation of the loss Similarly to standard quantization for measurement reconstruction, where optimal nonuniform quantization intervals can be approximated for large NI , an approximation for Iq will now be developed. The loss L which is an expectation under the measure F can be rewritten as a sum of integrals, each term of the integral corresponding to the loss produced by a quantization interval: NI Z X ∂ log f (y; x) ∂ log P (i; x) 2 L= − f (y; x) dy. (4.6) ∂x ∂x i=1 q i First term ∂ log f (y;x) ∂x For the interval with index i, the PDF can be approximated with a Taylor series around the central point yi = τi +τ2i−1 : (y) (yy) fi (y − yi )2 + o (y − yi )2 , (4.7) 2 where the superscripts indicate the variables for which the function is differentiated. The subscript represents that the function (after differentiation) is evaluated at yi . It will be assumed that the sequences of intervals for increasing NI are chosen such that, for any ε > 0, it is possible to find a NI∗ for which f (y; x) = fi + fi (y − yi ) + o (y − yi )2 < ε, (y − yi )2 for NI > NI∗ , y ∈ qi . (4.8) Under the assumption that f > 0, the logarithm of f at interval qi can be approximated also using a Taylor series: (y) (yy) log f (y; x) = log fi + (log f )i (y − yi ) + (log f )i (y − yi )2 + o (y − yi )2 2 182 Chapter 4. High-rate approximations of the FI and the derivative w.r.t. x is 2 ∂ log f (y; x) (x) (yx) (yyx) (y − yi ) = (log f )i + (log f )i (y − yi ) + (log f )i + o (y − yi )2 , ∂x 2 (4.9) which is an expression for the continuous score function on qi to be used in (4.6). Second term ∂ log P(i;x) ∂x Now, the other term in the squared factor must be calculated. Integrating the PDF in (4.7) on the interval qi , which has length denoted by ∆i , one gets 3 (yy) ∆i P (i, x) = fi ∆i + fi + o ∆3i . 24 (4.10) Note that the term in ∆2i is zero as yi is the interval central point and the integral of (y − yi ) around it is zero. The logarithm of P (i, x) can be obtained by dividing the second and third terms of the right hand side of (4.10) by the first term and then using the Taylor series for log (1 + x) = x + ◦ (x). Differentiating the resulting expression w.r.t. x gives ∂ log P (i, x) (x) = (log f )i + ∂x f (yy) f !(x) i ∆2i + o ∆2i . 24 (4.11) Loss L Subtracting (4.11) from (4.9) and squaring makes the leading term with least power in (y − yi ) (yx) or in ∆i to be (log f )i (y − yi ). When we square this difference and multiply by the Taylor h i (yx) 2 series of f , we have a leading term (log f )i fi (y − yi )2 and all other terms have larger powers of (y − yi ) and/or ∆i . Therefore, after integrating the squared difference multiplied by the Taylor series of f , we get L = = N I h X i ∆3 (yx) 2 (log f )i fi i i=1 NI X i=1 12 (y) 2 Sc,i fi +o ∆3i ∆3i , + o ∆3i 12 (4.12) where we have used the fact that f is smooth enough so that we can change the derivative (yx) (y) order between y and x to get (log f )i = Sc,i . To obtain a characterization w.r.t. the quantization intervals, an interval density function λ (y) is defined 1 , for y ∈ qi . (4.13) λ (y) = λi = NI ∆i The interval density when integrated in an interval gives, roughly, the fraction of the number of quantization intervals contained in that interval. It is a positive function that always sums 4.1. Asymptotic approximation 183 to one1 . Rewriting (4.12) with this density gives L= N I X i=1 (y) 2 Sc,i fi ∆i +o 12NI2 λ2i 1 NI2 ∆i . (4.14) As NI → ∞, it will be supposed that all ∆i converge uniformly to zero. Therefore, lim NI2 L = NI →∞ 1 12 Z 2 ∂Sc (y;x) f ∂y λ2 (y) (y; x) dy. (4.15) This asymptotic expression for the loss gives the following approximation for the FI Iq ≈ Ic − 1 12NI2 Z 2 ∂Sc (y;x) f ∂y λ2 (y) (y; x) dy, (4.16) which is valid for large NI . Note that when NI in (4.16) tends to infinity, if the quantizer intervals are chosen in a way such that all ∆i tend to zero uniformly, then the asymptotic estimation performance for quantized measurements will tend to the estimation performance for continuous measurements. 4.1.4 Optimal fixed rate encoding In the fixed rate encoding scheme, all the outputs of the quantizer are encoded with (binary) words that have the same binary size, namely, NB = log2 (NI ). Thus, we can rewrite (4.16) using the number of bits NB instead of the number of intervals NI . This gives Iq ≈ Ic − 2−2NB 12 Z 2 ∂Sc (y;x) f ∂y λ2 (y) (y; x) dy. (4.17) This shows that the FI for quantized measurements under fixed rate encoding tends exponentially to the FI for continuous measurements with increasing number of bits. Moreover, the constant that multiplies the exponential depends not only on the measurement distribution and on the estimation problem, through f and Sc , but also on the quantizer intervals through λ. Optimal interval density We can characterize asymptotically the optimal quantizer for estimation by defining an optimization problem using (4.16) as the function to be maximized w.r.t. λ. To find the optimal 1 The Riemann sum is equal to one N PI i=1 1 ∆i N I ∆i =1≈ R λ (y) dy. 184 Chapter 4. High-rate approximations of the FI λ when NB is large, we must solve the following optimization problem: 2 Z ∂Sc (y;x) f (y; x) ∂y minimize dy, w.r.t. λ (y) λ2 (y) Z subject to λ (y) dy = 1, λ (y) > 0, where the equality and inequality constraints on λ comes from its definition as a density. This minimization problem can be solved using Hölder’s inequality, which states [Hardy 1988, p. 140] that for two functions h (y) and g (y) 1 Z 1 Z Z p q q p |g (y)| dy ≥ |h (y) g (y)| dy, |h (y)| dy with equality happening when hp (y) ∝ g q (y) and p1 + #1 " 2 Setting p = 3, q = 32 , h (y) = ∂Sc (y;x) f (y;x) ∂y λ2 (y) 1 q = 1. 3 2 and g (y) = λ 3 (y) in Hölder’s inequality and using the constraint that the integral of the density must sum to one, we have the following optimal interval density: 2 ∂Sc (y;x) 3 1 2 f 3 (y; x) ∂y ∂Sc (y; x) 3 1 ⋆ ∝ f 3 (y; x) (4.18) λ (y) = R ∂Sc (y;x) 23 1 ∂y f 3 (y; x) dy ∂y and the corresponding maximum FI given by this density is #3 "Z 2 1 ∂Sc (y; x) 3 1 ⋆ f 3 (y; x) dy . Iq ≈ Ic − ∂y 12NI2 (4.19) Remark: in standard quantization for minimum MSE measurement reconstruction the optimal interval density is given by [Gersho 1992, p. 186] 1 λ⋆rec (y) =R f 3 (y; x) 1 3 f (y; x) dy 1 ∝ f 3 (y; x) . Therefore, the main difference from standard quantization is the additional factor depending on the derivative of the score function. Practical approximation of the interval density From the definition of the interval density, the percentage of intervals until interval qi , NiI must be equal to the integral of the interval density from τmin to τi . Thus, a practical way of approximating the optimal thresholds is to set i −1 ⋆ , for i ∈ {1, · · · , NI − 1} , (4.20) τi = Fλ NI 4.1. Asymptotic approximation 185 where Fλ−1 is the inverse of the cumulative distribution function (CDF) related to λ. An important issue for evaluating the τi is that they may depend explicitly on x, which is the parameter we want to estimate. A possible solution for this problem is to initially set τi with an arbitrary guess of x, then estimate x using an initial set of measurements and finally update the thresholds with the estimate. This procedure can be performed in an adaptive way to get closer and closer to the optimal thresholds. We can use, for example, an adaptive scheme based on the MLE for doing at the same time estimation and thresholds setting. For the location parameter estimation problem, it was shown that this adaptive scheme converges, thus in this case, if we set τi according to τi⋆ , we expect to obtain the optimal asymptotic performance when N → ∞ and NI is large. Also in the location parameter case, a low complexity alternative, which gives asymptotically the same performance as the scheme based on the MLE, is the adaptive algorithm presented in Ch. 3. We will see through simulation later that the low complexity adaptive algorithm with the thresholds chosen using τi⋆ achieves asymptotically (N → ∞) a performance close to Iq⋆ given by (4.19), even for a moderate number of quantization intervals. 186 Chapter 4. High-rate approximations of the FI We have the following solution to problem (c) (p. 174): Solution to (c) - Asymptotic approximation of the FI for fixed rate encoding (c1) The asymptotic approximation of the FI is given by (4.16) Iq ≈ Ic − ≈ Ic − 1 12NI2 Z 2−2NB 12 Z 2 ∂Sc (y;x) f ∂y λ2 (y) (y; x) 2 ∂Sc (y;x) f ∂y λ2 (y) dy (y; x) dy, where Ic and Sc are the FI and the score function for continuous measurements and λ (y) is the interval density. • Maximization of Iq gives the optimal interval density (4.18) ⋆ λ (y) = ∂Sc (y;x) ∂y 2 3 R ∂Sc (y;x) 23 ∂y 1 f 3 (y; x) . 1 3 f (y; x) dy • The corresponding asymptotic approximation of Iq is (4.19) Iq⋆ 1 ≈ Ic − 12NI2 "Z ∂Sc (y; x) ∂y 2 3 1 3 f (y; x) dy #3 . • A practical approximation of the asymptotically optimal thresholds using a finite number of quantization intervals is (4.20) i −1 ⋆ , for i ∈ {1, · · · , NI − 1} , τi = Fλ NI where Fλ−1 is the inverse of the CDF related to the interval density. This CDF may be dependent on the true parameter x, therefore, it may be necessary to use an adaptive solution to obtain approximately optimal thresholds. 4.1. Asymptotic approximation 4.1.5 187 Variable rate encoding: dead end A It is known from information theory that the minimum average length H required for describing a discrete r.v. with a binary word is obtained by encoding its possible values (index j) with lengths lj given by the negative logarithm of their probabilities pj [Cover 2006, p. 111] lj = − log2 (pj ) . For a r.v. with n possible values, this way of encoding the r.v. gives the following average length n X H=− pj log2 (pj ) , i=1 which is the minimum average length and it is also the entropy of the r.v.. For achieving rate requirements in the problem of estimation based on quantized measurements, instead of using the fixed rate encoding scheme, we can use a scheme with variable rate, where the outputs of the quantizer are coded with binary words with possibly different lengths. The lengths of the outputs can be defined as above, leading to the following minimum average length NI X Hq = − P (i, x) log2 [P (i, x)] . (4.21) i=1 Suppose that the communication channel imposes a constraint on the maximum Hq , so that for lower (or equal to the maximum) Hq , transmission through this channel occurs without any error, this constraint which is the capacity of the channel will be denoted R2 . The main objective now is to set the quantizer thresholds for a given NI so that the FI Iq is maximized under the constraint Hq ≤ R. As this problem is complicated to solve for finite NI , we will use again the asymptotic approach to obtain the characterization of the optimal quantizer through λ. The asymptotic expression for Iq was already developed above and it is given by (4.16) 2 Z ∂Sc (y;x) f (y; x) ∂y 1 dy. Iq ≈ Ic − 2 λ2 (y) 12NI We need now to develop an asymptotic approximation for the entropy Hq . Using the Taylor series development for P (i, x) given in (4.10) in the expression for Hq (4.21), we have Hq = − NI X f i ∆i + 3 (yy) ∆i fi i=1 24 +o ∆3i log2 fi ∆i + 3 (yy) ∆i fi 24 +o ∆3i . Separating the factor fi ∆i inside the logarithm, using the Taylor expansion for log2 (1 + x) and multiplying the terms in the resulting expression gives Hq = − NI X i=1 fi ∆i log2 (fi ) + fi ∆i log2 (∆i ) + ◦ ∆2i . 2 We have supposed in the Introduction that efficient channel coding is used, so that we can assume no-error transmission for rates below channel capacity. 188 Chapter 4. High-rate approximations of the FI Using the interval density ∆i = Hq = − NI X i=1 1 NI λ i in the term with log2 (∆i ) leads to fi ∆i log2 (fi ) − fi ∆i log2 (λi ) − fi ∆i log2 (NI ) + ◦ ∆2i . When NI is large and ∆i are small, the sums can be approximated by integrals Hq ≈ − Z f (y; x) log2 [f (y; x)] dy + Z f (y; x) log2 [λ (y)] dy + log2 (NI ) , where for obtaining the term log2 (NI ), we used the fact that NI P fi ∆i is asymptotically close R to one as it is approximately the integral of the PDF. The integral − f (y; x) log2 [f (y; x)] dy is known [Cover 2006, p. 243] as the differential entropy of the r.v. Y , therefore, from now on we will denote it hy H q ≈ hy + Z i=1 f (y; x) log2 [λ (y)] dy + log2 (NI ) . (4.22) For large NI , using the integral in expression (4.16) and the approximation of the entropy (4.22), we can define the following optimization problem minimize w.r.t. λ (y) Z Z subject to Z 2 ∂Sc (y;x) f ∂y λ2 (y) (y; x) dy, f (y; x) log2 [λ (y)] dy ≤ R − hy − log2 (NI ) , λ (y) dy = 1, λ (y) > 0. The solution for this problem can be adapted from the development presented in [Li 1999]. First, we define the function p (y) ∂Sc (y;x) ∂y 2 p (y) = R 2 ∂Sc (y;x) ∂y f (y; x) , f (y; x) dy then the integral that must be minimized can be rewritten as Z 2 ∂Sc (y;x) f ∂y λ2 (y) (y; x) dy = (Z ∂Sc (y; x) ∂y 2 f (y; x) dy ) Z p (y) dy , λ2 (y) where we note that only the second factor depends on λ. Thus we can redefine the optimization 4.1. Asymptotic approximation 189 problem as minimize w.r.t. λ (y) Z Z subject to Z p (y) dy, λ2 (y) f (y; x) log2 [λ (y)] dy ≤ R − hy − log2 (NI ) , λ (y) dy = 1, λ (y) > 0. To find the optimal λ, we take the logarithm of the integral to be minimized Z Z p (y) p (y) log2 dy = log2 f (y; x) dy λ2 (y) λ2 (y) f (y; x) and we apply Jensen’s inequality (the logarithm is a concave function) Z Z p (y) p (y) f (y; x) dy ≥ log2 2 f (y; x) dy, log2 λ2 (y) f (y; x) λ (y) f (y; x) now we exponentiate both sides of the inequality h nR i o Z p(y) p (y) log2 2 f (y;x) dy λ (y)f (y;x) f (y; x) dy ≥ 2 . λ2 (y) f (y; x) (4.23) To obtain equality in the Jensen’s inequality the term in the argument of the logarithm in the RHS of (4.23) must be a constant, thus p (y) λ⋆ (y) ∝ f (y; x) 1 2 = R 2 ∂Sc (y;x) ∂y 2 ∂Sc (y;x) f (y; x) ∂y 1 2 dy . Integrating the constraint that λ (y) is a PDF makes the constant in the denominator of the expression above to disappear, thus giving ∂Sc (y;x) ∂y . (4.24) λ⋆ (y) = R ∂S (y;x) c∂y dy The exponential in (4.23) can be written as a function of the rate constraint. We multiply the rate constraint by −2 and we add hy in both sides. We have Z 1 log2 2 f (y; x) dy ≥ −2R + 3hy + 2 log2 (NI ) . λ (y) f (y; x) R Finally, we add log2 [p (y)] f (y; x) dy to obtain Z Z p (y) log2 2 f (y; x) dy ≥ −2R + 3hy + 2 log2 (NI ) + log2 [p (y)] f (y; x) dy. λ (y) f (y; x) (4.25) 190 Chapter 4. High-rate approximations of the FI The integral in the RHS of (4.25) is ( ) Z Z Z ∂Sc (y; x) 2 log2 [p (y)] f (y; x) dy = log2 f (y; x) dy + log2 [f (y; x)] f (y; x) dy ∂y ) (Z Z ∂Sc (y; x) 2 f (y; x) dy f y ′ ; x dy ′ − log2 ∂y ( 2 ) Z ∂Sc (y; x) = log2 f (y; x) dy − hy ∂y ) (Z ∂Sc (y; x) 2 f (y; x) dy . − log2 ∂y Substituting the expression above in (4.25) and the result in (4.23), we obtain the minimum value of the integral in the optimization problem. This value is 2 i h R R ∂Sc (y;x) 2 ∂S (y;x) −2 R−hy − log2 c∂y f (y;x) dy+ 12 log2 f (y;x) dy −log2 (NI ) ∂y . Substituting this value in the approximation of the FI, we get n h i o (y;x) 1 −2 R−hy −R log2 ∂Sc∂y f (y;x) dy . (4.26) Iq ≈ Ic − 2 12 Notice that again here the FI for quantized measurements tends exponentially to the FI for continuous measurements, the exponential decay rate is sensible to the randomness of the continuous measurements and to the derivative of the score function. The difference in the quantizer characterization w.r.t. to the fixed rate encoding scheme is that now the interval density is dictated only by the derivative of the score function, we must put more intervals around values of Y that have a larger score function variation. Observe that the quantizer interval distribution may depend also on the true parameter value, as the score function may be a function of it. Thus, similarly to the fixed rate scheme, it will be necessary to set adaptively the thresholds. The main problem now is that we need to know the quantizer outputs probabilities to encode the outputs with their proper length, however as we do not know completely the measurement distribution, we cannot encode the words properly. As a solution, we can also use an adaptive solution for encoding, using as distribution for encoding, the distribution with the most recent estimate of the parameter. The problem with this solution is that we cannot encode correctly at the beginning of the adaptive estimation procedure, we will be penalized in terms of average length in the initial part of the procedure and as a consequence we will not respect the rate constraints. Thus, this solution is still not complete. Further work will be necessary, we can try to quantify the increase in rate at the beginning of the estimation procedure or we can try to find an encoding scheme with variable rate that quantize the measurements properly, without knowing the true parameter value. 4.1.6 Estimation of GGD and STD location and scale parameters We will apply the results given in solution (c1) (p. 186) for obtaining the approximately optimal quantization thresholds for estimation of location and scale parameters of the GGD 4.1. Asymptotic approximation 191 and the STD. Notice that even if their support is unbounded, as in standard quantization theory, it is expected that the error caused by neglecting the extremal regions (overload region) will be small. Results for the estimation of a GGD location parameter The first step for obtaining the approximately optimal thresholds is to evaluate the optimal interval density given by (4.18). Thus, we start by calculating the derivative of the score function w.r.t. x and y. Differentiating the logarithm of the GGD PDF (1.39) ! y − x β β exp − f (y; x) = δ 2δΓ β1 for β > 1, we obtain ∂Scx (y; x) β (β − 1) y − x β−2 = . δ ∂y δ2 Note that for β ≤ 1, which includes the Laplacian case, the score function is not differentiable at x. Thus, we cannot evaluate the interval density for these cases. For β > 1, evaluating the 1 power 23 of the expression above and multiplying it by f 3 (y; x), we have the following interval density: y−x 2β−4 3 exp − 1 y−x β δ 3 δ λxGGD (y) = , (4.27) C where C is a constant normalizing the density. Using the symmetry of the density, this constant can be evaluated as the following integral: " +∞ 2β−4 # Z 3 1 y−x β y−x dy. exp − C=2 δ 3 δ x An expression for this integral can be obtained by using the change of variables ε = and identifying the resulting integral factor with the gamma function. This gives 1 1 2δ 13 2− β1 Γ 2− . C= 3 β 3 β 1 3 y−x β δ Now, we can obtain the CDF related to the interval density. Exploiting again the symmetry of the distribution, we can obtain the CDF by integrating the PDF only for values of y larger than x. Also, by using the same change of variables used above for calculating C, we get h i 1 1 y−x β 1 2 − , γ 3 β 3 δ 1 sign (y − x) x h i . Fλ,GGD (y) = + 2 2 Γ 1 2− 1 3 β Using the inverse of this function we can obtain the approximately optimal thresholds (4.20). For i ∈ {1, · · · , NI } 1 β 2i 1 1 1 2i ⋆,x −1 1 , (4.28) −1 3γ 2− , − 1 Γ 2− τi,GGD = x + δsign NI 3 β NI 3 β 192 Chapter 4. High-rate approximations of the FI where γ −1 {, } is the inverse incomplete gamma function. The interval densities for three GGD (β = 1.5, 2 and 2.5) are shown in Fig. 4.1. δ × λxGGD 0.6 β = 1.5 β = 2 (Gaussian) β = 2.5 0.4 0.2 0 −3 −2 −1 0 1 2 3 y−x δ Figure 4.1: Interval densities for the estimation of a GGD location parameter. The GGD shape parameters are β = 1.5, 2 and 2.5. Both axis are normalized to have plots independent of x and δ. A few remarks can be done based on the results above: • in Ch. 1 we saw that the binary quantization is optimal for the Laplacian distribution, as long as the quantizer threshold is placed at the true parameter. This singular behavior might be related to the difficulties on defining the optimal interval density in this case. • Observe that for 1 < β < 2 (see Fig. 4.1 for β = 1.5), the interval density tends to infinity at zero showing the importance of quantizing around this point for these distributions. • Notice that within the subclass of GGD for which the density at x is finite, the Gaussian distribution is the distribution with the lower β. Notice also that for β > 2 (see Fig. 4.1 for β = 2.5), the maximum of the interval density is not placed exactly at zero, showing that a possible relation might exist between the multimodality of the threshold distribution and the asymmetric behavior of optimal binary quantization. It shows also that the Gaussian distribution is exactly between two subclasses of the GGD family, one subclass for which quantizing around the true parameter is very informative (1 < β < 2) and another subclass for which quantizing symmetrically around the parameter, but not at the parameter, is informative (β > 2). • Observing the symmetry of the interval density we can see that, asymptotically, the best quantizer is symmetric around the parameter. Thus if we choose NI to be a large even number, the optimal central threshold might be placed at x. For β > 2, if we have a moderate odd number of quantization intervals, the interval density indicates that the optimal quantizer will be probably asymmetric, as we will have to place more quantization intervals around one of the modes of the interval density. 4.1. Asymptotic approximation 193 Results for the estimation of a GGD scale parameter We evaluate now the derivative of the logarithm of f (y; δ) function w.r.t. δ and y. This gives β 2 y − x β−1 ∂Scδ (y; δ) = 2 . ∂y δ δ Differently from the location problem, the derivative above exists for all positive β. Using this derivative and the expression for f (y; δ) in (4.18), we have y−x 2β−2 3 exp − 1 y−x β δ 3 δ , (4.29) λδGGD (y) = C where the normalizing constant can be obtained using the symmetry of the numerator C=2 +∞ Z y−x δ x Changing the variables ε = we get 1 3 y−x β δ C= 2β−2 3 " 1 exp − 3 y−x δ β # dy. and using the gamma function for rewriting the result, 2δ 13 3 β 2+ β1 Γ 1 1 2+ . 3 β Using again the symmetry and a similar development as it was done for obtaining the CDF for the interval density of the location problem, we have h i 1 1 y−x β 1 2 + , γ 3 β 3 δ 1 sign (y − x) δ h i Fλ,GGD (y) = + . 2 2 Γ 1 2+ 1 3 β Its inverse gives the threshold approximation. For i ∈ {1, · · · , NI } ⋆,δ τi,GGD = x + δsign 1 β 2i 2i 1 1 1 −1 1 . −1 3γ − 1 Γ 2+ , 2+ NI 3 β NI 3 β (4.30) The main differences w.r.t. the location parameter case is that, now, it is the Laplacian distribution that is in the border between the distributions for which x is a very informative point and the distributions for which most of the information is around x but not at x. Note also that the interval density is still dependent on δ, thus, as it was said before, for placing optimally the thresholds, an adaptive scheme is necessary. Results for the estimation of a STD location parameter Using the STD PDF (3.72) " #− β+1 2 Γ β+1 2 1 y−x 2 1+ , f (y; x) = √ β δ δ βπΓ β2 194 Chapter 4. High-rate approximations of the FI the derivative of the score function is ∂Scx (y; x) ∂y h y−x 2 δ 1 β+1 1− β = h βδ 2 1 + β1 y−x 2 δ i i2 . Replacing this expression and the PDF expression above in the interval density (4.18), we obtain h i2 1 y−x 2 3 1 − β δ 1 . (4.31) λxST D (y) = h i 9+β C 1 y−x 2 6 1+ β δ The constant C and the corresponding CDF cannot be expressed analytically, with known special functions. For obtaining a general expression for the thresholds, it might be necessary to use numerical integration of the density for each y and then invert an interpolation of the numerical integration. In the special case of a Cauchy distribution (β = 1), we can evaluate analytically the constant in the density and the CDF. For this distribution the interval density is h 1 1− λxC (y) = h C 1+ From the symmetry, the constant is h +∞ Z 1− C=2 h 1+ x Using the change of variables tan C=δ Zπ 0 θ 2 = y−x δ , y−x 2 δ y−x 2 δ y−x 2 δ y−x 2 δ i2 3 (4.32) i5 . 3 i2 3 i 5 dy. 3 we obtain π 2 Z2 3 2 1 2 θ 2 θ cos − sin dθ = 2δ cos (θ) 3 dθ, 2 2 0 where the second equality was obtained using a relation between trigonometric functions and the periodic pattern of the resulting function in the interval [0, π). Using another change of variables u = cos2 (θ) and identifying the resulting integral factor with the beta function, we have 1 5 , . C = δB 2 6 Exploiting the symmetry of the interval density and using a similar development, we can obtain the CDF related to the interval density Rφ 1 sign (y − x) 0 x Fλ,C (y) = + 2 2 cos2 (θ) B 1 1 5 2, 6 3 dθ , 4.1. Asymptotic approximation 195 . Using again the change of variables u = cos2 (θ), we can rewrite with φ = 2 arctan y−x δ the integral above using the incomplete beta function Zφ 0 cos2 (θ) 1 3 1 5 1 B 1, 5 − I 2 , for φ ∈ 0, π2 , cos (φ) 2 , 6 2h 2 6 i dθ = 1 2 2B 12 , 65 − Icos2 φ− π 21 , 56 , for φ ∈ π2 , π . ( ) 2 Transforming φ into the initial variable, we have the following CDF y−x 1 + sign 1 − 1 1 1 sign (y − x) 5 5 x δ # " Fλ,C (y) = + B , , − I 2 y−x 2 1−( 2 2 2 6 2 6 4B 12 , 65 δ ) y−x 2 1+( ) δ 1 5 −1 1 + sign y−x 5 1 δ . B , + I" y−x 2 #2 , + 1−( 2 2 6 2 6 δ ) y−x 2 1+( δ ) Inverting the CDF we can obtain the approximate expression for the thresholds. For i ∈ {1, · · · , NI } and i′ = i − N2I ⋆,x τi,C = v v u u −1 u 1−u I u t B( 1 , 5 ) 2 6 u ′) u v x + δsign (i u u −1 I t 1+u t 1,5 B( 2 6) v v u u −1 u 1+u I u t B( 1 , 5 ) 2 6 u ′) u v x + δsign (i u 1−u I −1 t u t 1,5 B( 2 6) 1− ! 1,5 2 6 ) ! 1,5 2 6 4| i ′ | ) 4| i ′ | NI ( ( 1− , when |i′ | ≤ 14 , NI (4.33) ! 1,5 2 6 4| i ′ | −1 NI ) ! 1,5 2 6 4| i ′ | −1 NI ) ( ( , when |i′ | > 14 , −1 where I() (, ) is the inverse incomplete beta function. An interesting point on the optimal interval density for the estimation of a location pa√ rameter of the STD is that it equals zero exactly at x ± βδ indicating that around this point not much statistical information can be obtained about the location parameter. If we observe the score function we will see that it is a function with "∽" shape, the zero derivative point is then related to the maximum and minimum of the score, in a practical sense, the points larger than the maximum and smaller than the minimum can be seen as outliers, so for estimation purposes we might not be interested in quantizing around the transition point. Note however that from a threshold placement point of view, the only practical way of having a zero interval density at this point is by placing a threshold at it, therefore in practice, we are interested in knowing if the measurement is an outlier or not. 196 Chapter 4. High-rate approximations of the FI Results for the estimation of a STD scale parameter For estimating the scale parameter of the STD, we have the following derivative of the score function y−x 1 2 ∂Scδ (y; δ) β δ = 2 (β + 1) h i2 . ∂y δ 1 + 1 y−x 2 β δ This leads to the following interval density with C given by 1 δ λST D (y) = h C C=2 +∞ Z x Using a change of variables ε = beta function, we obtain h 1+ 1 β h 1+ 1 i 2 1+ β1 ( y−x δ ) 1 β 1 β 1 β y−x δ y−x δ y−x 2 δ y−x 2 δ i2 i2 dy , (4.34) . and identifying the resulting integral with the 5 β+4 , . 6 6 Exploiting the symmetry of the interval density and using the previous change of variables, we have the following CDF related to the interval density " # β+4 β+4 5 B 56 , 6 − I 1 6, 6 y−x 2 1+ 1 ( sign (y − x) 1 ) δ β δ + . Fλ,ST D (y) = 5 β+4 2 2 B , p C = βδB 6 6 For i ∈ {1, · · · , NI }, the approximately optimal thresholds are then given by 1 ( )−1 2 β + 4 2i 5 ⋆,δ −1 −1 β I 5 β+4 2i , −β . τi,ST D = x + δsign B ( 6 , 6 ) 1− N −1 NI 6 6 I (4.35) Note that similarly to the GGD scale parameter estimation case, the point x is not very informative. Most of the quantizer intervals must be placed around x but not very close to x. 4.1.7 Location parameter estimation To check the results, we will now focus on location parameter estimation. First observe that using the normalized form for the PDF f (y; x) = 1δ fn y−x , we can δ rewrite the interval density given by (4.18) h i2 " #2 (1) 2 y−x 1 y−x (2) y−x 3 1 y−x 3 2 fn − fn δ fn ∂ log δ fn δ δ δ y−x 3 1 λ⋆ (y) ∝ fn , ∝ y−x ∂y∂x δ δ fn δ 4.1. Asymptotic approximation (1) 197 (2) where fn and fn are the first and second derivatives of fn w.r.t. its argument. For a fn (1) 2 (2) with even symmetry, fn is even, fn is even and consequently λ⋆ (y) is symmetric around x. This means that for large NI , the optimal quantizer is symmetric around x, indicating that, asymptotically, the asymmetry of the optimal quantizer for binary quantization under some distributions (Subsec. 1.3.4, p. 48) might disappear. The asymptotic approximation of malized PDF 1 2−2NB ⋆ x Iq ≈ 2 Ic,n − δ 12 Iq given by (4.19) can also be rewritten using the nor Z h (1) 2 fn (ε) − (2) fn (ε) fn (ε) fn (ε) i2 3 3 dε , (4.36) x is the FI for estimating a location parameter when δ = 1. Note that the FI approxwhere Ic,n n) , where κ is a functional depending only on the normalized imation can be written as κ(f δ2 PDF and independent of x and δ. Therefore, we can have a characterization of the optimal estimation performance based on quantized measurements for a family of distributions with different δ and x only by evaluating κ (fn ). FI for the Gaussian and Cauchy cases We will check the results using the Gaussian (GGD with β = 2) and Cauchy (STD with β = 1) distributions. For the Gaussian distribution, the interval density (4.27) and the asymptotic approximation of the FI (4.18) are given by " 2 # √ −(2N −1) i 2 h x B 1 y − x ≈ I 1 − π 32 . (4.37) x q,G , λG (y) = √ exp − √ δ2 δ 3π 3δ We can note that the interval density in this case is exactly the same as for standard quantiza1 tion (proportional to f 3 ). Thus, in the Gaussian case when NI is large, the optimal quantizer for estimating the location parameter and for recovering the continuous measurement is the same. This coincidence between the optimal quantizer for estimation and for reconstruction happens whenever the score function is constant. In the location parameter estimation case, this happens only for the Gaussian distribution. If we look to the scale parameter case, this will happen for the Laplacian distribution. Observe also that if it was possible to implement the variable rate encoder in the Gaussian case, then the optimal quantizer would be a uniform quantizer and it would coincide with the optimal variable rate quantizer for reconstruction which is uniform [Gersho 1992, p. 299]. For the Cauchy distribution, the interval density (4.32) and the asymptotic FI approximation are the following: λxC (y) = 1 δB 1 5 2; 6 h h 1− 1+ y−x 2 δ y−x 2 δ i2 3 i5 , 3 x Iq,C # " 3 B 21 ; 56 1 −2NB +1 . 2 ≈ 2 1− 2δ 3π (4.38) 198 Chapter 4. High-rate approximations of the FI To evaluate the validity of the results, the FI (4.3) under both distributions for δ = 1 was evaluated for • the optimal set of thresholds for NB ∈ {1, 2, 3}. The optimal thresholds were obtained through exhaustive search. For NB ∈ {4, 5, 6, 7, 8} the theoretical results (4.37) and (4.38) were used as an approximation. • uniform quantization considering NB ∈ {1, · · · , 8}. After setting the central threshold to x, the optimal quantization interval step-length ∆⋆ was found by maximizing the FI also using exhaustive search. • the approximate optimal set of thresholds given by (4.28) and by (4.33), for NB ∈ {1, · · · , 8}. The results are given in Tab. 4.1. NB 1 2 3 4 5 6 7 8 x = 2) Gaussian (Ic,n Practical Optimal Uniform approx. 1.27323954† 1.76503630† 1.93090199† 1.97874454⋆ 1.99468613⋆ 1.99867153⋆ 1.99966788⋆ 1.99991697⋆ – 1.76503630 1.92837814 1.97841622 1.99353005 1.99807736 1.99943563 1.99983649 1.27323954 1.75128300 1.92740111 1.98038526 1.99489906 1.99869886 1.99967136 1.99991741 x = 0.5) Cauchy (Ic,n Practical Optimal Uniform approx. 0.40528473† 0.43433896† 0.48474865† 0.49533850⋆ 0.49883463⋆ 0.49970866⋆ 0.49992716⋆ 0.49998179⋆ – 0.43433896 0.45600797 0.48136612 0.49204506 0.49656712 0.49851056 0.49935225 0.40528473 0.40528473 0.47893785 0.49504170 0.49879785 0.49970408 0.49992659 0.49998172 Table 4.1: FI for the estimation of Gaussian and Cauchy location parameters based on quantized measurements. NB is the number of quantization bits. In Optimal† the maximum FI obtained by exhaustive search of the thresholds is shown. Optimal⋆ is the theoretical asymptotic approximation of the FI. Uniform shows the value of the FI for optimal uniform quantization and Practical approx. gives the FI for the practical approximation of the asymptotically optimal thresholds. In all cases the fast convergence to the continuous FI with increasing NB is verified. Again here, 4 or 5 bits are enough for obtaining an estimation performance close to the continuous measurement performance. The difference of performance between uniform and nonuniform quantization seems to be higher for the Cauchy distribution. In the Gaussian case, this difference is negligible, indicating that in practice uniform quantization should be used (as it is easier to implement). It can also be observed that the asymptotic approximation of the FI and its true value for the practical approximation of the optimal threshold set are very close, even for small values of NB (NB = 4). Verification with the adaptive algorithm As it was pointed out before, an important issue for evaluating the practical approximation of the optimal thresholds τi⋆ is that they depend explicitly on x. Thus a possible solution to, 4.1. Asymptotic approximation 199 at the same time, obtain an estimate of the parameter and set the quantizer thresholds is to use the adaptive algorithm proposed in Ch. 3 X̂k = X̂k−1 + 1 η (ik ) , kIq with the threshold variations set τ ′ given by the practical approximation τ ⋆ with x in (4.20) set f (τ ⋆ ;x)−f (τi⋆ ;x) to zero and η (ik ) given by η (i) = − F τi−1 ⋆ ;x . If NB ≥ 4, for a large k, the asymptotic ( i⋆ ;x)−F (τi−1 ) variance of the algorithm will be close to optimal and it will be given approximately by h i 1 Var X̂k ≈ CRBq = , kIq (4.39) where Iq is the asymptotic approximation given by (4.19). MSE × k This algorithm was tested under both distributions for NB = 4 and 5. The MSE for the algorithm was evaluated using Monte Carlo simulation, 4 × 106 realizations of blocks with 5 × 104 samples were used. The initial error x − X̂0 and δ were both set to be 1 in all simulations. The MSE for the algorithm and the approximation given by (4.39) are both given in Fig. 4.2, where they are multiplied by k for better visualization. 10−0.3 CRBq Algorithm 10−0.35 0 10 101 102 103 Time [k] 104 MSE × k (a) 100.35 CRBq Algorithm 100.3 100 101 102 103 Time [k] 104 (b) Figure 4.2: Simulated MSE for the adaptive algorithm considering Gaussian (a) and Cauchy (b) measurement distributions. The numbers of quantization bits are NB = 4 and 5. The initial estimation error and δ were set to 1 in all the cases. The simulated MSE was obtained through Monte-Carlo simulation, 4 × 106 realizations of blocks with 5 × 104 error samples were used. The curves that have asymptotically higher values correspond to NB = 4. 200 Chapter 4. High-rate approximations of the FI We observe that the asymptotic algorithm performance is very close to the approximation. For small k the CRB is not tight and that seems to be the reason for the algorithm to perform better than the bound. In other simulations, it was also observed that using uniform thresholds leads to faster convergence to the asymptotic performance. This indicates that in practice an algorithm with changing thresholds can be used for obtaining better results. In the convergence phase, a uniform set of thresholds is used, then after a given number of samples, the thresholds change to the approximately optimal set. 4.2 Bit allocation for scalar location parameter estimation The objective now is to solve problem (d) (p. 175). We have Ns sensors measuring independently the same location parameter x and the continuous measurements from one sensor to another have all the same noise type with normalized PDF fn . The only difference between the noise distribution from one sensor to another is the scale factor. For the Ns sensors, the scale factors are denoted {δ1 , · · · , δNs }. Each sensor i quantizes its measurements with a number of bits NB,i such that the total number of bits among the sensors is constrained to be NB . The objective then is to find the allocation of bits {NB,1 , · · · , NB,Ns } that maximizes the estimation performance. The estimation performance for unbiased estimators in terms of variance can be characterized asymptotically by the CRB, which is related to the inverse of the FI. Thus, by maximizing the FI, the asymptotic estimation performance is maximized. As the sensors measurements are independent, the FI for the measurements from all the sensors Iq is the sum of the FI Iq,i (NB,i ) for each sensor Ns X Iq = Iq,i (NB,i ) , (4.40) i=1 where we made explicit the dependence of the FI for each sensor on the allocated number of bits. We will assume that the thresholds can be chosen so that Iq,i (NB,i ) is maximum. This can be done for example by using the adaptive algorithm with decreasing gain to set optimally the central threshold and then by choosing optimally the threshold variations. Thus, we want to solve the following optimization problem: maximize w.r.t. NB,i subject to Iq = Ns X Iq,i (NB,i ) , i=1 Ns X NB,i = NB , i=1 NB,i ∈ N, where Iq,i (NB,i ) is the maximum FI for NB,i . This problem can be solved exactly by evaluation of Iq for all possible combinations of the NB,i . The numbers of allocated bits NB,i can assume values from 0 to NB but their sum 4.2. Bit allocation for scalar location parameter estimation 201 must be NB . Therefore, the NB,i form a weak composition of NB into Ns parts. The number NB + Ns − 1 B +Ns −1)! of possible allocations is = (N (Ns −1)!NB ! . If we have to solve the allocation Ns − 1 problem for Ns = 20 and NB = 100, then the number of possible allocations that we have to compare is approximately 4.9 × 1021 , which indicates that in practice the exact solution for this problem is difficult to be obtained by exhaustive search. 4.2.1 Unconstrained numbers of bits If we neglect the constraint that NB,i must be a non-negative integer and we suppose that the asymptotic approximation of Iq (4.36) is valid for all real NB,i , then we can define a maximization problem that can be solved analytically. Using the approximation (4.36), we have that the total FI can be approximated by h i 2 3 3 (1) 2 (2) Z fn (ε) − fn (ε) fn (ε) Ns X 1 x 2−2NB,i Iq ≈ . (4.41) dε I − c,n 12 fn (ε) δ2 i=1 i Maximizing the approximation in the RHS of (4.41) is equivalent to minimizing Ns P i=1 2 −2NB,i δi2 as x and the integral are constants if all the sensor noise types are equal. Thus the relaxed Ic,n form (without the integer constraints) of the bit allocation problem is the following: Ns −2NB,i X 2 minimize w.r.t. NB,i i=1 Ns X subject to δi2 , NB,i = NB . i=1 We can solve this optimization problem by integrating the constraint in the function to be minimized using a Lagrange multiplier. The Lagrangian (the function to be minimized) for this minimization problem is "N # " N ! # s s X X 2−2NB,i L= +λ NB,i − NB , δi2 i=1 i=1 where λ is the Lagrange multiplier. As the function is convex (it is a sum of negative exponentials plus a sum of linear terms), the zero gradient point of the Lagrangian w.r.t. the NB,i gives a global minimum. The derivative of the Lagrangian w.r.t. NB,i is −2 ln (2) 2−2NB,i ∂L = + λ, ∂NB,i δi2 which is zero for NB,i = log2 h λδi2 2 ln(2) −2 i . (4.42) 202 Chapter 4. High-rate approximations of the FI To find λ it is necessary to use (4.42) in the sum constraint Ns X NB,i = Ns log2 (λ) − Ns log2 [2 ln (2)] + 2 −2 i=1 N Ps i=1 log2 (δi ) = NB , thus, "N # s 2 X NB + log2 [2 ln (2)] − log2 (δi ) . log2 (λ) = −2 Ns Ns (4.43) i=1 Using (4.43) in (4.42) gives "N # s log2 [2 ln (2)] 1 X NB log2 [2 ln (2)] − + log2 (δi ) − log2 (δi ) + Ns 2 Ns 2 i=1 "N # s 1 X NB + log2 (δi ) − log2 (δi ) , Ns Ns NB,i = = i=1 which can be rewritten as NB,i δ NB s i − log2 = Ns Qs N N s j=1 δj . (4.44) This is a correction on the uniform bit allocation that depends on the weight of the distribution scale parameter in the geometric mean of the scale factors. Note that the approximate allocation depends only on δi and no other information about the distribution is required. In practice we can estimate δi for each sensor with an arbitrary allocation and then we can use the estimates in (4.44) and round the results in a proper way for obtaining for obtaining integer NB,i . If we use the approximate solution from (4.44), we obtain Iq ≈ = " " x Ic,n x Ic,n = Ns " Ns X 1 2 δ i=1 i "N #) s 2−2NB,i κ′ (fn ) X − = 2 12 δ i i=1 NB −2 −log 2 Ns !# Ns Ns ′ X X 2 κ (fn ) 1 − 2 12 δ δi2 i=1 i=1 i !# ( δ v i uN u Qs Nt s δj j=1 # 2−2N̄B κ′ (fn ) − , 2 2 12 GM δ12 , · · · , δN HM δ12 , · · · , δN s s x Ic,n = (4.45) 4.2. Bit allocation for scalar location parameter estimation 2 where HM δ12 , · · · , δN s = Ns N Ps 1 2 i=1 δi 2 and GM δ12 , · · · , δN s = s Ns 203 N Qs j=1 δj2 are the harmonic and geometric means of the squared scale factors, κ′ (fn ) is the integral factor in (4.36) and B N̄B = N Ns is the number of allocated bits per sensor that would be obtained if we had used a uniform bit allocation. If we compare this result to uniform bit allocation Ns κ′ (fn ) −2N̄B x Ic,n − , Iq ≈ 2 2 12 HM δ12 , · · · , δN s we can verify that, as the geometric mean is larger than the harmonic mean, the approximate optimal bit allocation performs better than or equal to the uniform bit allocation. If it was possible to implement this allocation scheme, an interesting point for future study would be to study the influence of the variability of the precision of the sensors δ12 on the i estimation performance. This might be done for example by considering that the δ12 are i.i.d. i r.v. with a given distribution (a gamma distribution for example) with known parameters, then by assuming large Ns for a fixed N̄B , we can apply the law of large numbers to the harmonic and the geometric means in the approximation of Iq (4.45) to obtain a characterization of the approximately optimal FI as a function of the parameters of the precision distribution. This approach, even if approximate, might give some insight on the performance of estimation of asymptotically large heterogeneous sensor arrays under communication rate constraints. 204 Chapter 4. High-rate approximations of the FI We have the following solution to problem (d) (p. 175): Solution to (d) - Unconstrained approximate optimal bit allocation for location parameter estimation (d1) For i ∈ {1, · · · , Ns }, the approximate optimal bit allocation is given by (4.44) NB,i δ NB s i = − log2 Ns Qs N N s j=1 δj . Appropriate rounding can be used to obtain NB,i ∈ N. • For the approximate optimal bit allocation, the FI is given by (4.45) " # x Ic,n κ′ (fn ) 2−2N̄B − , Iq ≈ Ns 2 2 12 GM δ12 , · · · , δN HM δ12 , · · · , δN s s x where Ic,n is the continuous FI for δ = 1, κ′ (fn ) = h 3 i2 (1) 2 (2) R fn (ε)−fn (ε)fn (ε) 3 1 B dε , N̄B = N 12 Ns is the average number fn (ε) of bits per sensor and HM and GM are the harmonic and geometric means of the scale parameters. 4.2. Bit allocation for scalar location parameter estimation 4.2.2 205 Positive numbers of bits For obtaining a more realistic solution, we can constrain the numbers of bits to be nonnegative reals. This gives the following optimization problem: minimize w.r.t. NB,i subject to Ns −2NB,i X 2 i=1 Ns X δi2 , NB,i = NB , i=1 NB,i ≥ 0. The Lagrangian is the same as for the unconstrained problem. Using the zero gradient condition, we have h 2 i λδi log2 2 ln(2) = ν − log2 (δi ) , NB,i = −2 where ν is a constant to be chosen. Note that the positivity constraint imposes the following form for NB,i NB,i = [ν − log2 (δi )]+ , (4.46) with [x]+ = max (x, 0). The sum constraint gives Ns X i=1 NB,i = Ns X i=1 [ν − log2 (δi )]+ = NB . (4.47) Thus, the constant ν is chosen so that (4.47) is satisfied and then the number of bits can be chosen according to (4.46). Again here, appropriate rounding might be used to obtain integer numbers of bits. Observe that this approximate bit allocation is equivalent to water-filling, a common solution to allocate power to carriers in multicarrier modulation. The main difference is that in this case the channel noise is replaced by log2 (δi ) and the "water depths" are the number of bits instead of the power levels. In Fig. 4.3, both water-filling solutions are shown, for power allocation in multicarrier systems and for approximate bit allocation in constrained rate sensing systems. When the δi are also unknown, we can mix the two extensions of the adaptive algorithm with decreasing gain presented in Ch. 3 (fusion center + joint estimation of the scale) to have estimates of the scale parameters. Then, we can use the estimates to obtain the approximate allocation. In practice, the value of ν can be evaluated at the fusion center and broadcasted to the sensors with the location parameter. The sensors can use the broadcasted ν with a local estimate of the location parameter for obtaining the optimal NB,i . The critical point with this approach will be the final rounding step, which will require an agreement (and consequently communication) between the sensors to respect the total bandwidth constraint. 206 Chapter 4. High-rate approximations of the FI Noise level Power Total "water volume" sums to P P5 P3 N1 P4 P2 N5 N3 N4 N2 1 2 ν–"water level" 3 4 5 Carrier (a) Number of bits NB,i log2 (δi ) Total "water volume" sums to NB ν–"water level" NB,2 NB,1 NB,4 log2 (δ3 ) NB,5 log2 (δ2 ) log2 (δ1 ) 1 log2 (δ4 ) 2 3 4 log2 (δ5 ) 5 Sensor (b) Figure 4.3: Both water-filling solutions for multicarrier modulation power allocation (a) and for rate constrained sensing system bit allocation (b). This gives the following solution to problem (d) (p. 175): Solution to (d) - Constrained approximate optimal bit allocation for location parameter estimation (d2) For i ∈ {1, · · · , Ns }, the approximate optimal bit allocation is obtained by choosing ν so that (4.47) is satisfied Ns X i=1 NB,i = Ns X i=1 [ν − log2 (δi )]+ = NB . With the value of ν satisfying (4.47), the numbers of bits can be obtained using (4.46) NB,i = [ν − log2 (δi )]+ . Integer NB,i can be obtained with appropriate rounding. The corresponding FI can be approximated by substituting the optimal NB,i in (4.41). 4.3. Generalization with the f –divergence 4.3 207 Generalization with the f –divergence In this section, we will discuss a generalization of the asymptotic results to different inference problems. The generalization that we will study is based on the generalized f –divergence (GFD), which is presented in [Poor 1988]. The objective of this section is to show the main differences between the asymptotically optimal quantizers for different inference problems. 4.3.1 Definition of the generalized f –divergence The GFD is a generalization of the f –divergence (also known as the Ali–Silvey distance) studied in [Ali 1966] (cited in [Poor 1988]). For a continuous r.v. Y , the GFD Df is defined as Df,c = E {f [l (Y )]} , (4.48) where l is a measurable function and f is a continuous convex function. For a quantized measurement i from Y the GFD is defined as Df,q = E f EY |i [l (Y )] = E [f (lqi )] . (4.49) Developing the conditional expectation and supposing that Y accepts a PDF p (y), we can rewrite (4.49) as NI X Df,q = f (lqi ) P (i) , (4.50) i=1 where P (i) = Z p (y) dy, (4.51) qi and l qi = R qi l (y) p (y) dy R . p (y) dy (4.52) qi 4.3.2 Generalized f –divergence in inference problems The performance of some important inference problems can be written as a function of the f –divergence. Three examples are given below. Classical estimation For classical estimation, we want to estimate a deterministic parameter x embedded in noisy independent measurements Y1:N . The quantized version of this problem is the main problem treated in thesis. Under some regularity conditions, we know that the asymptotic MSE of the optimal unbiased estimator of x attains the CRB which is given by the inverse of the FI. The FI for N independent measurements is given by N times the FI for one measurement. 208 Chapter 4. High-rate approximations of the FI If we look to the forms of Ic and Iq we can see that the FI for one measurement is a GFD with l (y) = Sc (y; x) and f (l) = l2 . Therefore, the GFD is directly linked to the asymptotic performance of classical estimation. Bayesian estimation Consider that instead of estimating a deterministic parameter, we want to estimate a random parameter X based on a noisy measurement Y . From Ch. 2 (2.7) (p. 78), we know that MSE = EY VarX|Y (X) , which can be rewritten o n as EY EX|Y X 2 − E2X|Y (X) . This gives h i MSE = E X 2 − EY E2X|Y (X) . This function is decreasing w.r.t. the second term, which is a GFD. Proceeding similarly for the quantized measurement version of the problem, we can conclude that the performance depends on a GFD with l (y) = EX|Y =y (X) and f (l) = l2 . For N identically distributed measurements, the MSE for Bayesian estimation can also be rewritten as a GFD, but in this case a generalization to the non-scalar case is needed. In [Marano 2007], we can find details for this case with a variable rate approach for quantization. We can also approach Bayesian estimation for N measurements as a sequential single measurement problem, where at each new observation we can use the last posterior as the new prior. Using this approach for each measurement, the MSE will be given by the scalar version of the GFD explained above. Neyman–Pearson detection We consider now the detection problem. We have N i.i.d. measurements Y1:N . The measurements are all obtained from one of two distributions with PDF p0 (y) or p1 (y). Based on the N measurements we want to decide from which of the two distributions the measurements are obtained. The index of the true measurements distribution will be denoted H ∈ {0, 1} and the decision that we make based on the N measurements will be denoted Ĥ. For specifying the performance of the decision procedure we can consider a Neyman– Pearson strategy [Van Trees 1968, p. 33]. In the Neyman–Pearson strategy, we set an upper bound on the probability of deciding Ĥ = 1 when H = 0 to a fixed constant α and the performance of the decision procedure is given by the minimum probability β of deciding Ĥ = 0 when H = 1. When N is asymptotically large, the limit of β can be characterized using Stein’s lemma [Blahut 1987] (cited in [Gupta 2003]) 1 lim β N = exp {−DKL [p0 (y) ||p1 (y)]}, N →+∞ where DKL [p0 (y) ||p1 (y)] is the Kullback–Leibler divergence (KLD) Z p0 (y) dy. DKL [p0 (y) ||p1 (y)] = p0 (y) log p1 (y) 4.3. Generalization with the f –divergence 209 For quantized measurements, a similar theorem can be stated by replacing p0 (y) and p1 (y) by the corresponding probabilities of the quantizer outputs. Therefore, for this problem, the KLD is the criterion to be maximized to increase detection performance. If we take the (y) and opposite of the KLD, we can see that it is a GFD with l the likelihood ratio l (y) = pp10 (y) f (l) = − log (l). The expectation in the GFD in this case is evaluated w.r.t. the probability measure for H = 0. Detection of weak signals We can also consider the detection of a low amplitude signal. We follow a similar presentation as in [Poor 1988]. For this problem, the Yk for k ∈ {1, · · · , N } are i.i.d. and distributed according to p (yi ) or p (yi − θxi ), where p is the noise marginal PDF and xi is a known signal N P with finite average power x¯2 = N1 x2k . If we consider a large number of measurements k=1 N → ∞ and small signal amplitude θ → 0, then the performance of the optimal detector in terms of β in the Neyman–Pearson strategy is related to the efficacy ρ = x¯2 Z dp(y) dy p (y) 2 p (y) dy = x¯2 Z d log [p (y)] dy 2 p (y) dy. When this quantity is maximized, we maximize asymptotic detection performance. Note that the integral factor is exactly the FI for estimating a location parameter of the PDF p. Thus in this case, the inference performance can also be written as a GFD with l (y) = and f (l) = l2 . 4.3.3 dp(y) dy p(y) = d log[p(y)] dy Asymptotic results Similarly to the asymptotic development for the FI, we will write asymptotic approximations for the loss of GFD incurred by quantization. After obtaining the asymptotic loss for the GFD, we will obtain the optimal interval densities for the fixed rate and variable rate encoding cases. Asymptotic GFD loss The loss of GFD due to quantization can be defined as Lf = Df,c − Df,q = NI X Lf,i , (4.53) i=1 where Lf,i is the loss for each quantization interval Lf,i = Z qi {f [l (y)] − f (lqi )} p (y) dy. (4.54) 210 Chapter 4. High-rate approximations of the FI For obtaining the asymptotic approximation, we write the Taylor series expansions of l and p around the central point yi and of f around a point li l (y) = li + (y) li (y (yy) − yi ) + li (y − yi )2 + ◦ (y − yi )2 , 2 (4.55) (yy) pi (y − yi )2 + ◦ (y − yi )2 , 2 (ll) f (l) f (l) = fi + fi (l − li ) + i (l − li )2 + ◦ (l − li )2 . 2 (y) p (y) = pi + pi (y − yi ) + (4.56) (4.57) Using (4.57) and (4.55), the function f [l (y)] on the interval qi can be written as " # (yy) (ll) li fi (y) 2 (l) (y) 2 2 (y − yi ) + li (y − yi ) + ◦ (y − yi )2 . f [l (y)] = fi + fi li (y − yi ) + 2 2 (4.58) We use (4.55) and (4.56) in (4.52) to evaluate lqi l qi = R qi (y) li + li (yy) pi (y) 2 2 2 2 p + p (y − y ) + dy (y − y ) + ◦ (y − y ) (y − y ) + ◦ (y − y ) i i i i i i i 2 2 (yy) R p (y) pi + pi (y − yi ) + i 2 (y − yi )2 + ◦ (y − yi )2 dy (yy) (y − yi ) + li qi (yy) = l i pi ∆ i + l i pi 2 ∆2i 12 (yy) li ∆3i (y) (y) ∆3i 12 + 2 pi 12 (yy) ∆3i pi 3 2 12 + ◦ ∆i + l i pi pi ∆ i + + ◦ ∆3i . (4.59) To evaluate f [lqi ] we will replace (4.59) in (4.58). We proceed first by evaluating lqi − li l qi − l i = (yy) ∆3 pi 12i + ◦ ∆3i . (yy) ∆3 p pi ∆i + i2 12i + ◦ ∆3i (y) (y) ∆3i 12 l i pi li + 2 Note that lqi − li has a factor ∆2i , thus (lqi − li )2 = ◦ ∆3i . This leads to (l) f [lqi ] = fi + fi (yy) ∆3i 3 p + ◦ ∆ i 12 i 3 2 . + ◦ ∆ i (yy) ∆3i pi 3 pi ∆i + 2 12 + ◦ ∆i (y) (y) ∆3i 12 l i pi + li (4.60) Now, we will evaluate the two terms in Lf,i (4.54). Multiplying the expansion of f [l (y)] (4.58) by the expansion of p (y) (4.56) and integrating, we obtain # " Z (yy) (yy) 3 li pi ∆3i ∆3i (l) (y) (y) ∆i + f i l i pi + pi f [l (y)] p (y) dy = fi pi ∆i + fi 2 12 12 2 12 qi + (ll) fi (y) 2 ∆3i li pi + ◦ ∆3i . 2 12 (4.61) 4.3. Generalization with the f –divergence Using (4.60) and integrating the expansion for p (y), we get # " Z (yy) (yy) 3 3 pi ∆3i l ∆ ∆ (l) (y) (y) i f [lqi ] p (y) dy = fi pi ∆i + fi + f i l i pi + i pi i + ◦ ∆3i . 2 12 12 2 12 211 (4.62) qi Subtracting (4.62) from (4.61), we get the loss in the interval qi Lf,i = Therefore, the total loss is (ll) fi (y) 2 ∆3i pi + ◦ ∆3i . li 2 12 # " (ll) NI X fi (y) 2 ∆3i 3 pi + ◦ ∆ i . li Lf = 2 12 (4.63) i=1 Similarly to the asymptotic development for the FI, we have lim NI →∞ NI2 Lf 1 = 24 Z 2 f (ll) [l (y)] l(y) (y) p (y) dy. λ2 (y) (4.64) The optimal interval density for fixed rate encoding is then given by 2 1 1 f (ll) 3 [l (y)] l(y) (y) 3 p 3 (y) . λ (y) = R 2 1 1 f (ll) 3 [l (y)] l(y) (y) 3 p 3 (y) dy ⋆ (4.65) If the PDF of the measurements is completely known and given by p (y), then a similar development as it was done for the FI leads to the following optimal variable rate encoding interval density: p f (ll) [l (y)] l(y) (y) ⋆ λvr (y) = R p . f (ll) [l (y)] l(y) (y) dy 4.3.4 Interval densities for inference problems We will compare now the different interval densities for the inference problems described above. In Tab. 2 we give the different functions defining the GFD for each problem and the corresponding optimal interval density. We also give the optimal interval density for variable rate encoding, whenever variable rate encoding is possible. 212 Chapter 4. High-rate approximations of the FI Inference problem l (y) f (l) Classical estimation Sc (y; x) l2 Bayesian estimation EX|Y =y (X) l2 N–P detection p0 (y) p1 (y) − log (l) Weak signal detection dp(y) dy p(y) l2 λ⋆ (y) ∝ ∂Sc (y;x) ∂y 2 3 λ⋆vr (y) ∝ 1 p 3 (y; x) 2 dE 1 3 X|Y =y (X) p 3 (y) dy 2 i h p (y) 3 1 d log p0 (y) 1 p03 (y) dy n d2 log[p(y)] dy 2 o2 3 1 p 3 (y) – dEX|Y =y (X) dy – 2 d log[p(y)] dy2 Table 4.2: Functions characterizing the GFD for different inference problems and interval densities maximizing the inference performance based on quantized measurements. The interval density λ⋆ (y) is the density optimizing the performance when encoding is done with fixed rate, λ⋆vr (y) is the density for variable rate encoding. Notice that for Bayesian estimation and weak signal detection, we give expressions for the variable rate optimal density. In Bayesian estimation, as we have a prior on the true parameter, we know the probabilities of the quantizer outputs, thus we can define correct lengths for the outputs. In weak signal detection, as the amplitude of the signal is small, we can consider that the encoding can be done approximately by using the noise distribution. While in classical estimation, we have the effect of the score function derivative, in Bayesian estimation the optimal interval density is affected by the optimal estimator x̂ = EX|Y =y (X) function. Note also that, differently from the classical estimation case, where the interval density is affected directly by the true parameter value, in Bayesian estimation the influence of the parameter appears only through its prior. Thus even if x is unknown in the Bayesian case, an optimal quantizer can be implemented in practice3 . Observe that classical estimation for a location parameter with value x = 0 and weak signal detection have exactly the same interval density. Actually, the performance of weak signal detection can be seen equivalently as the performance of estimating a small constant with i.i.d noise and marginal PDF p (y). Thus it is not surprising that the optimal interval densities are the same. The optimal density for Neyman–Pearson detection that we have obtained here is exactly the same as the one obtained in [Gupta 2003] in the scalar case. Note that similarly to Bayesian estimation, where the sensibility of the key element for inference, the optimal estimator, has a direct impact on the interval density, in detection, the sensibility of the logarithm of the continuous measurement ilikelihood ratio plays an important role. Note also that the logh p0 (y) likelihood ratio log p1 (y) = log [p0 (y)] − log [p1 (y)] for two distributions parametrized by x and x + ε with small ε can be rewritten using an expansion around x log [p0 (y)] − log [p1 (y)] = log [p (y; x)] − log [p (y; x + ε)] = ε ∂ log [p (y; x)] + ◦ (ε) . ∂x The optimal interval density is then approximately given by the optimal density for classical estimation. This makes explicit the link between the density for weak signal detection and for 3 Optimal in this case for a given prior, if the prior does not represent well the reality, then the Bayesian setting is not useful and optimality is meaningless. 4.4. Chapter summary and directions 213 classical estimation. 4.4 Chapter summary and directions We summarize the main points from this chapter and possible directions for future work: • We developed an asymptotic high-rate approximation for the FI for quantized measurements. The approximation shows that the FI for quantized measurements tends to the FI for continuous measurements when the number of quantization intervals tends to infinity. When the quantizer outputs are all coded with binary words with the same length (fixed rate encoding), the approximation of the FI tends exponentially to the FI for continuous measurements as a function of the number of quantization bits. • The asymptotic performance approximation obtained depends on the specific choice of the quantizer intervals through the quantizer interval density. For fixed rate encoding, 1 the optimal interval density is shown to depend not only on the PDF through f 3 , as it is common in standard quantization, but also on the derivative of the score function. In practice for finite number of bits, the optimal interval density can be approximated by setting the quantization thresholds using the inverse of the CDF related to the interval density. As the CDF depends on the parameter that we want to estimate, a recursive procedure for joint estimating and resetting the thresholds is necessary for obtaining asymptotically optimal performance (asymptotic both in N and NI ). For example, we can use the adaptive algorithms presented in Ch. 3 when we want to solve a location estimation problem. In general we can use the adaptive MLE approach. When the length of the binary words are chosen to minimize the mean length of the quantizer output, the optimal density is shown to depend directly on the derivative of the score function. The problem with this approach is that not only setting the quantizer thresholds depends on the measurement distribution, but also the encoding method depends on it. Even if we can attain the best asymptotic performance by using an adaptive technique for setting the thresholds, we will not respect the rate constraint during all the initial time of the estimation procedure, when the parameter estimate is far from the true parameter value. • The practical approximation of the asymptotically optimal quantization thresholds was obtained for the estimation of location and scale parameters of the GGD. For the STD, we obtained the practical approximation in the Cauchy case for location and in general for scale. The asymptotic results were tested in the location problem with the Gaussian and Cauchy distributions. We compared the asymptotic approximation of the FI with the FI for optimal uniform quantization and with the FI for the practical approximation of the optimal thresholds. We observed that, with only 4 bits, the FI obtained with the practical approximation is very close to the asymptotic approximation. We also observed that, in the Gaussian case, the gain of performance obtained with nonuniform quantization is 214 Chapter 4. High-rate approximations of the FI negligible, while in the Cauchy case it is small. This indicates that in practice uniform quantization might be a better solution, as it requires a lower complexity. By using the adaptive algorithm, we have shown that the asymptotically optimal results can be obtained in practice. During the simulation of the adaptive algorithm, it was observed that uniform quantization leads to faster convergence when compared with nonuniform quantization. An interesting point for future research is then to study adaptive algorithms that start with a threshold set optimized for faster convergence and then change the threshold set, so that asymptotically the performance is also optimal. • By using the asymptotic results, we have obtained approximations for the optimal bit allocation when we estimate a location parameter using multiple sensors and the total number of quantization bits is constrained. The first approximate solution was given by considering unconstrained numbers of bits (positive and negative reals), the approximate optimal bit allocation is shown to be a correction on the uniform bit allocation (equal number of bits for each sensor) that depends on the weight of the noise scale parameter on the geometric mean of all the scale parameters. The FI given by this approximate optimal bit allocation is shown to depend on the harmonic and geometric means of the noise scale parameters. An interesting point for future work is the analysis of the approximate FI for the optimal allocation when the number of sensors is very large and the sensors scale parameters are random with a given known distribution. As the approximate FI depends on the geometric and harmonic means of the scale parameters, by using the law of large numbers, we expect to obtain an approximation of the FI depending on the parameters of the scale distribution. The second approximation was given by considering a more realistic scenario, with the numbers of bits constrained to be positive. The approximate optimal bit allocation is given by a water-filling solution, which is a well known solution for the problem of power allocation in multicarrier modulations. For the bit allocation problem, the logarithm of the scale parameter plays the equivalent role of the noise power in multicarrier power allocation and the number of bits plays the role of the power to be allocated. The water-filling solution depends on the scale parameters, in a fusion center approach, the fusion center can use the estimates of the scale parameters to obtain an approximate solution. The solution of the problem is mainly determined by only one parameter, the "water level", after obtaining the approximate "water level", the fusion center can broadcast it to the sensors so that they can set their quantizer resolution. A problem that still need to be solved in this case is how the sensors will coordinate their final choice on the numbers of bits (which are constrained to be integers) so that the total rate constraint is respected. • As a final point of this chapter, we revisited the asymptotic approximation for the f – divergence loss due to quantization presented in [Poor 1988]. The objective of this part was to show that the asymptotic approximation of the FI presented in this chapter can be seen as a special case of the asymptotic approximation of a general performance measure for inference problems and to show the links between the asymptotic characterization of the quantizers for different inference problems. 4.4. Chapter summary and directions 215 We saw that there is a close link between quantization for weak signal detection and for classical estimation of a location parameter. In practice, as we will use an adaptive algorithm with static quantizer centered at zero for classical estimation, the quantizer thresholds for these two problems are exactly the same. The link between Neyman– Pearson detection and Bayesian estimation is that, for both, the quantizer depends on the sensibility of their key quantities: the Neyman–Pearson detection optimal interval density depends on the sensibility of the log-likelihood ratio and the Bayesian estimation optimal density depends on the sensibility of the estimator. • Additional to the points for further study presented above many other points can also be investigated: – The vector quantizer extension of the asymptotic approximation of the FI can be considered: vector quantization is the most natural extension of the results presented here. – Further study of the Bayesian case: for one sample asymptotic characterization, we saw that a recursive approach can be used. In practice, this solution may be too complex to be implemented as we need to evaluate completely the continuous measurement estimator and consequently the posterior for obtaining the optimal quantizer at each sample. For obtaining a simple solution, we can consider high resolution quantizers that are designed to optimize the asymptotic (large number of samples) performance of Bayesian estimation. – Dealing with the overload region: a main point that was neglected in the analysis is that, in practice, most noise PDF that are used for modeling have infinite support. In this chapter we considered explicitly that the noise PDF have bounded support, so that it would not be necessary to deal with the overload region. In future work, we can try to deal with the overload region. – Asymptotic approximation of the optimal uniform quantizer for estimation: during all this thesis we considered the explicit optimization through a grid search of the optimal quantization step in uniform quantization. We can try to obtain an analytic characterization of the optimal step by considering an asymptotic approach. Conclusions of Part II In Part II, we studied the asymptotic performance of estimation of a scalar deterministic parameter based on quantized measurements. Asymptotically in this case means: • that the number of samples tends to infinity N → ∞, so that we can use the FI to characterize the estimation performance. • That the number of quantization bits tends to infinity NB → ∞, so that we can use high-rate approximations of the FI to determine analytically the loss of performance induced by quantization. We obtained the following conclusions: • The asymptotic loss of performance due to quantization decreases exponentially as a function of the number of bits. The loss of FI due to quantization is shown to decrease exponentially with increasing numbers of bits. Even if the results are asymptotic, they indicate that it is probably not useful to increase the sensor quantizer resolution when a target performance is not met. Probably, it is more reasonable, as we saw in Part I, to increase the number of sensors, or if it is possible, to increase sampling frequency and to use sensors with smaller noise amplitude (smaller noise scale factor). • Asymptotic may be low to medium resolution in practice. Using a practical approximation of the asymptotically optimal thresholds for finite number of quantization intervals in the location estimation problem (Gaussian and Cauchy cases), we have shown that the corresponding FI is very close to the asymptotic approximation of the FI for numbers of bits as low as 4. For 1,2 and 3 bits the optimal threshold variations can be found easily by grid search and the central threshold can be adjusted in all cases with an adaptive algorithm. This means that in practice, for all numbers of bits, we can set, at least approximately, the quantizer thresholds to have asymptotically optimal quantization for location parameter estimation under Gaussian and Cauchy distributions. A question that still remains unanswered is if this is true in general, for different types of measurement distribution and for the estimation of other types of parameters. • Uniform is not bad at all. Although we can use, in practice, nonuniform quantization of the measurements to have asymptotically optimal performance. The gap between the performance for optimal uniform quantization and the performance for nonuniform quantization in location parameter estimation is small. As uniform quantization is easier to be implemented, it seems that, in practice, uniform quantization may be a better solution. 217 Conclusions Main conclusions In this thesis, we have studied the problem of estimation based on quantized measurements, a problem that has attracted increasing attention of the signal processing community due to the emergence of sensor networks. More specifically, we treated the problem of estimating a scalar parameter, either constant or varying with a Wiener model, based on quantized noisy measurements of the parameter. We observed that for most commonly used noise models, the estimation performance degrades when the quantizer dynamic range is far from the true parameter value, indicating that a good solution can be obtained by adaptively setting the quantizer range using the most recent estimate of the parameter. Using the adaptive scheme, the loss of estimation performance due to quantization seems to be small. For all the tested cases (different noise PDF, constant or slowly varying parameter), a small loss is observed when we use 1–3 quantization bits and a negligible loss is observed for 4 or 5 quantization bits. This indicates that the solution of the remote sensing problem under constrained communication rate is linked to low resolution sensor networks: • If we consider that the problem is constrained to be solved with a sensor network approach, then from the results above, we can see that quantization with low resolution is a solution to this problem. • If we constrain the problem to be a remote sensing problem based on quantized measurements, then a low resolution sensor network approach seems to be an appropriate solution. As the standard estimation algorithms for attaining the small loss of performance have high complexity, we proposed a low complexity adaptive algorithm that achieves asymptotically the same performance. Extensions of the algorithm were proposed for the cases when the noise scale factor is unknown and when multiple sensors are available. We also studied the problem of how to set the quantization thresholds for obtaining optimal estimation performance when a large number of quantization intervals is available. We used the asymptotic approach (the quantizer intervals tend to zero) to obtain an approximation of the optimal thresholds, this approach also allowed to obtain an approximate analytical expression for the estimation performance (the FI) as a function of the number of quantization bits. The approximation of the FI for quantized measurements is shown to converge exponentially to the FI for continuous measurements. The approximate analytic expression was shown to be valid in the location estimation problem even for small numbers of bits (4 in this case), indicating that the result, which is expected to be exact only when the number of bits tends to infinity, can be useful in practice, if we consider non uniform quantization. 219 220 Conclusions From the asymptotic approach, we show that the optimal thresholds may depend on the parameter, which is unknown. This reinforces the importance of the adaptive approach, which allows to set the thresholds asymptotically according to their optimal values, leading to asymptotically optimal estimation performance. We also want to point out that the difference between using the optimal general threshold scheme (non uniform) and the optimal uniform scheme for the location problem is small. In practice, if low complexity is needed, then uniform quantization may be a better solution. Perspectives We finish this "conversation" between quantization and estimation, highlighting some subjects for future discussion. Some details of these subjects were already discussed at the end of the chapters, therefore here we give only the main lines. • Vector parameter and vector quantization: this is the direct extension of the problem, while the vector quantization extension might be straightforward to study, both in terms of proposing algorithms and studying their asymptotic (in terms of numbers of samples and quantization intervals) behavior, the vector parameter extension seems to be less straightforward, specially because it would require a redefinition of the estimation performance and it would require a full extension of the algorithms to vectors, for exploiting correctly the correlation between the components. • Noisy channels: in the "DSP party", most of the time, communication is not invited, we can propose to invite it to the next party by adding the communication channel in the problem. A noisy communication channel can be considered in multiple ways. The simplest way for introducing it in the problem is by indexing the quantized measurements with binary words and then considering the channel as an extension of the binary symmetric channel. While for a fixed indexing the extensions of the algorithms, specially the low complexity one proposed here, might be simple, the problem of optimal estimation/indexing can be difficult. Different extensions can be considered by introducing a continuous channel, for example additive and fading channels. In this case we might consider the problem of indexing, by assigning real values to the quantized measurements, this will generate again a joint problem of estimation/codebook design. • Estimation under unknown noise distribution: we supposed that the noise distribution is known, at least up to a scale factor, however, in practice, this assumption cannot be always satisfied and we will need to look for different approaches to estimate the location parameter based on quantized measurements. There are other topics that were not discussed explicitly in this thesis, but they are interesting subjects for future research. They are the following: Conclusions 221 • Fast variations: to develop some parts of this thesis, we considered that the parameter to be estimated was a slowly varying Wiener process, under this hypothesis we have shown that the loss of performance due to quantization is small. The unanswered question here is whether this conclusion is true or false for the estimation of fast varying processes. • Distributed problem: in this thesis, we treated the simplified remote sensing problem, where we have only one sensor. In the only case where a multiple sensor approach was treated, we used the fusion center approach. Thus, we still need to generalize the concepts and algorithms developed here to a partially or completely distributed setting, where a cluster head or each sensor wants to obtain estimates based on the information from all the sensors. • Continuous time: for a varying parameter, we considered that the parameter model was inherently discrete and we did not discuss sampling issues. Thus a subject to be studied is the estimation of a continuous process based on sampled and quantized measurements. A Appendices A.1 A.1.1 Why? - Proofs Proof that E [Sc Sq ] = E Sq2 We will consider a general parameter estimation problem in the proof. The density of the measurement will be f (y; x) instead of f (y − x). Adding the dependence of Sq on the quantizer output index i, y and x, the expectation of the product is Z ∂ log f (y; x) Sq (i (y) ; x) f (y; x) dy. (A.1) E [Sc Sq ] = ∂x R Separating the integral in (A.1) in a sum of integrals on the different quantization intervals qi : X Z ∂ log f (y; x) Sq (i (y) ; x) f (y; x) dy. E [Sc Sq ] = ∂x i∈I qi Sq is a constant function inside an interval qi , thus, in an interval, it does not depend on y and it can leave the integral Z X ∂ log f (y; x) f (y; x) dy. E [Sc Sq ] = Sq (i; x) ∂x i∈I qi Rewriting the continuous measurement score function in ratio form gives Z ∂f (y;x) X ∂x Sq (i; x) E [Sc Sq ] = f (y; x) dy, f (y; x) i∈I qi supposing that we can change the order of integral and the partial derivative leads to X ∂P (i; x) . E [Sc Sq ] = Sq (i; x) ∂x i∈I Multiplying and dividing each term of the sum by its corresponding P (i; x), we have E [Sc Sq ] = X Sq (i; x) i∈I ∂P(i;x) ∂x P (i; x) P (i; x) . We can identify the score function as the second factor, leading to X E [Sc Sq ] = Sq2 (i; x) P (i; x) = E Sq2 . i∈I 223 (A.2) 224 A.1.2 A. Appendices Proof of the upper bound on F (ε) [1 − F (ε)] for the Gaussian distribution We can write F (ε) [1 − F (ε)] as the probability of two i.i.d. Gaussian r.v. X1 and X2 to be in the respective intervals [−∞, x] and [x, ∞]. Thus, this probability can be written as the integral of their joint PDF (see (1.24) for the marginal Gaussian PDF form) " # x21 + x22 1 f1,2 (x1 , x2 ) = f (x1 ) f (x2 ) = 2 exp − πδ δ2 on the area A0 + A1 of Fig. A.1. From the i.i.d. assumption, the integral on the area A1 is equal to the integral on the area A′1 . Therefore, F (ε) [1 − F (ε)] is equal to the integral of f1,2 (x1 , x2 ) on A0 + A′1 . It is easy to see that the area outside the quarter circle C1 in the fourth quadrant is not smaller than the area of A0 +A′1 . Denoting the area outside the quarter ¯ circle in the fourth quadrant by C¯1 , we can say that P (X1 , X2 ∈ A0 + A1 ) ≤ P Xp 1 , X2 ∈ C 1 . 2 2 Changing the coordinates from rectangular (x1 , x2 ) to polar (r, θ), where r = x1 + x2 is x1 x2 the radius and θ = arctan P X1 , X2 ∈ C¯1 = is the angle, we have that Z0 Z∞ − π2 x " " # # Z∞ r2 r2 1 1 r exp − 2 drdθ = 2 r exp − 2 dr. πδ 2 δ 2δ δ x Changing variables one more time r′ = rδ , we obtain P X1 , X2 ∈ C¯1 1 = 2 Z∞ x δ ′ r exp −r ′2 1 dr = − 4 ′ Z∞ x δ x 2 1 . −2r′ exp −r′2 dr′ = exp − 4 δ Consequently, 1 x 2 ¯ F (ε) [1 − F (ε)] = P (X1 , X2 ∈ A0 + A1 ) ≤ P X1 , X2 ∈ C1 = exp − . 4 δ A.1.3 Proof that the FI for estimating a Laplacian location parameter with noise scale δ is δ12 . The score function (1.15) for the location parameter of the Laplacian distribution is (PDF given by (1.27)): ∂ log 2δ12 − y−x 1 ∂ log f (y − x) δ Sc = = = sign (y − x) , ∂x ∂x δ where we used the fact that the derivative of the absolute value function is the sign function. The FI is then given by Ic = E Sc2 = +∞ Z −∞ y − x 1 1 dy. exp − δ 2 2δ δ A.1. Why? - Proofs 225 x1 (0, x) A1 C1 x2 (x, 0) (0, − x) A′1 A0 Figure A.1: Geometric scheme to show that the probability of the interval A0 + A1 is less than the probability of the exterior region of the left quarter circle C1 . Changing variables y ′ = y−x δ and using the symmetry of exp (− |y ′ |), we get +∞ Z 1 exp −y ′ dy ′ = 2 . δ 1 Ic = 2 δ 0 A.1.4 Proof that the FI for estimating a Cauchy location parameter with noise scale δ is 2δ12 . The score function (1.15) for the location parameter of a Cauchy distribution (PDF given by (1.33)) is the following: ∂ log f (y − x) Sc = = ∂x h h ∂ − log (πδ) − log 1 + ∂x y−x 2 δ ii =h The FI can be evaluated then with the following integral Ic = E = Sc2 8 πδ 3 4 = 2 πδ +∞ Z x +∞ Z −∞ h 1+ h y−x 2 δ 1+ y−x 2 δ y−x 2 δ y−x 2 δ 2 δ 1+ y−x δ i. y−x 2 δ i3 dy i3 dy, where the second equality comes from the symmetry of the integrand. Changing variables 2 tan (θ) = y−x δ . We must change dy = δ sec (θ) dθ and the integration limits also change to 0 226 and A. Appendices π 2. Using the trigonometric identity 1 + tan2 (θ) = sec2 (θ), we have π Ic = 8 πδ 2 Z2 π tan2 (θ) sec6 (θ) sec2 (θ) dθ = 8 πδ 2 0 Z2 sin2 (θ) cos2 (θ) dθ. 0 2 Using trigonometric identities, we have that (θ) cos2 (θ) = 81 [1 − cos (4x)]. The integral sin of the term cos (4x) is zero on the interval 0, π2 . Therefore, we finally obtain π Ic = 1 πδ 2 Z2 dθ = 1 . 2δ 2 0 A.1.5 Proof that the FI for N measurements quantized adaptively with NI N P quantization intervals is IqNI = E [Iq (εk )]. k=1 Making more explicit the dependence of P (i1:N ; x) on the adaptive central thresholds τ0,0:N −1 by the conditional probability P (i1:N |τ0,0:N −1 ; x) and exploiting the independence between the measurements conditioned on the central threshold used to obtain them, we can write that the joint probability used in the score function evaluation factorizes as follows: P (i1:N |τ0,0:N −1 ; x) = N Y k=1 P (ik |τ0,k−1 ; x) . Thus, the log-likelihood is given by log L (x; i1:N ) = N X k=1 log P (ik |τ0,k−1 ; x) . The FI is then given by IqNI = E ( ∂ log L (x; i1:N ) ∂x 2 ) =E 2 N P log P (ik |τ0,k−1 ; x) ∂ " #2 N X ∂ log P (ik |τ0,k−1 ; x) = E , ∂x k=1 ∂x k=1 where the expectation is evaluated w.r.t. the joint probability measure of the r.v. i1:N and τ0,0:N −1 . We can decompose the joint expectation in a composition of two expectations using conditioning. For 2 r.v. X and Y and a function h, this is EX,Y [h (Y, X)] = EX EY |X [h (X, Y )] . A.1. Why? - Proofs 227 The subscripts indicate the corresponding probability measure used for the evaluation. For example, X|Y corresponds to the conditional probability measure of X given Y . Using this decomposition on the FI above: " #2 N X ∂ log P (ik |τ0,k−1 ; x) IqNI = Eτ0,0:k−1 Ei1:k |τ0,0:k−1 . ∂x k=1 By expanding the square of the inner sum, we have that the inner expectation is a sum of 2 ∂ log P(ik |τ0,k−1 ;x) expectations of squared score functions and products of score functions for ∂x ∂ log P(ik |τ0,k−1 ;x) ∂ log P(ij |τ0,j−1 ;x) with j 6= k. As the samples are conditionally different samples ∂x ∂x independent given their central thresholds, the conditional expectations of the squared scores are equal to the sum of conditional expectations, each conditional expectation will be evaluated with the probability measure of its corresponding ik |τ0,k−1 . For the crossed terms the same happens, but now each conditional expectation will be evaluated with respect to ik,j |τ0,k,j , as the pairs of measurements are conditionally independent, the conditional expectation of the product of scores is the product of conditional expectations. Finally, as the expectation of each score function is zero [Kay 1993, pp. 67], the expectation of the sum of cross products is zero. Therefore, we have (N ( 2 )) X ∂ log P (i |τ ; x) k 0,k−1 IqNI = Eτ0,0:k−1 Eik |τ0,k−1 . ∂x k=1 The terms in the inner sum depends each on a different τ0,k−1 , thus by marginalization (integration w.r.t. others τ0,k−1 ), we get IqNI = N X k=1 Eτ0,k−1 ( Eik |τ0,k−1 ( ∂ log P (ik |τ0,k−1 ; x) ∂x 2 )) . Observe that the inner expectation is the FI for each observation ik parametrized by τ0,k−1 and x. We can re-parametrize it by the difference εk = τ0,k−1 −x, writing it using the notation of (1.13). Therefore, we obtain IqNI = N X k=1 A.1.6 Eεk {Iq (εk )} . Proof that the posterior PDF can be written in recursive form using prediction and update expressions For obtaining a relation between the PDF p (xk |i1:k−1 ), that we will call prediction PDF, and the posterior for instant k − 1, p (xk−1 |i1:k−1 ), we will use conditioning on the joint density/distribution (PDF for X and probability for i) of the variables Xk , Xk−1 and i1:k−1 p (xk , xk−1 , i1:k−1 ) = p (xk |xk−1 , i1:k−1 ) p (xk−1 |i1:k−1 ) P (i1:k−1 ) . 228 A. Appendices Exploiting the fact that conditioned on Xk−1 the r.v. Xk is independent of all the past measurements, we have p (xk , xk−1 , i1:k−1 ) = p (xk |xk−1 ) p (xk−1 |i1:k−1 ) P (i1:k−1 ) . On the other hand, conditioning only on the measurements, we obtain p (xk , xk−1 , i1:k−1 ) = p (xk , xk−1 |i1:k−1 ) P (i1:k−1 ) . Equating the last expressions gives p (xk , xk−1 |i1:k−1 ) = p (xk |xk−1 ) p (xk−1 |i1:k−1 ) . Marginalization of Xk−1 gives the prediction expression Z p (xk |i1:k−1 ) = p (xk |xk−1 ) p (xk−1 |i1:k−1 ) dxk−1 . R As it was stated before, we can notice that for obtaining the prediction PDF we must use the last posterior and the transition PDF p (xk |xk−1 ) that characterizes the dynamical model. For obtaining the update expression, we will start by conditioning the joint density/distribution function p (xk , ik , i1:k−1 ) p (xk , ik , i1:k−1 ) = P (ik |xk , i1:k−1 ) p (xk |i1:k−1 ) . As ik given xk is independent from all the other r.v., we have p (xk , ik , i1:k−1 ) = P (ik |xk ) p (xk |i1:k−1 ) P (i1:k−1 ) . Now, conditioning on the entire set of measurements p (xk , ik , i1:k−1 ) = p (xk |i1:k ) P (i1:k ) . Using both last expressions, we get p (xk |i1:k ) = P (ik |xk ) p (xk |i1:k−1 ) P (i1:k−1 ) , P (i1:k ) this result can be simplified by applying conditioning on the denominator. Absorbing the factor P (i1:k−1 ), we have p (xk |i1:k ) = P (ik |xk ) p (xk |i1:k−1 ) . P (ik |i1:k−1 ) The conditional probability on the denominator can be expressed using marginalization of p (ik , xk−1 |i1:k−1 ) = P (ik |xk ) p (xk |i1:k−1 ) , which finally gives the update expression p (xk |i1:k ) = R R P (ik |xk ) p (xk |i1:k−1 ) . P ik |x′k p x′k |i1:k−1 dx′k Note that for updating the prediction to the posterior distribution, we introduced the information from the measurement through P (ik |xk ). A.1. Why? - Proofs A.1.7 229 Proof that Ic for the GGD is 1 1 β(β−1)Γ(1− β ) . 2 1 δ Γ( β ) The continuous measurement FI for the GGD distribution is obtained using the PDF expression (1.39) in the integral (3.58) Ic,GGD = h Z R (1) fGGD (x) i2 β3 dx = fGGD (x) δ 3 Γ β1 +∞ Z x 2β−2 x β dx, exp − δ δ 0 where we used the fact that the function to be integrated is an even functionfor obtaining an β integral on [0, +∞). We can now change the integration variable to z = xδ , this produces 1 dx = βδ z β −1 dz, leading to the following integral Ic,GGD β2 = δ 2 Γ β1 +∞ Z 1− 1 z β exp (−z) dz. 0 The integral is equal to Γ 2 − β1 , thus using the property of the gamma function Γ (1 + z) = zΓ (z), we have finally 1 β (β − 1) Γ 1 − β 1 Ic,GGD = 2 . δ Γ 1 β A.1.8 Proof that Ic for the STD is 1 β+1 . δ 2 β+3 For the STD, the continuous measurement FI is obtained using its PDF expression (3.72) in (3.58). As the function to be integrated is an even function, we can integrate it only in the positive real semi-axis. This gives Z h (1) fST D (x) i2 dx fST D (x) R +∞ 2 " 2 #− β+5 Z 2 Γ β+1 2 1 (β + 1)2 x x √ √ √ dx. 1 + = 2 √ 3 β δ β δ β Γ β δ β π Ic,ST D = 2 0 For evaluating this integral, we can change the integration variable to θ using tan (θ) = δ√x β , 2 √ this produces dx = βδ cos12 (θ) , 1+ δ√x β = cos12 (θ) and an integration interval 0, π2 , leading to π Z2 Γ β+1 2 1 (β + 1)2 2√ 2 sin2 (θ) cos (θ)β+1 dx. Ic,ST D = β β δ π Γ 2 0 230 A. Appendices The integral factor multiplied by 2 can be identified to the beta function B beta function can be written using the gamma function Γ 3 Γ β + 1 2 2 3 β , B , +1 = β+1 2 2 +2 Γ 3 β 2, 2 + 1 . The 2 which can be rewritten using the fact that Γ Γ (1 + z) = zΓ (z). This gives B 3 β , +1 2 2 leading finally to 3 2 √ π 2 and the property of the gamma function β Γ √ 2 β , = π (β + 3) (β + 1) Γ β+1 2 Ic,ST D = A.1.9 = 1 β+1 . δ2 β + 3 Minimization of the asymptotic variance w.r.t. η under the asymptotic zero mean constraint. For simplifying the notation we will use η and fd , suppressing the subscripts and superscripts. The problem we want to solve is η ⊤ Fd η , η ⊤ fd fd⊤ η minimize w.r.t. η 2 σ∞ = subject to Fvec,⊤ η = 0, d (A.3) η 6= 0, where Fvec d is the diagonal of Fd in vector form. This problem can be also cast as a maximization problem maximize w.r.t. η subject to η ⊤ fd fd⊤ η 1 = , 2 σ∞ η ⊤ Fd η (A.4) Fvec,⊤ η = 0, d η 6= 0. As Fd is a diagonal matrix it can be decomposed as the product of diagonal matrices formed with the square roots of the diagonal terms 1 1 Fd = Fd2 Fd2 . Thus using the change of variables −1 η = Fd 2 η ′ , A.1. Why? - Proofs 231 the problem (A.4) becomes −1 −1 maximize w.r.t. η ′ η ′ ⊤ Fd 2 fd fd⊤ Fd 2 η ′ subject to Fvec,⊤ Fd 2 η ′ = 0, d η′⊤η′ , −1 η ′ 6= 0. This problem can be solved by constraining η ′ ⊤ η ′ to be equal to one and then maximizing the numerator maximize w.r.t. η ′ η ′ Fd 2 fd fd⊤ Fd 2 η ′ , subject to η ′ η ′ = 1, ⊤ −1 −1 (A.5) ⊤ −1 Fvec,⊤ Fd 2 η ′ = 0, d η ′ 6= 0. −1 Note that Fvec,⊤ F 2 is a transposed vector with the square roots of Fvec d . This term will be d1 ⊤d ,vec from now on. This problem has been treated in [Golub 1973] and we will denoted Fd2 apply here the same development. The Lagrangian of the maximization problem (A.5) is given by 1 1 −1 ,vec ⊤ − ⊤ L = η ′ Fd 2 fd fd⊤ Fd 2 η ′ − λ η ′ η ′ − 1 + 2µη ′ Fd2 , where λ and µ are Lagrange multipliers. The zero derivative point of the Lagrangian w.r.t. η ′ is given as the solution of the following equation: 1 −1 −1 Fd 2 fd fd⊤ Fd 2 η ′ − λη ′ + µFd2 Multiplying by 1 Fd2 1 ,vec 2 Fd ,vec ⊤ ⊤ ,vec (A.6) = 0. gives −1 −1 Fd 2 fd fd⊤ Fd 2 η ′ 1 ,vec 2 − λ Fd ⊤ ′ 1 ,vec 2 η + µ Fd ⊤ 1 Fd2 ,vec = 0. As Fd are quantizer output probabilities, we have 1 Fd2 ,vec ⊤ 1 Fd2 ,vec = 1. Now, using the expression above and the second equality constraint (that the asymptotic mean is zero) on the factor that multiplies λ, we obtain 1 ,vec 2 µ = − Fd ⊤ −1 −1 Fd 2 fd fd⊤ Fd 2 η ′ . 232 A. Appendices Substituting this expression for µ in (A.6), we get " 1 ⊤ # 1 ⊤ − 12 −1 ,vec ,vec −1 I − Fd Fd2 Fd2 Fd 2 fd fd⊤ Fd 2 η ′ = λη ′ , where I is the identity matrix. Clearly, and the optimal η′ is the eigenvector of " I− P′ −1 Fd 2 1 ,vec 2 Fd 1 ,vec 2 Fd ⊤ −1 ⊤ # = P′ is a projection matrix − 21 Fd 2 fd fd⊤ Fd that gives the maximum λ. For a squared matrix A and projection matrix P′ , we know that the maximum eigenvalue function λ () respects the following equality: 2 λ P′ A = λ P′ A = λ P′ AP′ . This means that the optimal η ′ can also be found as the eigenvector of −1 −1 P′ Fd 2 fd fd⊤ Fd 2 P′ −1 −1 related to the maximum eigenvalue λ. As the only non zero eigenvector of P′ Fd 2 fd fd⊤ Fd 2 P′ −1 is P′ Fd 2 fd , this is the optimal η ′ . Changing back to the initial vector η, we have " 1 ⊤ # ,vec −1 − 21 − 12 Fd2 F d 2 fd . I − Fd η ∝ Fd The proportional ∝ comes from the fact that the solution of (A.3) is defined up to a proportional factor. Expanding the expression gives 1 ⊤ − 12 12 ,vec ,vec −1 η ∝ F−1 f − F F Fd2 F d 2 fd d d d d ∝ F−1 d fd − 1fd , where 1 is a squared matrix filled with ones. A.1.10 Proof that 1 fdT F−1 d fd = 1 (j) 2 (j) f˜ i d (j) (j) i j=1 i(j) ∈I (j) F̃d N Ps P [ ] [ ] in the fusion center approach. h i h ′i (l′ ) (l′ ) (l′ ) For simplifying notation the sensor superscript in F̃d i and f˜d i(l ) will not be written, the dependence h ′ ion the sensor h ′ i number will be done implicitly through the argument of the (l ) function F̃d i and f˜d i(l ) . Using the fact that that Fd is diagonal, we can write 2 P h i N s Q ′ Ns ˜ (j ) (j) F̃ i i f d j=1 d j′ = 1 ′ X j = 6 j T −1 fd F d fd = , N Qs (j) i∈I ⊗Ns F̃d i j=1 A.1. Why? - Proofs 233 where I ⊗Ns is the set of all possible i. Developing the quadratic term, the sum above is equal to the sum of two terms fdT F−1 d fd = I 1 + I 2 , with I1 = Ns n (j) o2 N Qs n h (j ′ ) io2 P ˜ F̃d i fd i j=1 ′ j = 1 ′ X j = 6 j i∈I ⊗Ns and I2 = N Qs j=1 F̃d i(j) h i h i N N N N s s s s Q P P Q ′ ′ (m) (l) (l ) (m ) ˜ ˜ i F̃ i f i f F̃ i d d d d l=1 ′ ′ l = 1 m = 1 m = 1 ′ ′ X l = 6 l m= 6 l m = 6 m i∈I ⊗Ns N Qs j=1 F̃d i(j) . Dividing the common factors in I2 and rewriting the sum, we obtain N N N s s s h i h h i i Y X X X (l) (p) (m) ˜ ˜ I2 = f i F̃d i fd i d i∈I ⊗Ns l=1 m = 1 p = 1 m 6= l p = 6 l p 6= m N N N s s s h i h i h i X X X X Y X (l) (m) (p) , = f˜d i f˜d i F̃d i ⊗N ⋆ (l) (l) (m) (m) s l=1 i∈I m = 1 i ∈I i ∈I p=1 m 6= l p = 6 l p 6= m (l) (m) where I ⊗Ns ⋆ is the set of all combinations of i, without considering (p) i and i . The interior is a probability. Thus I2 sum in the RHS of the last equality equals to one because F̃d i 234 A. Appendices is given by I2 = = Ns X l=1 Ns X m=1 m 6= l i(l) ∈I (l) i(m) ∈I (m) n Ns X Ns X X i l=1 m=1 m 6= l X (l) i X h f˜d i(l) ∈I (l) h i h io f˜d i(l) f˜d i(m) X (m) i ∈I (m) h f˜d i(m) i . From the symmetry assumptions f˜d i(j) is an odd function of i(j) , therefore I2 = 0. The term I1 can be rewritten by dividing common factors from the numerator and denominator. This gives N N s s h ′ i Y X X f˜2 i(j) d (j ) I1 = F̃d i . F̃d i(j) ′ i∈I ⊗Ns j=1 j =1 ′ j 6= j Changing the order of summation and separating the sum for the sensor index j from the others, we obtain N N s s h i 2 (j) X Y X X f˜d i ′ (j ) F̃d i I1 = , (j) i F̃ d ⊗N ⋆ (j) (j) s j=1 i ∈I i∈I j′ = 1 j ′ 6= j where now I ⊗Ns ⋆ is the set of all possible i without considering i(j) . As the inner sum is equal to one, we finally have Ns X X f˜2 i(j) d T −1 fd F d fd = i(j) F̃ d j=1 (j) (j) i and consequently 1 fdT F−1 d fd = Ns P ∈I 1 P j=1 i(j) ∈I (j) f˜d2 [i(j) ] F̃d [i(j) ] . A.2. More? - Further details A.2 A.2.1 235 More? - Further details Discussion on the issues of finding the MLE Binary quantization. In Subsection 1.3.6, we give an analytic expression for the MLE in the binary quantization case. In this case the MLE depends on the noise distribution mainly through the inverse of the CDF, thus existence and unicity of the MLE are guaranteed by the monotonicity of noise CDF (implicitly stated in the assumption AN2). Multibit and dynamic quantization: log-concave distributions. In the multibit case, or even in the binary case when the threshold is not static, we cannot write a closed-form expression for the the MLE. In this case, we have to use a numerical method for the evaluation of the maximum. In the case of log-concave distributions (the Gaussian distribution is an example), we can show that, as it is explained in Subsection 1.4.4, the log-likelihood with quantized measurements is concave. Thus, in this case the log-likelihood has only one maximum which can be found very efficiently using the Newton’s algorithm. Multibit and dynamic quantization: general distributions. If the distribution is not log-concave, then the Newton’s algorithm does not necessarily converge. If it converges, it can converge very slowly when compared to the log-concave distribution case. It can also happen that the likelihood has multiple maxima, in this case, any technique based on the gradient may fail to find the global maxima and other types of maximization techniques must be used. As a simple example of non log-concave noise distribution, we can consider the Cauchy distribution with PDF and CDF given by (1.33) and (1.34) respectively. The log-likelihood for estimating x with δ = 1, τ = [−3 − 2 − 1 0 1 2 3]⊤ and ik = {−3, −3, −4, 3, 3, 3} is shown in Fig. A.2. log (L) −16 −18 −20 −22 −4 −2 0 x 2 4 Figure A.2: Log-likelihood function for estimating x based on the quantized measurements ik = {−3, −3, −4, 3, 3, 3}. The quantizer has NI = 8 and τ = [−3 − 2 − 1 0 1 2 3]⊤ . The distribution of the noise is Cauchy with δ = 1. We can clearly note the multimodality of the log-likelihood function. 236 A.2.2 A. Appendices MLE for estimation of a constant based on binary quantized measurements: uniform/Gaussian noise case. The MLE for binary quantized measurements is given by (1.45) !# " N 1 X −1 1 1− ik . X̂M L = τ0 − F 2 N k=1 The function F −1 () is the inverse of the noise CDF. For the uniform/Gaussian case, the CDF is given by (1.37) α ε+ 2 1 Φ , for ε < − α2 , C σ h i 1 F (ε) = C1 12 + √2πσ ε + α2 , for − α2 ≤ ε ≤ α2 , α i h 1 √ α + Φ ε+ 2 , for ε > α , C where C = 1 + 2 σ 2πσ √α . 2πσ As the CDF is decomposed in three parts, for inverting the CDF we N 1 P 1 ik , the might distinguish three possible cases. Using the notation 1 − P̂M L = 2 1 − N k=1 cases are the following: • 1 − P̂M L < • 1 2C 1 2C , ≤ 1 − P̂M L ≤ • 1 − P̂M L > 1 C 1 2 1 C + 1 2 + √α 2πσ √α 2πσ , . Using the inverse of F () for each case in the expression of the estimator above, we get i h α 1 −1 C 1 − P̂ τ + − σΦ , for 1 − P̂M L < 2C , 0 M L 2 h i √ α 1 , X̂M L = τ0 + α2 − 2πσ C 1 − P̂M L − 21 , for 2C ≤ 1 − P̂M L ≤ C1 12 + √2πσ i h τ0 − α − σΦ−1 C 1 − P̂M L − √ α , for 1 − P̂M L > 1 1 + √ α . 2 C 2 2πσ 2πσ The function Φ−1 [] is the inverse of the standard Gaussian CDF. A.2. More? - Further details A.2.3 237 MLE for estimation of a constant based on binary quantized measurements: generalized Gaussian noise case. For binary quantized measurements with a fixed threshold, the MLE is given by (1.45) X̂M L = τ0 − F −1 " N 1 X 1− ik N 1 2 k=1 !# , where F −1 () is the inverse of the noise CDF. In the GGD case the CDF is the following (1.40): 1 ε β γ β, δ 1 . F (ε) = 1 + sign (ε) 2 Γ 1 β Therefore, denoting the average of the binary observations by ī = 1 N following MLE: X̂M L = τ0 + δsign (ī) γ −1 1 , |ī| Γ β 1 β 1 , β N P ik , we have the k=1 where γ −1 [, ] is the inverse of the incomplete gamma function. A.2.4 Adaptive binary threshold asymptotic probabilities when the threshold is defined in a grid. We consider here that the parameter lies in an interval [−A, A], where A is a positive real. For assimilating this information, we are going to change the update of the binary threshold. The following is assumed: • The step size γ is chosen so that A = N, γ with N a positive integer. • The initial threshold τ0,0 is chosen to be an integer multiple of γ, τ0,0 = jγ, so that τ0,0 ∈ [−A, A]. • The threshold cannot leave the interval [−A, A]. This means that when τ0,k−1 = A and ik = 1, we will set τ0,k = A. When τ0,k−1 = −A and we have ik = −1, we will set τ0,k = A. This changes the adaptive update of the threshold (1.49) to τ0,k = −A, if τ0,k−1 = −A and ik = −1, τ0,k = τ0,k−1 + γik , A, if τ0,k−1 = A and ik = 1. (A.7) 238 A. Appendices The threshold is now defined in a finite grid A A (N − 1) (N − 1) , · · · , − , 0, , ··· , A , A . τ0,k ∈ −A, −A N N N N An iteration of the threshold update is depicted in Fig. A.3. τ0,k−1 Time k − 1 yk (measure) ··· −A ··· A Ni −A (NN−1) A (i+1) N A −A (NN−1) A τ0,k Time k ··· −A −A (NN−1) −A (NN−1) ··· A Ni A (i+1) N Figure A.3: An iteration of the binary threshold update in the grid where it is defined. The values of the finite grid where the threshold is defined are indicated by the black squares. Asymptotic probability distribution In a similar way as for the infinite grid, we will define a transition matrix for the finite grid Tf g . In this case the matrix will have size (2N + 1) × (2N + 1). Using the following notation for the CDF elements j j , aj = F A − x = 1 − F x − A N N the transition matrix is given by a−N a−(N −1) 1 − a−N 0 0 1 − a−(N −1) .. . 0 .. . Tf g = .. . .. . 0 a0 0 .. . 1 − a0 0 .. . .. . 0 .. . aN −1 0 0 aN 1 − aN −1 1 − aN . The Markov chain formed by the sequence τ0,k is an ergodic chain, as all threshold values can be reached from all other threshold values and the borders −A and A make the chain to A.2. More? - Further details 239 be aperiodic1 . Thus, the sequence of thresholds admit a unique asymptotic distribution p∞ [Gallager 1996, Ch. 4]. The asymptotic distribution is then the solution of p∞ = Tf g p∞ , or equivalently (Tf g − I) p∞ = Rp∞ = 0, (A.8) where I is a (2N + 1) × (2N + 1) identity matrix and 0 is the zero vector. The problem is then to find a vector from the null space of R a−N − 1 a−(N −1) .. 1 − a−N . −1 . .. .. . 0 1 − a −(N −1) .. . . . 0 . 0 .. . a0 , R= −1 . .. 1 − a 0 . . . . . . 0 0 .. . . . aN −1 0 . .. . −1 aN 1 − aN −1 −aN 0 0 under the constraint that the vector is a probability vector: it sums to one 1⊤ p∞ = 1, where 1 is a vector with all elements equal to one, and all its elements are nonnegative p∞ 0. For solving (A.8), we start by solving its last line (the line at the bottom). We have (1 − aN −1 ) pN −1,∞ − aN pN,∞ = 0, which gives pN −1,∞ = For the next line (above), we obtain aN pN,∞ . (1 − aN −1 ) aN pN,∞ − pN −1,∞ + (1 − aN −2 ) pN −2,∞ = 0 and solving it, we have pN −2,∞ = pN −1,∞ − aN pN,∞ . (1 − aN −2 ) 1 This is not the case for the thresholds defined in an infinite grid. In this case the thresholds must be separated in two periodic classes [Fine 1968]. 240 A. Appendices Using the expression for pN −1,∞ above, we get pN −2,∞ = aN −1 aN . (1 − aN −2 ) (1 − aN −1 ) Clearly, from the similarity of the equations for the other lines, we can proceed in the same way to obtain i−1 Q aN −j j=0 (A.9) pN −i,∞ = i pN,∞ , for i ∈ {−N, · · · , N − 1} . Q (1 − aN −j ) j=1 If we denote i−1 Q aN −j j=0 ci = i , Q (1 − aN −j ) j=1 then, pN,∞ can be found by using the constraint that the vector must sum to one ! 2N X pN −i,∞ + pN,∞ = 1. i=1 Separating the factor pN,∞ which appears in all terms (see (A.9)), we get pN,∞ = 1+ 1 2N P (A.10) . ci i=1 Using this and (A.9), we can obtain a general expression for the probabilities pN −i,∞ = 1+ c′i 2N P , ci for i ∈ {0, · · · , 2N } , (A.11) i=1 where c′i = ( 1, if k = 0, ci , otherwise. By substituting the c′i in (A.11), we get the following expressions for the asymptotic probabilities 2N pN,∞ = p−N,∞ = pN −i,∞ = 1 Y (1 − aN −i ) , P (x) 1 P (x) i=1 2N −1 Y i=0 1 P (x) i−1 Y j=0 (A.12) aN −i , aN −j 2N Y j=i+1 (1 − aN −j ) , A.2. More? - Further details 241 where the normalization factor P (x) is ! 2N 2N 2N −1 2N −1 j−1 Y Y Y X Y P (x) = (1 − aN −j ) + aN −j + aN −k (1 − aN −l ) . j=1 j=0 j=1 l=j+1 k=0 (A.13) With the expressions above for the asymptotic probabilities of the thresholds, it is possible to obtain exact values for the asymptotic FI using (1.62). Maximum of the probability distribution We are going to verify that the asymptotic threshold is indeed around the true parameter. We will analyze the position of the maximum probability threshold and the increasing and decreasing patterns of the asymptotic probabilities. For doing so, we will obtain the expressions for the signs of the differences between neighboring (in threshold position) asymptotic probabilities. Starting at the negative extremum of the interval, the difference is the following: ! 2N −2 2N −1 Y 1 Y aN −j (1 − a−N ) . aN −i − p−N,∞ − p−(N −1),∞ = P (x) j=0 i=0 Making explicit the common factor, we have 2N −1 Q p−N,∞ − p−(N −1),∞ = The factor 2N Q−1 aN −i i=0 P (x) ! is positive because i=0 aN −i P (x) 2N Q−1 i=0 a−(N −1) − (1 − a−N ) . aN −i is positive, it is a product of probabilities, and P (x) is also positive, it is a sum of products of probabilities. Therefore, to obtain the sign of p−N,∞ − p−(N −1),∞ as a function of x, we need only to analyze the sign of the difference a−(N −1) − (1 − a−N ). Using the expressions for the ai terms, we have sign p−N,∞ − p−(N −1),∞ = sign 1−F (N − 1) x+A N − F (x + A) . The difference in the sign on the RHS is the difference between a complementary CDF (N −1) parametrized by x and centered on −A N , 1 − F x + A (NN−1) , and a CDF centered on −A, F (x + A). Using the facts that the complementary CDF is a decreasing function (from one to zero), the CDF is an increasing function (from zero to one) and that CDF is simply a reversed and shifted version of the complementary CDF, we obtain the following conclusions: (N − 12 ) . • the sign of the probability difference is positive for x ∈ −A, −A N 242 A. Appendices • The sign is negative for x > −A (N − 21 ) N . • The probability difference is zero when x = −A (N − 12 ) N . The difference between probabilities for i ∈ {1, · · · , 2N − 1} is i−1 2N Y Y 1 aN −j (1 − aN −j ) − p(N −i),∞ − p(N −i+1),∞ = P (x) j=0 j=i+1 i−2 2N Y Y aN −j (1 − aN −j ) − j=0 j=i and after factorization p(N −i),∞ − p(N −i+1),∞ = i−2 Q j=0 aN −j !" 2N Q j=i+1 (1 − aN −j ) P (x) # [aN −i+1 − (1 − aN −i )] . As the first factor on the RHS is positive, the sign of the difference is determined by (N − i + 1) (N − i) sign p(N −i),∞ − p(N −i+1),∞ = sign 1 − F x − A −F x−A . N N The analysis of the sign above is similar to the negative extremum case. Thus we have the following conclusions: • the sign of the difference is positive for x < A • We have a negative sign for x > A • The difference is zero when x = A (N −i+ 21 ) N . (N −i+ 21 ) . N (N −i+ 12 ) N . Using a similar procedure for the positive extremum, we have that the sign of the difference is given by (N − 1) sign p(N −1),∞ − pN,∞ = sign [1 − F (x − A)] − F x − A , N which leads to the same conclusions as above, the exception is that i = 0 in this case. Joining all the results, we can see that the maximum of the asymptotic probability vector always occurs at the point of the grid that is closest to x. Moreover, the distribution always decreases when we consider thresholds with increasing distance to the maximum probability threshold. This means that the distribution is unimodal with its maximum close to the parameter, thus justifying the statement that the thresholds will be placed asymptotically around the parameter. A.2. More? - Further details 243 Small noise approximation The analytical asymptotic probabilities expressions (A.13) are quite cumbersome to be evaluated when N is large. As the CDF are almost step functions (zero/one functions) for large arguments and as the asymptotic probabilities are products of CDF, in the case when the noise level is small compared with γ, we can obtain very simple approximate expressions for the asymptotic probabilities that involve only a few CDF terms. The small noise approximations for the complementary CDF and CDF are the following: 1, x < A (N −i−1) , N (N − i) (N −i+1) 0, x>A N , = aN −i = 1 − F x − A N (N −i) 1 − F x − A < x < A (N −i−1) , , A (N −i+1) N N N , x < A (N −i−1) 0, N (N − i) (N −i+1) 1, x>A N , = 1 − aN −i = F x − A (A.14) N (N −i−1) (N −i+1) (N −i) F x − A <x<A . , A N N N Independently of the value of x, we can get the following approximations of the CDF products using (A.14): 2N Y i=1 (1 − aN −j ) ≈ 1 − aN −1 , 2N −1 Y i=0 j−1 Y i=0 aN −i ≈ a−N +1 , ! 2N Y (1 − aN −i ) ≈ aN −j+1 (1 − aN −j−1 ) . aN −i i=j+1 We can now apply these approximations to the asymptotic probabilities (A.13).i Note that h the approximations will be dependent on the value of x. For x ∈ −A, −A (NN−1) we have a−N +1 2 − a−N 1 − a−N ≈ 2 − a−N p−N,∞ ≈ p−(N −1),∞ p−(N −2),∞ ≈ = = 1 − F x + A (NN−1) 1 + F (x + A) F (x + A) , 1 + F (x + A) F x + A (NN−1) , a−N +1 = 2 − a−N 1 + F (x + A) p(N −i),∞ ≈ 0, for i ∈ {0, · · · , 2N − 3} . 244 A. Appendices i (N −i) , we obtain 4 nonzero terms, all the others are approximately For x ∈ A (N −i−1) , A N N zero. The nonzero terms are p(N −i−2),∞ ≈ p(N −i−1),∞ p(N −i),∞ p(N −i+1),∞ aN −i−1 2 aN −i ≈ 2 aN −i+1 ≈ 2 aN −i+2 ≈ 2 = aN −1 1 − aN aN ≈ 1 + aN = p(N −1),∞ = pN,∞ ≈ 2 1 − F x − A (NN−i) = 2 F x − A (N −i+1) N = 2 F x − A (NN−i) = Finally, for the positive extremum, x ∈ following: p(N −2),∞ ≈ 1 − F x − A (N −i−1) N 2 , , , . i −A −A (N −1), , the approximations give the N 1 − F x − A (NN−1) 2 − F (x − A) 1 − F (x − A) , 2 − F (x − A) F x − A (NN−1) , 1 − aN −1 = 1 + aN 2 − F (x − A) p(N −i),∞ ≈ 0, for i ∈ {3, · · · , 2N } . Under the small noise assumption, these approximations are not only useful for evaluating the FI, but they can also be used for estimating the parameter when the number of measurements is very large. Suppose that after a number of samples M , the threshold probabilities reach approximately the asymptotic distribution, then, from this point on, we start to store the measurements forming an histogram of the threshold values that were used. After a large number of measurements, the histogram will be very close to the asymptotic threshold probabilities. We can then search for the two largest values of the histogram and using one of the correspondent empirical frequencies in the place of the true probability, we can inverse the corresponding approximate expression for the probability to obtain x. For example, suppose we have obtained the largest empirical frequencies at the points N − i − 1 and N − i. The empirical frequency at N − i − 1 is p̂(N −i−1),∞ , then inverting the corresponding p(N −i−1),∞ we get the estimate x̂ x̂ = A (N − 1) + F −1 2p̂(N −i−1),∞ − 1 . N A.2. More? - Further details A.2.5 245 Particle filter using rejection sampling for tracking a scalar Wiener process. The optimal sampling distribution p (xk |xk−1 , ik ) can be rewritten as p (xk |xk−1 , ik ) = P (ik |xk ) p (xk |xk−1 ) p (xk , xk−1 , ik ) = ∝ P (ik |xk ) p (xk |xk−1 ) , p (xk−1 , ik ) P (ik |xk−1 ) (A.15) where the proportional relation comes from the fact that, for a given ik , the probability P (ik |xk−1 ) is a constant independent of xk . Note that as P (ik |xk ) is a probability, it can be bounded above by one, as a consequence P (ik |xk ) p (xk |xk−1 ) can be bounded above by (j) p (xk |xk−1 ) which is a Gaussian PDF. Therefore, for each previous xk−1 , a standard rejection sampling method [Robert 1999, pp. 50] can be applied to generate a sample from (j) p xk |xk−1 , ik . This can be done by sampling independently from the Gaussian distribution p (xk |xk−1 ) and from the uniform distribution U [0, 1]. The rejection sampling method that (j) gives the optimal samples xk is the following: Rejection sampling for the optimal sampling distribution (app1) For j = 1 to NS (j) (j) • Set uk = 1 and lk = 0. (j) (j) • While lk < uk , do – Sample the Gaussian distribution (How? - App. A.3.3) !2 (j) x − x − u 1 1 k k (j) k−1 . p xk |xk−1 = √ exp − 2 σw 2πσw (j) – Evaluate lk (j) (j) lk = P ik |xk . (j) – Sample, independently from xk , the uniform distribution U [0, 1]. (j) (j) Note that we accept a sample xk only when the its likelihood P ik |xk is larger than the uniform sample. By replacing (A.15) in the place of q (xk |x0:k−1 , i1:k ) in the recursive expression for the weights (2.25), we have the following update equation for the weights (j) (j) (j) w x1:k = P ik |xk−1 w̃ x1:k−1 . 246 A. Appendices (j) (j) Observe that we might evaluate P ik |xk−1 , which can be obtained similarly as P ik |xk with F ′ τi ,k − x(j) − uk − F ′ τi −1,k − x(j) − uk , if ik > 0, k−1 k−1 (j) k k P ik |xk−1 = (A.16) F ′ τi +1,k − x(j) − uk − F ′ τi ,k − x(j) − uk , if ik < 0, k k k−1 k−1 where F ′ is the CDF for the r.v. that is the sum of the noise r.v. Vk and the centered Xk increment Wk − uk . The procedure for tracking the Wiener process starts by sampling independently NS times the prior distribution p (x0 ) and setting the initial weights all to N1S . After obtaining the first measurement i1 , both the sampling with p (x1 |xk−1 , i1 ) and the updates of the weights can be done. Then, after normalizing the weights, the estimate x̂1 can be obtained with the weighted mean. The procedure is then repeated for each time k in a sequential way. This procedure may also suffer from the degeneracy problem explained in Sec. 2.3.4 (p. 85), thus a resampling step (How? - App. A.3.4) (app4) must be carried out each time the number of effective samples is too low. The performance of this sequential importance sampling algorithm can be obtained through a lower bound, as it is discussed in Sec. 2.4. Remark: to reduce the complexity of this algorithm, we could use a technique based on local linearizations of the optimal proposal distribution [Doucet 2000]. The problem with this approach is that it requires the logarithm of the optimal proposal to have a positive second derivative and this cannot be guaranteed for all noise distributions considered here. A.2. More? - Further details 247 The sequential procedure with the resampling step (particle filter) for solving (b) (p. 29) is the following: Solution to (b) - Particle filter with rejection for a fixed threshold set sequence τ 1:k (b1.2) 1) Estimator (j) • Set uniform normalized weights w̃ x0 = N1S and initialize n o (1) (N ) NS particles x0 , · · · , x0 S by sampling the prior " 2 # 1 x0 − x′0 1 exp − . p (x0 ) = √ 2 σ0 2πσ0 For each time k, (j) • for j from 1 to NS , sample the r.v. Xk with rejection sampling (app1). • for j from 1 to NS , evaluate and normalize the weights (j) w x 1:k (j) (j) (j) (j) w̃ x1:k = N , w x1:k = P ik |xk−1 w̃ x1:k−1 , S P (j) w x1:k j=1 (j) where P ik |xk−1 is given by (A.16). • Obtain the estimate with the weighted mean x̂k ≈ NS X j=1 (j) (j) xk w̃ x1:k . • Evaluate the number of effective particles Neff = N PS j=1 1 (j) w̃2 x1:k , if Neff < Nthresh , then resample using multinomial resampling (How? - App. A.3.4) (app4). 2) Performance (lower bound) The MSE can be lower bounded as follows MSEk ≥ 1 , Jk′ with Jk′ given recursively by Jk′ = 1 1 + Iq (0) − 4 2 σw σw 1 1 2 σw ′ + Jk−1 . 248 A. Appendices A.3 How? - Algorithms and implementation issues A.3.1 How to sample from a uniform/Gaussian distribution. We are going to consider that we can generate easily and independently uniform and Gaussian variates. For generating uniform variates, one can use linear congruential generators (see [Knuth 1997, Sec. 3.2] for details), while for generating Gaussian variates one can use the Box-Muller transform which requires a pair of independent uniform variates [Box 1958]. By looking to the specific form of the PDF (1.36) α 2 ε+ 2 1 1 , for ε < − α2 , fGL (ε) = C √2πσ exp − 2 σ f (ε) = fU (ε) = C √12πσ , for − α2 ≤ ε ≤ α2 , α 2 1 ε− 2 1 , for ε > α2 , fGR (ε) = C √2πσ exp − 2 σ α where C = 1 + √2πσ , we can see that we can generate samples from it by generating samples independently from the half Gaussian distributions α √ 2 exp − 1 ε+ 2 2 , for ε < − α , 2 σ 2 ′ 2πσ (ε) = fGL 0, otherwise, ′ (ε) = fGL √ 2 2πσ exp 0, and from the central uniform distribution ( fU′ (ε) = 1 α, 0, − 21 ε− α 2 σ for − α 2 2 , for ε > α2 , otherwise ≤ ε ≤ α2 , otherwise and then choosing one of the samples randomly. For having the samples distributed correctly, 1 , or the sample we will choose the sample from the left Gaussian r.v. with probability 2C α √ from the uniform distribution with probability 2πσC or from the right Gaussian also with probability 1 2C . A.3. How? - Algorithms and implementation issues 249 This gives the following algorithm for generating a sample from the uniform/Gaussian distribution with parameters α and σ: Uniform/Gaussian sample generator (app2) To generate a sample v do the following • evaluate C = p1 = p2 = 1 1+ √α 2πσ , 1 , 2C 1 1 α . +√ C 2 2πσ • Generate 2 independent uniform variates (from U [0, 1]) u0 and u1 and 2 standard (zero mean and σ = 1) Gaussian variates g1 and g2 . • If u0 < p1 , then α , v = − σ |g1 | + 2 else if p1 ≤ u0 ≤ p2 , then v = σ |g2 | + α . 2 1 v = α u1 − 2 else A.3.2 , How to sample from a GGD. We consider that an easy method for generating independent binary samples (samples with values −1 or 1 that have equal probability) and gamma samples is available. For obtaining binary samples one can simply take the sign of a sample from a uniform U [−0.5, 0.5] distribution and for obtaining gamma variates one can use a rejection method [Marsaglia 2000]. It can be shown that a generalized Gaussian r.v. V ′ with shape parameter β and unit scale parameter can be obtained with the following transformation of two independent r.v. [Nardon 2009]: 1 β , V ′ = B Γ1 β where B is a binary r.v. and Γ 1 is a gamma r.v. with shape parameter β 1 β. If we want a generalized Gaussian r.v. V with scale parameter δ, we need only to multiply V ′ by δ. 250 A. Appendices This gives the following algorithm for generating a sample from a GGD with parameters β and δ: Generalized Gaussian sample generator (app3) To generate a sample v do the following • generate independently a uniform sample u from U [0, 1] and a gamma sample γ 1 from Γ 1 with unitary scale paβ β rameter. • Transform the uniform sample u into a binary sample b with 1 b = sign u − . 2 • Apply the transformation 1 β . v = δb γ 1 β A.3.3 (j) How to sample from the distribution p xk |xk−1 using a Gaussian standard variate. Suppose we can generate a Gaussian standard variate Wn ∼ N (0, 1), for example using the Box-Muller transform on a pair of independent uniform variates [Box 1958]. We want to generate a Gaussian variate with PDF !2 (j) x − x − u 1 1 k k (j) k−1 , exp − p xk |xk−1 = √ 2 σ 2πσw w (j) where xk−1 , uk and σw are known. Using the following properties of Gaussian r.v.: • the product of a Gaussian r.v. by a constant gives a Gaussian r.v. with variance given by the initial variance multiplied by the square of the constant; • the sum of a Gaussian r.v. and a constant gives a Gaussian r.v. with mean shifted by the value of the constant. (j) (j) We have that the r.v. Xk distributed according to p xk |xk−1 can be generated as follows (j) (j) Xk = σw Wn + xk−1 + uk . A.3.4 Multinomial resampling algorithm. In order to sample from P (xk ) = w̃ x(j) , 1:k 0, (j) if xk = xk otherwise, A.3. How? - Algorithms and implementation issues 251 we can create an increasing sequence of cumulative weights (j) w+ = j X i=1 (0) (i) w̃ x1:k , where we define w̃ x1:k = 0. Thus, the intervals defined by the neighboring pairs of the i (j) (j−1) sequence w+ , w+ form a partition of the interval [0, 1] and their lengths equal the cor (j) responding w̃ x1:k . If we sample from the uniform distribution defined on [0, 1], U [0, 1], (j) and choose xk with j corresponding to the interval in the sequence w+ in which the uni(j) form sample is contained, then the chosen xk are distributed according to the probability distribution P (xk ) above. Resetting equal sample weights at the end of the procedure, we have the multinomial resampling algorithm: Multinomial resampling (app4) For j = 1 to NS • store the particle values in a sequence of auxiliary vari(j) ables x̃k (j) (j) x̃k = xk , • create the sequence of cumulative weights (j) w+ = j X i=1 (0) (i) w̃ x1:k , with w̃ x1:k = 0. n o • Create a sequence u′1 , · · · , u′NS by sampling independently NS times from the distribution U [0, 1]. For j = 1 to NS , (j) (l ) • set xk = x̃k j , where lj is chosen so that i (l −1) (l ) uj ∈ w+j , w+j , • reset the normalized weights to a uniform distribution 1 (j) . w̃ x1:k = NS 252 A.3.5 A. Appendices How to sample from a STD. For generating samples from the STD, we consider that a simple method for generating uniform U [0, 1] samples is available. It is possible to show that a Student’s-t r.v. V ′ with shape parameter β and unit scale parameter can be obtained with the following transformation of two independent r.v. [Bailey 1994]: 1 2 −2 cos (2πU2 ) , V ′ = β U1 β − 1 where U1 and U2 are independent r.v. with uniform U [0, 1] distribution. If we want a Student’s-t r.v. V with scale parameter δ, we need only to multiply V ′ by δ. Thus we have the following algorithm for generating a sample from a STD with parameters β and δ: Student’s-t sample generator (app5) To generate a sample v do the following • generate independently two uniform samples u1 and u2 from U [0, 1]. • Apply the transformation 2 1 2 −β cos (2πu2 ) . v = δ β u1 − 1 B Résumé détaillé en français (extended abstract in French) Ceci est un résumé détaillé en français des travaux réalisés dans cette thèse. L’introduction et les conclusions des travaux sont traduites directement du manuscrit en anglais pour une meilleure compréhension du contexte, les chapitres concernant les développements et résultats théoriques seront présentés sous forme synthétique, avec seulement les principaux développements et résultats. Contents B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 B.2 Estimation et quantification : algorithmes et performances . . . . . . 261 B.2.1 Estimation d’un paramètre constant . . . . . . . . . . . . . . . . . . . . . 261 B.2.2 Estimation d’un paramètre variable . . . . . . . . . . . . . . . . . . . . . 270 B.2.3 Quantifieurs adaptatifs pour l’estimation . . . . . . . . . . . . . . . . . . . 274 B.3 Estimation et quantification : approximations à haute résolution . . 286 B.3.1 Approximation à haute résolution de l’information de Fisher . . . . . . . 286 B.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 B.4.1 Conclusions principales . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 B.4.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 253 254 B.1 B. Résumé détaillé en français (extended abstract in French) Introduction Quantification : une inconnue dans la salle Ouvrez un livre, un livre quelconque sur les fondements du traitement numérique du signal, et comptez le nombre de pages dédiées au théorème de l’échantillonnage et au traitement du signal à temps discret : la transformée de Fourier rapide, la transformée en Z, le filtrage à réponse impulsionnelle finie et infinie. Maintenant, comptez le nombre de pages dédiées à la quantification. Même si la moitié du « monde numérique » est un résultat de la quantification, si on lit quelques livres fondamentaux en traitement numérique du signal, on a l’impression qu’elle est un sujet sans importance. Toutefois, une personne curieuse peut se demander : la quantification est-elle un sujet vraiment dépourvu d’importance ? Peut-être qu’elle est si difficile à étudier et à expliquer de façon simple, que la plupart des références de base en traitement numérique du signal préfèrent omettre une explication plus détaillée. Nous croyons que cette explication est à l’origine de l’omniprésence de la quantification fine des signaux dans la plupart des livres sur le traitement numérique des signaux. En la considérant fine, les auteurs de ces livres peuvent reléguer la quantification à une note de bas de page. On constate que la quantification semble être l’étrange participant de la « fête du traitement numérique des signaux » et que personne ne veut discuter avec elle (même si elle est un des organisateurs de la fête). Quelques domaines du traitement du signal trouvent utile (et dans certaines circonstances ils n’ont pas tort) de refuser tout contact avec la quantification. Chaque fois qu’ils ont besoin de traiter des problèmes induits par la quantification, ils l’appellent de façon dépréciative – « le bruit de quantification ». Dans cette thèse, nous espérons faire « discuter » de façon respectueuse, sans termes amoindrissants, un des participants de la fête du traitement du signal avec la quantification. Le sujet que nous avons choisi est l’estimation. Dans la suite, on expliquera la motivation et les points principaux de cette « discussion ». Quantification et réseaux de capteurs : l’invitée d’honneur Bien que nous ne traitions pas explicitement de la conception d’algorithmes d’estimation avec une architecture du type réseau de capteurs, avec cette thèse nous espérons contribuer au développement de techniques qui peuvent être utilisées ou étendues aux réseaux de capteurs. L’essor des réseaux de capteurs. Avec la réduction des coûts et de la taille des dispositifs électroniques, tels que les capteurs et les émetteurs-récepteurs, un nouveau domaine a émergé sous le nom de « Réseaux de capteurs ». Ce terme, en général, désigne un groupe de capteurs capables de communiquer et de traiter des données pour réaliser une tâche donnée, e.g. : faire de l’estimation, de la détection, du suivi d’un signal, de la classification, etc. Les réseaux de capteurs sont intéressants en pratique pour plusieurs raisons, parmi les plus mentionnées dans la littérature on peut trouver [Akyildiz 2002], [Intanagonwiwat 2000], B.1. Introduction 255 [Zhao 2004, pp. 7–8] : • tolérance aux défaillances et flexibilité. • Déploiement facile. • Possibilité d’utilisation en environnement dangereux. • Possibilité d’utilisation sans maintenance. • Utilisation de la communication pour réduire la quantité d’énergie utilisée. • Rapport signal à bruit amélioré pour le suivi et détection d’événements dans une zone donnée. Applications des réseaux de capteurs. Les avantages cités plus haut ouvrent la voie pour l’utilisation des réseaux de capteurs dans un très large spectre de domaines [Arampatzis 2005], [Chong 2003], [Durisic 2012], [Puccinelli 2005]: surveillance de l’environnement, surveillance pour l’agriculture, génie civil, surveillance urbaine, applications en santé, applications commerciales etapplications militaires. Le besoin de quantifier. Même si le progrès des technologies de conception des capteurs et des dispositifs de communication nous amène à l’utilisation de réseaux à grand nombre de capteurs, des considérations pratiques tels que l’utilisation de batteries et des contraintes sur la taille maximale des capteurs imposent trois contraintes majeures pour la conception d’un réseau de capteur : la contrainte énergétique, la contrainte sur le débit de communication et la contrainte sur la complexité. Pour respecter ces contraintes, on peut quantifier les mesures au niveau des capteurs. Ceci permet de : • réduire la complexité des opérations grâce à des recherches dans des tableaux pré-stockés et limiter la quantité de mémoire utilisée. • réduire directement le débit binaire en sortie des capteurs par le réglage du nombre d’intervalles de quantification. • réduire la quantité d’énergie utilisée, comme conséquence de la réduction de la complexité et du débit. Voilà les principales raisons pour lesquelles nous avons choisi d’étudier la quantification dans cette thèse. 256 B. Résumé détaillé en français (extended abstract in French) Différents objectifs et précisions sur le sujet de la thèse Dans un réseau de capteurs, on s’intéresse principalement à l’inférence d’une certaine information enfouie dans les mesures. Les deux classes principales de problèmes d’inférence étudiées en traitement du signal sont la détection et l’estimation. Si on regarde la littérature sur les problèmes conjoints détection/quantification et estimation/quantification, on constate que, en comparaison avec la littérature pour les problèmes isolés (seulement détection ou seulement quantification), sa taille n’est pas importante, en revanche, comme conséquence de l’essor des réseaux de capteurs, elle ne cesse pas de grandir. Quelques références sur ces problèmes conjoints sont : • Détection: [Benitz 1989], [Gupta 2003], [Kassam 1977], [Longo 1990], [Picinbono 1988], [Poor 1977], [Poor 1988], [Tsitsiklis 1993], [Villard 2010], [Villard 2011]. • Estimation: [Aysal 2008], [Fang 2008], [Gubner 1993], [Luo 2005], [Marano 2007], [Papadopoulos 2001], [Poor 1988], [Ribeiro 2006a], [Ribeiro 2006b], [Ribeiro 2006c], [Wang 2010]. Estimation à partir de mesures quantifiées. Dans cette thèse on s’intéresse au second problème, l’estimation à partir de mesures quantifiées. On commence par la définition générale du problème d’estimation dans un réseau de capteurs pour, après une suite de simplifications, arriver au sujet précis de la thèse. Dans le schéma général, chaque capteur : mesure une quantité à amplitude continue X (i) , puis la mesure est traitée et transmise au point où l’estimation sera faite. Ce point peut être un centre de fusion, un des capteurs ou tous les capteurs. Dans le dernier cas, tout les capteurs diffuseront leurs mesures après traitement. Ce schéma est montré en Fig. B.1. La quantité mesurée peut être une suite de vecteurs, une suite de scalaires, un vecteur constant ou un scalaire constant. Comme première hypothèse de travail, on considère que seulement un des terminaux (capteurs) est utilisé dans le réseau de capteurs, éventuellement on peut considérer plusieurs terminaux, mais dans ce cas la quantité à estimer sera la même pour tous les capteurs. On considère aussi que la quantité à estimer est une séquence de scalaires ou un seul scalaire, on utilise la notation Xk pour cette quantité dans les deux cas, l’indice k désigne l’échantillon en question ou le temps discret. Dans le cas où Xk est une constante scalaire, on a Xk = x. Le problème simplifié, qui peut être appelé problème d’estimation scalaire à distance, est montré en Fig. B.2. Le paramètre Xk est mesuré avec du bruit additif Vk . La mesure à amplitude continue est notée Yk = Xk + Vk . Le problème que nous traitons dans cette thèse est donc un problème d’estimation d’un paramètre de centrage. En raison des contraintes de conception discutées plus haut, le bloc de traitement est remplacé par un quantifieur scalaire. Par conséquent, chaque mesure continue Yk génère une mesure quantifiée ik au travers d’une fonction de quantification Q (). Chaque mesure quantifiée est définie dans un ensemble fini de valeurs, ceci permet de fixer le débit binaire en B.1. Introduction X(1) Capteur 257 Traitement Transmission Capteur 1 X(2) Capteur Traitement Transmission Estimation X̂(1) X̂(2) .. . X̂(Ns ) Capteur 2 .. . X(Ns ) Capteur Traitement Transmission Capteur Ns Figure B.1: Estimation avec un réseau de capteurs. Plusieurs capteurs transmettent des informations pré-traitées à l’estimateur final qui doit récupérer les quantités d’intérêt. Xk Mesure Traitement Transmission Estimation X̂k Capteur Figure B.2: Problème d’estimation scalaire à distance. Simplification scalaire et à un seul capteur du problème montré en Fig. B.1. sortie du capteur. On suppose que le débit en bits par unité de temps est choisi de façon à ne pas dépasser la capacité du canal de transmission, de cette manière on peut considérer qu’un code suffisamment performant peut être mis en œuvre pour rendre le canal parfait. A chaque instant k, on est intéressé par l’estimation de Xk à partir d’un bloc de mesures passées i1 , i2 , · · · , ik . Ce problème est illustré en Fig. B.3. Vk Xk Yk Bruit Q (Yk ) Quant. ik Canal parfait g (i1 , · · · , ik ) X̂k Estimation Mesure Figure B.3: Estimation à partir de mesures quantifiées. Un paramètre est mesuré avec du bruit additif, les mesures sont alors quantifiées et transmises à travers un canal de communication parfait. A partir de mesures passées, l’objectif est d’estimer Xk à chaque instant k avec la suite de fonctions g (). En Fig. B.3, on voit que la structure du quantifieur peut dépendre aussi de mesures quantifiées passées. 258 B. Résumé détaillé en français (extended abstract in French) Ce que l’on veut étudier. On veut proposer des algorithmes pour l’estimation de Xk à partir des ik . Le paramètre Xk , qui sera défini de façon plus précise dans la suite, peut être déterministe et constant ou aléatoire et lentement variable. Après avoir proposé des algorithmes, on veut étudier leurs performances. Etant données les performances des algorithmes, on veut aussi étudier les effets de différents paramètres du quantifieur : seuils de quantification et résolution du quantifieur. Pour évaluer l’impact de la quantification sur la performance d’estimation, on comparera la performance des algorithmes proposés avec leurs pendants à mesures continues. L’objectif ici est d’estimer Xk seulement à partir des informations sur les intervalles où ses versions bruitées se trouvent. Ce que l’on ne veut pas étudier. On ne veut pas reconstruire la mesure Yk à partir de la mesure quantifiée pour ensuite estimer Xk à partir des mesures reconstruites comme si elles étaient continues. En faisant cela, on se ramènerait au groupement des solutions optimales des deux problèmes séparés, ces solutions ont déjà été abondamment étudiées dans la littérature. On ne veut pas non plus considérer la quantification comme du bruit additif. On veut étudier le problème dans sa forme originale, c’est-à-dire, le problème d’estimation à partir des informations contenues dans des intervalles et pas dans des valeurs continues. Ce que l’on veut étudier mais que l’on n’étudiera pas. Pour spécifier de façon plus précise le problème traité dans cette thèse, on doit aussi mentionner les problèmes que l’on a négligé sciemment pour rendre le sujet plus simple à traiter. Ces problèmes sont les suivants : paramètres vectoriels et quantification vectorielle, canaux de communication bruités et codage canal, signaux à variations rapides, estimation de signaux à temps continu et estimation Bayésienne d’une constante aléatoire. B.1. Introduction 259 Plan du résumé Le plan de ce résumé est le suivant : • Estimation à partir de données quantifiées : algorithmes et performances On détaille le problème à traiter (présentation des modèles de signaux à estimer, du bruit et du quantifieur), puis on étudie les algorithmes d’estimation et leurs performances. – Estimation d’un paramètre constant. D’abord, on se concentrera sur l’estimation d’un signal constant. On présentera un estimateur du maximum de vraisemblance pour deux types de quantification : binaire et multibit. Par l’analyse de sa performance asymptotique, donnée par la borne de Cramér–Rao (BCR) ou de façon équivalente par l’information de Fisher, on regardera l’impact du réglage de la dynamique de quantification. Comme conséquence de cette analyse, on montrera l’importance d’une approche adaptative pour le réglage du quantifieur. Finalement, on présentera des algorithmes adaptatifs de haute complexité qui, conjointement, estiment la constante et règlent le quantifieur. On montrera qu’asymptotiquement une de ces méthodes est équivalente à un algorithme récursif de basse complexité. – Estimation d’un paramètre variable. On passera ensuite au cas du paramètre variable. Après la présentation du modèle de variation utilisé, on définira le critère de performance d’estimation et l’estimateur optimal. Pour réaliser l’estimateur optimal, on utilisera une méthode numérique d’intégration, dans ce contexte (estimation Bayésienne) cette méthode est connue sous le nom de filtrage particulaire. On étudiera ses performances avec la borne de Cramér–Rao Bayésienne (BCRB) et on montrera encore une fois l’importance de l’adaptativité du réglage du quantifieur. Avec l’approche adaptative, on montrera qu’asymptotiquement l’estimateur optimal ainsi obtenu pour un signal lentement variable peut être mis, lui aussi, sous une forme récursive simple. – Quantifieurs adaptatifs pour l’estimation. En se basant sur l’optimalité asymptotique des estimateurs vus précédemment, on proposera des algorithmes adaptatifs de basse complexité pour l’estimation et le réglage conjoint du quantifieur. On étudiera la performance de ces algorithmes pour deux modèles d’évolution de la quantité à estimer (constant ou lentement variable) et on les optimisera par rapport à ses paramètres libres. Pour la performance optimale, on étudiera la perte de performance d’estimation par rapport à des schémas équivalents pour des mesures continues. On proposera deux extensions de l’algorithme adaptatif : une extension où l’on estime le paramètre x sans connaître l’échelle du bruit (équivalent de l’écart type) et une autre où plusieurs capteurs obtiennent des mesures quantifiées en parallèle et les transmettent à un centre de fusion qui applique un algorithme adaptatif pour l’estimation et diffuse son estimateur aux capteurs pour le réglage des quantifieurs. 260 B. Résumé détaillé en français (extended abstract in French) • Estimation à partir de données quantifiées : approximations à haute résolution. Contrairement aux développements précédents où le réglage du quantifieur n’est fait qu’en fonction du seuil central, on se concentrera ici sur le placement de tous les seuils de quantification pour maximiser la performance d’estimation d’un paramètre arbitraire (pas seulement de centrage). Vu que ce problème est difficile à résoudre directement, on utilisera une approche asymptotique, i.e. on trouvera des approximations pour le quantifieur optimal quand le nombre d’intervalles de quantification est très grand. – Approximation à haute résolution de l’information de Fisher. Après avoir montré l’importance de l’information de Fisher dans la performance d’estimation des algorithmes proposés, on appliquera cette approche asymptotique pour la maximiser en fonction des caractéristiques du quantifieur. Cette approche asymptotique permettra de trouver une caractérisation optimale du quantifieur et une expression analytique de l’information de Fisher optimale. On testera les résultats sur le problème d’estimation d’un paramètre de centrage. Pour avoir une approximation pratique des seuils de quantification optimaux, on proposera l’utilisation de l’algorithme adaptatif présenté précédemment. Avec les expressions analytiques de l’information de Fisher, on pourra aussi étudier de façon approchée le problème d’allocation optimale de bits dans un réseau de capteurs, i.e. le nombre total de bits que les capteurs peuvent envoyer à un centre de fusion étant fixé, combien de bits faut-il allouer à chaque capteur ? • Conclusions On présentera les principaux points qui découlent des résultats de la thèse et on regardera les travaux qui peuvent être développés dans le futur : des extensions de problèmes traités ici ou des problèmes qui n’ont pas été traités pour avoir une première approche la plus simple possible. B.2. Estimation et quantification : algorithmes et performances B.2 B.2.1 261 Estimation et quantification : algorithmes et performances Estimation d’un paramètre constant Pour commencer cette section, on présente les modèles de mesure et de bruit utilisés. Modèle de mesure Le paramètre inconnu, constant et scalaire est x ∈ R, il est mesuré N fois, N ∈ N⋆ , avec du bruit indépendant et identiquement distribué (i.i.d.) Vk . Pour k ∈ {1, · · · , N }, les mesures continues sont données par Y k = x + Vk . (B.1) Modèle de bruit, hypothèses sur la distribution du bruit Pour simplifier la suite, on considérera les hypothèses suivantes sur la distribution du bruit : AN1 La fonction de répartition marginale du bruit, notée F , admet une densité de probabilité (d.d.p.) f par rapport à la mesure de Lebesgue standard en (R, B (R)). AN2 La d.d.p. f (v) est une fonction paire, strictement positive et elle décroît strictement avec |v|. Modèle du quantifieur La sortie du quantifieur est donnée par ik = Q (Yk ) , où ik est choisi dans un ensemble fini de valeurs I de R, cet ensemble possède NI éléments. Le nombre d’intervalles de quantification est par conséquent noté NI . Un exemple simple de quantifieur Q avec seuils uniformes est donné en Fig. B.4. A l’exception de la quantification uniforme, que l’on n’imposera pas, cet exemple illustre les principales hypothèses de travail sur la structure du quantifieur : Hypothèses (sur le quantifieur) : AQ1 NI est un nombre naturel pair et l’ensemble I, auquel ik appartient, est NI NI I = − , · · · , −1, 1, · · · , . 2 2 262 B. Résumé détaillé en français (extended abstract in French) ik = Q (Yk ) ... NI 2 .. . .. . 2 1 τ− NI = −∞ 2 τ0 − ... NI 2 − 1 ∆ τ0 − 2∆ ... τ0 − ∆ τ0 ... τ1 = τ0 + τ1′ τ0 + 2∆ τ0 + −1 = τ0 + ∆ ... NI 2 −1 ∆ τ NI = +∞ Yk 2 −2 .. . .. . ... − N2I Figure B.4: Fonction de quantification Q (Yk ) avec NI intervalles de quantification uniformes de taille ∆. Le nombre d’intervalles de quantification NI est pair, le quantifieur est symétrique autour d’un seuil central τ0 et ses indices de sortie sont des entiers non nuls. AQ2 Le quantifieur est symétrique autour d’un seuil central. Par conséquent le vecteur de seuils τ peut être écrit sous la forme suivante (⊤ est l’opérateur transposé) ′ · · · τ−1 = τ0 − τ1′ τ = τ − NI = τ 0 − τ N I 2 ′ τ1 = τ0 + τ1′ · · · τ NI = τ0 + τ N I τ0 2 2 2 ⊤ . Les éléments de ce vecteur forment une séquence strictement positive et le vecteur de variations de seuil par rapport au seuil central est donné par ′ τ = 0 τ1′ ··· ′ τ NI = +∞ 2 ⊤ . Avec les variations de seuils τi′ , on peut écrire la relation entrée–sortie du quantifieur sous une forme plus compacte : ik = i sign (Yk − τ0 ) , ′ pour |Yk − τ0 | ∈ τi−1 , τi′ . (B.2) Maximum de vraisemblance, borne de Cramér–Rao et information de Fisher On veut estimer x à partir de i1:N = {i1 , · · · , iN }, on cherche donc un estimateur X̂ (i1:N ) - qui est aléatoire, vu que les i1:N sont aléatoires aussi, le plus proche de x. Proche dans ce cas peut être traduit de façon quantitative par un critère de performance. Dans notre cas on considère comme critère de performance l’erreur quadratique moyenne (EQM) 2 . EQM = E X̂ − x B.2. Estimation et quantification : algorithmes et performances 263 Si l’on impose que l’estimateur soit non biaisé i.e. h i E X̂ = x, au moins quand N → ∞, on sait que l’estimateur qui minimise l’EQM asymptotiquement (et donc qui maximise la performance asymptotiquement) est l’estimateur du maximum de vraisemblance (MV) [Kay 1993, p. 160]. Le MV consiste à maximiser la fonction de vraisemblance par rapport au paramètre inconnu. La vraisemblance est la distribution conjointe des mesures (celles-ci étant figées après observation) et elle est une fonction du paramètre inconnu (celui-ci considéré comme une variable). Pour le problème que l’on traite ici, la vraisemblance pour un bloc de mesures indépendantes i1:N est L (x; i1:N ) = N Y P (ik ; x) , k=1 où P (ik ; x) est la probabilité d’avoir une valeur quantifiée ik à l’instant k pour un paramètre x. On peut réécrire cette probabilité en fonction des seuils et de la fonction de répartition : P (ik ; x) = ( P (τik −1 6 Yk < τik ) , si ik > 0, P (τik 6 Yk < τik +1 ) , si ik < 0, avec la définition Yk = x + Vk donnée par (B.1) ( P (τik −1 6 x + Vk < τik ) , si P (ik ; x) = P (τik 6 x + Vk < τik +1 ) , si ( F (τik − x) − F (τik −1 − x) , = F (τik +1 − x) − F (τik − x) , ik > 0, ik < 0, si ik > 0, si ik < 0. L’estimateur du MV est donné par X̂M V,q = X̂M V (i1:N ) = argmax L (x; i1:N ) , x ou de façon équivalente par X̂M V,q = argmax log L (x; i1:N ) . x On se concentre maintenant sur les performances de cet estimateur, qui , à cause du manque de résultats à taille d’échantillon finie, ne sont connues qu’en régime asymptotique. L’EQM du MV peut être écrit, en général, sous la forme suivante i2 2 h + Var X̂M V,q = biais2 + variance. = E X̂M V,q − x E X̂M V,q − x Comme mentionné auparavant, le MV est asymptotiquement non biaisé: i h = x. E X̂M V,q N →∞ 264 B. Résumé détaillé en français (extended abstract in French) Par conséquent, son EQM n’est caractérisée que par sa variance. La variance asymptotique du MV atteint la BCR [Kay 1993, p. 160] (qui est aussi une borne inférieure sur la variance des estimateurs non biaisés dans un contexte non asymptotique [Kay 1993, p. 30]) : ∼ Var X̂M V,q où le symbole ∼ N →∞ N →∞ BCRq , est utilisé pour représenter une équivalence. La BCR est l’inverse de l’information de Fisher [Kay 1993, p. 30] Iq , qui est la variance de la fonction score Sq . En partant de la fonction score pour N mesures quantifiées, on a les expressions suivantes Sq,1:N = Iq,1:N = Var X̂M V,q ∼ N →∞ ∂ log L (x; i1:N ) ∂x ( ) 2 ∂ log L (x; i1:N ) 2 E Sq,1:N = E ∂x BCRq = - fonction score, - Fisher, 1 1 = Iq,1:N E h ∂ log L(x;i1:N ) ∂x i2 - variance et BCR. L’indice 1 : N est utilisé pour indiquer que ces quantités sont relatives à N mesures. Pour simplifier, on utilisera la notation Sq et Iq dans le contexte d’une mesure quantifiée arbitraire. Sous l’hypothèse de mesures indépendantes on a Var X̂M V,q ∼ N →∞ BCRq = 1 . N Iq La fonction score pour une mesure Sq est ∂P(i ;x) Sq = k ∂ log L (x; ik ) ∂x = ∂x P (ik ; x) et l’information de Fisher correspondante est Iq = E ( ∂ log L (x; ik ) ∂x 2 ) = X ik ∈I = X ik ∈I " ∂P(ik ;x) ∂x #2 P (ik ; x) h i2 ∂P(ik ;x) ∂x P (ik ; x) P (ik ; x) , . Si on note ε = τ0 − x la différence entre le seuil central et le paramètre, on peut réécrire Iq sous la forme suivante : N 2 2 ik = 2I ′ ′ ′ ′ X f ε + τik −1 − f ε + τik f ε − τik − f ε − τik −1 + . (B.3) Iq = F ε + τ′ − F ε + τ′ F ε − τ′ − F ε − τ′ ik =1 ik ik −1 ik −1 ik B.2. Estimation et quantification : algorithmes et performances 265 Influence du quantifieur sur la performance La performance de l’estimateur est donc caractérisée par BCRq ou de façon équivalente par Iq , par conséquent, pour étudier l’influence du quantifieur sur la performance d’estimation on peut, de façon quantitative, étudier comment BCRq ou Iq se comportent en fonction de NI et τ . On commence par quelques propriétés générales de Iq : Perte induite par la quantification : si on note Sc et Ic la fonction score et l’information de Fisher du problème d’estimation équivalent avec des mesures continues, on peut montrer que h i Ic − Iq = E (Sc − Sq )2 > 0. Ce qui veut dire que Iq est majorée par Ic et qu’il existe hune perte de i performance inhérente 2 à la quantification donnée de manière quantitative par E (Sc − Sq ) . Monotonicité de Iq : on peut montrer aussi que si l’on ajoute un seuil à un vecteur de seuils τ , alors l’information de Fisher correspondant au nouveau vecteur de seuils est toujours plus grande ou égale à l’information de Fisher précédente. Cela veut dire que l’information de Fisher croît de façon monotone en fonction de NI (pour une séquence de seuils construite en ajoutant des seuils). Une question qui se pose pour la suite est : comme on peut construire une séquence de seuils telle que Iq croît de façon monotone en NI et comme on sait que Iq est majorée par Ic , est-ce que Iq converge vers Ic ? On répondra à cette question plus loin dans ce résumé. Maintenant, on passe à l’étude de la performance d’estimation en fonction de la position des seuils. On commence par le cas binaire. Cas binaire : dans le cas binaire on peut utiliser l’expression de l’information de Fisher (B.3) pour obtenir la BCR suivante BCRB q = F (ε) [1 − F (ε)] . N f 2 (ε) L’analyse de la performance se réduit alors à l’analyse de la fonction B (ε) = N BCRB q = F (ε) [1 − F (ε)] . f 2 (ε) h 2 i L’étude de cette fonction dans le cas Gaussien (f (ε) = √1πδ exp − δε ) a été réalisée par [Papadopoulos 2001] et [Ribeiro 2006a], son comportement est illustré en Fig. B.5. On peut noter que la valeur minimale de B est atteinte lorsque ε = 0 et que B (ε) augmente lorsque |ε| augmente. Par conséquent la valeur optimale du seuil τ0⋆ est égale à x et la valeur 2 minimale de B est B ⋆ = 4f 21(0) = πδ4 . Si on compare cette valeur avec la BCR pour les mesures 2 continues, BCRc ×N = δ2 , on peut constater que la perte produite par la quantification binaire est d’environ 2dB, ce qui est, de façon surprenante, très peu. 266 B. Résumé détaillé en français (extended abstract in French) B× 1 δ2 4 2 π/4 0 −1.5 −1 −0.5 0 0.5 1 1.5 ε δ Figure B.5: BCR normalisée B en fonction de la différence normalisée δε entre le seuil et le paramètre. La distribution du bruit est Gaussienne et le facteur de normalisation δ est le paramètre d’échelle de la Gaussienne. Des normalisations sont réalisées sur les deux axes pour que la courbe affichée soit indépendante de δ. Notez que pour avoir cette petite perte, il faut que τ0 = x, ce qui est impossible en pratique, puisque x est le paramètre inconnu à estimer. Notez aussi que B est une fonction assez sensible par rapport à la position du seuil, si l’on place τ0 loin du paramètre la performance d’estimation est très dégradée. On peut montrer que pour d’autres distributions couramment utilisées comme modèle de bruit, tels que la distribution de Laplace et la distribution de Cauchy, des conclusions similaires peuvent être obtenues : • La valeur optimale du seuil de quantification est τ0⋆ = x. • La perte due à la quantification est petite, si on utilise τ0⋆ . • La performance se dégrade lorsque τ0 s’éloigne de x. Cas asymétriques : même si pour plusieurs distributions de bruit couramment utilisées la fonction B a un comportement symétrique en forme de « u », ce comportement ne se généralise pas à toutes les distributions symétriques, comme on s’y attend intuitivement. Il suffit que la condition suivante ne soit pas satisfaite − f (2) (0) > 4f 3 (0) , pour que la fonction B ait ε = 0 comme maximum local. Ceci veut dire que pour des densités ne respectant pas cette condition, le point de quantification optimal n’est pas x et la quantification optimale doit être faite de manière asymétrique par rapport à la distribution des mesures. B.2. Estimation et quantification : algorithmes et performances 267 Un cas simple de distribution symétrique qui ne respecte pas cette condition est la distribution ad hoc suivante α 2 ε+ 2 1 1 , pour ε < − α2 , fGL (ε) = C √2πσ exp − 2 σ f (ε) = fU (ε) = C √12πσ , pour − α2 ≤ ε ≤ α2 , α 2 1 1 ε− 2 , pour ε > α2 . fGR (ε) = C √2πσ exp − 2 σ Un exemple de BCR obtenue avec cette distribution (et de la performance pratique du MV) est donné en Fig. B.6 1.2 ·10−2 BCRB q Sim. – MV EQM 1 0.8 0.6 −2 −1 0 ε 1 2 Figure B.6: BCRB q et EQM simulée du MV pour un bruit distribué selon la loi ad hoc. La borne et l’EQM simulée ont été évaluées pour N = 500 et ε dans l’intervalle [−2, 2]. L’EQM du MV a été évaluée par une simulation Monte Carlo avec 105 réalisations de blocs de 500 échantillons. On a utilisé aussi : α = 1 et σ = 1. Cas multibit : pour l’estimation avec MV et une quantification multibit, la performance en fonction du quantifieur peut être étudiée au travers de l’analyse de l’information de Fisher (B.3). Comme résultat de cette analyse on trouve que : • la dynamique de quantification doit être proche du paramètre pour maximiser la performance d’estimation. • Pour des variations de seuils symétriques bien choisies, le choix τ0 = x est optimal pour plusieurs types de bruit (pour une classe plus large que dans le cas binaire). • La performance se dégrade rapidement quand la dynamique de quantification est placée loin du paramètre à estimer. • Le problème d’optimisation de Iq en fonction de τ ′ est difficile à résoudre pour NB = log2 (NI ) > 3. 268 B. Résumé détaillé en français (extended abstract in French) Quantification adaptative : l’approche à haute complexité La conclusion directe des résultats précédents est la suivante : on doit placer le seuil central le plus proche possible du paramètre x. Or comme x est inconnu, on peut se baser sur les dernières mesures quantifiées pour estimer x, et comme on s’attend à ce que l’estimateur soit, au moins après un certain moment, proche de x, on placera le seuil central de quantification exactement sur cette dernière estimation. Ceci équivaut donc à une approche d’estimation où le processus de mesure, le quantifieur, est à tout instant adapté pour améliorer la performance d’estimation. Dans la littérature cette approche adaptative a été proposée en [Li 2007] et [Fang 2008], dans le cas binaire et Gaussien. La première méthode, proposée en [Li 2007], consiste à générer des estimations simples du paramètre au niveau du capteur avec la mise à jour du seuil central donnée par τ0,k = τ0,k−1 + γik , où γ est un pas d’adaptation. Les mesures quantifiées sont donc transmises à l’estimateur distant qui possède suffisamment de puissance de calcul pour générer des estimations plus précises en utilisant le MV. Dans ce cas, le MV consiste à maximiser la vraisemblance suivante L (x; i1:N ) = P(i1:N ; x) = N Y k=1 = = log L (x; i1:N ) = N Y k=1 N Y P(ik |τ0,k−1 ; x) [1 − F (τ0,k−1 − x)] k=1 N X k=1 P(ik |ik−1 , · · · , i1 ; x) 1+ik 2 F (τ0,k−1 − x) 1−ik 2 , (B.4) 1 − ik 1 + ik log [1 − F (τ0,k−1 − x)] + log F (τ0,k−1 − x) . 2 2 Du fait de la symétrie du problème, on espère que le seuil τ0,k va tendre en moyenne vers le point x, de cette façon le seuil central va fluctuer autour du vrai paramètre et donnera une performance d’estimation proche de l’optimum. La performance asymptotique de l’algorithme a été étudiée plus en détail dans [Fang 2008]. Elle est obtenue à partir de l’inverse de l’information de Fisher N X f 2 (τ0,k−1 − x) , Iq,1:N = E F (τ0,k−1 − x) [1 − F (τ0,k−1 − x)] k=1 où l’espérance est évaluée par rapport à la distribution de τ0,k−1 , qui maintenant n’est plus fixé, ni déterministe. Sachant que la distribution des seuils tend vers une distribution asymptotique, quand N → ∞, cette information de Fisher peut être approchée par l’information de Fisher avec la distribution asymptotique des seuils p∞ : lim N →∞ Iq,1:N ⊤ = Ĩ′ q p̃∞ , N B.2. Estimation et quantification : algorithmes et performances 269 où I′ q = · · · , f 2 (−γ − x) , F (−γ − x) [1 − F (−γ − x)] f 2 (0 − x) f 2 (γ − x) , , ··· F (0 − x) [1 − F (0 − x)] F (γ − x) [1 − F (γ − x)] ⊤ . p̃∞ étant de taille infinie, pour avoir la performance asymptotique en pratique, [Fang 2008] propose de tronquer le vecteur p̃∞ . Un problème avec l’approche adaptative qui vient d’être présentée réside dans la présence d’une fluctuation asymptotique sur l’emplacement de la dynamique de quantification, ceci entraîne une sous optimalité asymptotique vu que l’on devrait avoir τ0,k = x pour une performance optimale. Ce problème peut être résolu en utilisant un algorithme qui converge vers x quand N → ∞. En ajoutant de la complexité au niveau du capteur ou un lien de retour de l’estimateur vers le capteur, une idée assez directe consiste à utiliser la dernière estimation du MV comme nouveau seuil central : τ0,k = X̂M V,k . Cette idée a été proposée initialement pour le cas binaire Gaussien dans [Fang 2008] où les auteurs prétendent qu’asymptotiquement la performance en terme de variance est équivalente à N I1q (0) . Ce qui équivaut à dire que l’algorithme est asymptotiquement optimal. On peut étendre de façon assez naturelle cette approche adaptative au cas multibit et non Gaussien. Moyennant certaines contraintes sur la fonction Iq (ε), on peut montrer que la performance asymptotique est aussi optimale 1 BCRq ∼ . N →∞ N Iq (0) Pour réduire la complexité de l’algorithme, qui résout un problème d’optimisation à chaque nouvelle mesure, dans [Papadopoulos 2001] une étude heuristique de la forme asymptotique du MV dans le cas Gaussien binaire avec le seuil adaptatif a été réalisée, elle montre qu’asymptotiquement le MV adaptatif a une forme récursive de très basse complexité : √ δ π ik . X̂k = τ0,k = X̂k−1 + 2k En se basant sur les propriétés asymptotiques du MV et en considérant que l’estimateur est suffisamment proche du vrai paramètre, on peut montrer que, même dans un contexte non Gaussien, le MV adaptatif est équivalent à une forme asymptotique simple ik . X̂k ≈ X̂k−1 + 2kf (0) On peut se poser quelques questions sur la forme équivalente simple donnée ci-dessus : • peut-elle converger quand l’erreur initiale |ε| = |τ0 − x| est arbitraire (pas nécessairement petite) ? • Peut-on étendre cet algorithme à basse complexité au cas NI > 2 ? On donnera les réponses à ces questions en Sous-section B.2.3. 270 B.2.2 B. Résumé détaillé en français (extended abstract in French) Estimation d’un paramètre variable On passe maintenant à l’estimation d’un paramètre variable. Modèle du paramètre Le paramètre à estimer est défini comme un processus stochastique, il n’est donc pas seulement variable mais aussi aléatoire. A chaque instant k ∈ N⋆ , la variable aléatoire (v.a.) Xk est donnée par le modèle de Wiener suivant : Xk = Xk−1 + Wk , k > 0, où Wk est le k-ème élément d’une séquence indépendante de v.a. Gaussiennes. Sa moyenne 2 . Si u = 0, alors X forme un est donnée par uk et sa variance est une constant connue σw k k processus de Wiener à temps discret classique, sinon, on l’appelle processus de Wiener avec dérive. La distribution initiale Xk est supposée Gaussienne de moyenne x′0 et de variance connue σ02 . Modèle du quantifieur Pour poursuivre le paramètre, on suppose que le quantifieur peut être dynamique avec un vecteur de seuils donné par τ k : h τ k = τ− NI ,k · · · τ−1,k 2 τ0,k τ1,k · · · τ NI ,k 2 i⊤ . Les mesures quantifiées sont encore données par Q () définie en (B.2) ik = Q (Yk ) , mais dans ce cas, cette fonction peut varier dans le temps. Estimateur optimal De façon analogue au cas constant, on veut un estimateur X̂ (i1:k ) qui minimise l’EQM EQMk = E X̂k − Xk 2 . Il est connu que l’estimateur optimal minimisant l’EQM est donné par la moyenne a posteriori Z X̂k = EXk |i1:k (Xk ) = xk p (xk |i1:k ) dxk , (B.5) R B.2. Estimation et quantification : algorithmes et performances 271 où p (xk |i1:k ) est la densité a posteriori. Cet estimateur est non biaisé et sa performance est donnée par EQMk = Ei1:k VarXk |i1:k (Xk ) X Z 2 = xk − EXk |i1:k (Xk ) p (xk |i1:k ) dxk P (i1:k ) . (B.6) ⊗k i1:k ∈I R Filtrage particulaire La solution donnée par (B.5) est difficile à mettre en œuvre de façon analytique car, dans la plupart des cas, l’évaluation directe de la densité p (xk |i1:k ) et de l’intégrale n’est pas possible. On doit donc utiliser une méthode numérique pour l’évaluation de la densité et de l’intégrale. Dans notre cas, on peut utiliser la méthode de Monte Carlo, qui est une méthode d’intégration numérique basée sur la simulation. Dans le contexte du problème étudié, l’application de la méthode de Monte Carlo est connue sous le nom d’échantillonnage d’importance avec rééchantillonnage, populairement, elle est aussi connue sous le nom filtrage particulaire (FP), que l’on utilise pour la suite. Un algorithme particulaire pour l’estimation du paramètre variable à partir de données quantifiées est donné ci-dessous : (j) 0. Définition des poids w̃ x0 = N1S et initialisation de NS particules (échantillons) n o (1) (N ) x0 , · · · , x0 S par l’échantillonnage de la densité " # 1 1 x0 − x′0 2 p (x0 ) = √ exp − . 2 σ0 2πσ0 A chaque instant k, (j) 1. pour j de 1 à NS , l’échantillonnage de Xk est réalisé avec la d.d.p. !2 (j) x − x − u 1 1 k k (j) k−1 , exp − p xk |xk−1 = √ 2 σ 2πσw w 2. pour j de 1 à NS , on évalue et on normalise les poids w où P (j) x1:k (j) ik |xk =P (j) ik |xk w̃ (j) x1:k−1 , (j) w̃ x1:k (j) w x1:k , = N PS (j) w x1:k j=1 est donnée par ( F (τik ,k − xk ) − F (τik −1,k − xk ) , si ik > 0, P (ik |xk ) = F (τik +1,k − xk ) − F (τik ,k − xk ) , si ik < 0. 272 B. Résumé détaillé en français (extended abstract in French) 3. L’estimation est donc donnée par x̂k ≈ NS X j=1 (j) (j) xk w̃ x1:k . 4. Finalement, on évalue le nombre effectif de particules utilisées Neff = N PS j=1 1 w̃2 (j) x1:k , si Neff < Nseuil , alors une procédure de rééchantillonnage multinomial doit être réalisée (j) (j) (rééchantillonnage des valeurs x1:k avec les poids w x1:k comme probabilités de tirage). Evaluation de la performance Quand NS tend vers l’infini, il est connu que le filtrage particulaire converge vers l’estimateur optimal. Dans ce cas, si l’on considère qu’un NS suffisamment grand est utilisé, on peut utiliser (B.6) pour obtenir la performance du filtrage particulaire. Cependant, l’expression (B.6) souffre du même problème que (B.5), l’impossibilité d’être évaluée analytiquement dans la plupart des cas. Comme solution, on pourrait donc avoir recours à une procédure d’intégration numérique similaire à celle utilisée pour obtenir l’estimateur. Le problème avec cette dernière approche réside dans le fait que si l’on voulait utiliser la performance de simulation pour la conception d’un système de mesure (choix de la qualité du capteur, choix de NI , etc) on serait obligé de réaliser des simulations pour plusieurs valeurs possibles des paramètres. Ceci demanderait un temps de simulation très long. Comme alternative, on utilise une borne inférieure sur l’EQM, qui peut être obtenue de façon analytique. Cette borne est la version Bayésienne de la BCR, la BCRB. EQMk ≥ BCRBk = 1 , Jk (B.7) où Jk est l’information Bayésienne donnée sous forme récursive par Jk = 1 1 + E [Iq (εk )] − 4 2 σw σw 1 1 2 σw + Jk−1 . (B.8) L’innovation quantifiée Pour les modèles symétriques de bruit couramment utilisés (Gaussien, Laplacien et Cauchy), on sait que Iq (ε) est maximisée quand ε = 0 et que Iq (ε) décroît quand |ε| augmente. Or, d’après (B.8), on voit que plus τ0,k est proche de la réalisation du paramètre xk , plus grande est l’information Bayésienne et, par conséquent, plus petite est la BCRB. Si l’on suppose que la BCRB est une borne suffisamment serrée, de façon à ce que l’on puisse accepter son comportement comme une approximation de l’EQM, alors plus proche τ0,k est de xk , plus B.2. Estimation et quantification : algorithmes et performances 273 petite est l’EQM. Ceci indique que pour avoir une performance d’estimation améliorée, la dynamique de quantification doit se déplacer dans le temps de façon à suivre le paramètre. L’approche –τ0,k = xk – est, encore une fois, impossible à mettre en œuvre car on ne connaît pas xk . On doit alors accepter une perte de performance et utiliser la valeur la plus proche de xk disponible, dans notre cas, la prédiction de xk . Ceci consiste donc à quantifier l’innovation apportée par la nouvelle mesure. Avec le modèle d’évolution de Xk utilisé ici, on peut montrer que cette prédiction est X̂k|k−1 = τ0,k = X̂k−1 + uk . (B.9) On peut donc modifier l’algorithme particulaire, pour inclure la mise à jour adaptative du centre du quantifieur (B.9). La performance du nouvel algorithme peut être approchée par la BCRB (B.7). Si l’on suppose que le signal est lent, i.e. que σw est petit devant l’écart type du bruit de mesure, alors on peut s’attendre à ce que l’erreur d’estimation soit petite après un certain temps, vu que l’estimateur a le temps de « moyenner » les mesures avant un changement important d’amplitude du signal. On peut donc remplacer E [Iq (εk )] par sa borne supérieure Iq (0) dans l’expression récursive pour l’information Bayésienne. Si on calcule le point fixe de l’expression résultante pour σw petit, on trouve une expression asymptotique simple pour la borne sur l’EQM : σw + ◦ (σw ) , (B.10) EQM∞ ≥ p Iq (0) où la notation ◦ (σw ) est utilisée pour représenter un terme qui est négligeable devant σw quand σw → 0, c’est-à-dire, quand le signal est lent. Estimateur asymptotique optimal d’un signal lent On peut se demander si, de la même façon que pour le MV, il existe une forme asymptotique simple pour l’estimateur optimal d’un paramètre variable quand on utilise τ0,k = X̂k−1 . En effet, si l’on suppose encore une fois que σw est petit devant l’écart type du bruit, on peut montrer que l’estimateur asymptotique optimal est donné par la forme récursive suivante : fd ik , X̂k|k−1 , Xk X̂k|k−1 =Xk σw X̂k ≈ X̂k−1 + uk − p , (B.11) P (ik |Xk )|X̂ Iq (0) =Xk k|k−1 où P (ik |Xk )|X̂ est la probabilité d’avoir la sortie ik quand X̂k|k−1 = Xk . Sa dérivée par rapport à l’erreur εk évaluée au point εk = 0 est fd ik , X̂k|k−1 , Xk . Notez que k|k−1 =Xk X̂k|k−1 =Xk cette forme a une complexité encore plus basse que celle du MV adaptatif, car maintenant le gain qui corrige l’estimateur à chaque instant est constant. On peut montrer qu’une approximation de l’EQM asymptotique de cet algorithme est σw . EQM ≈ p Iq (0) 274 B. Résumé détaillé en français (extended abstract in French) Cette performance pour σw petit se raccorde bien avec la BCRB asymptotique donnée en (B.10). On constate aussi que, de la même façon que pour le MV adaptatif, la performance de l’estimateur optimal avec seuil central adaptatif dépend des caractéristiques de la mesure (bruit et vecteur de seuils τ ′ ) au travers de l’information de Fisher Iq (0). Par conséquent, pour caractériser complètement le quantifieur optimal pour l’estimation on doit, encore une fois, maximiser Iq (0) par rapport à τ ′ . Comme on l’a mentionné précédemment, ce problème est difficile à résoudre de façon directe pour NB > 3, on doit donc essayer de trouver une approximation de la solution, cette approximation sera le sujet de la Section B.3. Comme dans le cas du paramètre constant, on peut se poser la question suivante : • est-ce que la forme récursive (B.11) peut converger quand l’erreur initiale sur X̂k , |ε0 | = X̂0 − X0 , est arbitraire (pas nécessairement petite) ? On répondra à cette question dans la suite. B.2.3 Quantifieurs adaptatifs pour l’estimation Dans cette sous-section on traite des questions posées précédemment au sujet de l’application des algorithmes asymptotiques de basse complexité. Pour cela, on impose d’abord la structure de quantification adaptative, puis on définit un algorithme d’estimation général qui a comme cas spécifiques les formes asymptotiquement optimales vues précédemment. Après la définition de l’algorithme, on s’intéresse à l’analyse de sa performance : son biais et sa variance asymptotique. Suite à l’optimisation de sa performance par rapport aux paramètres libres de l’algorithme, on analyse la perte de performance d’estimation par rapport au cas continu (mesures continues). A la fin, on présente aussi des extensions de l’algorithme à d’autres problèmes : estimation conjointe de x et de l’échelle du bruit δ et estimation à partir de mesures obtenues par plusieurs capteurs. Modèle du signal Dans la suite, le paramètre à estimer est considéré soit constant Xk = x, soit lentement variable Xk = Xk−1 + Wk (avec σw petit et uk = u petite ou nulle). Modèle du quantifieur On a vu que le seuil central du quantifieur doit être mis à jour dynamiquement pour améliorer la performance d’estimation. Pour rendre explicite cette caractéristique du quantifieur, on imposera un biais réglable bk à l’entrée du quantifieur, pour régler l’amplitude de l’entrée on 1 . La fonction de quantification est donc donnée par appliquera aussi un gain ∆ Y k − bk ik = Q . ∆ Avec un biais réglable et un gain d’entrée, on peut fixer la structure du quantifieur avec un seuil central statique à zéro et d’autres seuils qui seront égaux aux décalages τ ′ . B.2. Estimation et quantification : algorithmes et performances 275 La sortie du quantifieur réglable est donnée par Y k − bk |Yk − bk | ′ ik = Q ∈ τi−1 , τi′ . = i sign (Yk − bk ) , pour ∆k ∆ A partir des mesures quantifiées, l’objectif est d’estimer le paramètre Xk , un objectif secondaire est de régler les paramètres bk et ∆ pour avoir une performance d’estimation améliorée. Comme l’estimateur X̂k de Xk peut être utilisé dans des applications temps réel, il serait intéressant de l’estimer en ligne. Dans les sous-sections précédentes on a vu que : • dans le cas où Xk = x, si on place le centre du quantifieur sur la dernière estimation, on peut avoir un algorithme asymptotiquement optimal. • dans le cas où Xk est variable, la performance peut être améliorée si on place le seuil central sur la prédiction du signal. Quand le signal a pour modèle un processus de Wiener, la prédiction et donc le seuil central sont donnés par X̂k−1 et quand le modèle a une dérive uk , le seuil est donné par X̂k−1 + uk . Etant données ces observations et pour simplifier le problème (avoir une seule forme d’algorithme pour tous les signaux), on posera bk = X̂k−1 . Notez que ce choix entraîne une possible perte de performance quand le modèle a une dérive. En réalité, si l’on utilise la prédiction X̂k−1 + uk au lieu de la dernière estimation, les deux cas, sans et avec dérive, peuvent être traités de façon conjointe (ici on les traitera sans perte de généralité comme étant le cas sans dérive). Le choix bk = X̂k−1 nous permet d’étudier le comportement d’une approche sous-optimale. Le schéma général d’estimation est donné en Fig. B.7. L’objectif maintenant est de définir l’algorithme qui sera placé dans le bloc Mise à jour. Algorithme d’estimation On utilise comme estimateur l’algorithme adaptatif suivant : " !# Yk − X̂k−1 , X̂k = X̂k−1 + γk η Q ∆ (B.12) où γk est une séquence de gains réels positifs et η[·] est une application de I vers R η: I → R j → ηj n o caractérisée par NI coefficients η− NI , . . . , η−1 , η1 , . . . , η NI . Les coefficients η[·] peuvent être 2 2 vus comme des équivalents pour l’estimation des niveaux de sortie des quantifieurs dans un contexte de quantification classique (quantification pour la reconstruction des mesures). 276 B. Résumé détaillé en français (extended abstract in French) Quantifieur réglable Vk Xk τ2′ τ1′ Yk 1 ∆ 0 −τ1′ − ik Mesures quantifiées −τ2′ ∆ Mise à jour X̂k−1 X̂k Estimateur Figure B.7: Schéma général d’estimation. L’algorithme d’estimation est réalisé dans le bloc Mise à jour. Cet algorithme a les avantages d’être un algorithme en ligne, d’avoir une basse complexité et d’inclure comme des cas spéciaux les formes récursives optimales des estimateurs avec quantification adaptative. A cause de la symétrie du problème et pour simplifier les développements présentés dans la suite, on impose ce qui suit : Hypothèse (sur les niveaux de sortie du quantifieur) : AQ3 Les niveaux ont une symétrie impaire en i: ηi = −η−i , avec ηi > 0 pour i > 0. La non linéarité non différentiable en (B.12) rend difficile l’analyse directe de l’algorithme. Pour s’en sortir, on peut utiliser les techniques présentées en [Benveniste 1990]. Ces techniques d’analyse sont basées sur des approximations de la moyenne de l’algorithme et sont valables pour une classe assez générale d’algorithmes adaptatifs. Dans le contexte des algorithmes étudiés en [Benveniste 1990], la fonction η peut être une fonction non linéaire et non différentiable et il est montré que les gains γk qui optimisent l’estimation de Xk ont les formes suivantes : • γk ∝ 1 k pour Xk constant. • γk est constant pour Xk avec un modèle de Wiener. 2 • γk est une constante proportionnelle à u 3 pour Xk avec un modèle de Wiener qui contient une dérive. B.2. Estimation et quantification : algorithmes et performances 277 Dans ce qui suit, on utilise en (B.12) les séquences de gains données ci-dessus et on obtient la performance de l’algorithme avec les techniques présentées en [Benveniste 1990]. Performance d’estimation L’analyse de la performance de l’algorithme adaptatif est séparée en deux parties : l’analyse de la trajectoire moyenne de l’algorithme et, par conséquent, de son biais et l’analyse de l’EQM ou de la variance asymptotique. Analyse de la moyenne : cas constant et Wiener. Une approximation de la moyenne de l’algorithme E X̂k dans le cas constant et Wiener est donnée par x̂ (tk ), où x̂ (t) est la solution de l’équation différentielle ordinaire (EDO) suivante : dx̂ = h (x̂) . dt La correspondance entre temps discret et continu est donnée par la relation tk = k P γj et j=1 h (x̂) est x − x̂ + V , h (x̂) = E η Q ∆ où l’espérance est évaluée par rapport à la distribution marginale du bruit V . Cette approximation est valable lorsque les gains γk sont petits, ce qui veut dire que l’approximation est valable après un certain temps dans le cas constant (vu que les gains décroissent en k) et pour tout k dans le cas Wiener, si on choisit un petit gain γk = γ (ce qui doit être le cas, car pour poursuivre avec peu d’erreur un signal lentement variable, il faut que les variations de l’estimateur soient petites). On peut utiliser cette approximation de la moyenne pour obtenir une approximation du biais ε (t) : dε = h̃ (ε) , dt où h̃ (ε) = h (ε + x) est une fonction qui ne dépend pas du paramètre x. En utilisant les hypothèses de symétrie AN2, AQ2 et AQ3, on peut démontrer que l’EDO est globalement asymptotiquement stable, i.e., pour tout ε (0), on a ε (t) → 0 quand t → ∞. Comme ε (t) approche le biais, l’algorithme est donc asymptotiquement non biaisé. Analyse de la moyenne : cas Wiener avec dérive. Pour une dérive u petite, on s’attend à ce que le gain γk = γ soit petit aussi (pour suivre le signal qui est lentement variable), dans ce cas on peut aussi utiliser une approximation par EDO. Par contre, contrairement aux cas précédents, on doit prendre en compte la variabilité de la moyenne de Xk . Ce qui fait que maintenant, la moyenne de l’algorithme est obtenue en échantillonnant la solution de la paire 278 B. Résumé détaillé en français (extended abstract in French) d’EDO suivante : dx dt dx̂ dt = u , γ = h̃ (x̂ − x) . Si l’on soustrait les deux expressions, on a une EDO pour le biais u dε = h̃ (ε) − . dt γ (B.13) La principale différence dans ce cas est le second terme à droite, qui fait que le biais n’est pas asymptotiquement nul. Si l’EDO sans le second terme est globalement asymptotiquement stable, on s’attend à une convergence du biais vers une valeur petite. Pour des petites valeurs de biais, on peut linéariser (B.13) autour de zéro et obtenir une approximation du biais asymptotique. Cette approximation est E X̂k − Xk ≈ k→∞ u . γhε où hε est la dérivé de h̃ (ε) évaluée en zéro. EQM et variance normalisée. Les résultats de [Benveniste 1990] peuvent être utilisés pour la caractérisation des fluctuations asymptotiques de l’algorithme adaptatif. Les fluctuations asymptotiques dans ce cas sont la hvariance asymptotique normalisée de l’erreur i √ 2 et l’EQM asymptotique pour d’estimation de la constante σ∞ = lim Var k X̂k − x k→∞ 2 l’estimation de Xk variable EQMq,∞ = lim E X̂k − Xk . k→∞ Les expressions asymptotiques des fluctuations étant dépendantes de γ, on peut les minimiser par rapport aux gains. Les expressions des paires (gain optimal γ ⋆ , performance optimale) sont données en Tab B.1. Signal Constant Wiener Wiener avec dérive Gain optimal γ ⋆ = − h1ε σw γ⋆ = √ R 2 1 3 4u ⋆ γ = −hε R Performance 2 = R σ∞ h2 √ ε w R EQMq,∞ = σ−h + ◦ (γ ⋆ ) = σw σ∞ + ◦ (γ ⋆ ) ε 2 2 uR 3 2 3 + ◦ (γ ⋆ ) EQMq,∞ ≈ 3 4h + ◦ (γ ⋆ ) = 3 u4 σ∞ 2 ε Table B.1: Gains optimaux, EQM asymptotique et variance normalisée asymptotique de l’algorithme adaptatif. La quantité R dans ce tableau est la variance asymptotique normalisée des corrections de B.2. Estimation et quantification : algorithmes et performances 279 l’algorithme quand X̂k = Xk R = x − x̂ + V Var η Q ∆ x̂=x NI = 2 2 X ηi2 F̃d (i, 0) , i=1 F̃d (i, 0) est la probabilité d’avoir la sortie i du quantifieur aussi quand X̂k = Xk . Algorithme optimal et performance Les performances asymptotiques présentées ci-dessus indiquent que la performance de l’algorithme 2 , qui est une fonction du vecteur de coefficients dépend, dans les trois cas, de la quantité σ∞ i⊤ h η = η1 · · · η NI . Par conséquent, pour maximiser la performance asymptotique, on doit 2 résoudre le problème de maximisation suivant argmin η R η ⊤ Fd η = argmin 2, h2ε η 2 (η ⊤ fd ) où Fd est une matrice diagonale donnée par NI ,0 , Fd = diag F̃d (1, 0) , · · · , F̃d 2 et fd est le vecteur des dérivées des probabilités de sortie du quantifieur par rapport à X̂k quand X̂k = Xk ⊤ NI ˜ ˜ fd = fd (1, 0) · · · fd ,0 . 2 Ce dernier peut être vu comme le vecteur des différences entre les valeurs de la d.d.p. du bruit ′ pour des variations de seuil consécutives τi−1 et τi′ . Ce problème peut être résolu facilement à l’aide de l’inégalité de Cauchy-Schwarz. En tenant compte de la contrainte de positivité sur les coefficients, on trouve η ⋆ = −Fd −1 fd . 2 minimum est donc Le σ∞ 2 σ∞ = NI 2 −1 2 1 X f˜d (i, 0) = 2 2 fd ⊤ Fd −1 fd F̃ (i, 0) i=1 d = 1 . Iq (0) Dans le tableau suivant, on donne les gains et les performances asymptotiques optimales. Notez que dans le cas de l’estimation d’une constante et d’un processus de Wiener lent, l’algorithme a des performances asymptotiques optimales. Il est donc une alternative à basse complexité aux algorithmes vus précédemment (le MV adaptatif et l’estimateur optimal adaptatif). 280 B. Résumé détaillé en français (extended abstract in French) Signal Constant Wiener Wiener avec dérive Gain optimal γ ⋆ = Iq1(0) γ ⋆ = √σw Iq (0) 2 1 3 γ ⋆ = I4u 2 (0) q Performance 2 = 1 σ∞ Iq (0) EQMq,∞ = √σw + ◦ (σw ) Iq (0) 2 3 EQMq,∞ ≈ 3 4Iqu(0) + ◦ (γ ⋆ ) Table B.2: Gains optimaux, EQM et variance normalisée asymptotique de l’algorithme adaptatif pour η optimal. Choix du gain d’entrée : pour simplifier le choix de la constante ∆, on peut considérer que la fonction de répartition du bruit est caractérisée par un paramètre d’échelle δ : x , F (x) = Fn δ où Fn est la fonction de répartition pour δ = 1. Dans ce cas, ∆ δ est un facteur clé pour l’évaluation des coefficients η. Par conséquent, l’évaluation des coefficients peut être simplifiée si on choisit ∆ = c∆ δ. La constante c∆ peut être utilisée pour régler le gain d’entrée du quantifieur ou pour régler le pas de quantification quand les seuils τ ′ sont uniformes et fixés à des valeurs qui ne peuvent pas être modifiées. Seuils optimaux. On voit que, dans les expressions pour les performances de l’algorithme (Tab. B.2), l’influence des variations de seuil τ ′ se fait à travers la quantité Iq (0), donc, pour optimiser les performances par rapport aux seuils on doit résoudre le problème d’optimisation suivant : Iq⋆ = argmax Iq (0) . τ′ Or, comme on l’a mentionné précédemment, ce problème est difficile à résoudre en général (pour NB > 3) et une approximation de la solution optimale sera présentée dans la Section B.3. Pour les simulations qui seront présentées dans la suite, on imposera les variations de seuil d’être uniformes : ⊤ ′ ′ ′ ′ ′ τ = −τ NI = −∞ · · · − τ1 = −1 0 + τ1 = +1 · · · + τ NI = +∞ . 2 2 Sous cette contrainte, l’optimisation de la performance est faite par rapport à c∆ . La valeur optimale de c∆ peut être obtenue de façon simple par recherche exhaustive. Perte induite par la quantification. On peut comparer les performances asymptotiques de l’algorithme adaptatif avec les performances asymptotiques de son équivalent qui utilise des mesures continues : B.2. Estimation et quantification : algorithmes et performances 281 • Cas constant : dans ce cas, on compare la performance asymptotique de l’algorithme adaptatif à pas décroissant avec la performance du MV avec des mesures continues – Var X̂k ∼ kI1c . k→∞ • Cas Wiener lent : la performance est comparée avec celle de l’estimateur optimal du processus de Wiener lent avec des mesures continues – EQMc,∞ = √σwI + ◦ (σw ). c • Cas Wiener lent avec dérive : dans ce cas, on compare la performance de l’algorithme adaptatif avec des mesures quantifiées avec celle de l’algorithme adaptatif avec des 2 2 3 + ◦ u3 . mesures continues – EQMc,∞ ≈ 3 4Icu(0) Les pertes de performance relative en dB sont données en Tab. B.3. Signal Constant Wiener Wiener avec dérive Perte I (0) Lq = −10 log10 qIc 1 LW q ≈ 2 Lq (σw petit) 2 ≈ 3 Lq (σw et u petits) D LW q Table B.3: Pertes de performance asymptotique induites par la quantification. Ce qui est surprenant dans ces résultats est le fait que la perte de performance est plus petite dans le cas variable que dans le cas constant, ceci indique une certaine ressemblance avec le phénomène de « dithering », connu en quantification classique. Dans la quantification classique, un ajout de variabilité à l’entrée du quantifieur (ajout de bruit) peut améliorer les performances de reconstruction après quantification. Dans les résultats présentés ci-dessus, on voit que la variabilité intrinsèque au signal induit une perte de performance d’estimation plus petite que dans le cas constant. Simulations Modèle de bruit : pour les résultats de simulation présentés dans la suite deux modèles de bruit (respectant les hypothèses de travail) ont été utilisés. Ces modèles sont caractérisés par les distributions suivantes : • Gaussienne généralisée (GG). Cette distribution a pour d.d.p. x β β exp − , fGGD (x) = δ 2δΓ β1 où β est un paramètre de forme (réel et positif). • Student-t (ST). Sa d.d.p. est β+1 Γ β+1 2 1 x 2 − 2 1+ . fST D (x) = √ β δ δ βπΓ β2 282 B. Résumé détaillé en français (extended abstract in French) Perte théorique : les résultats de simulation de l’algorithme seront comparés aux pertes théoriques qui dépendent toutes de Lq , l’évolution de cette quantité en fonction de NB est donnée en Fig. B.8. 4 GG - β = 1.5 GG - β = 2 (Gaussien) GG - β = 2.5 GG - β = 3 ST - β = 1 (Cauchy) ST - β = 2 ST - β = 3 Perte [dB] 3 2 1 0 1 2 3 4 Nombre de bits [NB ] 5 Figure B.8: Perte Lq induite par la quantification dans le cas constant pour différents nombres de bits et différents types de bruit. Notez que dans tous les cas, la perte est faible pour la quantification binaire (de 1 à 4 dB) et qu’elle décroît très rapidement avec NB pour devenir négligeable pour 4 ou 5 bits de quantification. Simulation pour le cas constant : on vérifie la convergence des pertes simulées pour NB de 2 à 5 et plusieurs distributions de bruit dans la Fig. B.9. Simulation pour le cas Wiener : dans la Fig. B.10, on vérifie que si le signal est lent (σw = 0.001), alors les résultats asymptotiques simulés sont très proches des résultats théoriques. Par contre, dès que l’on s’éloigne de l’hypothèse de signal lent (σw = 0.1), les résultats théoriques et simulés ont un certain écart. Simulation pour le cas Wiener avec dérive : la Fig. B.11 montre les performances asymptotiques simulées de l’algorithme adaptatif pour la poursuite du processus de Wiener avec dérive. Le petit écart entre les résultats théoriques et simulés vient du fait que le gain optimal γ ⋆ est calculé avec une estimation en ligne de u (qui est inconnue en pratique). Comparaison avec les algorithmes à haute complexité : avant de passer aux extensions de l’algorithme adaptatif on discutera rapidement des différences entre l’algorithme adaptatif proposé ici et les algorithmes vus dans les sous sections précédentes. Etant donnée l’équivalence en termes de performance asymptotique de l’algorithme adaptatif et des solutions à haute complexité (le MV adaptatif et le FP adaptatif), des simulations 1 0.5 0 100 β β β β 101 102 Temps [k] 0.1 0.05 0 100 103 β β β β = 1.5 = 2 (Gaussien) = 2.5 =3 101 (a) 0.4 0.2 0 100 103 0.4 β = 1 (Cauchy) β=2 β=3 Perte [dB] Perte [dB] 0.6 102 Temps [k] (b) 1 0.8 283 0.15 = 1.5 = 2 (Gaussien) = 2.5 =3 Perte [dB] Perte [dB] B.2. Estimation et quantification : algorithmes et performances 101 102 Temps [k] 103 (c) 0.3 0.2 β=1 (Cauchy) β=2 β=3 0.1 0 100 101 102 Temps [k] 103 (d) Figure B.9: Perte induite par la quantification pour des distributions GG et ST et pour NB ∈ {2, 3, 4, 5} quand Xk est constant. Pour chaque type de bruit il y a 4 courbes, les constantes sont les résultats théoriques et les courbes décroissantes sont les résultats simulés avec l’algorithme adaptatif. Pour chaque paire de courbes, les résultats plus hauts correspondent à moins de bits de quantification. En (a) on a les résultats pour un bruit GG avec NB = 2 et 3 et en (b) avec NB = 4 et 5. En (c) on présente les résultats pour un bruit ST avec NB = 2 et 3 et en (d) avec NB = 4 et 5. des transitoires des algorithmes ont été réalisées pour les différencier de façon plus précise. Les résultats de simulation indiquent que l’algorithme adaptatif peut atteindre des performances similaires voire meilleures que le MV adaptatif pour l’estimation d’une constante et que, dans le cas de la poursuite d’un signal variable, le FP semble être dans la plupart des cas plus performant. En conclusion, pour l’estimation d’une constante, l’algorithme adaptatif semble être la meilleure solution, car il a une complexité très basse en comparaison avec le MV. Toutefois, dans le cas de l’estimation du processus de Wiener, l’algorithme adaptatif ne sera la meilleure solution que si les contraintes de complexité empêchent d’utiliser le FP (aussi très complexe). Extensions de l’algorithme adaptatif Paramètre d’échelle inconnu : une extension possible du problème d’estimation d’une constante consiste à considérer que le paramètre d’échelle δ est inconnu et donc que l’on doit estimer conjointement la paire (x, δ) à partir de mesures quantifiées. Pour améliorer la performance d’estimation on peut envisager non seulement l’utilisation de X̂k−1 comme biais du quantifieur, mais aussi de δ̂k−1 , pour régler le gain d’entrée du quantifieur. Ceci est montré en Fig. B.12. 284 B. Résumé détaillé en français (extended abstract in French) Cauchy - σw = 0.1 Cauchy - σw = 0.001 Gaussien - σw = 0.1 Gaussien - σw = 0.001 Cauchy – Théo. Gaussien – Théo. Perte [dB] 1 0.5 0 2 3 4 Nombre de bits [NB ] 5 Figure B.10: Perte induite par la quantification dans le cas Wiener pour différents nombres de bits, écarts types des incréments du signal et types de bruits (Gaussien – GG avec β = 2 et Cauchy – ST avec β = 1). Gaussien Cauchy Gaussien Cauchy - Perte [dB] 0.4 0.3 - Sim. Sim. - Théo. Théo. 0.2 0.1 2 3 4 Nombre de bits [NB ] 5 Figure B.11: Perte induite par la quantification dans le cas Wiener avec dérive pour différents nombre de bits et types de bruit (Gaussien et Cauchy). B.2. Estimation et quantification : algorithmes et performances Quantifieur réglable 285 τ2′ τ1′ Yk 1 c∆ δ̂k−1 ik Mesures quantifiées 0 −τ1′ − −τ2′ δ̂k−1 X̂k−1 Mise à jour X̂k Estimateur Figure B.12: Schéma d’estimation/quantification pour retrouver conjointement le paramètre de centrage x et d’échelle δ. Pour la mise à jour, on peut encore une fois utiliser un algorithme adaptatif de basse complexité : " # " # Γ X̂k X̂k−1 ηx (ik ) , = + δ̂k−1 ηδ (ik ) k δ̂k δ̂k−1 où Γ est une matrice 2 × 2 de gains. Sous certaines hypothèses de convergence en moyenne de l’algorithme, on peut montrer que les coefficients optimaux η x et η δ sont donnés par (x) η x = −F−1 d fd , (δ) η δ = −F−1 d fd , (x) (δ) où Fd a déjà été détaillée plus haut et où fd et fd sont des vecteurs de dérivées des probabilités des sortiesdu quantifieur par rapport à X̂k et δ̂k respectivement. Ces dérivées sont évaluées au point X̂k = x, δ̂k = δ . Avec les coefficients optimaux on trouve les valeurs asymptotiques optimales de la covariance normalisée d’estimation P et du gain Γ⋆ : 2 δ P = δ 2 Γ⋆ = 2 1 (x)T fd 0 (x) F−1 d fd 0 1 (δ)T fd (δ) F−1 d fd . Les éléments de la diagonale de P sont les informations de Fisher pour l’estimation de x et δ à partir des données quantifiées, l’algorithme est donc optimal asymptotiquement et on voit que le fait de ne pas connaître δ ne dégrade pas les performances asymptotiques de l’estimateur de x. 286 B. Résumé détaillé en français (extended abstract in French) Approche multi capteur : une autre extension consiste à utiliser des mesures obtenues de façon simultanée par plusieurs capteurs pour estimer un paramètre constant x. Dans cette approche, chaque capteur quantifie une mesure continue (j) Yk (j) for j ∈ {1, · · · , Ns } , = x + Vk , où Ns est le nombre de capteurs, et transmet la mesure quantifiée à un centre de fusion. Le centre de fusion utilise toutes les mesures des capteurs pour générer une estimation X̂k de x qui est diffusée à tous les capteurs pour être utilisée comme seuil central des quantifieurs. Le schéma qui représente cette approche est montré en Fig. B.13. (1) Vk (1) Quant. 1 ik Capteur 1 (2) Vk (2) x Quant. 2 ik ik Capteur 2 Mise à jour X̂k Centre de fusion (Ns ) Vk (Ns ) Quant. Ns ik Capteur Ns Figure B.13: Schéma d’estimation/quantification multicapteur avec un centre de fusion. Pour la mise à jour, on peut utiliser l’extension suivante de l’algorithme adaptatif : γ η (ik ) , k i (N ) T · · · ik s . X̂k = X̂k−1 + h (1) où ik est le vecteur des mesures quantifiées ik Les coefficients η optimaux sont donnés cette fois par Ns ˜(j) (j) X fd i . η (i) = − (j) (j) i F̃ j=1 d (B.14) Pour ces coefficients, on trouve la variance asymptotique d’estimation normalisée et le gain optimal suivants 1 2 . (B.15) σ∞ = γ⋆ = N Ps P f˜d(j) 2 [i(j) ] (j) (j) j=1 i(j) ∈I (j) F̃d [i ] B.3. Estimation et quantification : approximations à haute résolution 287 Les expressions (B.14) et (B.15) nous montrent que l’algorithme adaptatif pour l’approche à un seul capteur s’étend de façon très naturelle à l’approche multicapteur : le coefficient de l’algorithme η en multicapteur est la somme des coefficients en monocapteur et la performance ainsi que le gain sont donnés par l’inverse de la somme des informations de Fisher en monocapteur. B.3 Estimation et quantification : approximations à haute résolution On présente maintenant des résultats concernant la caractérisation asymptotique des quantifieurs optimaux pour l’estimation. Le mot « asymptotique » dans ce cas vient du fait que l’on suppose que le nombre d’intervalles de quantification NI est très grand. Comme on impose aussi que les tailles ∆i des intervalles de quantification tendent vers zéro, on les appelle aussi approximations à haute résolution. B.3.1 Approximation à haute résolution de l’information de Fisher Pour trouver la caractérisation asymptotique des quantifieurs optimaux et la performance correspondante en termes d’estimation, on s’intéresse aux questions suivantes : • comment décrire l’information de Fisher pour l’estimation d’un paramètre x en fonction du quantifieur quand NI est grand ? • Comment maximiser l’information de Fisher par rapport à la caractérisation du quantifieur ? • Quelle est la performance optimale correspondante ? Remarque : notez que dans la suite on n’impose pas que le problème soit un problème d’estimation de paramètre de centrage. Approximation asymptotique Pour répondre à la première question, on va commencer par une réécriture de l’information de Fisher : h i Iq = Ic − E (Sc − Sq )2 . (B.16) Le membre de droite dans (B.16) peut être vu comme la perte L induite par la quantification. L’espérance en L peut être écrite comme une somme d’intégrales sur les différents intervalles de quantification qi : NI Z X ∂ log f (y; x) ∂ log P (i; x) 2 L= − f (y; x) dy. ∂x ∂x i=1 q i 288 B. Résumé détaillé en français (extended abstract in French) Des développements en série de Taylor nous donnent L= N I X i=1 ∆3 (y) 2 Sc,i fi i 12 +o ∆3i , (B.17) (y) où Sc,i est la dérivée du score par rapport à y évaluée au centre de l’intervalle de quantification qi et fi est la d.d.p. des mesures continues aussi évaluée au centre de l’intervalle. Pour obtenir la caractérisation de la perte en fonction du quantifieur, on définit la densité d’intervalles λ : 1 , pour y ∈ qi . (B.18) λ (y) = λi = NI ∆i La densité d’intervalles est une fonction qui, si on l’intègre dans un intervalle donné, donne la fraction d’intervalles de quantification dans cet intervalle. Si l’on utilise (B.18) dans (B.17), si l’on fait NI → ∞ et si les ∆i convergent uniformément vers zéro, on obtient le résultat suivant 2 Z ∂Sc (y;x) f (y; x) ∂y 1 dy. lim NI2 L = NI →∞ 12 λ2 (y) Si on revient à (B.16), le résultat asymptotique ci-dessus nous amène à l’approximation asymptotique de l’information de Fisher : 2 Z ∂Sc (y;x) f (y; x) ∂y 1 dy. (B.19) Iq ≈ Ic − 2 λ2 (y) 12NI Celle-ci est la réponse de la première question. On constate que si l’intégrale du membre à droite converge, alors Iq converge vers Ic quand NI → ∞ et, de cette façon, on répond aussi à une question qui avait été posée précédemment (p. 265). Si toutes les valeurs possibles en sortie du quantifieur sont quantifiées avec des mots binaires de même taille NB = log2 (NI ), alors (B.19) peut être réécrite de la façon suivante : Iq ≈ Ic − 2−2NB 12 Z 2 ∂Sc (y;x) f ∂y λ2 (y) (y; x) dy. On voit de façon explicite la convergence exponentielle en NB de Iq vers Ic . Densité d’intervalles optimale : maintenant on répond à la deuxième question. L’expression (B.19) nous montre directement que, pour maximiser la performance par rapport au quantifieur, on doit minimiser l’intégrale de droite par rapport à λ. Ce problème de minimisation peut être facilement résolu avec l’inégalité de Hölder, ce qui donne 2 ∂Sc (y;x) 3 1 2 f 3 (y; x) ∂y ∂Sc (y; x) 3 1 ⋆ f 3 (y; x) . (B.20) ∝ λ (y) = R ∂Sc (y;x) 32 1 ∂y f 3 (y; x) dy ∂y B.3. Estimation et quantification : approximations à haute résolution 289 Notez que, contrairement aux résultats asymptotiques pour la reconstruction des mesures 1 où λ⋆ (y) ∝ f 3 (y; x), en quantification optimale pour l’estimation le score du problème d’estimation intervient sur la densité d’intervalles. Si l’on remplace (B.20) en (B.19), on peut donner une réponse à la troisième question. L’expression analytique de l’approximation asymptotique de l’information de Fisher optimale est "Z #3 2 3 1 ∂S (y; x) 1 c (B.21) f 3 (y; x) dy . Iq⋆ ≈ Ic − ∂y 12NI2 Approximation pratique des seuils optimaux : la définition de la densité d’intervalles nous dit que le pourcentage d’intervalles jusqu’à l’intervalle qi , NiI doit être égal à l’intégrale de la densité d’intervalles jusqu’à τi . Par conséquent, une approximation pratique des seuils optimaux est donnée par i , pour i ∈ {1, · · · , NI − 1} , (B.22) τi⋆ = Fλ−1 NI où Fλ−1 est l’inverse de la fonction de répartition obtenue par intégration de la densité d’intervalles λ. Remarque sur la solution à débit variable : on pourrait aussi considérer que les sorties du quantifieur sont encodées avec des mots de tailles égales au logarithme de leur probabilité, ceci entraînerait une possible réduction de la taille moyenne des mots en sortie du quantifieur. Cette solution est connue sous le nom d’encodage à débit variable. La taille moyenne des mots en sortie du quantifieur avec l’encodage à débit variable est donnée par l’entropie des mots de sortie. De la même manière que précédemment, où en imposant un NB on a trouvé la densité d’intervalles optimale pour des mots de sortie de taille égale, on peut s’intéresser au problème de quantification optimale avec encodage à débit variable. Si l’on impose un débit moyen R, on peut montrer que la densité optimale est ∂Sc (y;x) ∂y λ⋆ (y) = R ∂S (y;x) c∂y dy et l’information de Fisher maximale Iq ≈ Ic − où hy = − R 1 −2 2 12 n i o h R ∂S (y;x) R−hy − log2 c∂y f (y;x) dy , f (y; x) log2 [f (y; x)] dy est l’entropie différentielle des mesures. Le problème avec cette solution est que l’encodage des sorties du quantifieur dépend du paramètre qui est inconnu. Même si on utilise une approche adaptative pour la quantification avec une convergence vers l’encodage optimal, on ne respectera pas les contraintes de débit moyen pendant toute la phase de convergence de l’algorithme adaptatif. 290 B. Résumé détaillé en français (extended abstract in French) Application à l’estimation d’un paramètre de centrage : pour la distribution Gaussienne, la densité d’intervalles optimale et l’approximation de l’information de Fisher maximale sont données par " # √ −(2N −1) i 2 h x B 1 y−x 2 32 . (B.23) I ≈ 1 − π x q,G λG (y) = √ exp − √ , δ2 δ 3π 3δ Pour la distribution de Cauchy on a λxC (y) = 1 δB 1 5 2; 6 h h 1− 1+ y−x 2 δ y−x 2 δ i2 3 i5 , 3 x Iq,C # " 3 B 12 ; 65 1 ≈ 2 1− 2−2NB +1 . 2δ 3π (B.24) Afin de valider les résultats théoriques, l’information de Fisher (B.3) a été évaluée avec δ = 1 pour les deux distributions et pour • les seuils optimaux pour NB ∈ {1, 2, 3}. Les seuils optimaux ont été obtenus par recherche exhaustive. Pour NB ∈ {4, 5, 6, 7, 8} les résultats théoriques (B.23) et (B.24) sont utilisés comme une approximation. • la quantification uniforme pour NB ∈ {1, · · · , 8}. En plaçant le seuil central sur x, l’intervalle de quantification optimal ∆⋆ est obtenu par maximisation de l’information de Fisher. Dans ce cas aussi, le maximum est trouvé par recherche exhaustive. • l’approximation pratique des seuils optimaux donnée par (B.22), pour NB ∈ {1, · · · , 8}. Les résultats sont montrés en Tab. B.4. NB 1 2 3 4 5 6 7 8 x = 2) Gaussien (Ic,n Approx. Optimal Uniforme pratique 1.27323954† 1.76503630† 1.93090199† 1.97874454⋆ 1.99468613⋆ 1.99867153⋆ 1.99966788⋆ 1.99991697⋆ – 1.76503630 1.92837814 1.97841622 1.99353005 1.99807736 1.99943563 1.99983649 1.27323954 1.75128300 1.92740111 1.98038526 1.99489906 1.99869886 1.99967136 1.99991741 x = 0.5) Cauchy (Ic,n Approx. Optimal Uniforme pratique 0.40528473† 0.43433896† 0.48474865† 0.49533850⋆ 0.49883463⋆ 0.49970866⋆ 0.49992716⋆ 0.49998179⋆ – 0.43433896 0.45600797 0.48136612 0.49204506 0.49656712 0.49851056 0.49935225 0.40528473 0.40528473 0.47893785 0.49504170 0.49879785 0.49970408 0.49992659 0.49998172 Table B.4: Information de Fisher Iq pour l’estimation d’un paramètre de centrage des distributions Gaussienne et Cauchy. En Optimal† se trouve l’information de Fisher maximale obtenue par recherche exhaustive des seuils optimaux. Optimal⋆ est l’approximation asymptotique de l’information de Fisher maximale. En Uniforme, les valeurs de l’information de Fisher pour la quantification uniforme optimale sont montrées. Les colonnes Approx. pratique correspondent à l’information de Fisher obtenue avec l’approximation pratique des seuils asymptotiquement optimaux. On constate que, dans tous les cas, Iq converge rapidement vers Ic quand NB augmente. B.3. Estimation et quantification : approximations à haute résolution 291 Ici encore, on voit que 4 ou 5 bits sont suffisants pour obtenir une performance d’estimation proche de celle obtenue avec des mesures continues. La différence de performance entre la quantification uniforme et non uniforme semble être plus importante pour la distribution de Cauchy, mais pratiquement négligeable dans le cas Gaussien, ceci indique que la quantification uniforme est probablement une meilleure solution en pratique (étant donnée sa simplicité d’implantation). Finalement, on observe aussi que l’approximation asymptotique de l’information de Fisher et sa valeur obtenue avec l’approximation pratique des seuils optimaux sont très proches, même pour des valeurs petites de NB (NB = 4). Utilisation de l’algorithme adaptatif : pour la réalisation pratique du quantifieur optimal dans l’estimation d’un paramètre de centrage, un problème important est la dépendance explicite au paramètre x de l’approximation pratique des seuils optimaux τi⋆ . Une solution pour résoudre ce problème et atteindre une performance asymptotique optimale, même en ne connaissant pas x, consiste à utiliser l’algorithme adaptatif proposé en Sous-section B.2.3 : X̂k = X̂k−1 + 1 η (ik ) , kIq avec le vecteur de variations de seuil τ ′ donné par τ ⋆ avec x en (B.22) considéré comme étant f (τ ⋆ ;x)−f (τi⋆ ;x) égal à zéro et η (ik ) donnés par η (i) = − F τi−1 ⋆ ;x . Si NB ≥ 4, pour k grand, on ( i⋆ ;x)−F (τi−1 ) s’attend à une performance d’estimation proche de l’optimale et, par conséquent, proche de l’approximation suivante : h i 1 Var X̂k ≈ BCRq ≈ , kIq où Iq est l’approximation asymptotique (B.21). Les résultats de simulation pour des distributions de Gauss et de Cauchy avec NB = 4 et 5 indiquent la validité cette approche. Ils sont montrés en Fig. B.14. Allocation de bits pour l’estimation d’un paramètre de centrage scalaire On suppose maintenant que Ns capteurs mesurent, avec du bruit additif et indépendant d’un capteur à l’autre, une constante x. En raison des contraintes de communication, la somme des nombres de bits par mesure alloués à chaque capteur NB,i est contrainte à être égale à une valeur NB . Le question que l’on se pose est la suivante : quelle est l’allocation de bits qui maximise la performance d’estimation sous la contrainte de communication ? Ceci équivaut de façon quantitative à résoudre le problème de maximisation suivant : maximiser en NB,i sujet à Iq = Ns X Iq,i (NB,i ) , i=1 Ns X NB,i = NB , i=1 NB,i ∈ N, où Iq,i (NB,i ) est l’information de Fisher maximale pour NB,i . B. Résumé détaillé en français (extended abstract in French) EQM × k 292 10−0.3 BCRq Algorithme 10−0.35 0 10 101 102 103 Temps [k] 104 EQM × k (a) BCRq Algorithme 100.35 100.3 100 101 102 103 Temps [k] 104 (b) Figure B.14: EQM simulée pour l’algorithme adaptatif avec les seuils non uniformes asymptotiquement optimaux. Les mesures continues sont distribuées selon la loi de Gauss (a) et de Cauchy (b). Les nombres de bits de quantification utilisés sont NB = 4 et 5. Les courbes qui ont des valeurs asymptotiques plus hautes correspondent à NB = 4. On peut trouver la solution analytique de ce problème par la comparaison de toutes les combinaisons possibles des NB,i . Cependant, l’aspect combinatoire de la solution rend impossible en pratique l’application de cette solution, même pour quelques dizaines de capteurs. Une autre solution possible consiste à utiliser les expressions asymptotiques analytiques des informations de Fisher en levant la contrainte NB,i ∈ N. La solution du problème d’optimisation sous ces nouvelles conditions peut être trouvée sous forme analytique et, si on arrondi les NB,i trouvés de cette manière, on a une approximation pratique de la solution. Solution à NB,i réels : si l’on considère que les distributions de bruit ont la même forme mais des paramètres d’échelle δi différents, alors, en utilisant les approximations asymptotiques on trouve NB,i δ NB s i − log2 = Ns Qs N N s j=1 δj . On voit que les NB,i optimaux ne dépendent que des paramètres d’échelle des bruits. B.4. Conclusions 293 L’approximation de l’information de Fisher dans ce cas est # " x Ic,n 2−2N̄B κ′ (fn ) − , Iq ≈ Ns 2 2 12 GM δ12 , · · · , δN HM δ12 , · · · , δN s s x est l’information de Fisher pour un paramètre d’échelle unitaire, κ′ (f ) est une où Ic,n n 2, · · · , δ2 B et HM δ fonctionnelle de la d.d.p. du bruit aussi pour δ = 1, N̄B = N 1 Ns et Ns 2 GM δ12 , · · · , δN sont les moyennes harmoniques et géométriques des paramètres d’échelle. s On peut démontrer que l’approximation de Iq optimale ainsi obtenue est toujours plus grande que celle donnée par une allocation de bits uniforme. Solution à NB,i réels et positifs : si l’on contraint les NB,i à être positifs, on peut démontrer que la solution optimale est obtenue en deux étapes. D’abord on choisit un ν qui satisfait Ns X [ν − log2 (δi )]+ = NB , i=1 où [x]+ = max (x, 0), puis, on obtient les NB,i avec NB,i = [ν − log2 (δi )]+ . On peut facilement vérifier que cette solution est équivalente à la procédure de « waterfilling » qui est utilisée pour l’allocation de puissance aux sous porteuses dans les modulations multi porteuses. B.4 Conclusions Dans cette thèse, nous avons traité le problème d’estimation à partir de mesures quantifiées, un problème qui attire depuis quelque temps l’attention de la communauté de traitement du signal, en raison de l’essor des réseaux de capteurs. Nous avons traité, plus spécifiquement, le problème d’estimation d’un paramètre de centrage scalaire, soit constant, soit lentement variable avec un modèle de Wiener. B.4.1 Conclusions principales Nous avons observé que, pour la plupart des modèles de bruit considérés en pratique, la performance d’estimation se dégrade lorsque la dynamique de quantification est loin de la vraie valeur du paramètre. Ceci indique qu’une bonne performance d’estimation peut être obtenue par une approche adaptative, où l’on place la dynamique de quantification grâce à l’information donnée par l’estimation la plus récente du paramètre. Avec le schéma adaptatif, nous avons vu que la perte de performance d’estimation induite par la quantification est petite. Pour tous les cas testés, nous avons observé une perte de performance petite pour 1–3 bits de quantification et une perte négligeable pour 4 ou 5 bits. 294 B. Résumé détaillé en français (extended abstract in French) Ceci indique que dans un contexte d’estimation à distance, où le nombre de bits total est contraint, il est possible qu’une solution multicapteur/basse résolution soit préférable à la solution classique monocapteur/haute résolution. Nous avons proposé des alternatives à basse complexité pour les algorithmes trouvés dans la littérature et leurs extensions. Nous avons démontré que les algorithmes à basse complexité proposés atteignent les mêmes performances asymptotiques que leurs pendants à haute complexité. En utilisant les approches à basse complexité, nous avons présenté des solutions assez naturelles pour traiter des extensions du problème de base : l’extension au cas d’un paramètre d’échelle inconnu et l’extension à plusieurs capteurs. Pour traiter le problème de placement des seuils optimaux pour l’estimation quand un grand nombre d’intervalles de quantification est utilisé, nous avons étudié une approche asymptotique. Cette approche asymptotique nous a permis d’obtenir une approximation pratique des seuils optimaux ainsi qu’une expression analytique de la performance d’estimation optimale, dans ce cas l’information de Fisher optimale. Nous avons vu aussi avec cette approche que la performance d’estimation avec des mesures quantifiées converge exponentiellement vite vers la performance avec des mesures continues quand le nombre de bits de quantification augmente. En appliquant les résultats sur un problème d’estimation de paramètre de centrage, nous avons constaté que l’approximation proposée, censée être valable seulement asymptotiquement, est valable pour un nombre petit de bits (4 dans ce cas), ceci indique que les résultats asymptotiques peuvent être utilisés en pratique. Nous avons montré, avec l’approche asymptotique, la dépendance des seuils asymptotiquement optimaux par rapport au paramètre inconnu. Ceci indique, encore une fois, l’importance de l’approche adaptative qui permet de placer asymptotiquement les seuils sur leurs valeurs optimales et donc d’obtenir une performance asymptotiquement optimale. Nous voudrions aussi attirer l’attention sur le fait que la différence de performance entre un schéma de quantification uniforme et un schéma non uniforme semble être petite pour l’estimation d’un paramètre de centrage. Par conséquent, en pratique, si une forte contrainte sur la complexité est présente, la quantification uniforme peut être préférable. B.4.2 Perspectives Cette « discussion » entre quantification et estimation sera terminée par la présentation des possibles sujets de travaux futurs. • Paramètre vectoriel et quantification vectorielle : ce sujet est une extension naturelle du problème. Tandis que l’extension à la quantification vectorielle est assez directe (en termes d’algorithmes d’estimation et de leurs performances asymptotiques), l’extension aux paramètres vectoriels est moins directe car elle nécessitera une nouvelle définition pour la performance d’estimation et un changement de la structure des algorithmes pour prendre en compte les corrélations possibles entre les composantes vectorielles. • Canaux bruités : un canal de communication bruité peut être intégré au problème de différentes façons. La plus simple consistant à introduire un indice binaire pour chaque B.4. Conclusions 295 mesure quantifiée et un modèle de canal binaire symétrique. Avec un étiquetage fixe des mesures quantifiées, des extensions de l’algorithme de basse complexité proposé peuvent être directement conçues. Cependant, si les indices ne sont pas fixés, le problème qui en résulte, avec l’étiquetage des sorties du quantifieur, peut être très difficile à traiter. D’autres extensions peuvent être envisagées par l’introduction d’un canal à amplitude continue, par exemple des canaux à bruit additif et à évanouissements. Dans ce cas, on sera obligé, encore une fois, d’ajouter le problème d’étiquetage et donc de traiter un problème conjoint de conception d’encodeur/estimation. • Estimation avec distribution de bruit inconnue: on a supposé depuis le début que la distribution du bruit est connue, en pratique ceci ne sera pas toujours le cas et on sera obligé de trouver d’autres approches d’estimation. • Variations rapides : dans certaines parties de cette thèse nous avons supposé que le signal est lentement variable. Sous cette hypothèse, nous avons vu que la perte de performance induite par la quantification est petite. On peut se poser la question de savoir si cette conclusion reste vraie lorsque le signal varie rapidement. • Problème distribué : pour arriver aux applications classiques des réseaux de capteur, on doit généraliser les algorithmes et résultats obtenus ici pour un capteur à un contexte partiellement distribué, où certains capteurs recueillent l’information des capteurs qui sont autour, ou complètement distribué, où tous les capteurs recueillent l’information. • Temps continu : dans le cas d’un paramètre variable, nous avons considéré, depuis le début, que le temps est discret et nous n’avons pas traité de l’échantillonnage. Un sujet qui reste ouvert est donc l’estimation d’un signal à temps continu, échantillonné et quantifié. Bibliography [Akyildiz 2002] I.F. Akyildiz, W. Su, Y. Sankarasubramaniam and E. Cayirci. A survey on sensor networks. Communications magazine, IEEE, vol. 40, no. 8, pages 102–114, 2002. (Cited in page(s) 17, 19 and 254.) [Ali 1966] S.M. Ali and S.D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), pages 131–142, 1966. (Cited in page(s) 207.) [Arampatzis 2005] T. Arampatzis, J. Lygeros and S. Manesis. A survey of applications of wireless sensors and wireless sensor networks. In Intelligent Control, 2005. Proceedings of the 2005 IEEE International Symposium on, Mediterrean Conference on Control and Automation, pages 719–724. Ieee, 2005. (Cited in page(s) 18 and 255.) [Aysal 2008] T.C. Aysal and K.E. Barner. Constrained decentralized estimation over noisy channels for sensor networks. Signal Processing, IEEE Transactions on, vol. 56, no. 4, pages 1398–1410, 2008. (Cited in page(s) 20 and 256.) [Bailey 1994] R.W. Bailey. Polar Generation of Random Variates with the t-Distribution. Mathematics of Computation, pages 779–781, 1994. (Cited in page(s) 252.) [Baker 2004] E.T. Baker and C.R. German. On the global distribution of hydrothermal vent fields. Mid-ocean ridges: hydrothermal interactions between the lithosphere and oceans, vol. 148, pages 245–266, 2004. (Cited in page(s) 173.) [Benitz 1989] G.R. Benitz and J.A. Bucklew. Asymptotically optimal quantizers for detection of iid data. Information Theory, IEEE Transactions on, vol. 35, no. 2, pages 316–325, 1989. (Cited in page(s) 19 and 256.) [Benveniste 1990] A. Benveniste, M. Métivier and P. Priouret. Adaptive algorithms and stochastic approximations. Springer-Verlag New York, Inc., 1990. (Cited in page(s) 113, 114, 115, 121, 122, 123, 136, 150, 151, 152, 158, 159, 166, 276 and 278.) [Berzuini 1997] C. Berzuini, N.G. Best, W.R. Gilks and C. Larizza. Dynamic conditional independence models and Markov chain Monte Carlo methods. Journal of the American Statistical Association, vol. 92, no. 440, pages 1403–1412, 1997. (Cited in page(s) 86.) [Blahut 1987] R.E. Blahut. Principles and practice of information theory. Addison-Wesley Longman Publishing Co., Inc., 1987. (Cited in page(s) 208.) [Borkar 1995] V.S. Borkar, S.K. Mitteret al. LQG control with communication constraints. Technical report, Massachusetts Institute of Technology, Laboratory for Information and Decision Systems, 1995. (Cited in page(s) 95.) [Box 1958] G.E.P. Box and M.E. Muller. A note on the generation of random normal deviates. The Annals of Mathematical Statistics, vol. 29, no. 2, pages 610–611, 1958. (Cited in page(s) 248 and 250.) 297 298 Bibliography [Boyd 2004] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, New York, NY, USA, 2004. (Cited in page(s) 57, 58 and 59.) [Chong 2003] C.Y. Chong and S.P. Kumar. Sensor networks: evolution, opportunities, and challenges. Proceedings of the IEEE, vol. 91, no. 8, pages 1247–1256, 2003. (Cited in page(s) 18 and 255.) [Costa 2003] J. Costa, A. Hero and C. Vignat. On Solutions to Multivariate Maximum αEntropy Problems. In Anand Rangarajan, Mário Figueiredo and Josiane Zerubia, editors, Energy Minimization Methods in Computer Vision and Pattern Recognition, volume 2683 of Lecture Notes in Computer Science, pages 211–226. Springer Berlin/Heidelberg, 2003. (Cited in page(s) 136.) [Cover 2006] T.M. Cover and J.A. Thomas. Elements of information theory 2nd edition. Wiley-Interscience, 2006. (Cited in page(s) 136, 187 and 188.) [Crisan 2000] D. Crisan and A. Doucet. Convergence of sequential Monte Carlo methods. Technical report, Signal Processing Group, Department of Engineering, University of Cambridge, 2000. (Cited in page(s) 91.) [Crowder 1976] M.J. Crowder. Maximum likelihood estimation for dependent observations. Journal of the Royal Statistical Society. Series B (Methodological), pages 45–53, 1976. (Cited in page(s) 62, 67 and 70.) [Curry 1970] R. Curry, W.V. Velde and J. Potter. Nonlinear estimation with quantized measurements–PCM, predictive quantization, and data compression. Information Theory, IEEE Transactions on, vol. 16, no. 2, pages 152–161, March 1970. (Cited in page(s) 95.) [Doucet 1998] A. Doucetet al. On sequential simulation-based methods for Bayesian filtering. Technical report, 1998. (Cited in page(s) 83, 84, 85 and 86.) [Doucet 2000] A. Doucet, S. Godsill and C. Andrieu. On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and computing, vol. 10, no. 3, pages 197–208, 2000. (Cited in page(s) 246.) [Durisic 2012] M.P. Durisic, Z. Tafa, G. Dimic and V. Milutinovic. A survey of military applications of wireless sensor networks. In Embedded Computing (MECO), 2012 Mediterranean Conference on, pages 196–199. IEEE, 2012. (Cited in page(s) 18 and 255.) [Fang 2008] J. Fang and H. Li. Distributed adaptive quantization for wireless sensor networks: from delta modulation to maximum likelihood. Signal Processing, IEEE Transactions on, vol. 56, no. 10, pages 5246–5257, 2008. (Cited in page(s) 20, 32, 62, 63, 64, 66, 106, 112, 118, 256, 268 and 269.) [Fine 1968] T. Fine. The response of a particular nonlinear system with feedback to each of two random processes. Information Theory, IEEE Transactions on, vol. 14, no. 2, pages 255–264, 1968. (Cited in page(s) 63 and 239.) Bibliography 299 [Gallager 1996] R.G. Gallager. Discrete stochastic processes, volume 101. Kluwer Academic Publishers, 1996. (Cited in page(s) 239.) [Gastpar 2008] M. Gastpar. Uncoded transmission is exactly optimal for a simple Gaussian “sensor” network. Information Theory, IEEE Transactions on, vol. 54, no. 11, pages 5247–5251, 2008. (Cited in page(s) 19.) [Gersho 1992] A. Gersho and R.M. Gray. Vector quantization and signal compression. Springer, 1992. (Cited in page(s) 91, 109, 148, 174, 184 and 197.) [Golub 1973] G.H. Golub. Some modified matrix eigenvalue problems. SIAM Review, pages 318–334, 1973. (Cited in page(s) 153 and 231.) [Golub 1991] G.H. Golub and J.M. Ortega. Scientific computing and differential equations: An introduction to numerical methods. Academic Press, Inc., 1991. (Cited in page(s) 116.) [Gordon 1993] N.J. Gordon, D.J. Salmond and A.F.M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. In Radar and Signal Processing, IEE Proceedings F, volume 140, pages 107–113. IET, 1993. (Cited in page(s) 86.) [Gubner 1993] J.A. Gubner. Distributed estimation and quantization. Information Theory, IEEE Transactions on, vol. 39, no. 4, pages 1456–1459, 1993. (Cited in page(s) 20 and 256.) [Gupta 2003] R. Gupta and A.O. Hero III. High-rate vector quantization for detection. Information Theory, IEEE Transactions on, vol. 49, no. 8, pages 1951–1969, 2003. (Cited in page(s) 19, 208, 212 and 256.) [Hardy 1988] G.H. Hardy, J.E. Littlewood and G. Polya. Inequalities. Cambridge University Press, 1988. (Cited in page(s) 43 and 184.) [Herzig 2002] P. Herzig, M.D. Hannington and S. Petersen. Polymetallic massive sulphide deposits at the modern seafloor and their resources potential. Technical report, 2002. (Cited in page(s) 173.) [Hoagland 2010] P. Hoagland, S. Beaulieu, M.A. Tivey, R.G. Eggert, C. German, L. Glowka and J. Lin. Deep-sea mining of seafloor massive sulfides. Marine Policy, vol. 34, no. 3, pages 728–732, 2010. (Cited in page(s) 173.) [Hol 2006] J.D. Hol, T.B. Schon and F. Gustafsson. On resampling algorithms for particle filters. In Nonlinear Statistical Signal Processing Workshop, 2006 IEEE, pages 79–82. IEEE, 2006. (Cited in page(s) 86.) [Intanagonwiwat 2000] C. Intanagonwiwat, R. Govindan and D. Estrin. Directed diffusion: a scalable and robust communication paradigm for sensor networks. In Proceedings of the 6th annual international conference on Mobile computing and networking, pages 56–67. ACM, 2000. (Cited in page(s) 17 and 254.) 300 Bibliography [Jazwinski 1970] A.H. Jazwinski. Stochastic processes and filtering theory. Academic Press, New York, 1970. (Cited in page(s) 29, 78, 91 and 97.) [Karlsson 2005] G.R. Karlsson and F. Gustafsson. Particle Filtering for Quantized Sensor Information. In 13th European Signal Processing Conference, EUSIPCO. EURASIP, 2005. (Cited in page(s) 87.) [Kassam 1977] S. Kassam. Optimum quantization for signal detection. Communications, IEEE Transactions on, vol. 25, no. 5, pages 479–484, 1977. (Cited in page(s) 19, 53 and 256.) [Kay 1993] S.M. Kay. Fundamentals of statistical signal processing, volume 1: Estimation theory. PTR Prentice Hall, 1993. (Cited in page(s) 38, 39, 40, 53, 62, 70, 88, 91, 95, 227, 263 and 264.) [Khalil 1992] H.K. Khalil and J.W. Grizzle. Nonlinear systems. Macmillan Publishing Company New York, 1992. (Cited in page(s) 116.) [Knuth 1997] D.E. Knuth. The art of computer programming, volume 2: Seminumerical algorithms. Addison-Wesley, 1997. (Cited in page(s) 248.) [Kong 1994] A. Kong, J.S. Liu and W.H. Wong. Sequential imputations and Bayesian missing data problems. Journal of the American Statistical Association, vol. 89, no. 425, pages 278–288, 1994. (Cited in page(s) 85.) [Lange 1989] K.L. Lange, R.J.A. Little and J.M.G. Taylor. Robust statistical modeling using the t distribution. Journal of the American Statistical Association, pages 881–896, 1989. (Cited in page(s) 136.) [Li 1999] J. Li, N. Chaddha and R.M. Gray. Asymptotic performance of vector quantizers with a perceptual distortion measure. Information Theory, IEEE Transactions on, vol. 45, no. 4, pages 1082–1091, 1999. (Cited in page(s) 188.) [Li 2007] H. Li and J. Fang. Distributed adaptive quantization and estimation for wireless sensor networks. Signal Processing Letters, IEEE, vol. 14, no. 10, pages 669–672, 2007. (Cited in page(s) 60, 62, 106, 118 and 268.) [Longo 1990] M. Longo, T.D. Lookabaugh and R.M. Gray. Quantization for decentralized hypothesis testing under communication constraints. Information Theory, IEEE Transactions on, vol. 36, no. 2, pages 241–255, 1990. (Cited in page(s) 19 and 256.) [Luo 2005] Z.Q. Luo. Universal decentralized estimation in a bandwidth constrained sensor network. Information Theory, IEEE Transactions on, vol. 51, no. 6, pages 2210–2219, 2005. (Cited in page(s) 20 and 256.) [Marano 2007] S. Marano, V. Matta and P. Willett. Asymptotic design of quantizers for decentralized MMSE estimation. Signal Processing, IEEE Transactions on, vol. 55, no. 11, pages 5485–5496, 2007. (Cited in page(s) 20, 42, 208 and 256.) Bibliography 301 [Marsaglia 2000] G. Marsaglia and W.W. Tsang. A simple method for generating gamma variables. ACM Transactions on Mathematical Software (TOMS), vol. 26, no. 3, pages 363–372, 2000. (Cited in page(s) 249.) [Molden 2007] D. Molden. Water for food, water for life: a comprehensive assessment of water management in agriculture. Earthscan/James & James, 2007. (Cited in page(s) 27.) [Nardon 2009] M. Nardon and P. Pianca. Simulation techniques for generalized Gaussian densities. Journal of Statistical Computation and Simulation, vol. 79, no. 11, pages 1317–1329, 2009. (Cited in page(s) 249.) [Papadopoulos 2001] H.C. Papadopoulos, G.W. Wornell and A.V. Oppenheim. Sequential signal encoding from noisy measurements using quantizers with dynamic bias control. Information Theory, IEEE Transactions on, vol. 47, no. 3, pages 978–1002, 2001. (Cited in page(s) 20, 44, 53, 60, 66, 70, 72, 105, 106, 256, 265 and 269.) [Picinbono 1988] B. Picinbono and P. Duvaut. Optimum quantization for detection. Communications, IEEE Transactions on, vol. 36, no. 11, pages 1254–1258, 1988. (Cited in page(s) 19 and 256.) [Poor 1977] H.V. Poor and J. Thomas. Applications of Ali-Silvey distance measures in the design of generalized quantizers for binary decision systems. Communications, IEEE Transactions on, vol. 25, no. 9, pages 893–900, 1977. (Cited in page(s) 19 and 256.) [Poor 1988] H.V. Poor. Fine quantization in signal detection and estimation. Information Theory, IEEE Transactions on, vol. 34, no. 5, pages 960–972, 1988. (Cited in page(s) 19, 20, 178, 207, 209, 214 and 256.) [Puccinelli 2005] D. Puccinelli and M. Haenggi. Wireless sensor networks: applications and challenges of ubiquitous sensing. Circuits and Systems Magazine, IEEE, vol. 5, no. 3, pages 19–31, 2005. (Cited in page(s) 18 and 255.) [Rhodes 1971] I. Rhodes. A tutorial introduction to estimation and filtering. Automatic Control, IEEE Transactions on, vol. 16, no. 6, pages 688–706, 1971. (Cited in page(s) 98.) [Ribeiro 2006a] A. Ribeiro and G.B. Giannakis. Bandwidth-constrained distributed estimation for wireless sensor networks-Part I: Gaussian case. Signal Processing, IEEE Transactions on, vol. 54, no. 3, pages 1131–1143, 2006. (Cited in page(s) 20, 44, 53, 58, 60, 63, 106, 256 and 265.) [Ribeiro 2006b] A. Ribeiro and G.B. Giannakis. Bandwidth-constrained distributed estimation for wireless sensor networks-part II: Unknown probability density function. Signal Processing, IEEE Transactions on, vol. 54, no. 7, pages 2784–2796, 2006. (Cited in page(s) 20, 106 and 256.) [Ribeiro 2006c] A. Ribeiro, G.B. Giannakis and S.I. Roumeliotis. SOI-KF: Distributed Kalman filtering with low-cost communications using the sign of innovations. Signal Processing, IEEE Transactions on, vol. 54, no. 12, pages 4782–4795, 2006. (Cited in page(s) 20, 75, 94, 95, 106 and 256.) 302 Bibliography [Robert 1999] C.P. Robert and G. Casella. Monte Carlo statistical methods. Springer New York, 1999. (Cited in page(s) 81, 82 and 245.) [Ruan 2004] Y. Ruan, P. Willett, A. Marrs, S. Marano and F. Palmieri. Practical fusion of quantized measurements via particle filtering. In Target Tracking 2004: Algorithms and Applications, IEE, pages 13–18. IET, 2004. (Cited in page(s) 87.) [Rubin 1988] D.B. Rubinet al. Using the SIR algorithm to simulate posterior distributions. Bayesian statistics, vol. 3, pages 395–402, 1988. (Cited in page(s) 86.) [Samorodnitsky 1994] G. Samorodnitsky and M.S. Taqqu. Stable non-Gaussian random processes: stochastic models with infinite variance. Chapman and Hall/CRC, 1994. (Cited in page(s) 34.) [Sigman 1999] K. Sigman. Appendix: A primer on heavy-tailed distributions. Queueing Systems, vol. 33, no. 1, pages 261–275, 1999. (Cited in page(s) 47.) [Sukhavasi 2009a] R.T. Sukhavasi and B. Hassibi. The Kalman like particle filter : Optimal estimation with quantized innovations/measurements. arXiv:0909.0996, September 2009. (Cited in page(s) 95.) [Sukhavasi 2009b] R.T. Sukhavasi and B. Hassibi. Particle filtering for Quantized Innovations. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 2229–2232, April 2009. (Cited in page(s) 91.) [Tichavsky 1998] P. Tichavsky, C.H. Muravchik and A. Nehorai. Posterior Cramér-Rao bounds for discrete-time nonlinear filtering. Signal Processing, IEEE Transactions on, vol. 46, no. 5, pages 1386 –1396, May 1998. (Cited in page(s) 88.) [Tsitsiklis 1993] J.N. Tsitsiklis. Extremal properties of likelihood-ratio quantizers. Communications, IEEE Transactions on, vol. 41, no. 4, pages 550–558, 1993. (Cited in page(s) 19 and 256.) [Van Trees 1968] H. L. Van Trees. Detection, estimation, and modulation theory. Part 1. New York: John Wiley and Sons, Inc., 1968. (Cited in page(s) 37, 78, 88 and 208.) [Varanasi 1989] M.K. Varanasi and B. Aazhang. Parametric generalized Gaussian density estimation. The Journal of the Acoustical Society of America, vol. 86, pages 1404– 1415, 1989. (Cited in page(s) 136.) [Villard 2010] J. Villard, P. Bianchi, E. Moulines and P. Piantanida. High-rate quantization for the Neyman-Pearson detection of hidden Markov processes. In Information Theory Workshop (ITW), 2010 IEEE, pages 1–5. IEEE, 2010. (Cited in page(s) 19 and 256.) [Villard 2011] J. Villard and P. Bianchi. High-rate vector quantization for the Neyman– Pearson detection of correlated processes. Information Theory, IEEE Transactions on, vol. 57, no. 8, pages 5387–5409, 2011. (Cited in page(s) 19 and 256.) [Wang 2010] L.Y. Wang, G. Yin, J.F. Zhang and Y. Zhao. System identification with quantized observations. Birkhauser, 2010. (Cited in page(s) 20, 32 and 256.) Bibliography 303 [Wasserman 2003] L. Wasserman. All of statistics: a concise course in statistical inference. Springer, 2003. (Cited in page(s) 53, 68 and 72.) [You 2008] K. You, L. Xie, S. Sun and W. Xiao. Multiple-level quantized innovation Kalman filter. In IFAC World Congress, volume 17, pages 1420–1425, 2008. (Cited in page(s) 75, 95 and 106.) [Zhao 2004] F. Zhao and L. Guibas. Wireless sensor networks: an information processing approach. Morgan Kaufmann, 2004. (Cited in page(s) 17 and 254.) Abstract: With recent advances in sensing and communication technology, sensor networks have emerged as a new field in signal processing. One of the applications of this new field is remote estimation, where the sensors gather information and send it to some distant point where estimation is carried out. For overcoming the new design challenges brought by this approach (constrained energy, bandwidth and complexity), quantization of the measurements can be considered. Based on this context, we study the problem of estimation based on quantized measurements. We focus mainly on the scalar location parameter estimation problem, the parameter is considered to be either constant or varying according to a slow Wiener process model. We present estimation algorithms to solve this problem and, based on performance analysis, we show the importance of quantizer range adaptiveness for obtaining optimal performance. We propose a low complexity adaptive scheme that jointly estimates the parameter and updates the quantizer thresholds, achieving in this way asymptotically optimal performance. With only 4 or 5 bits of resolution, the asymptotically optimal performance for uniform quantization is shown to be very close to the continuous measurement estimation performance. Finally, we propose a high resolution approach to obtain an approximation of the optimal nonuniform quantization thresholds for parameter estimation and also to obtain an analytical approximation of the estimation performance based on quantized measurements. Keywords: estimation, quantization, compression, adaptive algorithms. Résumé : L’essor des nouvelles technologies de télécommunication et de conception des capteurs a fait apparaître un nouveau domaine du traitement du signal : les réseaux de capteurs. Une application clé de ce nouveau domaine est l’estimation à distance : les capteurs acquièrent de l’information et la transmettent à un point distant où l’estimation est faite. Pour relever les nouveaux défis apportés par cette nouvelle approche (contraintes d’énergie, de bande et de complexité), la quantification des mesures est une solution. Ce contexte nous amène à étudier l’estimation à partir de mesures quantifiées. Nous nous concentrons principalement sur le problème d’estimation d’un paramètre de centrage scalaire. Le paramètre est considéré soit constant, soit variable dans le temps et modélisé par un processus de Wiener lent. Nous présentons des algorithmes d’estimation pour résoudre ce problème et, en se basant sur l’analyse de performance, nous montrons l’importance de l’adaptativité de la dynamique de quantification pour l’obtention d’une performance optimale. Nous proposons un schéma adaptatif de faible complexité qui, conjointement, estime le paramètre et met à jour les seuils du quantifieur. L’estimateur atteint de cette façon la performance asymptotique optimale. Avec 4 ou 5 bits de résolution, nous montrons que la performance optimale pour la quantification uniforme est très proche des performances d’estimation à partir de mesures continues. Finalement, nous proposons une approche à haute résolution pour obtenir les seuils de quantification non-uniformes optimaux ainsi qu’une approximation analytique des performances d’estimation. Mots clés : estimation, quantification, compression, algorithmes adaptatifs. GIPSA-lab, 11 rue des Mathématiques, Grenoble Campus BP 46, F-38402 Saint Martin d’Hères CEDEX

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project