Information Theory for Risk-based Water System Operation Steven Weijs Information Theory for Risk-based Water System Operation Proefschrift ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus prof. ir. K.C.A.M. Luyben, voorzitter van het College voor Promoties, in het openbaar te verdedigen op vrijdag 1 april 2011 om 12:30 uur door Steven Vincent WEIJS civiel ingenieur geboren te Groningen Dit proefschrift is goedgekeurd door de promotor: Prof. dr. ir. N. C. van de Giesen Samenstelling promotiecommissie: Rector Magnificus Prof. dr. ir. N. C. van de Giesen Prof. Dr. rer.nat. Dr.-Ing. A. Bárdossy Prof. dr. ir. D. Koutsoyiannis Prof. dr. ir. H. H. G. Savenije Prof. dr. D. P. Solomatine Dr. ir. P. J. A. T. M. van Overloop Dr. ir. F. Pianosi voorzitter Technische Universiteit Delft, promotor Universität Stuttgart National Technical University of Athens Technische Universiteit Delft UNESCO-IHE Technische Universiteit Delft Politecnico di Milano This research was performed at the section Water Resources Management, faculty of Civil Engineering & Geosciences of TU Delft and has been financially supported by Delft Cluster. Keywords: operational water management, control, optimization, information theory, probabilistic forecasts, risk. cover: A medal design by Gottfried Wilhelm Leibniz included in a letter in 1697 to Rudolph August, Duke of Brunswick, celebrating his discovery of the binary system: “for all to spring from nothing, a oneness suffices”. The version used is from Rudolf Nolte in 1734. The second, turbulent figure is a fractal from the Mandelbrot set. c 2011 by S.V. Weijs Copyright Published by VSSD, Delft, The Netherlands ISBN 978-90-6562-264-8 All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic, or mechanical, including photocopy, recording or by any information storage and retrieval system, without written permission of the publisher. This thesis was written using LYX and LATEX Printed in The Netherlands See www.hydroinfotheory.net for more background information and updates. To my parents In a predestinate world, decision would be illusory; in a world of a perfect foreknowledge, empty, in a world without natural order, powerless. Our intuitive attitude to life implies non-illusory, non-empty, non-powerless decision. ... Since decision in this sense excludes both perfect foresight and anarchy in nature, it must be defined as choice in face of bounded uncertainty. – (George Shackle, Decision Order and Time in Human Affairs, 1961) Preface Information is a quantity, measurable in bits. Any piece of information can be expressed in zeros and ones and any calculation can be performed by a simple computer reading only zeros and ones. Gottfried Wilhelm Leibniz was justifiedly excited when in 1697 he discovered the binary system and designed the medal for the Duke of Brunswick, depicted on the front cover, to celebrate his discovery: “For all to spring from nothing, a oneness suffices” While the research leading to this thesis started as a study of water system operation under uncertainty, it soon became an investigation into the role of information in that process. Because information and uncertainty are quantities measured in the same unit, the topic essentially remained the same, but with a more positive sound to it. I admit that information theory in a broad sense has now come close to being my pet theory of the universe, which might of course bias my view on its importance. This is also visible in the contents of this thesis, whose central four chapters focus on the application of information theory. The law that entropy always increases, holds, I think, the supreme position among the laws of Nature. If someone points out to you that your pet theory of the universe is in disagreement with Maxwell’s equations - then so much the worse for Maxwell’s equations. If it is found to be contradicted by observation - well, these experimentalists do bungle things sometimes. But if your theory is found to be against the second law of thermodynamics I can give you no hope; there is nothing for it but to collapse in deepest humiliation. - Arthur Stanley Eddington, The Nature of the Physical World (1928) In support of my pet theory, note that the second law of thermodynamics, celebrated in the above quote, can be interpreted information-theoretically as a loss of information into microscopical degrees of freedom that we usually cannot observe or control. Thermodynamics can therefore be seen as a specific application of information theory. The second law also defines the direction of time, which inevitably runs out before all questions are answered. New unanswered questions are also an inevitable by-product of answers, making curiosity an addiction. This thesis therefore ends with an ellipsis rather than a period, because I do not think its submission will cure my addiction. I would also like to apologize for the unanswered questions that might feed the curiosity of its readers. vii Summary isk-based water system operation can be formulated as a problem of rational decision making with incomplete information, which can be approached using the interlinked fields of probability, decision and information theory. This thesis presents a perspective from information theory, by focusing on selected issues for which this theory can help understanding how information flows or should flow from observation to decision. R Water system operation is the task of finding a sequence of decisions in time to influence a water system in order to optimally benefit from it and minimize the risks, using realtime information. An example is the operation of a hydropower reservoir, where the daily releases should maximize the power production benefits, but balance this against flood risks downstream by offering sufficient flood-storage, given forecasted inflows and present state. Uncertainty in forecasts plays an important role in the operation of water systems and negatively affects the performance of their operation. Conversely, information reduces uncertainty and therefore has a value for operation. Information theory offers a rigorous framework to study information and uncertainty as quantities, but its application is not widespread in water resources management. In this thesis, the risk based operation of water systems is studied, with a specific focus on the role of information in this process. Because probabilistic forecasts are often the interface between observations, model outcomes and decisions, their evaluation is a key component of studying the flow of information. The information-theoretical perspective results in a number of new practical techniques for the evaluation of forecasts, merging information from different sources, and using the information in sequential decision processes. The overall aim of this thesis is to develop methods for optimal risk based water system operation. During the research, Shannon’s information theory was found to be a central framework to study risk and decision making in the context of modeling and water system operation procedures. This results in the following main research questions: 1. How does information play a role in optimal operation of water systems? 2. How can information be exploited optimally in decisions? The nature of the research questions requires a theoretical approach to the problem. This thesis combines fundamental results from the interlinked fields of information theory, control theory and decision theory and applies them to problems in water resources management, while maintaining an integrated view on the nature of information. ix x Summary The outcomes of the research can be subdivided into two levels. Firstly, at the conceptual level, the research tries to make a contribution to the debate about uncertainty analysis philosophy and methods, which are important topics in current hydrological literature. Secondly, at the applied level, the thesis presents a number of practical methods and recommendations. The conceptual contributions result from taking an information-theoretical perspective on a number of open problems in hydrology and water resources management, notably the evaluation of probabilistic forecasts and the calibration of models. It is shown why forecasting should be seen as a communication process, where information is transferred from the forecaster to the user, in order to reduce the uncertainty of the latter. The quality of such forecasts can be measured using the expected Kullback-Leibler divergence from the observations to the forecasts. This “divergence score” can be interpreted as the remaining uncertainty of the user, after he has received the forecast. A new mathematical decomposition of the divergence score is found that can be interpreted as “the remaining uncertainty is equal to the initial uncertainty, minus the correct information, plus the wrong information.” The correct information is bounded by the information in the predictors, while the wrong information must be minimized in the model calibration process. The decomposition is also used to demonstrate why deterministic forecasts are theoretically unacceptable and should be replaced by probabilistic forecasts. Furthermore, the roles of parsimony and purpose of models are analyzed in the context of model calibration. Algorithmic information theory is used to shed light on the connection between science and data compression and the link between the amount of information and model complexity is discussed in this context. It is argued that the purpose of a model should play no role in the choice of calibration objective, because it can both lead to a reduction in the information that reaches the model from the data and to the model learning from information that is not existing. When information in probabilistic predictions is used to support sequential decision processes, such as finding the sequence of optimal releases from a reservoir, the complex time-dynamics of information become important. In that case, the information that will become available in the future has an influence on the current optimal decision. Although the information itself is not available for the current decision, it is possible to take into account the fact that information will become available and benefits future decisions. Because doing this explicitly is computationally intractable, approximate solutions are necessary that estimate the future value of water, which must be balanced against the benefits of immediate use. The more practical contributions of this thesis include: a method to generate long lead time weighted ensemble streamflow forecasts, based on information from El-Niño; a score for verification of probabilistic forecasts that rewards maximum information extraction from predictors; guidelines for optimization horizons in controller design for sequential decision processes; and a method to study the increase in value of water due to more informative predictions. Samenvatting isicogestuurd operationeel waterbeheer is te formuleren als een zoektocht naar rationele beslissingen, gegeven onvolledige informatie. Hierbij kunnen de onderling verbonden disciplines kansrekening, beslistheorie en informatietheorie worden gebruikt. Dit proefschrift beoogt een informatie-theoretisch perspectief te bieden, door een aantal problemen te belichten waarvoor informatietheorie een bijdrage kan leveren aan het begrip en de verbetering van de informatiestroom van observatie naar beslissing. R Operationeel waterbeheer behelst het vinden, gebruikmakend van actuele informatie, van een reeks opeenvolgende beslissingen, die een watersysteem zodanig beïnvloeden dat er maximaal van geprofiteerd kan worden, terwijl risico’s zo klein mogelijk worden gehouden. Een voorbeeld is het aansturen van gemalen in poldersystemen, waarbij een voldoende hoog waterpeil moet worden gehandhaafd voor onder andere de landbouw, maar inundatie moet worden voorkomen door op tijd te pompen, daarbij rekening houdend met de huidige systeemtoestand en anticiperend op eventueel voorspelde neerslag. Onzekerheden in voorspellingen hebben een belangrijke invloed op het operationeel waterbeheer en verminderen de beheersbaarheid van watersystemen. Omgekeerd geldt ook dat informatie onzekerheid terugdringt en daardoor een waarde krijgt voor waterbeheer. Informatietheorie biedt een wiskundig kader voor informatie en onzekerheid als meetbare grootheden, maar is nog weinig toegepast in het waterbeheer. In dit proefschrift wordt het risicogestuurd beheer van watersystemen onderzocht, waarbij de nadruk ligt op de rol die informatie hierbij speelt. Omdat kansvoorspellingen vaak de verbinding vormen tussen waarnemingen, modeluitkomsten en beslissingen, speelt hun evaluatie een sleutelrol bij het onderzoeken van de informatiestromen. De informatie-theoretische blik resulteert in nieuwe methoden voor het evalueren van voorspellingen, het samenbrengen van informatie uit verschillende bronnen en het gebruik van informatie in sequentiële beslissingsprocessen. Het achterliggende doel van dit onderzoek is het ontwikkelen van methoden voor optimaal risicogestuurd waterbeheer. Tijdens het onderzoek bleek de informatietheorie van Shannon een centraal kader voor het onderzoek naar risico’s en beslisproblemen in de context van modelleren en operationeel waterbeheer. Dit levert de volgende onderzoeksvragen op: 1. Welke rol speelt informatie bij optimaal operationeel beheer van watersystemen? 2. Hoe kan informatie optimaal worden benut voor beslissingen? De aard van de onderzoeksvragen vraagt om een theoretische onderzoeksaanpak. Dit proefschrift combineert een aantal fundamentele resultaten uit de informatietheorie, meeten regeltechniek en beslistheorie en past deze toe op problemen in het waterbeheer, terwijl de aard van informatie steeds referentiekader en rode draad blijft. xi xii Samenvatting De resultaten zijn te onderscheiden in twee abstractieniveaus. Op het eerste, conceptuele niveau levert dit onderzoek een bijdrage aan de discussie over methoden en filosofie van onzekerheidsanalyse van modellen. Deze discussie speelt een grote rol in de hedendaagse hydrologische wetenschappelijke literatuur. Op het tweede, toegepaste niveau, presenteert dit proefschrift een aantal praktisch toepasbare methoden en aanbevelingen. De conceptuele bijdragen volgen uit een informatie-theoretische kijk op een aantal openstaande problemen in de hydrologie en het waterbeheer, met name de evaluatie van kansvoorspellingen en de kalibratie van modellen. Er wordt aangetoond waarom het doen van voorspellingen als een communicatieproces beschouwd moet worden, waar informatie wordt overdragen van de voorspeller op de gebruiker, om diens onzekerheid te verminderen. De kwaliteit van zulke voorspellingen kan worden gemeten met de verwachtingswaarde van de Kullback-Leibler divergentie van de waarnemingen tot de voorspellingen. Deze zogenaamde “divergence score” is te interpreteren als de resterende onzekerheid van de gebruiker, nadat deze de voorspelling heeft ontvangen. Er wordt een nieuwe wiskundige decompositie van de divergence score gepresenteerd, die in woorden kan worden geïnterpreteerd als “de resterende onzekerheid is gelijk aan de initiële onzekerheid, min de juiste informatie, plus de onjuiste informatie.” De juiste informatie is gelimiteerd tot de informatie die aanwezig is in de voor de voorspelling gebruikte gegevens, terwijl de onjuiste informatie moet worden geminimaliseerd in de modelkalibratie. De decompositie wordt ook gebruikt om aan te tonen waarom deterministische voorspellingen theoretisch gezien onacceptabel zijn en vervangen moeten worden door kansvoorspellingen. Verder wordt gekeken naar de rol van parsimonie en van het doel van een model in de context van modelkalibratie. Algoritmische informatietheorie wordt gebruikt om de analogie tussen wetenschap en datacompressie te verhelderen en vanuit deze context wordt het verband tussen de hoeveelheid informatie in gegevens en de complexiteit van modellen beschouwd. Er wordt betoogd dat het doel van een model in principe geen rol zou mogen spelen in de kalibratie, omdat dit tot een verlies van informatie uit de data kan leiden of juist tot het leren van niet bestaande informatie. Wanneer de informatie uit kansvoorspellingen wordt gebruikt om sequentiële beslisprocessen zoals reservoirbeheer te ondersteunen, komt de complexe tijd-dynamica van informatie in beeld. In dat geval beïnvloedt de informatie die in de toekomst beschikbaar zal komen de optimale waarde van de huidige beslissing. Hoewel deze informatie zelf nog niet beschikbaar is voor die beslissing, is het wel mogelijk rekening te houden met het feit dat toekomstig beschikbare informatie toekomstige beslissingen verbetert. Omdat het qua rekenkracht niet haalbaar is om dit expliciet te doen, zijn benaderingen nodig om de toekomstige waarde van water in te schatten, om deze af te wegen tegen de opbrengsten van onmiddellijk gebruik. De meer praktisch georiënteerde bijdrage van dit proefschift omvat: een methode om gewogen ensemble-voorspellingen op seizoen-tijdschaal te doen op basis en informatie over “El Niño”; een score ter evaluatie van kansvoorspellingen die het maximaliseren van de informatie beloont; ontwerprichtlijnen voor optimalisatie-horizons van regelaars voor sequentiële beslisprocessen; en een methode om de waardetoename van water als gevolg van informatievere voorspellingen in te schatten. Contents Preface vii Summary ix Samenvatting xi Contents xiii List of Figures xix List of Tables xxi 1 Introduction 1.1 Why operate water systems? . . . . . . . . . . . . . . 1.2 Uncertainty, rationality and risk in decision making . 1.2.1 Uncertainty . . . . . . . . . . . . . . . . . . . 1.2.2 Rationality . . . . . . . . . . . . . . . . . . . 1.2.3 Risk . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Risk-based water system operation is rational 1.3 Information reduces uncertainty . . . . . . . . . . . . 1.4 The value of information . . . . . . . . . . . . . . . 1.5 Objectives and research questions . . . . . . . . . . . 1.5.1 The broader perspective . . . . . . . . . . . . 1.5.2 Open questions . . . . . . . . . . . . . . . . . 1.5.3 Research objectives . . . . . . . . . . . . . . . 1.5.4 Research questions . . . . . . . . . . . . . . . 1.6 Thesis outline . . . . . . . . . . . . . . . . . . . . . . 2 Risk-based water system operation 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . 2.1.1 Water system operation as a mathematical 2.1.2 Formulation of a control problem . . . . . 2.1.3 Solution techniques . . . . . . . . . . . . . 2.2 An example: a lowland drainage system . . . . . . 2.2.1 Delfland . . . . . . . . . . . . . . . . . . . 2.2.2 The Delfland decision support system . . . 2.2.3 Objective of control . . . . . . . . . . . . . xiii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 3 3 4 4 5 6 7 7 7 8 8 9 9 . . . . . . . . 11 11 12 12 14 16 17 18 18 xiv Contents 2.3 Off-line optimization of operation . . . . . . . . . . . . . . . 2.3.1 Example: the “regelton” . . . . . . . . . . . . . . . . 2.3.2 Disadvantages compared to online optimization . . . 2.4 Online optimization of operation by model predictive control 2.5 Uncertainties affecting the Delfland system . . . . . . . . . . 2.5.1 Uncertainties . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Data driven inflow model . . . . . . . . . . . . . . . . 2.6 Relevant time horizons for uncertainty and optimization . . 2.6.1 Time horizons relevant for prediction and control . . 2.6.2 The time horizons for the Delfland system . . . . . . 2.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Certainty equivalence . . . . . . . . . . . . . . . . . . . . . . 2.8 Multiple model predictive control (MMPC) . . . . . . . . . . 2.9 The problem of dependence on future decisions . . . . . . . 2.10 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Uncertainty or missing information defined as entropy 3.1 Uncertainty and probability . . . . . . . . . . . . . . . . . . . . 3.2 The uncertainty of dice: Entropy . . . . . . . . . . . . . . . . . 3.3 Side information on a die: Conditional Entropy . . . . . . . . . 3.3.1 Conditioning, on average, reduces entropy . . . . . . . . 3.3.2 Conditional entropy . . . . . . . . . . . . . . . . . . . . . 3.4 Mutual Information and Relative Entropy . . . . . . . . . . . . 3.5 Rolling dice against a fair and ill-informed bookmaker . . . . . 3.6 Interpretation in terms of surprise . . . . . . . . . . . . . . . . . 3.6.1 Surprise and meaning in a message . . . . . . . . . . . . 3.6.2 Can information be wrong? . . . . . . . . . . . . . . . . 3.7 Laws and Applications . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Information cannot be produced from nothing . . . . . . 3.7.2 Information never hurts . . . . . . . . . . . . . . . . . . 3.7.3 Given the information, uncertainty should be maximized 3.7.4 Applications of information theory . . . . . . . . . . . . 3.8 Relation to thermodynamics . . . . . . . . . . . . . . . . . . . . 3.9 Applications of information theory in water resources research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 21 23 23 24 24 25 26 27 28 33 34 35 36 37 . . . . . . . . . . . . . . . . . 39 39 40 43 43 44 45 47 49 50 50 51 51 52 52 52 53 55 4 Adding seasonal forecast information by weighting ensemble forecasts 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Use of ensembles in water resources . . . . . . . . . . . . . . . . . . 4.1.2 Weighted ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Previous work on adding forecast information to climatic ensembles by weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Information, Assumptions and Entropy . . . . . . . . . . . . . . . . . . . . 4.2.1 The principle of maximum entropy . . . . . . . . . . . . . . . . . . 4.3 The Minimum Relative Entropy Update . . . . . . . . . . . . . . . . . . . 4.3.1 Rationale of the method . . . . . . . . . . . . . . . . . . . . . . . . 57 57 58 60 61 62 63 63 63 Contents 4.4 4.5 4.6 4.7 xv 4.3.2 Formulation of the method . . . . . . . . . . . . . . . . . . . . . . . Theoretical test case on a smooth sample and comparison to existing methods 4.4.1 Results in a theoretical test case . . . . . . . . . . . . . . . . . . . . 4.4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Conclusions from the theoretical test case . . . . . . . . . . . . . . Multivariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Application to ESP forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Seasonal forecast model . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions and recommendations . . . . . . . . . . . . . . . . . . . . . . . 5 Using information theory to measure forecast quality 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Definition of the divergence score . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 The divergence score . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Relation to Brier score and its components . . . . . . . . . . . . . 5.2.6 Normalization to a skill score . . . . . . . . . . . . . . . . . . . . 5.3 Relation to existing information-theoretical scores . . . . . . . . . . . . . 5.3.1 Relation to the ranked mutual information skill scores . . . . . . 5.3.2 Equivalence to the Ignorance score . . . . . . . . . . . . . . . . . 5.3.3 Relation to information gain . . . . . . . . . . . . . . . . . . . . . 5.4 Generalization to multi-category forecasts . . . . . . . . . . . . . . . . . 5.4.1 Nominal category forecasts . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Ordinal category forecasts . . . . . . . . . . . . . . . . . . . . . . 5.4.3 The Ranked divergence score . . . . . . . . . . . . . . . . . . . . 5.4.4 Relation to Ranked Mutual Information . . . . . . . . . . . . . . 5.4.5 Information and useful information . . . . . . . . . . . . . . . . . 5.5 An example: rainfall forecasts in the Netherlands . . . . . . . . . . . . . 5.6 Generalization to uncertain observations . . . . . . . . . . . . . . . . . . 5.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.2 Decomposition of the divergence score for uncertain observations . 5.6.3 Expected remaining uncertainty about the truth: the cross-entropy score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.4 Example application . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.5 Discussion: divergence vs. cross-entropy . . . . . . . . . . . . . . . 5.7 Deterministic forecasts cannot be evaluated . . . . . . . . . . . . . . . . 5.7.1 Deterministic forecasts are implicitly probabilistic (information interpretation) . . . . . . . . . . . . . . . . . . . . . 64 65 67 73 76 77 77 79 80 84 85 85 . . . . . . . . . . . . . . . . . . . . . . 87 87 89 89 90 90 91 94 94 96 96 97 98 98 98 99 101 101 102 103 106 106 108 . . . . 109 111 113 114 . 115 xvi Contents 5.7.2 5.8 Deterministic forecasts can still have value for decisions (utility interpretation) . . . . . . . . . . . . . . . . . . . . . . . . . 117 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6 Some thoughts on modeling, information and data compression 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 The principle of parsimony . . . . . . . . . . . . . . . . . . . . 6.1.2 Formalizing parsimony . . . . . . . . . . . . . . . . . . . . . . . 6.2 Algorithmic information theory, complexity and probability . . . . . . . 6.2.1 The Bayesian perspective . . . . . . . . . . . . . . . . . . . . . 6.2.2 Universal computability . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Kolmogorov complexity, patterns and randomness . . . . . . . . 6.2.4 Algorithmic probability and Solomonoff induction . . . . . . . . 6.2.5 Computable approximations to automated science . . . . . . . . 6.3 The divergence score: prediction, gambling and data compression . . . 6.3.1 Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 A practical test: “Zipping” hydrological time series . . . . . . . . . . . 6.4.1 Data and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Compressing with hydrological models . . . . . . . . . . . . . . 6.4.4 Discussion and conclusions . . . . . . . . . . . . . . . . . . . . . 6.5 Prediction versus understanding . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Hydrological models approximate emergent behavior . . . . . . 6.5.2 Science and explanation as data compression? . . . . . . . . . . 6.5.3 What is understanding? . . . . . . . . . . . . . . . . . . . . . . 6.6 Modeling for decisions: understanding versus utility . . . . . . . . . . . 6.6.1 Information versus utility as calibration objective . . . . . . . . 6.6.2 Locality and philosophy of science: knowledge from observation 6.6.3 Utility as a data filter . . . . . . . . . . . . . . . . . . . . . . . 6.6.4 Practical example . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Conclusions and recommendations for modeling practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Stochastic dynamic programming to discover relations between information, time and value of water 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Stochastic dynamic programming (SDP) . . . . . . . . . . . . . . . . . . 7.2.1 Formulation of a typical hydropower reservoir optimization problem using SDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Computational burden versus information loss . . . . . . . . . . . 7.3 Example case description . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Simulation and re-optimization . . . . . . . . . . . . . . . . . . . 7.4 Optimization and the value of water . . . . . . . . . . . . . . . . . . . . 7.4.1 Water value in the example problem . . . . . . . . . . . . . . . . 7.5 Interdependence of steady state solution and real-time control . . . . . . 7.6 How predictable is the inflow? - entropy rate and the Markov-property . 121 . 122 . 122 . 123 . 124 . 125 . 126 . 127 . 127 . 128 . 129 . 131 . 132 . 133 . 136 . 139 . 140 . 141 . 141 . 142 . 144 . 145 . 145 . 145 . 147 . 150 . 151 155 . 155 . 156 . . . . . . . . 157 158 159 160 162 163 165 167 7.7 7.8 7.9 7.10 The influence of information on the marginal value of water . . . . . . . Sharing additional benefits of real-time information between stakeholders Reinforcement Learning to approximate value functions . . . . . . . . . . Conclusions and recommendations . . . . . . . . . . . . . . . . . . . . . . 8 Conclusions and recommendations 8.1 Conclusions at the conceptual level . . . . . . . . . . . . . . . . . . . . . 8.1.1 The nature of information . . . . . . . . . . . . . . . . . . . . . . 8.1.2 The flow of information . . . . . . . . . . . . . . . . . . . . . . . 8.1.3 The value of information . . . . . . . . . . . . . . . . . . . . . . . 8.1.4 The necessity of probabilistic predictions . . . . . . . . . . . . . . 8.1.5 Information theory as philosophy of science . . . . . . . . . . . . . 8.2 Methodological contributions and recommendations for practice . . . . . 8.2.1 On risk based water system operation . . . . . . . . . . . . . . . . 8.2.2 On weighted ensemble forecasts . . . . . . . . . . . . . . . . . . . 8.2.3 On the evaluation of probabilistic forecasts . . . . . . . . . . . . 8.2.4 On performance measures for model inference . . . . . . . . . . . 8.2.5 On optimization of reservoir release policies . . . . . . . . . . . . 8.3 Limitations and recommendations for further research . . . . . . . . . . . 8.3.1 Going from discrete to continuous . . . . . . . . . . . . . . . . . . 8.3.2 Expressing prior information . . . . . . . . . . . . . . . . . . . . . 8.3.3 Merging information theory with statistical thermodynamics . . . 8.3.4 An integrated information-theoretical framework from observation to decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.5 Problem solved, but the solution is a problem . . . . . . . . . . . . . . . 168 170 172 174 175 . 175 . 175 . 176 . 177 . 177 . 178 . 179 . 179 . 179 . 179 . 180 . 180 . 181 . 181 . 181 . 182 . 182 . 183 References 185 A Equivalence between MRE-update and pdf-ratio solutions for the normal case 199 B The decomposition of the divergence score 201 C Relation divergence score and doubling rate in a horse race 203 Acknowledgements 205 About the author 207 Publications 209 List of Figures 1.1 Control increases flexibility of the Q-h relation of a weir . . . . . . . . . . . 1.2 Information is the reduction in uncertainty . . . . . . . . . . . . . . . . . . 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 A schematic representation of a “polder-boezem” system . . . . . . . . . . Variability of water levels within the Delfland boezem canals . . . . . . . . Setpoints and simulated water levels for water butt optimized off-line . . . Schematic representation of model predictive control (MPC) on a real system Rainfall-runoff model found by linear system identification . . . . . . . . . Increasing uncertainty in inflow modeled as second order Markov chain . . Two methods for comparison of observed and forecast rainfall . . . . . . . Time-decreasing skill of Delfland rainfall forecasts . . . . . . . . . . . . . . MPC simulations for different prediction horizons with perfect forecasts . . MPC simulation with real imperfect forecasts . . . . . . . . . . . . . . . . MPC performance against prediction horizon with perfect and actual forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12 Schematic representation of MMPC. . . . . . . . . . . . . . . . . . . . . . 2.13 Two MMPC formulations with different assumptions on future information availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 6 17 20 22 24 26 28 30 30 32 33 34 36 38 3.1 Expected number of questions to know the outcome of a die . . . . . . . . 41 3.2 Observing the dice can yield a further conditioning model. . . . . . . . . . 44 3.3 Venn diagrams depicting the relations between information measures . . . 46 4.1 Various climatic and conditional ensembles related to ESP forecasting . . . 4.2 Equally weighted ensemble members can represent a nonuniform density. This density can be changed by shifting or by weighting the ensemble traces. 4.3 Resulting ensemble weights for µ1 =3, σ1 =0.5 and resulting empirical CDF 4.4 Resulting ensemble weights for µ1 =4, σ1 =0.5 and resulting empirical CDF 4.5 Resulting ensemble weights for µ1 =3, σ1 =1.2 and resulting empirical CDF 4.6 Resulting ensemble weights for µ1 =4, σ1 =1.2 and resulting empirical CDF 4.7 Pareto-front showing trade-off between lost information and lost uncertainty 4.8 Weights and CDF for MRE-update with skewness constraint . . . . . . . . 4.9 Bubble plots of weights for MRE-update in the multivariate case . . . . . . 4.10 Bivariate kernel density estimate of the joint distribution of the ENSO index (November-February) and the average streamflow (April-August). . . 4.11 An example of the kernel density estimate used in the pdf-ratio method for the year 2003 (hindcast mode). . . . . . . . . . . . . . . . . . . . . . . . . 59 60 67 67 68 68 74 76 78 81 81 4.12 Ensemble weights for 2003, plotted against the average streamflow of the each trace from April to September (hindcast mode). . . . . . . . . . . . . 4.13 Forecast CDFs for the different weighting methods . . . . . . . . . . . . . . 4.14 The RPSS for the weighted ensemble forecasts by the different weighting methods for the whole forecast period (hindcast mode).. . . . . . . . . . . 4.15 The RPSS for the weighted ensemble forecasts by the different weighting methods for the period from 1970 (forecast mode, starting from 20 traces). 83 83 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 The components of the divergence score as additive bars . . . . . . . . . Comparison of the uncertainty component in Brier and divergence scores Comparison of the resolution component in Brier and divergence scores . Comparison of the reliability component in Brier and divergence scores . Skill against lead-time for Dutch probability of precipitation forecasts . . Sensitivity of the components for rounding of the forecast probabilities . The relation between Brier and divergence scores is not monotonic. . . . The additive component bars for the case of uncertainty in observations Uncertain binary observation for Gaussian measurement error . . . . . . Decomposition for uncertain observations applied to KNMI data . . . . . 6.1 6.2 6.3 Temporal dependence reduces the uncertainty of a guess . . . . . . . . . . 132 The effect of quantization on the hydrological signal . . . . . . . . . . . . . 134 The effect of lossy compression on the signal. It introduces a noise but is not significantly more than the quantization noise. . . . . . . . . . . . . . 138 Compression vs. signal to noise ratio for JPG on hydrological data . . . . . 138 Local vs. non-local scores: DS vs. (C)RPS . . . . . . . . . . . . . . . . . . 146 Schematic representation of how a model learns from data. Information reaches the model through three routes, and can be filtered through a utility function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Model behavior in validation of utility-calibrated versus information-calibrated parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.4 6.5 6.6 6.7 7.1 7.2 . . . . . . . . . . 81 83 Schematic illustration of the value-to-go calculation in SDP . . . . . . . . . The storage volume-area-head relation for the reservoir and the storage discretization using the Savarenskiy scheme. . . . . . . . . . . . . . . . . . 7.3 The discretization of the flow in 5 equiprobable classes for each month . . . 7.4 Reservoir behavior (level, releases, spills, power) for the simulated policy. . 7.5 Immediate water value and opportunity cost as a function of the release. . 7.6 Water values from the re-optimization for the different months of the year. 7.7 The reservoir behavior as a function of the month of the year. . . . . . . . 7.8 Marginal values of water as a function of storage, inflow and month. . . . . 7.9 Mutual information due to temporal dependence in the inflow process. . . 7.10 Shift in Pareto front and negotiated solution due to real-time operations. . 7.11 The agent-environment interaction in reinforcement learning. . . . . . . . . 92 95 96 97 104 105 107 111 112 113 158 161 161 162 164 165 165 166 169 171 173 List of Tables 3.1 Joint distributions expressing side information for two models of a die . . . 47 4.1 Overview of the methods and types of forecasts that are compared in chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Resulting tercile probabilities for the four methods compared . . . . . . . 4.3 Resulting mean and standard deviation for the various methods . . . . . 4.4 Resulting relative entropy for the different methods . . . . . . . . . . . . 4.5 The resulting skill score for the different methods. . . . . . . . . . . . . . . . . . 66 68 69 69 84 6.1 Analogy science ⇔ data compression and physical systems ⇔ computation 6.2 Optimal code lengths proportional to minus the log of events’ probability . 6.3 Compression performance for well-known compression algorithms on various time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Information-theoretical and variance statistics and compression results for rainfall-runoff modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Validation and calibration results for information and utility objectives . . 122 130 137 139 151 7.1 Characteristics of the hydropower reservoir toy model. . . . . . . . . . . . 160 Chapter 1 Introduction “Only a fool tests the depth of the water with both feet.” - African proverb Water systems are all around us: our ancestors emerged from one; we drink from them; we feed from them; we fear to be submerged in them; we ship goods over them; we store heat in them; we generate power from them; we drain them; we replenish them; we rinse in them; we pollute them; we try to harness them and preserve them for our children, who, like us, are water systems themselves. Although water, which plays such an important role in our lives and life in general, lends itself perfectly to long-winding and poetic descriptions, this thesis will try to solidify some of its fluidity and will look at water mostly in dry and mathematical terms. This by no means indicates an under-appreciation of the beauty that water enables and possesses. On the contrary, appreciation of complex systems and their many interrelated subsystems that all have their functions is only enhanced by understanding them. Understanding amounts to linking observations of quantities in the water system by describing them in a compact and precise manner that enables making sensible predictions. The system-view is useful because it helps to formalize the thinking about behavior, interaction with the surroundings, interactions with us, boundaries, modeling, division into subsystems, inputs and outputs to the system. This facilitates making predictions, which are of paramount importance in both science and engineering. In science predictions are the only way to test our theories and in engineering they enable decisions that improve our quality of life. This thesis is specifically focused on water systems that can be influenced by humans and that are interesting for us to influence. Examples are artificial lakes, rivers, aquifers, polder systems and irrigation systems. All these systems have in common that we can exert an influence on them, for example through dams, dikes, wells, pumps and weirs. The construction and installation of these structures helps harness these systems and make them behave in a way that is beneficial to mankind. Once in place, the functionality of these structures can be enhanced by operating them in an optimal manner according to our objectives. 1 2 Introduction Figure 1.1: The relation between discharge and water level is more flexible if a controllable weir is used. The gray area shows the possible combinations for a controllable weir, while the black line gives the combinations attainable with a fixed weir. 1.1 Why operate water systems? In contrast to structural measures, such as the construction of new infrastructure like dams, weirs and pumps, operational measures do not aim to alter the water system permanently. Instead, they are using existing infrastructure, to have a temporary influence on the system’s behavior to cope with the specific situation at hand. Switching on a pumping station in a polder or releasing water from a hydropower reservoir are examples of operational measures. As with many classifications, the distinction is not clear-cut and depends on the time-scale of reference. Often, water system operation is needed, in addition to structural measures, to achieve a good and cost-effective performance of the water system. This need arises from the fact that, apart from our influence, the water system is influenced by external forcings, such as the weather, which typically vary in time. Therefore, the best response will often also be time-varying. Although it is possible to achieve a time-varying response by structural measures (a weir, for example, “reacts” on a higher water level by letting more water pass), operation provides much more flexibility (extra water can be released by lowering the weir when needed, even when the water level is still low). The key difference is that operational measures can vary in time, depending on objectives that change in time and on information beyond the local and present situation. The processing of this information into an action or decision is extremely flexible. See for example figure 1.1, where the relation between discharge and upstream water level for a weir is plotted for a free flowing weir. The downstream water level is considered to be low enough not to influence the flow. The equation that describes the flow is 3 Q = k (h1 − h2 ) 2 (1.1) where Q is the flow over the weir, k is a constant and h1 and h2 are the water levels indicated in figure 1.1. The range of possible flows for a given water level or the range of possible water levels for a given flow are far larger for a controllable weir. 1.2. Uncertainty, rationality and risk in decision making 3 Structural and operational measures complement each other. A well-designed dam, built at the right location, enhances the possibilities to benefit from the water system, but only if it is thoughtfully operated. Conversely, the room for operation is very small if the structures through which the water system can be influenced do not have sufficient capacity. Therefore, optimal decisions about operational measures have to take into account the constraints posed by the infrastructure. Furthermore, the effects of future operation need to be considered already at the stage where structures are planned and designed. Because of this interdependency, decisions on structural and operational measures interact. This thesis is purely concerned with decisions on operational measures, referred to as water system operation. The design of the structures and the range of possible actions to control the water system will be considered as given and not as part of the optimization problem. The focus will be on water system operation under uncertainty and the role information plays in that process. Later in this thesis, the close link between information and uncertainty will be further analyzed. The next sections give a short introduction in the related concepts of information, uncertainty and risk in the context of finding optimal decisions, such as optimal actions for operating the structures to influence a controlled water system. 1.2 Uncertainty, rationality and risk in decision making 1.2.1 Uncertainty Almost every decision in real life is influenced by uncertainty. In some cases this uncertainty is more obvious than in others. A famous example of a decision under uncertainty is the decision to take an umbrella or not, based on the weather forecast. In case it is forecast that it will rain with certainty, the obvious decision is to take an umbrella. In another case, when the sky is blue and the air pressure is high, it would be a better decision not to carry the useless umbrella around (assuming pure water management and no fashion objectives). In many cases, however, the decision maker does not know with certainty whether it will rain. In such a case, his decision depends on how much he values remaining dry, how annoying carrying the umbrella is and his estimate of the probability that it will rain. An example in which the uncertainty is less obvious is the decision of a Ph.D. student to travel to university by bicycle. Suppose it is simply the best mode of transportation according to this Ph.D. student’s preferences. No uncertainty seems to affect his choice. However, if he knew beforehand that he was to get a flat tire, which is always possible, he would prefer to go on foot. Again, the optimal decision depends on the outcome of an uncertain event. The apparent absence of uncertainty stems from the fact that the probability of a flat tire is very small compared to the threshold probability above which the student would prefer walking. A large amount of new information is therefore necessary to convince the student to leave his bicycle at home. 4 Introduction In water resources management, decision makers are often faced with decisions under uncertainty. An example is the daily release of water at a hydropower dam, given the incomplete information about future inflow into the reservoir. If the future inflow into the lake will be high, the optimal current release will also be high, because future spills are likely to occur if the water is not used now and the capacity of the generators is limited. If the future inflows are low, on the other hand, a better decision would be to keep the reservoir filled, so power is produced at a higher reservoir level elevation, yielding more power per unit of water volume. The decision maker does not know the future inflow and the best he can do is to use all information available at the time of the decision to choose the release that he expects to be best. Because the water manager is not able to influence the inflow, it makes little sense to judge these decisions in hindsight, based upon information on the actual inflow that turned out to occur. Uncertainty, although in some occasions more than in others, plays a role in all decisions, adversely affecting their quality. There are two things we can do about that. In the first place we try to reduce uncertainty by obtaining information. This information can result in understanding and predictions that remove some of the initial uncertainty. In the second place, we have to deal with the remaining uncertainty by making rational decisions. 1.2.2 Rationality Even in case we are excellent decision makers, the presence of uncertainty at the time of decision means that actions are not always optimal in hindsight, once more information has become available. Therefore, it is important to make a distinction between decisions that are right in hindsight and decisions that are right, given the information available at the time the decision was made. The latter are often referred to as rational decisions. By definition, rational decisions maximize the expected fulfillment of the decision makers’ objective from the perspective of the information available to him at the time. This thesis focuses on rational decisions in the context of water system operation. To limit the scope, the objective of the operator is assumed to have been correctly formalized and to reflect the objectives of society. Questioning the objectives is outside the scope of this thesis. 1.2.3 Risk Risk is usually associated with adverse uncertain events, such as floods and nuclear accidents. However, uncertainty with respect to positive events can also be formulated in terms of risk by considering the failure to attain a potential benefit as an adverse event. Irrigation risk, for example, can be defined in terms of the probabilities of the water supply falling short to produce the maximum agricultural yield. In this risk, both the probability and the magnitude of the reduction in crop yield play a role. Generally, risk is defined as a function of probabilities and consequences. A widely applied quantification of this definition is the engineering definition of risk: Risk = probability of an event × consequences of that event. 1.2. Uncertainty, rationality and risk in decision making 5 In this thesis, this definition of risk is used. Because probability is dimensionless, the risk is expressed in the same unit as the consequences. Often these are expressed in monetary terms, rendering them readily applicable in cost-benefit analyses. In the decision whether or not to build a dike, for example, construction and maintenance costs are weighed against the benefit of decreased flood-risk. This can be a very challenging task, given the fact that both the costs (e.g. ecological damage, destruction of the landscape) and the benefits (e.g. saving human lives, preventing emotional shock) are hard to express in monetary terms. In addition to monetary values, a more general measure of costs and benefits within the framework of decision theory is the concept of utility. Utility cannot be defined directly, but is measured through preferences of a decision maker. If a decision maker prefers A over B, then A is supposed to have a higher utility than B for that decision maker. At first sight, there seems to be a circular definition in the concept of utility, since it is used to define the rational choice but is also defined in terms of choices (preferences). However, when applied to a set of decisions, the concept can be used to check whether a decision maker’s preferences are rational (i.e. self-consistent) and to make him conscious of his preferences. Once the utility function of a decision maker has been captured in a quantitative objective, the best way to maximize utility are rational decisions taking into account all available information. When the information is incomplete, this entails taking into account uncertainty. 1.2.4 Risk-based water system operation is rational One way to cope with uncertainty is to always be on the safe side. A decision can for instance be based on a worst case scenario, no matter how small its probability. An example from the context of structural water management measures is the construction of dikes. One could argue dikes should be designed to withstand the “maximum possible flood”, because flooding leads to damage and loss of lives. However, flood protection also comes at a cost. Not only the monetary cost of dikes, but also the damage to environment and landscape has to be considered. These costs have to be balanced by the benefits of the increased flood protection. A worried citizen could of course argue that the loss of human lives can never be outweighed by such costs, but it is easy to point out that resources spent on dikes cannot be spent on, for example, healthcare. The potential loss of lives as a result of cutting funds there makes it clear that there is no evident “safe side” in this decision problem. Risk-based decision making is rational decision making, given that a rational agent values an uncertain outcome as a linear function of its probability. If a decision problem is accurately formulated in terms of objectives (utilities), possible decisions and available information, the decision that minimizes risk is the best one. Other results from decision theory prescribe that preferences for decisions must be transitive and complete (Peterson, 2009). Transitivity means that preferences cannot be circular: if A is better than B and B is better than C, C cannot be better than A. Completeness means that different options 6 Introduction Figure 1.2: Information is the reduction in uncertainty are always comparable: a decision must always be made, even if the decision is to postpone action until more information is available. If for two rational decision makers the objectives and available information are equal, the decisions will also be equal and conditionally optimal in terms of expected utility. In practice, this situation never occurs, because of three reasons. Firstly, prior information is often implicit and never the same for two persons. Secondly, the objectives are often not entirely the same and partly implicit (hidden agendas). Thirdly, human decision makers are only partly rational. Specifically, rational decision making is sometimes avoided to prevent explicit moral choices. For example, in flood protection, the risks involve a combination of monetary value and human lives. A clear statement of the objectives together with a systematic treatment of information and uncertainties would prevent circular preferences and irrational choices. However, such an explicit formulation of objectives can only be done on the basis of moral choices. Because these are difficult to agree on, the objectives are often kept partly implicit. Some less harmful and very entertaining examples of predictably irrational behavior can be found in Ariely (2008). 1.3 Information reduces uncertainty Uncertainty in a decision problem can be associated with incomplete information. The less information is available about the outcome of some event, the more uncertain we are about it, and this will be reflected in uncertainty about the best decision. Within the framework of information theory, which will be used extensively in this thesis and will be introduced in chapter 3, uncertainty is defined as missing information and information is defined as a reduction in uncertainty. This relation is schematically represented in figure 1.2. In the information-theoretical framework, uncertainty is defined as a unique measurable quantity defined in terms of probabilities. The exact definitions of these concepts will be introduced in chapter 3 and some extensions to these definitions will be presented in chapter 5. For now, it must be stressed that uncertainty in this definition influences risk and value of decisions, but is not equivalent to it. Uncertainty depends only on how much is known or believed about some event, but not on what that event is. To put it differently: there is a difference between uncertainty and risk. Imagine a coin toss that is used for deciding 1.4. The value of information 7 who will get coffee to fuel the research in some university office. Now imagine that same coin toss executed by some morbid dictator who uses it to make a decision on starting a nuclear war for fun. In both cases the uncertainty about the outcome of the toss is the same, but the risk associated with that outcome is far higher in the latter case. The reason that some people might perceive the second coin toss to be associated with more uncertainty is the higher uncertainty about the future state of the world as a whole. The uncertainty about the outcome of the coin toss itself, however, remains the same. 1.4 The value of information Intuitively, most people would agree that more information leads to better decisions. A water system that is better known and monitored, is easier to control. Uncertainty adversely affects the quality of decisions and information reduces this negative influence by resolving uncertainty. For the person making decisions or the groups whose objectives he represents, information therefore has a value. The value of information can be defined as the expected value of the outcome given a decision using that information minus the expected value of the outcome with a decision based on the prior information. An example is the increase in hydropower revenue due to improved inflow predictions. According to Hamlet et al. (2002), including information about El Niño Southern Oscillation into a more flexible operating strategy of the Columbia river hydropower system would lead to an expected increase in expected annual revenue of $153 million in comparison with the status quo. Generally, inflow predictions are important for hydropower production and better forecasts or better ways to make decisions based on these forecasts are an important topic in water resources research. Summarizing, it can be observed that information acquires value through its use in decisions. 1.5 Objectives and research questions 1.5.1 The broader perspective The overall objective of water resources management is to make water systems behave in a way that is beneficial from a human perspective. Of course this is easier said than done. First of all, we need to define beneficial, and this leads to a number of questions that are progressively more difficult to answer: In what ways can we benefit from the water system? Who should benefit? Should we benefit now or later? Should we benefit at the cost of reducing benefits for future generations? Is there a reason to preserve ecosystems or the earth beyond the human benefits? Is survival of the human race the ultimate objective? Why are we here in the first place? Are we actually here? The preceding series of questions makes it clear that it is in the end impossible to satisfactorily define beneficial. However, just as it is possible to live life without knowing the 8 Introduction ultimate purpose, it is possible to manage water resources without an all-encompassing definition of beneficial. We just do what we think is best, considering our incomplete information and simplified, derived objectives. Information is beneficial to water management. Firstly, the degree of achievement of a derived objective can never be decreased by obtaining information about the workings of the system, provided the information is true and processed in a correct fashion. Secondly, more information about the water system helps to better define the objectives and possibilities to control the system. In other words, it helps posing the right optimization problem. Often, defining the problem and objectives correctly involves participation of stakeholders and negotiation (Soncini-Sessa et al., 2007). This process helps to derive the immediate objectives from more abstract underlying objectives. Ultimately, however, the underlying objectives reduce to moral statements and information cannot tell us what we should or should not strive for. In fact, it could very well be that all purpose is just an emergent behavior of our evolution. This thesis deals with the role and benefits of information in the first sense. It is supposed that the objective has been defined and that it is clear how the water system can be influenced. The question then remains how to convert the information about the system and its surroundings into a decision that will optimize the objectives. Some tools that facilitate this process are developed within this research, often using simplified representations of the real decision problem. The main results are therefore general methodologies rather than solutions to specific real-world problems. Notwithstanding the somewhat theoretical character, the application of these methodologies can be part of finding such solutions. 1.5.2 Open questions Water system operation continuously deals with decisions under uncertainty. This uncertainty is reduced by information from forecasts, which are based on models, which are based on observations. The remaining uncertainty must be taken into account in decisions. Because new decisions for the operation of water systems are taken continuously, there is a complex interaction between the current decision, future decisions and new information that becomes available in between subsequent decisions. The flow of information appears to be a key component in the whole process of going from observations to decisions. In water system operation, this process is usually not viewed from the perspective of information being a quantity that flows through this process. Yet, such a perspective could provide new insights in these complex relations. Studying how to maximize the information that infiltrates our models and how it percolates into our decisions might be as important as studying the flow of water itself. 1.5.3 Research objectives The objective of this thesis is to develop a framework to study the role of information in the context of water system operation. Because uncertainty plays an important role 1.6. Thesis outline 9 in most water systems, it is important that the framework takes into account both the information that is available and the remaining uncertainty. Studying information in risk based water system operation requires describing the interrelations between objectives, decisions, uncertainty, measurements, models, forecasts, probability, value or utility, time and information. Apart from offering another view on existing methodologies, a self-consistent framework can also serve to reflect on them and suggest improvements. Therefore the search for the framework is expected to yield several concrete and practical recommendations about existing methodologies and possibly some alternatives to these methodologies. Ideally, these methodologies should be defendable from a self-consistent framework. 1.5.4 Research questions The main questions that need to be addressed in the framework that is to be developed in this research is: • • How does information play a role in optimal operation of water systems? How can this information be exploited optimally? To answer these questions, both water system operation and the nature of information have to be investigated. A formal framework for quantifying information can be used to study its flow through the processing of observations into decisions. Because decisions are usually not just based on raw observations, but also on predictions made by models, both predictions and models are expected to play an important role in optimal risk based decisions. Ultimately, the information that enters the decisions, should improve their quality and therefore possesses some value. An important question is how this value can be maximized, i.e. how to exploit information optimally. Special attention will be given to probabilistic forecasts, as they form summaries of available information, but also represent the remaining uncertainty. Providing the right information for risk based decisions might be seen as the task of producing a probabilistic forecast that is in some sense optimal. Having optimal forecasts requires that the information processing from observation to decision, which includes the employed models, is optimized. The evaluation of probabilistic forecasts is therefore an important part of the problem that will be studied in this thesis, because it allows some diagnosis of the information flow. 1.6 Thesis outline This thesis focuses on several aspects of information in its course from observations to decisions. In chapter 2, the context of risk based water system operation is explored using a case-study. A framework for relevant time-horizons that describes how information about the future can affect current decisions is presented. It it also shown why uncertainty needs to be taken into account to arrive at optimal decisions. Shannon’s formal theory 10 Introduction of information and uncertainty is introduced in chapter 3, along with some additional interpretations that will facilitate the understanding of the methods in the next chapters. This chapter also gives a short overview of some of the other applications of information theory within water resources management and hydrology. Chapter 4 presents the “minimum relative entropy update” (MRE-update), a method to add seasonal forecast information to an existing ensemble of historical time series by attaching weights to them. The information-theoretical foundation of the method ensures that not more information is added than is present in the forecast, but also not less. The information-perspective is also used to analyze what happens in the existing methods that this method aims to complement or replace. Probabilistic forecasts, such as resulting from the MRE-update, are sometimes claimed to be problematic in terms of their evaluation and acceptance by decision makers. In chapter 5, a radically opposite view is presented, where it is claimed that probabilistic forecasts are the only forecasts that actually contain information in a strict sense. In this pivotal chapter, an information-theoretical analogy of a well-known decomposition of an existing score for the quality of probabilistic forecasts (the Brier score) is found. In light of this result, forecasting can be seen as a communication process in which information is transferred to the user to reduce his uncertainty. The presented framework for the evaluation of forecasts defines the relations between climatic uncertainty, correct information, wrong information and remaining uncertainty in this context and also explicitly distinguishes between useful information and pure information. The evaluation of predictions, as treated in chapter 5, is of critical importance to the process of inference of models and therefore to both science and engineering. The quality of a model is determined by the quality of its predictions of unseen data. Chapter 6, building on the insights from the previous chapter, presents information as a central concept in the philosophy of science. Some thoughts on the link between the principle of parsimony, data compression and algorithmic information theory are presented. As an example application, common data compression algorithms are used to “ZIP” hydrological time series to estimate their information content. Some philosophical implications of the deep theories of algorithmic information theory, which sees science as data compression, are also discussed. Chapter 7 returns to the narrower scope of information in the context of optimal water system operation. Stochastic dynamic programming is used to study the value of water when being allocated under uncertain conditions. It is shown why the complex dynamics of information play a role in both the value of water and in optimal decisions. A possible way forward is presented in the form of a more empirical approach to estimate optimal decisions under these complex information-dynamics. Chapter 8 summarizes the conclusions on both the conceptual and the applied level and proposes a few potentially interesting future research avenues. Chapter 2 Risk-based water system operation “The policy of being too cautious is the greatest risk of all.” - Jawaharlal Nehru 2.1 Introduction This chapter1 presents the formulation of an example risk based water system operation problem. Furthermore, a framework is introduced that describes how uncertainties about future events influence the current decision. The typically Dutch water system of the Delfland storage canals is used as an illustration of these concepts and the practice of water system operation in general. For optimization of water system operation, an online and an off-line approach is distinguished and illustrated. For the online approach, the Model Predictive Control (MPC) formulation for the Delfland system is analyzed. The background about the problem formulation and solution techniques is presented in sections 2.2-2.4, while section 2.5 concerns uncertainties affecting the water system. Uncertainties influence the operation of the Delfland system in various ways. The uncertainties are present in the measurements, models, and predictions. The uncertainties in the predictions have an important time-dimension. On the one hand, uncertainties tend to grow with increasing lead time. On the other hand, the importance of future events for the present decisions decreases with lead time. Section 2.6 presents a framework for time horizons relating these effects and draws some conclusions for controller design. The time horizons are determined empirically for a simplified schematization of the Delfland system. 1. based on: – Weijs, S.V., van Leeuwen, P.E.R.M., van Overloop, P.J., van de Giesen, N.C. Effect of uncertainties on the real time operation of a lowland water system in The Netherlands. IAHS publication 313 IUGG Perugia, 2007 – Weijs, S.V. Information content of weather predictions for flood-control in a Dutch lowland water system. 4th International Symposium on Flood Defense: Managing Flood Risk, Reliability and Vulnerability, Toronto, Ontario, Canada, 2008 – Overloop, P.J. van, Weijs, S.V., Dijkstra, S.J. Multiple Model Predictive Control on a drainage system, Control Engineering Practice, 16:531-540, 2008 11 12 Risk-based water system operation Using the concept of certainty equivalence (section 2.7), it is found that uncertainties in the predictions need to be considered explicitly in the decision making process to make optimal risk-based decisions. A multiple model extension for MPC is proposed for this task (section 2.8). A further analysis of the Multiple Model Predictive Control methodology (section 2.9) reveals that future decisions influence the current decision. Therefore, also the information that will be available for future decisions is important for the current decision. Two MMPC formulations are presented that represent the two extreme assumptions of no new information and perfect new information. More realistic representations are dealt with in chapter 7, analyzing stochastic dynamic programming (SDP) formulations of this problem. 2.1.1 Water system operation as a mathematical optimization problem When water systems are operated, usually some objective is involved, depending on the functions of the water system and the preferences of the operator and the organization he works for. A reasonable requirement for operation is that it strives to optimally meet these objectives, given the constraints posed by the water system and other requirements (e.g. legal norms). Examples of objectives are maximal profit, minimal danger, and keeping the water level within predefined limits. This last objective can also be formulated as a constraint, which can be combined with other objectives, like minimizing pumping costs. Generally, objectives and constraints are interchangeable mathematically, although in our logical perception, it would be awkward to formulate adherence to physical law as an objective instead of a constraint. Optimal water system operation can therefore be regarded as the optimization problem of finding the choice for all actions within our control that optimizes the objective while satisfying the constraints. Such a problem is here referred to as the control problem. 2.1.2 Formulation of a control problem Solving a control problem requires defining a system, controls, objectives, constraints, measurements and disturbances. The system to be controlled is usually formulated in discrete time (Eq. 2.1). xt+1 = f (xt , ut , dt ) (2.1) where vector xt is the state of the system, ut the vector of control actions and dt the vector of disturbances at time t. The function f describes the behavior of the system under influence of the controls ut and the disturbances dt . This can be a nonlinear function, but in control theory the system is often assumed to be linear for computational reasons, leading to the state-space formulation xt+1 = Axt + Bu ut + Bd dt (2.2) in which matrix A describes the autonomous behavior of the system, Bu the influence of the controlled inputs, and Bd the influence of the known disturbances on the system. 2.1. Introduction 13 The controls ut on a system are often limited by physical or other constraints. For example, if ut represents pump flows, the capacity may be limited by a vector of maximum pump capacities ut,max . There may also be constraints on the states. If state i represents the total volume of water stored in a reservoir, for example, that state cannot become negative, leading to the constraint [xt ]i ≥ 0 , where [xt ]i is element i of the state vector. Apart from satisfying the constraints, water systems are usually controlled with additional objectives, which can be expressed in quantities that need to be maximized or minimized. Identifying these quantities is not always straightforward, as they reflect the definition of desirable behavior, which inevitably leads to questions about whose desires are most important. The objective function can be interpreted as reflecting the utility of a particular system behavior and can therefore be formulated and analyzed using economic theory. The objective can then be satisfied by maximizing J = g(xt , ut , dt ) (2.3) where g is the objective function and J the objective function value. This function usually does not only account for immediate benefits, but needs to take into account future behavior of the system as well2 . Therefore, the objective function that actually should be minimized is J = g(xt..T , ut..T , dt..T ) (2.4) where T is the end of the time horizon of interest. For example, when a reservoir is used for hydropower production, the largest immediate benefits are obtained by using the turbines at full capacity. This, however, may lead to a rapid depletion of the reservoir, which is not in the interest of overall long term benefits. Keeping the reservoir levels higher results in higher power yields per unit of water that flows through the system. In case the function g for the benefits can be disaggregated in time, the objective function in Eq. 2.4 can be written as a sum of the benefits in individual timesteps J= T X g(xt , ut , dt ) (2.5) t=1 which allows the use of certain special solution techniques for finding the optimum of J (see subsection 2.1.3). Measurements from the system can be a function of the state, decision and disturbance yt = h(xt, ut , dt ) = Cxt + Du ut + Dd dt for a linear system (2.6) which can mean that not all states are directly observable. All information a controller can receive about the system is in the measurements. This can pose important challenges 2. Note that the most basic objective that emerges from our evolution, reproduction, also needs to optimize survival over a future time period in order to optimize reproduction probability. Optimizing over a finite time horizon is therefore very natural. 14 Risk-based water system operation for the design of a control system. This chapter deals mostly with states that are directly observable. The above state-space formulation of a control problem is just one of the possibilities. In other formulations, only the objective is considered and all other equations, including model, are seen as constraints. Ultimately, every optimization problem can in theory be reduced to an objective and a number of decision variables. 2.1.3 Solution techniques When the control problem has been formulated, there are several techniques to find the optimal solution. Some of these techniques find analytical solutions and others approach the solution numerically. Several techniques make assumptions or require the problem to be cast into a predefined form that does not exactly match the original problem. In this section, the solution techniques that are used in this thesis will be briefly introduced. Global optimization algorithms When optimizing an objective function by manipulating two decisions that are real numbers (e.g. adjustable weir flows), the problem can be visualized as finding the highest point in a mountain landscape. One option to find such a point is to test each point in the landscape and compare it with the highest point so far. This strategy is referred to as exhaustive optimization or brute force optimization. Especially when there are many decision variables to optimize, this strategy becomes infeasible due to the high computational cost, e.g. exponentially increasing time of computation. Global optimization algorithms reduce this computational burden significantly, by using efficient strategies to sample the search space, using results from previous samples to identify promising solutions. These algorithms are often biologically inspired, because in living organisms, populations and ecosystems, optimality often emerges from simple rules. Examples are ant colony optimization (Dorigo and Stützle, 2004), particle swarm optimization (Vesterstrøm and Thomsen, 2004) and evolutionary algorithms like differential evolution (Storn and Price, 1997) and its self-adapting variant (Brest et al., 2006). Some of these algorithms have been combined in meta-algorithms like AMALGAM (Vrugt and Robinson, 2007) that let several algorithms run in parallel and adapt their relative importance based on their performance. Apart from applications to model parameter estimation (see for example Duan et al. (1992); Shoemaker et al. (2007)) and design problems (e.g. Dandy et al. (1996); Savic and Walters (1997); Abebe and Solomatine (1998); Solomatine (1999)), evolutionary algorithms have also been applied in control problems in for example Rauch and Harremoës (1999); Merabtene et al. (2002); Huang and Hsieh (2010); Koutsoyiannis and Economou (2003). The global optimization algorithms are sometimes also referred to as black box optimization algorithms, because they do not need any information about the problem they 2.1. Introduction 15 optimize. They are especially valuable for problems that are difficult to solve (e.g. highdimensional search space, many local optima), because the more efficient class of gradient based search algorithms fail to find the global optimum in those cases. Any structure that might exist in the search space is left for the algorithm to discover. This makes global optimization algorithms less competitive on problems with a clear structure, where thinking carefully during the problem formulation might reveal a problem that is far easier to solve. However, given the low cost of computer power compared to brain power (even Ph.D. students’), solving such problems using global optimization might still be the most economical solution. Dynamic Programming (DP) Dynamic Programming is a solution technique that makes use of the structure in a particular class of problems. One type of problem in this class are sequential decision processes, where decisions have to be taken repeatedly and each decision influences the state of the world for the next decision, and the objectives can be disaggregated in time (like in Eq. 2.5). While the original problem has many decision variables (e.g. a hydropower reservoir release for each timestep in the planning horizon) that have to be solved simultaneously, DP allows splitting the problem in a series of simpler problems, that can be solved one by one. This is achieved by disaggregating the problem in time, making use of the Bellman principle of optimality (Bellman, 1952), which states that “An optimal policy has the property that whatever the initial state and initial decisions are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decisions”. This leads to the Bellman equation Ht (xt ) = min {gt (xt , ut , dt ) + Ht+1 (xt+1 )} ut (2.7) in which Ht (xt ) is the optimal cost-to-go function in timestep t, evaluated at state xt , and gt is the step cost associated with the transition from xt to xt+1 . The disturbance dt is assumed to be deterministically known in this case. Equation 2.7 can be solved recursively, going backwards in time. At each stage, only the step cost gt and the cost-togo at the next timestep Ht+1 (calculated in the previous iteration) need to be considered. For certain functional forms of Ht (xt ) and gt , the equations can be solved analytically to yield an optimal policy ut = f (xt ), but for most problems the solution has to be found numerically and discretization of the state vector xt is required. Discretization makes Ht (xt ) a look-up table, of which all values have to be calculated one by one. Also the policy f then becomes a lookup table consisting of the optimal control actions for each state. Finding the table is referred to as off-line optimization (see section 2.3), although it is also possible to re-optimize the policy online. When the number of states in xt increases, the computational cost grows exponentially. This is referred to as “the curse of dimensionality”. Yakowitz (1982) provides a review of applications of dynamic programming to water resources planning, which date back to Hall and Buras (1961), and the application in the form of Stochastic Dynamic Programming (SDP), which is able to deal with uncertainties and has now largely replaced deterministic DP. In chapter 7, SDP will be applied to a reservoir operation problem and analyzed in terms of information flows and the value of water. 16 Risk-based water system operation Model Predictive Control Model Predictive Control (MPC) is an approximate solution of the Bellman equations for a finite time horizon (Bertsekas, 2005), usually for the case where the model is linear and the objective functions are quadratic. It has the advantage that no discretization of the states is necessary and it does not suffer from the curse of dimensionality. Instead of finding the optimal action for each possible state, the problem is solved “online” over a receding horizon, calculating a sequence of control actions ut...t+h that is optimal for the initial state xt and predicted sequence of disturbances dt...t+h . For each timestep the optimization is repeated and the first control action ut is executed. The objective function that is optimized in each timestep is min J = ut..t+h h X xT (t + i|t)Qx(t + i|t) + uT (t + i|t)Ru(t + i|t) (2.8) i=0 where Q and R are the weight matrices for the penalty on states and control actions, respectively, and i is a counter for the timesteps in the optimization horizon of length h. The solution of this quadratic objective function by MPC can explicitly take into account the linear model (Eq. 2.2) and possible constraints on the states xt...T and control actions ut...T . Model predictive control has been applied in various studies of optimal water system operation, for systems ranging from irrigation channels to the centralized water management of the main water systems in the Netherlands (van Overloop et al., 2010a,b) or emergency flood areas (Barjas Blanco et al., 2010). The Ph.D. thesis by van Overloop (2006) presents a number of applications and further references. Recent advances include the application of MPC to the control of both water quality and quantity simultaneously (see for example Xu et al. (2010)). The next section describes the practical details of a real world water system control problem, where MPC has been implemented in a decision support system (DSS) that is used in daily practice. 2.2 An example: a lowland drainage system The western part of The Netherlands is mainly situated below sea level. Most of the water systems there are divided into hydrological units called “polders”, in which a certain water level is maintained, depending on the land use and functions assigned to the area (Schuurmans et al., 2002; Lobbrecht et al., 1999; van Andel et al., 2010). For the drainage of these areas, the responsible water boards are mainly depending on pumping stations. The evacuation of the drainage water usually takes place in two stages, as depicted in Fig. 2.1. The water is first pumped to the storage belt (“boezem”), a system of canals and lakes connecting the pumping stations of different polders to large pumping stations at the coast or large rivers. Here the water is further elevated in a second step. The boezem serves both for transport and for storage of drainage water. The water level is usually higher than the surface level in most of the surrounding polders, but lower than the water 2.2. An example: a lowland drainage system 17 Figure 2.1: A schematic representation of the “polder-boezem” systems that characterize the western part of The Netherlands. The “polder land” drains onto the main “boezem” canals through pumping stations, limiting the peak flows. The higher lying “boezem land” can cause higher peak flows into the system. The water is eventually pumped into the sea (see bottom). level in the surrounding outer waters. These typically Dutch water systems are referred to as “polder-boezem” systems. In this chapter, the water system of the Delfland area is used as an example. 2.2.1 Delfland Delfland is the water board responsible for the south-western part of the province of South Holland. With 1.4 million inhabitants on 410 km2 , the Delfland region is one of the most densely populated areas in The Netherlands. The area is characterized by a high concentration of economical value. The water system has important functions for drainage, water supply for agriculture, navigation and recreation. The eastern part consists mainly of polder areas while the western part also has more elevated areas, draining into the boezem canals directly. Greenhouses cover a large area, resulting in a very fast runoff process. The total area of the boezem canal system is 7.3 km2 (about 1.8 percent of the total area). The water level in these canals is usually kept close to the target level of -0.42 m below mean sea level. The tolerable range for extreme conditions is between -0.60 m and -0.30 m below sea level. If large amounts of precipitation are expected, the water level is temporarily lowered. The total capacity of the polder pumping stations discharging on the canals is around 50 m3 /s, but due to the fast runoff from the higher lying areas, the total inflow can easily exceed 100 m3 /s during heavy rainstorms. The main pumping stations discharge the water from the boezem canals to the North Sea and the artificially canalized tidal river “Nieuwe Waterweg”, connecting the port of Rotterdam to the sea. The total capacity of the main pumping stations depends on the tide of the outside water and can vary between 50 and 70 m3 /s. However, in practice this capacity cannot be reached, because of limitations on the transport capacity of the canal system. High pump flows 18 Risk-based water system operation should be avoided where possible to avoid problems with high flow velocities and steep water level gradients. Importance of anticipation on extreme events As can be seen from the quantities mentioned above, inflows to the boezem can exceed maximum outflow capacity by 50 m3 /s. This can last several hours and in extreme cases lead to space averaged water level rises of 20 cm. Because of the limited channel capacity, local water levels can even rise up to 30 cm in a few hours. To accommodate this water level rise, both the lower and the upper margin need to be used. This means that pumps must lower the water level to -0.60 m before the start of the event, so the maximum level of -0.30 m will not be exceeded. To be able to anticipate on these events, it is necessary to use inflow predictions up to several hours ahead. Altogether, management of the Delfland water system is a challenging task. The increase in greenhouse area and the apparent regional climate trend towards more frequent extreme events have led to considerable damage over the last 15 years. The water board is currently executing a large program of structural measures to increase storage and discharge capacity in the system. Apart from the structural measures, a new decision support system is being tested to improve operational management 2.2.2 The Delfland decision support system Traditionally, all pumping stations were managed by operators of the water board, based on visual observation of water level gauges. Over the years, this process has become increasingly automated. Nowadays, most of the polder-pumping stations switch on and off automatically, based on local water level measurements in the polders. The main pumping stations of the boezem are operated centrally, based on water level measurements, information about the situation in the polders, and expected meteorological conditions. Until recently, this was done purely relying on judgment of the operators. Now, a new Decision Support System (DSS), built by the engineering consultant “Nelen & Schuurmans”, aids the operators in controlling the water levels in the boezem by advising flows for the main pumps. When results are satisfactory, it is also possible to switch the system to fully automated mode, in which the main pumping stations are directly controlled by the DSS. While testing the system, the operators give feedback on the decisions the DSS proposes, which helps to elicit extra operational constraints and objectives that the operators take into account through their experience. 2.2.3 Objective of control Damage occurs if the water system fails during extreme events. Making optimal use of the possibilities to control the boezem system will reduce both failure frequency and impact. Avoiding system failure is not the only objective. In fact, failure in a boezem system should not be viewed as a single Boolean event, but rather as a continuous damage function of 2.2. An example: a lowland drainage system 19 the water level deviations from a target level. Low water levels are associated with longterm effects such as acceleration of land subsidence, decay of foundations and instability of embankments. High water levels can cause flooding, risk of dike breach and the necessity to impose a pump restriction for the polders, causing flooding problems there. The challenge for the operators is to secure the evacuation of water, discharged from the surrounding land, while balancing the short- and long-term costs associated with high and low water levels. Usually a target water level has been set by the water board with democratically elected representatives of the various interest sectors. These include owners of agricultural land, house owners and inhabitants. Because zero deviation of the target level throughout the system is never attainable, the objective should also quantify the “cost” associated with the deviations, as a function of their direction, magnitude, location and duration. This makes it possible to balance the deviations in a way that minimizes costs or damage. Apart from the water-level related variables, the variables associated with control actions, like the pump flow, could also be part of the objective function. In this way, the operation costs of the pumps can be incorporated into the objective. In most cases, these are relatively low compared to the water level related costs, making minimization of operating costs a secondary objective that becomes relevant for the decisions only in non-critical situations. High pump flows also cause high water level gradients in the canals, which leads to relatively high water levels in the center of the system (see Fig. 2.2). Therefore, a penalty on high pump flows indirectly also reduces water levels in the center of the system, because it encourages a pumping policy that is spread out more over time. The objective function that was established in collaboration with the operators is quadratic for the deviation from target level and the control flow and linear (by summation over the time steps) for the duration. The spatial variability has not been included, but this is possible by introducing extra state variables for water levels in the different areas. This objective function can either be used as a performance indicator to evaluate control rules, or directly in a control policy, based on real time optimization. In the latter case, optimization should take place over a certain time horizon, to balance current and future costs and to allow anticipation of future events if necessary. This type of optimization is known as Model Predictive Control (MPC), which was introduced in subsection 2.1.3 and will be further explained in section 2.4. The objective is minimizing the cost function J over this time horizon: min J = u n n X xT (k + i|k)Qx(k + i|k) + uT (k + i|k)Ru(k + i|k) i=0 o (2.9) in which x = e = h − hopt is the state vector (deviation of the water level h from target level hopt ), k is the current time step, n is the number of time steps within the prediction horizon, i is the counter for these time steps, u is the control action vector (pump flow), Q and R are the weight matrices for the penalty on states and control actions, respectively. In the example case, these vectors and matrices are scalars, because the system is modeled as one reservoir controlled by one pumping station. Before going into more detail about 20 Risk-based water system operation Figure 2.2: The water levels in the Delfland system can have some variability during extreme events. The map shows a snapshot of instantaneous water levels resulting from a simulation of an extreme event in 1998 using the hydrodynamical model Sobek. The large circles correspond to high water levels. the formulation of the MPC formulation for the Delfland system, the alternative strategy of solving a control problem off-line is analyzed first. 2.3 Off-line optimization of operation Operation of water systems requires reacting on the current situation and anticipating future events that can influence the water system. In the off-line approach, this reaction and anticipation is formulated as a precalculated rule that specifies how to act upon each specific situation, as summarized by the input variables of the rule. The rules thus map each situation to a control action, where a situation can be described by the state x and additional information I. u = f (x, I) (2.10) The simplest off-line rules take into account only one local variable. For example, a pump may switch on and off based on the water level of the connected canal. This is called feedback control, because the control action depends on the variable that is controlled. The levels at which the pump switches can then be optimized off-line. This can for example be done by running simulations of a model of the water system including the rule for the pump and changing the rule until it best satisfies the objectives. Finding the optimal rule can be done by trial and error by an optimization algorithm, but can also be done analytically 2.3. Off-line optimization of operation 21 in some cases. Examples of well-established feedback controllers are proportional (P) or proportional integral (PI) feedback controllers and linear quadratic regulators (LQR) (Kwakernaak and Sivan, 1972). An improvement to optimal feedback rules can be made when the rules depend on more information. For example, the rule may be varied depending on the season. This is typically the case in many polder systems, where the target water levels are higher in summer, when the demand for water is higher and there is an infiltrating flow from the canals into the adjacent agricultural lands. The different objectives in different seasons can also be interpreted as emerging from the underlying, implicit objective to control the water availability in the root zone of the agricultural fields. In feedforward control, information about the actual disturbance is used as an input to the rule. The advantage of including this information is that actions can be taken before the disturbance starts to affect the states that have to be controlled. For example, the pumps in the Delfland System can be started at the moment inflow into the system is observed, proportional to its magnitude. This would lead to smaller deviation than postponing action until the inflow causes a measurable water level rise. To effectively use feedforward control, a model of how the disturbance and the control action influence the controlled state is necessary. If the measured disturbance is rainfall, a rainfall runoff model is needed, in combination with a simple balance model that relates inflow, controlled outflow and water level rises. Feedforward control can also be based on experience, in which case the models are implicit. Control rules can also include anticipation on forecast rainfall. For example, the water level in the canal system may be lowered to an emergency level that is predefined by the water board, when exceedence of a certain rainfall threshold is expected. The value of the threshold can be optimized off-line. When uncertainty in the forecasts is taken into account, the rules can also include a probability threshold, e.g. ‘the water level will be lowered with 10 cm if there is a probability of more than 25% that a rainfall of 20 mm in the next 12 hours is exceeded’. Thresholds for the amount, time horizon, and probability can be optimized off-line to meet certain requirements on the water system (van Andel et al., 2008; van Andel, 2009). This is often implicitly the case in Dutch polder and storage canal systems, although the target levels for water systems and rules for anticipation are determined in a negotiation process rather than in a mathematical optimization with predefined objectives. After an optimal rule has been found by off-line optimization, the only computation that needs to be executed in the real time operation of the water system is the implementation of that predefined rule. 2.3.1 Example: the “regelton” Suppose a water butt will be equipped with a proportional feedback controller (regelton < Dutch: regenton = water butt, regel = control ). The water butt serves both for reduction of peak flows and for watering plants. Therefore, there will be a penalty for both spills and shortages. The outflow of the reservoir can be regulated proportionally to the water 22 Risk-based water system operation Setpoints and results for proportional feedback, Kp=0.0428572 0.12 setpoint volume setpoint and results (m3) 0.1 0.08 0.06 0.04 0.02 0 1 2 3 4 5 6 7 month 8 9 10 11 12 Figure 2.3: The optimal month setpoints and feedback gain for the water butt. The box and whisker plots give the median, quartiles and extremes for the volumes in the butt for each month, based on a simulation of the rule for a 100 year rainfall time series. level deviation from a certain target level, but is limited to a certain maximum outflow. The proportional feedback controller can have a different target level for each month and has one kp parameter, that determines how much the outlet is opened when a positive deviation from the target level occurs. The optimization problem that has to be solved is min J = k1 shortage2 + k2 spill2 kp ,h∗1..12 (2.11) subject to the mass balance constraint, the inflow and the maximum outflow. To optimize the 13 parameters for the controller, a global optimization algorithm was used that does not need to make any assumptions on the structure of the problem. During the optimization, many simulations of the controlled water butt are performed, with a 100 year time series of daily measured rainfall as an input. In each simulation, the controller has a different 13 parameter vector. The objective function value over the 100 years is evaluated and used to generate new proposed parameter sets. The optimization was performed using a genetic algorithm developed by Storn and Price (1997), called Differential Evolution (DE). This algorithm efficiently searches the 13-dimensional parameter space by maintaining a good balance between exploration of the whole space and refining results in promising regions. The resulting monthly setpoints and some statistics of the water levels achieved within the 100 year period are shown in figure 2.3. The problem of finding the optimal setpoints for the water butt is of a similar nature as finding an optimal rule curve for a hydropower reservoir (see chapter 7). The approach that was taken here is empirical and finds the best setpoints and proportional feedback gain for a given time series of disturbances. In this case the amount of information in the data was sufficient to determine parameters that will also perform reasonably well for 2.4. Online optimization of operation by model predictive control 23 future situations. When the control rules that need to be found become more complex or less data are available, it may become more difficult to find good control rules empirically. 2.3.2 Disadvantages compared to online optimization A problem with finding optimal control actions empirically is that every possible situation must be considered beforehand. Especially if the best action depends on a combination of many variables, or on the distribution in space and time, many different combinations and patterns must be considered to find the best rule. If only a limited amount of data is available, these patterns have to be generated using stochastic models that accurately capture the spatial and temporal dependencies of the inputs. It is important to note, however, that such models usually do not increase the amount of information on which the decision rules are based. More complex control actions, e.g. the use of multiple pumps, need more data to reliably establish the control rules. Projection of the situation to lower dimensions may make it easier to identify the optimal rule, but often results in a loss of information. If the system to be controlled is well defined and the way the optimal control action depends on the inputs is complex, a better approach is to execute the optimization online, especially if forecasts of reasonable quality are available. This avoids the need to represent the patterns that are the input for the rule in low-dimensional form. Instead, any information that influences the decision through the model is automatically taken into account. Instead of a rule, the result directly consists of the optimal action, suited to that particular present situation. Consequently, the online optimization has to be repeated every timestep. 2.4 Online optimization of operation by model predictive control MPC solves a constrained optimization problem over a receding horizon. This means that the optimization finds a sequence of actions that optimizes the value of the objective function over the whole prediction horizon, while satisfying the constraints on water levels and control actions. In this optimization, MPC makes use of an internal model of the controlled system, to calculate the future states as a result of the projected actions (see Fig. 2.4). In the case of the Delfland system the model represents the system with one single reservoir, which is controlled by one pumping station representing the total pump flow out of the system. The optimization horizon is 24 hours ahead and the internal model has a calculation timestep of 15 minutes. The pump flow can vary between the constraints of -6 m3 /s to 70 m3 /s. The negative pump flow corresponds to letting fresh high-quality water into the system from a nearby lake. The maximum pump flow constraint varies in time with the outside water level. For predicting the outside water level, the system uses the astronomical tide. Other, second order effects are neglected. The Delfland decision support system uses a post-processing scheme to convert the total projected pump flow 24 Risk-based water system operation Figure 2.4: Schematic representation of model predictive control on a real system (modified from van Overloop, 2006). into instructions for the individual pumping stations. For the calculations in the rest of this chapter, a simplified representation of the MPC controller is used, with a fixed maximum pump flow. 2.5 Uncertainties affecting the Delfland system In this section, a number of sources of uncertainty relevant for the Delfland control problem are reviewed. The estimation of the inflow to the boezem canals is treated in more detail, including a description of the rainfall runoff model that is used in subsequent analyses. 2.5.1 Uncertainties The optimal pump flow is computed every 15 minutes on the basis of the latest information available. Apart from the actual and forecast inflow, this information includes water levels for 8 different locations in the boezem canal system. These point measurements are used in a weighted average to calculate the representative water level (RWL). This RWL is used in all mass-balance equations and serves as an initial condition for the state of the single reservoir model. A difference of 1 cm between RWL and target level, takes 20 minutes at full pumping capacity to compensate. Especially during high pump flows, the limited conveyance capacity of the canals can cause water level differences up to 10 cm within the system. The limited spatial resolution of the measurements in combination with the spatial variability of water levels is the main source of uncertainty for the actual state of the system. Apart from the actual water level and volume in the boezem system, the actual inflow is important information for the controller maintaining the water level. Part of this inflow is the outflow of polder pumping stations, the rest is originating from higher lying areas and flows into the boezem canals through weirs and other uncontrolled flow processes. Only the polder pump flows are partly known, but the telemetry system is 2.5. Uncertainties affecting the Delfland system 25 not yet advanced enough to have this information available centrally in near real time. Therefore, the inflow in the boezem canals can not be measured directly at this moment. One possibility to estimate the inflow is using a rainfall runoff model of the surrounding land, including the polders. This means that the local controllers switching the polder pumping stations have to be modelled as one of the rainfall runoff processes. This results in the difficulty of modeling the joint emergent behavior of many small systems with hysteresis, which is sometimes also encountered in natural systems; see e.g. the application of “hysterons” by O’Kane and Flynn (2007). Another possibility for inflow prediction is using the mass-balance of the boezem canal system. Because of the persistence, the past inflow can give quite an accurate estimation of the current inflow. By using the known outflow through the main pumping stations and the changes in volume calculated from the water level measurements, the past inflow can be estimated. Because of the large high frequency errors (see noise in lower graph of Fig. 2.5) in the representation of the volume by the level measurements, this balance calculation needs a considerable balance period to be accurate. A longer period acts as a moving average filter, removing errors due to temporary local drawdown effects and waves. The downside of using a long balance period is that the existent temporal variability of the inflow is also filtered out and the estimation lags behind approximately half of the balance period. The DSS uses a combination of the two approaches. A fast, lumped rainfall runoff model makes a first estimation of the inflow for the past 12 hours, using rainfall measurements. When this inflow is different from the inflow derived from the water balance over the boezem canals, a correction term is added to the inflow predicted by the rainfall runoff model. In this way, persistent errors in times of slowly varying inflow are compensated by the flow calculated from the balance, while the fast response to precipitation events is secured by the rainfall runoff model. The rainfall runoff model is fed by actual precipitation measurements at 8 locations. Weighted averages are used for different areas. For the inflow prediction, the model is fed by the updated weather forecast, but because of integrator elements in the model, the short term inflow prediction is also very much determined by past rainfall. Uncertainties in the inflow forecast are caused by model uncertainty, predicted rainfall uncertainty and measurement errors in the rainfall. 2.5.2 Data driven inflow model In the DSS, a conceptual rainfall-runoff model is used, that was developed in Weijs (2004). For the analyses in this chapter, an even more parsimonious model was formulated, making use of the data collected by the DSS since its installation. A rainfall runoff model was identified from the data using linear system identification (Ljung, 1987). First, the measured water levels and pump flows for a two year period were used to estimate the inflows to the canal system, using the mass balance over the canal system. Second, linear system identification was applied to link the inflow signal to the measured rainfall signal. 26 Risk-based water system operation Figure 2.5: The response of the identified rainfall-runoff model to a rainfall event, compared to the inflow signal that was derived from the water balance over the canal system. The noisy signal for the balance-derived inflow results from small fluctuations of the area-averaged water level, which is not necessarily representative for the volume in the system. Several model structures were tested and their parameters estimated. The model with the best performance was used. The transfer function after the Laplace transform reads G(s) = K 1 + TZ s s (1 + TP 1 s) (2.12) where s is the laplace transform of the input signal and the constants are (gain, zero, pole) K = 2.69 ∗ 10−7 , TZ = 7.66 ∗ 107 and TP 1 = 4968.3. As can be seen from figure 2.5, this model behaves almost like a linear reservoir, where outflow is linearly dependent on storage, with some additional delay. The model captures the essential dynamics of the transformation of rainfall to inflow into the canal system sufficiently for the purpose of evaluating the control performance under uncertainties in the prediction of precipitation. 2.6 Relevant time horizons for uncertainty and optimization Because uncertainties in predictions usually increase with the lead time, one would expect that the most problematic uncertainties exist close to the end of the time horizon. However, depending on the problem and formulation, the influence of the prediction on the current decision usually also decreases with lead time. This also means that the optimality of control will be less sensitive to this information and thus to the uncertainties therein. To get insight into the uncertainties in the forecast, use of probabilistic forecasts (e.g. 2.6. Relevant time horizons for uncertainty and optimization 27 ensemble forecasts) is proposed. In the light of Bayes’ rule, such a forecast can be seen as the prior climatic distribution, conditioned on actual information, to become a posterior distribution (Krzysztofowicz, 1999, 2001; Murphy and Winkler, 1987). The forecast is thus based on the multi-year average and spread of rainfall in a certain season, combined with information about current weather patterns. With increasing lead time, this conditional, posterior distribution approaches the climatic distribution, in which no information about the actual state is contained (see Fig. 2.6 for an illustration). 2.6.1 Time horizons relevant for prediction and control If we use a probabilistic forecast in real time control, we can define two horizons relevant to the problem: (a) (b) The information-prediction horizon (TIp) The information-control horizon (TIc) Horizon (a) can be defined as the time span from the actual moment until the moment where the conditional distribution of future events, conditional to all actual information, becomes the same as the marginal (climatic) distribution of these events. The length of this horizon depends on the forecast system and the statistical properties of the input forecast. This may also depend on the season. Horizon (b) is defined as the time span from the actual moment until the moment from which information does not influence the control actions anymore. This can be the case because the control action is at one of its constraints, or when optimality requires postponing actions. Note that in control theory there is also a definition of the control (Tc) and prediction (Tp) horizons. These horizons define, respectively, the number of control moves to be calculated and number of prediction time steps over which the future state is calculated. These are parameters of the controller configuration. Rules to choose these parameters are usually based on characteristic time-constants of the system (Camacho and Bordons, 1999). In contrast to these definitions, here the horizons TIp and TIc are defined, which depend on predictability of inputs and sensitivity of the control action, respectively. Regarding the relation between these four horizons, the following observations can be made: 1. If the controlled system has delays in its behavior, Tp should be much larger than the delay-time. 2. Tp should also be much larger than Tc + delay time. This is necessary to evaluate the full effects of each calculated control action. 3. If TIc > TIp, extending the TIp by more advanced predictions based on more information helps to improve control. Tp and Tc should be chosen larger than TIp. At the end of Tp, the cost-to-go function can be used that is based on a steady state optimization (Faber and Stedinger, 2001; Kelman et al., 1990; Loucks and van Beek, 2005; Negenborn et al., 2005). 4. If TIp > TIc, we can know something about the future after TIc, but it has no influence on the decision now. The information about cost-to-go functions after TIc is not relevant, because the possible control sequences that lead to minimum total cost do not diverge yet. Tp does not need to be longer than TIc. 28 Risk-based water system operation ensemble inflow from Monte−Carlo simulation 3 inflow m /s 60 40 20 Entropy and Divergence (bits) 0 0 50 100 150 entropy of flow distribution as function of lead time 8 6 Shannon Entropy KL−Divergence 4 2 0 0 50 100 lead−time (900 sec timesteps) 150 Figure 2.6: For increasing lead time, the uncertainty in the inflow increases and approaches the climatic uncertainty. The picture shows a Monte-Carlo simulation ensemble of a second order Markov Chain, with transition probabilities empirically derived from the modelled inflow series by Eq. 2.12. The Shannon-Entropy is a measure for uncertainty and the KL-Divergence a measure for information relative to the climate. These measures will be introduced in the next chapters. 5. In general, it can be stated that extending Tc and Tp beyond TIc is not necessary, but has no negative influence on the control, except for the computational cost. 6. Extending Tp and Tc beyond TIp is not necessary either, and can have a negative influence, if the forecast system is biased. In that case, the climatic distribution would provide a better estimate than the forecast 2.6.2 The time horizons for the Delfland system For the model predictive controller used in the DSS for Delfland, the sensitivity of the control action for uncertainties was tested by providing different forecast errors and measuring performance by evaluating the objective function over a closed loop simulation period. The sensitivity can be determined in two steps: 1. Test whether the control action is sensitive to information after a certain time step. 2. If so, test what the resulting reduction in optimality is, by evaluating the closed loop value of the objective function. Because positive and negative deviations from the target level are punished equally strong in the objective function, anticipatory pumping is only used if pump constraints are likely 2.6. Relevant time horizons for uncertainty and optimization 29 to be violated and positive deviations are expected somewhere in the future. In this case, the pumping necessary to counter the positive deviations is postponed as much as possible to avoid long periods with low water levels. Under normal conditions, in which the pump constraints are not relevant, the actual water level and the actual flow, therefore, are the only variables influencing control actions. In this case, the MPC controller behaves similarly to a feedforward controller. In cases where constraints become relevant (when the expected inflow exceeds the maximum outflow), anticipatory pumping becomes necessary. In these cases, there will be a period of time in the future in which the pumps are planned to operate at their full capacity. If the inflow peak is close enough to the actual moment, the pump flow is already at full capacity and will not be sensitive to an increase of the forecast flow. If the event is farther away or smaller, anticipation will be postponed, so in that case the action will be insensitive to changes in the future. In between these two, there is a relatively small range of situations in which the optimal action consists of anticipation of future constraints, but not at full pump capacity. In that case, the control action is sensitive to any change in forecast, as long as it occurs within the period of projected full pump capacity use. This period can be bounded by physical or operational constraints on the water level, but in the theoretical case that the expected inflow is very close to the maximum pump flow, can be unbounded. In this case, TIc goes to infinity. Summarizing, the controller can be in three situations: (a) (b) (c) no anticipation necessary sensitive to prediction pump at full capacity Situation (b) is always a transition from situation (a) to (c). In a deterministic optimization with a perfect prediction, the time period of this transition is limited and very much determined by the objective function and the dimensions of the system. In practice, however, predictions are uncertain to a considerable extent. Apart from the small errors in prediction, that may or may not influence the actions depending on the situation, larger uncertainties in the prediction may also exist. These errors may also influence the situation (no anticipation, sensitive, full capacity) of the controller. It is only possible if all future scenarios lead either to situation (a), or all lead to situation (c), that the controller is insensitive to the prediction. If this is not the case, the consequences and probabilities of all inflow scenarios have some impact on the decision. Forecast accuracy: the information prediction horizon For the Delfland system, 2.5 years of hourly rainfall forecasts for up to 24 hours ahead are available. The forecasts concern the hourly average rainfall for the Delfland area. When comparing the hourly forecasts from the Delfland dataset with the measured rainfall, different choices can be made about the period over which the sums are taken and the lead time for which to compare the forecasts. Also, there are two different methods to compare: ‘horizontal” and “vertical” methods, which are illustrated in figure 2.7. The 30 Risk-based water system operation Figure 2.7: The various options to compare forecasts and measured rainfall. M3 is the measured rainfall over a 3 hour period, while F3-L1v, F3-L8v and F3-L6h are the forecasts for lead times of 1, 8 and 6 hours respectively. The first two use the vertical method, while the last uses the horizontal. 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0.25 NSE NSE vertical comparison horizontal comparison 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 2 4 6 8 10 12 lead time (hours) (a) All forecasts 14 16 18 20 vertical comparison horizontal comparison 0 0 5 10 lead time (hours) 15 20 (b) Excluding correctly forecast dry days Figure 2.8: The accuracy of the rainfall forecasts as a function of lead time. The Nash-Sutcliffe efficiency (NSE) of the predicted rainfall amount within a 3 hour period. The measured rainfall is compared with the amount falling at a fixed lead time. (F3-LXv) and within one prediction (F3-LXh). measured rainfall was obtained from a weighted average of 8 tipping bucket raingauges in the area. This is the same measured rainfall information that is available for the DSS. What comparison is relevant depends on the sensitivity of the control decision for those different characteristics of the forecast. For a slow reacting water system for example, it makes sense to compare multi-hour sums instead of single hours. An example of how the forecast accuracy depends on lead time for both the vertical and the horizontal method is given for the 3 hour sum in figure 2.8. The right figure shows the results for the data set where all correct forecasts of no rain have been filtered out. This gives more insight in whether the amounts during events are forecast correctly. The Nash-Sutcliffe efficiency (NSE, Nash and Sutcliffe (1970)), which was used in this analysis, has a number 2.6. Relevant time horizons for uncertainty and optimization 31 of drawbacks. The fact that the forecasts have significant skill up to long lead times may be the result of seasonality, which reflects knowledge of the seasonal cycle rather than ability to make good short term forecasts (see Schaefli and Gupta (2007)). Apart from that, the time horizons are formulated in terms of information, which is described by quadratic performance measures like NSE only under certain circumstances. In chapter 5, a more coherent framework for forecast evaluation using information theory is given, which could be applied to determine the time horizons if the forecasts are probabilistic. Sensitivity to predictions: the information control horizon To determine to which extent the controlled system is sensitive to rainfall forecast information, several simulations were made. Because this study focused on the influence of information and uncertainties in rainfall forecasts, uncertainties in water level and rainfall measurements were not considered and the models for the rainfall-runoff process and the storage canal system are assumed to be perfect. In the simulations, both the storage canal system and the MPC controller are simulated. For the MPC controller, different horizons were used for the optimization and the resulting performance in controlling the water levels is compared. The simulations were done using a one hour time step. To be able to compare results with optimal feedback control, the penalty on control actions was set to zero. This led to small differences in behavior compared to the real controller, but does not significantly affect the results on time horizons. First, to determine the Information Control Horizon, simulations were made in which the predictions that the MPC controller uses are taken from measured rainfall data, that is also used to simulate the storage canal system. This corresponds to perfect foresight, which allows the controller to optimally anticipate extreme events. This is only true in case the optimization horizon is larger than the Information Control Horizon. If it is shorter, the controller does not see events in time to take the necessary anticipatory actions. In other words, the Information Control Horizon is the point where extending the optimization horizon does not improve performance anymore. The results for four different optimization horizons with perfect foresight are shown in figure 2.9. It is visible that for shorter horizons, the controller can not anticipate the high inflow in time to avoid relatively high positive deviations. For longer optimization horizons, the controller lowers the water level beforehand, leading to some negative and some positive deviations, which have a smaller total quadratic penalty. The results for the penalty as a function of the optimization horizon are shown in figure 2.11. As can be seen from the dashed line in that figure, for the Delfland storage canals, the Information Control Horizon is approximately 6 to 7 hours, because extending the optimization horizon beyond that point does not improve control performance anymore. The results for a optimization horizon of 1 hour correspond to feed forward control. This means the controller does not anticipate future inflows, but exactly knows the current inflow. This already assumes perfect rainfall measurements and knowledge of the rainfallrunoff process. Feed forward control is an improvement compared to feedback control, where control actions are based on measured water levels in the system, which leads to actions that always lag behind. As long as the inflow does not exceed the outflow, a feed forward controller can perfectly maintain the water level at target level, but when the intensity (mm/h) intensity (mm/h) rainfall 10 5 0 5400 5450 5500 5550 10 5 0 5600 5400 5450 50 0 5400 5450 5500 5550 5550 5600 5550 5600 50 0 5600 5400 5450 deviation (m) 0.05 0 −0.05 5400 5450 5500 time(hours) 5550 5600 0 −0.05 5400 a: 1 hour perfect foresight 5450 intensity (mm/h) intensity (mm/h) rainfall 10 5 5400 5450 5500 5550 10 5 0 5600 5400 5450 inflow and pump flow 5500 5550 5600 5550 5600 5550 5600 inflow and pump flow pump flow inflow 100 3 3 flow (m /s) pump flow inflow 100 5500 time(hours) b: 2 hours perfect foresight rainfall 0 5500 water level deviation from target 0.05 flow (m /s) 5600 pump flow inflow 100 water level deviation from target 50 0 5400 5450 5500 5550 50 0 5600 5400 5450 water level deviation from target deviation (m) 0.05 0 −0.05 5500 water level deviation from target 0.05 deviation (m) 5550 3 flow (m /s) pump flow inflow 100 5500 inflow and pump flow 3 flow (m /s) inflow and pump flow deviation (m) Risk-based water system operation rainfall 5400 5450 5500 time(hours) 5550 −0.05 5400 5450 5500 time(hours) d: 12 hours perfect foresight 32 c: 3 hours perfect foresight 5600 0 Figure 2.9: Influence of the prediction horizons on the control flows, when the predictions are perfect. Figures a to d give the results for the control flows (which slightly differ in timing) and resulting water levels for various time horizons. 2.6. Relevant time horizons for uncertainty and optimization 33 flow (m3/s) intensity (mm/h) rainfall 10 5 0 5400 5450 5500 inflow and pump flow 5550 5600 5500 5550 water level deviation from target 5600 pump flow inflow 100 50 0 5400 5450 5400 5450 deviation (m) 0.05 0 −0.05 5500 time(hours) 5550 5600 Figure 2.10: The behavior of the controller when fed with the forecasts that were actually available, which are not perfect. It becomes clear that none of the events were anticipated in time to lower the water level, resulting in larger mean squared deviations from target level compared to the perfect predictions. The optimization horizon was set to 12 hours, which is long enough for perfect anticipation if the forecast were perfect (see dashed line in Fig. 2.11). outflow is exceeded, water levels start to deviate. From the performance, which is proportional to resulting water level squared deviations, it can be seen that anticipation based on perfect forecasts would improve water level control by a factor of about 3 compared to feed forward control, while feed forward control already outperforms an optimally tuned (linear-quadratic regulator) feedback controller (not shown) by a factor 4.5. Secondly, simulations were done in which the controlled system was fed with the measured rainfall, while the controller was fed with the forecasts that were actually available at the time, instead of perfect forecasts (see Fig. 2.10). From the results of this second experiment, it becomes clear that the uncertainties in the forecasts actually reduce most of the value of anticipating. This also shows the huge gain that can be made by improving forecasts, especially for the first 6 hours. For these simulations, the performance as function of lead time is shown as the solid line in figure 2.11. 2.6.3 Results Analysis of the RMSE of forecast compared to measured rainfall has shown that rainfall forecasts have some predictive power for lead times up to at least 20 hours. The Information Prediction Horizon for this method of forecasting rainfall is more than 20 hours. Analysis of controller performance under perfect foresight as a function of optimization horizon shows that the current control action is only sensitive to events within the first 6 hours. Therefore, extending the optimization horizon beyond 6 hours does not improve control performance. This defines the Information control horizon. However, the value of 34 Risk-based water system operation −3 control performance: RMSE water level (m) 2.6 x 10 2.4 2.2 2 1.8 1.6 1.4 1.2 0 5 10 optimization horizon (hours) 15 20 Figure 2.11: The performance of the controller at maintaining the water level as a function of the length of the optimization horizon. The performance is measured in terms of root mean squared error (RMSE), where a lower value indicates better performance. The dashed line represents performance with perfect forecasts, while the solid line was obtained using the actual, imperfect forecasts that were available. the predictive power in the forecasts for controlling the water levels disappears already at lead times of 5 hours due to prediction errors. Although predictive power is there and the controller is sensitive to events between 5 and 7 hours ahead, no gain is made using these forecasts. A probable cause for this is that performance is mainly dependent on a limited number of events for the cases in which anticipation is necessary or appears to be necessary. Even if forecasts have some predictive power over the whole range of events, they might not have this power for correctly forecasting extreme events that need to be anticipated more than 5 hours beforehand. The way the “information” in the forecasts was measured apparently does not correspond one to one with the useful information for controlling the water system (related discussions can be found in chapter 5 and section 6.6). Another conclusion from simulations with real forecasts is that the value of anticipating largely vanishes as a result of forecast inaccuracies. Comparison between performance with perfect and with real forecasts shows a potential gain in performance by a factor 3 that could be made by improving forecasts. Even without anticipation, simulated feed forward control using perfect knowledge of the current inflow showed performance to be improved by a factor 4.5 compared to LQR-derived optimal feedback control, which only uses measured water levels as inputs. This shows the importance of good rainfall measurements, accurate rainfall-runoff models and real time availability of flow data from the polder pumping stations. 2.7 Certainty equivalence If uncertainties in the forecast information influence the control decisions, the challenge is to find the best decision, given the probability distribution of the inputs. If a water system is certainty-equivalent, the optimal operation can easily be found by optimizing 2.8. Multiple model predictive control (MMPC) 35 the control actions, given the best deterministic forecast, i.e. the expected value of the input (Eq. 2.13). However, the necessary conditions for certainty equivalence require that (Philbrick and Kitanidis, 1999): 1. 2. 3. 4. The objective function is quadratic System dynamics are linear There are no inequality constraints Uncertain inputs are normally distributed and independent For the Delfland case, the last two conditions are not fulfilled (maximum pump flow, errors are correlated), making it a non-certainty equivalent problem. This means that: u∗deterministic = arg min J (x, u, E {Qd }) u ∗ ustochastic = arg min EQd {J (x, u, Qd )} u ∗ ∗ udeterministic 6= ustochastic (2.13) (2.14) (2.15) in which x is the state (in this case e), u is the control vector (in this case Qc ), Qd is the uncertain inflow, J is the objective function (see equation 2.8) and E is the expectation operator. In this case, optimal operation, given the uncertainties and the available information, is only possible by using a stochastic approach (Eq. 2.14) to the optimization problem, routing the uncertainties to the objective function, which will then express risk instead of costs (Weijs et al., 2006; van Overloop et al., 2008). By minimizing this objective function, expected damage is minimized. This can be approximated by using different inflow scenarios as a discrete representation of the uncertain inflow in a multiple model configuration of MPC (van Overloop, 2006; van Overloop et al., 2008). 2.8 Multiple model predictive control (MMPC) Model predictive control calculates the optimal actions, given a certain predicted sequence of disturbances. In the past sections it has been shown how uncertainties in the predicted disturbance exist and can negatively affect the performance of the MPC controller. Because the Delfland control problem is not certainty equivalent, the best control action is not equal to the control action that would be best for the most likely disturbance. Instead, the control problem is stochastic, meaning that the action sought should minimize the expected value of the cost function over the optimization horizon. A possible way to achieve this in a standard model predictive control setup is to extend the internal model in MPC to contain multiple copies of the system. Each copy can than be fed with a different scenario for the future disturbance to the system, where the spread in scenarios reflects the uncertainty. This method combines well with ensemble forecasts, which have become quite common in meteorology and flood forecasting. This approach is referred as Multiple Model Predictive Control (MMPC). MMPC and its application to the Delfland system is described in more detail in Weijs (2004); van Overloop (2006); van Overloop et al. (2008). This section introduces the basic concept through which risk based water system operation can be implemented in the objective function. Section 2.9 analyzes the method in terms of information flows about future decisions. 36 Risk-based water system operation Figure 2.12: Schematic representation of MMPC. In MMPC formulation, the number of states is multiplied by the number of different inflow scenarios. In the optimization, a sequence of control actions is sought that leads to reasonable results for each copy of the model (see Fig. 2.12). For each possible scenario, the states of the model develop differently over the optimization horizon. If each scenario j represents a certain probability measure pj in the probability space of future events, the expected value of the cost function (risk) can be calculated by J= h n n o X X T T x (t + i|t)pj Qxj (t + i|t) + u (t + i|t)Ru(t + i|t) j i=0 (2.16) j=1 where the symbols are equal to those used in eq. 2.8, and j is the counter for the n different models or scenarios. Minimizing this objective function by a certain control sequence minimizes risk in the open loop behavior. Subsequently, the first action of the sequence can be executed and the optimization is repeated for the next control time step, using the latest information from measurements and updated predictions. In this research, a stochastic inflow model generated 50 inflow scenarios. These where distilled into 3 representative scenarios with unequal weights, in order to reduce the computational cost for the MMPC algorithm. More details can be found in van Overloop et al. (2008). New research focuses on more sophisticated tree based scenario reduction techniques (Raso et al., 2010). 2.9 The problem of dependence on future decisions The optimization problem that has to be solved in one control time step in the MMPC approach can be viewed from the Dynamic Programming perspective. In fact, the current decision is sought, that minimizes the sum of current and future costs. Because the uncertain future is described by multiple scenarios, the future cost is an expected value or risk. This is calculated by weighting the consequences of each scenario with their probability. However, these future costs not only depend on the external disturbance scenario, but is 2.10. Summary and Conclusions 37 also determined by how we react to it in the future. The quality of the decisions in the future influences the decision now. The quality of the decisions is partly influenced by the information that these future decisions are based on. In the MMPC formulation, the tacit assumption in the calculation of the benefits / penalty of the scenarios is that the whole control sequence is decided to be optimal for the whole range of scenarios, but not changed at the next control steps. In other words, it is assumed that no new information is taken into account during the optimization horizon. The sequence is thus optimal for the case where the operator has a day off tomorrow and programs the pump flows for the whole next day and returns to change the settings only after 24 hours. The true closed loop behavior of the control system, however, is that the optimization is executed every control timestep. The true sequence of control actions will therefore be more adapted to the scenario that becomes reality (see figure 2.13b ). The true control action on every future timestep will take into account the extra information about the state of the water system at that moment and the information in the updated inflow forecasts. The feedback on the actual water levels that occur for each scenario will result in smaller deviations than assumed in the MMPC formulation. The value of future decision is thus under-estimated in this formulation. An alternative formulation for the MMPC controller requires only the first control action to be equal for all scenarios. All control actions that follow the first are different for each scenario. The tacit assumption here is that after the first control action, which takes into account uncertainty, the future control actions are based on perfect forecasts. These perfect forecasts are the opposite end of the spectrum compared to the absence of new information that was assumed in the previous formulation. In this new formulation, the value of future decisions is thus over-estimated. This approach, although coming from MPC rather than SDP, is equivalent to a special case of Sampling Stochastic Dynamic Programming (SSDP), as proposed by Faber and Stedinger (2001). In section 4 of that paper, it is also argued that assuming a case where no transitions between scenarios occur, leads to an overestimation of future benefits. The authors propose Bayesian updating of the transition probabilities as a remedy. The correct estimation of the optimal control action thus has to take into account how much information will be available for the future control actions to estimate their value. This amount of information lies somewhere between no information and perfect information, as assumed in the previous two formulations. In chapter 7, the problem of accounting for future availability of information is further addressed in the context of SDP. 2.10 Summary and Conclusions Analysis of the control of the Delfland storage canal system revealed that anticipation on heavy rainstorm events is necessary. The sequential decision process of finding optimal 100 50 0 0 5 10 15 20 100 50 0 0 5 10 15 20 −0.3 −0.4 −0.5 0 5 10 15 Time (h) (a) no new info 20 avg min max Disturbance (m3/s) Risk-based water system operation Water level (mMSL) Control flow (m3/s) Water level (mMSL) Control flow (m3/s) Disturbance (m3/s) 38 100 50 0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 avg min max 100 50 0 −0.3 −0.4 −0.5 Time (h) (b) perfect new info Figure 2.13: The difference between MMPC formulations that assume no new information and perfect new information. sequences of pump flows can be solved off-line, by finding an optimal decision rule, or online, by finding a sequence of optimal decisions for a finite horizon, based on predictions of inflow into the system. Model Predictive Control (MPC) can solve this optimization problem, making use of an internal model that predicts behavior of the system over the control horizon. Uncertainties in measurements and predictions affect performance of the MPC setup negatively. The largest uncertainties are present in forecasts of events far into the future. At the same time, sensitivity of the current action to errors in prediction diminishes with lead time. Two conceptual time horizons regarding these effects were defined and rules for choosing the optimization horizon of a controller were formulated in terms of these horizons. For the Delfland system, the first eight hours of the prediction are most important for performance, which can be improved significantly by more accurate inflow predictions. The rainfall predictions contain information up to at least 20 hours into the future, according to the Nash-Sutcliffe Efficiency. In chapter 5, a more rigorous approach to evaluating the amount of information in forecasts will be described, using a mathematical theory of information, which is introduced in the next chapter. Due to constraints on pump flow, the control problem is not certainty equivalent. This demands a stochastic approach, which minimizes risk associated with decisions. This can be implemented in MPC by an extension to multiple parallel models (MMPC). Analysis of the MMPC formulation revealed an intricate interplay between future decisions, current decisions, information and risk. This issue will be revisited in chapter 7, where stochastic dynamic programming is applied to analyze the value (utility) of water as function of forecast information. Information can be seen as a key ingredient to good decisions and appropriate operation of water systems. Taking this viewpoint, the main part of this thesis deals with the application of formal theories about information and uncertainty. The next chapters will first introduce information theory and then show its applications and insights regarding forecasting and modeling for decisions. Chapter 3 Uncertainty or missing information defined as entropy “My greatest concern was what to call it. I thought of calling it ‘information’, but the word was overly used, so I decided to call it ‘uncertainty’. When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.’ ” - Claude Shannon (Scientific American 1971, volume 225, page 180) The word “uncertainty” in daily conversation is a qualitative notion meaning lack of certainty. In many uses of the word, the notion is also something quantitative: “There is too much uncertainty to take a decision”; “The uncertainty of forecasts tends to grow with lead-time”. Both these expressions reflect the idea that there is such a thing as “the amount of uncertainty”. To be useful, the ordinary language concept of uncertainty needs to be explicated. A highly convincing explication of uncertainty was provided by Claude Shannon in his seminal paper “A Mathematical Theory of Communication” (1948), in which he introduced the concept of entropy as a measure for uncertainty. Furthermore he introduced the concepts of conditional entropy, entropy rate and relative entropy (although the term was later used for another concept), and stated and proved many of the main theorems relating to these quantities and their application to the transmission of information. In this thesis, information theory is extensively used as a framework of reference and a basis for methodological development. This chapter therefore introduces the basic measures and theorems, along with some additional interpretations. 3.1 Uncertainty and probability In the English language, certainty and uncertainty are not necessarily opposites on two ends of a scale. One can be very certain that one particular outcome of an uncertain event will happen or one can be highly uncertain about the outcome of an uncertain event. Both 39 40 Uncertainty and Information cases refer to certainty and uncertainty as a quantitative properties. See for example the statements: 1. I’m quite certain that it will rain tomorrow 2. I’m very uncertain about whether it will rain tomorrow or not Although both expressions seem to refer to the same uncertainty-certainty scale, they do not. When relating the expression of uncertainty to statements in terms of probabilities, the difference becomes clearer. The first statement could imply that my subjective probability for rain tomorrow is 90%, but can not mean that it is 10%. Certainty in this statement means high probability attached to the outcome following the word “that”. So “I’m quite certain” is synonymous to “I find it highly probable”. In the second statement, however, uncertainty does not refer to the opposite, i.e. low probability of rain, but to close to equal probabilities on both outcomes, e.g. 50% chance of rain. If we would obtain new information that makes us revise our probability of rain to either 10% or 90%, in both cases we would say that we are more certain about whether it will rain tomorrow than in the 50%-50% case. In this thesis, the word uncertainty refers to the meaning in the second statement. The first concept would be more accurately conveyed by the statements in terms of the improbable-probable scale. 3.2 The uncertainty of dice: Entropy A quantitative definition of uncertainty is related to choice and possibilities. The more possibilities there are, the more uncertainty is associated with a choice. The outcome of throwing a fair die is thus more uncertain than the outcome of a fair coin toss. The uncertainty is related to a lack of information about which of the possibilities is the truth. Therefore, intuitively, uncertainty can be equated with missing information or the amount of information that needs to be gained to obtain certainty. One could express this in terms of the expected number of binary questions that need to be asked. For the coin toss, one binary (yes-no) question is enough to determine the outcome (“is the outcome heads?”). For the die, one question could be enough to know the answer (“is the outcome six?”), but with a probability of 56 the answer will not be enough to determine the outcome and additional questions are necessary. One could continue asking questions like “is the outcome four?” to obtain the answer in five or less questions (see left of Fig. 3.1), with an expected number of questions of E{number of questions} = 1 1 1 1 2 10 ×1+ ×2+ ×3+ ×4+ ×5= 6 6 6 6 6 3 This is, however, not the optimal way of asking the questions. A scheme where each question divides the possibilities approximately equally leads to certainty in less than three questions, with an expected number of 38 questions (see right of Fig. 3.1). Intuitively, the answer to one yes-no question could serve as a unit of information. Shannon called this unit one “bit” of information, following a suggestion by J.W. Tukey (Shannon, 1948). One bit corresponds to one binary digit (0 or 1) and its derived units, from byte (8 bits) to terabyte, have now become familiar to most people who have ever touched a computer. 3.2. The uncertainty of dice: Entropy ? Y 1 6 N 56 ! Y ? Y 1 5 N 54 ! Y ? Y 1 4 ! N 43 ! ? Y 1 3 ? Y 1 2 ! 1 (2) 6 1 (3) 6 1 (4) 6 1 3 ? Y ! N 32 ! 1 (1) 6 41 1 (5) 6 1 (2) 6 N 12 1 (3) 6 ? 1 2 N 32 Y ? 1 2 N 21 N 21 ! ! 1 (3) 6 1 3 ? Y ! 1 (2) 6 1 (3) 6 N 23 1 2 ? N 12 ! 1 (3) 6 = 8 3 ! 1 (5) 6 = 10 3 Figure 3.1: The expected number of questions to determine the outcome of a fair die for a risky questioning scheme (left) and an optimal questioning scheme (right). The expected number of questions does not yet define uncertainty. The missing information still depends on how clever the person asking those questions is (this is the basis of the game “Guess Who?”). The minimum expected number of questions, however, is a function of the probability distribution alone. Unfortunately, this measure of missing information still leaves us with two problems. The first problem is why we would restrict outselves to asking binary questions. The second problem is the counter-intuitive result that a coin toss would always present one bit uncertainty, regardless whether it is a fair coin, or an extremely biased coin where we are 99% sure it will land on heads. In our intuition, the heavily biased coin would present less uncertainty. Shannon (1948), instead of presenting this interpretation in terms of questions, took a more rigorous approach. He started from three reasonable properties that would be required for a measure of uncertainty, if it would exist. If all that is known about the outcome of an event are the probabilities attached to the different possible outcomes, the measure for uncertainty should be a measure of those probabilities: H(p1 , p2 ...pn ). In the case of the fair die, p1 ...p6 would all be 61 . The requirements for H are (quoting Shannon, 1948) : 1. H should be continuous in the pi . 2. If all the pi are equal, pi = n1 , then H should be a monotonic increasing function of n. With equally likely events there is more choice, or uncertainty, when there are more possible events. 3. If a choice be broken down into two successive choices, the original H should be the weighted sum of the individual values of H. The first of these requirements corresponds to the intuitive notion that adding an infinitely small bias to the die should not lead to a large jump towards more certainty. 42 Uncertainty and Information The second requirement is the most self-evident of the three and needs no further explanation. The third is requirement becomes clearest when seen in light of the expected number of questions-interpretation. It was explained before that this interpretation was unsatisfactory because the information gained in one answer does not depend on the prior probabilities. However, when the information gained in one question is equated with the uncertainty it resolves (to be defined by the measure that is sought), it will depend on the probabilities. Furthermore, uncertainty can then be defined in a way that does not depend on which questions are asked (this is the interpretation of requirement 3). Instead, the expected information gained in each question depends on the question asked (and the actual information gained depends on luck, cf. “is the outcome six?”). Shannon proved that the only measures simultaneously satisfying these three requirements are of the form H = −K n X pi log pi (3.1) i=1 Where K is a positive constant that determines, together with the base of the logarithm, the unit in which uncertainty is measured. For K = 1 and a base 2 logarithm, the uncertainty is measured in bits. In the rest of this thesis, logarithms have base 2 unless specified otherwise. Shannon named his measure “entropy”, because the expression is similar to the concept of entropy in statistical thermodynamics as interpreted by Boltzmann and Gibbs. Some further explanation of this connection will be given in section 3.8. Just after deriving the entropy measure for uncertainty Shannon writes: “This theorem, and the assumptions required for its proof, are in no way necessary for the present theory. It is given chiefly to lend a certain plausibility to some of our later definitions. The real justification of these definitions, however, will reside in their implications.” Indeed, the implications were many, ranging from data compression in computer science to portfolio theory for the stock market. The fact that fundamental theorems in all these fields can be stated in terms of the measures introduced by Shannon, exposes a surprising unity of underlying principles. It allows an intuitive understanding of connections between methods in many fields, including some of the methods presented in this thesis. For uniform distributions, H simplifies to log n. The entropy of a fair die is therefore approximately 2.585 bits. The entropy is a lower bound for the minimum expected number of binary questions. This lower bound is achieved if the answer to every binary question in the scheme resolves one bit of uncertainty. This is only the case if the prior information about the answer of each question is as uncertain as a fair coin toss, i.e. 50-50. No questioning scheme about a fair die can meet this property (see figure 3.1). However, as proven in a theorem related to data compression, it is always possible to find a questioning scheme that can achieve an expected number of questions within 1 bit from the bound. The questioning scheme in the right of figure 3.1 is an example of such a scheme for the fair die, achieving an expected number of questions of 2.667, where the entropy bound is 2.585. 3.3. Side information on a die: Conditional Entropy 3.3 43 Side information on a die: Conditional Entropy 3.3.1 Conditioning, on average, reduces entropy The entropy measure introduced in the previous section represents the uncertainty a gambler has about the outcome of a fair die. The uncertainty is a function of the probabilities attached to each outcome and those probabilities follow from the beliefs of the gambler. This means that even after the die has been thrown and the outcome is fixed, the uncertainty for the gambler is not eliminated until he sees the outcome. Conversely, this also means that if the die is thrown with a very precisely controlled and known velocity and spin, the probability distribution changes according to this state of knowledge and uncertainty can be reduced significantly, before the die has been thrown (see e.g. Diaconis et al. (2007) where this is shown for a coin toss). This makes the propensity interpretation of probability, where it is seen as a property of the die (Popper, 1959), untenable. Even when the propensity refers to the whole experimental setting, the interpretation needs many ad hoc complications. A far more simple interpretation of probability is that it is just in our heads, and reflects our incomplete information about the outcome. The probability thus depends on how much we know about the initial conditions of the throw or situation after the throw. An extensive treatment of this view on probability, which is also adopted in this thesis, can be found in Jaynes (2003). Given this probability interpretation, we can now imagine the following situation, in which some side information on the die is available. After the die has been thrown, the gambler is allowed to observe the face of the die facing him. This observation, although not directly revealing the outcome, gives some information on which side faces upwards. This information reduces the uncertainty of the observing gambler, because it rules out the observed face as an outcome. In absence of other information, the gambler should assign equal probabilities to the outcomes. For example, the conditional remaining possible 1 1 1 1 1 probability P (X|y = ) = 5 , 0, 5 , 5 , 5 , 5 , where the random variable X denotes the outcome of the throw, given the observed side-face y = . The entropy of the conditional distribution H (X|x 6= )1 is now the same as if the die has only five faces (2.32 bits). A clever gambler, however, can combine the observation with his prior knowledge on the die to reduce his uncertainty even further: the sum of two opposing faces is always seven. Using this model of the die, can be ruled out as well, yielding P (X|y = , model) = 1 1 1 1 , 0, 4 , 4 , 0, 4 and a remaining uncertainty of only 2 bits. In this case the gambler con4 ditioned his probability estimate both on the observation and on his model. One of the theorems of information states that conditioning, on average2 , always reduces entropy, sometimes also stated as “Information can’t hurt” (Cover and Thomas, 2006), although 1. Note that, for ease of notation, a shorthand notation is introduced here, in which the informationtheoretical measure is written with a random variable as argument, but in fact is a function of its probability distribution. 2. Note that by taking the average, we arrive at the conditional entropy H (X|Y ) as defined in the next section, which is not the same as the entropy of the conditional distribution H (X|y). The latter entropy could actually be larger than H (X) for some particular values of y, although this does not happen in the example with the die, because there conditioning entails eliminating possible outcomes. 44 Uncertainty and Information Figure 3.2: Observing the dice can yield a further conditioning model. this is not always true in some sense, as will be explained in subsection 3.6.2. The model the gambler used for the die, could have been obtained from the various sources familiar to scientists. The model might have been discovered by another gambler who shared his research in a publication; the model could have been derived by observing the die directly and theorizing about the geometry; and the model could also be an empirical one, stating nothing about the seven-eye-sum theorem or even the shape of the die, but just observing that in a large number of throws, never comes up when is observed and using that for prediction. The gambler’s estimate can in that case be seen as conditional not only on the actual observation, but also conditional on all those observations that were used to infer the model. In a hydrological context, this is analogous with a flood forecast, based on side-information about the weather and the hydrological conditions in the basin. The side information on those conditions enables better flood forecasts. However, without a model the side information is worthless. Observing long time series of streamflow and weather, the models can be improved. Also direct observation of certain properties of the catchment, like slopes and area, may help, but some remain hidden underground (cf. the fact that the die is cubic vs. the fact that it’s density is homogeneous). The more specific the observations are, the better the forecast becomes. Also for our gambler, those observations can still be refined... 3.3.2 Conditional entropy When looking closer at the observations of the die, the gambler might find a further structure in them. Apart from conditioning on the number of eyes observed, the gambler can use the fact that , and are only 2-fold rotational symmetric and not 4-fold, like the other 3 faces. In his observation, he can therefore further distinguish between and . Of the four possible outcomes, given two eyes facing can occur when him, only two 1 1 y = . As can be seen from figure 3.2, P (X|y = ) = 0, 0, 2 , 2 , 0, 0 leaving only 1 bit of uncertainty. Besides looking at the remaining uncertainty for one particular case of side information y, the gambler might be interested in his average or expected remaining uncertainty about X, given side information Y , which is now also a random variable. Because for Y the orientation is also of interest, there are more possible outcomes and H (Y ) > H (X). The possible 3.4. Mutual Information and Relative Entropy 45 1 1 1 1 1 1 1 1 faces for Y are now with P (Y ) = 61 , 12 , 12 , 12 , 12 , 6 , 6 , 12 , 12 , yielding H (Y ) = 3.085 bits. When observing the outcome y, 3.085 bits of information is thus gained about Y. When having side information Y on the die, while employing the best possible model, the uncertainty about X can with 50% probability be reduced to 2 bits (when observing , or ) and to one bit when one of the other faces is observed, including its orientation. It can then be said that the conditional entropy of X, given Y is 1.5 bits. H (X|Y ) = X {P (y) H(X|y)} = m X j=1 y∈Y P (yj ) n X P (xi ) log P (xi |yj ) (3.2) i=1 in which yj are the m possible states the random variable Y can have (the possible observations for the observed side), and xi the n possible states the outcome of the throw can have, P (xi |yj ) denotes the probability of the ith outcome, given an observation of the jth outcome for the observed side and H (X|Y ) is the conditional entropy of random variable X, given random variable Y . The conditional entropy measures the remaining uncertainty about X, given Y . The information inequality (“conditioning, on average, always reduces entropy”) can now be stated as H (X|Y ) ≤ H (X), with equality only when X and Y are completely independent. In that last case, Y does not reduce uncertainty about X and therefore contains no information about this variable. 3.4 Relations between uncertainty and information gain: Mutual Information and Relative Entropy Due to the logarithms used in the entropy-definition of uncertainty, the measures introduced have some intuitive additive properties, which are depicted in figure 3.3. In the figure, two new measures appear: joint entropy and mutual information. The joint entropy is simply the combined uncertainty about more than one variable. It is the missing information to obtain certainty about both X and Y and can be calculated with H (X; Y ) = m X n X P (xi , yj ) log P (xi , yj ) (3.3) j=1 i=1 In case the variables are completely independent, the joint entropy is equal to the sum of the entropies of both variables. If the variables are dependent, which is the case shown in the figure, the joint entropy is less than this sum. This is due to the fact that once the questions about Y have been asked and it is known that Y = y, the uncertainty about X is not H (X), but H (X|y) . On average, the uncertainty about X when also informing about Y is the conditional entropy H (X|Y ). This leads to the following equation H (X; Y ) = H (X|Y ) + H (Y ) = H (Y |X) + H (X) (3.4) Furthermore, because this equation shows that X reduces uncertainty about Y exactly by the same amount as Y reduces uncertainty about X, the measure mutual information can be defined as I (X; Y ) = H (X) + H (Y ) − H (X; Y ) (3.5) 46 Uncertainty and Information H(X) = 2.585 bits H(X) H(X; Y ) = 4.585 bits H(Y ) H(X) (a) entropy of X (b) joint entropy of X and Y H(X|Y ) = 1.5 bits H(X) H(Y ) I(X; Y ) = 1.085 bits H(Y ) H(X) (c) conditional entropy of X, given Y H(Y ) (d) mutual information of X and Y Figure 3.3: The relations between the entropy measures of two variables. The numerical values are for the die with side information, using the sophisticated model. The mutual information defined by this equation can also be calculated directly from the joint probability mass function such as defined in table 3.1. I (X; Y ) = m X n X P (xi , yj ) log j=1 i=1 P (xi , yj ) P (xi ) P (yj ) (3.6) It measures the degree of dependency between two random variables. The mutual information I (X; Y ) ≥ 0 with equality only when the variables are completely independent. In the case of the simple model of the die, the mutual information between the number of eyes on the observed face and the outcome is 0.585 bits, while for the sophisticated model using the exact observed pattern of the eyes, it is 1.085 bits. The mutual information in eq. 3.6 can also be interpreted as a measure divergence between the joint distribution of X and Y , P (xi , yj ), and the hypothetical joint distribution if both variables were independent, which is P (xi ) P (yi ). Equation 3.6 shows that mutual information is the expectation of the logarithm of the ratio between these two probability distributions. This interpretation of mutual information makes it a special case of a general informationtheoretical measure for divergence from one distribution to another. The measure is known as Relative Entropy, Relative Information or Kullback-Leibler divergence and was introduced by Kullback and Leibler (1951). The Kullback-Leibler divergence from distribution P (X) to distribution Q (X) is defined by DKL (P ||Q) = n X i=1 P (xi ) log P (xi ) Q (xi ) (3.7) The divergences measures how much more uncertain about X a person is, when having probability estimate Q(X) rather than P (X) , while the true probabilities are P (X) . The 3.5. Rolling dice against a fair and ill-informed bookmaker 1 2 3 4 5 6 P (X) 47 1 2 3 4 5 6 P (Y ) 1 2 3 4 5 6 P (Y ) 0 1 24 1 24 1 24 1 24 1 24 1 24 0 0 0 1 24 1 24 1 24 1 24 1 24 1 24 0 0 0 0 0 0 1 24 1 24 1 6 1 24 1 24 1 6 1 6 1 6 1 6 1 6 1 6 1 6 0 0 0 0 0 0 0 0 1 24 1 24 0 0 1 24 1 6 1 24 1 6 1 6 1 12 1 12 1 12 1 12 1 6 1 6 1 12 1 12 1 24 1 24 1 24 1 24 0 1 6 0 1 24 1 24 0 1 24 1 6 0 1 24 1 24 0 1 24 1 6 1 24 1 24 1 24 1 24 0 1 6 1 24 1 24 0 1 24 1 24 P (X) 0 0 0 1 24 1 24 0 0 0 1 24 1 6 1 6 0 0 0 0 1 24 1 24 0 1 24 1 24 1 24 0 1 24 1 24 0 0 0 1 6 1 6 Table 3.1: The joint probability mass function of X and Y in case of the die for the simple model (left) and the sophisticated model (right) of the die. measure is referred to as a divergence and not a distance, because it is not symmetrical and does not satisfy the triangle inequality. In mathematical terms: DKL (P ||Q) 6= DKL (Q||P ) and DKL (P ||Q) DKL (P ||R) + DKL (R||Q). The explanation for the asymmetry is that the divergence depends on which of the distributions is (considered to be) the true one. This is the distribution that is in the first argument and which is used to calculate the expectation. Further interpretation of this measure and the information-theoretical measures is given in the next two sections. In the first it is related to gambling and in section 3.6 it is related to the concept of surprise. 3.5 Rolling dice against a fair and ill-informed bookmaker The information-theoretical measures introduced in this chapter can be interpreted in terms of gains in a gambling game against a fair bookmaker. These interpretations are presented here along with some essential definitions. The derivation of these results is outside the scope of this thesis and can partly be found in Cover and Thomas (2006). A fair bookmaker is defined as a bookmaker whose odds offered for the various possible outcomes do not lead to any expected gain or loss in an individual bet, if his probability estimate is correct. For example, for a fair die, a fair bookmaker could offer 6 times the stakes that the gambler put every outcome, leading to an expected gain of zero for both the gambler and the bookmaker. When $1 is bet on the outcome , and the bet was correct, the bookmaker keeps the $1 and pays out $6 to the gambler. If bet was wrong, the bookmaker keeps $1 and does not pay out anything. The odds the fair bookmaker offers reflect his subjective probabilities. For every outcome, the probability estimate of the bookmaker is hi = 1/ri , where ri are the odds offered for outcome i and hi is the bookmaker’s subjective probability for outcome i. The bookmaker is said to be fair if n X 1 =1 (3.8) i=1 ri This indicates that a bookmaker can be fair, regardless of what his probability estimate is, as long as the odds offered reflect some probability distribution where the sum of probabilities on all possible events is one. From his own perspective, the expected gain of 48 Uncertainty and Information a fair bookmaker is zero in each individual bet, regardless of where the gambler chooses to put his money. However, he can still expect to make money in a long series of bets, if the gambler has worse probability estimates than he has. As will become clear from the remainder of this section, this is related to the fact that a gambler can not bet more money than he possesses. Suppose a gambler, who starts the game with $1, wants to maximize his gain in a long sequence of repeated bets against this bookmaker. Because the bets are repeated, the gambler can reinvest all the capital he accumulates in previous bets in each new bet. As a consequence, if he loses all his money, he can not continue gambling and the bookmaker wins. To guard against this situation, he should always put some money on each outcome that is not impossible. Kelly (1956) showed that the best strategy for the gambler is to follow a proportional betting strategy, which means that he should “put his money where his probability is”. A gambler betting on a fair die should therefore equally split his capital into six parts and spread his stakes over all possible outcomes. In general, the fraction of the gambler’s capital bet on each outcome should be equal to his probability estimate for that outcome. One remarkable consequence of this theorem is that the best strategy for the gambler does not depend on which fair odds are offered, but only on his own beliefs. Some further background on this result and the relations with information-theoretical measures can be found in appendix C, where the gambling interpretation is related to weather forecasts which are the subject of chapter 5. The gambling context is now used to give some additional interpretation to the information-theoretical measures introduced in this chapter, some proofs can be found in Cover and Thomas (2006). A gambler can make money only if his probability estimates are better than the bookmaker’s. If they would be worse than the bookmaker’s, he would better not bet. However, he can never knowingly have a worse estimate than the bookmaker, because if he thought the bookmaker had a better estimate than he, he could just adopt the bookmaker’s estimate, which can be read from the odds. In that case, the gambler is sure to not win or lose any money. For the gambler the game becomes interesting when he (thinks he) has side information. If the gambler observes one side and knows the simple seven-eye-sum theorem, he can use this information in his bets against the ignorant, fair, and soon to be poor bookmaker. The mutual information between the side and the outcome, I(X; Y ) is 0.585 bits. Because the bookmaker assumed the full uncertainty in setting his odds, the gambler’s capital will now on average grow with a factor 2I(X;Y ) = 1.5, per bet. After 30 bets his expected capital and the will be $191551. For one single bet, in which the gambler for example observes 1 1 1 bookmaker offers naive fair odds, the gamblers estimate Pg (X|y = ) = 4 , 0, 4 , 4 , 0, 41 and the bookmaker’s estimate is Pb (X) = 16 , 61 , 16 , 61 , 61 , 61 . The gambler then estimates the exponent of his own expected gain at DKL (Pg ||Pb) = 0.5850 (a factor 1.5 per bet), while the bookmaker will think the exponent of the gambler’s expected gain is −DKL (Pb ||Pg ) = −∞, which means he expects the gambler to reduce his capital from $1 to $0 in the long run. The asymmetry in Kullback-Leibler divergence can thus be seen as originating from a different view on the “true” probabilities: the bookmaker and the gambler can not both be right. The gambler is more right in this case and will win, because he 3.6. Interpretation in terms of surprise 49 has more information than the bookmaker. Also an independent third party can make an estimate of the gambler’s winnings. Suppose an experienced second gambler observes the same , but has knowledge model of dice. His probability estimate of the sophisticated 1 1 will be Pg2 (X|y = ) = 0, 0, 2 , 2 , 0, 0 . In his estimate, the first gambler is expected to receive (3.9) capital bet × 2DKL (Pg2 ||Pb)−DKL (Pg2 ||Pg ) = 1.5 ∗ bet meaning that he earns half of his bet as winnings. In this case, the better informed second gambler confirms the own expectation of the first gambler. 3.6 Interpretation in terms of surprise Apart from the axiomatic derivation, the interpretation in terms of gambling gains, and the interpretation related to the expected minimum number of questions, an intuitive natural language view of information theory is now presented in terms of surprise (Tribus, 1961). Surprise is something we feel when something unexpected happens. The lower the probability we assume something to have, the more surprised we are when observing it. Rain in a desert is surprising, rain in the Netherlands is less surprising and rain on the moon is a miracle yielding almost unbounded surprise. When, following Tribus (1961), 1 , surprise can be measured the surprise of observing outcome x is defined as Sx = log P (x) in bits, like information and uncertainty. Observing something that was a certain fact yields no surprise, heads on a fair coin yield one bit of surprise and observing a 1/1000 year flood in some year yields a surprise of log 1000 ≈ 10 bits. Entropy can now be defined as the expected surprise about the true outcome: H (X) = P 1 EX (Sx ) = x∈X P (x)(log P (x) ) . Mutual information between X and Y is the expected reduction n n of surprise ooabout the outcome of X due to knowing n n the outcome oo of Y : I (X; Y ) = EY EX Sx − Sx|y or, vice versa, I (X; Y ) = EX EY Sy − Sy|x . The KullbackLeibler divergence from the bookmaker’s estimate to the gambler’s estimate DKL(Pb ||Pg ) is the extra surprise the bookmaker expects the gambler to experience about the true outcome compared to his own surprise, while the reverse divergence DKL (Pg ||Pb ) is the expected extra surprise the bookmaker will experience, seen from the viewpoint of the gambler (Eq. 3.10). n DKL(Pg ||Pb ) = EPg SPb − SPg = = n X i=1 n X i=1 o Pg (xi ) {− log Pb (xi ) − (− log Pg (xi ))} Pg (xi ) log Pg (xi ) Pb (xi ) (3.10) In general, uncertainty can now be interpreted as expected surprise about the true outcome. The fact that different expectations can be calculated according to different subjective probability distributions reflects the fact that uncertainty can be both something 50 Uncertainty and Information objective and subjective. The uncertainty perceived by the person have a subjective probability distribution himself is the entropy of that distribution. Kullback-Leibler divergence can be seen as the additional uncertainty one person estimates the other person to have compared to his own. When the uncertainty of a person is estimated from the viewpoint of a hypothetical all-knowing observer, who knows the true outcome, an objective perfect estimate of the uncertainty (expected surprise about the truth) of that person is obtained. This estimate is DKL Pteapot fairy ||Pperson and because the estimate of the truth Pt (X = xi ) = 1 for the i that corresponds to the true outcome and zero everywhere else, the estimate reduces to − log Pg (X = xtrue ). In chapter 5 we will see that this divergence has an important role as a measure of forecast quality. 3.6.1 Surprise and meaning in a message When the word information3 is used in daily conversation, it normally comprises two elements; surprise and meaning, as noted by Applebaum (1996), which he demonstrated with the following example of three possible messages: 1. I will eat some food tomorrow. 2. The prime minister and the leader of the opposition will dance naked in the street tomorrow. 3. XQWQ YK VZXPU VVBGXWQ. Of these messages, message 2 is considered to convey most information. Message 1 has meaning but only little surprise, 3 has surprise (when taking English language as a prior), but little or no meaning and only message 2 has both surprise and meaning. In this thesis, this framework is used to define useful information, which can offer new insights, noting the explicit distinction between meaning and surprise as two components of information transfer in the context of decision making. In this context, we can further specify meaning as meaning to a specific user or receiver of information. Message 2 in the above example might have meaning to someone interested in politics or dance, but is of little relevance to a farmer deciding whether to irrigate or not. A good forecast in the eyes of this farmer will have both meaning (e.g. exceedence of some rain threshold that will change his decision and has influence on his yield) and surprise (something new to be learned about this meaningful uncertain event). 3.6.2 Can information be wrong? Is wrong information also information? This question is related to the notion that uncertainty is subjective. Decisions can be improved if uncertainty relative to the truth is reduced. Reduction of uncertainty relative to an arbitrary other (possibly irrational) belief is not guaranteed to help decisions. A message can be true or wrong. It can be noted that surprise, as defined by Applebaum (1996), is a necessary but not a sufficient condition to 3. In this section, the word information is printed in italic font where it refers to information in the ordinary language sense, to distinguish it from the information-theoretical definition. 3.7. Laws and Applications 51 convey information about the truth. An extension to the framework of Applebaum can therefore be proposed, in which we require the message to be true to truly constitute information. If the dancing in message 2 does not take place the next day, the message conveyed the same information (surprise), but wrong information. Since forecasting for decision making is concerned with information about the truth, the information in a forecast should reduce the expected surprise upon hearing the true outcome. Because one can not be surprised about the same information twice, and surprise can not be reduced without information, any reduction of surprise about the truth will involve a surprising message and any message that is true and surprising will reduce the user’s uncertainty about the truth. In chapter 5, a mathematical decomposition of Kullback-Leibler divergence is presented that defines the concepts of missing information, true information and wrong information in the context of forecast verification. 3.7 Laws and Applications Once information is defined as a quantity, some inequalities can be formulated that can be applied to many problems. Once these “laws of information” are allowed to seep into our intuition, they are as useful as concepts such as the conservation laws in thermodynamics and equations of motion in classical mechanics. Just like an apparent perpetuum mobile tells us we are missing some source of energy, results from information theory can for example serve to detect flaws in data manipulation methodologies. 3.7.1 Information cannot be produced from nothing The data processing inequality states that information can never increase in a Markov chain (Cover and Thomas, 2006). This means that no matter how clever a data manipulation method is, it is not possible to extract more information than there is in the original raw data. Therefore there exists an upper limit to predictability, given the information that is in the predictors; see Westra and Sharma (2010) for an empirical approach to find this limit. An example are the low-frequency components (long term memory) in the climate system, that can be used for seasonal forecasts. The sea-surface temperatures (often summarized in indexes like El Niño Southern Oscillation) at a certain moment in time can contain information about the average weather conditions up to more than a year ahead (see e.g. Namias (1969); Piechota et al. (1998); Barnston et al. (1994)), but the predictability is limited by the mutual information between these temperature patterns and the predictands. Statistical methods can help to optimally extract this information, but can never increase it. The only way to improve these forecasts is to add informative predictors on long timescales, like ice coverage and vegetation. The deep rooting vegetation can for example serve as an indicator for soil moisture in the entire root-zone, which can show considerable long term persistence. New sources of information from remote sensing can provide valuable information for long term predictions. 52 Uncertainty and Information 3.7.2 Information never hurts On average, information always reduces uncertainty (Cover and Thomas, 2006). As was shown in subsection 3.6.2, this is only true for correct information. This “information inequality” is often also stated as conditioning on average reduces entropy. In this form, the inequality is always valid, because conditioning information is true by definition. Only if Bayes’ rule is bypassed if the information consists of a probability estimate that is directly assigned to the outcome, information may be partly wrong. In chapter 5, a framework is presented for distinguishing correct and wrong information. Unfortunately, this can only be done in hindsight. The wrong information is referred to as an unreliable probability estimate. 3.7.3 Given the information, uncertainty should be maximized The principle of maximum entropy (Jaynes, 1957) states that given certain pieces of information, one should use that information as much as possible, but not use information that is not there. In other words, apart from the information that is there, uncertainty should be maximized. This will lead to probability estimates that reflect both what is known and what is unknown. The principle of maximum entropy (PME, also POME) also defines the concept of maximum entropy distributions, which maximize entropy given certain constraints on for example moments and domain. The constraints constitute the information, and the remaining maximum entropy constitutes the uncertainty that is left. Many frequently used parametric distributions turn out to be maximum entropy distributions for natural constraints. For example, the normal distribution maximizes entropy, given mean and variance. Section 3.9 gives some references for use of PME in hydrology and corresponding maximum entropy distributions. In chapter 4, the principle of maximum entropy is used to add information from low frequency components in the climate system to existing ensembles reflecting climatological uncertainty4 . 3.7.4 Applications of information theory Although originally developed as a mathematical theory of communication, information theory has found application in a wide range of fields and problems. Initial results dealt with how much information could be sent over a communication line. When extending these results to a noisy communication channel, error correcting codes became important. Without these techniques, every scratch in an audio CD would be audible. The same codes are also important in data recovery. If information is stored in a redundant way, also partly damaged data allows full recovery of the information. This is closely linked to cryptography, where the aim is to make information unrecoverable without the correct key. Data compression is based on removing all redundancy in data, resulting in a new data set where every piece of data is a piece of unique information. If such a piece of data 4. “Climatological uncertainty” here refers to knowledge of historic frequencies, without other conditioning information. 3.8. Relation to thermodynamics 53 is lost, it can not be recovered by any error correcting code. Therefore, there is a tradeoff between communication speed (compression) and robustness (error correction). Kelly (1956), presented a new interpretation of information rate, which illustrates the connection between information and gambling. In the end, this connection follows from the strong connection between information and probability on the one hand, and probability and rational decisions on the other. The results for gambling, which were initially for idealized horse races, have been extended to more complicated cases, resulting in optimal portfolio theory for the stock market. 3.8 Relation to thermodynamics In physics, entropy was used long before Shannon introduced it as a measure of uncertainty in his theory of communication. The term was first coined by Clausius in 1865 for a concept of “transformation content” or the amount of energy that was used or dissipated, referring to Greek "entropÐa" (entropia; from the verb "entrèpesv jai" / entrepesthai = to turn into). Clausius’ theory of thermodynamics was stated in terms of the macroscopical quantities heat and temperature. He defined the change in entropy associated with the irreversible transformation of a quantity of heat Q from a body of temperature T1 to a body of temperature T2 as 1 1 ∆S = Q (3.11) − T2 T1 It was by then understood that heat flowing from warm to cold in a isolated system is an irreversible process. And entropy changes can only be positive. Boltzmann managed to find an explanation in terms of statistical properties of microscopical behavior, and thereby founded the field of statistical mechanics. The formulation was later refined by Gibbs. In the statistical mechanical perspective, temperature, heat and entropy are related to motion of microscopical particles. The classical theory of thermodynamics emerges from statistical mechanics through statistical relationships between averaged macroscopical variables. These relationships, although emergent from random and complex behavior, are quite accurate, but still only an approximation, given the lack of knowledge of the precise microscopical behavior. This is one of the first times that statistics and incomplete information entered fundamental physics so explicitly. A nice historical account of these developments is given in Lindley (2008). Shannon, when developing a measure for uncertainty in his theory of communication, found that his measure has the same mathematical form as the statistical mechanics formulation of entropy. Jaynes (1957) investigated the relation between information-theory and statistical mechanics and proposed that the idea of maximum entropy in physics could be seen as a general principle of statistical inference. In this view, maximum entropy is seen not so much as a specific physical law, but more as the best estimate of the distribution by inference from the incomplete information in the macroscopical variables. The entropy of the dice in section 3.3 is reflecting remaining uncertainty for the gambler, given the side 54 Uncertainty and Information information. In statistical thermodynamics, entropy represents the remaining uncertainty about the microstates (what the molecules are up to), given some conditioning information about the macrostates (pressure, temperature, volume). For example, the macrostate temperature for an ideal gas corresponds to the average kinetic energy of the molecules. Secondly, we know that these energies must be positive. Given these two constraints, or pieces of conditioning information, the best guess is that the kinetic energies of the individual molecules follow an exponential distribution, which is the maximum entropy distribution under these constraints. The distribution for the velocities then becomes the Maxwell-Boltzmann distribution. f (v) = s 2 π m kT 3 −mv 2 v exp 2kT 2 ! (3.12) where f (v) is the probability distribution of the instantaneous velocity of a randomly selected particle, T the temperature of the gas, m the molecular mass of the gas and k is Boltzmann’s constant. The second law of thermodynamics, which states that entropy tends to increase over time, can also be interpreted in terms of information. In that interpretation, it states that we tend to lose track of the microscopical behavior, because energy that is concentrated in ways that can accurately be described in terms of macroscopical constraints tends to spread over microscopical degrees of freedom. When only having access to macroscopical states, this represents a loss of information, which is analogous to a loss of free energy. Since the work of Szilard (1964), who coined Maxwells demon, and Landauer (1961), who connected irreversibility of logical operations to the irreversibility of heat generation, information is increasingly seen as a physical quantity, which is tightly linked with the laws of thermodynamics. Recently, the principle of Maxwell’s demon, which can apparently violate the laws thermodynamics by using knowledge, was demonstrated experimentally for the first time. Toyabe et al. (2010) showed that information can be converted into free energy by applying feedback control on the molecular level. However, obtaining and processing that information is impossible without increasing entropy. The precise connections between thermodynamics and information theory are quite subtle and need a more precise mathematical formulation to be fully appreciated. This is beyond the scope of this thesis, but would be an interesting direction of future research. Especially for hydrology, where inference about complex systems plays a major role, the cross-pollination between information theory and thermodynamics could yield interesting new perspectives.5 5. Note that also in fundamental physics, information theory plays a central role through “black hole information paradox” (information seems to be lost in a black hole, while quantum mechanics says that information should be preserved), which led to black hole thermodynamics, the holographic principle and even to new theories for gravity as an emergent entropic force; see Verlinde (2010) 3.9. Applications of information theory in water resources research 3.9 55 Applications of information theory in water resources research Information theory has been applied to several problems in hydrology and water resources research. It is outside the scope of this thesis to give an extensive overview, but a few references are given in this section. For more references, the reader is referred to the review papers by Harmancioglu et al. (1992a,b); Singh and Rajagopal (1987); Singh (1997). One area where information theory has been applied is the derivation, justification and parameter estimation of several statistical distributions that are common in hydrology by the principle of maximum entropy (PME, also POME); see for example (Sonuga, 1972; Singh and Guo, 1997, 1995b,a; Singh et al., 1993; Singh and Singh, 1991, 1985). Also in hydraulics, this principle has been used to derive velocity profiles (Chiu, 1988), and behavior of complex natural systems in terms of their distribution en temporal dependence structure (Koutsoyiannis, 2005a,b, 2011). One of the first applications of information theory in water resources is due to Amorocho and Espildora (1973), who investigated the use of entropy, conditional entropy and transinformation as measures of model performance. Later, information theory has also been applied to investigate predictability and forecast quality, mainly for meteorological applications. For relevant references, see chapter 5, where an information-theoretical framework for forecast verification is presented. Other applications of information theory concerned the morphological analysis of river basin networks and landscapes (Fiorentino et al., 1993; Rodriguez-Iturbe and Rinaldo, 2001) and the design and analysis of monitoring networks for rainfall and water quality (Krstanovic and Singh, 1992a,b; Alfonso et al., 2010; Alfonso Segura, 2010). Because using information theory was not a predefined objective, this thesis builds mostly on literature about information theory itself, rather than previous applications to water research. The results mainly followed from the idea that information plays a central role in risk based water system operation and that a a rigorous theory of information existed. In this thesis, information theory is used • • • • • • • to characterize the information that is contained in forecasts (chapter 5), to investigate the amount of information in data that is available for inference of models (chapter 6), to clarify the distinction between information and value for decisions (chapter 5), to give guidelines for the objective function for training models (chapters 5 and 6), to point out the philosophical objections to deterministic forecasts (chapter 5), to relate future information availability to the value of water (chapter 7) and, in the next chapter (4), to ensure that the correct amount of information in a forecast is reflected in a weighted ensemble. Chapter 4 Adding seasonal forecast information by weighting ensemble forecasts “Probability is relative, in part to [our] ignorance, in part to our knowledge.” - Pierre-Simon Laplace, 1820 Abstract - This chapter1 presents an information-theoretical method for weighting ensemble forecasts with new information. Weighted ensemble forecasts can be used to adjust the distribution that an existing ensemble of time series represents, without modifying the values in the ensemble itself. The weighting can for example add new seasonal forecast information in an existing ensemble of historically measured time series that represents climatic uncertainty. A recent article compared several methods to determine the weights for the ensemble members and introduced the pdf-ratio method. In this article, an information-theoretical view on these weighting methods is presented. A new method, the minimum relative entropy update (MRE-update), is presented. Based on the principle of minimum discrimination information, the method ensures that no more information is added to the ensemble than is present in the forecast. This is achieved by minimizing relative entropy, with the forecast information imposed as constraints. The MRE-update is compared with the existing methods and the parallels with the pdf-ratio method are analyzed. The method is illustrated with an example application the a data-set from the Columbia river basin in the USA. 4.1 Introduction This chapter presents an information-theoretical view on methods to produce weighted ensemble forecasts, as recently addressed by Stedinger and Kim (2010). It is argued that the updating of ensemble weights constitutes information and is therefore amenable to the information-theoretical principle of maximum entropy. Using the information-theoretical concepts, it is shown that the existing parametric update of Croley (2003) is a second order 1. Based on: – Weijs, S.V., van de Giesen, N., An information-theoretical perspective on weighted ensemble forecasts, to be submitted for publication. 57 58 Adding seasonal forecast information by weighting ensemble forecasts approximation of the information-theoretical approach. It is also shown that, for normal distributions, the pdf-ratio method of Stedinger and Kim (2010) has a solution of the same shape but with some deviations in parameters. When the pdf-ratio method is forced to exactly match the prescribed conditional mean and variance, the results are identical. Firstly, this is an information-theoretical justification for this version of the pdf-ratio method. Secondly, it indicates the pdf-ratio method as a fast way to solve the MREupdate in case the forecast information consists of a conditional mean and variance. We only give a short introduction to weighted ensemble forecasts here. For more background and references, the reader is referred to Stedinger and Kim (2010). 4.1.1 Use of ensembles in water resources Decision making about water resources systems often requires uncertainty to be taken into account. For the operation of a system of reservoirs, for example, forecasts at different timescales can improve decisions. Because these decision problems are not certainty equivalent (Philbrick and Kitanidis, 1999), optimal decisions can only be found by explicitly taking uncertainties into account. Ensembles are a common method to describe uncertainty in forecasts, such as future inflows to a reservoir system. An ensemble consists of several scenarios (also called members or traces), which represent the possible future development of the variables of interest. An ensemble of past measured streamflows can for example be used as a stochastic description of the inputs to a system of reservoirs (Kelman et al., 1990). Ideally, such a historical ensemble represents the climatic uncertainty about the interannual variability, and at the same time contains a realistic stochastic description of spatial and temporal variability and correlations at smaller timescales. Using ensembles directly has the advantage that no statistical models have to be assumed. Ensemble members can be multivariate, e.g. yearly sets of daily time series of various hydrological variables on various locations (Faber and Stedinger, 2001). Other examples of ensembles are a multi-model ensemble of different climate models predicting the global average temperature over the next century, the ensemble weather forecasts of European Centre for Medium range Weather Forecasts (ECMWF), and the ensemble streamflow predictions (ESP) that are used throughout the USA as an input for reservoir optimization models or decision making about flood protection. These ESP forecasts reduce uncertainty by conditioning on actual basin conditions. The ESP forecast is produced by feeding a distributed hydrological model, which has an initial state consistent with actual basin conditions, with past observed weather patterns (Day, 1985; Wood and Lettenmaier, 2008). The result is an ensemble with one trace for each historical weather pattern. The ensemble reflects the climatic uncertainty, but also the information that is in the actual basin conditions. The most important information is in the states that represent the largest seasonal storage changes, such as soil moisture and snow pack. This information reduces the climatic uncertainty in the flows; see the bottom left of Fig. 4.1. 4.1. Introduction 59 daily average flow m3/s Climate (modeled) Selected equal Niño3.4 tercile (Hamlet) 1500 1500 1000 1000 500 500 0 0 100 200 300 400 0 0 daily average flow m3/s Conditioned on initial state 1500 1000 1000 500 500 0 100 200 300 time (days after 1st of january) −> 200 300 400 Conditioned on true meteo forcing 1500 0 100 400 0 0 100 200 300 time (days after 1st of january) −> 400 Figure 4.1: The ESP forecasts are generated by forcing a hydrological model with historical weather patterns. The top left figure represents the modeled climatic uncertainty, in which the modeled flow traces obtained by forcing the model with the meteorological data while the initial conditions nor the meteorological data are conditioned on the actual year. The bottom left figure is a typical ESP forecast, in which all traces start from the initial conditions in one particular year (2003), but each trace corresponds to a different historical weather pattern. The top right shows the climatic ensemble, where only traces matching the actual tercile of the ENSO index are selected (also 2003, above normal ENSO). The lower right figure shows traces produced by forcing a model from different initial conditions all with the same weather pattern from 2003. Remarkably, the effect of the uncertainty in the initial conditions only becomes important after 5 months, in the melting season, when the different initial snowpack conditions are translated into different streamflows. Apart from the initial basin conditions, information about the streamflows might be present in climatic indexes that characterize long term persistence in the atmospheric and oceanic circulation (Piechota et al., 1998). The indexes are usually based on sea surface temperatures and pressure heights. For example, the phase of the El Niño Southern Oscillation (ENSO) and the Pacific Decadal Oscillation (PDO) gives information about the precipitation in the Pacific Northwest of the USA (Hamlet and Lettenmaier, 1999). Hamlet et al. (2002) proposed a method to select only the one third of ESP traces that match the ENSO conditions of the actual year. They calculated that this information, in combination with a more flexible reservoir operation strategy, could lead to “an increase of nonfirm energy production from the major Columbia River hydropower dams, ... resulting in an average increase in annual revenue of approximately $153 million per year in comparison with the status quo.” This chapter presents a method to include this type of information into ensemble forecasts by weighting rather than selecting ensemble traces. 60 Adding seasonal forecast information by weighting ensemble forecasts historical ensemble Columbia streamflows with representative averages 0.1 probability (and scaled density) 1400 1200 streamflow 1000 800 climatic pdf forecast pdf P original P increase P decrease 0.08 0.06 0.04 0.02 0 150 200 250 300 350 3 streamflow (m /s) 600 climatic pdf forecast pdf original sample shifted sample 400 after shift 200 original 0 0 20 40 60 80 100 days 120 140 160 180 150 200 250 300 350 Figure 4.2: Equally weighted ensemble members can represent a nonuniform density. This density can be changed by shifting or by weighting the ensemble traces. 4.1.2 Weighted ensembles Ensemble traces are often stated to be “equally likely” (see e.g. Cloke and Pappenberger (2009)). This should not be taken too literally. For example, streamflows close to the interannual mean are more likely than streamflows of the most extreme ensemble members. Ideally ensemble forecasts are produced in such a way that all ensemble members can be considered to represent equally probable ranges of the possible outcomes. This is reflected by the fact that scenarios usually lie closer to each other around the mean value. Each scenario represents the same probability, but a different region of the space of the outcome and therefore a different probability density; see Fig. 4.2. In that way, the ensemble is a discrete representation of the underlying probability density function and can be used in risk analysis and decision making (see e.g. (Georgakakos and Krzysztofowicz, 2001) and the MMPC approach in van Overloop et al. (2008); see chapter 2. Often, long-term forecasts based on for example ENSO do not contain information at a high spatial and temporal detail level, but rather contain information about averages in time and space, e.g. the total flow at the outlet of a river basin, averaged over several months. Yet, risk analysis may depend on events and sequences at shorter timescales and smaller spatial scales. One could attempt to shift the time series in the ensemble (bottom right Fig 4.2), but this could destroy the realism of the traces if the shifting or scaling procedure is not sophisticated enough. A reasonable alternative to combine detailed information in the ensemble of historical time series with forecast information is to update the probabilities of individual ensemble members, while leaving the time series they contain intact (Stedinger and Kim, 2010). This has the advantage of preserving high-resolution stochastic structure within the ensembles. 4.1. Introduction 61 The update of the weights is thus based on averages in space and time, derived from the time series, compared to information on these same quantities in the seasonal forecast; see top right of Fig. 4.2. Such an update of probability weights is consistent with Laplace’s principle of insufficient reason, which states that all outcomes of an experiment should be considered equally likely, unless one has information that indicates otherwise. In this case, the information from e.g. ENSO indicates otherwise. As is shown later, Laplace’s principle also corresponds to the more general information-theoretical principle of maximum entropy. When updating ensemble probabilities to deviate from equal probabilities, additional information is added to the ensemble. That information is measured by the relative entropy between the original and the new probabilities. To prevent adding more information to the ensemble than is justified by the forecast, the relative entropy is minimized, constrained by the information in the forecast. In this chapter, this method is referred to as the minimum relative entropy update (MRE-update). Apart from the example in this chapter, concerning climatic ensembles and additional forecast information, the MRE-update method that is introduced is generally applicable whenever information is added to an ensemble by adjusting probabilities. Another possible application could be a bias correction or variance adjustment for ensembles generated by Monte Carlo simulations with models. In finance, the concept of minimum relative entropy has been used to include price information in a Weighted Monte Carlo simulation, which is mathematically equivalent to the proposed bias correction application of the MRE-update (Avellaneda et al., 2001). 4.1.3 Previous work on adding forecast information to climatic ensembles by weighting The focus of this chapter is how to find a proper adjustment of probabilities in an existing ensemble to reflect forecast information that is given in the form of moments or conditional tercile probabilities, also referred to as probability triplets, which are often used to present seasonal forecasts. Finding such adjustments was recently discussed in Stedinger and Kim (2010). We now shortly review some existing weighting methods, to which the MRE-update in this chapter will be compared. After that, this chapter also addresses the question whether the chosen form of the forecast is an accurate representation of the information that the seasonal forecasts are supposed to convey. Croley (1996) presents a method for updating ensemble member probabilities, assuming forecast information is given by a third party in the form of conditional tercile probabilities. These are the probabilities of below normal, normal or above normal conditions, which have equal probabilities of 1/3 in the climatic distribution. Croley presents a nonparametric probability adjustment procedure based on minimization of the sum of squared deviations of the probabilities from the uniform distribution. The result is a block adjustment of probabilities, in which all ensemble members within one tercile get assigned the same weight. This is in line with the literal interpretation of the probability triplets as 62 Adding seasonal forecast information by weighting ensemble forecasts considered by Wilks (2000, 2002). The method can also deal with multiple forecasts, including deterministic “most probable event” forecasts (Croley, 1997). A procedure that one by one eliminates constraints according to user priorities helps to reach a feasible solution for the probability weights. Croley (2003) presents an alternative parametric approach, in which sample moments of the forecast distribution can be imposed as equality constraints on corresponding moments of the weighted ensemble. A problem with this method is that often many of the probabilities become zero, so only part of the original ensemble is used. The cause for this partly lies in the objective function that Croley proposed. Although it seems reasonable to minimize the adjustment to the probabilities, there is no clear rationale for using minimum squared deviations as objective function; see page 345 of Jaynes (2003). Among the two methods, the parametric one leads to more reasonable, smoother adjustments than the block adjustment, avoiding sudden jumps in probability between adjacent ensemble members (Stedinger and Kim, 2010). Stedinger and Kim (2002, 2007, 2010) introduce the pdf-ratio method, also focusing on obtaining a weighted ensemble, but now assuming a forecast is given by the third party as a target conditional distribution. They also argue that forecast information in the form of probability triplets should not be taken literally, but as a representation of a smooth underlying target distribution. They propose that probability triplets should be converted to a likely target distribution that can subsequently be used in the pdf-ratio method. The pdf-ratio method adjusts the probability of each ensemble member with the ratio between marginal (climatic) and conditional (forecast) probability density functions (pdf) at each sample point. The pdf-ratio method then normalizes the probabilities to make them sum to one. Although the method does not seem to be very sensitive to distribution type, one still has the problem of assuming a distribution from only two tercile probabilities or moments. Another problem is that for relatively large deviations from the climatic distribution, significant deviations of the resulting moments from the target moments occur. In this chapter we analyze the problem of updating ensemble probabilities with forecast information from an information-theoretical viewpoint. We present a new method to include forecast information in a historical ensemble, based on minimum relative entropy. In a comparison between our method and the existing methods, we explicitly show the assumptions in the existing methods and the differences between various ways of presenting forecast information. Before introducing the MRE-update, a short review of relevant information-theoretic concepts and principles is given. 4.2 Information, Assumptions and Entropy A reduction of entropy implies that information is added about the uncertain event the distribution describes. Information can be added in the form of data or knowledge, but can also enter implicitly by unwarranted assumptions, all reducing the entropy of the 4.3. The Minimum Relative Entropy Update 63 distribution. When new information in, for example, a forecast motivates a revision of the probability distribution from P (X) to Q(X), the relative entropy DKL (Q||P ) is an exact measure of the amount of information in that specific forecast. The expectation of DKL (Q||P ) over all possible forecasts is equal to the mutual information between the forecasts and the random variable X; see chapter 3. A good overview of informationtheoretic concepts reviewed in this section can be found in Cover and Thomas (2006). 4.2.1 The principle of maximum entropy Among all discrete probability distributions, the uniform distribution, in which all outcomes are equally likely, maximizes entropy. The uniform distribution has maximum missing information amongst all distributions on a finite support set. So without any information available except for the support set, it is rational to assume a uniform distribution. Assuming any other distribution leads to less uncertainty without having the information to justify that reduction. This idea was already formulated by Laplace Jaynes (1957) first formulated the principle of maximum entropy, which is in fact a generalization of Laplace’s principle. It states that when making inferences based on incomplete information, one should choose the probability distribution that maximizes uncertainty (entropy), subject to the constraints provided by the available information. Applying this principle leads to a distribution with maximum uncertainty, but bounded by what is known. This automatically implies that no false certainty is created and only truly existing information is added. The principle of maximum entropy (PME or POME) has been widely applied for derivation of prior distributions and parameter estimation; see e.g. Singh and Singh (1985); Singh and Rajagopal (1987); Singh (1997) for relevant references. Along the same lines of reasoning, the principle of minimum relative entropy or principle of minimum discrimination information (Kullback, 1997) states that given new facts, a new distribution should be chosen that is consistent with those facts, but apart from that minimizes the information gain with respect to the original distribution. This principle ensures that not more new information is included than is justified by the new facts. The principle leads to results identical to those of PME, but generalizes to non-uniform prior distributions. 4.3 The Minimum Relative Entropy Update 4.3.1 Rationale of the method We propose to apply the principle of minimum relative entropy to adjust the probabilities of a climatic ensemble to reflect new forecast information. This method is referred to as the minimum relative entropy update (MRE-update). The MRE-update is a constrained minimization of relative entropy to optimally combine new information in the form of constraints with an existing ensemble. In this example, the method is used for updating a climatic ensemble, whose members may contain high resolution spatial and temporal 64 Adding seasonal forecast information by weighting ensemble forecasts patterns of several variables. The new information that is added concerns some averaged quantities that characterize the traces in the climatic ensemble. This new information is for example expressed in the form of conditional moments of those averaged quantities. The information is added by adjusting the weights of the ensemble members in such a way that the weighted moments match the forecast. The amount of new information added by the forecast is the relative entropy between the original uniform distribution of probabilities and the updated probabilities assigned to the ensemble. Minimizing this relative entropy, constrained by the information contained in the forecast, will exactly use all information in the forecast, without adding information that is not in the forecast. Consequently, the new ensemble is consistent with the forecast, but does not deviate more than necessary from the observed climatic distribution. The minimum relative entropy update thus optimally combines forecast information with climatic information in an ensemble. 4.3.2 Formulation of the method In the minimum relative entropy update (MRE-update), we try to find updated probabilities qi by minimizing relative entropy between the original uniform distribution of probabilities pi and updated probabilities qi assigned to the n samples xi , given the general constraints of probabilities and the constraints posed by the forecast information. This results in a nonlinear optimization problem with objective function: min { qi ...qn n X qi qi log( )} pi i=1 (4.1) Because in this case we start from an uniform distribution of equiprobable ensemble members (pi is constant), which has maximum entropy, minimizing relative entropy is equivalent to maximizing the entropy of the distribution of qi : max {− qi ...qn n X qi log(qi )} (4.2) i=1 Subject to the constraint that the probabilities sum to one n X qi = 1 (4.3) i=1 And that all probabilities are nonnegative: qi ≥ 0 (4.4) This last constraint is never binding in the MRE-update, because the objective function (Eq. 4.1) already ensures positive weights. Without any forecast information, no extra constraints are added. Objective function (4.1) minimizes the divergence from the original 4.4. Theoretical test case on a smooth sample and comparison to existing methods 65 uniform distribution. Because constraints (4.3) and (4.4) are already satisfied by the original distribution, no adjustment is made. When forecast information is available, it can be introduced by additional constraints to the minimization problem. In case the forecast information is given as probability triplets of below (pb ) and above (pa ) normal conditions, the following constraints are added: X qi = pb (4.5) X qi = pa (4.6) i∈Sb i∈Sa In which Sb and Sa are the sets of i for which xi ≤ xb and xi ≥ xa respectively. With xb and xa being the lower and upper terciles of the climatic distribution for X. In case the forecast information is given as the conditional mean µ1 and standard deviation σ1 , the following constraints are imposed: n X qi xi = µ1 (4.7) qi (xi − µ1 )2 = σ12 (4.8) i=1 n X i=1 The resulting constrained convex optimization problem is subsequently solved using a standard gradient search. 4.4 Theoretical test case on a smooth sample and comparison to existing methods In this section, the results of the minimum relative entropy update are compared with the results of the Croley nonparametric adjustment (Croley, 1996), the Croley parametric adjustment (Croley, 2003) and the pdf-ratio method Stedinger and Kim (2010). The same example as the univariate case in Stedinger and Kim (2010) was used. In this example an artificially generated smooth climatic sample of n = 50 scalar values xi is updated with forecast information of the previously mentioned forms. In a real-world application, the sample would represent an ensemble of, for example, time series. The sample is created by evaluating the inverse cumulative distribution function of the prescribed original distribution at the Hazen plotting positions ((i − 0.5) /n); see Stedinger and Kim (2010). The sample is drawn from a normal distribution with mean 3 and standard deviation 1. In absence of extra information, all 50 samples are considered equiprobable, with probability 1/50. 66 Adding seasonal forecast information by weighting ensemble forecasts The challenge is now to update the probabilities of the sample values in such a way that the ensemble reflects the forecast distribution, given the climate information, conditioned on forecast information. Comparing the methods, all using their own required type of input information, we automatically compare forecast information given as three different types: (TC) (N) (M) The conditional tercile probabilities pb , pn and pa An assumed forecast normal distribution with given parameters. The mean µ1 and standard deviation σ1 of the forecast distribution In table 4.1 an overview of the methods and forecast types is given, indicating which combinations are compared and which abbreviations are used for the results. Three of the methods have also been compared by Stedinger and Kim (2010). For the pdf-ratio method, they considered normal, lognormal and gamma type distributions. This section focuses on their results using the assumption of a normal distribution for both climatic and forecast distribution. Adjustment method Forecast used Tercile constraints (TC) pb , pn , pa pdf-ratio method Croley non-parametric adjustment Croley parametric adjustment Minimum relative entropy update Conditional distribution (N) N(µ1 , σ1 ) Conditional mean and variance (M) µ1 , σ12 (pdf-N) (TC) (TC) (CP-M) (MRE-M) Table 4.1: An overview of the methods and types of forecasts that are compared in this chapter. In Stedinger and Kim (2010), the forecast information of type TC is converted to type N, using the assumption of a normal distribution. The rationale behind this is that the forecast of type TC is likely to represent a smooth underlying distribution. Results of the pdf-ratio method using forecast N are then compared with the Croley nonparametric adjustment using forecast TC and to CP-M. We compare results of the MRE-update with both tercile probability constraints (TC) and constraints on mean and standard deviation (MRE-M) to the results of the Croley nonparametric adjustment, the Croley parametric adjustment, and the pdf-ratio method. For the Croley methods, we use the same constraints as for the MRE-update for both the forecast TC (results: TC) and the forecast M (results: CP-M). For CP-M, the MREobjective (Eq. 4.1) is replaced by minimum squared adjustment (equation 4.9; see Croley (2003)). min { q i n X i=1 (qi − pi )2 } (4.9) 4.4. Theoretical test case on a smooth sample and comparison to existing methods 67 4.4.1 Results in a theoretical test case From the results it becomes clear that there is a large difference between forecasts in the form of probability triplets and forecasts in the form of moments. When the deviations from the original moments are small, the results for the methods using moments and the pdf-ratio method are similar. When deviations become larger, the Croley parametric method shows clearly different behavior, while the MRE-update and the pdf-ratio method show very similar results in many cases. The results are presented as graphs showing the weights of the individual traces as a function of the value they represent and the cumulative weights, which form an empirical cumulative distribution function (CDF). Next to the graphs, a number of tables shows the resulting tercile probabilities, moments and relative entropies. The relative entropy of each case can be interpreted as the reduction in uncertainty or the information added by the forecast. Discrete CDF Discrete Probabilities 1 0.045 0.9 0.04 climatic MRE−M pdf−N CP−M TC 0.035 0.8 0.7 0.03 P(X<x) P(X=x) 0.6 0.025 0.5 0.02 0.4 0.015 0.01 0.2 0.005 0 0.5 climatic MRE−M target−N pdf−N CP−M TC 0.3 0.1 1 1.5 2 2.5 3 x 3.5 4 4.5 5 0 0.5 5.5 1 1.5 2 2.5 3 x 3.5 4 4.5 5 5.5 Figure 4.3: Resulting ensemble member probabilities for µ1 =3, σ1 =0.5 and resulting empirical cumulative distribution. Discrete CDF Discrete Probabilities 1 0.08 0.9 0.07 climatic MRE−M pdf−N CP−M TC 0.06 0.7 0.05 0.6 P(X<x) P(X=x) climatic MRE−M target−N pdf−N CP−M TC 0.8 0.04 0.5 0.4 0.03 0.3 0.02 0.2 0.01 0 0.5 0.1 1 1.5 2 2.5 3 x 3.5 4 4.5 5 5.5 0 0.5 1 1.5 2 2.5 3 x 3.5 4 4.5 5 5.5 Figure 4.4: Resulting ensemble member probabilities for µ1 =4, σ1 =0.5 and resulting empirical cumulative distribution. The codes in tables 4.2-4.4 and the legends of figures 4.3-4.6 correspond to the different methods given in table 4.1. For the case of tercile probability constraints (forecast TC), the MRE-update (with objective eq. 4.1) always results in exactly the same block 68 Adding seasonal forecast information by weighting ensemble forecasts Discrete CDF Discrete Probabilities 1 0.05 0.9 0.045 climatic MRE−M pdf−N CP−M TC 0.04 0.8 0.7 0.6 P(X<x) P(X=x) 0.035 0.03 0.5 0.4 0.025 climatic MRE−M target−N pdf−N CP−M TC 0.3 0.02 0.2 0.015 0.01 0.5 0.1 1 1.5 2 2.5 3 x 3.5 4 4.5 5 0 0.5 5.5 1 1.5 2 2.5 3 x 3.5 4 4.5 5 5.5 Figure 4.5: Resulting ensemble member probabilities for µ1 =3, σ1 =1.2 and resulting empirical cumulative distribution. Discrete CDF Discrete Probabilities 1 0.3 0.9 climatic MRE−M pdf−N CP−M TC 0.25 climatic MRE−M target−N pdf−N CP−M TC 0.8 0.7 0.2 P(X<x) P(X=x) 0.6 0.15 0.5 0.4 0.1 0.3 0.2 0.05 0.1 0 0.5 1 1.5 2 2.5 3 x 3.5 4 4.5 5 5.5 0 0.5 1 1.5 2 2.5 3 x 3.5 4 4.5 5 5.5 Figure 4.6: Resulting ensemble member probabilities for µ1 =4, σ1 =1.2 and resulting empirical cumulative distribution. True moments Target probabilities Estimated probabilities TC µ1 3.00 2.00 3.00 4.00 4.50 5.00 3.00 3.00 4.00 4.50 5.00 σ1 0.25 0.50 0.50 0.50 0.50 0.50 1.00 1.20 1.20 1.20 1.20 pb 0.043 0.873 0.195 0.002 0.000 0.000 0.333 0.360 0.117 0.054 0.021 pa 0.043 0.002 0.195 0.873 0.984 0.999 0.333 0.360 0.682 0.814 0.905 pb 0.043 0.873 0.195 0.002 0.000 0.000 0.333 0.360 0.117 0.054 0.021 pa 0.043 0.002 0.195 0.873 0.984 0.999 0.333 0.360 0.682 0.814 0.905 pdf-N pb 0.049 0.880 0.205 0.002 0.000 0.000 0.340 0.364 0.127 0.064 0.029 pa 0.049 0.002 0.205 0.880 0.986 0.999 0.340 0.364 0.669 0.791 0.877 MRE-M pb 0.050 0.881 0.205 0.002 n.a. n.a. 0.342 0.375 0.142 0.097 0.071 pa 0.050 0.002 0.205 0.881 n.a. n.a. 0.342 0.374 0.712 0.843 0.931 CP-M pb 0.047 0.886 0.227 0.000 n.a. n.a. 0.342 0.382 0.110 0.081 0.071 pa 0.047 0.000 0.227 0.885 n.a. n.a. 0.342 0.382 0.840 0.919 0.931 Table 4.2: Resulting tercile probabilities for the four methods compared 4.4. Theoretical test case on a smooth sample and comparison to existing methods Target assuming normal Mean 69 Standard deviation pb pa µ1 σ1 TC pdf-N CP-M MRE-M TC pdf-N CP-M MRE-M 0.043 0.873 0.195 0.002 0.000 0.000 0.333 0.360 0.117 0.054 0.021 0.043 0.002 0.195 0.873 0.984 0.999 0.333 0.360 0.682 0.814 0.905 3.00 2.00 3.00 4.00 4.50 5.00 3.00 3.00 4.00 4.50 5.00 0.25 0.50 0.50 0.50 0.50 0.50 1.00 1.20 1.20 1.20 1.20 3.00 2.07 3.00 3.93 4.05 4.07 3.00 3.00 3.61 3.81 3.95 3.00 2.00 3.00 4.00 4.52 4.96 3.00 3.00 3.84 4.19 4.47 3.00 2.00 3.00 4.00 n.a. n.a. 3.00 3.00 4.00 4.50 5.00 3.00 2.00 3.00 4.00 n.a. n.a. 3.00 3.00 4.00 4.50 5.00 0.41 0.61 0.76 0.61 0.52 0.51 0.98 1.01 0.88 0.75 0.64 0.25 0.50 0.50 0.50 0.50 0.41 0.99 1.13 1.04 0.95 0.84 0.25 0.50 0.50 0.50 n.a. n.a. 1.00 1.20 1.20 1.20 1.20 0.25 0.50 0.50 0.50 n.a. n.a. 1.00 1.20 1.20 1.20 1.20 Table 4.3: Resulting mean and standard deviation for the various methods Target Assuming normal distribution Resulting divergence (relative entropy) from original distribution (bits) µ1 σ1 pb pa TC pdf-N CP-M MRE-M 3.00 2.00 3.00 4.00 4.50 5.00 3.00 3.00 4.00 4.50 5.00 0.25 0.50 0.50 0.50 0.50 0.50 1.00 1.20 1.20 1.20 1.20 0.043 0.873 0.195 0.002 0.000 0.000 0.333 0.360 0.117 0.054 0.021 0.043 0.002 0.195 0.873 0.984 0.999 0.333 0.360 0.682 0.814 0.905 1.132 1.002 0.257 1.002 1.438 1.547 0.001 0.005 0.371 0.712 1.035 1.324 1.174 0.459 1.174 2.081 3.356 0.000 0.036 0.561 1.105 1.709 1.379 1.251 0.500 1.251 n.a. n.a. 0.000 0.081 1.333 2.870 5.274 1.324 1.181 0.459 1.178 n.a. n.a. 0.000 0.078 0.949 2.283 5.274 Table 4.4: Resulting relative entropy for the different methods adjustment as for the Croley nonparametric adjustment (objective eq. 4.9). For tercile constraints, the minima of the objective functions thus coincide. The identical results for these two methods are indicated with TC. Pdf-N indicates pdf-ratio method, using normal climatic and normal forecast distributions (forecast N). The original smooth sample with uniform weights, drawn from the climatic normal distribution is also shown (climatic). In the cumulative distribution plots, also the cumulative distribution function of the target distribution, used in the pdf-ratio method is plotted. The comparison is made for a hypothetical case, with the various combinations of µ1 and σ1 as described in Stedinger and Kim (2010). For the case of tercile probability constraints (TC), target tercile probabilities were derived from µ1 and σ1 , assuming a normal distribution. Figures 4.3-4.6 show the assigned probabilities for individual ensemble members against their x-values. The right graphs in these figures show the corresponding discrete approximations for the cumulative distribution functions (CDF), using Hazen plotting 70 Adding seasonal forecast information by weighting ensemble forecasts positions, following Stedinger and Kim (2010). Table 4.2 shows the target and resulting tercile probabilities for below and above normal conditions (pb and pa ) for the methods, while table 4.3 shows resulting means and standard deviations. Table 4.4 shows the resulting relative entropy for the set of ensemble member probabilities, relative to the original uniform distribution. The “n.a.” entries correspond to combinations of µ1 and σ1 constraints for which the optimization based methods (CP-M) and (MRE-M) were not able to find a solution. This means that those µ1 and σ1 combinations are not achievable with the given sample. Results of the methods compared Small rounding errors are caused due to the limited number of ensemble members and the way the original sample is drawn. The effects become apparent for the case µ1 =µ0 =3 and σ1 =σ0 =1 (no new information). Firstly, the discrete approximation of outer tercile probabilities with 17 of the 50 members, results in probabilities of 0.34 rather then 31 . Secondly, the standard deviation of the original sample is not exactly one, but 0.987. When ignoring small differences due to these numerical effects, table 4.2 shows that for all methods pb and pa match the assumed target reasonably well for cases with σ1 ≤ 1. For the cases with increased variance, the methods using moment constraints show somewhat larger deviation from target probabilities pa and pb . Results for the mean-variance forecast (type M) show a difference between MRE-update (MRE-M) and Croley parametric adjustment (CP-M). The latter tends to result in more ensemble member probabilities set to zero; see Fig. 4.3-4.6. Although results are different, both satisfy the constraints given by the forecast. The difference in results is purely due to the difference in objective function. Naturally, because the moments are imposed as constraints, both methods will exactly match the target mean and standard deviation (table 4.3), like the results for methods using tercile constraints (TC) exactly match pb and pa (table 4.2). Information contained in forecasts The relative entropies of the resulting discrete distributions, as compared to the original uniform distribution, are shown in table 4.4. These relative entropies are a direct measure of the amount of information added to the ensemble. In all cases, the result for TC has the lowest relative entropy, respectively followed by pdf-N, MRE-M and CP-M. The relative entropy resulting from the MRE-update is the uncertainty reduction by the information in the forecast. Hence we can see that the forecast of type TC is less informative than type M (compare TC and MRE-M). The entropies for (MRE-M) in table 4.4 also show that larger shifts in mean result in larger relative entropy. This corresponds to the intuition that forecasts add more information when they deviate more from climatology. Because the pdf-N and CP-M methods have no information-theoretical founding, the relative entropy resulting from those methods does not say much about the amount of 4.4. Theoretical test case on a smooth sample and comparison to existing methods 71 information in the forecast, but does indicate the uncertainty reduction in the ensemble. Because CP-M has higher relative entropy than MRE-M, we can say that this first method introduces information (reduction of uncertainty) that is not present in the forecast. This will be further explained in the discussion in section 4.4.2. For cases with reduced variance and not too large a shift in mean, results of pdf-N very closely resemble MRE-M. Apparently, the information contained in the mean and variance constraints is the same information contained in a normal distribution. This is related to the fact that the maximal entropy distribution for a given mean and variance is a normal distribution; see appendix A. In general, there is a duality between sufficient statistics and the constraints of a maximum entropy distribution; see Jaynes (2003), page 520. If forecast information is given as constraints on conditional tercile probabilities (TC), there are infinitely many adjustments possible to satisfy those constraints. However, the adjustment that minimizes relative entropy is a block adjustment. Generally, when information about a distribution is given in form of constraints on quantile probabilities, the maximum entropy distribution is piecewise uniform. This also holds for the distribution of probabilities over the discrete scenarios. For the case of tercile constraints, the objective of minimum squared deviations of the Croley method has its minimum in the same location, leading to identical results as MRE. It might seem strange that results for the MRE-update depend on how the forecast information is presented. Forecast TC and M give completely different results. However, considering the tercile probabilities (TC) and moments (M) as different ways to present the same information implies that some underlying information is present, i.e. that we know more about the information than what is presented. In this example forecast M contains the extra knowledge that the forecast distribution is normal. Taking the information-theoretical viewpoint, forecast TC and M contain different information. If we do not know anything but these forecasts, we have to take them literally and thus we get different results for both forecasts. Forecasts of type TC appear to be less informative than forecasts of type M; see table 4.4. Moreover, forecast TC does not seem an appropriate way to represent a smooth forecast distribution. Deriving a mean and standard deviation (forecast M) as given variables from tercile probabilities (forecast TC), as was done in Stedinger and Kim (2010), implicitly introduces the assumption that the forecast distribution is normal and hence results in a smooth update. MRE compared to Croley The MRE-update (MRE-M) and the Croley parametric method (CP-M) both use a forecast of type M. Although the same information is used by both methods, they lead to different results. Logically, because it is the objective, the MRE-update results in a smaller divergence (relative entropy) than CP-M (table 4.4). This means that the MRE-update retains more of the climatic uncertainty. However, the Croley parametric method uses the same constraints as the MRE-update, so the amount of forecast information used is the same. Table 4.3 shows that both results are equally consistent with the forecast, because mean and standard deviation are exactly reproduced in both cases. 72 Adding seasonal forecast information by weighting ensemble forecasts Consequently, from an information-theoretical point of view, we can say that the Croley method makes an unnecessary extra deviation from climatology, not motivated by the forecast. The higher relative entropy means uncertainty is reduced by artificially introduced information that is not actually there in the forecast. The minimum squared deviation objective therefore results in an over-confident adjustment. This is also demonstrated by the fact that several probabilities are set to zero, without having information that explicitly rules out those scenarios as representing possible future conditions. MRE compared to pdf-ratio, using equivalent forecasts M and N Because the maximum entropy distribution for given mean and variance is a normal distribution, forecast M implies a normal forecast distribution. When a forecast used in the pdf-ratio method is a normal distribution (forecast type N) with mean and standard deviation (of forecast type M) as parameters, forecast N and M are equivalent and add the same information. Therefore, the differences in the resulting weights for (pdf-N) and (MRE-M) are purely due to the methods. MRE-M gives similar results as the pdf-ratio method (pdf-N) for adjustments that are not too large and where the ensemble is sufficiently dense; see fig. 1. In cases where the adjustment is large and the sample values do not cover the entire forecast distribution range, the approximation of the forecast distribution needs probability mass in the range outside the sample values. In the pdf-ratio method, this results in some “missing” probability. This can clearly be seen in the figure 4.6 (right), where the value of the target CDF at the highest sample value is still far from one. The pdf-ratio method (pdf-N) needs a large normalization step in these cases. All ensemble member probabilities are multiplied by the same factor, to make them sum to one. This results in deviations from the target mean and variance, because the missing probability outside the sample range is divided equally over the traces. Although this leads to smooth adjustments, it also results in weighted ensembles that do not conserve the mean and variance of the new information, possibly biasing results of planning and risk analysis. The (MRE-M) method distributes the missing probability to the sample values in a way to match exactly the target mean and variance, as long as the constraints do not become over-restrictive. Especially when a high adjustment in the mean and a small variance are required, the problem might become infeasible (the “n.a.” entries in the tables). An infeasible MRE-update indicates that the new information is conflicting with the historical ensemble and use of a weighted ensemble may be questionable. When the MRE-update is asked to match the resulting moments of the weighted ensemble resulting from pdf-N, the results of MRE-M and pdf-N are identical. Conversely, when pdf-N is forced to exactly match the target moments from the forecast, it will yield a result identical to MRE-M with the original target moments. This can, for example, be achieved by changing the parameters of the target distribution of pdf-N in an optimization, until the resulting moments after normalization exactly match the targets. In appendix A, it is shown analytically why the methods yield the same results. It is also shown that the Croley parametric method (CP-M) results in a second order approximation of the MRE-result. 4.4. Theoretical test case on a smooth sample and comparison to existing methods 73 4.4.2 Discussion An information-theoretical view The previous results showed that the pdf-ratio method does not exactly match the target moments in the case of large shifts. Stedinger and Kim (2010) discussed whether it is desirable to exactly match target moments, arguing that if the moments in the forecast can not be trusted completely, it might be better to not exactly match them. The question can then be asked what justifies that deviation and in what way the resulting moments should deviate from the forecast. We now take a look at this problem from an informationtheoretical perspective. In the information-theoretical framework, the information is the reduction in uncertainty. A requirement of the distribution of updated weights is therefore that the uncertainty is maximum, given a quantity of information that is added. If less information is taken from the forecast because it is not completely trusted, the maximum permissible reduction in uncertainty will also be less, and vice versa. This can be visualized as a tradeoff between forecast information lost and uncertainty lost. Figure 4.7 shows the tradeoff as a Pareto front. Points below the Pareto front are not attainable, because the change in weights to include a given portion of the forecast information inevitably leads to a given minimum loss of uncertainty. Solutions above the Pareto front, however, lose more uncertainty than permitted by the information. In other words, these weighted ensembles incorporate a gain in information that did not come from the forecast. The Pareto front in figure 4.7 was obtained by formulating a multi-objective optimization problem and solving it by using the fast global optimization algorithm AMALGAM, developed by Vrugt and Robinson (2007). The problem consisted of minimizing two objectives by finding Pareto-optimal vectors Q of 50 weights for the ensembles. The first objective is the maximization of the entropy of the weights, here plotted as the minimization of the difference with the entropy of uniform weights (Eq. 4.10). This objective is plotted on the vertical axis. o n (4.10) min Huniform − H (Q) Q The second objective is the minimization Kullback-Leibler divergence of the sought distribution from the closest distribution that exactly matches the target moments. This objective, plotted on the horizontal axis, measures the information loss with respect to the exact forecast. n o min DKL Ptarget||Q (4.11) Q where Ptarget = arg min {DKL (P ||Q)} P (4.12) and Ptarget is subject to the constraints on sum, mean and variance in equations 4.3, 4.7 and 4.8. From the figure, it can be seen that the climatic distribution, and the MRE-update both lie on the Pareto-front. In contrast, the Croley parametric method is not Pareto optimal 74 Adding seasonal forecast information by weighting ensemble forecasts 0.35 0.3 Uncertainty lost (D KL from climate (bits)) MRE Croley PDFratio climate probability 0.25 0.2 0.15 0.1 Tradeoff between uncertainty and forecast information 3 pareto−optimal non−optimal MRE 2.5 Croley PDF−ratio climatic 2 2.2 2 1.5 1.8 1 1.6 1.4 0.5 0 0.02 0.04 0.06 0.05 0 0 0 1 2 3 value 4 5 6 0 0.5 1 1.5 2 2.5 3 Forecast information lost, DKL to target µ and σ (bits) Figure 4.7: Pareto-front showing the trade-off between losing as little uncertainty as possible and losing as little information in the forecast as possible. The different points of the front represent different levels of trust in the forecast. The result is shown here for µ1 = 4.5 and σ1 = 0.8. according to these criteria. Although it exactly matches the forecast and loses no information from the forecast, it does so with more reduction in uncertainty than strictly needed. Also the pdf-ratio method does not reach the Pareto-front, although is comes close in many cases. However, it is conjectured that when the forecast is seen as two separate pieces of information, one about the mean and one about the variance, the pdf-ratio solution would lie on the Pareto front in a 3 dimensional space, where lost information with respect to mean and variance would be plotted on separate axes. While all solutions on the pareto front indicate rational sets of weights given varying degrees of trust in the forecast, the MRE-update completely trusts the forecast. The onus is therefore on the forecaster to reflect both the information and the uncertainty about the variable under consideration in that forecast. The MRE-update reflects exactly this information and uncertainty in the weighted ensemble, and does not further increase or decrease the uncertainty. A discussion on why forecasters should communicate carefully chosen summary statistics or preferably their entire probability estimates can be found in Weijs et al. (2010a); see also chapter 5. About the use of the weighting methods When forecast TC is received, additional information should be gathered about the moments, support set, or distribution types to assume. If really no other information is available, the MRE-update can be used directly with the forecast, resulting in block adjustment. When information about moments or other appropriately summarized statistics 4.4. Theoretical test case on a smooth sample and comparison to existing methods 75 of the forecast distribution is available, the MRE-update is the most suitable method, as it exactly uses the available information and does not make implicit assumptions. The use of the optimization based adjustments proposed by Croley (1996; 1997; 2003) are a second order approximation of the MRE-update, but can introduce a reduction of uncertainty that is not supported by the forecast information. The MRE objective function should be preferred over the quadratic objective on information-theoretical grounds. For implementing the MRE-update in places where the Croley method is applied, it suffices to replace the quadratic objective function by relative entropy. This also resolves the problem of many probabilities set to zero. Because the pdf-ratio method does not need to solve an optimization problem, it is easier to apply and faster then the MRE-update. Another advantage of the pdf-ratio method is that it is relatively easy to include a large amount of information, included in estimated climatic and forecast distributions. In many practical cases, the forecast distribution lies well within the climatic distribution, and normalization is not required in the pdf-ratio method. In those cases, the pdf-ratio method provides a fast and correct adjustment, given that no unfounded assumptions are introduced in the estimation of climatic and forecast distributions. When an extra optimization is done to exactly match the target moments, it can be used as a fast solver for the MRE-update. The MRE-update uses the full information from the forecast, provided the information contained in the forecast distribution can be converted into mathematical constraints for the optimization problem. In principle the MRE-update offers possibilities to include constraints on for example skew, variance of the log-transformed variable, other quantiles or correlations in a multivariate setting. Many known parametric distributions are in fact maximum entropy distributions for combinations of these types of constraints; see e.g. Singh and Singh (1985); Singh and Guo (1995a). This offers the possibility to reformulate pdf-ratio problems as a MRE-update problem. Conversely, it allows fast parametric solution of the MRE-update by using the pdf-ratio method which is forced to exactly match the constraints. Making more use of all available information When we have more information available about the forecast distribution than only mean and variance, like the complete time series of the predictors and responses , it is possible to estimate a joint pdf for them. Bivariate kernel density estimators, as applied by Sharma (2000), would then be a good way to derive continuous climatic and target distributions for the pdf-ratio method. Once one has the joint pdf, the marginal climatic and conditional forecast pdfs can be derived from it and used in the pdf-ratio method. If the conditional distribution from the kernel density estimate can be summarized in a number of constraints, it can also be used in the MRE-update. Figure 4.8 shows for example how an extra constraint on skewness results in a different update. 76 Adding seasonal forecast information by weighting ensemble forecasts Discrete Probabilities Discrete CDF 0.06 1 original MRE skew MRE no skew 0.05 0.9 0.8 0.7 0.6 0.03 P(X<x) P(X=x) 0.04 0.02 0.5 0.4 0.3 0.01 0.2 original MRE skew MRE no skew 0 0.1 −0.01 0.5 1 1.5 2 2.5 3 x 3.5 4 4.5 5 5.5 0 0.5 1 1.5 2 2.5 3 x 3.5 4 4.5 5 5.5 Figure 4.8: Resulting weights (left) and CDF (right) when an extra constraint on skewness is imposed in the MRE-update. The result is shown for µ1 = 3 , σ1 = 0.5 and a target skewness of 2. 4.4.3 Conclusions from the theoretical test case There is an important difference in the information that is conveyed by forecasts in the form of conditional tercile probabilities (TC) and forecasts in the form of a mean and a variance (M). The TC forecast is less informative and taken literally suggests a reweighting of the ensemble in the form of a block adjustment, following Croley (2001). Probability triplets therefore do not seem an appropriate way to convey forecast information. For forecasts in the form of moments, the Croley parametric method (Croley, 2003) makes an adjustment that reduces the uncertainty represented by the ensemble more than is warranted by the information in the forecast. It excludes some of the scenarios in the ensemble by setting their probabilities to zero, without receiving information to justify that. The pdf-ratio method (Stedinger and Kim, 2010), used with Gaussian distributions, does not use all information in the forecast. It results in an adjustment of the same form as the MRE-update, but with resulting moments that deviate from the moments of the forecast information. The distribution of weights that is found by this version of the pdfratio method is not Pareto-optimal in the two dimensional trade-off of lost uncertainty versus lost forecast information. The solution loses more uncertainty than is justified by the partial information that is taken from the forecast. However, the method is Paretooptimal in a 3D objective space, with objectives minimum lost uncertainty, minimum lost information from the mean and minimum lost information on the variance. This results from the fact that the solution of the pdf-ratio method is identical to the MRE-update solution for the moments that result from it; see appendix A. An adaptation of the pdf-ratio method is possible, that adjusts the parameters of the target distribution in such a way that the resulting moments exactly match the target moments of the forecast (Jery Stedinger, personal communication). This offers an opportunity to significantly reduce the dimensionality of the optimization problem for the MRE-update in case of a mean-variance forecast. Instead of seeking values for all individual weights, it suffices to optimize the 2 parameters of the target normal distribution and 4.5. Multivariate case 77 the normalization factor. Appendix A shows that this amounts to finding the 3 Lagrange multipliers in the analytical solution to the MRE-update. 4.5 Multivariate case Another important case is updating ensemble probabilities to reflect forecast information on multiple variables (Stedinger and Kim, 2010). For more background on the importance of the multivariate case; see Stedinger and Kim (2007, 2010). For all variables, constraints on mean and variance can be specified separately. Because the size of the ensemble stays the same, the dimensionality of the optimization problem does not increase. The only difference is the addition of more constraints, which results in a slightly higher risk of the optimization problem becoming infeasible. However, tests show that for most practical problems, enough degrees of freedom exist to find a solution. Another important issue is the preservation of cross-correlations (Stedinger and Kim, 2010), especially in cases where risk depends on the joint occurrence of for example high water temperatures and low flows. Preservation of the cross-correlations can be ensured by imposing additional equality constraints on the weighted cross-correlation of the adjusted sample. Although in this chapter we concentrate on the univariate case, we briefly show some results to demonstrate the potential of the MRE-update also for multivariate updates. We consider the theoretical example from (Stedinger and Kim, 2010) for comparison, using the exact same data. We chose the same bubble plots to show the resulting weights as a function of both variables. The top plot in figure 4.9 shows the results for the MREupdate exactly matching mean and variance of both variables, but without explicitly preserving the initial cross-correlation by including it as a constraint. The middle plot shows the MRE-update result when a constraint enforced the preservation of the initial cross-correlation of 0.8. The bottom plot in figure 4.9 shows the resulting weights when the MRE-update is asked to exactly match the means, variances and cross correlation resulting from the pdf-ratio method with a bivariate normal distribution (σ1x = 1.145 σ1y = 1.292 ρ1 = 0.751). Also for the multivariate case, it turned out that the MREupdate using means, variances, and cross-correlation is equivalent to the pdf-ratio method with a bivariate normal distribution, when its moments and cross-correlation would be forced to exactly match the targets. 4.6 Application to ESP forecasts In this section, the various methods for generating weighted ESP forecasts are applied on a data set with hindcast ESP forecasts for the Columbia river basin in the Pacific Northwest of the USA. The data concerns a “climatic” ensemble. The ensemble traces are modeled flows from the VIC hydrological model (Wood et al., 1992) that were generated using different historical initial conditions and historical weather patterns from 1950 to 2005; see Wood et al. (2005) for a more detailed description of this data set. 78 Adding seasonal forecast information by weighting ensemble forecasts Exactly matching target moments, no constraint on correlation 5.5 5 4.5 values of Y 4 3.5 3 2.5 2 1.5 1 0.5 0.5 1 1.5 2 2.5 3 3.5 values of X 4 4.5 5 5.5 5 5.5 5 5.5 Exactly matching target moments, preserving correlation 5.5 5 4.5 values of Y 4 3.5 3 2.5 2 1.5 1 0.5 0.5 1 1.5 2 2.5 3 3.5 values of X 4 4.5 Matching the resulting moments from Stedinger and Kim 5.5 5 4.5 values of Y 4 3.5 3 2.5 2 1.5 1 0.5 0.5 1 1.5 2 2.5 3 3.5 values of X 4 4.5 Figure 4.9: Results for the bivariate update of a sample with initial means {µ0x , µ0y } = {3, 3} and initial standard deviations {σ0x , σ0y } = {1, 1}, the new means and standard deviations are {µ1x , µ1y } = {3, 3} and {σ1x , σ1y } = {1.5, 1.5}. The area of the circles represent the weights. The above two graphs result from the MRE-update and match these targets exactly. The middle plot maintains the original cross-correlation ρ0 = 0.8, while the upper plot results in a cross-correlation of 0.934. The bottom plot shows the result of the MRE-update for the target moments that are the resulting moments of the pdf-ratio method with bivariate normal target. The resulting weights are identical to the pdf-ratio solution. 4.6. Application to ESP forecasts 79 4.6.1 Seasonal forecast model The predictor that was used is the ENSO climate index “Nino3.4” as defined by Trenberth (1997), obtained from the NOAA server2 . The inclusion of other information, like de phase of the PDO, can probably improve predictions further, but for the purpose of demonstrating the weighting methods, the simple linear regression model with ENSO is considered sufficient. Two different modes, “forecast” and “hindcast”, were investigated. In hindcast mode, the whole data set (from 1950 to 2005) about the streamflows and ENSO is used in the regression to derive the linear relation. For each forecast, the current year is excluded from the data set. In forecast mode, which is more representative for a real situation, only data up to the year that is forecast is used. For the forecast for 1970, for example, only the ENSO indexes and flows from 1950 to 1969 are used in the regression model. Furthermore, the ESP forecasts consist of historical weather patterns only from 1950 to 1969. The number of traces in the ESP forecast thus grows each year that a new forecast is made. To avoid problems with small ensemble sizes and little training data for the regression, a “warm up” period of 20 years is used in forecast mode. The linear model that is found by the regression in hindcast mode is Qf = 235.33 − 10.96 ∗ ENSO11..2 (R2 = 0.08) (4.13) where Qf is the average flow in the months April to September and ENSO11..2 is the average ENSO index for November the year before to February. In forecast mode, the regression coefficients varied from year to year. The deterministic forecasts were subsequently converted to normally distributed probabilistic forecasts using three different methods. Mean and variance forecast To obtain a forecast in the form of a mean and a variance of Qf , the joint distribution of previous forecasts and observed average flows is used. The joint distribution is assumed to be a bivariate normal distribution and the parameters µQf , µQ , σQf , σQ , ρ are obs obs estimated using the values of Qf and Qobs over the years up to the forecast (forecast mode) or for the entire dataset (hindcast mode). Subsequently, the conditional mean and variance, given the actual forecast Qf (t) from the regression model, can be calculated using µQ (t) = µQ obs + ρσQ Qf (t) − µQf obs σQf q σQ (t) = σobs 1 − ρ2 2. ftp://ftp.cpc.ncep.noaa.gov/wd52dg/data/indices/sstoi.indices (4.14) (4.15) 80 Adding seasonal forecast information by weighting ensemble forecasts Kernel Density Estimate (KDE) Kernel density estimation is a method to estimate an empirical distribution from a sample. It can be interpreted as a smoothed histogram, which is a sum of kernels around the sample values. Sharma (2000) describes a method to use bivariate kernel density estimation of forecasts and responses to derive the joint distribution and the conditional distribution of the response, given a certain value of a predictor (e.g. a climate index or a linear combination of several variables). This method was used to estimate a joint distribution of past ENSO indexes and past streamflow values. For the kernels, a correlated bivariate normal distribution was used, for which the parameters where estimated from the data, using the method described in (Botev, 2006). This method introduces relatively few assumptions and is based on information-theoretic concepts. One still has to make an assumption about the kernel shape, but in most cases, the result is not very sensitive to that (Wand and Jones, 1993). After the joint distribution has been estimated, the conditional distribution at the current predictor value can be found, which is a weighted sum of Gaussian kernels. The kernel weights and resulting kernel density estimates for the climatic and the conditional distribution are shown in Fig. 4.11. These distributions can then be used in the pdf ratio method, yielding the updated weights for the ensemble traces, which are shown in the same figure. ESP forecast with information about the hydrological initial state Next to the weighted climatic ensembles, also ESP forecast were used. The ESP ensembles are obtained by forcing a hydrological model with the observed weather patterns from historical observations, while the initial conditions are the same for each model run. These initial conditions are based on the hydrological state of the catchment in the year of the forecast. As a result of this, the ESP forecasts are based on information that is not available to the weighting methods. A direct comparison of the forecast skill can therefore not lead to conclusions about the weighting methods, but can lead to conclusions about the predictors basin conditions vs. teleconnections; see also Wood et al. (2005); Wood and Lettenmaier (2008). The weighting can also be combined with the conditional ESP forecasts. The conditional ensemble contains the information about the initial state, in the form of an ensemble that differs from the climatological ensemble. Adjusting the weights of this conditional ensemble also adds the information from the teleconnections. 4.6.2 Results Figure 4.12 shows how the probabilities are updated by the different methods. For the MRE update on the conditioned ensemble, the weights found for the original ensemble were used. Because the streamflow magnitudes in the different years can reverse order when conditioned on initial basin state, the weights are no longer a smooth function of 4.6. Application to ESP forecasts 81 400 350 300 flow 250 200 150 100 −3 −2 −1 0 1 nino34 2 3 4 Figure 4.10: Bivariate kernel density estimate of the joint distribution of the ENSO index (November-February) and the average streamflow (April-August). 0.04 climatic kernel weights conditional kernel weights updated ensemble weights marginal density conditional density (kernel) weights and densities 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 100 150 200 250 300 350 400 streamflow m3/s Figure 4.11: An example of the kernel density estimate used in the pdf-ratio method for the year 2003 (hindcast mode). Weights for year 2003 0.07 0.06 climate Croley MRE update pdf−ratio Hamlet pdfratio with kde clim conditioned MRE conditioned Probability weight 0.05 0.04 0.03 0.02 0.01 0 100 150 200 250 300 350 400 3 Flow m /s Figure 4.12: Ensemble weights for 2003, plotted against the average streamflow of the each trace from April to September (hindcast mode). 82 Adding seasonal forecast information by weighting ensemble forecasts the streamflows. Changes in order occur for example when a relatively dry but warm meteo-year is used as a meteo trace with snowy initial conditions. In the conditional forecast, the increased snowmelt gives a high flow in the melting season, while for the climatic forecast the flow might be one of the lowest. After the probabilities are updated, the empirical CDFs are shifted. The shifts differ between the various methods, as can be seen from figure 4.13. Weighting methods make vertical shifts to the points in the CDF, while the conditioned ensembles of the ESP forecasts have their CDF points shifted horizontally. The lower graphs make clear that, for this year, the change in CDF due to conditioning on initial basin state is far larger than the change due to weighting based on ENSO information. The observed value for the average flow for the forecast period in 2003 was 188.7 m3 /s. In this case, the shift towards lower values was correct and the forecasts, especially the ones in the lower graphs, were an improvement compared to the climatological forecast. Over the whole time period, the forecasts were evaluated using the Ranked Probability Skill Score (RPSS) Epstein (1969), which is common for these forecasts. The RPSS is defined as RPS (4.16) RPSS = 1 − RPSclim where n RPS = X (Pi − Oi )2 (4.17) i=1 where Pi is the CDF value of the forecast for bin i out of n bins and Oi the CDF value of the observation, which is 0 below the observed value and one above it. RPSclim is the RPS score for the reference climatological CDF. An RPSS of 0 therefore indicates no skill over climatology and an RPSS of 1 indicates a perfect forecast. In chapter 5, an alternative skill score based on information theory is presented. In this chapter, the traditional RPSS is used. The resulting scores for the different weighting methods are shown in figures 4.15 (forecast mode) and 4.14 (hindcast mode). It can be observed that the score for the pdfratio forecast using KDE fluctuates less, because the probability distribution is smoothed somewhat by the Gaussian kernels. The methods that depart more from the climatic distribution have a more fluctuating score, most notably the deterministic forecasts. In chapter 5 it is argued that this fluctuation in skill should in fact range between 1 and −∞. From table 4.5 it becomes clear that in terms of RPSS, all weighted ensemble probabilistic forecasts have some skill over climatology, while the deterministic forecasts have not. The strong variation of forecast quality between the different years make it hard to draw definitive conclusions about the relative performance of the different weighting methods. Longer periods are necessary to demonstrate the practical advantages of one method over the other. In forecast mode, results are most representative of a real situation, but only a short time series is available for evaluation. The conditioned ensemble yields an RPSS of 0.4528 for uniform weights and 0.4310 for an MRE-updated set of weights (both in forecast mode). In hindcast mode, results are 4.6. Application to ESP forecasts 83 Cumulative Distributions for year 2003 year 2003 1 Probability of non−exceedence Probability of non−exceedence 1 0.8 0.6 0.4 0.2 0 100 150 200 250 300 350 400 climate croley MRE pdfratio hamlet determ prob reg pdf kde 0.8 0.6 0.4 0.2 0 0 0.2 3 Flow m /s Probability of non−exceedence Probability of non−exceedence Cumulative Distributions for year 2003 conditioned on initial conditions 1 1 0.8 0.6 0.4 0.2 0 100 150 200 250 300 350 400 0.4 0.6 original percentile 0.8 1 year 2003 climate clim cond MRE cond 0.8 0.6 0.4 0.2 0 0 0.2 3 Flow m /s 0.4 0.6 original percentile 0.8 1 Figure 4.13: The empirical CDFs for the different methods. The right figures are quantilequantile (QQ) plots, clearly showing the changes by the weighting. The lower two graphs show the CDFs for the ESP forecasts that are conditioned on initial basin conditions. Ranked Probability Skill Score in time 1 0 RPSS −1 −2 −3 Croley MRE update pdf−ratio Hamlet pdfratio with kde deterministic forecast −4 −5 −6 1950 1960 1970 1980 year 1990 2000 2010 Figure 4.14: The RPSS for the weighted ensemble forecasts by the different weighting methods for the whole forecast period (hindcast mode).. Ranked Probability Skill Score in time 1 0 RPSS −1 −2 Croley MRE update pdf−ratio Hamlet pdfratio with kde deterministic forecast −3 −4 −5 1970 1975 1980 1985 1990 1995 2000 2005 year Figure 4.15: The RPSS for the weighted ensemble forecasts by the different weighting methods for the period from 1970 (forecast mode, starting from 20 traces). 84 Adding seasonal forecast information by weighting ensemble forecasts Method Croley MRE pdfratio Hamlet ESP MRE weighted ESP Pdfratio KDE Deterministic RPSS forecast RPSS hindcast 0.0365 0.0479 0.0323 0.0182 0.4528 0.4310 0.0247 -0.3970 0.0228 0.0327 0.0350 0.0214 0.4487 0.4298 0.0356 -0.4638 Table 4.5: The resulting skill score for the different methods. 0.4487 for uniform weights and 0.4298 of the updated weights. First of all, this shows that conditioning on initial basin conditions leads to far more skillful forecasts than weighting on the basis of ENSO. The deterioration when going from the ESP forecast to the weighted ESP forecast could be attributed to the fact the initial basin conditions and the ENSO phase are not independent. By basing the weights on a regression of ENSO with the climatic flows, the total dependence of the flows on ENSO is reflected in the weights. The part of this dependence that is also present in the initial basin conditions is therefore used twice. Using the same information twice results in overconfident forecasts, which leads to a reduction in forecast skill. Using the MRE-update with weights based on regression of ENSO with the conditioned ensemble prevents using the information twice. This leads to an RPSS of 0.0117 with the conditioned ensemble as a reference, showing that the weighting does not add much info over the conditional ESP forecast. Most of the ENSO information, is already present in the basin conditions. 4.6.3 Discussion The ensemble weighting methods seem to produce an improvement of skill over the method of selecting traces, similar to Hamlet and Lettenmaier (1999). Weighting based on ENSO has less predictive power than generating ensembles conditioned on initial basin conditions. The advantage of the weighting method, however, is that it can also be applied to historically measured flows, without the need for a detailed hydrological model and measurements of initial basin conditions. Combining the weighting methods with the conditional ensemble would be an interesting direction of future research. The weighting methods can add information to the ESP forecasts based on any feature of the weather pattern that correlates with ENSO or other long timescale predictors. Such features could, for example, also include timing of the precipitation or average temperature in some months. This information can be added to the ensemble by using the multivariate version of the MRE-update. Care should be taken, however, not to include the same information twice. The extension of the pdf-ratio method using kernel density estimate showed good results in hindcast mode, but in forecast mode, it performed worse than the pdf-ratio with normal distributions. This may indicate the kernel density estimates are over-fitted to the data. 4.7. Conclusions and recommendations 85 However, we can not draw any general conclusions based on this single case, which just served to demonstrate the different methodologies. 4.6.4 Conclusion The different weighting methods show a small positive skill compared to climatology. The skill fluctuates from year to year, which makes it difficult to draw significant conclusions from this case about the performance of the different weighting methods. The skill is also mostly determined by the quality of the forecasts and the predictors they are based on and less by the weighting method to combine the forecasts with an ensemble. However, the MRE-update shows promising results, also in a practical case. 4.7 Conclusions and recommendations In this chapter, we introduced the minimum relative entropy update (MRE-update) as an approach to update ensemble member probabilities. Our method is based on the minimization of relative entropy, with forecast information imposed as constraints. The main advantage of the method is that it optimally combines available climatic and forecast information, without introducing extra assumptions. Results were compared with three existing methods to make probability-adjustments to ensembles, based on different type of forecast information. We considered forecast information given in the form of conditional tercile probabilities, a normal distribution and a given mean and variance. Analysis of the results from an information-theoretical viewpoint explicitly revealed the differences in information contained in these different types of forecasts and the way existing methods interpret them. The block adjustment that results from the Croley nonparametric method may be undesirable due to the discontinuities in weights at the arbitrarily selected quantiles. However, the result is in line with the literal information-theoretical interpretation of the probability triplets. When interpreting the forecasts in any other way, information is added. It is important to be aware of this and think carefully about what information is added. Conversely, forecasts that require this extra interpretation are in fact incomplete, leaving too much interpretation to the user of the forecasts. Ideally, seasonal forecasts should provide pieces of information that are a good summary of the probability estimate. For smooth distributions, a mean and variance are more appropriate than probability triplets. See also chapter 5 and Weijs et al. (2010a) for more discussion on how forecasts should be presented to be most informative. The information contained in the mean-variance forecast and in a normal forecast distribution is the same. The MRE-update results in a weighted ensemble that exactly incorporates this information. The pdf-ratio method diverges from the forecast information by not exactly matching the given moments. From an information-theoretical perspective, it is unclear how to justify this divergence. A multi-objective optimization was performed to 86 Adding seasonal forecast information by weighting ensemble forecasts find a Pareto-front that represents the tradeoff between lost information from the forecast and lost initial uncertainty. An analysis of the methods in this objective-space revealed that in some cases, the pdf-ratio method reduces uncertainty more than is justified by the partial information taken from the forecast. This results in a solution that is not Paretooptimal. Also the Croley parametric method lies above the Pareto-front. It uses the full information from the forecast, but reduces uncertainty more than that information permits. One of the symptoms of this false certainty are the ensemble members which get weight zero, although nothing in the forecast rules them out. By definition of the chosen objectives, the MRE-update results in a Pareto-optimal solution in which the complete information from the forecast is used and no other information is added to the weighted ensemble. The pdf-ratio method has the advantage that it is fast and does not require an optimization search for all individual weights. An adaptation of the pdf-ratio method that includes a search for parameters that result in an exact match of the target moments is possible. This is equivalent to finding the values of the Lagrange multipliers for the analytical solution of the MRE-update (see appendix A) and results in an optimization problem that is much easier to solve. In this chapter, the equivalence for the univariate and bivariate normal distributions was demonstrated, but it is anticipated that similar results can be found for other distributions for which sufficient statistics exist. In a test case in a practical application, the variations in skill of the seasonal mean-variance forecasts for Columbia river streamflows were too large to enable strong conclusions on the performance of the weighting methods. The methods showed similar performance, which indicated a small skill over climatology, which was considerably less than the skill of ensembles that where conditioned on initial basin conditions. Optimally combining the information from ENSO with the basin state information is an important challenge. The ENSO-based forecasts have the advantage that they can be extended to longer lead times, and can be issued before the snow accumulates in the basin. For reservoir operation, having such forecasts, with a seasonal jump in predictive power, presents an interesting control problem, where it might be necessary to include knowledge of the future growth of information when snow falls into the operation strategy; see also the discussion in chapter 7. The MRE-update can incorporate information that can be formulated in terms of constraints. This chapter showed also how skew can be included. In addition, a multivariate example was given in which information in both means and variances was matched while also preserving initial cross-correlation. Generation weighted ensembles using information from non-parametric forecast pdfs, such for example obtained from kernel density estimation, remains an open issue. The Matlab-source code (still in development) for the MRE-update will be available from the website www.hydroinfotheory.net. Chapter 5 Using information theory to measure forecast quality “Ignorance is preferable to error, and he is less remote from the truth who believes nothing than he who believes what is wrong.” - Thomas Jefferson, 1781 Abstract - This chapter1 presents a score that can be used for evaluating probabilistic forecasts of discrete events. The score is a reinterpretation of the logarithmic score or Ignorance score, now formulated as the relative entropy or Kullback-Leibler divergence of the forecast distribution from the observation distribution. Using the information-theoretical concepts of entropy and relative entropy, a decomposition into three components is presented, analogous to the classical decomposition of the Brier score, which is also extended to the case of uncertain observations. The information-theoretical twins of the components uncertainty, resolution and reliability provide diagnostic information about the quality of forecasts. The overall score measures the the uncertainty that remains after the forecast. As was shown recently, information theory provides a sound framework for forecast verification. The new decomposition, which has proven to be very useful for the Brier score and is widely used, can help acceptance of information-theoretical scores in meteorology and hydrology. 5.1 Introduction Forecasts are intended to provide information to the user. Forecast verification is the assessment of the quality of a single forecast or forecasting scheme (Jolliffe and Stephenson, 2008). Verification should therefore assess the quality of the information provided by the 1. based on: – S.V. Weijs, R. van Nooijen, and N. van de Giesen. Kullback–Leibler divergence as a forecast skill score with classic reliability–resolution–uncertainty decomposition. Monthly Weather Review, 138, (9): 3387–3399, September 2010 – S.V. Weijs, G. Schoups, and N. van de Giesen. Why hydrological forecasts should be evaluated using information theory. Hydrology and Earth System Sciences ,14 (12), 2545–2558, 2010 – S.V. Weijs and N. van de Giesen, Accounting for observational uncertainty in forecast verification: an information–theoretical view on forecasts, observations and truth, Monthly Weather Review, early online release, 2011. 87 88 Using information theory to measure forecast quality forecast. It is important here to note the distinction between quality, which depends on the correspondence between forecasts and observations, and value, which depends on the benefits of forecasts to users (Murphy, 1993). In this chapter, it is assumed that the verification is intended to quantitatively measure quality. Several scores and visualization techniques have been developed that measure certain desirable properties of forecasts with the purpose of assessing their quality. One of the most commonly used skill scores (Stephenson et al., 2008) is the Brier score (BS) (Brier, 1950), which is applicable to probabilistic forecasts of binary events. The Brier skill score (BSS) measures the BS relative to some reference forecast, which is usually climatology. Murphy (1973) showed that the BS can be decomposed into three components; uncertainty, resolution and reliability. These components give insight into some different aspects of forecast quality. The first component, uncertainty, measures the inherent uncertainty in the process that is forecast. Resolution measures how much of this uncertainty is explained by the forecast. Reliability measures the bias in the probability estimates of the probabilistic forecasts. A perfect forecast has a resolution that is equal to (fully explains) the uncertainty and a perfect reliability. Information theory provides a framework for measuring information and uncertainty; see Cover and Thomas, 2006 for a good introduction. As forecast verification should assess the information that the forecaster provides to the user, using information theory for forecast verification appears to be a logical choice. A concept central to information theory is the measure of uncertainty named entropy. However, consulting two standard works about forecast verification, it was noted that the word entropy is mentioned only thrice in (Jolliffe and Stephenson, 2003) and not one single time in (Wilks, 1995). This indicates that the use of information-theoretical measures for forecast verification is not yet widespread, although some important work has been done by Roulston and Smith (2002); Ahrens and Walser (2008); Leung and North (1990); Kleeman (2002). Leung and North (1990) used information-theoretical measures like entropy and transinformation in relation to predictability. Kleeman (2002) proposed to use the relative entropy between the climatic and the forecast distribution to measure predictability. The applications of information theory in the framework of predictability are mostly concerned with modeled distributions of states and how uncertainty evolves over time. Forecast verification, however, is concerned with comparing observed values with the forecast probability distributions. Roulston and Smith (2002) introduced the Ignorance score, a logarithmic score for forecast verification, reinterpreting the logarithmic score (Good, 1952) from an information-theoretical point of view. They related their score to relative entropy between the forecast distribution and the “true PDF”, which they defined as “the PDF of consistent initial conditions evolved forward in time under the dynamics of the real atmosphere”. Ahrens and Walser (2008) proposed information-theoretical skill scores to be applied to cumulative probabilities of multi-category forecasts. Very recently, Benedetti (2010) showed that the logarithmic score is a unique measure of forecast goodness. He showed that the logarithmic score is the only score that simultaneously satisfies three basic requirements for such a measure. These requirements are additivity, locality (which he interprets as exclusive dependence on physical observations) and strictly proper behav- 5.2. Definition of the divergence score 89 ior. For a discussion on these requirements, see Benedetti (2010). Furthermore, Benedetti (2010) analyzed the Brier score and showed that it is equivalent to a second order approximation of the logarithmic score. He concludes that lasting success of the Brier score can be explained by the fact that it is an approximation of the logarithmic score. Benedetti also mentions the well-known and useful decomposition of the Brier score into uncertainty, resolution and reliability as a possible reason for its popularity. This chapter, follows a similar route as Benedetti, but from a different direction. From an analogy with the Brier score, it is proposed to use the Kullback-Leibler divergence (or relative entropy) of the observation from the forecast distribution as a measure for forecast verification. The score is named ‘divergence score’ (DS). When assuming perfect observations, DS is equal to the Ignorance score or logarithmic score, and can be seen as a new reinterpretation of Ignorance as the Kullback-Leibler divergence from the observation to the forecast distribution. By presenting a new decomposition into uncertainty, resolution and reliability, analogous to the well-known decomposition of the Brier score (Murphy, 1973), insight is provided in the way the divergence score measures the information content of probabilistic binary forecasts. The decomposition can help acceptance and wider application of the logarithmic score in meteorology and hydrology. Section 5.2 of this chapter presents the mathematical formulation of the DS and its components. Section 5.2 also shows the analogy with the Brier score components. Section 5.3 compares the divergence score with existing information-theoretical scores. It is shown that the DS is actually a reinterpretation of the Ignorance score (Roulston and Smith, 2002) and that one of the ranked mutual information scores defined by Ahrens and Walser (2008) is equal to the skill score version of DS, when the reliability component is neglected (perfect calibration assumed). A generalization to multi-category forecasts is presented in Section 5.4. The inherent difficulty found in formulating skill scores for ordinal category forecasts is also analyzed and leads to the idea that this can be explained by explicitly distinguishing between information and useful information for some specific user. This distinction provides some insights in the roles of the forecaster and the user of the forecast. Section 5.5 presents an application to a real data set of precipitation forecasts. Section 5.6.3 introduces the cross entropy score and another decomposition, which become relevant if the observations are not perfect and contain uncertainties. A paradox which arises from the use of deterministic forecasts is analyzed in section 5.7. Section 5.8 summarizes the conclusions and restates the main arguments for adopting the divergence score. 5.2 Definition of the divergence score 5.2.1 Background By viewing the Brier score as a quadratic distance measure and translating it into the information-theoretical measures for uncertainty and divergence of one distribution from another, in this chapter an information-theoretical twin of the Brier score and its components is formulated. First, some notation is introduced, followed by formulation of the 90 Using information theory to measure forecast quality Brier score. Then the information-theoretical concept of relative entropy is presented as an alternative scoring rule. In the second part of this section, it is shown how the new score can be decomposed into the classical Brier score components; uncertainty, resolution and reliability. 5.2.2 Definitions Consider a binary event, like a day without rainfall or with rainfall. This can be seen as a stochastic process with two possible outcomes. The outcome of the event can be represented in a probability mass function (PMF). For the case of binary events, the empirical PMF of the event after the outcome has been observed is a two element vector, denoted by o = (1 − o, o)T . Assuming certainty in the observations, o ∈ {0, 1}. Therefore, o = (0, 1)T if it rained and (1, 0)T otherwise. Now suppose a probabilistic forecast of the outcome of the binary event is issued in the form of a probability of occurrence f . This can also be written as a forecast PMF f = (1 − f, f )T , with f ∈ [0, 1]. If for example an 80% chance of rainfall is forecast, this is denoted as f = (0.2, 0.8)T . The Brier score for a single forecast at time t measures distance between observation and forecast PMFs by the square Euclidean distance BSt = 2 (ft − ot )2 = (ft − ot )T (ft − ot ) (5.1) For a series of forecasts and observations, the Brier score is simply the average of the Brier scores for the individual forecasts. BS = N 1 X (ft − ot )T (ft − ot ) N t=1 (5.2) Note that this is the original definition by Brier (1950). Nowadays, the Brier score is almost always defined as half the value of (5.2) (Ahrens and Walser, 2008). 5.2.3 The divergence score The entropy of a forecast distribution gives an indication of the uncertainty that is represented by that forecast. It measures the expected surprise Sf upon hearing the true outcome, when the state of knowledge is f , with respect to probability distribution f . For example, the uncertainty associated with a binary probabilistic forecast of 70% chance of precipitation, f = (0.3, 0.7)T , is H(f ) = Ef {Sf } = − n X [f ]i log [f ]i = 0.88 bits (5.3) i=1 where H(f ) is the entropy of f , calculated in the unit “bits” because the logarithm is taken to the base 2 (throughout this chapter). Ef denotes the expectation operator with 5.2. Definition of the divergence score 91 respect to f , and n is the number of categories in which the outcome can fall, in this case 2. Next to this entropy-measure, introduced by Shannon (1948), there is also a definition of relative entropy, or Kullback-Leibler divergence (Kullback and Leibler, 1951). This is a measure for the expected amount of additional surprise a person is expected to experience, compared to another person having a more accurate and reliable probability estimate. Therefore, it is a relative uncertainty. The divergence score is based on this idea and measures the expected extra surprise that a person having the forecast ft for a instance t will experience, compared to a person knowing the observation ot n X [ot ]i [ot ]i log DSt = DKL(ot ||ft ) = Eot {Sft − Sot } = [ft ]i i=1 ! (5.4) The divergence score (DS), replaces the quadratic distance from the BS with the KullbackLeibler divergence. For one single forecast, the DS functions as a scoring rule. It is the Kullback-Leibler divergence of the forecast distribution from the observation distribution over the n = 2 possible events i. The DS over a series of forecast-observation pairs measures the average divergence of the forecast distribution from the observation DS = N 1 X DKL(ot ||ft ) N t=1 (5.5) The divergence score can be interpreted as the information gain when one moves from the prior forecast distribution to the observation distribution. When this information gain is zero, the forecast already contained all the information that is in the observation and therefore is perfect. If the information gain from the forecast to the certain observation is equal to the climatological uncertainty, the forecast did not contain more information than the climate, and therefore was useless. Another way to view the divergence score is the remaining uncertainty about the true outcome, after having received the forecast; see Fig. 5.1. 5.2.4 Decomposition The new score can be decomposed in a similar way as the Brier score. The classical decomposition of the Brier score (BS) into reliability (REL) , resolution (RES) and uncertainty (UNC) is BS = RELBS − RESBS + UNCBS BS = K K 1 X 1 X nk (fk − ōk )2 − nk (ōk − ō)2 + ō (1 − ō) N k=1 N k=1 (5.6) (5.7) Using information theory to measure forecast quality DS DS REL RES UNC remaining uncertainty 92 0 climatological after forecast after recalibration after observation Figure 5.1: The components measure the differences in uncertainty at different moments in forecasting process, judged in hindsight. The divergence score measures the remaining uncertainty after taking the forecast at face-value. with N being the total number of forecasts issued, K the number of unique forecasts P issued, ō = N t=1 o/N the observed climatological base rate for the event to occur, nk the number of forecasts with the same probability category and ok the observed frequency, given forecasts of probability fk . The reliability and resolution terms in (5.7) are summations of some distance measure between two binary probability distributions, while the last term measures the uncertainty in the climatic distribution, using a polynomial of degree two. Now, information-theoretical twins are presented for each of the three quadratic components of the BS, using entropy and relative entropy. It is shown that they add up to the divergence score proposed earlier. The first component, reliability, measures the conditional bias in the forecast probabilities. In the DS, it is the expected divergence of the observed probability distribution from the forecast probability distribution, both stratified (conditioned) on all issued forecast probabilities. In the ideal case, the observed frequency is equal to the forecast probability for all of the issued forecast probabilities. Only in this case is the reliability 0. This is referred to as a perfectly calibrated forecast. Note that reliability is defined in the opposite direction as the meaning of the word in the English language. A perfectly reliable forecast has a reliability of 0. The reliability can be calculated with RELDS = K 1 X nk DKL(ōk ||fk ) N k=1 (5.8) 5.2. Definition of the divergence score 93 with N being the total number and K the number of unique forecasts issued, nk the number of forecasts with the same probability category, ok the observed frequency distribution for forecasts in group k and fk the forecast PMF for group k. The second component, resolution, measures the reduction in climatic uncertainty. It can be seen as the amount of information in the forecast. In the DS, it is defined as the expected divergence of the conditional frequencies from the marginal frequency of occurrence. The minimum resolution is 0, which occurs when the climatological probability is always forecast or the forecasts are completely random. The resolution measures the amount of uncertainty in the observation explained by the forecast. In the ideal case the resolution is equal to the uncertainty, which means all uncertainty is explained. This is only the case for a deterministic forecast that is either always right or always wrong. In the last case, the forecast needs to be recalibrated. K 1 X RESDS = nk DKL(ōk ||ō) N k=1 (5.9) From Equation (5.9) it becomes clear that the resolution term is the expectation over all forecast probabilities of the divergence from the conditional probability of occurrence to the marginal probability of occurrence. In information theory this quantity is known as the mutual information (I) between the forecasts and the observation. RESDS = E{DKL(ōk ||ō)} k = E{DKL((ō|fk )||ō)} = I(f ; o) k (5.10) (5.11) The third component, uncertainty, measures the initial uncertainty about the event. This observational uncertainty is measured by the entropy of the climatological distribution (H (ō)). It is a function of the climatological base rate only and does not depend on the forecast. The uncertainty is maximum if the probability of occurrence is 0.5 and zero if the probability is either 0 or 1. UNCDS = H (ō) = − n X {[ō]i log [ō]i } (5.12) i=1 Like for the BS, for a single forecast-observation pair, uncertainty and resolution are 0 and the total score is equal to the reliability, which acts as a scoring rule. Over a larger number of forecasts uncertainty approaches climatic uncertainty and reliability should go to zero if the forecast is well calibrated. In appendix B it is shown that, just like in the Brier score, the relation DS = REL − RES + UNC holds. The relation between the components and the total score (DS) is N K K 1 X 1 X 1 X DS = nk DKL(ōk ||fk ) − nk DKL (ōk ||ō) + H (ō) DKL(ot ||ft ) = N t=1 N k=1 N k=1 (5.13) Note that this decomposition is valid for forecasts of events with an arbitrary number of categories and is not restricted to the binary case. 94 Using information theory to measure forecast quality 5.2.5 Relation to Brier score and its components For the binary case, the Brier score can be seen as a second order approximation of the divergence score (also noted by Benedetti (2010)). Both scores have their minimum only with a perfect forecast. When the forecast is not perfect, the Brier score is symmetric in the error in probabilities, while the divergence score is not, except for the case where the true forecast probability is 0.5. Therefore the divergence score, like the logarithmic score, is a double valued function of the Brier score (Roulston and Smith, 2002). Consequently, when two forecasting systems are compared, the forecasting system with the higher Brier score may have the lower divergence score. The uncertainty component in the Brier score is a second order approximation of the uncertainty term in the divergence score (entropy), with the same location of zero uncertainty (100% probability of one of the two events) and maximum uncertainty (equiprobable events with 50% probability; see Fig. 5.2). In the Brier score, the maximum of the uncertainty component is 0.5, while in the divergence score it is 1 (bit). Resolution in the Brier score is the variance of conditional mean probabilities. It is a mean of squared deviations from the climatic probability. Resolution in the divergence score is a mean of divergences. Divergences are asymmetric in probabilities. The resolution in both the Brier and the divergence score can take on values between zero and the uncertainty. In both scores, it can be seen as the amount of uncertainty explained. The resolution in the Brier score is the second order approximation of the resolution of the divergence score, satisfying the condition that the minimum is zero and in the same location (the climatic probability) and that the maximum possible value is equal to the inherent uncertainty of the forecast event; see Fig. 5.3. Reliability in the Brier score is bounded between zero and one, while in the divergence score, the reliability can reach infinity; see Fig. 5.4. This is the case when wrong deterministic forecasts are issued. Generally the reliability in the divergence score is especially sensitive to events with near deterministic wrong forecasts. Overconfident forecasting is therefore sanctioned more heavily than in the Brier score. 5.2.6 Normalization to a skill score Because the Brier score depends on the climatological probability, which is independent of the forecast quality, it is common practice to normalize it to the Brier skill score (BSS) with the climatology forecast as a reference. A perfect forecast is taken as a second reference. For the DS it is possible to use the same normalization procedure, yielding the divergence skill score (DSS). DSS = DS − DSref DS = 1− DSperf − DSref DSref (5.14) The score for a perfect forecast (DSperf ) is zero. In the climatological forecast (DSref ), both resolution and reliability are 0 (perfect reliability, no resolution). The DSS therefore reduces to: 5.2. Definition of the divergence score 95 uncertainty 1 Brier score Divergence score uncertainty relative to maximum uncertainty 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 climatological probability of occurrence 1 Figure 5.2: The uncertainty component of the Brier score is a second order approximation of the entropy, with coinciding minimum and maximum values. The uncertainty for the Brier score is divided by its maximum value of 0.5, to allow a clear comparison with the uncertainty term of the divergence score, measured in bits. DSS = 1 − UNC − RES + REL RES − REL = UNC UNC (5.15) This leads to a positively oriented skill score that becomes one for a perfect forecast and zero for a forecast of always the climatological probability. Also a completely random forecast that has a marginal distribution equal to climatology gets a zero skill score. Negative (i.e. “worse than climate”) skill scores are possible if the reliability is larger (worse) than the resolution. In case the resolution is significant, calibration of the forecast can yield a positive skill score, meaning that a decision maker using the recalibrated forecast is better off than a decision maker using climatology or a random strategy. This shows the importance of looking at the individual components when diagnosing a forecast system’s performance. Summarizing, the divergence score and its components combine two types of measures to replace the quadratic components in the Brier score decomposition. Firstly, the quadratic distances between probability distributions are replaced by Kullback-Leibler divergences, which are asymmetric. Care should therefore be taken in which direction the divergence is calculated. Secondly, the polynomial uncertainty term is replaced by the entropy of the climatology distribution. The total scores and components are visualized in Fig. 5.1. 96 Using information theory to measure forecast quality resolution (climatic probability=1/3) 1 Brier score Divergence score 0.9 resolution relative to uncertainty 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 probability of the event, given a positive forecast 1 Figure 5.3: The resolution term of the divergence score is asymmetric in probability while the Brier score resolution is not. 5.3 Relation to existing information-theoretical scores 5.3.1 Relation to the ranked mutual information skill scores Ahrens and Walser (2008) proposed the ranked mutual information skill score (RMIS). The score is intended for use with multi-category forecasts, which will be treated later in this chapter. For the special case of forecasts of binary events, the RMISO can be written as the mutual information between forecasts and observations divided by the entropy of the observations RMISO = I(f , o) H(ō) (5.16) When comparing Eq. (5.16) with Eqs. (5.10), (5.12) and (5.15), it becomes clear that RMISO = RESDSS UNCDSS (5.17) This means that for the case of a binary forecast, RMISO equals the DSS in case the reliability is perfect (zero). In case the forecast is not well calibrated, RMISO neglects the reliability component and measures the amount of information that would be available to a user after calibration. The DSS measures the information conveyed to a user taking the forecasts at face-value. The individual components of the DSS also indicate the potentially extractable information as measured by the RMISO . 5.3. Relation to existing information-theoretical scores 97 reliability (climatic probability=1/3) 6 Brier score Divergence score reliability relative to uncertainty 5 4 3 2 1 0 0 0.2 0.4 0.6 average probability forecasted 0.8 1 Figure 5.4: The reliability is asymmetric in the DS, while symmetric in the BS. In the BS, it is bounded, while in the DS, it can reach infinity. 5.3.2 Equivalence to the Ignorance score Roulston and Smith (2002) defined the Ignorance score from the viewpoint of using the forecast probability distribution as basis for a data compression scheme. The scoring rule measures the Ignorance or information deficit of a forecaster, compared to a person knowing the true outcome of the event (j). The Ignorance scoring rule is defined as IGN = − log2 fj (5.18) In which fj is the probability that the forecaster had assigned to the event that actually occurred. The Ignorance score is a reinterpretation of the logarithmic score by Good (1952). By expanding the relative entropy measure that is used as a scoring rule, it becomes clear that divergence from the certain observation PMF (o) to the forecast PMF (f ) is actually the same as the Ignorance (IGN) or the logarithmic score. n X oi oi log DKL(ot ||ft ) = fi i=1 ! oi6=j = oi6=j log fi6=j ! oi=j + oi=j log fi=j ! (5.19) Because oi6=j = 0 and oi=j = 1, this reduces to DKL(ot ||ft ) = 0 log 0 fi6=j ! + 1 log 1 fi=j ! = − log fj = IGN (5.20) 98 Using information theory to measure forecast quality This means that the divergence scoring rule presented in this chapter (DS) is actually equal to the Ignorance (IGN). The Ignorance is therefore not only “A scoring rule . . . closely related to relative entropy” as defined by Roulston and Smith (2002), but actually also is a relative entropy. The difference is in the distributions that are used to calculate the relative entropy. Roulston and Smith (2002) refer to a relation to the divergence between the unknown “true” distribution p and the forecast distribution f ; see Eq. 5.21. DKL (p||f ) = E[IGN] − H(p) p (5.21) The divergence that is used in the divergence score is calculated from the PMF after the observation (o) instead of p. That makes the second term of the RHS vanish and IGN equal to the divergence. Using the decomposition presented in Eq. (5.13), the Ignorance score for a series of forecasts can now also be decomposed into a reliability, a resolution and an uncertainty component. This decomposition, until now only applied to the Brier score, has proven very useful to gain insight into the aspects of forecast quality. Furthermore the new interpretation of the Ignorance score as the average divergence of observation PMFs from forecast PMFs, links to results from information theory more straightforwardly. 5.3.3 Relation to information gain Peirolo (2010) defines “information gain” as a skill score for probabilistic forecasts. The information gain is defined as IGf = log2 fj = IGNc − IGNf cj (5.22) where fj denotes the probability attached to the event that actually occurred and cj the climatological probability of that event. For a series of forecasts, the score is simply the average over the different timesteps. IGf = T fk(t) 1X log2 T t=1 ck(t) (5.23) From this definition it becomes clear that the information gain, as defined by Peirolo (2010) as a positively oriented skill score, is equal to the reduction in uncertainty from climatic uncertainty to the remaining uncertainty after the forecast (IGf = UNC − DS). Alternatively, the information gain can be expressed as the correct information minus the wrong information (IGf = RES − REL). 5.4 Generalization to multi-category forecasts 5.4.1 Nominal category forecasts When extending verification scores from forecasts of binary events to multi-category forecasts, it is important to differentiate between nominal and ordinal forecast categories. In 5.4. Generalization to multi-category forecasts 99 the case of nominal forecasts, there is basically one question that is relevant for assessing their quality: How well did the forecaster predict the category to which the outcome of the event belongs? In nominal category forecasts, there is no natural ordering in the categories into which the forecast event is classified. For this case of forecast verification, there is no difference between the categories in which the event did not fall. Although the probability attached to those events conveys information at the moment the forecast is received, the only probability relevant for verification, after the observation has been made, is the probability that the forecaster attached to the event that did occur. The quadratic score of Brier (1950) can also be used for multiple category forecasts. In that case, ft and ot are the PMFs of the event before and after the observation, now having more than two elements. The problem with this score is that it depends on how the forecast probabilities are spread out over the categories that did not occur. For nominal events this dependency is not desirable, as all probability attached to the events that did not occur is equally wrong. The divergence score (DS) does not suffer from this problem, because it only depends on the forecast probability of the event that did occur; Eq. 5.20. The DS as presented in Eq. 5.4 can directly be applied on nominal category forecasts. A property of the score is that a high number of categories makes it more difficult to obtain a good score. To compare nominal category forecasts with different numbers of categories, the DS should be normalized to a skill score (DSS); see Eq. 5.14. 5.4.2 Ordinal category forecasts When dealing with forecasts of events in ordinal categories, there is a natural ordering in the categories. This means that the cumulative distribution function (CDF) starts to become meaningful. There are now two possible questions that can be relevant for verification of the probabilistic forecast. 1. How well did the forecaster predict the category to which the outcome of the event belongs? 2. How well did the forecaster predict the exceedence probabilities of the thresholds that define the category boundaries? The first question is equal to the one that is of interest for nominal forecasts. However, in the ordinal case there is a difference between the categories in which the observed event did not fall. Forecasts of categories close to the one observed are preferred over categories more distant from the one observed. Therefore, skill scores for ordinal category forecasts are often required to be “sensitive to distance” (Epstein, 1969; Laio and Tamea, 2007; Murphy, 1971, 1970). This requirement has led to the introduction of the ranked probability score (RPS) (Epstein, 1969), which is now widely used .The DS is not sensitive to distance in the sense of the RPS, because DS is insensitive to the forecasts for the non-occurring events. However, there still is an apparent sensitivity to distance introduced through the forecast PMF. A forecaster will usually attach more probability to the categories adjacent to the one that is considered most likely, simply because they are also likely. Therefore, missing the exact category of the observation with the most likely forecast still leads to a relatively 100 Using information theory to measure forecast quality low penalty in the score, if the uncertainty estimation of the forecaster was correct and significant probability was forecast for the other likely (often neighboring) categories. Over a series of forecasts, the apparent distance-sensitivity of the penalty given by the DS is therefore defined by the PMF of the forecaster alone, and independent of what the categories represent. In verification literature, the property of only being dependent on the probability assigned to the event that actually occurred is known as locality, which is often seen as a non-desirable property of a score. Whether or not locality is desirable can be questioned (Mason, 2008). In this chapter it is argued that in absence of a context of the users of the forecast there is no justification for using non-local scores, which require some sensible distance measure to be specified apart from the natural distance sensitivity introduced by the forecast PMF. When assessing value or utility of forecasts, as opposed to quality, non-local scores can be used; see also the analogy with a horse race on page 117. However, in that case the distance measure should depend on the users and associated decision processes, as these determine the consequences of missing the true event by a certain number of categories distance. In non-local verification scores, the distance measure is often not explicitly specified in terms of utility, making it unclear what is actually measured. In those cases, the utility function of the users becomes more like an emerging property of the skill score instead of the other way around. Benedetti (2010) also presents locality as a basic requirement for a measure of forecast goodness, interpreting locality as “exclusive dependence on physical observations”. He correctly states that it is a violation of scientific logic if two series of forecasts that assign the same probabilities to a series of observed events gain different scores, based on probabilities assigned to events that have never been observed. For a more elaborate treatment of this view on the fundamental discussion about locality, the reader is referred to Benedetti (2010) and Mason (2008). The second question, regarding the forecast quality of exceedence probabilities, differs from the first, because all the thresholds are considered at once. Therefore, the quality of a single forecast depends on the entire PMF of the forecast and not only on the probability forecast for the event that occurs. Therefore, scores that are formulated for cumulative distributions can never be local. This means that apart from the physical observations, the importance attached to the events influences the score. So some assumption about value is added and the score is not a pure measure of quality alone. The RPS evaluates the sum of squared differences in CDF values between the forecast and the observation of the event (5.24) X 1 n−1 RPS = n − 1 m=1 " m X k=1 ! fk − m X k=1 ok !#2 (5.24) The RPS can be seen as a summation of the binary Brier scores over all n − 1 thresholds defined by the category boundaries. The summation implies that the Brier scores for all thresholds are weighted equally. Whether the BS for all thresholds should be considered equally important depends on the users. It has been shown that the RPS is a strictly proper scoring rule in case the cost-loss ratio is uniformly distributed over the users (Murphy, 1970). In that case the RPS is a linear function of the expected utility. 5.4. Generalization to multi-category forecasts 101 5.4.3 The Ranked divergence score Now an information-theoretical score is presented for ordinal category forecasts, which are defined in terms of cumulative probabilities. An equivalent to the RPS would be the ranked divergence score (RDS), averaging of the DS over all n − 1 category thresholds m. RDS = X 1 n−1 DSm , n − 1 m=1 (5.25) with DSm denoting the divergence score for the forecast of the binary event j ≤ m. This assumes equally important thresholds. The RDS, just like the DS, is dependent on the climatological uncertainty. To make the score comparable between forecasts, the RDS can be converted into a skill score. Now, two intuitive options exist to do the normalization. The first is to normalize the individual DSm scores for each threshold m to a skill score for that threshold, like (5.15), using the climatic uncertainty for the binary event defined by that threshold (5.26) DSm DSSm = 1 − (5.26) UNCm and then averaging the resulting skill score over all thresholds (5.27) RDSS1 = X 1 n−1 DSSm n − 1 m=1 (5.27) This means that the relative contributions to the reduction of climatic uncertainty about each threshold m are considered equally important. In other words, all skills of forecasts about the exceedence of the n − 1 thresholds are equally weighted. The second option for normalization is the first to sum the DSm and then normalizing with the climatic score for the sum (5.28). RDSS2 = 1 − n−1 P m=1 n−1 P m=1 DSm (5.28) UNCm The formulation of the RDSS2 according to Eq.(5.28) does not normalize the scores for the different thresholds individually, but applies the same normalization to every DSm . This means that the merits of the forecaster for all thresholds are implicitly weighted according to the inherent uncertainties in the climate. In this way, the forecast of extreme (nearly certain) probabilities, which are often associated with extreme events, are hardly contributing to the total score, while they could in fact be most important for the users. 5.4.4 Relation to Ranked Mutual Information An alternative skill score defined in terms of cumulative probabilities is the RMISO (5.16) as defined by Ahrens and Walser (2008). The version of the RMISO for multiple category 102 Using information theory to measure forecast quality forecasts can be written as RMISO = 1 − n−1 P I (fm , ōm ) m=1 n−1 P m=1 , (5.29) H(ōm ) with fm denoting the series forecast probabilities of exceedence of threshold m, om denoting the corresponding series of observations and ōm the average observed occurrence. For a perfectly reliable forecast, the RMISO is therefore equal to the RDSS2 formulated in Eq.(5.28). For forecasts that are not well calibrated, the RMISO measures the amount of information that would be available after calibration, while the RDSS measures the information as presented by the forecaster. By using the decomposition presented in Eq. (5.15) it is possible to write RDSS1 = X RESm X RELm 1 n−1 1 n−1 − n − 1 m=1 UNCm n − 1 m=1 UNCm (5.30) By presenting both terms the resolution and the reliability term of Eq. (5.30) separately, both the potential information and the loss of information due to imperfect calibration are visible. Apart from including the reliability or not, another question is how to weight the scores for the different thresholds, to come to one aggregated score. As every binary decision by some user with a certain cost-loss ratio can be associated with some threshold, the weighting reflects the importance of the forecast to the various users. No matter what aggregation method is chosen, there will always be an implicit assumption about the user’s importance and stakes in a decision making process. This is inherent to summarizing forecast performance in one score. A diagram that plots the two skill score components against the thresholds contains the relevant information characteristics for different users. In this way each user can look up the score on the individual threshold, that is relevant for his decision problem, and compare it with the performance of some other forecasting system on that threshold. 5.4.5 Information and useful information Forecasting is concerned with the transfer of information about the true outcome of uncertain future events that are important to a given specific user. The information in the forecast should reduce the uncertainty about the true outcome. It is important to note the difference between two estimations of this uncertainty. Firstly, there is the uncertainty a receiver of a forecast has about the truth, estimated in hindsight, knowing the observation. This uncertainty is measured by the divergence score. Secondly, there is the perceived uncertainty about the truth in the eyes of the user after adopting the forecast, which is measured by the entropy of the forecast. The first depends on the observation, while the second does not. Note that information-theoretical concepts measure information objectively, without considering its use. The usefulness of information is different for each specific user. The amount of useful information in a forecast can explicitly be subdivided into two elements: 5.5. An example: rainfall forecasts in the Netherlands 103 1. reduction of uncertainty about the truth (the information theory part) 2. the usefulness of this uncertainty reduction (the user part) The first element is only dependent on the user’s probability estimate of the event’s outcome before and after the forecast is received and on the true outcome. If the (subjective) probability distribution of the receiver does not change upon receiving the forecast, no information is transferred by it. If the probability distribution changed, but the divergence to the observation increased, the forecast increased the uncertainty about the truth as estimated from the observation, which is in itself an estimation of the unknown truth (although for the decomposition it was assumed to be a perfect estimation). A forecast is less informative to a user already having a good forecast. To make the information-theoretical part of useful information in a forecast independent of the user, remaining uncertainty is estimated instead of its reduction. The second element of useful information in a forecast, usefulness, is user and problem specific. A forecast is useful if it is about a relevant subject. Communicating the exceedence probability of a certain threshold that is not a threshold for action for a specific user does not help him much. Usefulness also depends on how much importance is attached to events. This can be, for example, the costs associated with a certain outcome-action combination, typically reflected in a so-called payoff matrix. Implicitly, also informationtheoretical scores make some assumption on the usefulness of events. The assumption is that the user attaches his own importance to the events by placing repeated proportional bets, each time reinvesting his money. This is referred to as Kelly-betting. For more detailed explanation; see Kelly (1956); Roulston and Smith (2002) and appendix C. In other words, the assumption is that the user maximizes the utility of his information in a fair game by strategically deciding on his importance or stakes. The explicit consideration of usefulness of information brings up an interesting question about the roles of the forecaster and the user of forecasts. The divergence score measures the remaining uncertainty after adopting the forecast, which is completely independent of the user. This focuses the score on evaluating a main task of the forecaster, which is to give the best possible estimate of probabilities. It might also be argued however, that a forecaster should not just reduce uncertainty, but also deliver utility for user’s decisions. To be able to judge forecasts on that criterion, assumptions need to be made about the users and their decision problems. When scores based on these objectives are used to improve forecasting procedures, maximizing these two objectives does not always lead to the same answer. In such cases an improvement of the utility of forecasts may coincide with a reduction in informativeness. Chapter 6 further looks into this issue, which is strongly related to model complexity, overfitting and calibration versus validation. 5.5 An example: rainfall forecasts in the Netherlands As an illustration of the practical application of the divergence score and its decomposition, it was applied to a series of probabilistic rainfall forecasts and corresponding 104 Using information theory to measure forecast quality Skill against observed climatology over 3 years 0.5 DSS DSS potential BSS BSS potential 0.45 0.4 0.35 0.3 The forecasts were rounded to 5% intervals to be able to do the decomposition. Min.and max. forecast probabilities were 5% and 95% respectively. 0.25 0.2 0 1 2 3 4 Days ahead (day 0 is the day the forecast was issued) 5 Figure 5.5: The skill according to both scores decreases with lead-time. The potential skills, which should be interpreted with caution, indicate the part of uncertainty that would be explained after recalibration. observations for the measurement location De Bilt, the Netherlands. The forecast series consist of daily forecasts of the probability of a precipitation amount equal or larger than 0.3 mm, for zero to five days ahead. They are derived using output of both the ECMWF deterministic numerical weather prediction model and the ECMWF ensemble prediction model. Predictors from both models are used in a logistic regression to produce probability forecasts for precipitation. The data covers the period from Dec. 2005 to Nov. 2008, in which the average probability was 0.4613, leading to an initial uncertainty of 0.9957 bits. Figure 5.5 confirms the expectation that both the Brier skill score and the divergence skill score show a decline with increasing lead time. It also shows that the forecasts possess skill over annual climatology up to a lead time of at least 5 days. The dashed lines show the potential skill that could be attainable after recalibration. The estimation of this potential skill, however, is dependent on the correct decomposition. The decompositions of both the Brier and the divergence score need enough data (large enough nk ) to calculate the conditional distributions fk for all unique forecasts k ∈ {1 . . . K}. To be able to calculate the contribution to reliability of all the 99% forecasts, for example, at least 100 of such forecasts are necessary to not surely overestimate reliability. Also for a larger number of forecasts, there is a bias towards overestimation of the reliability, that decreases with the amount of data available per conditional to be estimated. A solution for estimating the components with limited data is rounding the forecast probabilities to a limited set of values. In this way, less conditional distributions fk need to be estimated and more data per distribution are available. Figure 5.6 shows that for these three years of data, using finer grained probabilities as forecasts leads to an increasing overestimation of reliability. The skill scores themselves 5.5. An example: rainfall forecasts in the Netherlands 105 0.55 0.5 0.45 0.4 REL U NC 0.35 20 forecast intervals means that probabilities of 5%,10%,15% ....95% are forecasted. At this value, no resolution is lost, while data scarcity does not lead to overestimation of reliability and resolution. 0.3 0.25 0.2 0 10 20 RES−REL U NC Increasing trend is artefact of decomposition when insufficient data is available for calculating the conditionals fk for k = 1..K 30 40 50 60 70 Number of forecast probability intervals DSS DSS potential BSS BSS potential 80 90 100 Figure 5.6: The calculation of reliability is sensitive to the rounding of the forecasts. If not enough data are available, rounding is necessary to estimate the conditionals. Too coarse rounding causes an information loss. are not sensitive to this overestimation, because the lack of data causes a compensating overestimation of resolution. The potential skill, however, should be interpreted with caution, as solving reliability issues by calibration on the basis of biased estimates of reliability does not lead to a real increase in skill. From the figure it can also be noted that too coarse grained probabilities lead to a real loss of skill. In this case, giving the forecasts in 5% intervals seems the minimum needed to convey the full information that is in the raw forecast probabilities. Fig. 5.7 sheds more light on the relation between the Brier skill score and the divergence skill score, based on 5 day ahead forecasts from a second data set, which covers February 2008 to December 2009. For this set, the forecast probabilities ranged from 1 to 99 %. The black dots indicate the scores that were attained for single forecast observation pairs. The dots show that the BSS and DSS have a monotonic relation as scoring rules. The limits of this relation are at (1, 1) for perfect forecasts and, in this case, (-∞, −3.095) for certain, but wrong forecasts. The worst forecast was 98%, while no rain fell. The total scores for different weeks of forecasts are plotted as gray dots. They are averages of sets of 7 black dots. Because the relation of the single forecast scores is not a straight line, a scatter occurs in the relation of the weekly average scores, which is therefore no longer monotonic. The scatter implies that two series of forecasts can be ranked differently by the Brier score and the divergence score. In this example, the scatter is relatively small (r 2 = 0.9938) and will probably have no significant implications, but it would be larger if many overconfident forecasts were issued. An interesting example of differently ranked 106 Using information theory to measure forecast quality forecast series are the two weeks indicated by the triangles, where the scores disagree on which of the two weeks was better forecast than climate. The downward pointing triangle marks the score for forecasts in week A, where performance according to the divergence score was worse than the climatological forecast (DSS=-0.0758 ), but according to the brier score was slightly better than climate (BSS=0.0230). Conversely, the upward pointing triangle marks week B, where the forecasts according to Brier were worse than climate (=-0.0355), but still contained more information than climate according to the DSS (=0.0066). Given that the scatter in the practical example is small, the Brier score appears to be a reasonable approximation of the divergence score and is useful to get an idea about the quality of forecasts. More practical comparisons are needed to determine if the approximation can lead to significantly different results in practice. These are mostly expected in case extreme probabilities are forecast. The severe penalty the divergence gives for errors in the extreme probabilities, which is sometimes seen as a weakness2 , should actually be viewed as one of its strengths. As the saying goes: “you have to be cruel to be kind”. It is constructive to give an infinite penalty to a forecaster that issues a wrong forecast that was supposed to be certain. This is fair, because the value that a user would be willing to risk when trusting such a forecast is also infinite. 5.6 Generalization to uncertain observations The previous sections introduced an information-theoretical decomposition of KullbackLeibler divergence into uncertainty, reliability and resolution. In this section, this decomposition is generalized to the case where the observation is uncertain. Along with a modified decomposition of the divergence score, a second measure is presented, the cross-entropy score, which measures the estimated information loss with respect to the truth instead of relative to the uncertain observations. The difference between the two scores is equal to the average observational uncertainty and vanishes when observations are assumed to be perfect. 5.6.1 Introduction Section 5.2.4 introduced a decomposition of the divergence score analogous to the decomposition of the Brier score. See also Bröcker (2009) for a general form of this decomposition 2. Note for example that Selten (1998), the 1994 laureate of the Nobel prize in economics, takes it as an axiom that scores should not exhibit infinite penalties, even when zero probability is attached to the true outcome. Selten regards this “hypersensitivity” as an unacceptable value judgment and together with other axioms, he gives an axiomatic justification for using the Brier score, which treats such errors more mildly. It is left to the reader to speculate what the economic consequences may be of wrongly ruling out improbable events. 5.6. Generalization to uncertain observations 107 1 0.15 0.5 Zoom detail 0.1 0 0.05 0 BSS −0.5 −0.05 −0.1 −1 −0.2 −0.1 0 0.1 0.2 −1.5 −2 98% sure of rain, but no rain fell one week averages scoring rule values fcst rounded to 10% 1:1 line DS ranks under climate DS ranks over climate −2.5 −3 −5 −4 −3 −2 DSS −1 0 1 Figure 5.7: Relation between the Brier skill score and the divergence skill score. For single forecasts they have a monotonic relation, but for averages of series of forecasts, a scatter in the relation can cause weekly average forecast skill to be ranked differently by both scores. There is only little difference in the overall number ranked better than climatology. For this example, the Brier score ranked 79.2% and the divergence score ranked 79.4% of weekly average skills better than climate. for proper scoring rules. A possible interpretation of the new information-theoretical decomposition is “The remaining uncertainty is equal to the missing information minus the correct information plus the wrong information” (see figure 5.8). A forecast that is reliable but has no perfect resolution does not give complete information, but the information it does give is correct. In the decompositions of both the Brier score (BS) and the divergence score (DS), it was assumed that the observations are certain and correspond to the truth. In reality, no observation can be assumed to correspond to the true outcome with certainty. For example, in the evaluation of binary probabilistic precipitation forecasts of the Dutch Royal Meteorological Institute (KNMI), the observation that corresponds to “no precipitation” is defined as an observed precipitation of less than 0.3 mm on a given day. Given the observational errors in the exact precipitation amount, measured values close to the threshold would be best represented by a probabilistic observation, accounting for the uncertainties; see fig. 5.9. Briggs et al. (2005) noted that uncertainty in the observation must be taken into account to assess the true skill of forecasts. This requires either “gold standard” observations or subjective estimates of the observation errors. Bröcker and Smith (2007) proposed to use a noise model for the observation to transform the forecast, using this to define a generalized score. 108 Using information theory to measure forecast quality In this section, it is proposed to define an “uncertain observation” as the conditional distribution of the true outcome of event or quantity that was forecast, given the reading on one or more measurement instruments. For example, when the spatial scale or location of the measurements and the forecasts differs, the distribution can be based on spatial statistics of various instruments. In another case, the distribution may be derived from an model of the observational noise (e.g. due to wind around a raingauge, noise in the electronics). Note that the correctness of such uncertainty models cannot be verified, because the “true” value cannot be observed directly. Although the term verification suggests a comparison between forecasts and truth (Latin: veritas = truth), both the divergence and the Brier score are actually comparing the forecasts with observations, which are an estimate of the unknown truth. An uncertain observation acknowledges this by representing the uncertainty explicitly by a probabilistic best estimate. When the uncertainty in the observations is accounted for by representing them with probability distributions with nonzero entropy (i.e. the observation assigns probability to more than one outcome), the decomposition in section 5.2.4 does not hold. The divergence score as a whole, however, is still a useful measure of correspondence between forecasts and observations. It would therefore be interesting to define a meaningful decomposition of the divergence score that is applicable in the case of uncertain observations. A second point is whether the quality of forecasts should be measured with respect to the known probabilistic observations or estimated with respect to the unknown truth. In this section, a modification to the decomposition in section 5.2.4 and Weijs et al. (2010b) is presented, which generalizes to the case of uncertain observations. The new decomposition is interpreted in terms of uncertainty and information. Furthermore a second, related measure for forecast quality is presented. In information theory, this measure often referred to as cross-entropy, which in this case estimates the uncertainty relative to the truth instead of relative to the observation. A decomposition for this score is also presented. The scores are applied to a real data set for illustration. 5.6.2 Decomposition of the divergence score for uncertain observations The decomposition of the divergence score (DS) for a series of N forecasts that was presented in Eq. 5.13 on page 93 relies on the assumption that observations are certain, i.e. ot = (0, 1)T or ot = (1, 0)T . In appendix B this assumption is used to rewrite the closing term of the decomposition 1 N N P t=1 DKL (ot ||ō) as the uncertainty component H (ō). The uncertain observation ot is the probability mass function (PMF) of the true outcome of the uncertain event that is forecast, given the available information after it occurred. Because measurements are usually indirect, the observation can be regarded as a (usually subjective) conditional distribution of the true outcome, given the information from the measurement equipment. When we want the decomposition to be valid for uncertain observations, the last step of the derivation in appendix B can be omitted. The uncertainty component, the last term in Eq. 5.13, is thus replaced by the average Kullback-Leibler 5.6. Generalization to uncertain observations 109 divergence from the uncertain observations to the average observation (i.e. the observed climatic distribution), the last term in Eq. 5.31. DS = K K N 1 X 1 X 1 X nk DKL(ōk ||fk ) − nk DKL(ōk ||ō) + DKL (ot ||ō) N k=1 N k=1 N t=1 (5.31) The last term in Eq. 5.31 represents the expected climatological uncertainty relative to the observation, which is depicted in figure 5.8 as UNCDS . By writing the uncertainty term of the divergence score decomposition in this way, it remains valid for uncertain observations. The original uncertainty term, the entropy H (ō), can be seen as representing the estimated climatological uncertainty relative to the truth, which from now on will be denoted as UNCXES , because it is part of a decomposition of XES, which will be introduced in section 5.6.3. Likewise, DS represents the average remaining uncertainty relative to the observations, which in the case of uncertain observations can become different from the estimated remaining uncertainty relative to the truth. Analogy for the Brier score decomposition Analogously to the new decomposition of the DS, the Brier score decomposition introduced by Murphy (1973) can be modified in a similar manner to remain valid for uncertain observations. This can be achieved by replacing the uncertainty term in the original decomposition by the average squared Euclidean distance from the observations to the average observation. The modified decomposition is shown in equation 5.32. For perfect observations, equations 5.7 and 5.32 are the same. When observational uncertainty is considered, Eq.5.7 does not hold but Eq. 5.32 does. K K N 1 X 1 X 1 X 2 2 (ot − ō)2 BS = nk (fk − ōk ) − nk (ōk − ō) + N k=1 N k=1 N t=1 (5.32) 5.6.3 Expected remaining uncertainty about the truth: the cross-entropy score Now, the cross-entropy score (XES) will be introduced. The expected uncertainty relative to the unknown truth can be expressed by taking the expectation, with respect to the PMF that represents the uncertain observation, of the Kullback-Leibler divergence from the hypothetical truth to the forecast distribution. N N X n X n [vt ]i 1 X 1 X [ot ]j [vt ]i log Eot DKL (vt ||ft ) = XES = [ft ]i N t=1 N t=1 j=1 i=1 ( ) (5.33) In which n = 2 is the number of categories in which the event can fall. vt denotes the hypothetical distribution of the truth at instance t, which, like a perfect observation, is either (1, 0)T if the event in fact did not occur or (0, 1)T if the event truly occurred. Eot is the expectation operator with respect to the probability distribution ot . In this case, the 110 Using information theory to measure forecast quality Kullback-Leibler divergence DKL(vt ||ft ) reduces to the logarithmic score (Good, 1952), which is also known as the Ignorance score (Roulston and Smith, 2002). These scores are simply minus the logarithm of the probability attached to the event that truly occurred. DKL(vt ||ft ) = − log [ft ]k(t) (5.34) Where k(t) is the category in which the true outcome of the event fell at instance t. Because ot is the best estimate of the unknown true outcome, we can use the expectation Eot DKL(vt ||ft ) to evaluate the forecast, which can also be written as the right hand expression in Eq. 5.35. In information theory, this expression is often defined as the crossentropy between ot and ft , hence it is referred to in this thesis as the cross-entropy score XESt XESt = Eot DKL (vt ||ft ) = − n X [ot ]i log [ft ]i (5.35) i=1 This measure can be interpreted as the expected remaining uncertainty relative to the truth, when the forecasts are evaluated in the light of the observations, which are assumed to be reliable probability estimates of the truth. The difference with the divergence score becomes clear from figure 5.8. For a series of forecasts, the cross-entropy score is defined as XES = 1 N N P t=1 XESt . Decomposition of cross-entropy Figure 5.8 shows that the relation between all the components allows for several decompositions. The relation between DS and XES can be written as XES = DS + N 1 X H(ot ) N t=1 (5.36) The estimated remaining uncertainty in the forecasts relative to the truth (XES) is equal to the average uncertainty relative to the observations (DS) plus the average uncertainty that the observations represent, relative to the estimated truth (the second term on the right hand side of Eq.5.36). Another natural decomposition for the XES is the original decomposition of DS for perfect observations as presented in Weijs et al. (2010b). For uncertain observations, the three components presented there add up to the XES instead of to the DS; see also figure 5.8. The decomposition of the cross-entropy score (XES) therefore reads XES = − 1 N N X n X t=1 i=1 [ot ]i log [ft ]i = RELXES − RESXES + UNCXES 1 N K P k=1 nk DKL (ōk ||fk ) − 1 N K P k=1 (5.37) nk DKL (ōk ||ō) + H (ō) (5.38) Note that the resolution and reliability components are equal to those of the DS decomposition in Eq. 5.31. 5.6. Generalization to uncertain observations 111 obs unc DS REL XES RES UNCDS UNCXES remaining uncertainty (BITS) 1 0 naive climatological after forecast after recalibration after observation all knowing Figure 5.8: The relations between the components and the scores presented in this chapter are additive. The bars give the average remaining uncertainty about the truth (measured in bits) for various (hypothetical) stages in the forecasting process. The naive forecast is always assigning 50% probability of precipitation (complete absence of information), the climatological forecast takes into account observed frequencies. This climatological uncertainty (UNCXES ) can be reduced to (XES) by believing the forecasts f . If these are not completely reliable, the uncertainty can be further reduced with REL by recalibration. After observation, there is still some uncertainty (’obs unc’) about the hypothetical true outcome, given that observations are not perfect. Only for an all knowing observer, the uncertainty is reduced to 0. The resolution (RES) is the information that could maximally be extracted from the forecasts by perfect calibration. The divergence score (DS) and the new uncertainty component (UNCDS ) measure the uncertainty after forecast and in the climate, relative to the observations. 5.6.4 Example application As an illustration of the new term in the decomposition, the scores were calculated for a real data set of binary probabilistic rainfall forecasts of the Dutch Royal Meteorological Institute (KNMI). The observed rainfall amounts were transformed into probabilistic uncertain observations using a very simple uncertainty model. The purpose of this exercise is merely to illustrate the concepts in this section. The forecasts that are evaluated are the forecast probabilities of a daily precipitation of 0.3 mm or more. This is the same dataset that was used in Weijs et al. (2010b). In that paper the rainfall amounts xt , which were given with a precision of 0.1 mm, were converted to binary observations with a simple threshold filter: if xt ≥ 0.3 ⇒ ot = (0, 1), if x < 0.3 ⇒ ot = (1, 0). In this section, a random measurement error is assumed to make ot probabilistic and account for the uncertainty in the observation. The model of the uncertainty in the observation is Gaussian. The observed rainfall amount becomes a random variable with a normal probability density function gobs (x) = N(µobs , σobs ) (5.39) with mean µobs and standard deviation σobs . Because in this case we deal with a binary predictand, the pdf of the observation can be converted to a binary probability mass 112 Using information theory to measure forecast quality 1 µ obs=[0.07,0.93] =0.4; σ obs 0.9 =0.1 obs µobs=0.2; σobs=0.03 0.8 0.6 threshold probability 0.7 0.5 0.4 no rain 0.3 rain 0.2 0.1 obs=[0.95,0.05] 0 0 0.1 0.2 0.3 0.4 0.5 rainfall (mm) 0.6 0.7 0.8 Figure 5.9: The measurement uncertainty in the precipitation measurement leads to a probabilistic binary observation. In the example, a simple Gaussian measurement uncertainty is assumed. The measurement distributions are centered around the measurements and have a constant standard deviation. function ot = (1 − ot , ot ) by using ot = 1 − Gobs (T ) = 1 − Z T −∞ gobs (x)dx (5.40) in which Gobs (T ) is the cumulative distribution function of the observation, evaluated at threshold T . This conversion is illustrated in figure 5.9. The decompositions of the the DS and XES scores for the entire dataset were calculated for a range of different standard deviations σobs for the measurement uncertainty. In figure 5.10 it can be seen that while the average observation uncertainty grows, the divergence score improves (decreases), but the cross entropy score (XES) deteriorates. This indicates that if the observation uncertainty is reflected by this model, the best estimate for the information loss compared to the truth is higher than would be assumed neglecting observation uncertainty. Not taking the observation uncertainty in account would lead to an overestimation of the forecast quality in this specific case. A closer analysis reveals that most of the deterioration is caused by observations of 0 mm, which start to give significant probability to rain when the standard deviation grows beyond 0.1 mm. As many of the 0 mm observations are during cloudless days and in fact almost certain, we might reconsider the simple Gaussian uncertainty model . When the uncertainty model is changed to have no uncertainty for 0 mm observations, the decomposition changes significantly (see dashed lines in fig. 5.10), leading to a crossentropy score that is almost constant (sometimes even slightly decreasing) with increasing standard deviation. For this particular error model, the uncertainty in the observations thus hardly affects the estimation of the forecast quality. This gives us confidence that as 5.6. Generalization to uncertain observations 113 1 RES−REL 0.8 0.7 information in forecast 0.6 divergence score 0.5 0.4 XES uncertainty and information components (bits) 0.9 0.3 0.2 0.1 observation uncertainty 0 0 0.05 0.1 0.15 0.2 standard deviation around observation 0.25 0.3 Figure 5.10: The resulting decompositions (reliability not shown) as a function of measurement standard deviation. The growing XES with standard deviation, indicates that for the homoscedastic Gaussian observation uncertainty model, forecast quality is lower than would be estimated assuming perfect observations. The dashed lines show the decomposition for the same observation uncertainty model, with the exception that measurements of 0 mm are assumed to be certain. The almost constant XES in that case indicates that the estimation of forecast quality is robust against those observation uncertainties. long as the 0 mm observations are certain, the estimate of forecast quality is robust against Gaussian observation errors with standard deviation up to 0.3 mm. Although in that case there is significant observation uncertainty (lower dashed line, fig. 5.10) that lowers the divergence score, the changes in the cross entropy score for individual forecasts cancel each other out. Not surprisingly, the robustness of forecast quality estimates depends very much on the characteristics of the observation uncertainty. Further experiments are necessary to determine how to formulate realistic observation uncertainty models and how this can benefit verification practice. 5.6.5 Discussion: divergence vs. cross-entropy For the divergence score (DS), worse observations lead to better scores for forecast quality, because the quality is evaluated relative to the observations. This might be considered undesirable, especially when the performances for two locations with similar climates are compared, while the quality of the observations is not the same. On the other hand, the divergence score has the advantage of not making explicit reference to a truth beyond the observations, which might be philosophically more appealing. 114 Using information theory to measure forecast quality If the cross-entropy score (XES) is used as a scoring rule, the score estimates the quality of the forecasts at reducing uncertainty about the truth. This quality may be estimated differently in the light of observation uncertainty, but should not be relative to it. The skill might both be overestimated and underestimated in the presence of observation uncertainty. This depends on the nature of the errors, which should there be modeled to the best possible extent. The XES allows a better comparison between the quality of different forecasts. In other words, the benchmark to compare the forecasts to is the truth. Because the uncertainty of the forecasts relative to this benchmark can only be evaluated if we would know the truth, we can only estimate its expected value. In contrast, in the divergence score (DS), the benchmark to which the forecasts are compared are the probabilistic estimations of the truth. The remaining uncertainty with respect to these estimates, the observations, can be calculated exactly. Summarizing, the divergence score is the exact divergence from an estimate of the truth (the observation), while the crossentropy score is an estimated (expected) divergence from the exact truth. 5.7 Deterministic forecasts cannot be evaluated Can a forecaster be completely sure about something that in the end does not happen and still get credit for his forecast? This does not appear natural, but it often turns out to happen in practice. For example, a deterministic flow forecast of 200 m3 /s is considered quite good, when 210 m3 /s is observed. Apparently, it is already expected that some error will occur and a forecast that is 10 m3 /s off is considered to be not that bad. Hydrological models are per definition simplifications of reality. Often, they describe relations between macrostates, like averaged rainfall, mass of water in the groundwater reservoir, and flow through a river cross-section. Similar to problems in statistical thermodynamics, having limited information about what really goes on inside a hydrological system on a microscopical level, our forecasts on a macroscopical level can never be perfect (Weijs, 2009; Grandy Jr, 2008). What can be said about the real world on the basis of a model is therefore inherently erroneous to some extent, or should be stated in terms of probabilities. How then, should deterministic forecasts be evaluated? Literally taken, a deterministic (point value) forecast states: “the outcome is x”. Implicitly, such a forecast asks to be evaluated from a black and white view: the forecast is either wrong or right. The divergence score also reflects this. If the forecast is right, the perfect score of 0 will be attained. If the forecast is wrong, however, a penalty of infinity will be given. If one such a forecast is given, the forecaster can look for another career , because even a future series of perfect forecasts can not average out the infinite penalty. The decomposition shows that the reliability component is responsible; See Fig. 5.4. Although the deterministic forecasts usually contain information about the observed outcomes, given that the resolution (correct information) is positive and removes some of the uncertainty, this is completely annihilated by the reliability term (the wrong information). The discrepancy between the information (reduction of uncertainty) that the forecasts contain and the information conveyed by the messages that constitute the forecasts is so large that the expected surprise about the 5.7. Deterministic forecasts cannot be evaluated 115 truth of a person taking the forecast at face value goes to infinity. The fact that deterministic forecasts are still used in society (and unfortunately sometimes even preferred), while they explode uncertainty to infinity, seems to present a paradox. In this section, two possible interpretations are proposed that offer a solution to this paradox. 5.7.1 Deterministic forecasts are implicitly probabilistic (information interpretation) Fortunately, in practice, almost no person takes deterministic forecasts at face value. The fact that a user does not take the forecast literally can be seen as recalibration of the forecast (“unconscious statistical post-processing”) by that user. The user bases his internal probability estimates on the forecast, but adjusts the probabilities given by the forecaster based on his own judgment, instead of literally copying the forecaster’s statements. For a deterministic forecast, this means reallocating some probability to outcomes that the forecast did not speak about. This reallocation improves the reliability of the internal probability estimates of the user on which he bases his actions. We can thus see this as the user eliminating the wrong information from the forecast.3 The user can do the recalibration based on previous experience with the forecasts, common sense and can also add information from his own observations. The user of the forecast can think “if the forecaster says the water level will be 10 cm under the embankment, he implicitly also forecasts a little that overtopping will occur”. Note that the example of Grand Forks in (Krzysztofowicz, 2001) shows that not all users do this. Mathematically this recalibration is equivalent to also attaching some probability to overtopping. However, it is not the task of a user to guess what the forecaster wanted to say. The forecaster has the task of summarizing different sources of information and expert knowledge into a forecast that various users can base their decisions on. Consistency requires that the forecaster communicates his judgments to the user (Murphy, 1993). If he deems it possible that 210 m3 /s will flow through the river instead of his best estimate 200 m3 /s, then the forecaster should also communicate a probability for this outcome to the user. The forecaster may also present the deterministic forecast as being an expected value or mean. This suggests an underlying probabilistic forecast. However, when taking the information-theoretical viewpoint, communicating an expected value means nothing without additional statements regarding the probability distribution. The principle of maximum entropy (PME) (Jaynes, 1957) states that when making inferences based on incomplete information, the best estimate for the probabilities is the distribution that is consistent with all information, but maximizes uncertainty. In this way, the uncertainty is reduced exactly by the amount the information permits, but not more. The resulting distribution thus gives an exact representation of the information actually conveyed by the forecast. Maximizing entropy with known mean and variance, gives a Gaussian distribution, maximizing uncertainty about the velocities of gas molecules with known 3. Note that eliminating wrong information is different from adding information. If a user takes the forecasts as true, but partial information and is rational (following Bayesian probability logic), no future information can update the zero probability. This is another argument against assigning zero probability to anything. 116 Using information theory to measure forecast quality total kinetic energy gives the Maxwell-Boltzmann distribution (Jaynes, 2003; Cover and Thomas, 2006). When PME is applied to expected value forecasts, however, the maximum entropy forecast distribution that is consistent with the information given by the forecaster is uniform between minus and plus infinity. It is the complete opposite end of the spectrum compared to the previous literal interpretation of the deterministic forecast: from claiming total certainty to claiming total uncertainty. In the case of streamflow forecasts, the user can still get a less nonsensical forecast distribution by combining the information in the forecast with the common sense notion that streamflows in rivers are nonnegative. This extra constraint turns the PME forecast distribution for a known expected value into an exponential distribution (Cover and Thomas, 2006). This brings back the question who ought to specify these constraints, which constitute information. The fact that the user can reduce the maximum entropy by adding this common sense constraint actually means that the forecaster failed to add this information. Note that the forecaster should be best equipped to give probability estimates and these should be summarized in such a way that no information is lost, but also all uncertainty is represented (cf. consistency). As was argued in the introduction, predictions only make sense when they are testable, i.e. can be evaluated. One way to evaluate deterministic forecasts with information measures is to convert them to probabilistic forecasts by looking at the joint distribution of forecasts and observations. The conditional distributions of observations for each forecast value can then be seen as probabilistic forecast distributions. It is important to note however, that the probabilistic part of such a forecast is derived from data that includes the observations. Such a forecast is thus evaluated against the same data that is used as the basis of its own uncertainty model, which is clearly undesirable. Also without explicit conversion to a probabilistic forecast, the uncertainty model becomes explicit when a series of deterministic forecasts is evaluated. A penalty (objective) function for a deterministic forecast can be interpreted as an uncertainty (information) measure for a corresponding probabilistic forecast. For example, a deterministic forecast evaluated with root mean squared error implicitly defines a Gaussian forecast probability density function. An important consequence of this insight is that the way to evaluate a deterministic model actually defines (i.e. forms) the probabilistic part of a total model, consisting of a separate deterministic and probabilistic part. The objective function (which is a likelihood measure) should therefore be stated a priori, as it forms part of the model that is put to the test against observations. While estimating the error model from the data may under some conditions be acceptable in calibration, for (independent) evaluation of forecasts it is unacceptable, because it uses the data against which it is evaluated. A correct approach would be to explicitly formulate a parametric error model, and find its parameters in the calibration. The combination of the hydrological model and the error model can subsequently be used to make probabilistic predictions, which can be evaluated with the divergence score in an independent evaluation 5.7. Deterministic forecasts cannot be evaluated 117 period. The error models are not restricted to Gaussian distributions, but can take more flexible forms. Such an approach is taken in Schoups and Vrugt (2010). As a last consideration, it must be stressed that even if an error model is properly formulated and added to the deterministic “physical” part, the resulting model still represents a false dichotomy between true behavior of the system and the error, as was argued by Koutsoyiannis (2010). A more consistent approach would be to explicitly make the probabilistic part of the model an integrated part of the physical reality it is supposed to simplify. Such approaches can lie in studying the time-evolution of chaotic systems (Koutsoyiannis, 2010) or in applying the principle of maximum entropy in combination with macroscopic constraints, as for example suggested by Weijs (2009) and Koutsoyiannis (2005a). Concluding, from the information-theoretical viewpoint, several reasons come to light why deterministic forecasts should in fact be considered to be implicitly probabilistic. The problem with these forecasts is that they leave too much of the probabilistic interpretation to the user. It might be considered ironic that the users who are claimed not to be able to handle probabilistic forecasts and are for that reason provided with deterministic forecasts are the ones who have to rely most on their ability to subconsciously make probability estimates based on the limited information in the deterministic forecast. 5.7.2 Deterministic forecasts can still have value for decisions (utility interpretation) A second, independent interpretation of deterministic forecasts that justifies their existence is their usefulness, even to users who do not make subconscious probability estimates. Even though a reservoir operator might be infinitely surprised if he has taken a deterministic inflow forecast of 200 m3 /s at face-value and he finds out the inflow was 210 m3 /s, his loss is not infinite. The operator might spill some water, but all is not lost. The difference between surprise and loss is due to the fact that most decision problems are not equal to placing stakes in a series of horse races. Such a horse race is the classical example where information can be directly related to utility, see Kelly (1956) and Cover and Thomas (2006) for more explanation. Kelly showed that when betting on a series of horse races, where the accumulated winnings can be reinvested in the next bet, the stakes the gambler should put on each horse should be proportional to the estimated winning probabilities. In a single instance of such a horse race, all money not bet on the winning horse is lost, so the only probability that is important for the results is the one attached to the winning horse. If zero probability (and thus no bets) were put on the winning horse, then the gambler loses all his capital and has no chance of future winnings. In contrast, for decision problems like reservoir operation, an operator blindly believing in an inflow into his reservoir of 200 m3 /s and optimally preparing only for that flow, will automatically also be quite well prepared for 210 m3 /s. Conversely, the preparation on a predicted event, which influences the utility of an outcome, may depend on the entire forecast distribution and not just on the probability of the event that materializes. This makes the loss function non-local (locality is discussed in sections 5.4.2 and 6.6.2). 118 Using information theory to measure forecast quality Another difference with the horse race is that the total amount of value at stake in hydrological decision making usually does not depend on the previous gains, while the results for the horse race assume that the gambler invests all his previously accumulated capital in the bets. The gambler therefore wants to maximize the product of rates of return over the whole series of bets, while for a reservoir operator, each period offers a new opportunity to gain something from the water, even in case he spilled all his stored water in the previous month. The reservoir operator is interested in the total sum of gains. This is comparable with a gambler whose spouse allows him/her to bet a fixed amount of money each week (Kelly, 1956) and then spends it all in the bar on the same evening without possibility of reinvesting in the next bet. Assuming a utility that increases linearly with the consumption of beer bought with the winnings, the best decision is to bet all money on the one horse with the best expected return4 . Again, one loss is not fatal for the whole series of bets. The gambler can still hope for better luck next week. The evaluation of the value of deterministic forecasts is therefore not as black and white as evaluation of the information they contain. The evaluation of deterministic forecasts in this interpretation is thus connected to a decision problem. Decisions can be taken as if the forecasts are really certain, and still be of positive value, although generally less value than probabilistic forecasts; see e.g. Philbrick and Kitanidis (1999); Krzysztofowicz (2001); Pianosi and Ravazzani (2010); Zhao et al. (2011). The loss functions for evaluating deterministic forecasts can be seen as functions that map the discrepancy between forecast value and observed value to a loss of the decision based on the wrong forecast, compared to a perfect forecast. In the utility interpretation, evaluating deterministic forecasts with mean squared error implicitly defines a decision process in which the disutility is a quadratic function of the distance between forecast and observation. In that case, a series of forecasts that has the smallest MSE has most utility or value for the user. 5.8 Conclusions Analogously to the Brier score, which measures the squared Euclidean distance between the distributions of observation and forecast, an information-theoretical verification score was formulated, measuring the Kullback-Leibler divergence between those distributions. More precisely, the score measures the divergence from the distribution of the event after the observation to the distribution that is the probability forecast. This “divergence score” is a reinterpretation of the Ignorance score or logarithmic score, which was previously not defined as a Kullback-Leibler divergence. Extending the analogy to the useful and well-known decomposition of the Brier score, the divergence score can be decomposed into uncertainty – resolution + reliability. This decomposition can be interpreted as “the remaining uncertainty is the climatic uncertainty minus the the true information plus the 4. Note that in the reservoir case, unlike the horse race, we also need to take into account the influence of the current decision on the future returns through the state; see chapter 7. 5.8. Conclusions 119 wrong information”. For binary events, Brier score and its components are second order approximations of the divergence score and its components. The divergence score and its decomposition generalize to multi-category forecasts. A distinction can be made between nominal and ordinal category forecasts. Scores based on the cumulative distribution over ordinal categories can be seen as combinations of binary scores on multiple thresholds. How the scores for all thresholds should be weighted relative to each other depends on the user of the forecast. Scores on cumulative distributions are therefore not exclusively dependent on physical observations, but contain subjective weights for the different thresholds. Two possible formulations of a ranked divergence skill score have been formulated. The first equally weighs the skill scores relative to climate, while the second equally weights the absolute scores. The second ranked divergence skill score is equal to the existing ranked mutual information skill score for the case of perfectly calibrated forecasts, but additionally includes a reliability component, measuring miscalibration. In forecasting, a distinction can be made between information and useful information in a forecast. The latter can not be evaluated without a statement about context in which the forecast will be used. The first is only dependent on how the forecasts relate to the observations and is objective. Therefore, in the author’s opinion, information should be the measure for forecast quality. It can be measured using the logarithmic score, which now can be interpreted as the Kullback-Leibler divergence of the forecast from the observation. Useful information or forecast value, on the other hand, is a different aspect of forecast ’goodness’ (Murphy, 1993) that should be evaluated while explicitly considering the decision problems of the users of the forecast. The Brier score can be used as an approximation for quality or as an exact measure of value under the assumption of a group of users with uniformly distributed cost-loss ratios. It is argued that these two applications should be clearly separated. In case one wants to assess quality, information-theoretical scores should be preferred. If an approximation is sufficient, the Brier score could still be used, with the advantage that it is well understood and extensive experience exists with the use of it. However, when extreme probabilities have to be forecast, the differences might become significant and the divergence score is to be preferred on theoretical grounds. In case value is to be measured, an inventory of the users of the forecasts should be made to assess the total utility. When explicitly investigating the user base, a better estimator for utility than the Brier score can probably be defined. Using the Brier score as a surrogate for forecast value, implicitly assuming the emergent utility function is appropriate for a specific type of forecasts, is clearly unsatisfactory. In this respect, it is important to stress that also the divergence score does not measure value, but quality. Only in a very unrealistic case (a bookmaker offering fair odds) a clear relation exists between the two. In chapter 6, it is argued that for practitioners in meteorology and hydrology, quality is most likely of more concern than value, because the latter is in fact evaluating decisions rather than forecasts. 120 Using information theory to measure forecast quality When evaluating forecasts by comparing them with observations that are affected by measurement uncertainty, the observation uncertainty can influence the evaluation of the forecast quality. The observations should then be represented by probability distributions. When extending the use of the divergence score to the case of uncertain observations, cross entropy score is a more intuitive measure for intercomparison of forecasts at locations with different observational uncertainty. The divergence score (DS) can be interpreted as a measure for the remaining uncertainty relative to the observation. The cross entropy score can be seen as the expected remaining uncertainty with respect to a hypothetical true outcome. Both scores can be decomposed into uncertainty, resolution and reliability. The difference in the decompositions is in the uncertainty component. For the case of the cross-entropy, it represents the climatic uncertainty relative to the truth, for the case of the divergence score it represents the climatic uncertainty relative to the observation. The difference between the two uncertainty components is equal to the difference between the cross-entropy and divergence scores, and corresponds to the average observational uncertainty. If the observations are assumed perfect, which is usually the case in verification practice, both scores and decompositions are equal. Starting from the observation that deterministic forecasts are still commonly used and evaluated, but are worthless from an information-theoretical viewpoint, the conclusion can be drawn that these forecasts are either implicitly probabilistic or should be viewed in connection to a decision problem. In both interpretations, the evaluation depends on external information that is not provided in the forecast. Deterministic forecasts leave too much interpretation to the user, if seen as implicit probabilistic forecasts, or make too many assumptions on the user if they are evaluated using another utility measure. On the one hand, forecasting can be seen as a communication problem in which uncertainty about the outcome of a random event is reduced by delivering an informative message to a user. On the other hand, forecasting can be seen as an addition of value to a decision problem. Any measure that is not information only becomes meaningful when it is interpreted in terms of utilities. Science is required to make testable predictions. Forecasts should therefore be stated in terms that make it clear how to evaluate them. Deterministic and interval forecasts fail this criterion, because additional assumptions on utility and probability have to be made during evaluation of the forecast. Probabilistic forecasts can be evaluated using information theory. To facilitate the use of the score and its decomposition, scripts that can be used R and Octave are available on the website http://www.hydroinfotheory.net in Matlab Chapter 6 Some thoughts on modeling, information and data compression “If arbitrarily complex laws are permitted, then the concept of law becomes vacuous, because there is always a law!” - Hermann Weyl, 1932: The Open World, summarizing Gottfried Wilhelm Leibniz, 1686: Discours de métaphysique V, VI1 Abstract - This chapter2 concerns the relation between information and models. Models play an important role in optimal water system operation. They mainly serve as tools for making predictions. These predictions are of paramount importance for making good decisions, but also pure science is required to make testable predictions. While the previous chapter dealt with evaluating such predictions, this chapter goes one step further and looks into the question how such evaluations contribute to improvements in the predictions and the models that make them. Furthermore, the question is addressed whether the purpose of a model should play a role in its calibration. The process of learning from data is analyzed from the intuitive information-theoretical point of view, starting from the strong connection between modeling, data compression and description length. Algorithmic information theory is introduced as a little known, but fundamental framework within philosophy of science. The findings lead to a reflection on the role and importance of understanding in science. From this analysis, some important recommendations follow for the practice of formulating models. 1. Original french text: ... Car supposons, par exemple, que quelqu’un fasse quantité de points sur le papier à tout hasard, comme font ceux qui exercent l’art ridicule de la géomance. Je dis qu’il est possible de trouver une ligne géométrique dont la notion soit constante et uniforme suivant une certaine règle, en sorte que cette ligne passe par tous ces points, et dans le même ordre que la main les avait marqués. ... Mais quand une règle est fort composée, ce qui lui est conforme passe pour irrégulier. ... 2. Contains material from: – S.V. Weijs, G. Schoups, and N. van de Giesen. Why hydrological forecasts should be evaluated using information theory. Hydrology and Earth System Sciences ,14 (12), 2545–2558, 2010 – S.V. Weijs, Interactive comment on "HESS Opinions ‘A random walk on water’ " by D. Koutsoyiannis Hydrology and Earth System Sciences Discussions, 6, C2733-C2745, 2009 121 122 Modeling, information and data compression Science real world observations theory errors/remaining uncertainty complexity of theory Data compression Computation Physics data generating algorithm file to be compressed decompression algorithm compressed file size of decompr. algorithm computer computation input rules output physical system motion intial state laws of physics final state Table 6.1: Analogy between science and data compression (left) and between physical systems and computation (right, according to Deutsch (1998)) 6.1 Introduction Science and data compression have the same objective: by discovering patterns in (observed) data, they can describe them in a compact form. In the case of science, we call this process of compression “explaining” and the compact form a “theory” or physical “law”. The similarity of these objectives leads to strong parallels between philosophy of science and the theory of data compression; see table 6.1. A formal description of these ideas was put forward in the “formal theory of inductive inference” by Solomonoff (1964), who was among others inspired by the grammars of Chomsky (1956) and the foundations of probability according to Carnap (1950). Together with similar ideas, independently developed by Kolmogorov (1968) and Chaitin (1966), Solomonoffs theory was the start of the field of “algorithmic information theory” (AIT). The theory offers formal definitions of complexity and algorithmic probability and gives an explication3 of Occam’s razor, which states that the simplest explanation is the best. Because this information theory uses formal theories of computation, the parallels between physical systems and computers must also be noted (see right of table 6.1). Algorithmic information theory complements the information theory of Shannon (1948). Together, these information theories offer a view on the parallels between data compression and model inference. Although the theories of Shannon and Solomonoff start from quite different perspectives, they often lead to remarkably similar results. While Shannon’s theory is useful to deal with predictions, algorithmic information theory also considers the models that produce them, making it a good basis for inference. This chapter reviews the implications of these theories, which link data compression to model inference and appear to be largely unknown in the hydrological community. An exploratory real world application is presented, where data-compression is applied to hydrological time series to reveal their information content for inference; see also Weijs and van de Giesen (2011). 6.1.1 The principle of parsimony A specially designed data compression algorithm can in principle compress a given series to one bit, if the corresponding decompression algorithm contains the whole series and outputs this series if it encounters the bit in the compressed file. This is equivalent to having built a model of the data that is in fact nothing more than restating the same 3. Explication is a term introduced by Carnap (1950). It is the process of making a pre-scientific idea (explicandum) into something precise (explicatum). 6.1. Introduction 123 data. The model has become as complex as the original data was. When there is unseen data, it is not possible to compress or decompress it with the same algorithms that were developed for the seen data. Therefore, in representing the new data, both the one bit file and the specific decompression algorithm for that file is needed, giving a total file size that is not shorter than the original data. Equivalently, a model of such complexity can reproduce the very specific structure in that data, but can not make predictions applying the found structure to new data. This is known as overfitting and is an important problem throughout science and hydrology is no exception. In hydrological modeling, the problem is often that limited data are available about complex processes. Therefore, models that perform well in calibration often lack predictive accuracy for unseen data. Generally, more complex models offer higher flexibility in fitting the calibration data, but do not always offer good predictions, while simple models lack precision both in describing the calibration data and predicting new events. A model should therefore be complex enough to describe the data, but not too complex; see e.g. Dooge (1997); Koutsoyiannis (2009); Schoups et al. (2008). Qualitatively this can be formulated in the form of Occam’s razor (“The simplest explanation is the best”) or the principle of parsimony. Quantum physicist David Deutsch states a similar principle in terms of explanations (Deutsch, 1998): “do not complicate explanations beyond necessity, because if you do, the unnecessary complications themselves remain unexplained.” Of course this definition leaves us with the question how to define explanation. Some discussion on this topic follows in section 6.5. Jaynes (2003) puts it this way: “Do not introduce details that do not contribute to the quality of your inferences” and the data compression view would be: “Do not increase the size of your compression algorithm beyond the gains in compression that you achieve” Independently of how the principle is stated, there seems to be such a thing as an optimal model complexity. Several methods have been proposed to determine this optimum. Schoups et al. (2008) compare several of these methods applied to hydrological prediction. Examples of model complexity control methods are cross validation, the Akaike information criterion, AIC, Akaike (1974) and the Bayesian information criterion, BIC, Schwarz (1978), which are all formalizations of the principle of parsimony. More background about several model selection methods can be found in the books by Burnham and Anderson (2002); Li and Vitanyi (2008); Vapnik (1998); Grünwald (2007); Rissanen (2007). 6.1.2 Formalizing parsimony In the Bayesian framework, comparing model predictions with observed data gives the likelihood, i.e. the conditional probability of the observations, given the model. By multiplying the prior probabilities of parameters and models with the likelihoods, we obtain predictive distributions. The prior parameter estimates are usually based on expert knowledge and 124 Modeling, information and data compression other side information, which is sometimes referred to as “soft information” (Winsemius et al., 2009). These distributions and model structures are thus already conditioned on large amounts of information from observations. Each prior distribution over possible models and parameters is the posterior distribution of previous gains in knowledge. In that sense, “standing on the shoulders of giants” might be translated as “updating distributions of well-informed people”. When we follow the growth of knowledge backwards, ultimately we end up with the problem of specifying a prior over models, before we have seen any data. This also becomes relevant if we want to base predictions purely on data, such as for example in the study of artificial intelligence. Intuitively, the prior probability of a model, before seeing any data, has an inverse relation with its complexity. This is a reflection of the principle of parsimony or Occam’s razor. In algorithmic information theory, introduced in section 6.2, we will see that complexity is formalized as the minimal length to describe a program (i.e. model) and has a relation to probability. It therefore quantifies the principle of parsimony. The length of a program that describes the observations is strongly related to compression. If we can produce data with a short program, we can store the program instead of the data and save space. For example, the highly detailed and seemingly complex fractal figure on the cover of this thesis can be stored by the simple algorithm of the Mandelbrot-set and some information regarding coordinates and coloring. If a model (program) does not perfectly describe the data, but only gives high probability to it, then some extra space is needed to encode and store which observation actually occurred, given the predictions of the model. This is also how several data compression algorithms work. In section 6.3 it will be shown that, using an optimal coding scheme, this extra storage per observation converges to the divergence score treated in the previous chapter. The next section briefly presents some of the main ideas of algorithmic information theory and gives some references for further reading. Subsequently, the relation between the divergence score and data compression is explained in detail. Section 6.4 presents a practical experiment in which hydrological data are compressed with some well-known data compression algorithms to find upper bounds for their information content for hydrological modeling. Subsequently, in section 6.5 the data-compression perspective is used to look critically at what we call understanding. When hydrological models are used to support risk based water system operation, they often have specific purpose other than just understanding. In section 6.6, the question is addressed whether these models should be trained to minimize risk rather than to minimize uncertainty. Some recommendations for the practice of modeling conclude the chapter. 6.2 Algorithmic information theory, complexity and probability This section introduces algorithmic information theory (AIT), which stems from computer science and early work on artificial intelligence, but has far wider implications. Before 6.2. Algorithmic information theory, complexity and probability 125 going into the theory, it is useful to briefly review the Bayesian perspective on model inference. It provides the link between probabilistic forecasts, as treated in chapter 5 and the quality of models that produce them. Subsequently, after introducing Turing’s formal theory of computation, the connection between description length and probability will become clear when AIT is introduced. 6.2.1 The Bayesian perspective As was argued in chapter 5, probabilistic forecasts can be judged by how small they make the remaining uncertainty about the truth. For perfect observations, this is achieved by minimizing the divergence score. N N 1 X 1 X min DS = DKL(ot ||ft ) = − log[ft ]j(t) N t=1 N t=1 ( ) (6.1) where j(t) is the index of the event that occurred at instance t. This minimization is equivalent to maximizing the likelihood, i.e. maximizing the total probability of the data (D), given the model (M) ( max p(D|M) = N Y [ft ]j(t) t=1 ) (6.2) However, in hydrological model inference, we want to find a probable model, given the data, instead of the other way around. When doing model inference, i.e. induction, the probability of interest is thus p(M|D, I) instead of p(D|M). These two probabilities are related through Bayes rule (using the notation of Jaynes (2003)) p (Mi |D, I) = p (Mi |I) p (D|Mi , I) p (D|I) (6.3) where p (Mi |D, I) stands for the probability of model i, given the data D and prior information I. Therefore, a prior probability for a model (a combination of model structure and parameter values) is needed. This is captured by p(Mi |I). Note that p(D|I) serves as a normalizing constant and can be eliminated if a fixed set of models is compared. To find a plausible model, it both has to explain the data well and have a high a priori probability, i.e. be consistent with knowledge gathered from previous experiments and observations. Ideally, to prevent false certainty, the search should be done over the entire space of models that have nonzero prior probability and nonzero likelihood, but this is impossible in practice. Note that many hydrological modeling studies that attempt Bayesian model inference start from a single model structure and convert prior to posterior parameter distributions using the likelihood based on calibration data. Unless there really is evidence that allows the identification of single model structure that is uniquely consistent with prior information, this approach will generally underestimate uncertainty. Recent developments focus on considering multiple model structures, e.g. Fenicia et al. (2008, 2010). However, if after this analysis a single model is chosen that is supposed to represent best 126 Modeling, information and data compression the true catchment behavior, a logical error is committed, because all models but the best are ruled out without evidence to do so. This is an example of overfitting by searching the model-space. Multi-model prediction schemes like Bayesian model averaging Duan et al. (2007) prevent this and are more consistent with Bayesian probability as the logic of science, as expounded in Jaynes (2003). 6.2.2 Universal computability Algorithmic information theory makes use of the concept of program length. A model is seen as an algorithm or program that runs on a computer. This might seem like a very specific, arbitrary and limited definition, but in fact it is not. It is beyond the scope of this thesis to treat the foundations of computability theory in detail, but the main results will now be mentioned; see Boolos et al. (2007) for more background. One important basis for the ideas in AIT is the Church-Turing thesis, which states that any effectively computable function can be computed by a model of computation Turing (1937) introduced: the Universal Turing Machine (UTM). This thesis has not been proven, but no counterexample has been found so far and it widely believed to be true. Furthermore it was shown that all possible UTMs can simulate each other, given unlimited resources (memory and time). All binary modern age computers are examples of UTMs, although in practice they are bounded by finite memory. They can thus simulate any other Turing machine, and be simulated by any universal Turing machine. A UTM reads symbols from an input tape one by one and changes its internal state according to the symbols read from the tape. Furthermore it can move the (infinite) tape right or left and (over)write symbols on the tape. The first part of the tape can be viewed as a program that tells it how to simulate other computers. An example of a very small UTM has only 7 states (working memory) and 4 different symbols on the tape. Shannon (1956) showed that it is possible to exchange states for symbols on the tape and gave examples of UTM’s with just two states or just binary symbols on the tape. Such simple computers, given sufficient time and tape length (memory), can thus simulate any other computer and perform all possible calculations. Another result from Turing (1937) is that the so-called halting problem is unsolvable. This means that for some programs, the only way to determine whether they will ever halt is to run those programs and test them. For example, one can think about programs with “while-loops” in them, that continue until some condition that follows from the calculation in the loop is fulfilled. This limitation is related to the incompleteness theorems of Gödel (1931) and has substantial implications. As we will see in this chapter, it also means that it is ultimately impossible to say whether we have found the best model. Given the fact that all known physical processes are Turing-computable, but quantum processes take extreme resources to simulate, Deutsch (1985, 1998) expanded the TuringChurch thesis to state that “every finitely realizible physical system can be perfectly simulated by a universal model computing machine operating by finite means”. This analogy is also depicted in Tab. 6.1 and means that any computer is a physical system and any physical system can be regarded as a computer. Physical processes can thus be simulated 6.2. Algorithmic information theory, complexity and probability 127 on a binary computer and, conversely, computations can be performed by physical systems (cf. e.g. quantum computers). In hydrological terms it means that any hydrological process can in principle be simulated with arbitrary precision by a binary computer. 6.2.3 Kolmogorov complexity, patterns and randomness Using the thesis that any computable sequence can be computed by a UTM and that program lengths are universal up to an additive constant (the length of the program that tells one UTM how to simulate another), Kolmogorov (1968) gave very intuitive definitions of complexity and randomness; see also (Li and Vitanyi, 2008) for more background. Kolmogorov defined the complexity of a certain string (i.e. data set, series of numbers) as the length of the minimum computer program that can produce that output on a UTM and then halt. Complexity of data is thus related to how complicated it is to describe. If there are clear patterns in the data, then they can be described by a program that is shorter than the data itself. The majority of conceivable strings of data cannot be “compressed” in this way. Data that cannot be described in a shorter way than naming the data itself is defined as random. This is analogous to the fact that a “law” of nature cannot really be called a law if its statement is more elaborate than the phenomenon that it explains (see Tab. 6.1). A problem with Kolmogorov complexity is that it is incomputable, but can only be approached from above. This is related to the unsolvability of the halting problem: it is always possible that there exists a shorter program which is still running (possibly in an infinite loop) that might eventually produce the output and then halt. A paradox that would arise if Kolmogorov complexity were computable is the following definition known as the Berry paradox: “The smallest positive integer not definable in under eleven words”. 6.2.4 Algorithmic probability and Solomonoff induction A few years before Kolmogorov published his paper on complexity, Solomonoff (1964) published a “Formal theory of inductive inference”, which has many parallels with the ideas of Kolmogorov. He realized that predictions always rely on finding patterns in past data to predict the next symbol or number in the sequence. A pattern that describes the data, but is very complex usually turns out not to be a good predictor, because it is “overfitted” to the data. Universal Turing machines provide an excellent means to formalize this. A program for a UTM that starts with the known data as output, but is almost as long as the data, is less likely to give correct values for subsequent data points. In other words, if a simple pattern can be recognized in the data, it is more likely to be the process that generated the data than a complex pattern. This is also related the fact that programs for a Turing machine that execute and then halt cannot be the prefix of other programs. This means that other programs cannot start with symbols that form a halting program, because the machine would halt before the other symbols are read. Solomonoff used the requirement for prefix-free programs to define algorithmic probability. For example, if computer programs would be generated by random coinflips, one quarter of these programs will start with the symbols 01. If “01” would be a halting program, 128 Modeling, information and data compression it would represent one quarter prior probability of all existing programs. Generally, we can associate an algorithmic probability to all halting programs of 2−|Mi| , where |Mi | is the length of the ith program Mi on Turing machine M. This probability p (Mi ) could thus be seen as an universal prior for model i with respect to machine M. These priors can be combined with the likelihood of different models with respect to the data, to yield probabilistic predictions. In Solomonoff (1964), several variations of this concept are described which turn out to be equivalent. Solomonoff (1978) gives a sharp upper bound for the prediction error of these predictions and proves that it outperforms any other universal prediction method up to a machine-dependent additive constant. The following equation, using probabilistic models, is perhaps the most intuitive formulation for Solomonoff’s universal prediction from data PM (an+1 |a1 , a2 , . . . , an ) = X 2−|Mi| Si Mi (an+1 |a1 , a2 , . . . , an ) (6.4) i in this equation PM (an+1 |a1 , a2 , . . . , an ) is the probability that machine M predicts for outcome an+1 , given all previous observed data. Mi (an+1 |a1 , a2 , . . . , an ) is the forecast probability that model Mi (the ith program) predicts, Si is the total probability that the model assigned to the observed data (the likelihood, cf. Eq. 6.2) and 2−|Mi| is the algorithmic prior probability for model Mi . We can see from the formula that the predictive probability uses a sum over the outcomes of all possible models, weighted by their prior probabilities. It can thus be seen as an instance of Bayes’ rule (cf. Eqs. 6.4 and 6.5) PM (Dn+1 |D1 , . . . , Dn , M) = X p (Mi ) p (D1 , . . . , Dn |Mi ) p (Dn+1 |D1 , . . . , Dn , Mi ) (6.5) i Equation 6.5 gives the probability for all possible values of datapoint Dn+1 , given all previous data for the reference machine M. When using more complex computers or languages, some functions will be relatively shorter (more probable) and others longer (less probable). The computer is thus a way to represent prior information about which basic patterns we find in nature. Note that when the output to be produced gets longer, the programs also get longer, making the program length cost of one computer simulating another relatively less important. This is exactly equivalent with Bayes’ rule, where the prior becomes less important when more data are available. The reason why Solomonoff induction is not the answer to all problems in science is that, like Kolmogorov complexity, it is incomputable due to the insoluble halting problem. It should thus be seen as golden standard that shows the limits of what is possible and as a guideline to develop methods that approach it and are computable. A perfect method of inference is thus per definition impossible with finite computation resources. 6.2.5 Computable approximations to automated science Computable approximations to Solomonoff induction can be achieved by limiting or penalizing the time spent in a program or the memory it uses; see e.g. Levin search. Another approach is to perform the summation over a subset of functions (programs) that are computable. One could for example try ever larger neural networks. Some other approaches, 6.3. The divergence score: prediction, gambling and data compression 129 like minimum description length (Rissanen, 2007; Grünwald, 2007), take only the best model instead of a sum over all models. While these methods may not be optimal, they provide computable approximations that will converge when given enough data. In principle, the capabilities of machines to do inductive inference will increase with the growing computing power available. In combination with the growing archives of measured environmental data, see e.g. Lehning et al. (2009) in “the fourth paradigm”, this will make the role of computers in science even more important. For example, in Schmidt and Lipson (2009) an algorithm discovered laws of motion for a double pendulum by searching the space of all possible models by genetic programming and in the same issue of Science, robot scientist Adam independently generated and tested scientific hypotheses (King et al., 2009). 6.3 The divergence score: prediction, gambling and data compression The divergence score presented in chapter 5 as a measure for forecast quality has an interpretation in data compression and gambling. The gambling interpretation is due to Kelly (1956) and in appendix C it is explained how informative forecasts lead to good gambling returns. In the previous section it was also stated that the divergence score has an interpretation as the minimum average description length per observation. This section explains the basic background of this data compression analogy from the viewpoint of the information theory of Shannon (1948), starting from the strong analogy with gambling. Analogously to the gambling problem where high stakes are put on the most likely events, yielding the highest returns, data compression seeks to represent the most likely events (most frequent characters in a file) with the shortest codes, yielding the shortest total code length. As is the case with dividing the stakes, also short codes are a limited resource that has to be allocated as efficiently as possible. When required to be uniquely decodable, short codes come at the cost of longer codes elsewhere. This follows from the fact that such codes must be prefix free, i.e. no code can be the prefix of another one. This is formalized by the following theorem of McMillan (1956), who generalized the inequality (Eq. 6.6) of Kraft (1949) to all uniquely decodable codes. X A−li ≤ 1 (6.6) i in which A is the alphabet size (2 in the binary case) and li is the length of the code assigned to event i. In other words, one can see the analogy between gambling and data compression through the similarity between the scarcity of short codes and the scarcity of large fractions of wealth. Just as there are only 4 portions of 14 of the wealth available (you cannot divide a pie in five quarters), there are only 4 prefix-free binary codes of length log2 4 = 2 (see table 6.2, code A). In contrast to fractions of wealth, which can be chosen freely, the code lengths are limited to integers. For example, code B in the table uses one code of length 1, one of length 2 and two of length 3, we can verify that it sharply 130 Modeling, information and data compression event 1 2 3 4 total occurrence frequencies I II III 0.25 0.25 0.25 0.25 H=2 0.5 0.25 0.125 0.125 H=1.75 0.4 0.05 0.35 0.2 H=1.74 codes A B 00 01 10 11 0 10 110 111 A_I 0.5 0.5 0.5 0.5 2 expected code lengths per value B_I A_II B_II A_III B_III 0.25 0.5 0.75 0.75 2.25 1 0.5 0.25 0.25 2 0.5 0.5 0.375 0.375 1.75 0.8 0.1 0.7 0.4 2 0.4 0.1 1.05 0.6 2.15 Table 6.2: Assigning code lengths proportional to minus the log of their probabilities leads to optimal compression. Code B is optimal for distribution II, but not for the other distributions. Distribution III has no optimal code that achieves the entropy bound. satisfies Eq. 6.6, using D = 2 (notice also the analogy to Fig. 3.1 on page 41), we find 1 ∗ 2−1 + 1 ∗ 2−2 + 2 ∗ 2−3 = 1 ≤ 1 In table 6.2, it is shown how the total code length can be reduced, assigning codes of varying length depending on occurrence frequency. As shown by Shannon (1948), if every value could be represented with one code, allowing for non-integer code lenghts, the optimal code length for an event i is li = log (1/pi ) . The minimum average code length is the expectation of this code length over all events, H bits per sample, where H can be recognized as the entropy of the distribution (Cover and Thomas, 2006), which is a lower bound for the average description length. H (p) = Ep {l} = n X i=1 pi log 1 pi (6.7) However, because in reality the code lengths have to be rounded to an integer number of bits, some overhead will occur. The rounded coding would be optimal for a probability distribution of events 1 qi = li ∀i, (6.8) 2 such as frequencies II in table 6.2. In this equation, qi is the ith element of the probability mass function q for which the code would be optimal and li is the code length assigned to event i. The overhead in the case where p 6= q is DKL (p||q), yielding a total average code length of H(p) + DKL(p||q) (6.9) bits per sample. In general, if a wrong probability estimate is used, the number of bits per sample is increased by the Kullback-Leibler divergence from the true to the estimated probability mass function. For probability distributions that do not coincide with integer ideal code lengths, the algorithm known as Huffman coding (Huffman, 1952) was proven to be optimal for value by value compression. It finds codes of an expected average length closest to the entropybound and is applied in popular compressed picture and music formats like jpg, tiff, mp3 and wma. For a good explanation of the workings of this algorithm, the reader is referred to Cover and Thomas (2006). In table 6.2, code A is optimal for the naive probability distribution and code B is optimal for the distribution II. Both these codes achieve the 6.3. The divergence score: prediction, gambling and data compression 131 entropy bound. Code B is also an optimal Huffman code for the distribution III (last column in table 6.2). Although the expected code length is now more than the entropy, it is impossible to find a shorter code. The overhead is equal to the Kullback-Leibler divergence from the true distribution (III) to the distribution for which the code would be optimal. DKL(III||II) = DKL((0.4, 0.05, 0.35, 0.2) || (0.5, 0.25, 0.125, 0.125)) = 0.4106 If the requirement that the codes are value by value (one code for each observation) is relaxed, blocks of values can be grouped together to approach an ideal probability distribution. When the series are long enough, entropy coding methods like Shannon and Huffman coding using blocks can get arbitrarily close to the entropy bound (Cover and Thomas, 2006). 6.3.1 Dependency If the values in a time series are not independent, however, the dependencies can be used to achieve even better compression. This high compression results from the fact that for dependent values, the joint entropy is lower than the sum of entropies of indi idual values. In other words, average uncertainty per value decreases, when all the other values in the series are known, because we can recognize patterns in the series, that therefore contain information about themselves. Hydrological time series often show strong internal depen encies, leading to better compression and bett r prediction. Consider, for example, the case where you are asked to gamble on (or assign code lengths to) the streamflow value on May, 12, 1973. In one case, the information offered is the dark-colored climatological histogram (Fig. 6.1 on the right), in the second case, the time series is available (the left of the same figure). Obviously, the e pected compression and expected return for the bets are better in the second case, which shows the value of exploiting dependencies in the data. The surprise (− log Ptrue value ) upon hearing the true value is 3.72 bits in case the guessed distribution was assumed and 4.96 bits when using the climate as prior. These surprises are equivalent to the divergence scores treated in the previous chapter. Another example are the omitted characters that the careful reader may (not) have found in the previous paragraph. There are 48 different characters used, but the entropy of the text is 4.3 bits, far less than log(48)=5.6, because of for example the relatively high frequencies of the space and the letter “e”. Although the entropy is more than 4 bits, the actual uncertainty about the missing letters is far less for most readers, because the structure in the text is similar to english language and that structure can be used to predict the missing characters. On the one hand this means that english language is compressible and therefore fairly inefficient. On the other hand this redundancy leads to more robustness in the communication, because even with many typographical errors, the meaning is still clear. If english were 100% efficient, any error would obfuscate the meaning. In general, better prediction, i.e. less surprise, gives better results in compression. In water resources management and hydrology we are generally concerned with predicting one 132 Modeling, information and data compression 400 Answer: 86.42 350 3 streamflow (m /s) 300 250 200 150 100 50 0 15−Apr−1973 01−May−1973 15−May−1973 date 01−Jun−1973 0 0.1 0.2 0.3 probability Figure 6.1: The missing value in the flow time series can be guessed from the surrounding values (a guess would for example be the grey histogram). This will usually lead to a better guess than one purely based on the occurrence frequencies over the whole 40 year data set (dark histogram) alone. series of values from other series of values, like predicting streamflow (Q) from precipitation (P ) and potential evaporation (Ep ). In gambling we would call precipitation and evaporation series side information. In terms of data compression, knowledge of P and Ep would help compressing Q, but would also be needed for decompression. When P , Ep and Q would be compressed together in one file, the gain compared to compressing the files individually is related to what a hydrological model learns from the relation between these variables. Similarly, we can try to compress hydrological time series to investigate how much information those compressible series really contain for hydrological modeling. 6.4 A practical test: “Zipping” hydrological time series In this section, a number of compression algorithms will be applied to different datasets to obtain an indication of the amount of information they contain. Most compression algorithms use entropy-based coding methods such as introduced in the previous section, often enhanced by methods that try to discover dependencies and patterns in the data, such as autocorrelation and periodicity. The data compression perspective indicates that formulating a rainfall-runoff model has an analogy with compressing rainfall-runoff data. A short description of the data will include a good model about it. However, not all patterns found in the data should be attributed to the rainfall-runoff process. For example, a series of rainfall values is highly compressible due to the many zeros (a far from uniform distribution), the autocorrelation, and the seasonality. These dependencies are in the rainfall alone and can tell us nothing about the relation between rainfall and runoff. The amount of information that the rainfall contains for the hydrological model is thus less than the number of data points multiplied 6.4. A practical test: “Zipping” hydrological time series 133 by the number of bits to store rainfall at the accurate precision. This amount is important because it determines the model complexity that is warranted by the data (Schoups et al., 2008). In fact, we are interested in the Kolmogorov complexity of the data, but this is incomputable. A crude practical approximation of the complexity is the filesize after compression by some commonly available compression algorithms. This provides an upper bound for the information in the data. If the data can be regenerated perfectly from the compressed (colloquially referred to as zipped) files, the compression algorithm is said to be lossless. In contrast to this, lossy compression introduces some small errors in the data. Lossy compression is mainly used for various media formats (pictures; video; audio), where these errors are often beyond our perceptive capabilities. This is analogous to a model that generates the observed values to within measurement precision. This section gives one example of lossy compression, but will be mainly concerned with lossless compression. Roughly speaking, the file size that remains after compression, gives an upper bound for the amount of information in the time series. Actually, also the code-length of the decompression algorithm should be counted towards this file size (cf. a self-extracting archive). In the present exploratory example the inclusion of the algorithmic complexity of the decompression algorithm will be left for future research. The compression algorithms will be mainly used to explore the difference in information content between different signals. 6.4.1 Data and Methods The time series used The algorithms are tested on a real world hydrological dataset from Leaf River (MS, USA) consisting of rainfall, potential evaporation and streamflow. See e.g. Vrugt et al. (2003) for a description of this data set. As a reference, various artificially generated series where used. The generated series consist of 50000 values, while the time series of the Leaf River dataset, contains 14610 values (40 years of daily values). The following series where used in this experiment constant contains only 1 value repeatedly. Intuitively, this file contains the least possible amount of information. linear contains a slowly linearly increasing trend, ranging from 0 at the beginning of the series to 1 at the end of the series. R function “rand”, it is uncorrelated uniform_white is the output from the Matlab white noise with a uniform distribution between 0 and 1. R function “randn”, it is uncorrelated Gaussian_white is the output from the Matlab white noise with a normal distribution and is scaled between 0 and 1. sine_1 is a sinusoidal wave with a wavelength spanning all 50000 values, ranging between 0 and 1. sine_100 is a sinusoidal wave with a wavelength spanning 1/100 of 50000 values, ranging between 0 and 1. It is therefore a repetition of 100 sine waves. Leaf_P is a daily rainfall series from the catchment of Leaf river (1948-1988). Leaf_Q is the corresponding daily series of observed streamflow in Leaf river . 134 Modeling, information and data compression 150 100 3 streamflow (m /s) original quantized 8 bits 50 0 01−Mar−1965 01−Apr−1965 01−May−1965 01−Jun−1965 Figure 6.2: The effect of quantization. Because the errors are absolute, the largest relative errors occur in the low-flow periods, such as selected for this figure. Quantization Due to the limited amount of data, quantization is necessary to make correct estimates of the distributions, which are needed to calculate the amount of information and compression. This is analogous to the maximum number of bins permitted to draw an informative histogram. Although the quantization constitutes a loss of information, it does not affect the results, as they are all measured relative to the quantized series. All series were first quantized to 8 bit precision, using a simple linear quantization scheme (eq. 6.10). Using this scheme, the series were split into 28 = 256 equal intervals and converted into an 8 bit unsigned integer (an integer ranging from 0 to 255 that can be stored in 8 binary digits). x − min x xinteger =c0.5 + 255 max x − min x (6.10) These can be converted back to real numbers using xquantized = ( max x − min x )xinteger + min x 255 (6.11) Because of the limited precision achievable with 8 bits , xquantized 6= x. As can be seen in figure 6.2, the quantization leads to rounding errors, which can be quantified as a signal to noise ratio (SNR). The SNR is the ratio of the variance of the original signal to the variance of the rounding errors. SNR = 1 n 1 n Pn t=1 Pn t=1 (xt − x̄)2 xt − xt,quantized 2 (6.12) Because the SNR can have a large range, it is usually measured in the form of a logarithm, which is expressed in the unit decibel: SNRdB = 10 log10 (SNR). 6.4. A practical test: “Zipping” hydrological time series 135 Compression algorithms The algorithms that were used are a selection of commonly available compression programs and formats. Below are very short descriptions of the main principles and main features of each of the algorithms used and some references for more detailed descriptions. The descriptions are sufficient to understand the most significant pattern in the results. It is beyond the scope of this thesis to describe the algorithms in detail. ARJ Uses LZ77 (see LZMA) with sliding window and Huffman coding. WAVPACK is a lossless compression algorithm for audio files. JPG_LS The Joint Photography Experts Group created the JPEG standard, which includes a range of lossless and lossy compression techniques. Here the lossless coding is used, which uses a Fourier-like type of transform (Discrete cosine transform) followed by Huffman coding of the errors). JPG_50 Is the the result of the JPG format in lossy mode. After the discrete cosine transform, it discards the higher frequencies (wavelet coefficients), which results in a loss of small-scale detail. HDF_RLE HDF (hierarchical data format) is a data format for scientific data of any form, including pictures, time series and metadata. It can use several compression algorithms, including run length encoding (RLE). RLE replaces sequences of reoccurring data with the value and the number of repetitions. It would therefore be useful to compress pictures with large uniform surfaces and rainfall series with long dry periods. PPMD a variant of Prediction by Partial Matching, implemented in the 7Zip program. It uses a statistical model for predicting each value from the preceding values using a variable sliding window. Subsequently the errors are coded using Huffman Coding. LZMA The Lempel-Ziv-Markov chain algorithm combines the Lempel-Ziv algorithm, LZ77 (Ziv and Lempel, 1977), with a Markov-Chain model. LZ77 uses a sliding window to look for reoccurring sequences, which are coded with references to the previous location where the sequence occurred. The method is followed by range coding. Range coding (Martin, 1979) is an entropy-coding method which is mathematically equivalent to arithmetic coding (Rissanen and Langdon, 1979), it has less overhead than Huffman coding. BZIP2 Uses the Burrows and Wheeler (1994) block sorting algorithm in combination with Huffman-Coding. PNG Portable Network Graphics (PNG) uses a filter based on prediction of one pixel from the preceding pixels. Afterwards, the prediction errors are compressed by the algorithm “Deflate” which uses dictionary coding (matching repeating sequences) followed by Huffman coding. TIFF A container image format that can use several compression algorithms. In this case PackBits compression was used, which is a form of run length encoding. 136 Modeling, information and data compression 6.4.2 Results Lossless compression results As expected, the filesizes after quantization are exactly equal to the number of values in the series, as each value is encoded by one byte (8 bits) and stored in binary raw format. From the occurrence frequencies of the 256 unique values, the entropy of their distribution was calculated. Normalized with the maximum entropy of 8 bits, the fractions in row 3 of table 6.3 give an indication of the entropy bound for the ratio of compression achievable by value by value entropy encoding schemes such as Huffman coding, which do not use temporal dependencies. The signal to noise ratios in row 4 give an indication of the amount of data corruption that is caused by the quantization. As a reference, the uncompressed formats BMP (Bitmap), WAV (Waveform audio file format), and HDF (Hierarchical Data Format) are included, indicating that the file size of those formats, relative to the raw data, does not depend on what data are in them, but does depend on the amount of data, because they have a fixed overhead that is relatively smaller for larger files. The results for the various lossless compression algorithms are shown in rows 7-17. The numbers are the percentage of the file size after compression, relative to the original filesize (a lower percentage indicates better compression). The best compression ratios per time series are highlighted. From the result it becomes clear that the constant, linear and periodic signals can be compressed to a large extent. Most algorithms achieve this high compression, although some have more overhead than others. The uniform white noise is theoretically incompressible, and indeed none of the algorithms appears to know a clever way around this. In fact, the smallest file size is achieved by the WAV format, which does not even attempt to compress the data and has a relatively small file header (meta information about the file format). The Gaussian white noise is also completely random in time, but does not have a uniform distribution. Therefore the theoretical limit for compression is the entropy bound of 86.3 %. The WAVPACK algorithm gets closest to the theoretical limit, but also several file archiving algorithms (ARJ, PPMD, LZMA BZIP2) approach that limit very closely. This is because they all use a form of entropy coding as a backend (Huffman and Range coding). Note that the compression of this non-uniform white noise signal is equivalent to the difference in uncertainty expressed by the bars “naive” and “climate” in Fig. 5.8 on page 111. The results for the hydrological series firstly show that the streamflow series is better compressible than the precipitation series. This is remarkable, because the rainfall series has the lower entropy. Furthermore is can be seen that for the rainfall series, the entropybound is not achieved by any of the algorithms, presumably because of the overhead caused by the occurrence of 0 rainfall more than 50 percent of the time, see Eqs. 6.8 and 6.9 on page 130. Further structure like autocorrelation and seasonality can not be used sufficiently to compensate for this overhead. In contrast to this, the streamflow series can be compressed to well below the entropy bound (27.7% vs. 42.1% ), because of the strong autocorrelation in the data. These dependencies are best exploited by the PPMD 6.4. A practical test: “Zipping” hydrological time series dataset constant linear filesize SNR 50000 0.0 NaN 50000 99.9 255.0 BMP WAV HDF_NONE 102.2 100.1 100.7 JPG_LS HDF_RLE WAVPACK ARJ PPMD LZMA BZIP2 PNG GIF TIFF 12.6 2.3 0.2 0.3 0.3 0.4 0.3 0.3 2.3 2.0 JPG_50 10.0 H log N uniform white 137 Gaussian sine 1 white sin 100 Leaf Q Leaf P 50000 50000 50000 99.9 86.3 96.0 255.6 108.0 307.4 Uncompressed formats 102.2 102.2 102.2 102.2 100.1 100.1 100.1 100.1 100.7 100.7 100.7 100.7 Lossless compression algorithms 12.8 110.6 94.7 12.9 2.7 101.5 101.5 3.2 1.9 103.0 87.5 2.9 1.0 100.3 88.0 3.1 2.1 102.4 89.7 3.6 0.9 101.6 88.1 1.9 1.8 100.7 90.7 3.0 0.8 100.4 93.5 1.5 15.7 138.9 124.5 17.3 2.4 101.2 101.2 2.9 Lossy compression 10.1 150.6 114.9 10.2 50000 92.7 317.8 14610 42.1 42.6 14610 31.0 39.9 102.2 100.1 100.7 407.4 100.3 102.3 407.4 100.3 102.3 33.3 92.3 25.6 1.9 1.4 1.2 2.3 0.8 32.0 91.2 33.7 202.3 38.0 33.7 27.7 31.0 29.8 40.2 38.8 201.5 49.9 202.3 66.2 40.0 36.4 37.8 40.5 50.0 45.9 201.5 20.4 26.3 59.3 Table 6.3: The performance, as percentage of the original file size, of well known compression algorithms on various time series. The best results per signal are highlighted. algorithm, which uses a local prediction model that apparently can predict the correlated values quite accurately. Many of the algorithms cross the entropy bound, indicating that they use at least part of the temporal dependencies in the data. Lossy compression results Apart from the lossless data compression, which can give upper limits for the total amount of information in a time series, lossy compression can also be of interest for hydrological modeling. Instead of representing the series exactly, lossy compression retains the subjectively most important features of the data and discard the other detail, depending on the amount of compression required. This approach has some parallels with the idea of rainfall multipliers to account for uncertainty in measured rainfall, such as applied by Kavetski et al. (2006) and various related papers. The loss of detail can also be seen as the addition of a noise and is comparable to the effect of the quantization discussed earlier, which is in fact also a form of lossy data compression. As an example of lossy compression, the picture format JPG was used to store the various time series. As can be seen from the table, the compression ratio for the streamflow series outperforms all lossless compression algorithms. Of course, this strong compression comes at the cost of errors introduced by the JPG compression algorithm. The errors are shown in figure 6.3 and follow from the loss of finer details by eliminating terms from the discrete cosine transform used in JPG. The signal to noise ratio (SNR) over the entire series was 138 Modeling, information and data compression Original Quantized JPG 50% 250 streamflow (m3/s) 200 150 100 50 0 01−Jan−1985 01−Apr−1985 01−Jul−1985 date 01−Oct−1985 01−Jan−1986 Figure 6.3: The effect of lossy compression on the signal. It introduces a noise but is not significantly more than the quantization noise. 200 10 Compr. uniform white Compr. gaussian white Compr. Leaf Q Compr. Leaf P 9 120 8 80 7 40 6 compression 0 5 −40 4 −80 3 SNR −120 2 −160 −200 SNR (dB) Compression (reduction of original filesize in %) 160 SNR uniform white SNR gaussian white SNR Leaf Q SNR Leaf P 1 0 10 20 30 40 50 60 JPG quality (%) 70 80 90 0 100 Figure 6.4: The dependency of compression (negative indicates a grown file size) and signal to noise ratio (logarithmic) on the JPG quality setting in “lossy mode”. Especially for the discharge signal (Q), large compression can be achieved if a small noise is allowed. For high quality, the compression is negative. 75.6 in the case of JPG with 50% quality setting. This SNR is the noise that is added to the signal already quantized to 8 bit precision. The signal to noise ratio with respect to the original signal is 76.3, which is remarkable, because this seems to indicate that in this case the two corruptions of the signal cancel each other out somewhat. Further results indicating the tradeoff between quality and compression are shown in figure 6.4. 6.4. A practical test: “Zipping” hydrological time series 139 statistic P Q Qmod errQ Q|Qmod perm_Q perm_errQ entropy (% of 8 bits) best compression (%) std. dev. (range=256) Autocorrelation ρ 31.0 36.4 11.7 0.15 42.1 27.7 11.6 0.89 44.9 25.8 10.4 0.95 38.9 31.5 4.95 0.60 26.4 N.A. N.A. N.A. 42.1 45.4 11.6 <0.01 38.9 44.1 4.95 <0.01 Table 6.4: Information-theoretical and variance statistics and compression results (remaining file size %) for rainfall-runoff modeling. 6.4.3 Compressing with hydrological models In the previous paragraph single time series were compressed to obtain an indication of their information content. Given the connection between modeling and data compression, a hydrological model should in principle be able to compress hydrological data. This can be useful to identify good models in information-theoretical terms, but can also be useful for actual compression of hydrological data. Although a more detailed analysis is left for future work, this section contains a first test of estimating the performance of hydrological models using data compression tools. The hydrological model HYMOD was used to predict discharge from rainfall for the Leaf River dataset; see e.g. Vrugt et al. (2009) for a description of model and data. Subsequently, the modeled discharges were quantized using the same quantization scheme as the observed discharges. An error signal was defined by subtracting the modeled (Qmod) from the observed (Q) quantized discharge. This gives a signal that can range from -255 to +255, but because the errors are sufficiently small, ranged from -55 to +128, which allows for 8 bit coding. Because the observed discharge signal (Q) can be reconstructed from the precipitation time series (P), the model, and the stored error signal (errQ), the model could enable compression of the dataset consisting of P and Q. In table 6.4, the entropies of the signals are shown. The second row shows the resulting filesize as percentage of the original filesize for the best compression algorithm for each series (PPMD or LZMA). The table also shows the statistics for the series where the order of the values was randomly permuted (perm_Q and perm_errQ). As expected this does not change the entropy, because that depends only on the histograms of the series. In contrast, the compressibility of the signals is significantly affected, indicating that the compression algorithms made use of the temporal dependence for the non-permuted signals. The joint distribution of the modeled and observed discharges was also used to calculate the conditional entropy H(Q|Qmod ). It must be noted, however, that this conditional entropy is probably underestimated, as it is based on a joint distribution with 2552 probabilities estimated from 14610 value pairs. This is the cost of estimating dependency without limiting it to a specific functional form. The estimation of mutual information needs more data than Pearson correlation, because the latter is limited to a linear setting and looks at variance rather than uncertainty. In the description length, the underestimation of H(Q|Qmod ) is compensated by the fact that the dependency must be stored by the entire joint distribution. If representative for the dependence in longer data sets, the conditional entropy gives a 140 Modeling, information and data compression theoretical limit of compressing Q with knowledge of P and the model, while not making use of temporal dependence. A somewhat unexpected result is that the errors seem more difficult to compress (31.5 %) than the observed discharge itself (27.7 %), even though the entropy is lower. Apparently the reduced temporal dependence in the errors (lag-1 autocorrelation coefficient ρ = 0.60), compared to that of the discharge (ρ = 0.89), offsets the gain in compression due to the lower entropy of the errors. Possibly, the temporal dependence in the errors becomes to complex to be detected by the compression algorithms. Further research is needed to determine the exact cause of this result, which should be consistent with the theoretical idea that the information in P should reduce uncertainty in Q. The NashSutcliffe Efficiency (NSE) of the model over the mean is 0.82, while the NSE over the persistence forecast (Qmod (t) = Qt−1 ) is 0.18 (see Schaefli and Gupta, 2007), indicating a reasonable model performance. Furthermore, the difference between the conditional entropy and the entropy of the errors could indicate that an additive error model is not the most efficient way of coding and consequently not the most efficient tool for probabilistic prediction. The use of for example heteroscedastic probabilistic forecasting models (e.g. Pianosi and Soncini-Sessa, 2009) for compression is left for future work. 6.4.4 Discussion and conclusions This section presented an initial attempt at using data compression as a practical tool in the context of learning from data and estimating the information content of hydrological signals. To the author’s knowledge, this is the first time that this approach is used in hydrology. The present study is limited in scope, and more elaborate studies would be interesting. The results show that hydrological time series contain a large amount of redundant information, due to their far from uniform distributions and temporal dependencies. Model complexity control methods that use the number of observations should therefore be corrected for the true information content. Compression tools that search for patterns in individual time series can be helpful for this task. It would be interesting to develop compression tools specifically aimed at hydrological time series, that also give an overview of the various patterns found in the data. A hydrological model actually is such a compression tool. It makes use of the dependencies between for example rainfall and streamflow. The patterns that are already present in the rainfall reduce the information that the hydrological model can learn from: a long dry period could for example be summarized by one parameter for an exponential recession curve in the streamflow. The information available for a rainfall runoff model could theoretically be estimated by comparing the filesize of compressed rainfall plus the filesize of compressed streamflow with the size of a file where rainfall and streamflow are compressed together, exploiting their mutual dependencies. We could denote this as: learnable info = |ZIP(P )| + |ZIP(Q)| − |ZIP(P, Q)| (6.13) where |ZIP(X)| stands for the filesize of a theoretically optimal compression of data X, which includes the size of the decompression algorithm. A good benchmark for this 6.5. Prediction versus understanding 141 could be a self-extracting archive, i.e. the filesize of an executable file that reproduces the data on some given operating system on a given computer. This brings us back to the ideas of algorithmic information theory, which use program lengths on Turing machines. The shortening in description length when merging input and output data, i.e. the compression progress, could be seen as the amount of information learned by modeling. The hydrological model that is part of the decompression algorithm embodies the knowledge gained from the data. 6.5 Prediction versus understanding4 Data compression just looks for patterns, i.e. correlations; dependencies, in the data. It does not worry about cause and effect, nor does it take into account whether a certain model works for the right reasons. Model performance is reduced to one single benchmark, the description length, and there is no distinction between different aspects to the quality of a model. The single number that summarizes model performance does not allow for model diagnostics, nor does it reflect possibly different views to what constitutes a good model, depending on its use. The mechanistic approach seems to obviate the need for art in modeling, such as advocated by Savenije (2009). The process of hypothesis forming and testing has no importance in the search for patterns. In the following, it is argued that none of these objections is fundamentally a problem. For practical reasons, given the present state of the art, human intellect is still vital, but this does not preclude the possibility of more mechanistic science in the future. The following section addresses some of these issues and is at points deliberately provocative to stimulate discussion and present a different view on some of the current discourse in hydrology. 6.5.1 Hydrological models approximate emergent behavior Hydrological systems are high-dimensional, complex systems. They consist of an extremely large number of water molecules bouncing against each other and against those of soil and vegetation particles. Elementary forces are acting upon them and photons are hitting their atoms. Even though it is impossible to predict the paths of individual water molecules with accuracy, the macro-states of a large number of molecules interacting is surprisingly predictable. Although in some cases this might be seen as some form of (biological/ecological) self-organization, fundamentally this predictability follows from the calculus of probabilities. Given a large number of equally probable microstates, probability in a complex system often concentrates in a small number of possible macro-states. An example is the sum of the outcomes of a large number of dice. The uncertainty about the precise microstate (the outcomes of all individual dice) is equal to the sum of uncertainties of the individual dice, but the uncertainty about the sum is far less. Also in a hydrological system, the macro-states, which are for example sums or averages such as 4. This section is based on a comment on an early version of Koutsoyiannis (2010), which is recommended for related discussions. 142 Modeling, information and data compression the water storage and vegetation cover, are far more predictable than the micro-states, such as the position of all water molecules and the activity of the individual stomata in the vegetation leaves. Compare also the relatively simple behavior of flow in a river and the complex flow of water through all its pathways in the catchment. We could see this as an emergence of predictability from randomness (cf. Koutsoyiannis (2010) emergence of randomness from determinism). This very fundamental mechanism of emergence, both visible in evolutionary processes and the movement of systems towards maximum entropy, is the reason why we can make hydrological predictions in the first place. The realization that hydrological processes are emergent behavior from complex interactions is also a reason why forecasts should be probabilistic rather than deterministic. Conceptual hydrological models can only be seen as approximations of complex hydrological systems. This argument is complementary to the arguments of consistency and testability given in the previous chapter, and to the notion that systems with feedbacks often show chaotic behavior, limiting predictability to a certain time horizon (Lorenz, 1963; Koutsoyiannis, 2010). When seen as emergent behavior from a complex system, the limited predictability is analogous to random fluctuations of the sum of a very large number of dice, or, as Grandy Jr (2008) puts it: “Effects of the microscopic dynamical laws can only be studied at the macroscopic level by means of probability theory”. One might feel that modeling emergent behavior is unsatisfactory, especially when probability replaces truth and description of behavior replaces explanation. A statement that is often heard is that “a model might give the right answers for the wrong reasons”. However, it can be argued that there is no way to determine what the “right reasons” are, except comparing other answers that the model gives with observations to determine whether they are right. The following section addresses the question whether there is in fact a fundamental difference between a short description and explanation. 6.5.2 Science and explanation as data compression? The analogy between modeling and data compression can also be applied to the whole of science in general. When using the theory of inductive inference of Solomonoff (1964), the principles of parsimony and good model performance are sufficiently formal to be implemented in an automatic machine. This would yield the best predictions, given all historical observations. An interesting question would therefore be whether the current state of science would follow with a high probability if such an inference machine would be fed with all observed data that humanity has recorded so far. Would it yield the best possible predictions and would its shortest programs be our best theories? Some authors, e.g. Deutsch (1998), make the distinction between prediction and explanation. The question that naturally arises is then what defines a good explanation apart from consistency with observations and a requirement for parsimony. Sometimes it is required that explanations match intuition or are understandable. It could also be posed, however, that intuition and understanding emerge from seeing patterns in observations too. This would mean that they emerge from trying to make good predictions. 6.5. Prediction versus understanding 143 Taking the data-compression view, physics can never really “explain” a phenomenon in an other way than just describing it in a more compact, accurate, or generalized way, which all lead to shorter programs. In other words, we try to codify observed behavior in laws as much as possible (leave little noise) and a shorter code is preferred over a larger one. Laws should therefore be as general as possible and re-use of laws for various problems is encouraged. In science, a good set of models is the set of models that best describes all observations so far and with the smallest total complexity. Whenever it is possible to unify two parts of physics into one and it yields a shorter program, progress has been made. Sometimes this progress is visible in terms of prediction of thus far unobserved phenomena. This progress is cashed at the moment when it is observed, for example if the Higgs Boson is found. Another way to advance science is to observe phenomena that cannot be explained by the current set of models, forcing the models to become more complex in order to make good predictions of the phenomena causing these new observations, but can also lead to discoveries of new patterns. Of course, scientific progress is not limited to fundamental physics and explaining the unexplained. Sometimes the emergent behavior of a complex system can be described in a much shorter way than the reductionist explanation, avoiding the need to specify the full micro-state. Especially because the full micro-state is impossible to observe anyway, it has no value trying to predict it. What constitutes a good model inherently depends on the location of the observer and his access to information; see also the very interesting paper about subjectivity of theories of everything by Hutter (2010). Modeling relations between macro-states is what we mainly try to do in hydrology, referred to by Koutsoyiannis (2010) as “overstanding”, but in a way also in Newtonian mechanics. The difference is that the latter almost perfectly describes the emergent behavior in many conditions. When we infer from a limited data set, a “physically based” model should in principle be able to outperform a purely data-driven model, because it has more prior information in it. The model structure has become very likely on the basis of all past observations. A model structure violating a physical law should be punished through all historical observations that confirmed that law.5 Although a physically based model might not have a better fit to the data than a data-driven model of similar complexity, the historical observations of e.g. mass balance should give it a higher probability. This makes the use of physically based models consistent with the data-compression view, which dictates that, in absence of prior information, predictive performance and complexity are enough to fully describe the merit of a model. Another way to view this is that not the complexity of the model itself, but the amount of complexity added to accepted scientific knowledge should be penalized in model complexity control. Theories are thus ways to conveniently describe our knowledge and combine that with new observations. However, ultimately theories are not fundamental to science. According to Solomonoff, we do not need theories at all and everything is just about predictions. 5. Note that when for example a hydrological model violates the mass balance, this can be usually interpreted as a flux across the boundary that is not explicitly modelled. Only when a model attempts to model all in- and outgoing mass-fluxes and does not conserve mass, we would regard it as contradicting the mass balance and give it infinitesimal prior probability. 144 Modeling, information and data compression 6.5.3 What is understanding? If science is just about compressing observations to get good predictions, where is the understanding? We could pose that what is usually seen as understanding is nothing more than seeing analogies between the mathematical relations that give good predictions and our intuition based on observations in everyday life. We intuitively understand conservation of mass because we see it everyday. We understand the movement of molecules in a gas because it is analogous to bouncing marbles in some way. We understand the concept of waves by picturing the movement of ripples when we throw a stone in the water. Understanding is thus nothing more than picturing the predictive model in terms of similar relations in our observable world. This is also the reason why “Nobody understands quantum mechanics” (Feynman, 1965). If the way the world behaves at a certain scale does not have analogous counterparts on a human-observable scale, it is impossible to understand, given this notion of understanding. However, physicists sufficiently familiar with the phenomena might simply expand their intuition to achieve understanding. If we see understanding as matching intuition, we could observe that understanding is often overrated. It is often stated as an objective as such. From an aesthetic point of view it is desirable that a model is understandable in the sense that it has analogies to observable behavior on a human scale. An example is to picture a catchment as a series of interconnected buckets. However, in many cases, the system we try to model simply does not behave in such a way. In those cases, the understandable model structure compromises prediction accuracy and is not closer to how the actual system ”works” than a black-box model fitted to the data. The advantage of conceptual models is that knowledge that has been gained from past observations, like conservation of momentum and mass, is easily added to the model in the form of constraints on the structure. Data that has been previously transformed into knowledge of laws (patterns that are general) is helping predictions. So again, the overall goal of good predictions already captures the benefits of physically understandable parameters. Also, by keeping the formulation of the relations for prediction restricted to the known physical laws, we do not unnecessarily extend the description length of our total scientific knowledge with extra relations, that are only usable in one specific hydrological system. It is important to notice that in hydrology, we are always predicting macroscopical variables. In this case intuitive understanding (through “overstanding”) can sometimes still be achieved if one realizes that there are analogies with many systems in nature where an “intention” of the system emerges from randomness. Examples are adaptations and optimality that emerge from evolution, free will that apparently emerges from self-organization of processes in the brain, and maximum entropy distributions that emerge from microscopic randomness, deterministic dynamics and macroscopic constraints. Especially under idealized assumptions, the relations between the macro-quantities can sometimes be of the same form as some of the relations between micro-quantities, which even enhances the feeling of understanding. However, it is dangerous to generalize this kind of understanding. Heterogeneity, for example, can completely change the relation between the macro-states into a form that 6.6. Modeling for decisions: understanding versus utility 145 has no relation with similar processes on micro-scale. Fitting a model structure that still assumes that the form of the relation is “understandable” in terms of simple mechanics will yield poor predictions and thus is poor science. Also from a more practical point of view, we could ask ourselves what the use of understanding, models and theories is. It could also be posed that predictions are the fundamental goal and understanding, explanation, theories and models are just the means to achieve good predictions. Understanding could be seen as an emergent goal from the overall objective of good predictions6 . Especially when there is a clear decision problem associated with a modeling exercise, good predictions are clearly the objective. The question could then be asked why we would want to use information and probability as an objective in model inference rather than directly try to optimize the utility associated with the decision problem at hand. This question is addressed in the next section. 6.6 Modeling for decisions: understanding versus utility 6.6.1 Information versus utility as calibration objective Utility-based evaluation of predictions is inevitably connected to a particular user with a decision problem and therefore cannot be done without explicit consideration of the different users of those predictions. Moreover, an obvious question that arises is whether it is desirable to base the evaluation on the value to a particular user or group of users. In that case, the evaluation becomes an evaluation of the decisions of those users rather than of the predictions themselves or of the hydrological model that produced them. The difference between information and utility is analogous to that between quality and value, as treated in Murphy (1993) and chapter 5. This difference is particularly important if the results of the evaluation are used in a learning or calibration process. In that case, two effects can occur by using utility (i.e. value) instead of information as a calibration objective: • The model learns from information that is not there (treated in Sect. 6.6.2). • The model fails to learn from all information that is there (treated in Sect. 6.6.3). 6.6.2 Locality and philosophy of science: knowledge from observation In the data compression-view of science, knowledge consists of compressed observations. Knowledge should thus come from observations. Furthermore we saw that predictions about macroscopical variables in complex systems should in principle be probabilistic. This leads to certain requirements on the way these predictions are evaluated in a calibration process. Two fundamental requirements are propriety and locality. Especially the latter is closely linked to the difference between information and utility. Locality is a property of scores for probabilistic forecasts (Mason, 2008; Benedetti, 2010). A score is said to be local if the score only depends on the probability assigned to (a small 6. and the objective of making good predictions emerges from evolution Modeling, information and data compression 3 probability density (s/m ) probability of non−exceedence 146 1 0.8 0.6 0.4 Fct A, CRPS = 9.59 Fct B, CRPS = 12.24 observation 0.2 0 0 50 100 150 200 250 300 350 0.03 Fct A, DS = 4.18 Fct B, DS = 3.92 observation 0.02 0.01 0 0 50 100 150 200 250 300 350 3 streamflow (m /s) Figure 6.5: The RPS and CRPS scores measure the sum of squared differences in CDFs. Therefore they depend on probabilities assigned to events that where not observed. The divergence score only depends on the value of the PDF (the slope of the CDF) at the value of the observation. In the example, forecast A has a better (=lower) CRPS than forecast B, even though it assigned a lower probability to what was observed (resulting in a higher (=worse) DS). region around) the event that occurred, and does not depend on how the probability is spread out over the values that did not occur. In contrast to this, non-local scores do depend on how that probability is spread out. Usually non-local scores are required to be sensitive to distance, which means that probability attached to values far from the observed value is punished more heavily than forecast probability that was assigned to values close to the observation. This concept of distance only plays a role in forecasts of continuous and ordinal discrete predictands: for nominal forecasts, distance is undefined. For both these types of predictands, an extension of the Brier score exists: the Ranked Probability Score (RPS) and the continuous RPS (CRPS) (see Laio and Tamea, 2007 for description and references). Both these scores are non-local, while the divergence score is local. Fig. 6.5 shows a comparison between (non-local) CRPS and the (local) divergence score. Note that forecast B obtains a worse CRPS than forecast A, even though B gives a higher probability to what is actually observed. It can also be imagined how changes in the distribution of the lower tail of forecast B would affect the CRPS, although based on the observation no statements can be made about the merit of that redistribution of probability. Note that any preference between two forecasts that assign equal probabilities to the observed value must be based on prior information, e.g. the fact that a bimodal distribution is counter-intuitive. It is important, however, that this prior information should be included in the forecast, rather than adding it implicitly during the evaluation process. For most decision problems, expected utility is a non-local score: a reservoir operator that attached most probability to values far from the true inflow is worse off than one that 6.6. Modeling for decisions: understanding versus utility 147 used a forecast with most probability close to the true value, even if the probability (density) attached to the true value was the same. Therefore, non-local scores are sometimes considered to have more intuitive appeal than local scores. It might seem logical to train a forecasting model to maximize the user-specific utility it yields for the training data, which may be a non-local function. There is, however, a serious philosophical problem with non-local scores if used in a learning (i.e. calibration) process. In principle, the knowledge a model embodies comes from observations or prior information, which in the end also comes from observation; see Fig. 6.6. By calibrating a model, the information in observations is merged with prior information, through a feedback of the objective function value to the search process (the arrows from “EVAL” to the model in Fig. 6.6). It is therefore a violation of scientific logic if the score that is intended to evaluate the quality of forecasts depends on what is stated about things that are not observed. Changes in the objective function would cause the model to learn something from an evaluation of what is stated about a non-observed event. In an extreme case, two series that forecast the same probabilities for all events that were observed, can obtain different scores based only on differences in probabilities assigned to events that were never observed (Benedetti, 2010). A similar argument in the context of experimental design was made by Bernardo (1979). If these non-local scores are used as objectives in calibration or inference (see e.g. Gneiting et al., 2005), things are inferred from non-observed outcomes, i.e. information that is not present in the observations. 6.6.3 Utility as a data filter The use of utility in calibration can, apart from using non-existing information, also lead to learning only from part of the information that is in the observations. In that sense, the decision problem that specifies the utility acts like a filter on the information. The information-theoretical data processing inequality tells us that this filter can only decrease information (see Cover and Thomas, 2006). This filter can affect two of the three information flows to the model, depicted in Fig. 6.6: the flow from the output (1) and from the input (2) observations. The first flow of information, from the observations of streamflow, is filtered by the “state of world” block in Fig. 6.6. By evaluating based on utility, the information in the streamflow observations only reaches the model through its effect on how the state of the world affects the utility of decisions based on the forecast. Figure 6.6 depicts a hypothetical binary evacuation decision that is coupled to a conceptual rainfall-runoff model for flood forecasting. In this simplified decision problem, the utility is only influenced by a binary decision (evacuate or not) and a binary outcome (the place floods or not). There are thus no gradations in severity of the floods that affect the damage. The calibration towards maximum utility for this decision problem will train the hydrological model to optimally distinguish flood-evacuation events. This implies that in the training, all that the hydrological model sees from the continuous observed discharges is a binary signal: flood or no flood. This constitutes at most one bit of information per observation, in the unlikely case that 50% of the observations is above the flood threshold, i.e. the climatic uncertainty is 148 Modeling, information and data compression Figure 6.6: There are three routes through which information can enter the model in a learning process: the output observations (1), the input observations (2) and prior information (3). When evaluating a model based on value, the decision model that is implicitly defined by the loss function acts as a filter on the information in the observations. The figure shows the case where both the decision and the state of the world are binary, resulting in a feedback of costs to the model of only 2 bits of information per input-output observation pair (the string of characters at the bottom of the figure). The graphs in the middle show how both the predicted and the measured flows are converted to binary sequences due to the way the cost-loss model is formulated. 1 bit, while the original signal, the observed flows, i.e. real numbers, contained far more information (see Fig. 6.6). The second flow of information to the model are the input observations. A typical example is the amount of information from rainfall observations, see Bárdossy and Das (2008), which can have influence of model performance. The amount of information that reaches the model is affected by the information filter in the “decision” block. For example, if a binary decision problem, e.g. to be or not to be in the flood zone tomorrow, is considered, the information from input observations travels through the model and subsequently through the decision model. While the model still gives a real number as output, the “decision” block maps that model output to a binary signal. The binary signal is all that enters the evaluation and can be learned from the input observations. When a model is evaluated based on a cost-loss model of a two action- two state of the world decision problem, the maximum amount of information that can be learned from each inputoutput observation pair is thus 2 bits. In Fig. 6.6, this information is contained in the 6.6. Modeling for decisions: understanding versus utility 149 string “00CCDDLC00”, which represents the sequence of utilities over all time steps. The hydrological model will therefore have far less information to learn from. Given the fact that there is a balance between the available information for calibration and the complexity that a model is allowed to have (see Schoups et al., 2008), hydrological models that are trained on user-specific utility functions (e.g. this binary one) are likely to become overly complex relative to the data. They will surely achieve better utility results on the calibration data, because there is less information to fit, but are likely to perform worse on an independent validation dataset. The model that has been trained with maximum information as an objective is likely to yield better results for the validation set, even in terms of utility. Because it has the unfiltered information from the observations to learn from, it is less prone to overfitting: the complexity of a conceptual hydrological model is better warranted by the full information. The objective of optimally predicting binary flood events for evacuation decisions could benefit from more parsimonious datadriven models, e.g. linear regression models or neural networks; see Solomatine and Ostfeld (2008) for an overview. These models can make a mapping directly from predictors (e.g. precipitation, snowpack, soil moisture, past discharge) to decisions, but this complicates the use of prior information. The third information flow in Fig. 6.6 consists of this prior information on the workings of the hydrological system, which can be valuable for improving forecasts. The information can enter in the form of prior parameter estimates, e.g. from regionalization (Bárdossy, 2007; Hundecha et al., 2008), or constraints that are captured in the model structure. Examples are constraints on mass balance and energy limits for evaporation. These constraints describe the patterns in data or “physical laws” that ultimately come from observations. Both adding too much (unwarranted assumptions) and too little (e.g too wide prior parameter distributions) information through this route deteriorates the forecasts, especially when little data are available. The framework presented in this section shows some similarity with the ideas presented in Gupta et al. (2009, 1998, 2008). In those papers it is also argued that information can be lost in the evaluation. However, the important difference of the framework presented in this chapter compared to those ideas is that we argue that information is lost by using measures other than information (in other words, measures that do not reflect likelihood), while Gupta et al. (2008) argue that information is lost because of the low dimensionality of the evaluation measure. In our information-theoretical viewpoint, we can in principle learn all we need from the observations through a single measure, because a real number can contain infinitely many bits of information. What is learned depends only on the data and the prior information. The challenge is to give a reliable representation of prior information which will result in the right likelihood function. In principle, this is equivalent to endorsing the likelihood principle, which states that all information that the data contains about a model is in the likelihood function (as argued by Robert (2007) p.14, Jaynes (1957), p.250 and Berger and Wolpert (1988). The divergence score corresponds to a logarithmic scoring rule (see Jose et al. (2008) for more context), which is the only scoring rule that is both local and proper (proofs can be 150 Modeling, information and data compression 16 flood threshold 20 and 80% quantiles probabilistic model utility based calibration observations Streamflow (mm/d) 12 8 4 0 2600 2640 2680 2720 2760 2800 Time (days) Figure 6.7: A detail of the validation results for both models, compared with observations. For the probabilistic model, the 20% and 80% quantiles are shown. With the cost-loss ratio of 0.2 used in this example, the upper quantile determines the decision for the probabilistic model. Note that the probabilistic model predicts the full distribution, the quantiles are just for visualization. found in Bernardo, 1979 and Benedetti, 2010), where propriety is the requirement that the scoring rule can only be optimized when the forecaster does not lie. Scoring rules that are not proper can be hedged, meaning that the expected score is maximized by forecasting probabilities that are not consistent with the best estimates of the forecaster (see Gneiting and Raftery (2007) for an elaborate discussion on proper scoring rules). A utility function that includes the importance of the outcomes can be hedged by attaching more forecast probability to important events. A model that is trained on such a measure is thus encouraged to “lie”. All utility functions that are not affine functions of information violate either locality or propriety, which makes them doubtful objectives for calibration. 6.6.4 Practical example As an illustration of the information-filter effect described in section 6.6.3, a hydrological model was calibrated both based on information and on a utility function relating to the binary decision scenario similar to that depicted in Fig. 6.6. A simple lumped conceptual rainfall-runoff model was used (Schoups et al., 2010) to simulate daily streamflow given daily forcing records of rainfall and evaporation from the French Broad River basin at Asheville, North Carolina. In order to evaluate the model for unseen data, it was calibrated using 1 year of streamflow observations (1961), and validated using 9 years of streamflow observations (1970-1978). The calibration on the information-uncertainty scale used minimization of the divergence score (i.e. remaining uncertainty) as an objective. In the continuous case this corresponds to maximizing the log-likelihood. This means that the model needs to provide explicit probabilistic forecasts. The probabilistic part used a flexible stochastic description, allowing for heteroscedasticity, autocorrelation and non-Gaussian distributions. The calibration relied on the general likelihood function presented in Schoups and Vrugt (2010). 6.7. Conclusions and recommendations for modeling practice 151 Calibration objective Result in calibration Results in validation min average cost min divergence score 1.6 3.8 2.47 2.29 Table 6.5: The resulting average disutility per year, composed of costs for action and losses for unpredicted events, is minimized by explicitly calibrating on it, but performance in the validation period is better for the probabilistic model trained to minimize remaining uncertainty. The calibration on the utility-risk scale employed a cost-loss utility function relating to the binary decision problem (Murphy, 1977). The flood threshold is defined at a value of 10 mm/d (streamflow divided by catchment area). Here, a cost C is associated with a precautionary action, which is taken if exceedence of the flood threshold is forecast. When a peak flow event occurs but was not predicted, a loss L occurs. For illustration purposes, values for C and L were chosen to be equal to 0.2 and 1.0, respectively. The results in Table 6.5 show that in the validation run for this case, we indeed find that the explicit probabilistic model trained to minimize remaining uncertainty outperformed the model trained on maximum utility for the specific decision problem at hand. As expected, the large deterioration from calibration to validation seems to suggest overfitting to the filtered information. Looking at the resulting model behavior in Fig 6.7, we can tell that the model trained on utility systematically overpredicts low flows. There is nothing in the evaluation that discourages this behavior and apparently these parameters gave an advantage in fitting the floods in the calibration year. The probabilistic model is encouraged to attach high likelihood to each observation, learning from all data in the calibration. In this case, this gave an advantage in predicting the exceedence probability for the flood threshold for the unseen validation data. We must note, however, that results might be different under less ideal conditions. For example, when the model structure is capable of representing high flows, but is inadequate for low flow situations because e.g. evaporation is not correctly represented, then the utility-based calibration might do better also in validation. We can explain this by seeing the utility function as an implicit way to add prior information. If we know a priori that the model structure misses relevant processes for low flow, then it could be reasonable to ignore the low-flow data in calibration. The analogous way to represent this in an explicit uncertainty model is to give an extra spread to the probabilistic predictions at low flows, making the model less sensitive to them. More elaborate case studies are needed to further investigate which practical factors might lead to different results and how they can be accounted for in the information-theoretical framework. Furthermore, applying this view on results in past literature, especially those relating to “informal” likelihood methods, might give new insights about prior information that is implicitly added. 6.7 Conclusions and recommendations for modeling practice A hydrological model can be seen as a tool for prediction, a theory, a hypothesis or a compact form to code observations. In all cases, it consists of mathematical (i.e. algorithmic) 152 Modeling, information and data compression relations that represent analogies to quantities in the postulated real world. Models that concern observable quantities can be tested. When models approximate emergent behavior from macroscopic systems, such as in hydrology, predictions can never be perfect. In chapter 5 it was argued that correspondence between model predictions and observations is only testable without additional assumptions if the predictions are probabilistic. By adapting the model to mimic the observed data, the likelihood of the model can be increased. Overly complex models increase likelihood but decrease prior probability, yielding less probable predictions. Bayesian probability theory can serve as the basis for a philosophy of science, see e.g. Jaynes (2003), and can replace limited or imprecise concepts such as verification, and falsification and corroboration advocated by Popper (1968). Knowledge of formal theories of algorithmic information theory could help focus debates about uncertainty analysis in hydrology. Algorithmic information theory provides the link between complexity and probability needed to intuitively justify and to formalize the principle of parsimony. Solomonoff’s formal theory of inductive inference offers a complete formalization of prediction from data. It combines Bayesian probability, a universal prior probability for models reflecting the principle of parsimony, Turing’s theory of computation, and the simultaneous use of all possible models. The fact that the universal prediction by Solomonoff induction is incomputable is not just a practical problem of the method, but stems from the fundamental limits of computation and provability as found by Turing (1937) and Gödel (1931), which apply to any method of induction. Due to this incomputability, not only perfect predictions, but also optimal probabilistic predictions are in principle not achievable. Several practical methods for prediction can be seen as computable approximations to Solomonoff induction. The principle of minimum description length (MDL; see Rissanen (2007); Grünwald (2007)) which can be approximated by data compression methods, may be a useful practical approach to learning from data. Given that inference and data compression are analogous tasks, compression algorithms may be useful to estimate the information available for learning and the merit of a model in terms of compression progress. Conversely, good models can be used to efficiently store bulky hydrological data. Understanding and explanation are not explicit building blocks of the information-theoretical view on science. Rather, they emerge from the objective of finding short descriptions by the fact that analogies can be used to compress the description of scientific knowledge. Understanding may be a practical requirement for the application of science, but fundamentally science is served by making good predictions, i.e. finding short descriptions of the totality of observations. When understanding is seen as a requirement for models of complex systems, this might lead to a flawed view of how the system works and will not yield good predictions. Calibration of models is a way to find parameter sets that, together with a given model structure, minimize some discrepancy measure between model outcomes and observations. This discrepancy measure can represent a user-specific utility for a given decision problem, or can be interpreted as a likelihood or information measure that implicitly specifies 6.7. Conclusions and recommendations for modeling practice 153 a probabilistic prediction. The likelihood principle states that all information the data contain about a model, is in the likelihood function of that model. Training on a userspecific objective function can therefore never increase the amount of information that can be learned from the observations. Apart from decreasing information by filtering it through the utility function, calibration based on utility, if it is a non-local measure, also lets the model learn from information that is not in the observations, i.e. non-existing information. This makes model calibration with objective functions that reflect user-specific utilities doubtful. Although the purpose of a model may influence its design or related data collection strategy, it should not influence its calibration. A model should learn from observation data and not from decisions based on its predictions. Bayesian logic and the likelihood principle also dictate that the model, which defines the likelihood function, should be specified a priori, before seeing the data against which the model is tested. This seems to conflict with the idea that we can do “model diagnostics” to learn about the true behavior of the system. For example, the stepwise improvement of a model concept by repeatedly looking at the results and adding components based on missing processes, without introducing new data, introduces a high risk of overfitting. In fact such an approach used the same information multiple times, because there is no clear separation between prior knowledge and data. The space of possible model structures becomes an important degree of freedom. A Bayesian fundamentalist would then correctly object that we fit a model to the data by forming hypothesis from data that are tested on that same data. Model complexity control methods can be used to prevent overfitting by balancing model complexity and the amount of information to be learned from the data. To successfully apply these methods, dependencies in the data should be accounted for. Data compression methods may be used to estimate a correction factor for the number of data points, which is an input to some of these methods. Also the use of black box models to mimic the data might be useful to get an idea of the data complexity and therefore the optimal model complexity. The difference between conceptual, physically based and statistical models is in some sense gradual. Also our physical knowledge ultimately comes from fitting mathematical relations to data. For hydrological models, it is important to include process knowledge, but its certainty should not be overestimated. Although in practice, by attribution of uncertainties to parameters or states, we can still make reasonably good predictions, theoretically it makes little sense to include hypotheses beyond prior knowledge in a single model structure and attribute all uncertainty to parameter values. In principle, all models are not certain and should not be attributed 100% prior probability. One could even say that all model structures of macroscopical systems are wrong, unless they are explicitly probabilistic. Chapter 7 Stochastic dynamic programming to discover relations between information, time and value of water “Water has an economic value in all its competing uses and should be recognized as an economic good” - IV th Dublin principle on integrated water resources management. Abstract - This chapter presents stochastic dynamic programming (SDP) as a tool to reveal economic information about managed water resources. An application to the operation of an example hydropower reservoir is presented. SDP explicitly balances the marginal value of water for immediate use and its expected opportunity cost of not having more water available for future use. The result of an SDP analysis is a steady state policy, which gives the optimal decision as a function of the state. A commonly applied form gives the optimal release as a function of the month, current reservoir level and current inflow to the reservoir. The steady state policy can be complemented with a real-time management strategy, that can depend on more real-time information. An informationtheoretical perspective is given on how this information influences the value of water, and how to deal with that influence in hydropower reservoir optimization. This results in some conjectures about how the information gain from real-time operation could affect the optimal long term policy. Another issue is the sharing of increased benefits that result from this information gain. It is argued that this should be accounted for in negotiations about an operation policy. Some suggestions for future research involving the technique of reinforcement learning conclude the chapter. 7.1 Introduction As stated in the principle IV of the Dublin principles of integrated water management “Water has an economic value in all its competing uses and should be recognized as an economic good” (Global Water Partnership, 2000). This report also states “In order to extract the maximum benefits from the available water resources there is a need to change perceptions about water values and to recognize the opportunity costs involved in current allocative patterns.” Water derives its value partly from how it is used (see Tilmant et al. 155 156 SDP: relations information, time and value of water (2008) for a more detailed discussion on the components of the value of water). A different allocation can lead to a different value of water, which also depends on the stakeholders and their preferences. Allocating water optimally corresponds to maximizing its value. Before this maximization, moral questions have to be answered, which are not considered here. Decisions about water resources have to allocate water between different uses, but also in space and time. This chapter focuses on the latter case, where a set of subsequent decisions in time have to be taken. Most problems of water system operation are sequential decision processes, meaning that a decision in a certain time step is followed by subsequent decisions. For the operation of pumping stations in a polder, for example, every 15 minutes the current situation is reconsidered and a decision made about operating the pump in the next period. In chapter 2, model predictive control (MPC) was used to solve this type of problem. When extending MPC to finding optimal decisions under uncertainty (the multiple model predictive control method, chapter 2), the problem of interdependency between current and future information for decisions arises. Consequently, the amount of information that will become available between the current decision and the future decisions affects the optimal current decision.1 Dynamic Programming (DP), which was also shortly reviewed in chapter 2, offers the possibility to disaggregate the decision process in time, resulting in a series of one step decision problems. This circumvents the aforementioned interdependency problem. In this chapter, a special case of dynamic programming for systems affected by uncertainty, Stochastic Dynamic Programming (SDP), will be applied and analyzed from an information-theoretical viewpoint. 7.2 Stochastic dynamic programming (SDP) Stochastic dynamic programming proceeds by going backwards in time, while recursively calculating the value-to-go function at time t from the value-to-go at time t + 1. The value-to-go function Ft∗ (x) at time step t estimates the benefits that can be made by optimally operating the water system from time step t to the end of the planning horizon, while being in state x. Bellman’s principle of optimality (Bellman, 1952) states that any optimal policy that visits a certain state has the property that the remaining decisions from that state to the end of the horizon would also form an optimal policy from that state if it were the starting point. When searching an optimal sequence of T decisions in a sequential decision process, the problem can be split into T independent subproblems. In each subproblem, the value-to-go for each possible state is computed based on the decision that maximizes the expected value-to-go for that state. This value is the sum of the benefits from the transition to the new state by that decision and the value-to-go from the state where it leaves us. Because the problem is stochastic, an expected value 1. Note that we cannot know which specific information will become available, but an estimate of the amount of information is possible. 7.2. Stochastic dynamic programming (SDP) 157 over several possible future states is taken as value-to-go following a decision (see Fig. 7.1). SDP has been widely applied to the optimization of water resources. See Yakowitz (1982); Yeh (1985); Stedinger et al. (1984); Philbrick and Kitanidis (1999); Labadie (2004) for more references and e.g. Karamouz et al. (2005); Pianosi and Soncini-Sessa (2009); Tilmant et al. (2010) for more recent applications. 7.2.1 Formulation of a typical hydropower reservoir optimization problem using SDP For hydropower production, reservoir operation is usually analyzed and optimized at a monthly time step, complemented by a daily or hourly operation that uses the results from the monthly time scale. The optimal policy of monthly releases as a function of reservoir level, current hydrological conditions and month of the year can be found using SDP. In a typical reservoir operation SDP problem, the state variables are discretized into intervals represented by characteristic values. For a single-reservoir problem, a two dimensional state is usually chosen, consisting of the current reservoir storage level and current inflow to the reservoir, which is used as a characterization of the hydrological condition (TejadaGuibert et al., 1995). This means that for the monthly flows, it is assumed that all information that is available about the next flows is present in the current flow, which will be known at the end of the current period. The streamflow persistence is assumed to be dominated by lag-1 autocorrelation. This is called a first order Markov (Markov-1) process. Note that this assumption may be problematic when the streamflow dependence structure has more persistent characteristics, as often found in nature (Koutsoyiannis, 2005b), leading to for example undersizing of reservoirs (Hurst, 1951). Using more state variables in the Markov description may be a remedy, but leads to higher computation power requirements. With the Markov-1 assumption, the equation for the value-to-go function in SDP, which is solved backwards in time is Ft∗ (St , Qt ) = max{(B(St , Rt , Qt ) + Rt ∗ E [Ft+1 (St+1 (St , Qt , Rt ), Qt+1 )])} Qt+1 |Qt (7.1) subject to: St+1 = St + Qt − Rt − Lt St+1,min ≤ St+1 ≤ St+1,max Rt,min ≤ Rt ≤ Rt,max (7.2) (7.3) (7.4) ∗ where the asterisk stands for optimal. The optimal value-to-go function Ft+1 in time step t + 1 is a lookup table in the discrete case, stored in the previous calculation step, where for every time period, the value-to-go is given for each given combination of the two states: storage St and current flow Qt . Rt is the release that is to be optimized, Lt is the are the spills, and B is the function for the immediate benefits in the current time step. EQt+1 |Qt denotes the expectation operator with respect to the conditional distribution of the next 158 SDP: relations information, time and value of water GWh p =0.09 45 5.3 F*t+1=5.3 p =0.09 44 expected future value 4.98 value−to−go F* =5.2 p =0.18 t+1 43 5.2 * Ft+1=5.1 p =0.45 42 2000 te edia F*t+1=5.0 .20 fits 0 bene 5.1 t+1 ) imm −R t n (Q t 5 sitio tran 1500 St+1 F0*(4,15) 500 Rt 4.9 4.8 4.7 inflow 1000 release storage (Mm3) p41=0.18 F* =4.8 Q t 4.6 St 5 4.5 4 t+1 3 4.4 2 1 inflow class t timestep 4.3 Figure 7.1: For each state, the value-to-go is computed from the optimal sum of the expected future value-to-go and the immediate benefits. The result is shown for the example reservoir used in this chapter, for the decision problem in May, with initial state corresponding to the top left of Fig. 7.5, the value-to-go is in GWh until the end of the planning period. period’s inflow, given the current. Apart from this value-to-go function at every time step, solving the optimization also yields the optimal release function Rt∗ at every time step. R∗t (St , Qt ) = arg max{(B(St , Rt , Qt ) + Rt ∗ E [Ft+1 (St+1 (St , Qt , Rt ), Qt+1 )])} Qt+1 |Qt (7.5) when solved backward until the current time step t, the optimal release Rt∗ is known for the current state St ,Qt and can be executed. For finding a steady state policy, the policy is iterated until the yearly increase in the value-to-go function converges to a constant value for all states. The model of the reservoir system is captured in the constraints (Eq.7.2-7.4). Next to minimum and maximum releases (Eq. 7.4) and reservoir levels (Eq. 7.3), the constraints include the mass-balance equation for the reservoir (Eq. 7.2). When a system of reservoirs is optimized, the states become vectors and a connectivity matrix can be used to specify the layout of the reservoir system (see e.g. Tilmant et al., 2007). 7.2.2 Computational burden versus information loss Due to the discretization, Ft∗ has to be computed for every combination of discrete values of the state that might be visited by the reservoir system. This has to be done for every time period. Therefore, the number of evaluations of the objective function is proportional n to T × (NSnS × NQQ ), where T is the number of time periods within the horizon, NS and 7.3. Example case description 159 NQ are the number of discrete values for the inflow and the storage and nS and nQ are the number of reservoirs and the number of memory states in the model of inflows to the system. This gives rise to the “curse of dimensionality”, which makes the classical SDP approach computationally intractable for systems with more than 3 or 4 reservoirs. Ways to overcome this curse are an important topic in water resources optimization literature. One solution is aggregation of multiple reservoirs into one representative reservoir, but this can of course lead to a severe loss of information about the spatial distribution of storage and to reduction in performance of the control policy. An other method is the approximation of the value-to-go with a function. Because the function can be interpolated, less discrete points are needed in the representation or the parameters can be stored. A number of references for such approaches, e.g. cubic Hermite polynomials (Foufoula-Georgiou and Kitanidis, 1988), splines (Johnson et al., 1993), and neural networks (Bertsekas and Tsitsiklis, 1996) are given by Pianosi (2008). Pianosi also discusses other strategies to combat dimensionality such as reducing the model state in an online approach (Pianosi and Soncini-Sessa, 2009). Another promising direction is Stochastic Dual Dynamic Programming (SDDP), (Pereira and Pinto, 1991) which approximates the value-to-go function by piecewise linear functions around the states actually visited by the system. The piecewise linear nature of the value-to-go function allows formulating the dual problem to approach the value-to-go from two sides, converging to the true value. A problem of the approach is that the immediate benefit function is limited to a linear function of the state. A large advantage is that systems with a much larger state dimension can be optimized (see e.g. Tilmant and Kelman (2007), where a system with 20 reservoirs is optimized). Naturally, all these simplifications have some negative impact on the performance. In this chapter, an information-theoretical view is given on some consequences of assuming the inflow is a Markov process, which is one of the most common assumptions in a classical SDP formulation. It is argued that in general, the real-time operation of a reservoir can use more short term information than the long term optimization supposes. This leads to an underestimation of the value of water in the long term planning. When this underestimation would depend on the storage level, also the marginal future value of water and therefore the optimal policy would change. This would mean that an estimate of the future information gain is needed to find an optimal policy. A practical demonstration of this effect would be an interesting topic for further research. 7.3 Example case description For the research described in this chapter, a toy reservoir model was constructed as an example. The model has the dimensions of Ross-lake in the Columbia river basin (WA, USA). In the current status, the model is not aimed at improving reservoir operation, but serves as an example with realistic dimensions. For the sake of transparency, it is assumed that maximizing total hydropower is the only objective and there is no ecological flow requirement or economical discount factor. In reality, economical objective functions 160 SDP: relations information, time and value of water Parameter Description Smax Smin hmin Rmax Rmax + Lmax Rmin Maximum storage in the reservoir Dead storage of the reservoir Minimum head on turbines Maximum release through generators Maximum release including spills Minimum release (environmental flow) Value Unit 1890 464 82 453 2400 0 Mm3 Mm3 m m3 /s m3 /s m3 /s Table 7.1: Characteristics of the hydropower reservoir toy model. will be more complex, but this would not contribute to the example. Table 7.1 gives an overview of some of the most important characteristics of the reservoir and figure 7.2 gives the reservoir area and the head on the turbines as a function of the storage, along with the characteristic storage values of the discretization. The storage volume was subdivided in 28 intervals and the midpoints of those intervals, along with the minimum and maximum storage, give the 30 discrete storage values. This method of discretization is known as the Savarenskiy scheme and was shown to have advantages over other discretization schemes (Klemeš, 1977). For the inflows to the reservoir, 55 years of modeled inflows from the hindcast dataset for Columbia river were used (see chapter 4). The advantage of this data set is that it provides a complete record and represents the natural flows, without the influence of releases from upstream reservoir operations. Subsequently, for each month of the year, the inflows were discretized and the transition probabilities for the Markov-1 description were calculated from observed frequencies. The flow was discretized into 5 separate classes, where the class boundaries were chosen to represent the quantiles of 20, 40, 60 and 80%. This resulted in 5 equiprobable intervals, for which the class boundaries vary for each month. In this way, the a priori distributions have maximum entropy and knowledge of the flow class gives maximum information for the given number of classes. For each discrete class in each month, the expected value of all historical flows falling within that class is used as the representative value in the SDP optimization. For the transition probabilities used in the following analysis, empirical transition matrix was chosen rather than one based on a parametric distribution. Because no assumptions on the distribution type are made, data scarcity affects the estimation of transition probabilities. This problem is aggravated when Markov models with more states. The difficulty of estimating transition probabilities from limited data is also referred to as the “curse of modeling” (Castelletti et al., 2010). 7.3.1 Simulation and re-optimization After deriving the optimal policy, i.e. value-to-go functions, with SDP based on the Markov-1 representation, the reservoir operation was simulated using the original, nondiscretized, inflows of the dataset. In the re optimization, a one step optimization problem is solved every time step, optimizing the sum of current benefits and the future value-togo, which is read from the tables stored by the backward optimization. This amounts to 7.3. Example case description 161 solving the optimization problem in equation 7.5. Instead of the representative values for storage and inflow, the exact values from the current state are used during re optimization. The value-to-go from the end of period state is found by linear interpolation of the stored table (see Fig. 7.1). 125 125 120 120 hydraulic head on generators (m) hydraulic head on generators (m) For this optimization, the Matlab function “fmincon” was used. This has the advantage that next to the optimal solution and the value of the objective function, it also gives the Lagrange multipliers for all constraints as an output. These are the derivatives of the objective function with respect to the value of the constraint. For example, the Lagrange multiplier for the mass balance constraint gives the improvement in objective function value as a result of relaxing the constraint with one unit (i.e. adding one m3 of water to the system). If the objective is stated in monetary terms, the Lagrange multiplier can be interpreted as the shadow price of water. In absence of a real market, this is the imaginary price that one of the users would want to pay for one m3 of water in the current conditions and allocation (Tilmant et al., 2008). In the next section, other theoretical links between SDP results and water resources economics are given. 115 110 105 100 95 90 85 115 110 105 100 95 90 85 characteristic storage 80 0 10 20 30 40 80 50 0 500 2 1000 1500 2000 3 surface area (km) volume (Mm ) average inflow (Mm /month, logarithmic scale) 3 inflow category bounds characteristic inflows 3 10 3 10 3 inflow (Mm /month, logarithmic scale) Figure 7.2: The storage volume-area-head relation for the reservoir and the storage discretization using the Savarenskiy scheme. 2 10 1 2 3 4 5 6 7 8 month of year 9 10 11 12 2 10 1 2 3 4 5 6 7 8 month of year 9 10 11 12 Figure 7.3: The discretization of the monthly flows (left) into 5 equiprobable classes for each month (right). 162 SDP: relations information, time and value of water 2000 3 storage (Mm ) 1800 1600 1400 1200 generator spill 2000 1500 3 release (Mm /month) 1000 1000 500 power (GWh/month) 0 300 200 100 0 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 Figure 7.4: Reservoir behavior during the re-optimization. Because no firm energy production target was set in this case, energy production drops during the dry season. The reservoir is constantly kept full, unless future spills become likely. 7.4 Optimization and the value of water In SDP, the optimal decision is found by maximizing the sum of the current and the future benefits that can be obtained from the water system. Due to the dependence of hydro-electric power generation on the reservoir level, not every unit of water has the same value, which leads to non-linearities in the optimization problem. One m3 of water released at a higher reservoir elevation yields a higher immediate benefit. Also the value per unit of water stored for next time steps depends on the current storage level. If the reservoir is almost full, every extra unit that is stored is increasingly likely to be spilled in the future. Therefore, the marginal value of storage (the benefits of one extra m3 stored) decreases with increasing storage for a near full reservoir in the wet season. Both the current and the expected future benefits also depend on the current release. To maximize the total benefits, given by the value-to-go Ft∗ in equation 7.1, the derivative of Ft∗ with respect to the release Rt must be zero (assuming that the maximum is not constrained by Rmin or Rmax ), so we can write ∂Ft = 0 ∂Rt ∂ ∂ {B(St , Rt , Qt )} = − ∂Rt ∂Rt (7.6) ( E Qt+1 |Qt ∗ [Ft+1 (St+1 , Qt+1 )]) ) (7.7) 7.4. Optimization and the value of water 163 The left hand side of equation 7.7 is the marginal value of using water in the current period (which is assumed to be certain). The right hand side represents the marginal value of saving water for the future, or the opportunity cost of using the water in the current period. This marginal value is the derivative of the expected future value of the water in storage. At the optimal release, the value of one m3 of water in the reservoir is thus equal to the derivative of the value-to-go with respect to the release (see Fig. 7.5). Also, the immediate value of using one extra m3 of water is then equal to the opportunity cost of not having that m3 of water for the future. Furthermore, the marginal value is equal to the Lagrange multiplier for the mass-balance constraint in the re-optimization (Tilmant et al., 2008). Optimization can thus provide interesting information about water resource economics, as it gives insight in the trade-offs between different water uses and different time-periods. See, for example the analysis in Tilmant et al. (2008) about a large reservoir system in Turkey. Conversely, it can also be observed that the expected future value of water determines the optimal release. An accurate estimate of this future value requires an accurate model of the future decisions and therefore an accurate model of future information availability for those decisions. 7.4.1 Water value in the example problem For the example reservoir, an SDP optimization was performed to determine the valueto-go functions and optimal release policy. Analysis of the value-to-go function and the Lagrange multipliers of the mass balance in a simulation of the re-optimization revealed interesting information about the marginal value of water as a function of time, current reservoir level and current inflow. In this example, only the objective of energy production was considered. Therefore, the marginal value of water can be expressed in terms of the energy produced per m3 of water, e.g. Wh/m3 , thereby avoiding the need to make a price assumption. This focuses the analysis on the hydrological influence on water value and rules out the influences of the socio-economical system connected to the water resource. In a real market situation, the price per kWh is also dependent on supply and demand and may vary with the season. This may significantly influence release decisions and can even lead to typically anthropogenic weekly cycles in downstream river flows. The high inflows in late spring and summer lead to a decrease in the marginal value or shadow price of water (see Fig. 7.6). Because the release is already at a maximum and the reservoir will most likely be replenished by July, extra water received in June is likely to be spilled in the future, which decreases the expected future value. In several instances (more than 25% of the years in June; see Fig. 7.6), the shadow price even drops to zero, corresponding to an inevitable future spill. In terms of the Lagrange multipliers this means that relaxing the mass balance constraint with one m3 of water (adding one m3 to the system) will not give any extra benefits, because the generators already operate at full capacity and even with this release, the reservoir will be full at the end of the period. The extra water therefore can not lead to extra immediate or future benefits and will be spilled. In March, just before the melting season, however, water is still scarce and the 164 SDP: relations information, time and value of water value of water is high. Any extra water is directly used to spin the turbines and contribute to power production, while being at the highest reservoir elevation level. The value of water depends on the end of period storage in the reservoir, but also on what inflow is expected in the further future. For the Markov-1 model of inflow persistence, the current inflow class determines the estimated distribution for the next period’s inflow and therefore can have an effect on the value of storage. In figure 7.8, the marginal value of storing water as a function of storage level is plotted for all months. The values are the derivatives of the value-to-go functions resulting from the SDP optimization. The five lines correspond to the five inflow classes illustrated in Fig. 7.3, where 1 is the lowest inflow class. 3 3 Value and marginal value, month 5, current storage 1152 Mm3, representative flow 107.8 m /s Value and marginal value, month 5, current storage 1814 Mm3, representative flow 107.8 m /s 400 Energy production (GWh) Energy production (GWh) 400 300 200 100 0 0 200 400 600 800 1000 1200 300 total benefit 200 future benefit optimal 100 0 1400 immediate benefit 0 200 400 3 200 100 200 400 600 800 1000 1200 1400 Marginal Energy production (Wh/m3) Marginal Energy production (Wh/m3) 300 0 200 1000 1200 1400 immediate marginal value opportunity cost Lagrange multipliers during releases 100 0 idem but in same flow condition 0 200 400 600 800 3 Value and marginal value, month 5, current storage 1152 Mm3, representative flow 73.7 m3/s Value and marginal value, month 5, current storage 1814 Mm3, representative flow 73.7 m3/s 400 Energy production (GWh) Energy production (GWh) 1400 Release (Mm ) 400 300 200 100 0 200 400 600 800 1000 1200 300 200 100 0 1400 0 200 400 400 300 200 100 0 200 400 600 800 Release (Mm3) 600 800 1000 1200 1400 1000 1200 1400 Release (Mm3) 1000 1200 1400 Marginal Energy production (Wh/m3) Release (Mm3) Marginal Energy production (Wh/m3) 1200 300 3 0 1000 400 Release (Mm ) 0 800 Release (Mm ) 400 0 600 3 Release (Mm ) 400 300 200 100 0 0 200 400 600 800 Release (Mm3) Figure 7.5: The value and opportunity cost of releases and marginal value of water for combinations of high inflow (top), low inflow (bottom), low storage (left) and high storage (right). The value curves are shifted to depict the extra value relative to the minimum. 165 340 320 300 280 260 5 6 7 month of year \\ \\ \\ \\ \\ 240 \\ value of water at optimal release (Wh/Mm3) 7.5. Interdependence of steady state solution and real-time control 20 0 1 2 3 4 8 9 10 11 12 Figure 7.6: The value of water at the release decisions found by re-optimization during simulation, as function of the month of the year. The box plots show the median, quartiles and minimum and maximum values. The abundance of water in the melting season decreases its marginal value. 1500 1000 400 3 3 3 spills (Mm /month) release (Mm /month) storage (Mm ) 2000 200 0 1000 500 0 1 2 3 4 5 6 7 month of year 8 9 10 11 12 Figure 7.7: The reservoir behavior as a function of the month of the year. 7.5 Interdependence of steady state solution and real-time control Real time operation of reservoirs makes use of actual information about inflows at a high temporal and spatial resolution in the near future, using information in high-dimensional state vectors of for example distributed hydrological models. In contrast, the long-term optimal steady state policy relies on a Markov-1 description of inflow uncertainty and persistence at a coarser time-resolution. However, the short and the long term optimization are not independent. Firstly, the future value function as function of the state, which can be obtained from the long term optimization, might be needed as a boundary condition at the end of the time horizon for the short-term optimization. Secondly, improvements in operation due to the use of actual information in real time operation, as compared to a steady state policy based on just climatic information, need to be reflected in the rewards 166 SDP: relations information, time and value of water Jan Feb Mar Apr May Jun 2000 1800 3 storage (Mm ) 1600 1400 1200 1000 800 600 400 280 300 Jul 320 250 300 Aug 350 400 290 300 310 250 300 Sep 350 250 300 Oct 350 0 200 Nov 400 0 200 Dec 400 290 300 310 290 300 310 290 300 310 280 300 320 2000 1800 3 storage (Mm ) 1600 1400 1200 1000 1 2 3 4 5 800 600 400 0 200 3 marginal value of water (Wh/m ) for the five different inflow classes (see legend) Figure 7.8: The marginal value of water as a function of the end of period storage. The value is affected by the current inflow through the inflow persistence. This is accounted for by the different lines for the inflow classes 1 (low flow) to 5 (high flow) depicted in figure 7.3. Usually the marginal value increases with storage due to higher head on the generators, but for an overly filled reservoir, the marginal value decreases due to the increasing probability of future spills, especially in the melting season. attributed to actions in the long term optimization. The first influence, of the long-term on the short-term operations, takes place through the end of the horizon for the short term planning. The real time optimization only has to take into account events that take place within the information control horizon, specified in chapter 2. When the information prediction horizon is shorter than the information control horizon, the actions towards the end of the information control horizon matter for the current action, but there is no predictive information to base them on. These actions can therefore be assumed to follow the long-term steady state policy. The information that these actions contain that is relevant for the current action to optimize is contained within the value-to-go function directly preceding the first action of the steady state policy. Inclusion of this information in the short-term optimization can be achieved by using the value-to-go of the long-term policy to reward the final state of the short-term optimization. This is in fact making use of Bellman’s principle of optimality again, ensuring that the total benefits are maximized. The second influence, of the short-term on the long-term operations, is due to the fact that the uncertainties under which the decisions in the long term policy are supposed to be taken are not reflecting the true uncertainty under which decisions are taken. In the long-term policy, the decision is presumed to be only based on the information which is 7.6. How predictable is the inflow? - entropy rate and the Markov-property 167 in the states of the Markov chain, which usually includes the actual reservoir level, the actual inflow and the month of the year. In reality, the probability distribution for the state transitions in the near future are conditioned on other, external information, which can be summarized in a forecast. The uncertainty under which decisions are taken is thus decreased (”conditioning reduces entropy”). To account for this effect, Stedinger et al. (1984) proposed to use the best forecast for the inflow as a state variable instead of the previous month’s inflow and reported an improvement in the performance of the resulting policy. A possibility to also take the uncertainty in this forecast into account is offered by the technique of Bayesian Stochastic Dynamic Programming (BSDP, Karamouz and Vasiliadis (1992); Kim and Palmer (1997)). In this method, the transition probabilities for the inflows are continuously updated to reflect the forecast information. The conditional distribution of forecasts, given observed inflows is used to account also for forecast uncertainty. A limitation of the method is that it can only take into account the predictive power for one timestep ahead. 7.6 How predictable is the inflow? - entropy rate and the Markov-property A process satisfies the Markov property if all information that the past carries about the next state is captured in the current state, which may be a vector that contains the flow from previous time steps. This can be written in terms of the familiar information measures mutual information (Eq. 7.9) and conditional entropy (Eq. 7.10) P (Xt+1 |Xt ) = P (Xt+1 |Xt . . . X1 ) (7.8) I(Xt+1 ; Xt , Xt−1 . . . X1 ) = I(Xt+1 ; Xt ) (7.9) H(Xt+1 |Xt , Xt−1 . . . X1 ) = H(Xt+1 |Xt ) (7.10) Consequently, the information that Xt contains about Xt+h decreases with increasing h and all information is transferred through the intermediate states. A first order discrete Markov process can be completely described by a transition matrix specifying the conditional probabilities of ending up in the next states, given each possible current state. For a time-homogeneous Markov process, information about the current state fades away in time, until a stationary distribution µ is obtained (under the conditions of irreducibility and aperiodicity; see Cover and Thomas (2006)). The stationary distribution is often also referred to as the “climatic” distribution, and can be found by solving µM = µ (7.11) where M is the transition probability matrix for the Markov chain. If current information is available, the distribution will diverge from the uniform stationary distribution for some time period. When the current state is precisely known (Xt = x), the entropy H(Xt+h |Xt = x) will increase with lead time h and the Kullback-Leibler divergence 168 SDP: relations information, time and value of water DKL (µ||Xt+h |Xt = x) will asymptotically go to zero, meaning that there is no extra information relative to climatology. This was defined as the information prediction horizon in chapter 2; see also Fig. 2.6 on page 28. In the model used in the SDP optimization, a different transition probability matrix is estimated for each month. This results in a cyclostationary process, which can have a different stationary distribution for each month. Because each month of the year is individually split into 5 equiprobable classes, the information measures between the different months can be interpreted in the same way as if the flows form a stationary process. An important characteristic of a stationary Markov process is the entropy rate H 0 , which gives the average uncertainty of each new value. Because the distribution of the next inflow depends on the current inflow, this uncertainty is less than the marginal entropy (log 5 in this case). The entropy rate for a first order Markov process is simply the conditional entropy H(Xt+1 |Xt ). This conditional entropy is equal to the marginal entropy minus the mutual information H 0 (X1 . . . XT ) = H (X) − I (Xt , Xt+1 ) (7.12) In Fig. 7.9, the mutual information between the inflows in subsequent months is plotted for different lead times and different initial months. The mutual information is calculated in two different ways. The first assumes that the inflow is a Markov process and uses the product of the transition matrices in the different timesteps to calculate the resulting transition probabilities for multiple timesteps ahead. The second method directly estimates the joint distribution based on the series of pairs xt , xt+h . If the first method yields a lower mutual information than the second, the persistence is stronger than explained just by lag-1 dependence and predictability is underestimated by the Markov-1 model. Because of the limited data, the estimation of transition probabilities is coarse and leads to spurious mutual information, which would even be found in a random data-set (a bootstrap revealed that 10% of randomly drawn data sets yield a mutual information of over 0.37 bits). Due to this curse of modeling, no conclusions can be drawn about “beyond Markov predictability” for this case. Notwithstanding the practical difficulties presented by this curse, we can generally expect that inflow forecasts will in reality be based on more information than just the current inflow. In theory the predictability can be assessed by looking at the mutual information I (Xt+h , Yt ), in which Yt is the probability distribution of the vector of all potentially relevant information available at time t. 7.7 The influence of information on the marginal value of water Generally, more predictable inflows lead to better decisions. This translates into an increase in the value that can be obtained from the water in the reservoir. A real-time control system that uses more information than the states in the Markov chain of the 7.7. The influence of information on the marginal value of water 169 2.5 Markov−1 Empirical Mutual information (bits) 2 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 10 month of year 11 12 1 2 3 4 5 Figure 7.9: The mutual information between the inflow classes in different months calculated from the Markov matrices and the empirically estimated mutual information, based on the joint distribution of classes at various time-lags. steady state optimization capture, will therefore increase the value of water. In the case of the hydropower objective in the example, this translates into an operation strategy that maintains a higher average elevation in the reservoir or prevents more spills. Given that such a system is used also to make future decisions, they will improve, leading to more benefits from the same amount of water. It is therefore likely that also the future value of water will increase. In this section it is investigated how these differences in value might affect the optimal decision. The value-to-go (and therefore the decision ) at time step t depends on all information that is currently available about the future behavior of the water system. This behavior does not only include the inflows, which are not controllable, but also the future decisions about releases. These releases depend on the information that will be available at the time those future decisions are made. Therefore, the optimal release in the current time step depends not only on our best probability estimates on the future inflows themselves, but also on our best estimates about how much information will be available at the time of future decisions. A model about the future growth in information is therefore necessary. The Markov chain model of inflows in a typical SDP problem is such a model of future information, which is assumed to be captured by the state. To enable the disaggregation in time that forms the basis of dynamic programming, the external influences need to satisfy the Markov property and be part of the state, or be completely random. To satisfy this condition, all states of the rainfall runoff model and even the dependency structure in the rainfall needs to be modeled as states in the Markov process. Due to the curse of dimensionality, the number of states of the Markov process is often assumed to be low, even if in reality more information on the future inflows is available and used in real-time. Although this enables a fast solution, it also underestimates the amount of information that is available for future decisions. An underestimation of the value that can be generated from the water is the result. 170 SDP: relations information, time and value of water A condition for this underestimation to influence the optimal operation is that also the derivative of the value function changes. A systematic fixed underestimation of future benefits for all reservoir levels would not have an effect on operation. If the real-time information has an additional value for operation that depends on the reservoir level, also the marginal value and thus the optimal decision is affected. Although it is beyond the scope of the present case study, it is conjectured that the marginal value is indeed affected, because accurate forecast information seems to have most additional value for operations while the reservoir is nearly full. This would result in an increased marginal value of storage and decisions that favor higher reservoir elevation levels most of the time, with occasional drawdowns for well-predicted inflow peaks. To achieve a steady state policy that accounts for the information gain of the true realtime operations, a higher order model would be needed. The computational effort required for such an approach is often prohibitive. An alternative approach is to correct the value functions by an empirical factor, that is learned from the actual operations. This can for example be achieved using reinforcement learning based approaches (Sutton and Barto, 1998), which will be shortly discussed in the recommendations of this chapter. 7.8 Sharing additional benefits of real-time information between stakeholders Apart from depending on the current reservoir level, the additional value of future information may also be different for different users. This can have an effect on negotiations between these users. If one of the users gains more than others from the effects of realtime information that is used in the actual decisions, a negotiated off-line Pareto-optimal solution needs to be renegotiated online to share the additional benefits. This section discusses this issue in more detail. Optimization by SDP can be a useful tool in negotiations between different stakeholders in a water system, because it can find optimal solutions for a given objective, taking into account uncertain natural influences. Without optimization tools, these solutions can be difficult to identify. A release policy that results from a SDP optimization could therefore present a win-win solution for two conflicting objectives compared to the suboptimal status quo. Solutions like these, which identify new benefits that can be shared, can stimulate negotiations between stakeholders; see Soncini-Sessa et al. (2007) for possible methodologies. One way to stimulate negotiations is to find optimal policies for several differently weighted objectives. The set of solutions thus obtained forms a Pareto-front. The Pareto-front contains the set of solutions where no objective can be improved upon without deteriorating one of the other objectives. All solutions dominated by the (i.e. above the) Pareto front can be improved upon for one of the stakeholders without a loss for one of the others. It therefore makes sense to only negotiate about the Pareto-optimal policies. Identifying the Pareto-front involves several optimization runs. Simplifications are often required to make the computations tractable. A typical way to do this is to reduce state 7.8. Sharing additional benefits of real-time information between stakeholders pareto front offline pareto front online original solutions corresponding improved solutions negotiated offline solution online solution without re−negotiation room for improvement solutions to re−negotiate about 0.6 penalty for irrigation losses 171 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 penalty for flood damage 0.5 0.6 0.7 Figure 7.10: Hypothetical example of a shifted Pareto-front as a result of online operation. The lines indicate the online performance of the off-line policy with the original weights. The way to share the benefit from real-time operation can be part of new negotiations. dimension of the SDP model, for example by replacing a Markovian inflow model with white noise, as was done in Pianosi and Soncini-Sessa (2009). In the project described in that paper, this simplified SDP model was used to present possible Pareto optimal solutions to stakeholders with conflicting objectives of flood protection in Lake Verbano and irrigation downstream. Several rounds of negotiations resulted in an objective function with fixed weights for various objectives. Subsequently, the operations were improved by a new real-time operation system using a heteroskedastic inflow model based on current information about the actual state of the catchment; see Pianosi and Soncini-Sessa (2009) . The additional information that is used in the online optimization has a value due to better meeting the objectives. Therefore, the Pareto-front found by off-line optimization, which is used in the negotiation, is improved upon by the real-time operations; see Fig. 7.10. The improvement that can be made by using more real-time information should in principle benefit at least one of the stakeholders, without making the other worse off, theoretically leading to a new, improved, Pareto front. The online operations, based on the off-line policy with the negotiated weights, result in a solution that dominates the off-line solution and also lies on the new online Pareto front. However, this new solution presents only one of a range of possible new solutions that provide a Pareto improvement over the negotiated off-line solution. When the weights for the off-line policy are unchanged, the benefits of the online information will mostly go to the stakeholder with the interests most sensitive to the real-time information. This is usually the objective associated with the shortest timescale. For example, in a reservoir both used for irrigation and flood prevention, the real-time information about current and expected rainfall in the catchment, which is not taken into account in the off-line 172 SDP: relations information, time and value of water optimization, can have large benefits for flood protection. Pre-releases based on measured precipitation lead to a significant reduction in flood peaks downstream and the maximum level in the reservoir Pianosi and Soncini-Sessa (2009). The additional benefits to irrigation, on the other hand, are not very significant, as the real-time information about how long upcoming droughts will last is usually limited. The real-time optimization mostly benefits flood protection, but the stakeholder with an irrigation interest might feel entitled to a share of these benefits. This can be realized by partly compensating the real-time gains for flood protection by selecting another off-line policy from the Pareto front that benefits irrigation at the cost of seemingly increased flood risk. The real-time information in combination with this change in off-line policy results in a balanced win-win solution compared to the negotiated off-line Pareto optimal solution. Although just a thought-experiment, this reasoning shows that next to the information provided by models, also a model of the information flows can contribute to reach more equitable win-win solutions and to negotiate about the right options. Again, a model of the amount of information on which the actual decisions are based seems to be necessary to fully appreciate the value of water.2 The previous case dealt with the future value, while this case concerned the value for the various stakeholders. Real world case studies are needed to determine the practical significance of this effect in various situations. 7.9 Reinforcement Learning to approximate value functions In high-dimensional cases, it might be possible to use techniques that empirically estimate the value-to-go functions, rather than to try to explicitly compute them based on simplified models. A promising approach is reinforcement learning; see Sutton and Barto (1998) for a good introduction. Promising first applications in water system operation have been described by Bhattacharya et al. (2003), Lee and Labadie (2007) and Castelletti et al. (2010). In reinforcement learning, an agent interacts with its environment (see Fig. 7.11), by taking actions a, which are mapped from the current state s by a policy π. The environment, depending on the state and the action, reacts by giving a reward r to the agent. The objective of the agent is to maximize the total reward over a finite horizon or the total discounted reward over an infinite horizon. This maximization is pursued by learning from the interaction and finding out which state-action combinations result in the highest overall rewards. An important aspect is the trade-off between exploitation and exploration. On the one hand, the agent wants to benefit from good action-state combinations found so far, on the other hand it must explore other actions to be able to learn improved actions. There are many parallels between dynamic programming (DP) and reinforcement learning (RL). Also in reinforcement learning, value functions are implicitly or explicitly calculated 2. Theoretically, the increased value of water due to the online information is closer to the actual value of the water and should therefore be the basis of the negotiations. 7.9. Reinforcement Learning to approximate value functions 173 Figure 7.11: The agent-environment interaction in reinforcement learning (source: Sutton and Barto (1998)) to summarize the expected rewards associated with one state or one state-action combination. In contrast to DP, where value functions are calculated based on an explicit model of the environment, in RL these value functions are estimated from a growing experience, without the requirement for knowledge of the environment except for the state. Where in (S)DP the optimal policy is an exact mapping from the state to the optimal action, in RL a policy is a mapping from the state to a probability of an action. A high entropy probability distribution of the actions means a high exploration. If applied on the same problem, an ideal RL algorithm will converge to the same optimal policy as DP, gradually reducing towards a zero entropy action distribution (i.e. after the learning is complete, there is no exploration, only exploitation). In practice, RL algorithms usually only reach an approximation of the optimal solution. The large advantage however, is that they can handle problems of far larger statedimension, because instead of an exhaustive calculation of the value functions for all possible states, it focuses attention on the states that look promising from previous experience. Because experimenting on a real reservoir in usually not desirable, a simulation model can be used as the learning-environment. Furthermore, if real-time information is used to improve decisions, but this information is not captured in the state of the long term SDP optimization, that state ceases to satisfy the Markov property. Even with such an imperfect model, RL can still approximate the best policy. Where DP approaches rely on a perfect model, RL discovers an approximate model in the learning process of interacting with the environment. In this way it partly overcomes the “curse of modeling”. In the case considered in this chapter, a real-time operation module could be considered as part of the environment for the agent that optimizes the long-term policy. The actions of this agents are sent to the environment and used in the real-time optimization. The rewards, which are calculated in the environment, depend on the performance of the real-time optimization and the states are a summary of the information that the realtime system plus the long-term policy use as an input. In this setup, the corrections to the value function due to new real-time information are learned automatically by the RL algorithm, which can be initialized with the value-functions from the original SDP solution. The correction to the value function made by the reinforcement learning algorithm will thus be an approximation of the additional value added by the real-time information, which is too complex to calculate exactly. 174 SDP: relations information, time and value of water 7.10 Conclusions and recommendations The value of water can be made more explicit by formulating an optimization problem including the different uses of water. Because of the disaggregation in time, SDP can explicitly represent the trade-off between immediate and future benefits. The time-varying value of water found by SDP also includes the effect of the uncertainty in the inflows, which makes water less valuable. Conversely, more information increases the benefits that are obtained from using the water optimally. This makes the value of water dependent on information available for future decisions. An optimization model by SDP includes a simple model of the information that will be available for future decisions. The model is a Markov-chain, in which all information available for the decision is captured in the state. Due the exponential dependence of computational burden on the state dimension and the difficulty of estimating reliable transition probabilities for higher order models, often models of low state dimension are chosen to represent inflow predictability. Such low-dimensional states do often not capture the complete information available for actual decisions. It is conjectured that especially for operations at shorter timescales and at high reservoir elevation levels, additional real-time information can have significant value for operation. Not taking this value into account in the long-term policy may lead to lower than optimal lake levels in a hydropower-only setting. In a multi-objective setting with irrigation and flood control, it may also lead to an over-emphasis on the objectives of flood related stakeholders, as they are profiting most from the real-time information. This may need to be accounted for in negotiations. It is in principle impossible to fully separate the short and long term planning, unless the state truly has the Markov property. The long term influences the end-of-horizon valueto-go function for the short term, while the short term information influences the benefits in the state transitions in the long term. When for real-time operations, forecasts are used that are based on models with high dimensional state and high-dimensional observations, it is impossible to include this into a model of future information for the decisions. Reinforcement learning may be a promising way forward to address the problem identified in this chapter. In fact, using such techniques amounts to admitting the infeasibility of an exact solution and trying to find an approximate solution to the full problem, rather than finding a full solution to the approximate problem. This might facilitate taking into account empirically the complex dynamics and value of information, which are overlooked by SDP solutions using reduced models. Chapter 8 Conclusions and recommendations “We have the duty of formulating, of summarizing, and of communicating our conclusions, in intelligible form, in recognition of the right of other free minds to utilize them in making their own decisions.” - Ronald Fisher The initial focus of this research was developing methods for risk based water system operation. Information soon presented itself as central concept underlying this task. Impressed by the coherence and broad applicability of Shannon’s information theory and the “Jaynesian” viewpoint on probability and information, the author shifted the focus somewhat to specifically study the role and flow of information in this process. The problem of risk based water system operation can be restated as “Rational water management decisions with incomplete information” and approached using theorems from the interlinked fields of decision theory, control theory and information theory. Between data and decisions are predictions1 , which summarize the information extracted from the data to enable informed decisions. Some of the work in this thesis therefore focused more specifically on information in predictions. Before reviewing the more concrete methodological contributions of this thesis and giving some recommendations for practice and future research, the first section presents some collateral insights about information that were obtained during the research. To the author personally they provide the most rewarding outcome of the Ph.D. research process. 8.1 Conclusions at the conceptual level 8.1.1 The nature of information When the word information is used in daily conversation, it is often implicitly restricted to mean useful information: an informative message is required to cause surprise, but also to contain meaning. Information theory is just concerned with surprise, which is related to changes in probability, and not with the meaning (usefulness; utility) of information. Information about some event is the reduction in uncertainty about its outcome. As 1. although these are not always explicit; see e.g. reinforcement learning, where actions follow directly from observations 175 176 Conclusions and recommendations explained in chapter 3, it is important to distinguish between perceived uncertainty, which can be evaluated ex ante, and true uncertainty, which can be evaluated ex post when the true outcome is known. The perceived uncertainty can be expressed by the entropy of the probability distribution that is adopted in the light of all information available. The true uncertainty is related only to the probability attached to the outcome that actually occurred, and can be expressed as the Kullback-Leibler divergence from the truth to the probability distribution that was adopted before knowing it. Information is contained in data (observations; evidence), which leads to conditioning of probability distributions of related phenomena and therefore to a reduction of uncertainty about those phenomena. In order to extract this information from data, we need models (algorithms; programs), which give us predictions of the uncertain events. Once a number of predictions has been made and the corresponding outcomes observed, the information that the predictions contained about the observations can be used to evaluate the quality of the model. If enough predictions and observations are available, we can distinguish between wrong and correct information. The decomposition in chapter 5 revealed that the remaining uncertainty is equal to the original uncertainty plus the wrong information minus the correct information. Once data can be processed in a general, repeatable manner to give informative predictions that can be successfully tested with other data, we have found a pattern in the data and this can be referred to as knowledge. The knowledge is embodied by the mathematical relations of the model that provides the informative predictions. 8.1.2 The flow of information Information that is used for human decisions ultimately stems from observations. These can be obtained directly through our senses or with the aid of measuring instruments. To enable optimal decisions, we need to extract from those observations the information that relates to the part of the state of the world that influences the impact of our decision on our utility. In this process, information flows from observation to decision. Good decisions are fed by an inflow of information from the right observations. An increased inflow of field observations can therefore improve decisions. However, we should also manage the flow correctly. On the one hand care should be taken not to lose any of the relevant information. On the other hand, we also should be careful not to add any information that is not justified. The principle of maximum entropy, or minimum relative entropy, such as applied in the ensemble weighting in chapter 4, can help ensure this. Chapter 5 presented a method for testing how much correct information from the observations ends up in the predictions (the resolution term) and how much wrong information is introduced by miscalibration (the reliability term). The remaining uncertainty about the observations, as measured by the divergence score, is also the performance measure to minimize in model calibration. The measure has the desirable property that its expectation is minimized only if the model gives a correct representation of uncertainty. Both information left behind and unjustifiably added information deteriorate the expected score, which therefore provides an incentive to correctly represent both our knowledge and our ignorance. This representation, a probabilistic forecast, in turn enables optimal decisions. 8.1. Conclusions at the conceptual level 177 8.1.3 The value of information Information obtains value through decisions. In hindsight, informed rational decisions are on average better than uninformed decisions. The value of new information can be expressed as the expected utility of a decision with the new information minus the expected utility of the decision with the a priori information. In the case of water resources management, information also influences the value of water through the allocation decision. Water adopts the value of the use it is allocated to. Informed decisions are based on accurate estimates of the value of each use, allocating water to its most valuable use. In sequential decision processes, such as repeated decisions about the release from a reservoir, the allocation also represents the trade-off between immediate and future use. Consequently, the value of water then also depends on the quality of future decisions, which are in turn dependent on information that will be available in the future. These complex dynamics of information are to some extent explicitly handled by stochastic dynamic programming. Often, however, the future information is too complex to model explicitly and we should resort to empirical implicit approaches (reinforcement learning, chapter 7) to account for the value of future information. 8.1.4 The necessity of probabilistic predictions Predictions or forecasts are of paramount importance in science and engineering. In science they are the interface between theory and observation. In engineering, they can be seen as messages that communicate information to a user. However, they should also communicate the missing information or uncertainty to the user. In order to minimize the remaining uncertainty about the true outcome for the user, they have to accurately reflect this uncertainty. Deterministic forecasts violate this requirement to the largest possible extent. Strictly speaking, they increase the remaining uncertainty to infinity, unless they are perfect. As is revealed by the decomposition in chapter 5, this is caused by the large amount of wrong information that they communicate by pretending to be certain, next to the correct information they may contain. An experienced or knowledgeable user of the forecast may be able to filter out some of the wrong information by implicitly or explicitly recalibrating the forecasts, i.e. not believing them completely. However, this moves a considerable responsibility to the user that in fact belongs to the forecaster. Ironically, the users who are claimed not to be able to handle probabilistic forecasts and are for that reason provided with deterministic forecasts are the ones who have to rely most on their ability to subconsciously make probability estimates based on the limited information in a deterministic forecast. Another reason to use probabilistic forecasts is that they allow rational decisions for different users at the same time. While a deterministic forecast can theoretically be optimized to allow optimal decisions for one user with a specific decision problem, probabilistic forecasts contain a representation of the full information and uncertainty about the forecast quantity that is relevant for all decision makers, irrespective of their decision problem. Given that most forecasts are made by experts and communicated to a varied body of users who are all free to decide on their own problem, forecasts ought to be probabilistic. 178 Conclusions and recommendations 8.1.5 Information theory as philosophy of science Also in the context of pure science there is a compelling reason why forecasts should be informative, i.e. probabilistic. It is almost part of the definition of science that it is required to produce testable predictions. In fact, this could be interpreted as a requirement for predictions to mandate their own way to be tested. The joint framework of probability and information theory provides an opportunity to meet this requirement. All forecasts that are not probabilistic need external assumptions to determine the way they are tested. As argued in chapter 5, these either relate to utility or they implicitly specify a probability distribution, which should have been stated a priori. Science has the task of extracting information from observations about other observable quantities. This is done by trying to find patterns in observed data and representing them in mathematical formulae. These formulae represent our scientific knowledge and are the algorithmic representation of the redundancy in the observations. All patterns that are present in the observations allow them to be represented in a theory, which can be more compactly represented than the observations themselves. As is outlined in chapter 6, the strong analogy with data-compression is evident: a theory can be seen as a compression algorithm for data. This compressibility is the reason for the very possibility of science. The divergence score presented in chapter 5 can be interpreted as the minimum filesize attainable by the best compression algorithm to represent the observations, while knowing the forecasts, i.e. the predictors and the model. To represent all observations, both the predictors and the model need to be stored as well. The filesize for the predictors is inversely related to the generality of the model. If the filesize is large, the conditions for which to model yields its predictions need to be extensively specified, while a smaller file means a more general model. The filesize of the decompression algorithm, which represents the model, is a measure for the complexity of the theory. The combined filesize is related to all the things that are left unexplained. An explanation is thus simply a description that is shorter than the explanandum. An inverse relation between description length and prior probability of a model, as suggested by algorithmic information theory, fits into a Bayesian framework and provides a possible justification for the principle of parsimony. A perfect theory of everything would not need an input file for predictors nor a file to store observed outcomes, given the predictions, because the predictions already are equal to the observations and the theory is so general that all input observations have become knowledge and are thus part of the algorithm. In algorithmic information theory, the filesize of that algorithm is the Kolmogorov complexity of the universe. This complexity might be infinite, if we consider information entering from parallel universes through quantum interference (see Deutsch (1998)). We are now maximally out of the scope of this thesis, so the next section returns to a more practical level. 8.2. Methodological contributions and recommendations for practice 8.2 179 Methodological contributions and recommendations for practice 8.2.1 On risk based water system operation Operation of water systems should be risk based when the control problem is not certainty equivalent. In those cases, the expected degree fulfillment of the objectives, given all possible futures, should be maximized. Two conceptual time horizons relating to sensitivity for the future and to information about the future were defined that can serve as guidelines in the design of predictive controllers. Furthermore, two different multiple model formulations of MPC, presented in chapter 2, serve as limiting cases for the availability of new information for future decisions (no information versus perfect information). In theory, stochastic dynamic programming, as applied in chapter 7, offers the possibility to exactly model the availability of future information, but only at the expense of computational intractability for most real-world problems. Two routes of circumvention are available to achieve near-optimal risk based water system operation. The first route makes modifications to the problem to solve by making assumptions that allow a fast and exact solution to the modified problem. The second route finds near-optimal actions empirically by using techniques from artificial intelligence, such as reinforcement learning. In chapter 7, that technique was identified as a promising way forward for reservoir operation. 8.2.2 On weighted ensemble forecasts The information that is contained in a forecast resides in what is actually presented. If a forecast does not specify the entire probability distribution, but just a number of summary statistics, the maximum entropy distribution consistent with those statistics should be assumed. The forecast information contained in the statistics can be added to an ensemble by adjusting the weights of an example to match the statistics, while minimizing relative entropy from the original weights. This method prevents the weighted ensemble to represent too much or too little of the forecast information. The minimum relative entropy update (MRE-update) presented in chapter 4 forms a readily applicable method for generating weighted ensembles using this principle. The information-theoretical foundation makes it theoretically superior to the existing methods for ensemble weighting. When forced to exactly match the forecast, the existing pdf-ratio method can be used as a fast solution method for the MRE-update, making use of the form of the analytical solution to the MRE optimization problem. 8.2.3 On the evaluation of probabilistic forecasts The Brier score and other derived quadratic scores like RPS and CPRS fail to meet certain fundamental requirements for measures of forecast quality. Forecasting should be seen as a communication problem and the quality of forecasts should be evaluated using the divergence score introduced in chapter 5. The Brier score, which has been used extensively in meteorology for the last 60 years, is a second order approximation of the 180 Conclusions and recommendations divergence score and should be replaced by it, leading to notably different evaluation in the extreme forecast probabilities. The analogy between the divergence score and its second order approximation can be extended to a well-known decomposition of the Brier score into uncertainty, reliability and resolution, allowing a re-interpretation of these components as missing, wrong and correct information. The decomposition can also be generalized to the case of uncertain observations, including the observational uncertainty as a closing term. These measures proposed in chapter 5, given their axiomatic justification, are expected to become part of standard forecast verification practice. 8.2.4 On performance measures for model inference The measure that is chosen to evaluate a model implicitly specifies the probabilistic part of that model. This measure should therefore be specified a priori, along with the model, otherwise part of the model is formulated with knowledge of the observations to test it against. A completely specified model gives probabilistic predictions that can and should be tested using information-theoretical measures. In contrast to measures that reflect a decision-problem-specific utility, information-theoretical measures allow the model to optimally learn from all information in the observations. In principle, the purpose of a model should not influence its calibration. The likelihood principle states that all information that data contains about a certain model is contained in the likelihood of the data given that model. When the observations are certain, the divergence score has a direct relation to the log-likelihood. The divergence score also represents the average number of bits per value that is needed for storing the data, using an optimal compression algorithm like arithmetic coding. At a fundamental level, quality of a model can be seen as how much it compresses the data. In algorithmic information theory, the principle of parsimony can be formalized as the description length of a model or the size of the decompression algorithm. Algorithmic information theory measures are in principle incomputable, but practically computable approximations exist in the form of commonly used model complexity control methods. Chapter 6 contains some pioneering practical explorations of compressibility of hydrological time series using general purpose data compression algorithms for computer files. Further developments in this direction may be useful to estimate the amount of information extractable from hydrological time series; compare the performance of hydrological models with general purpose pattern recognition; and provide a basis for model complexity control. 8.2.5 On optimization of reservoir release policies Estimation of the future value of water and optimal decisions about water allocation are two sides of the same coin. Real time operation of a hydropower reservoir can be viewed as an allocation problem in time. Interesting economic information can be derived from the 8.3. Limitations and recommendations for further research 181 results of an optimization of the release policy using stochastic dynamic programming. The information-theoretical analysis and thought experiments in chapter 7 suggest that for optimal reservoir operation, a model of future information growth is necessary. Reinforcement learning is suggested as a practical approach to this problem. Furthermore, it is suggested that the discrepancy between the information in the state of common SDP formulations and the information actually available for operations may need to be accounted for in negotiations based on multi-objective SDP results. 8.3 Limitations and recommendations for further research 8.3.1 Going from discrete to continuous The divergence score, presented in chapter 5, could be extended to forecasts of continuous variables by applying the Kullback-Leibler divergence to probability densities rather than probability masses, but the interpretation becomes more difficult and needs some further thought. The decomposition of the divergence score is limited to discrete predictands. One reason for this is that the uncertainty component, entropy, lacks a convenient extension to the continuous case, while relative entropy does have a well-behaved continuous counterpart. This can be understood by thinking about the remaining uncertainty about a real number. Unless a real number is precisely (e.g. theoretically) known, this remaining uncertainty can be thought of as infinite, because an infinite amount of information (e.g. digits) is needed to specify the number exactly. In practice, the way we deal with real numbers is usually also to discretize them. This limited precision, given by measurement equipment and computer implementation of models, is usually so high that to do meaningful information-theoretical analysis, we need to discretize further or have vast amounts of data at our disposition. Strangely enough, this means that we have to throw away information in order to get useful estimates of information and information flows. In many cases, e.g. ensembles and binary forecasts, the discretization has already been made, but for practical application in a inherently continuous setting, this issue has to be further investigated. 8.3.2 Expressing prior information In many cases, hydrology and water management require predictions about complex systems. This leads to high data requirements. Because the available data is usually limited, meaningful prediction heavily relies on prior information in the form of assumed model structures and prior parameter distributions. This information does not come out of thin air, but usually neither is justified rigorously. As Jaynes (2003) remarked, the process translating various forms of information into probability distributions should represent fully half of probability theory, but so far has received little attention. The principle of maximum entropy is one of the few principles that can be used for this, but its scope is very limited. This restricts the applicability of, 182 Conclusions and recommendations for example, the minimum relative entropy update in chapter 4 to cases where available information can be expressed as constraints. For physical process knowledge, the current practice is usually to convert some rather vague notion of processes into a deterministic model structure, leading to false certainty. Until more principles are found for expressing various forms of incomplete information consistently, no good alternatives exist for this practice except using multiple deterministic model structures simultaneously. 8.3.3 Merging information theory with statistical thermodynamics Water system operation needs predictions from hydrology and hydrology tries to predict the behavior of complex systems. These systems are driven by fluxes of energy and water which in their turn are driven by the chaotic dynamics of the atmosphere. In all the processes that take place, low entropy energy that eventually comes from solar radiation is continuously dissipated. Taking an integrated view on entropy production, entropy import and export, hypotheses about maximum entropy production have been formulated (see Kleidon (2004); Kleidon and Schymanski (2008); Kleidon (2010)). Along similar lines, principles of maximum energy dissipation in hydrological systems have been postulated by Zehe et al. (2010) to explain preferential flow. Also certain biological optimality principles, see e.g. Schymanski et al. (2009), seem to explain certain phenomena quite well. In the generalized maximum entropy framework of Jaynes (1957), all these forms of optimality should be translatable to high probability given limited information on macroscopical scale. When the maximum entropy distribution given some macroscopical constraint does not match the observed data, there is an indication that there are other constraints to be discovered which have a relation to the sufficient statistics for the apparent distribution of the data. Furthermore, information-theoretical investigations of non-equilibrium thermodynamics, such as the pioneering work of Dewar (2003), is likely to yield important insights in the behavior of complex hydrological systems. 8.3.4 An integrated information-theoretical framework from observation to decision This thesis, which concerned topics in forecasting, model inference and optimal decisions under uncertainty, can be summarized as pursuing optimal information extraction from the given input data to support decisions. An interesting new dimension to this problem arises when the input data for the forecasting models is not fixed. This is the case when there is an opportunity to install new measuring equipment to extract information from the environment to improve predictions and ultimately decisions. The hydrological literature describes a number of information-theoretical approaches to optimal monitoring network design; see e.g. Alfonso et al. (2010). By considering these methods in conjunction with the approaches presented in this thesis, which concern forecast evaluation, model inference, and Bayesian decision theory, the entire flow of information from observation to decision can be considered and optimized for risk reduction. Possible applications could be assessing the value of proposed monitoring networks for decisions or designing monitoring networks for a specific decision problem. 8.3. Limitations and recommendations for further research 183 8.3.5 Problem solved, but the solution is a problem In this thesis we have seen that a large part of science and engineering can be framed in terms of information, probability and decisions. Some of the deepest theories about algorithmic information seem to suggest that inductive inference could in principle be completely formalized and automated, but an optimal method of inference is necessarily incomputable. Different computable approximations of optimal induction exist that represent a minimum of prior information and can thus be used to study artificial intelligence. Note that one of the latest developments in artificial intelligence is AIXI (universal artificial intelligence; Hutter (2004); Legg (2008)), which merges Solomonoff induction (advocated in chapter 6) with Bayesian decision theory and reinforcement learning (advocated in chapter 7). In this way it defines a learning agent that makes optimal risk based decisions, given the information that is received from previous interactions with the a priori unknown environment and unlimited computational resources. Time and memory bounded versions, which are conditionally optimal, also exist. In principle it can thus be regarded as the ultimate golden standard solution to risk based water system operation. The only way in which this method can be improved is making use of prior knowledge about the environment and the control problem. In most actual problems in science and engineering there is considerable prior information available. The main challenge in the field can thus be formulated as merging human experience, sensory capabilities and pattern recognition capabilities that have evolved over 4 ∗ 109 years of interacting with the environment and evolutionary computation (we can view natural selection as adding information from the environment) with theoretically optimal decisions for a naive agent that has limited experience in interaction with the environment and relatively little computational power, but does not get bored at analyzing vast amounts of data. Becoming good water managers was essential for our intelligence to maintain its own hardware (i.e. survive). Maybe one day we will be replaced by silicon-managing robots, but until that time we must operate our water systems rationally, in a risk-based fashion. Once again we have ventured way outside the scope of this thesis, so it is time to stop adding information to it. Although raising more questions than answering them, hopefully this thesis sparked some interest in pursuing the application of information theory and artificial intelligence in water resources research... References Abebe, A. J. and Solomatine, D. P. (1998). Application of global optimization to the design of pipe networks. In Proc. Int. Conf. Hydroinformatics, volume 98, pages 989– 995. Ahrens, B. and Walser, A. (2008). Information-based skill scores for probabilistic forecasts. Monthly Weather Review, 136(1):352–363. Akaike, H. (1974). A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723. Alfonso, L., Lobbrecht, A., and Price, R. (2010). Information theory–based approach for location of monitoring water level gauges in polders. Water Resources Research, 46(3):W03528. Alfonso Segura, J. L. (2010). Optimisation Of Monitoring Networks For Water Systems. CRC/ Balkema press. PhD Thesis, UNESCO-IHE. Amorocho, J. and Espildora, B. (1973). Entropy in the assessment of uncertainty in hydrologic systems and models. Water Resources Research, 9(6):1511–1522. Applebaum, D. (1996). Probability and information: An integrated approach. Cambridge University Press. Ariely, D. (2008). Predictably irrational: The hidden forces that shape our decisions. Harper, New York. Avellaneda, M., Bu, R., Friedman, C., Grandchamp, N., Kruk, L., and Newman, J. (2001). Weighted monte carlo: A new technique for calibrating asset-pricing models. intern. J. of Theor. and Appl. Finance, 4(1):91–119. Bárdossy, A. (2007). Calibration of hydrological model parameters for ungauged catchments. Hydrology and Earth System Sciences, 11(2):703–710. Bárdossy, A. and Das, T. (2008). Influence of rainfall observation network on model calibration and application. Hydrology and Earth System Sciences, 12(1):77–89. Barjas Blanco, T., Willems, P., Chiang, P., Haverbeke, N., Berlamont, J., and De Moor, B. (2010). Flood regulation using nonlinear model predictive control. Control Engineering Practice, 18(10):1147–1157. Barnston, A. G., van den Dool, H. M., Rodenhuis, D. R., Ropelewski, C. R., Kousky, V. E., O’Lenic, E. A., Livezey, R. E., Zebiak, S. E., Cane, M. A., Barnett, T. P., et al. (1994). Long-lead seasonal forecasts–where do we stand? Bulletin of the American Meteorological Society, 75(11):2097–2114. Bellman, R. (1952). The theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8):716–719. Benedetti, R. (2010). Scoring Rules for Forecast Verification. Monthly Weather Review, 138(1):203–211. 185 186 References Berger, J. O. and Wolpert, R. L. (1988). The likelihood principle. Institute of Mathematical Statistics, Hayward, CA, 2nd edition. Bernardo, J. M. (1979). Expected information as expected utility. The Annals of Statistics, 7(3):686–690. Bertsekas, D. (2005). Dynamic programming and suboptimal control: A survey from ADP to MPC. European Journal of Control, 11(4-5):310–334. Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific, Boston. Bhattacharya, B., Lobbrecht, A. H., and Solomatine, D. P. (2003). Neural networks and reinforcement learning in control of water systems. Journal of Water Resources Planning and Management, 129:458–465. Boolos, G., Burgess, J., and Jeffrey, R. (2007). Computability and logic. Cambridge University Press. Botev, Z. I. (2006). A novel nonparametric density estimator. Postgraduate Seminar Series, The University of Queensland, Australia. Brest, J., Greiner, S., Bos̆ković, B., Mernik, M., and Z̆umer, V. (2006). Self-adapting control parameters in differential evolution: A comparative study on numerical benchmark problems. IEEE Trans. Evol. Comput., 10(6):646–657. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1–3. Briggs, W., Pocernich, M., and Ruppert, D. (2005). Incorporating misclassification error in skill assessment. Monthly Weather Review, 133(11):3382–3392. Bröcker, J. (2009). Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society, 135(643):1512–1519. Bröcker, J. and Smith, L. A. (2007). Scoring probabilistic forecasts: The importance of being proper. Weather and Forecasting, 22(2):382–388. Burnham, K. and Anderson, D. (2002). Model selection and multimodel inference: A practical information-theoretic approach. Springer Verlag. Burrows, M. and Wheeler, D. J. (1994). A block-sorting lossless data compression algorithm. Technical report, Systems Research Center, Palo Alto, CA. Camacho, E. F. and Bordons, C. (1999). Model predictive control. Advanced textbooks in control and signal processing. Springer, London. Carnap, R. (1950). Logical foundations of probability. Castelletti, A., Galelli, S., Restelli, M., and Soncini-Sessa, R. (2010). Tree-based reinforcement learning for optimal water reservoir operation. Water Resources Research, 46(9):W09507. Chaitin, G. J. (1966). On the length of programs for computing finite binary sequences. Journal of the ACM (JACM), 13(4):547–569. Chiu, C.-L. (1988). Entropy and 2-d velocity distribution in open channels. Journal of Hydraulic Engineering, 114(7):738–756. Chomsky, N. (1956). Three models for the description of language. Information Theory, IRE Transactions on, 2(3):113 –124. Cloke, H. L. and Pappenberger, F. (2009). Ensemble flood forecasting: A review. Journal of Hydrology, 375(3-4):613–626. References 187 Cover, T. M. and Thomas, J. A. (2006). Elements of information theory. WileyInterscience, New York. Croley, T. E. (1996). Using noaa’s new climate outlooks in operational hydrology. Journal of Hydrologic Engineering, 1(3):93–102. Croley, T. E. (1997). Mixing probabilistic meteorology outlooks in operational hydrology. Journal of Hydrologic Engineering, 2:161–168. Croley, T. E. (2001). Climate-biased storm-frequency estimation. Journal of Hydrologic Engineering, 6(4):275–283. Croley, T. E. (2003). Weighted-climate parametric hydrologic forecasting. Journal of Hydrologic Engineering, 8:171–180. Dandy, G. C., Simpson, A. R., and Murphy, L. J. (1996). An improved genetic algorithm for pipe network optimization. Water Resources Research, 32(2):449–458. Day, G. N. (1985). Extended streamflow forecasting using NWSRFS. Journal of Water Resources Planning and Management, 111(2):157–170. Deutsch, D. (1985). Quantum theory, the Church-Turing principle and the universal quantum computer. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, 400(1818):97–117. Deutsch, D. (1998). The fabric of reality. Penguin Books London. Dewar, R. (2003). Information theory explanation of the fluctuation theorem, maximum entropy production and self-organized criticality in non-equilibrium stationary states. Journal of Physics A Mathematical and General, 36(3):631–641. Diaconis, P., Holmes, S., and Montgomery, R. (2007). Dynamical bias in the coin toss. SIAM review, 49(2):211. Dooge, J. (1997). Searching for simplicity in hydrology. Surveys in Geophysics, 18(5):511– 534. Dorigo, M. and Stützle, T. (2004). Ant Colony Optimization. MIT Press. Duan, Q., Ajami, N. K., Gao, X., and Sorooshian, S. (2007). Multi-model ensemble hydrologic prediction using bayesian model averaging. Advances in Water Resources, 30(5):1371–1386. Duan, Q., Gupta, V. K., and Sorooshian, S. (1992). Effective and efficient global optimization for conceptual rainfall–runoff models. Water Resources Research, 28:1015–1031. Epstein, E. S. (1969). A scoring system for probability forecasts of ranked categories. Journal of Applied Meteorology, 8(6):985–987. Faber, B. A. and Stedinger, J. R. (2001). Reservoir optimization using sampling sdp with ensemble streamflow prediction (esp) forecasts. Journal of Hydrology, 249(1-4):113–133. Fenicia, F., Savenije, H., and Hoffmann, L. (2010). An approach for matching accuracy and predictive capability in hydrological model development. IAHS-AISH publication, pages 91–99. Fenicia, F., Savenije, H. H. G., Matgen, P., and Pfister, L. (2008). Understanding catchment behavior through stepwise model concept improvement. Water Resources Research, 44:W01402. Feynman, R. (1965). The character of physical law. MIT Pr. Fiorentino, M., Claps, P., and Singh, V. P. (1993). An entropy-based morphological analysis of river basin networks. Water Resources Research, 29(4):1215–1224. 188 References Foufoula-Georgiou, E. and Kitanidis, P. K. (1988). Gradient dynamic programming for stochastic optimal control of multidimensional water resources systems. Water Resources Research, 24(8):1345–1359. Georgakakos, K. P. and Krzysztofowicz, R. (2001). Special issue: Probabilistic and ensemble forecasting. Journal of Hydrology, 249(1-4). Global Water Partnership (2000). Integrated Water Resources Management. Global Water Partnership, Technical Advisory Comittee. Gneiting, T., Raftery, A., Westveld, A., and Goldman, T. (2005). Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Monthly Weather Review, 133(5):1098–1118. Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378. Gödel, K. (1931). Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik, 38(1):173–198. Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society. Series B (Methodological), 14(1):107–114. Grandy Jr, W. T. (2008). Entropy and the Time Evolution of Macroscopic Systems. Oxford University Press, New York. Grünwald, P. D. (2007). The minimum description length principle. The MIT Press. Gupta, H. V., Kling, H., Yilmaz, K. K., and Martinez, G. F. (2009). Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling. Journal of Hydrology, 377(1-2):80–91. Gupta, H. V., Sorooshian, S., and Yapo, P. O. (1998). Toward improved calibration of hydrologic models: Multiple and noncommensurable measures of information. Water Resources Research, 34(4):751–763. Gupta, H. V., Wagener, T., and Liu, Y. (2008). Reconciling theory with observations: elements of a diagnostic approach to model evaluation. Hydrological Processes, 22(18):3802–3813. Hall, W. A. and Buras, N. (1961). The Dynamic Programming Approach to WaterResources Development. Journal of Geophysical Research, 66(2):517–520. Hamlet, A. F., Huppert, D., and Lettenmaier, D. P. (2002). Economic value of longlead streamflow forecasts for columbia river hydropower. Journal of Water Resources Planning and Management, 128:91. Hamlet, A. F. and Lettenmaier, D. P. (1999). Columbia River streamflow forecasting based on ENSO and PDO climate signals. Journal of Water Resources Planning and Management, 125(6):333–341. Harmancioglu, N. B., Alpaslan, N., and Singh, V. P. (1992a). Application of the entropy concept in design of water quality monitoring networks. In Singh, V. and Fiorentino, M., editors, Entropy and Energy Dissipation in Water Resources, pages 283–302. Kluwer Academic Publishers, Dordrecht. Harmancioglu, N. B., Singh, V. P., and Alpaslan, N. (1992b). Versatile uses of the entropy concept in water resources. In Singh, V. and Fiorentino, M., editors, Entropy and Energy Dissipation in Water Resources, pages 91–117. Kluwer Academic Publishers, Dordrecht. References 189 Huang, W. and Hsieh, C. (2010). Real-time reservoir flood operation during typhoon attacks. Water Resources Research, 46(7):W07528. Huffman, D. A. (1952). A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE, 40(9):1098–1101. Hundecha, Y., Ouarda, T. B. M. J., and Bárdossy, A. (2008). Regional estimation of parameters of a rainfall-runoff model at ungauged watersheds using the “spatial” structures of the parameters within a canonical physiographic-climatic space. Water Resources Research, 44(1):W01427. Hurst, H. (1951). Long-term storage capacity of reservoirs. Transactions of the American Society of Civil Engineers, 116:770–808. Hutter, M. (2004). Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin. 300 pages, http://www.idsia.ch/∼ marcus/ai/uaibook.htm. Hutter, M. (2010). A complete theory of everything (will be subjective). Algorithms, 3(4):329–350. Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106(4):620–630. Jaynes, E. T. (2003). Probability theory: the logic of science. Cambridge University Press, Cambridge, UK. Johnson, S. A., Stedinger, J. R., Shoemaker, C., Li, Y., and Tejada-Guibert, J. A. (1993). Numerical solution of continuous-state dynamic programs using linear and spline interpolation. Operations Research, 41(3):484–500. Jolliffe, I. T. and Stephenson, D. B. (2003). Forecast verification: a practitioner’s guide in atmospheric science. Wiley, Chichester, UK. Jolliffe, I. T. and Stephenson, D. B. (2008). Proper scores for probability forecasts can never be equitable. Monthly Weather Review, 136(4):1505–1510. Jose, V. R. R., Nau, R. F., and Winkler, R. L. (2008). Scoring rules, generalized entropy, and utility maximization. Operations Research, 56(5):1146. Karamouz, M. and Vasiliadis, H. V. (1992). Bayesian stochastic optimization of reservoir operation using uncertain forecasts. Water Resources Research, 28(5):1221–1232. Karamouz, M., Zahraie, B., and Araghinejad, S. (2005). Decision support system for monthly operation of hydropower reservoirs: A case study. Journal of Computing in Civil Engineering, 19(2):194–207. J. Comput. Civ. Eng. Kavetski, D., Kuczera, G., and Franks, S. W. (2006). Bayesian analysis of input uncertainty in hydrological modeling: 1. Theory. Water Resources Research, 42(3):W03407. Kelly, J. (1956). A new interpretation of information rate. Information Theory, IEEE Transactions on, 2(3):185–189. Kelman, J., Stedinger, J. R., Cooper, L. A., Hsu, E., and Yuan, S. Q. (1990). Sampling stochastic dynamic programming applied to reservoir operation. Water Resources Research, 26(3):447–454. Kim, Y.-O. and Palmer, R. N. (1997). Value of seasonal flow forecasts in bayesian stochastic programming. Journal of Water Resources Planning and Management, 123(6):327– 335. King, R. D., Rowland, J., Oliver, S. G., Young, M., Aubrey, W., Byrne, E., Liakata, 190 References M., Markham, M., Pir, P., Soldatova, L. N., et al. (2009). The automation of science. Science, 324(5923):85. Kleeman, R. (2002). Measuring dynamical prediction utility using relative entropy. Journal of the Atmospheric Sciences, 59(13):2057–2072. Kleidon, A. (2004). Beyond gaia: Thermodynamics of life and earth system functioning. Climatic Change, 66(3):271–319. Kleidon, A. (2010). Non-equilibrium thermodynamics, maximum entropy production and Earth-system evolution. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 368(1910):181. Kleidon, A. and Schymanski, S. (2008). Thermodynamics and optimality of the water budget on land: A review. Geophysical Research Letters, 35(20):L20404. Klemeš, V. (1977). Discrete representation of storage for stochastic reservoir optimization. Water Resources Research, 13(1):149–158. Kolmogorov, A. N. (1968). Three approaches to the quantitative definition of information. International Journal of Computer Mathematics, 2(1):157–168. Koutsoyiannis, D. (2005a). Uncertainty, entropy, scaling and hydrological statistics. 1. Marginal distributional properties of hydrological processes and state scaling. Hydrological Sciences Journal, 50(3):381–404. Koutsoyiannis, D. (2005b). Uncertainty, entropy, scaling and hydrological stochastics. 2. Time dependence of hydrological processes and time scaling. Hydrological Sciences Journal, 50(3):1–426. Koutsoyiannis, D. (2009). Seeking parsimony in hydrology and water resources technology. Geophysical Research Abstracts, 11:EGU2009–11469. Koutsoyiannis, D. (2010). HESS Opinions “A random walk on water”. Hydrology and Earth System Sciences, 14:585–601. Koutsoyiannis, D. (2011). Hurst-Kolmogorov dynamics as a result of extremal entropy production. Physica A: Statistical Mechanics and its Applications, 390(8):1424–1432. Koutsoyiannis, D. and Economou, A. (2003). Evaluation of the parameterizationsimulation-optimization approach for the control of reservoir systems. Water Resources Research, 39(6):1170. Kraft, L. G. (1949). A device for quantizing, grouping, and coding amplitude-modulated pulses. Master’s thesis, Massachusetts Institute of Technology. Dept. of Electrical Engineering. Krstanovic, P. F. and Singh, V. P. (1992a). Evaluation of rainfall networks using entropy: I. Theoretical development. Water Resources Management, 6(4):279–293. Krstanovic, P. F. and Singh, V. P. (1992b). Evaluation of rainfall networks using entropy: II. Application. Water Resources Management, 6(4):295–314. Krzysztofowicz, R. (1999). Bayesian theory of probabilistic forecasting via deterministic hydrologic model. Water Resources Research, 35(9):2739–2750. Krzysztofowicz, R. (2001). The case for probabilistic forecasting in hydrology. Journal of Hydrology, 249(1):2–9. Kullback, S. (1997). Information theory and statistics. Dover Pubns. Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86. References 191 Kwakernaak, H. and Sivan, R. (1972). Linear optimal control systems. Wiley-Interscience New York. Labadie, J. W. (2004). Optimal operation of multireservoir systems: State-of-the-art review. Journal of Water Resources Planning and Management, 130(2):93–111. Laio, F. and Tamea, S. (2007). Verification tools for probabilistic forecasts of continuous hydrological variables. Hydrology and Earth System Sciences, 11(4):1267–1277. Landauer, R. (1961). Irreversibility and heat generation in the computing process. IBM Journal of Research and Development, 5:183–191. Lee, J. H. and Labadie, J. W. (2007). Stochastic optimization of multireservoir systems via reinforcement learning. Water Resources Research, 43:W11408. Legg, S. (2008). Machine Super Intelligence. PhD thesis, Faculty of Informatics of the University of Lugano. Lehning, M., Dawes, N., Bavay, M., Parlange, M., Nath, S., and Zhao, F. (2009). Instrumenting the earth: Next-generation sensor networks and environmental science. In The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft research. Leung, L. Y. and North, G. R. (1990). Information theory and climate prediction. Journal of Climate, 3(1):5–14. Li, M. and Vitanyi, P. M. B. (2008). An introduction to Kolmogorov complexity and its applications. Springer-Verlag New York Inc. Lindley, D. (2008). Uncertainty: Einstein, Heisenberg, Bohr, and the struggle for the soul of science. Anchor Books, Garden City, N.Y. Ljung, L. (1987). System identification: theory for the user. Prentice-Hall, Englewood Cliffs, NJ. Lobbrecht, A. H., Sinke, M. D., and Bouma, S. B. (1999). Dynamic control of the delfland polders and storage basin, the netherlands. Water Science and Technology, 39(4):269– 279. Lorenz, E. (1963). Deterministic Nonperiodic Flow. Journal of the Atmospheric Sciences, 20(2):130–141. Loucks, D. P. and van Beek, E. (2005). Water resources systems planning and management an Introduction to methods, models and applications. Unesco, Paris. Martin, G. N. N. (1979). Range encoding: an algorithm for removing redundancy from a digitised message. In Video & Data Recording conference. Mason, S. J. (2008). Understanding forecast verification statistics. Meteorological Applications, 15(1):31–40. McMillan, B. (1956). Two inequalities implied by unique decipherability. IEEE Transactions on Information Theory, 2(4):115–116. Merabtene, T., Kawamura, A., Jinno, K., and Olsson, J. (2002). Risk assessment for optimal drought management of an integrated water resources system using a genetic algorithm. Hydrological Processes, 16(11):2189–2208. Murphy, A. H. (1970). The ranked probability score and the probability score: a comparison. Monthly Weather Review, 98(12):917–924. Murphy, A. H. (1971). A note on the ranked probability score. Journal of Applied Meteorology, 10(1):155–156. 192 References Murphy, A. H. (1973). A new vector partition of the probability score. Journal of Applied Meteorology, 12(4):595–600. Murphy, A. H. (1977). The value of climatological, categorical and probabilistic forecasts in the cost-loss ratio situation. Monthly Weather Review, 105(7):803–816. Murphy, A. H. (1993). What is a good forecast? an essay on the nature of goodness in weather forecasting. Weather and Forecasting, 8(2):281–293. Murphy, A. H. and Winkler, R. L. (1987). A general framework for forecast verification. Monthly Weather Review, 115(7):1330–1338. Namias, J. (1969). Seasonal interactions between the north pacific ocean and the atmosphere during the 1960’s. Monthly Weather Review, 97(3):173–192. Nash, J. E. and Sutcliffe, J. V. (1970). River flow forecasting through conceptual models; Part I – a discussion of principles. Journal of Hydrology, 10:282–290. Negenborn, R. R., De Schutter, B., Wiering, M. A., and Hellendoorn, H. (2005). Learningbased model predictive control for markov decision processes. In P. Horacek, M. S. and Zitek, P., editors, 16th IFAC World Congress, Prague, Czech Republic. O’Kane, J. P. and Flynn, D. (2007). Thresholds, switches and hysteresis in hydrology from the pedon to the catchment scale: a non-linear systems theory. Hydrology and Earth System Sciences, 11(1):443–459. Peirolo, R. (2010). Information gain as a score for probabilistic forecasts. Meteorological Applications, in print. Pereira, M. V. F. and Pinto, L. (1991). Multi-stage stochastic optimization applied to energy planning. Mathematical Programming, 52(1):359–375. Peterson, M. B. (2009). An introduction to decision theory. Cambridge University Press, Cambridge, UK. Philbrick, C. R. and Kitanidis, P. K. (1999). Limitations of deterministic optimization applied to reservoir operations. Journal of Water Resources Planning and Management, 125(3):135–142. Pianosi, F. (2008). Novel methods for water reservoirs management. PhD thesis, Politechnico di Milano. Pianosi, F. and Ravazzani, G. (2010). Assessing rainfall-runoff models for the management of lake verbano. Hydrological Processes, 24(22):3195–3205. Pianosi, F. and Soncini-Sessa, R. (2009). Real-time management of a multipurpose water reservoir with a heteroscedastic inflow model. Water Resources Research, 45(10):W10430. Piechota, T. C., Chiew, F. H. S., Dracup, J. A., and McMahon, T. A. (1998). Seasonal streamflow forecasting in Eastern Australia and the El Nino–southern oscillation. Water Resources Research, 34(11):3035–3044. Popper, K. R. (1959). The propensity interpretation of probability. The British journal for the philosophy of science, 10(37):25. Popper, K. R. (1968). The logic of scientific discovery. Taylor & Francis e-Library, second edition. Raso, L., Schwanenberg, D., van der Giesen, N., and van Overloop, P. (2010). TreeScenario Based Model Predictive Control. Geophysical Research Abstracts, 12:3178. Rauch, W. and Harremoës, P. (1999). Genetic algorithms in real time control applied References 193 to minimize transient pollution from urban wastewater systems. Water Research, 33(5):1265 – 1277. Rissanen, J. (2007). Information and complexity in statistical modeling. Springer Verlag. Rissanen, J. and Langdon, G. G. (1979). Arithmetic coding. IBM Journal of Research and Development, 23(2):149–162. Robert, C. P. (2007). The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer Verlag, New York. Rodriguez-Iturbe, I. and Rinaldo, A. (2001). Fractal River Basins; chance and selforganization. Cambridge University Press. Roulston, M. S. and Smith, L. A. (2002). Evaluating probabilistic forecasts using information theory. Monthly Weather Review, 130(6):1653–1660. Savenije, G. (2009). HESS Opinions: ’The art of hydrology’. Hydrology and Earth System Sciences, 13(2):157–161. Savic, D. A. and Walters, G. A. (1997). Genetic algorithms for least-cost design of water distribution networks. Journal of Water Resources Planning and Management, 123(2):67–77. Schaefli, B. and Gupta, H. V. (2007). Do Nash values have value? Hydrological Processes, 21(15):2075–2080. Schmidt, M. and Lipson, H. (2009). Distilling free-form natural laws from experimental data. science, 324(5923):81. Schoups, G., van de Giesen, N. C., and Savenije, H. H. G. (2008). Model complexity control for hydrologic prediction. Water Resources Research, 44:W00B03. Schoups, G. and Vrugt, J. A. (2010). A formal likelihood function for parameter and predictive inference of hydrologic models with correlated, heteroscedastic, and nongaussian errors. Water Resources Research, 46:W10531. Schoups, G., Vrugt, J. A., Fenicia, F., and van de Giesen, N. C. (2010). Corruption of accuracy and efficiency of markov chain monte carlo simulation by inaccurate numerical implementation of conceptual hydrologic models. Water Resources Research, 46:W10530. Schuurmans, W., Leeuwen, P. E. R. M. v., and Kruiningen, F. E. v. (2002). Automation of the rijnland storage basin, the netherlands. Lowland Technology International, 4(No. 1):13–20. Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, 6(2):461–464. Schymanski, S. J., Sivapalan, M., Roderick, M. L., Hutley, L. B., and Beringer, J. (2009). An optimality-based model of the dynamic feedbacks between natural vegetation and the water balance. Water Resources Research, 45(1):W01412. Selten, R. (1998). Axiomatic characterization of the quadratic scoring rule. Experimental Economics, 1(1):43–62. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical J., 27(3):379–423. Shannon, C. E. (1956). A universal Turing machine with two internal states. Automata studies, pages 129–153. Sharma, A. (2000). Seasonal to interannual rainfall probabilistic forecasts for improved 194 References water supply management: Part 3–a nonparametric probabilistic forecast model. Journal of Hydrology, 239(1-4):249–258. Shoemaker, C. A., Regis, R. G., and Fleming, R. C. (2007). Watershed calibration using multistart local optimization and evolutionary optimization with radial basis function approximation. Hydrological Sciences Journal, 52(3):450–465. Singh, K. and Singh, V. P. (1991). Derivation of bivariate probability density functions with exponential marginals. Stochastic Hydrology and Hydraulics, 5(1):55–68. Singh, V. P. (1997). The use of entropy in hydrology and water resources. Hydrological Processes, 11(6):587–626. Singh, V. P. and Guo, H. (1995a). Parameter estimation for 3-parameter generalized pareto distribution by the principle of maximum entropy (POME). Hydrological Sciences Journal, 40(2):165–181. Singh, V. P. and Guo, H. (1995b). Parameter estimations for 2-parameter pareto distribution by pome. Water Resources Management, 9(2):81–93. Singh, V. P. and Guo, H. (1997). Parameter estimation for 2-parameter generalized pareto distribution by pome. Stochastic Hydrology and Hydraulics, 11(3):211–227. Singh, V. P., Guo, H., and Yu, F. X. (1993). Parameter estimation for 3-parameter loglogistic distribution (LLD3) by pome. Stochastic Hydrology and Hydraulics, 7(3):163– 177. Singh, V. P. and Rajagopal, A. K. (1987). Some recent advances in application of the principle of maximum entropy (pome) in hydrology. IAHS, 194:353–364. Singh, V. P. and Singh, K. (1985). Derivation of the pearson type (PT) III distribution by using the principle of maximum entropy (POME). Journal of hydrology, 80(3-4):197– 214. Solomatine, D. P. (1999). Random search methods in model calibration and pipe network design. Water Industry Systems: Modelling and Optimization Applications, pages 317– 332. Solomatine, D. P. and Ostfeld, A. (2008). Data-driven modelling: some past experiences and new approaches. Journal of Hydroinformatics, 10(1):3–22. Solomonoff, R. J. (1964). A formal theory of inductive inference. Part I. Information and control, 7(1):1–22. Solomonoff, R. J. (1978). Complexity-based induction systems: comparisons and convergence theorems. Information Theory, IEEE Transactions on, 24(4):422–432. Soncini-Sessa, R., Castelletti, A., and Weber, E. (2007). Integrated and participatory water resources management: theory. Elsevier Science Ltd. Sonuga, J. O. (1972). Principle of maximum entropy in hydrologic frequency analysis. Journal of Hydrology, 17(3):177 – 191. Stedinger, J. and Kim, Y. O. (2002). Updating ensemble probabilities based on climate forecasts. Proc., Water Resources Planning and Management (19-22 May, Roanoke, Virginia)(CD-Rom), Environmental and Water Resources Institute, American Society of Civil Engineers, Reston, VA. Stedinger, J. R. and Kim, Y. (2007). Adjusting ensemble forecast probabilities to reflect several climate forecasts. IAHS PUBLICATION, 313:188. References 195 Stedinger, J. R. and Kim, Y.-O. (2010). Probabilities for ensemble forecasts reflecting climate information. Journal of Hydrology, 391(1–2):9–23. Stedinger, J. R., Sule, B. F., and Loucks, D. P. (1984). Stochastic dynamic programming models for reservoir operation optimization. Water Resources Research, 20(11). Stephenson, D. B., Coelho, C. A. S., and Jolliffe, I. T. (2008). Two extra components in the brier score decomposition. Weather and Forecasting, 23(4):752–757. Storn, R. and Price, K. (1997). Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization, 11(4):341– 359. Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. Szilard, L. (1964). On the decrease of entropy in a thermodynamic system by the intervention of intelligent beings. Behavioral Science, 9(4):301–310. translation from original German version (1929). Tejada-Guibert, J. A., Johnson, S. A., and Stedinger, J. R. (1995). The value of hydrologic information in stochastic dynamic programming models of a multireservoir system. Water Resources Research, 31(10):2571–2579. Water Resour. Res. Tilmant, A., Beevers, L., and Muyunda, B. (2010). Restoring a flow regime through the coordinated operation of a multireservoir system-The case of the Zambezi River Basin. Water Resources Research, 46(7):W07533. Tilmant, A. and Kelman, R. (2007). A stochastic approach to analyze trade-offs and risks associated with large-scale water resources systems. Water Resources Research, 43(6):W06425. Tilmant, A., Lettany, J., and Kelman, R. (2007). Hydrological Risk Assessment in the Euphrates-tigris River Basin: A Stochastic Dual Dynamic Programming Approach. Water International, 32(2):294–309. Tilmant, A., Pinte, D., and Goor, Q. (2008). Assessing marginal water values in multipurpose multireservoir systems via stochastic programming. Water Resources Research, 44(12):W12431. Toyabe, S., Sagawa, T., Ueda, M., Muneyuki, E., and Sano, M. (2010). Experimental demonstration of information-to-energy conversion and validation of the generalized Jarzynski equality. Nature Physics, 6:988–992. Trenberth, K. E. (1997). The definition of El Nino. Bulletin of the American Meteorological Society, 78(12):2771–2777. Tribus, M. (1961). Thermostatics and thermodynamics. D. Van Nostrand Company, Inc. Turing, A. M. (1937). On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society, 2(1):230. van Andel, S. (2009). Anticipatory Water Management: Using ensemble weather forecasts for critical events. CRC Press/Balkema. PhD Thesis, Unesco-IHE. van Andel, S. J., Price, R., Lobbrecht, A., and van Kruiningen, F. (2010). Modeling Controlled Water Systems. Journal of Irrigation and Drainage Engineering, 136:392. van Andel, S. J., Price, R. K., Lobbrecht, A. H., van Kruiningen, F., and Mureau, R. (2008). Ensemble Precipitation and Water-Level Forecasts for Anticipatory WaterSystem Control. Journal of Hydrometeorology, 9(4):776–788. 196 References van Overloop, P., Negenborn, R., de Schutter, B., and van de Giesen, N. (2010a). Predictive Control for National Water Flow Optimization in The Netherlands. Intelligent Infrastructures, pages 439–461. van Overloop, P. J. (2006). Model predictive control on open water systems. PhD thesis, TU Delft, Delft. van Overloop, P. J., Negenborn, R. R., Weijs, S. V., Malda, W., Bruggers, M. R., and De Schutter, B. (2010b). Linking water and energy objectives in lowland areas through the application of model predictive control. In Proceedings of the 2010 IEEE Conference on Control Applications, pages 1887–1891, Yokohama, Japan. van Overloop, P. J., Weijs, S., and Dijkstra, S. (2008). Multiple model predictive control on a drainage canal system. Control Engineering Practice, 16(5):531–540. Vapnik, V. N. (1998). Statistical learning theory. John Wiley & Sons, NY, USA. Verlinde, E. (2010). On the Origin of Gravity and the Laws of Newton. arXiv, arXiv:1001.0785v1. Vesterstrøm, J. and Thomsen, R. (2004). A comparative study of differential evolution, particle swarm optimization, and evolutionary algorithms on numerical benchmark problems. In Proc. IEEE Congr. Evolutionary Computation, pages 1980–1987, Portland, OR, USA. Vrugt, J. A., Gupta, H. V., Bouten, W., and Sorooshian, S. (2003). A Shuffled Complex Evolution Metropolis algorithm for optimization and uncertainty assessment of hydrologic model parameters. Water Resources Research, 39(8):1201. Vrugt, J. A. and Robinson, B. A. (2007). Improved evolutionary optimization from genetically adaptive multimethod search. Proceedings of the National Academy of Sciences, 104(3):708. Vrugt, J. A., Ter Braak, C. J. F., Gupta, H. V., and Robinson, B. A. (2009). Equifinality of formal (DREAM) and informal (GLUE) Bayesian approaches in hydrologic modeling? Stochastic environmental research and risk assessment, 23(7):1011–1026. Wand, M. P. and Jones, M. C. (1993). Comparison of smoothing parameterizations in bivariate kernel density estimation. Journal of the American Statistical Association, 88(422):520–528. Weijs, S. (2009). Interactive comment on "HESS Opinions ‘A random walk on water’ " by D. Koutsoyiannis. Hydrology and Earth System Sciences Discussions, 6:C2733–C2745. Weijs, S., van Leeuwen, E., van Overloop, P. J., and van de Giesen, N. (2007). Effect of uncertainties on the real-time operation of a lowland water system in the netherlands. IAHS PUBLICATION, 313:463. Weijs, S. V. (2004). ’Sturen met onzekere voorspellingen’ (control using uncertain predictions), in Dutch. Master’s thesis, TU Delft. Weijs, S. V. (2007). Information content of weather predictions for flood-control in a dutch lowland water system. In 4th International Symposium on Flood Defense: Managing Flood Risk, Reliability and Vulnerability, Toronto, Ontario, Canada. Weijs, S. V., Schoups, G., and van de Giesen, N. (2010a). Why hydrological predictions should be evaluated using information theory. Hydrology and Earth System Sciences, 14(12):2545–2558. Weijs, S. V. and Van de Giesen, N. (2011). Accounting for observational uncertainty References 197 in forecast verification: an information–theoretical view on forecasts, observations and truth. Monthly Weather Review, early online release. Weijs, S. V. and van de Giesen, N. (2011). "zipping" hydrological timeseries: An information-theoretical view on data compression as philosophy of science. Geophysical Research Abstracts, 13:EGU2011–8105. Weijs, S. V., van Leeuwen, P., and van Overloop, P. (2006). The integration of risk analysis in real time flood control. In Cunge, J., Guinot, V., and Liong, S.-Y., editors, 7th International Conference on Hydroinformatics, Nice, France, volume 4, pages 2943– 2950. Weijs, S. V., Van Nooijen, R., and Van de Giesen, N. (2010b). Kullback–Leibler divergence as a forecast skill score with classic reliability–resolution–uncertainty decomposition. Monthly Weather Review, 138(9):3387–3399. Westra, S. and Sharma, A. (2010). An Upper Limit to Seasonal Rainfall Predictability? Journal of Climate, 23(12):3332–3351. Wilks, D. S. (1995). Statistical Methods in the Atmospheric Sciences: An Introduction. Academic Press. Wilks, D. S. (2000). On interpretation of probabilistic climate forecasts. Journal of Climate, 13(11):1965–1971. Wilks, D. S. (2002). Realizations of daily weather in forecast seasonal climate. Journal of Hydrometeorology, 3(2):195–207. Winsemius, H. C., Schaefli, B., Montanari, A., and Savenije, H. H. G. (2009). On the calibration of hydrological models in ungauged basins: A framework for integrating hard and soft hydrological information. Water Resources Research, 45(12):W12422. Wood, A. W., Kumar, A., and Lettenmaier, D. P. (2005). A retrospective assessment of National Centers for Environmental Prediction climate model-based ensemble hydrologic forecasting in the western United States. Journal of Geophysical Research, 110:0148–0227. Wood, A. W. and Lettenmaier, D. P. (2008). An ensemble approach for attribution of hydrologic prediction uncertainty. Geophysical Research Letters, 35(14):1–L14401. Wood, E. F., Lettenmaier, D. P., and Zartarian, V. G. (1992). A land-surface hydrology parameterization with subgrid variability for general circulation models. Journal of Geophysical Research, 97(D3):2717–2728. Xu, M., van Overloop, P. J., van de Giesen, N. C., and Stelling, G. S. (2010). Real-time control of combined surface water quantity and quality: polder flushing. Water Science and Technology, 61(4):869. Yakowitz, S. (1982). Dynamic programming applications in water resources. Water Resources Research, 18(4):673–696. Yeh, W. W.-G. (1985). Reservoir management and operations models: A state-of-the-art review. Water Resources Research, 21(12):1797–1818. Zehe, E., Blume, T., and Blöschl, G. (2010). The principle of ‘maximum energy dissipation’: a novel thermodynamic perspective on rapid water flow in connected soil structures. Philosophical Transactions of the Royal Society B: Biological Sciences, 365(1545):1377. 198 References Zhao, T., Cai, X., and Yang, D. (2011). Effect of streamflow forecast uncertainty on real-time reservoir operation. Advances in Water Resources, In Press, Corrected Proof. Ziv, J. and Lempel, A. (1977). A universal algorithm for sequential data compression. IEEE transactions on Information Theory, 23(3):337–343. Appendix A Equivalence between MRE-update and pdf-ratio solutions for the normal case We try to solve min { q i subject to the constraints n X qi qi log( )} pi i=1 n X qi = 1 i=1 n X qi ≥ 0 qi xi = µ1 i=1 n X qi (xi − µ1 )2 = σ12 i=1 this leads to the Lagrangeans ∀i ∂ ∂qi ( n X qi qi log( ) + λ1 p i i=1 n X i=1 ! qi − 1 + λ2 !) = 0 1 + log qi − log pi + λ1 + λ2 (xi − µ1 )2 + λ3 xi = 0 = log qi n X 2 qi (xi − µ1 ) − σ12 i=1 ! + λ3 n X qi xi − µ1 i=1 2 log pi − 1 − λ1 − λ2 (xi − µ1 ) − λ3 xi where λ1 , λ2 and λ3 are the Lagrange multipliers correspronding to the constraints for the sum, the mean and the variance, respectively. The constraint for nonnegativity is never binding and is left out of the Lagrangean. This leads to a solution of the form 2 qi = pi e−1−λ1 −λ2 (xi −µ1 ) −λ3 xi (A.1) The Lagrange multiplier can subsequently be solved numerically to match the constraints. The result of the pdf-ratio method for normal initial and target distributions has the same form as equation A.1, but not necessarily the same values for λ1 , λ2 and λ3 , because the solution is not forced to satisfy the constraints. When the parameters of the target normal distribution that is used as an input for the pdf-ratio method are modified so that the moments of the resultant weighted ensembles match µ1 and σ1 exactly, the result of 199 200 A: Equivalence MRE and pdf-ratio for normal case the pdf-ratio method matches exactly the result of the MRE-update. The search for the two parameters of the normal distribution and the normalization constant are exactly the three degrees of freedom that are required to find λ1 , λ2 and λ3 numerically. An analytical solution is not possible because the constants depend on all individual xi . Similarly, for the Croley parametric method with the same constraints, and objective function ) ( n X 2 (qi − pi ) min qi i=1 it can be found that the solution is of the form ∂ ∂qi ( n X i=1 2 (qi − pi ) + λ1 n X i=1 qi − 1 ! + λ2 n X i=1 2 qi (xi − µ1 ) − σ12 ! + λ3 n X i=1 qi xi − µ1 ! + λi qi ) = 0 2qi − 2pi + λ1 + λ2 (xi − µ1 )2 + λ3 xi + λi = 0 2pi − λ1 − λ2 (xi − µ1 )2 − λ3 xi + λi = 2qi where λi are the extra Lagrange multipliers which ensure that all qi are nonnegative. The solution to the Croley method are weights that are a parabolic function of the values xi . The quadratic objective function thus leads to a solution that is a second order (quadratic polynomial) approximation of the solution found by the MRE-update. Appendix B The decomposition of the divergence score First we use the definition of the Kullback-Leibler divergence to define the total score and the resolution and reliability components and the entropy for the uncertainty component. DKL (v q w) = n X vi log i=1 DS = REL = N 1 X DKL (ot q ft ) N t=1 K 1 X nk DKL ōk q f̄k N k=1 RES = UNC = vi wi K 1 X nk DKL (ōk q ō) N k=1 N N X n 1 X 1 X H (ō) = − {[ō]i log [ō]i } N t=1 N t=1 i=1 Now we can simplify the expression for REL − RES = K X n nk DKL ōk q f k − DKL (ōk q ō) k=1 K X n X [ōk ]i K X n X [ō]i [ōk ]i [ōk ]i log h i − log = nk [ō]i fk i=1 k=1 = k=1 Note that nk i=1 i [ōk ]i log h i fk i N X n [ot ]i 1 X [ot ]i log DS = [ft ]i N t=1 i=1 ( 201 ) o 202 B: The decomposition of the divergence score With K equal to the number of different ft and one bin for each ft we can label both bins and outcomes by k. We label the outcomes in a bin by [ot ]k,mk with mk = 1 . . . nk so nk n X K X [ok,m ] 1 X [ok,mk ]i log h ik i DS = N i=1 k=1 mk =1 f̄k i which can be written as DS = 1 N n X K X i=1 k=1 = 1 N n X K X i=1 k=1 = 1 N n X K X nk X mk =1 mk =1 nk X [ok,mk ]i ( [ok,mk ]i ( nk [ōk ]i i=1 k=1 ) [ok,mk ]i [ok,mk ]i [ok,mk ]i log + log − log [ft ]i [ō]i [ō]i ) [ok,mk ]i [ō] log i + log [ft ]i [ō]i [ō] log i [fk ]i ( )! ( we can now recognize the first term as REL − RES, so DS − (REL − RES) = ) = nk n X K X [ok,mk ]i 1 X [ok,mk ]i log [ō]i N i=1 k=1 mk =1 ( n N X [ot ]i 1 X [ot ]i log [ō]i N t=1 i=1 ( )! = 1 N PN t=1 DKL (ot q ō) Note that, with lim x log x = 0 x↓0 n and for n = 2,ot ∈ (1, 0)T , (0, 1)T n X i=1 [ot ]i ( [ot ]i log [ō]i o we find ) = n X {[ot ]i log [ot ]i − [ot ]i log [ō]i } i=1 = − n X [ot ]i log [ō]i i=1 so n N X X t=1 so i=1 [ot ]i ( [ot ]i log [ō]i DS − (REL − RES) = − ) nk n X K X [ok,mk ]i 1 X [ok,mk ]i log + [ō]i N i=1 k=1 mk =1 )! n X i=1 =N n X [ō]i log [ō]i i=1 [ō]i log [ō]i = H (ō) = UNC Appendix C Relation divergence score and doubling rate in a horse race In this example we consider a horse race, a gambler and a bookmaker. Suppose there are n possible events (horses winning). A bookmaker is offering odds ri , which means that if horse i wins, the gambler receives ri times the money he bet on horse i. The fraction of the gamblers wealth bet on horse i is denoted by bi . If another horse wins, the gambler loses his stake bi . A bookmaker is said to offer fair odds if n X 1 =1 i=1 ri . After one race, the outcome of the race can be described by a vector o. Element oi of this vector is 1 if horse i wins and 0 otherwise. The factor by which the wealth of the gambler has grown after one race is St = n X oi bi ri i=1 . If the gambler reinvests all his money in each new bet, the expected factor by which the gamblers wealth has grown after T bets will be ST = T Y St t=1 .The expectation of the logarithm of this factor is defined as the doubling rate W W = E{log2 St } = n X {pi log2 (bi ri ) i=1 , in which pi is the probability that horse i wins the race. The expected wealth after T bets can now be written as ST = 2 T W . Kelly (1956) showed that W is maximized by following a proportional betting strategy (bi = pi ). When following this strategy it is possible to express W as a difference between 203 204 C: Divergence score and doubling rate in horse race two Kullback Leibler divergences (Cover and Thomas, 2006). When the bookmaker is offering fair odds, his estimates of the win probabilities can be written as hi = 1/ri. W = n X {pi log2 (bi hi ) i=1 = n X i=1 {pi log2 ( bi pi ) pi hi = DKL (p q h) − DKL (p q b) This means that a gambler can make money only if his probability estimate is better than the bookie’s. Distribution p in these divergences is the true distribution. This truth must be conditioned at least on each combination of h, b. Now let’s assume that the bookmaker offers fair odds with respect to climatology. Because in this case, one of the forecasters (the bookie) always issues the same forecast, p is the distribution of observations ō, conditioned on b only, yielding the conditional distribution of the outcome ōk . The bookie’s estimate h is equal to ō and W can be written as W = DKL (ōk q ō) − DKL (ōk q fk ) which can be recognized as the resolution minus the reliability term. W can thus be seen as the information gain towards the truth, compared to climate. W = RES − REL It is also possible to condition the truth further, to arrive at a more general expression for W , which is also valid in case the bookmaker’s estimate is different from climate. In this case the distributions are conditioned on every single forecast, leading to the expression W = DKL (ot q ht ) − DKL (ot q bt ) which can be recognized as the difference in divergence score between bookie and gambler. A gambler can thus make money with growth rate W = DSbookie − DSgambler Acknowledgements The information in this thesis did not spring from nothing, but was a result of a long evolutionary process involving random brainstorms, selection, feedback and other forms of communication, love and support received from more persons than I can mention here. First of all I want to thank my promotor, Nick van de Giesen, for providing me with the opportunity to do this research. Nick, thank you for the support, sense of humour, trust, thoughts, critical feedback and freedom you gave me. Peter-Jules van Overloop, thank you very much for convincing me to do a Ph.D. in Delft and your very generous support, both in science and personally. I could always rely on your help. Sharing an office with you and my fellow Ph.D. students Luciano Raso and Xu Min was hilarious and inspiring. Gerrit Schoups, Ronald van Nooijen, Rolf Hut, Nico de Vos and Nick, thanks for listening to my semi-religious sermons about information theory and brainstorming about crazy ideas and experiments. You certainly made sure I reached my 10% foolishness quotum to sustain my curiosity-addiction with enough interesting questions. Thank you Hanneke and Betty for providing regularity and organisation in this chaotic environment of scientists. The accurately timed lunch-calls and fruit you supplied kept me going. My other colleagues, especially my fellow Ph.D. students, for being a bright bunch of nice people. Martine, thank you for sharing, support and surprises. Your information will not be lost. Thanks to the students of the dispuut water management for creating a great positive atmosphere with many motivated students and for giving me the opportunity to accompany the brilliantly organized study trip to Argentina. All the people of the Nieuwelaan for building and sharing a home together. My present and past housemates at the Nerdhuis, for building the binary encoded π-tile floor, crazy experiments, all your great cooking, random dinner conversations, the honor of receiving the nerd of the day award on multiple occasions, and much more. Nerds 22 ever! Bart Nijssen and Andy Wood of 3TIER inc., ICIMOD, KNMI, Meteoconsult, Hoogheemraadschap van Delfland and Nelen & Schuurmans for kindly providing the data used in this research. Jery Stedinger, Amaury Tilmant, Hoshin Gupta, Federico Lombardo, Demetris Koutsoyiannis, Francesca Pianosi, Vijay P. Singh, and several anonymous reviewers for their constructive comments on my papers. The members of the examination committee for their comments that helped to improve my draft thesis. 205 206 Acknowledgements My paranymphs Alexander Bakker and Jan Jongerden, for feedback on early versions of my chapters and joining me on hitchhiking trips to my first conference in Nice and other locations. Thanks to all my other friends who made this period a happy one and to some for sharing the idea that camping at subzero temperatures is fun. Juan Villarino, por compartir parte del circuito infinito. Sharing these travels of randomness allowed me to regain my focus. I am deeply grateful to my mother and father and brother Menno, for growing up happily, for informing or not informing about my thesis on the right moments, letting me choose my own path and teaching me love for nature and the Balkans and to never stop wondering. During the writing this acknowledgement, I noticed a small muscle ache developing as a result of the constant smile when thinking back of all the great moments I shared with all of you. Again I want to express my deepest gratitude to you. Last and most of all, I want to thank my girlfriend Tamara for her patience, unconditional support, endurance, enthusiasm and love, which were essential ingredients for finishing this thesis. I am glad to have you on my side in our new adventure. About the author Steven Weijs received his first information through a partially random, but well-selected genetic code, which, notwithstanding his mother’s fruitless frantic search for yoghurt during holiday in Czechoslovakia, lead to the emergence of a little creature that began receiving visual information on February 12, 1979 in Groningen. After a bit more than two years, he was joined by his younger brother. In the years that followed, he eagerly gathered more information by asking his parents and teachers progressively more foolish questions that were not always easy to answer. He was very curious about the motivation behind his father’s research, which according to Steven consisted of “looking how rabbits chew”. During secondary school, Steven became interested in electronic circuits and soon he was also transmitting low-information content (a novelty in those days) through his homemade radio transmitter. He came to TU Delft to study Civil Engineering in 1997. After working as a student assistant, following M.Sc. courses in both water management and hydrology, and designing a flood defense for a town in Argentina during his internship, he obtained his M.Sc. in water resources management cum laude in 2004. He then joined Nelen & Schuurmans Hydroinformatics, an engineering consultant, where he worked on hydrological / hydraulic modeling and the development of decision support and control systems for several Dutch water boards. After two years, in 2006, Steven returned to academia, starting his Ph.D. research at the chair of Water Resources Management. Now convinced that modeling and control will always lack enough information to fully eliminate it, he now aimed for at least trying to understand uncertainty. This thesis is a result of this journey. In his spare time, Steven enjoys hiking and hitchhiking in desolate landscapes. Randomness was one of the key ingredients of his travels through South America, which taught him the beauty of uncertainty and sub-optimal decisions. Now he will embark on a slightly more organized adventure in Switzerland. After finalizing his Ph.D. Steven will join the EFLUM lab of Marc Parlange at the EPFL in Lausanne, where he will work as a post-doc after having obtained a post-doctoral grant from the AXA research fund. 207 Publications Peer reviewed publications S.V. Weijs and N. Van de Giesen. Accounting for observational uncertainty in forecast verification: an information–theoretical view on forecasts, observations and truth. Monthly Weather Review, early online release, 2011. S.V. Weijs, G. Schoups, and N. van de Giesen. Why hydrological predictions should be evaluated using information theory. Hydrology and Earth System Sciences, 14(12):2545–2558, 2010. R.W. Hut, S.V. Weijs, and W.M.J. Luxemburg. Using the Wiimote as a sensor in water research. Water Resources Research, 46(12):W12601, 2010. S.V. Weijs, R. Van Nooijen, and N. Van de Giesen. Kullback–Leibler divergence as a forecast skill score with classic reliability–resolution–uncertainty decomposition. Monthly Weather Review, 138(9): 3387–3399, September 2010. P.J. van Overloop, S. Weijs, and S. Dijkstra. Multiple model predictive control on a drainage canal system. Control Engineering Practice, 16(5):531–540, 2008. S. Weijs, E. van Leeuwen, P. van Overloop, and N. van de Giesen. Effect of uncertainties on the real-time operation of a lowland water system in The Netherlands. IAHS PUBLICATION, 313:463, 2007. Non-reviewed publications P.J. van Overloop, R.R. Negenborn, S.V. Weijs, W. Malda, W., M.R. Bruggers and B. De Schutter. Linking water and energy objectives in lowland areas through the application of model predictive control. In Proceedings of the 2010 IEEE Conference on Control Applications, pages 1887–1891, Yokohama, Japan, 2010. S. Weijs. Interactive comment on "HESS Opinions ‘A random walk on water’ " by D. Koutsoyiannis. Hydrology and Earth System Sciences Discussions, 6:C2733–C2745, 2009. S.V. Weijs and N.C. van de Giesen. Information theory, uncertainty and risk for evaluating hydrologic forecasts. In International Workshop Advances in Statistical Hydrology (STAHY), Taormina, Italy, 2010. S.V. Weijs and M.M. Rutten. Application of minimum relative entropy update to long term forecast of cooling water problems in the Rhine. In 8th International Conference on Hydroinformatics, Concepcion, Chile, 2009. S. Weijs. Information content of weather predictions for flood-control in a dutch lowland water system. In 4th International Symposium on Flood Defense: Managing Flood Risk, Reliability and Vulnerability, Toronto, Ontario, Canada, 2007. S.V. Weijs. The value of short-term hydrological predictions for operational management of a dutch lowland water system. In International Conference on Water and Flood Management, Dhaka, Bangladesh, 2007. S.V. Weijs, P.E.R.M. van Leeuwen, and P.J. van Overloop. The integration of risk analysis in real time flood control. In Jean Cunge, Vincent Guinot, and Shie-Yui Liong, editors, 7th International Conference on Hydroinformatics, Nice, France, volume 4, pages 2943–2950, 2006. 209 210 Publications Abstracts S.V. Weijs and N. van de Giesen. "zipping" hydrological timeseries: An information-theoretical view on data compression as philosophy of science. Geophysical Research Abstracts, 13:EGU2011–8105, 2011. S.V. Weijs and N. van de Giesen. Evaluating reliability and resolution of ensemble forecasts using information theory. Geophysical Research Abstracts, 12:EGU2010-6489, 2010. S.V. Weijs. Minimum relative entropy update of ensemble probabilities to reflect forecast information. Geophysical Research Abstracts, 10:EGU2008-A-01778, 2008. S.V. Weijs. Timescales and information in early warning system design for glacial lake outburst floods. In A.G. van Os, editor, Proceedings of the NCR-days 2007, a sustainable river system?!, number NCR publication 32, 2007.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement