How to get data and model to fit together?

How to get data and model to fit together?
How to get data and model to fit together?
The field of statistics
• Not dry numbers, but the essence in them.
• Model vs data – estimation of parameters of interest
 Ex. of parameters: mean yearly precipitation, the
explanation value one discharge time series has on
another, the relationship between stage and discharge.
 Parameters are typically unknown but the data gives us
information that can be useful for estimation.
• Model choice
 Gives answers to questions like: “is the mean precipitation
the same in to neighbouring regions?”, “Can we use one
discharge time series to say something about another?”
 These answers are not absolute, but are given with a given
precision (confidence or probability).
Data uncertainty
 Perfect measurements+ perfect models = zero uncertainty
(whether it’s model parameters or the models themselves)
Sources of uncertainty:
 Real measurements come with a built-in uncertainty.
 Models can’t take everything into account. Unmeasured
confounder (local variations in topography and soil in a
hydrological model for instance)
Both problems can be handled by looking at how the measurements
spread out, i.e. at the probability distribution of the data, given
model and parameters. Our models thus need to contain
probability distributions.
This uncertainty has consequences for how sure we can be about
out models and parameters.
Data summery –
statistics (the numbers, not the field)
Ways of summarizing the data (x1,…,xn):
• Means: x  1  xi
n i 1
• Empirical quantiles: q( p)  x( p*n )
x(1) ,, x( n ) are ordered by value
for instance, q(0.5) is the median.
• Empirical variance
(sum of squared deviations):
s 2 ( x) 
( xi  x) 2 s(q)  s 2 (q) is the emp. standard deviation
n  1 i 1
• Empirical correlation between two quantities,
x and y:
ˆ 
(1 / n) ( xi  x)( yi  y )
i 1
s ( x) s ( y )
• Histograms, counts the instances or rates
inside intervals. Gives an indicator how the
data is distributed.
• Empirical histograms, counts the rate if
instances lower than each given value.
Empirical quantiles are easy to read from this.
Data summery vs the underlying reality
While the data is known, what has produced the data is not. This means
summary statistics don’t represents the reality of the system, only an indicator
of what this reality might be.
For instance, the histogram from the previous slide was produced by the
normal distribution, but this distribution doesn’t look exactly the same as the
Similarly the mean isn’t the distribution mean
(the expectancy), the empirical correlation isn’t
the real correlation, the empirical median isn’t
the distribution median, the empirical variance
isn’t the distribution variance etc.
One single distribution can produce a wide
variety of outcomes! But the opposite is then
also the case, a wide variety of distributions
could produce the same data! How to go from
data to distribution?
Example of distributional problems
• Regression. What is the functional
relationship between x and y?
• Forecasting. What to expect from
floods using what’s been happening
before as input?
Expressed distributionally
• What’s the distribution of y given x?
• What is the distribution of future
flood events, given the past?
• Model testing. Is hydrological model • Does model A summarize/predict
A better than B?
the distribution of the data better than
model B?
• Filling in missing data.
• What’s the joint distribution of the
missing and actual data?
 Views on probability
The long term rate of outcomes that falls into a given
category. For instance, one can expect that 1/6 of all dice
throws gives the outcome “one”.
The relationship between a payoff and what you are willing
to risk for it. For instance, you might be willing to risk 10kr in
order to get back 60kr on a bet that the next outcome of the
die is “one”.
III. A formal way of dealing with plausibility (reasoning under
uncertainty). A probability of 1/6 for getting “one” on the die
means that you have neither greater nor less belief in that
outcome than in the other 5 possible outcomes.
Notation: Will use Pr(”something”) to say “the probability of
I is a frequentist definition, while II and II are Bayesian.
Laws of probability
1. 0≤Pr(A)≤1
2. Pr(A)+Pr(not A)=1
Pr(flood on the west coast)=1.1
means you have calculated
Pr(”two or more one the die”) =
1-Pr(”one”) = 1-1/6=5/6
3. Pr(A or B)=Pr(A)+Pr(B) when Pr(”one or two on a single dice
A and B are mutually
throw”) = Pr(”one”)+Pr(”two”)=
Laws of probability 2 – conditional probability
Pr(A | B) gives the probability for A in Examples:
cases where B is true.
Pr(rain | overcast)
Pr(A|B)=Pr(A) means that A is
independent from B. B gives no
information about A.
A dependency between
parameters and data is what
makes parameter estimation.
Pr(A and B)=Pr(A|B)Pr(B)
Since Pr(A and B)=Pr(B|A)Pr(A)
also, we get Bayes formula:
One throw of the die does not affect the next
Pr(”one on the second throw” | ”one on the first throw”) =
Pr(”one on the second throw”).
Pr(”one on the first and second throw”) =
Pr(one on the first throw)*
Pr(”one on the second throw” | ”one on the first throw”) =
Pr(one on the first throw)Pr(”one on the second throw”)
From Bayes formula: If B is independent
from A, Pr(A|B)=Pr(A), then A is also
independent from B; Pr(B|A)=Pr(B).
Ex. of conditional probabilities
Assume that Pr(rain two consecutive days)=10%, and that
Pr(rain a given day)=20%.
What’s the probability of rain tomorrow if it rains today?
Pr(rain tomorrow | rain today) =
Pr(rain today and tomorrow)/Pr(rain today)=
If it’s overcast 50% of the time and it’s always overcast when it’s raining what is
the probability of rain given overcast?
Pr(rain | overcast) =
Pr(overcast and rain)/Pr(overcast)=
Pr(overcast | rain)Pr(rain)/Pr(overcast)=
(PS: I redevelop Bayes formula here!)
Say that overcast is evidence for rain.
Pr(rain | overcast)>Pr(rain)
Conditional probability as inferential logic
From the previous example, it was seen that the probability of rain increases when
we know it’s overcast. With “probability as inferential logic” terminology, overcast is
evidence for rain.
Evidence is information that increases the probability for something we are
uncertain about. It’s possible to make rules for evidence, even when exact
probabilities are not available.
When A->B, then B is evidence for A.
(If rain -> overcast then overcast is evidence for rain).
When A is evidence for B, then B is evidence for A.
(If a flood at position A increases the risk of there being a flood at location B, then ...) Note that the
strength of the evidence does not have to be the same both ways.
If A is evidence for B and B is evidence for C (and there are no direct dependency
between A and C), then A is evidence for C.
(If Oddgeir mostly speaks the truth and he says it’s overcast, then that is evidence for rain.)
If A is evidence for B, then ”not A” is evidence for ”not B”. (Not overcast, clear skies,
is evidence against rain. If you have been searching for the boss inside the building without finding
him/her, then that is evidence for he/she not being in the building.
See “Reasoning under uncertainty” on Youtube for more.
The law of total probability
If one has the conditional probabilities for one thing and the unconditional
(marginal) probabilities of another, one can retrieve the unconditional (marginal)
distribution of the first thing. This process is called marginalization.
Let’s say we have three possibilities spanning the realm of all possible outcomes:
B1, B2 or B3. So, one and only one of B1, B2 and B3 can be true. (For instance
”rain”, ”overcast without rain” and ”sunny”, A could be the event that a person uses
his car to get to work.)
Pr(A) = Pr(A and B1) + Pr(A and B2) + Pr(A and B3) =
It’s the same if there are more (or less) possibilities for B.
Example: Assume that the probability of hail in the summer half-year is 2% and in
the winter 20% (these are thus conditional probabilities). What’s the probability of
hail in an arbitrary day in the year?
Pr(hail)=Pr(hail | summer )Pr(summer)+Pr(hail | winter)Pr(winter)=
Properties of stochastic variables
(stuff that has a distribution)
 The expectation value is the mean of a distribution, weighted on the
E ( X )   xi Pr( X  xi )
when there are N different possible outcomes
i 1
For a die, the expectation value is 3.5.
For a uniformly distributed variable between 0 and 1, the expectation is ½.
For a normally distributed variable, the expectation is a parameter, .
 The standard deviation (sd) is a measure how much spread you can
expect. Technically it’s the square root of the variance (Var), defined as:
Var ( X )   ( xi  E ( x)) 2 Pr( X  xi )
i 1
For a uniformly distributed variable between 0 and 1, the variance is 1/12.
For a normally distributed variable, the standard deviation is a parameter, 
(or variance 2).
Covariance and correlation
If two variables X and Y are dependent (so Pr(Y|X)Pr(Y)), there is a
measure for that also. First off one can define a covariance, which tells how
X and Y varies linearly together:
Cov( X , Y ) 
Nx ,N y
 ( x  E ( x))( y
i 1, j 1
 E ( y )) Pr( X  xi , Y  y j )
Where Nx and Ny are the different possible outcomes for X and Y
Covariance will however depend both on the (linear) dependency between
X and Y but also the scale of both of them. To remove the latter, we form the
cov( X , Y )
 XY 
sd ( X ) sd (Y )
Note that -1XY1 always. XY =1 means perfect linear dependency.
Also note that the correlation summarizes dependency only linear
dependency, not non-linear dependency! It is even possible to have perfect
dependency but no correlation!
Samples from stochastic variablesthe law of large numbers
If we can sample from a statistical distribution enough times, we will
eventually see that…
Rates approaches the probabilities rA  n  Pr( A)
1 n
The mean approaches the
x   xi  E ( X )
expectancy value.
n i 1
The empirical variance approaches the
distribution variance.
1 n
S 
( xi  x ) 2  Var ( X )
n  1 i 1
The rate of samples falling inside an
interval approaches the interval
probability. Thus the histogram
approaches the probability density.
Empirical quantiles approaches the
distributional quantiles.
Empirical correlations approaches
distributional correlation.
The data we see is seen as a sample set from some (unknown) distribution.
Diagnostic plots concerning
probability distributions
 One can compare the histogram with
the distribution (probability density).
 Cumulative rates can be compared to
the cumulative distribution function.
 One can also plot theoretical quantiles
vs sample quantiles. This is called QQ
plots. If the right distribution has been
chosen, the points should lie on a
straight line.
The Bernoulli process and the
binomial distribution
A process of independent events having only If you count the number of successes in
two outcomes of interest, success and failure, n trials, you get the binomial distribution.
is called a Bernoulli process.
It is characterized by the success rate, p.
This is often an unknown parameter that
we’d like to estimate.
p=probability for heads (p=50%)
Coin tosses.
p=probability of
Years where the discharge went over a ii.
given threshold in Glomma.
E(X)=np. Var(X)=np(1-p)
Incorrect use: Rainy days last month.
n x
Pr( x | n, p)    p (1  p) n  x
 x
In this case, n=30, p=0.3
Related: The negative binomial distribution. Counts the number of ’failures’ before the
k’th success. Pr( x | n, p)   n  x  1 p x (1  p) n
Distributional families - Poisson
The Poisson distribution is something you get
when you count the number of events that
happens independently in time (Poisson
process), inside a time interval.
Number of car accidents with deadly
outcome per year.
Number of times the discharge goes above
a given threshold in a given time interval.
(PS: Strictly speaking not independent!)
t4 t
The Poisson distribution is
characterized by a rate parameter, .
 =Deadly traffic danger
 =Threshold rate
If the rate is uncertain in a particular
way (gamma distributed) the outcome
will be negative binomially distributed.
Pr( x |  ) 
x e  
In this case, =10.
Probability density
For stuff that has a continuous nature
(discharge/humidity/temperature) we can’t assign
probabilities directly, since there’s an uncountable
amount of outcomes.
What we can do instead is to assign a probability
density... We use the notation f(x) for probability
densities, with the x denoting which stochastic
variable X which it is the probability density over.
A probability density gives the probability of falling
inside a small interval times the size of this interval.
For larger intervals, integration is needed (law 3):
Probabilities still have to sum to one (law 2):
Conditional probabilities:
Law of total probability:
Bayes formula:
 f ( x)dx  1
f ( x)   f ( x | y ) f ( y )dy
f ( y | x) f ( x)
f ( y)
Expectancy: E ( X )   xf ( x)dx
Pr( a  X  b)   f ( x)dx
f ( x, y )  f ( x | y ) f ( y )
f ( x | y) 
f ( y | x) f ( x)
 f ( y | x) f ( x)dx
f ( x | y )  f ( x, y ) / f ( y )
Cumulative distributions and quantiles
If one has a probability density, one can calculate (by integration) the probability of an
outcome less than or equal to a given value x. Seen as a function of x, this defines the
cumulative probability, F(x).
F ( x)  Pr( X  x)   f ( x)dx
If we turn the cumulative distribution around, we can
ask for which value this distribution attains a given
probability. I.e. for which value is there a given
probability that X is lower than that?
One then gets a quantile, q(p). This is the value for
which there is probability p that X is lower
p -> q(p)=F-1(p)
Special quantile: the median. 50% probability to be
above that value and 50% probability of being below.
Quantiles can be used for indicating the spread of
possible outcomes (uncertainty intervals) 95% of the
probability mass is inside the 2.5% and 97.5%
quantile, for instance.
Ex: The 85% quantile of the
standard normal distribution is
approx. 1.
The normal distribution
The probability density, f(x), is smooth and has a single peak around a
parameter value, .
 ( x   )2 
exp  
Mathematically it looks like this: f ( x |  ,  ) 
2 
where  is the expectancy and
 is the standard deviation.
If a stochastic variable, X, is
normally distributed we often
write this as X~N(,).
68.3% probability
95.4% probability
99.73% probability
99.99994% probability
(-1.96,+1.96) contains 95% of the probability mass
for the normal distribution.
Why the normal distribution?
While the normal distribution may look complicated it has a host of nice properties:
 Its’ smooth and allows all real valued outcomes.
 It’s characterized by a single peak.
If you condition on a distribution being positive, smooth and having a single peak, a
Taylor expansion will indicate that the normal distribution will be an approximation
around this peak.
 Symmetric
 Information theory suggests choosing the distribution that maximises entropy
(minimizes information) conditioned on what you know. The max entropy
distribution when you know the centre (expectation) and spread (standard
deviation) is the normal distribution.
 The sum of two normally distributed variables is normally distributed.
 A large sum equally distributed independent variables will be approximately
normally distributed. (The central limit theorem).
 Believe it or not, the normal distribution is pleasant to work with, mathematically!
 Is the distributional basis for a lot of the standard methodology in statistics.
Should work well for temperatures, not so well for discharge!
The lognormal distribution (scale variables)
When something needs to be strictly positive (the mass of a person, the volume in a
reservoir, the discharge in a river), the normal distribution does not work. It assigns
positive probabilities to negative outcomes.
A simple way to fix this is by first taking the logarithm of your original quantity, and
assign the normal distribution to that. If X>0, will log(X) take values all over the real
The assumption log(X)~N(,) also gives a distribution for X, called the lognormal
distribution, X~logN(,).
 (log( x)   ) 2 
f ( x | , ) 
exp  
2 x
If  is increased, the uncertainty (standard
deviation) also increases, but the relative
uncertainty remains the same.
From the central limit theorem, one can argue that the product of independent
identically distributed positive variables will go towards the lognormal distribution.
The (inverse) gamma distribution
The gamma distribution is another such that only takes positive values:
f ( x |  , )  x 1e  x / /   ( )
It has a form that makes it mathematically convenient when studying
variation parameters (sums of independent square contributions) and
when studying rate parameters (Poisson).
In Bayesian statistics, the distribution
of it’s inverse is however often more
convenient. If X is gamma distributed,
then 1/X is inverse gamma distributed:
f ( x |  ,  )  x  1e   / x   / ( )
Statistical inference
 If we have a task like extreme value analysis, regression,
forecasting, there will be unknown numbers, parameters,
that we want to estimate.
 A model summarizes how the data has been produced by
the likelihood, f(D|). where  is the unknown parameters
and D is the data. The data might not tell a clear-cut story
about .
 Statistical inference deals with:
Estimation of parameter values  in a model.
Uncertainty of these parameter values.
Model choice and uncertainty concerning model choice
Estimates and uncertainty of derived quantities. (For instance risk
Estimates and uncertainties of latent variables (unmeasured stuff
that has a distribution).
Statistical (parameter) inference
We then want to go the other way, say something about the parameters  given
the data, D. Two fundamentally ways of dealing with this exists:
Frequentist (classic):
Parameters are unknown but seen as fixed. Distributions (and uncertainty)
is only assigned to data f(D|) and to stuff derived from that via various
We can assign probabilities to methodology having something to do with the
parameters (estimators, confidence interval etc) before the data.
Plugging actual data into this, we don’t have anything with probability left,
but talk about confidence instead.
Uncertainty is handled by probability distributions
whether it’s parameters or data. Since we have f(D|)
f ( D |  ) f ( )
we can turn it around and ask for f( |D) (posterior
f ( | D) 
f ( D)
distribution) using Bayes formula.
 We need to start out with a distribution f()
f ( D |  ) f ( )
summarizing what we know prior to the data.
We also have a troublesome integral for the prior
f ( D |  ) f ( )d
prediction distribution (marginal data distribution).
Pr(rain | overcast) =
We’ve seen type of inference this already...
Pr(overcast | rain)Pr(rain)/Pr(overcast)
When the model clashes with reality
Wish to find an uncertainty interval
for the average mass of mammoths
Data set: x=(5000kg,6000kg,11000kg)
Model 1: xi~N(,) i.i.d.
 The normal allows negative values for both the average mass and the
 Results in a 95% confidence interval, C()=(-650kg,15300kg) which contains
values that just can’t be right!
Model 2: log(xi) ~ N(,) i.i.d. (xi ~ logN(,) )
 Only positive expectancies and measurements possible.
 95% confidence interval transformed back to the original scale:
 Even better if we could put in pre-knowledge.
Message: Use models that deals with reality. Learn to use various
models/distributions/methods for various problems. GIGO.
Frequentist methodology
Only data is given a distribution. Focused on estimation, with
uncertainty, model choice being secondary. Model choice and
uncertainty comes from the probability of producing new data
that looks like the data you have.
Parameters do not have a distribution, but estimators do. An
estimator is a method for producing a parameter estimate,
before the data. So before the data, an estimator has a
probability distribution.
Frequentist statistics: Estimation
Estimation is done through an estimator, a method for producing a number from a
dataset generated by the model.
An estimator should be consistent. The probability of the difference between estimator
and actual parameter being larger than a given value should go to zero as the
number of data increases.
One would also like estimators to be unbiased, that is having an expectancy equal to
the parameter value.
Often used methods for making estimators:
 The method of moments. Estimate the parameters so that the expectancy matches the data
mean, the distributional variance matches the empirical variance etc. Advantage: Easy to make.
Disadvantage: Little theory about the estimator distribution (meaning bad for assessing
uncertainty), can be pathological, limited areas of usage.
 The L moment method. Variant of the method of moments using so-called L moments.
Advantage: Good experience from flood frequency analysis. Disadvantage: See over + not so
easy to make.
 The ML method. Estimate the parameters to maximize the probability for the data, i.e. The
likelihood f(D|). Advantage: More or less unlimited areas of usage. Asymptotic theory for the
uncertainty exists, pathological estimates impossible.
Disadvantage: May require more heavy numerical methods, can be skewed.
Frequentist statistics: Numeric
Not all models give you a likelihood that has
readily available analytical expressions for
the (ML) estimators.
For such cases one needs numerical
optimization. These come in two categories:
1. Hill-climbing/local climbing: Start from
one (or a small set of) point(s) in
parameter space and use the local
”topography” of the likelihood to find the
nearest peak. Examples: Newton’s algorithm,
2. Global methods: More complicated,
requires much coding, execution time
and adjustments. Examples: simulated
annealing, genetic algorithms.
Frequentist statistics: Parameter
uncertainty and confidence intervals
An estimate isn’t the truth. There can be many different
parameter values that can with reasonable probability
produce the data.
Frequentist statistics operates with confidence intervals. A
95% confidence interval is a method for making intervals
from the data which before the data has a 95% probability of
encompassing the correct parameter value.
(A Bayesian credibility interval has a 95% probability for encompassing
the correct parameter value, given the data).
Confidence intervals are made by looking at the distribution
of so-called test-statistics (often estimators).
Confidence interval techniques
Exact methods. This can be applied when you know exactly
the distribution of the test statistics. Ex: A 95% confidence
interval for the expectancy in the normal distribution:
( x  tn1 (0.975) s /
n , x  tn1 (0.975)s /
where s empirical standard deviation and tn-1 is the
t distribution with n-1 degrees of freedom.
Asymptotic theory. When the amount of data goes towards
infinity then the ML estimators2has the following distribution:
 l ( )
ˆ ~ N( , I( ) -1 ) where I ( )   E
is Fisher' s informatio n matrix.
1 ˆ
1 ˆ
Thus ( 1.96 I ( ) ,  1.96 I ( ) ) will be a 95% confidence interval.
Bootstrap. Here one tries to recreate the distribution that
produced the data, either by plugging in parameter estimates
into the model or by re-drawing from the data with
replacement (non-parametric). One then looks at the spread of
the set of new parameter estimates for these new datasets
Frequentist statistics:
Hypothesis testing
Sometimes we will be uncertain about which model to use. Is there a dependency
between x and y? Does the expectancy hold a particular value? Classic hypothesis
testing is done by:
Formulate a zero-hypothesis, H0, and an alternative hypothesis, Ha.
Make a threshold probability called the significance level, for the
probability of rejecting an ok zero-hypothesis. Typical value: 5%.
Focus on a test statistics (a function of the data, likelihood or estimators for
instance). Find an expression for the probability density of this.
By looking at the alternative hypothesis, find what values for the test
statistics are extreme for the zero-hypothesis. Find from the distribution of
the test statistics an interval of the 5% (significance level) most extreme
If the test statistics is inside this interval for the data, the zero-hypothesis is
rejected with 100%-significance level confidence.
P value: The probability of getting a test statistics as extreme as you got, given that
the zero-hypothesis is correct. P value<significance level means rejection.
Power: Gives the probability for rejecting the zero-hypothesis for various versions of
the alternative hypothesis (so a function of the parameter values). You want this as
high as possible. This can affect experimental planning.
Frequentist statistics:
Model testing (2)
The t test. Checks if one dataset has an expectancy equal to a given value or if
two datasets have the same expectancy. In practice done by seeing if the 95%
confidence interval (for the expectancy minus the given value or for the
difference) encompasses zero.
( x  tn1 (0.025)s /
n , x  tn1 (0.975) s /
General methodology:
The likelihood ratio test. Under a zero-hypothesis, 2(l A  l0 ) ~  k2
where k is the difference in the amount of parameters and lA and l0 are max
likelihood for alternative hypothesis and zero-hypothesis, respectively. (Only
valid asymptotically.)
ˆ ~ N( , I( ) ) where I ( )   E l ( ) is Fisher' s informatio n matrix.
The score test. Uses
 2
to check if the parameter estimate is far enough from a specific value to be
rejected. (See the confidence interval that ranges from ˆ 1.96 I 1 (ˆ) to ˆ  1.96 I 1 (ˆ)
to test this).
Frequentist statistics:
other model choice strategies
Hypothesis testing is primary for when you only want to reject a zero-hypothesis
with strong evidence. But often, what you are seeking is whichever model serves
the purpose best, like minimizing the prediction inaccuracies. Sometimes
hypothesis testing can even be impossible, because both your model have equal
model complexity.
Note that prediction uncertainty comes from the stochasticity of the data itself,
errors in the model and uncertainty concerning parameter estimates. Stochasticity
in the data is something we can’t get rid of, but the other two contributions needs
to be balanced.
Adjusted R2 (only regression)
AIC=-2*log(ML)+2*k, k=#parameters
Divide the data into i training and
validation sets.
Cross validation
CV-ANOVA (ANOVA test on the
results of cross validation.)
Model complexity
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF