QTL Mapping with Score Statistics (2009 Stat Methods in Genetics and Mol. Bio.)

QTL Mapping with Score Statistics (2009 Stat Methods in Genetics and Mol. Bio.)
Statistical Applications in Genetics
and Molecular Biology
Volume 8, Issue 1
2009
Article 16
Score Statistics for Mapping Quantitative Trait
Loci
Myron N. Chang∗
Rongling Wu†
Samuel S. Wu‡
George Casella∗∗
∗
University of Florida, mchang@cog.ufl.edu
University of Florida, rwu@ufl.edu
‡
University of Florida, samwu@biostat.ufl.edu
∗∗
University of Florida, casella@ufl.edu
†
c
Copyright 2009
The Berkeley Electronic Press. All rights reserved.
Score Statistics for Mapping Quantitative Trait
Loci
Myron N. Chang, Rongling Wu, Samuel S. Wu, and George Casella
Abstract
We propose a method to detect the existence of quantitative trait loci (QTL) in a backcross
population using a score test. Since the score test only uses the MLEs of parameters under the null
hypothesis, it is computationally simpler than the likelihood ratio test (LRT). Moreover, because
the location parameter of the QTL is unidentifiable under the null hypothesis, the distribution of
the maximum of the LRT statistics, typically the statistic of choice for testing H 0 : no QTL, does
not have the standard chi-square distribution asymptotically under the null hypothesis. From the
simple structure of the score test statistics, the asymptotic null distribution can be derived for the
maximum of the square of score test statistics. Numerical methods are proposed to compute the
asymptotic null distribution and the critical thresholds can be obtained accordingly. We show
that the maximum of the LR test statistics and the maximum of the square of score statistics are
asymptotically equivalent. Therefore, the critical threshold for the score test can be used for the
LR test also. A simple backcross design is used to demonstrate the application of the score test to
QTL mapping.
KEYWORDS: score test, likelihood ratio test, asymptotic equivalence, backcross design
Chang et al.: Score Statistics
1
Introduction
Since the publication of the seminal mapping paper by Lander and Botstein (1989),
there has been a vast wealth of literature concerning the development of statistical
methods for mapping quantitative trait loci (QTL) that affect complex traits (reviewed in Doerge et al. 1997; Jansen 2001; Hoeschele 2001; Wu et al. 2007).
In QTL mapping, one of the thorniest issues is the the characterization of critical
thresholds to declare the statistical significance of a QTL (Lander and Schork 1994).
Typically the profile of likelihood-ratio (LR) test statistics is constructed over the
grid of possible QTL locations in a linkage group or an entire genome and the maximum of the LR (MLR) is used as a global test statistic. At a given position of the
QTL, the LR test statistic is asymptotically χ 2 -distributed under the null hypothesis
with degrees of freedom equal to the number of associated QTL effects. However,
the QTL position is unidentified under the null hypothesis
(1)
H0 : no QTL,
therefore, the MLR test statistic does not have the standard χ 2 -distribution asymptotically.
To overcome the limitations due to the failure of the test statistic to follow
a standard statistical distribution, a distribution-free simulation approach has been
proposed to calculate critical values for different experimental settings with intermediate marker densities (Lander and Botstein 1989; van Ooijen 1992; Darvasi et
al 1993). A more empirical approach for determining critical thresholds is based
on a permutation test procedure (see Churchill and Doerge 1994 and Doerge and
Churchill 1996). The simulation- or permutation-based approach requires a very
high computational workload, and this makes its application impractical in many
situations. Several researchers have derived formulae for computing approximate
critical thresholds to control the type I error rate at a chromosome- or genome-wide
level (Rebai et al. 1994; Dupuis and Siegmund 1999; Piepho 2001).
The purpose of this article is to propose a general statistical framework to
detect the existence of a QTL using a score test. A score-statistic approach for mapping QTL was considered with sibships or a natural population in humans (Schaid
et al. 2002; Wang and Huang 2002). Zou et al. (2004) utilized score statistics
for genome-wide mapping of QTL in controlled crosses, in which the resampling
method was proposed to assess the significance. Since the score test only uses maximum likelihood estimates (MLEs) of parameters under the null hypothesis (1), under which all phenotypes in the sample are independent and identically distributed
(i.i.d.), this approach is much less demanding computationally than the LR test. We
form score test statistics for given locations in the permissible range and then use
Published by The Berkeley Electronic Press, 2009
1
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16
the maximum of the square of these score test statistics as a global test statistic. We
show that under the H0 , the maximum of the square of score test statistics asymptotically follows the distribution of sup Z 2 (d), where Z(d) is a Gaussian stochastic
process with mean zero and well-formulated covariances, and the supremum is over
the permissible range of d, the Haldane (1919) distance between one end of the
genome and the location of the QTL. The distribution of sup Z 2 (d) only depends on
Haldane distances or equivalently on recombination fractions between markers, and
the critical thresholds can be obtained by simulation. We show that the maximum
of the square of score test statistics and the MLR are asymptotically equivalent, and
hence the critical thresholds obtained for the maximum of the square of score test
statistics can be used for MLR test also.
The article is organized as follows. We provide background information on
QTL mapping in backcross population in Section 2. We derive the score test statistic and the asymptotic null distribution of the maximum of the square of score test
statistics for interval mapping In Section 3. We further show that the maximum of
the LR test statistics and the maximum of the square of score statistics are asymptotically equivalent under the null hypothesis. In Section 4, the score test and the
null distribution are extended to composite interval mapping, where we show that,
in the dense map case, the distribution of sup Z 2 (d) is the same as that derived by
Lander and Botstein (1989). Numerical methods to compute the null distribution
are addressed in Section 5. In Section 6, examples are presented and the thresholds
obtained by our method are compared with those from other procedures. Section 7
is a discussion section. We provide some technical details in the Appendix.
2
QTL Mapping in Backcross Populations
This section provides a short introduction to QTL mapping in backcross populations. A more thorough introduction can be found in Doerge et al. (1997).
2.1
Basics
The aim of QTL mapping is to associate genes with quantitative phenotypic traits.
For example, we might be interested in QTL that affect the growth rate of pine
trees, so we might be looking for regions on a chromosome that are associated with
growth rate in pines.
In the simplest case we assume that there are two parents that have alleles
QQ and qq at a certain (unknown) place, or locus, on the chromosome. If the parents
mate, the offspring (called the F1 generation) will have the allele Qq. In a backcross,
the offspring is mated back to a parent, say QQ, producing a new generation with
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
2
Chang et al.: Score Statistics
alleles QQ, Qq. We want to investigate the association between the new generation
(sometimes called BC1 ) and the quantitative trait.
Now the problems arise. We typically cannot see if the locus is Q or q, but
rather we see a ”marker”, with alleles M and m. If the marker is ”close” to the
gene, where closeness is measured by recombination distance, r, then we hope that
seeing an M implies that a Q is there, and so on. If r is small, with r = 0 being the
limiting case of no recombination, then M is closely linked to Q. But if r is large,
with r = 1/2 being the limiting case, then there may be no linkage between M and
Q.
A simple statistical model for all of this is the following. If Y is a random
variable signifying the quantitative trait, assume that Y is normal with variance σ 2
and mean µQQ , µqq or µQq depending on the alleles at the particular locus under
consideration. Clearly the parents are N(µQQ , σ 2 ) and N(µqq , σ 2 ), respectively,
and the F1 is N(µQq , σ 2 ). The distribution of the backcross to the QQ parent is a
mixture of normals with means µQQ and µQq .
If the Q alleles could be observed, the mixing fraction would be 1/2, but
we can only observe the marker genotypes M or m. If r denotes the recombination probability, then the distribution of the phenotype in the backcross population
(Doerge et al. 1997) is
½
N(µQQ , σ 2 ) with probability 1 − r
Y |MM ∼
,
N(µQq , σ 2 ) with probability r
and
½
Y |Mm ∼
N(µQQ , σ 2 ) with probability r
.
N(µQq , σ 2 ) with probability 1 − r
When categorized by the markers, the difference in means of the populations
is
µMM − µMm = (1 − 2r)(µQQ − µQq ).
Assuming that µQQ − µQq 6= 0, a test of H0 : no linkage between M and Q can
be carried out by testing H0 : µMM − µMm (which is also equivalent to the null
hypothesis H0 : r = 1/2). Using this mixture model in a likelihood analysis, we can
not only test for linkage but also estimate r. (The backcross is the simplest design
in which we get enough information to estimate r.)
2.2
Interval mapping
In practice, we typically use a somewhat more complex model, where we assume
that the QTL is bracketed by two flanking markers M `−1 (with alleles M `−1 and
Published by The Berkeley Electronic Press, 2009
3
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16
m`−1 ) and M ` (with alleles M ` and m` ) from a genetic linkage map. These two
markers form four distinct genotype groups
(2)
{M `−1 m`−1 M ` m` }, {M `−1 m`−1 m` m` }, {m`−1 m`−1 M ` m` }, {m`−1 m`−1 m` m` }.
The recombination fraction between the two markers M `−1 and M ` is denoted by r, and that between M `−1 and the QTL by r1 , and that between the QTL
and M ` by r2 . The corresponding genetic distances between these three loci are
denoted by D, x and D − x, respectively. Genetic distances and recombination fractions are related through a map function (Wu et al. 2007). Here we use the Haldane
(1919) map function, given by r = (1 − e−2D )/2, to convert the genetic distance to
the corresponding recombination fraction.
To express the conditional probability of a QTL genotype, say Qq, given
each of the four two-marker genotypes, we introduce two ratio parameters,
θ1 =
(3)
1 + e−2D − e−2x − e−2(D−x)
r1 r2
r1 r2
,
=
=
2(1 + e−2D )
1 − r (1 − r1 )(1 − r2 ) + r1 r2
and
θ2 =
1 − e−2D − e−2x + e−2(D−x)
r1 (1 − r2 )
r1 (1 − r2 )
.
=
=
2(1 − e−2D )
r1 (1 − r2 ) + (1 − r1 )r2
r
Furthermore, assume that the putative QTL has an additive effect on the
trait, and define the expected genotypic values as µ + ∆ and µ − ∆ for the two QTL
genotypes Qq and qq, respectively, where µ is the overall mean of the backcross
population. The corresponding distributions of the phenotypic values for the four
marker genotypes can be modelled, respectively, as
(1 − θ1 )N(µ − ∆, σ 2 ) + θ1 N(µ + ∆, σ 2 ),
(1 − θ2 )N(µ − ∆, σ 2 ) + θ2 N(µ + ∆, σ 2 ),
θ2 N(µ − ∆, σ 2 ) + (1 − θ2 )N(µ + ∆, σ 2 ),
θ1 N(µ − ∆, σ 2 ) + (1 − θ1 )N(µ + ∆, σ 2 ),
where µ , ∆, σ 2 , and x, 0 ≤ x ≤ D, are the four unknown parameters contained in the
above mixture models.
Lastly, suppose that we take a sample of size N from the backcross populations, with respective sample sizes of n1 , n2 , n3 and n4 in the four marker groups of
(2). We then have
1
lim(n1 /N) = lim(n4 /N) = (1 − r)
2
(4)
1
lim(n2 /N) = lim(n3 /N) = r,
2
as N → ∞.
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
4
Chang et al.: Score Statistics
3
Single Interval Mapping
Since the existence of the QTL in the interval is indicated by ∆ 6= 0, its statistical
significance can be tested by the hypotheses:
H0 :
∆ = 0 vs. Ha :
∆ 6= 0,
where H0 is equivalent to (1). Note that under the null hypothesis, all observations
are i.i.d. from N(µ , σ 2 ). Thus, the QTL position parameter θ = (θ1 , θ2 ) = θ (x, D)
is unidentifiable because there is no information on θ if H0 is true. In this case, the
MLR test statistic does not follow a standard χ 2 -distribution with two degrees of
freedom.
We derive a score statistic to overcome this problem. Recall that n1 , n2 ,
n3 and n4 are the sample sizes of the four marker groups. If we define N0 = 0,
N1 = n1 , N2 = N1 + n2 , N3 = N2 + n3 and N4 = N = N3 + n4 , the log likelihood of
the phenotype data conditional on the marker information can be written
`(∆, µ , σ 2 , x) = −
N
N
log 2π − log σ 2
2
2
N1
+ ∑ log[(1 − θ1 )e
i=1
N2
+
∑
−
1
(yi −µ +∆)2
2σ 2
log[(1 − θ2 )e
−
+ θ1 e
1
(yi −µ +∆)2
2σ 2
−
1
(yi −µ −∆)2
2σ 2
]
−
1
(yi −µ −∆)2
2σ 2
]
−
1
(yi −µ −∆)2
2σ 2
]
−
1
(yi −µ −∆)2
2σ 2
].
+ θ2 e
i=N1 +1
(5)
N3
+
∑
−
1
(yi −µ +∆)2
2σ 2
−
1
(yi −µ +∆)2
2σ 2
log[θ2 e
+ (1 − θ2 )e
i=N2 +1
N
+
∑
log[θ1 e
+ (1 − θ1 )e
i=N3 +1
The score function is the derivative of (5) with respect to ∆ evaluated at ∆ = 0:
¯
½
¯
N1
∂
`
1
¯
2
u(µ , σ , x) =
= 2 − (1 − 2θ1 ) ∑ (yi − µ )
¯
∂∆¯
σ
i=1
∆=0
N3
N2
−(1 − 2θ2 )
∑
(yi − µ ) + (1 − 2θ2 )
i=N1 +1
¾
(yi − µ ) .
N
+(1 − 2θ1 )
∑
∑
(yi − µ )
i=N2 +1
i=N3 +1
Published by The Berkeley Electronic Press, 2009
5
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16
Under the null hypothesis, the MLEs of µ and σ 2 are
µ̂ =
1 N
1 N
2
y
and
σ̂
=
∑ i
∑ (yi − µ̂ )2,
N i=1
N i=1
respectively. For a given x ∈ [0, D], to form the score test statistic we need to replace
µ and σ 2 in u(µ , σ 2 , x) by µ̂ and σ̂ 2 .
For ease of notation, denote
yt =
a(x, N)
b(x, N)
c(x, N)
d(x, N)
=
=
=
=
Nt
1
yi ,
∑
nt i=Nt−1 +1
t = 1, 2, 3, 4.
(θ1 − θ2 )n2 − (1 − θ1 − θ2 )n3 − (1 − 2θ1 )n4 ,
(θ2 − θ1 )n1 − (1 − 2θ2 )n3 − (1 − θ1 − θ2 )n4 ,
(1 − θ1 − θ2 )n1 + (1 − 2θ2 )n2 + (θ1 − θ2 )n4
(1 − 2θ1 )n1 + (1 − θ1 − θ2 )n2 + (θ2 − θ1 )n3 ,
where N = (n1 , n2 , n3 , n4 ). We can then write
u(µ̂ , σ̂ 2 , x) =
2
[a(x, N)n1 y1 + b(x, N)n2 y2 + c(x, N)n3 y3 + d(x, N)n4 y4 ],
N σ̂ 2
where, under the null hypothesis, the variance of u(µ̂ , σ̂ 2 , x) is
µ
2
Var(u(µ̂ , σ̂ , x)) =
N σ̂
2
¶2
[a2 (x, N)n1 + b2 (x, N)n2 + c2 (x, N)n3 + d 2 (x, N)n4 ].
The score test statistic, U(x), is u(µ̂ , σ̂ 2 , x) divided by its standard deviation, that
is,
(6)
U(x) = p
u(µ̂ , σ̂ 2 , x)
Var(u(µ̂ ,σ̂ 2 x))
,
Calculation of the score test statistic U(x) is simple (in contrast to the computation of the MLEs ∆ˆ x , µ̂x , and σ̂x , which can be complex). Moreover, the maximum score test statistic is asymptotically equivalent to the maximum likelihood
ratio statistic, as we now show.
For a given x ∈ [0, D], let ∆ˆ x , µ̂x , and σ̂x2 be the MLEs of ∆, µ and σ 2 . By
Cox and Hinkley (1974, pages 323-324), the LR test statistic and the square of the
score test statistic are asymptotically equivalent at this x, that is,
(7)
2[`(∆ˆ x , µ̂x , σ̂x2 , x) − `(0, µ̂ , σ̂ 2 , x)] ≈ U 2 (x).
Over the range of x ∈ [0, D], the maximum of the left-hand side of (7) (MLR) equals
the maximum of the right-hand side of (7) asymptotically. The proof is deferred to
Appendix 8.3.
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
6
Chang et al.: Score Statistics
Theorem 1
¯
¯
¯
¯
lim ¯ sup 2{`(η̂ (x), x) − `(η̂ 0 )} − sup U 2 (x)¯ = 0 a.s.
N→∞ x∈[0,D]
x∈[0,D]
where η̂ (x) = (∆ˆ x , µ̂x , σ̂x ), η̂ 0 = (0, µ̂ , σ̂ ), and the almost sure convergence is with
respect to the distribution of the phenotypic trait under H0 .
In Appendix 8.3, we show that the difference between the two sides in (7) converges
to zero with probability one uniformly for x ∈ [0, D] as N → ∞.
To test (1) we need the null distribution of the maximum of U 2 (x) (which is
equivalent to the MLR test statistic). To do this we derive a simplified and asymptotically equivalent form of the score test statistic. Because of (4), we can replace
n1 , n2 , n3 , and n4 in U(x) by 12 (1 − r)N, 12 rN, 12 rN, and 12 (1 − r)N, respectively. We
then have
(8)
µ√ ¶
N (1 − r)(1 − 2θ1 )(y4 − y1 ) + r(1 − 2θ2 )(y3 − y2 )
p
.
U(x) ≈ U (x) =
2σ̂
(1 − r)(1 − 2θ1 )2 + r(1 − 2θ2 )2
∗
Under the null hypothesis,
(9)
1 √ (y3 − y2 )
(y − y1 )
1p
≈ W2 ,
rN
≈ W1 and
(1 − r)N 4
σ̂
2
σ̂
2
where W1 and W2 are two independent standard normal random variables. Thus,
U ∗ (x) is asymptotically equivalent to
(10)
√
√
(1 − 2θ1 ) 1 − rW1 + (1 − 2θ2 ) rW2
Z(x) = p
.
(1 − r)(1 − 2θ1 )2 + r(1 − 2θ2 )2
Next, consider two QTL positions x0 and x00 ∈ [0, D], whose corresponding probabilities in (3) can be written as (θ10 , θ20 ) = θ (x0 , D) and (θ100 , θ200 ) = θ (x00 , D). The
covariance between Z(x0 ) and Z(x00 ) is
cov(Z(x0 ), Z(x00 ))
(1 − r)(1 − 2θ10 )(1 − 2θ100 ) + r(1 − 2θ20 )(1 − 2θ200 )
(11)
p
.
= p
(1 − r)(1 − 2θ10 )2 + r(1 − 2θ20 )2 (1 − r)(1 − 2θ100 )2 + r(1 − 2θ200 )2
Therefore, the MLR test statistic or the maximum of the square of score test statistics asymptotically distributed as
(12)
sup Z 2 (x),
0≤x≤D
where Z(x) is a Gaussian stochastic process with mean 0 and covariances (11).
Published by The Berkeley Electronic Press, 2009
7
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16
4
Composite Interval Mapping
In the preceding section, we discussed the asymptotic theory of the score test statistic for QTL mapping from a single marker interval. In this section, we use the
full marker information in composite interval mapping. Statistically, this is a combination of interval mapping based on two given flanking markers and a partial
regression analysis on all markers except for the two ones bracketing the QTL.
4.1
QTL mapping on a sparse map
Suppose there are k + 1 diallelic markers, M 0 , M 1 , ..., M k , on a sparse linkage
map. These markers encompass k marker intervals. In a backcross population, four
marker genotypes at a given interval [M `−1 , M ` ] (` = 1, 2, . . . , k) are indicated by

1 if a marker genotype is M `−1 m`−1 M ` m`



2 if a marker genotype is M `−1 m`−1 m` m`
G(`) =
3 if a marker genotype is m`−1 m`−1 M ` m`



4 if a marker genotype is m`−1 m`−1 m` m` .
The number of observations within each of these four marker genotypes at a given
marker interval can be expressed as
N
(`)
(`)
nt = ∑ I(Gi = t), t = 1, 2, 3, 4
i=1
where I(A) is the indicator function of event A. We then use
(`)
(`)
(`)
(`)
N(`) = (n1 , n2 , n3 , n4 )
to array the observations of all the four marker genotypes at the interval. We can
then express the phenotypic mean of a quantitative trait (y) for a marker genotype t
at the interval [M `−1 , M ` ] as
(`)
ȳt =
1
N
(`)
yi I(Gi
(`)
nt i=1
∑
= t).
Let D` and r` (` = 1, 2, · · · , k) denote the Haldane distance and recombination fraction between M `−1 and M ` , respectively, and denote the Haldane distance
between the two extreme markers M 0 and M k by D = ∑k`=1 D` . A putative QTL
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
8
Chang et al.: Score Statistics
must be located somewhere on the map, at unknown position d ∈ [0, D]. For each
d ∈ [0, D], define
(13)
`d = min{` :
`−1
`
`−1
i=1
i=1
i=1
∑ Di ≤ d < ∑ Di} and xd = D − ∑ Di,
to identify the possible QTL within the interval [M `d −1 , M `d ] that is associated
with the distance d. As derived in Section 3, the score test statistic (6) for detecting
the QTL at d ∈ [0, D], the asymptotically equivalent forms (8) and (10), and the rela(`)
tionship (9) are all valid with N, yt , and r replaced by N(`) , yt , and r` , respectively.
In (6), (8), and (10), (θ1 , θ2 ) = θ (xd , D`d ) is the function of xd and D`d defined in
(3).
For example, we express (10) as
√
√
(`)
(`)
(1 − 2θ1 ) 1 − r`W1 + (1 − 2θ2 ) r`W2
p
Z(d) =
,
(1 − r` )(1 − 2θ1 )2 + r` (1 − 2θ2 )2
(14)
(`)
(`)
where W1 and W2 are two independent standard normal random variables,
(15)
(`)
W1
(`)
(`)
(`)
(`)
(y − y2 )
(y − y1 )
1p
1p
(`)
≈
(1 − r` )N 4
and W2 ≈
r` N 3
.
2
σ̂
2
σ̂
In order to search for a QTL throughout the entire genome, we need to
compute the covariance between Z(d 0 ) and Z(d 00 ), 0 ≤ d 0 ≤ d 00 ≤ D. Assume (`0 , x0 )
and (`00 , x00 ) are associated with d 0 and d 00 through (13). Let (θ10 , θ20 ) = θ (x0 , D`0 )
and (θ100 , θ200 ) = θ (x00 , D`00 ) as defined in (3). If `0 = `00 = `, then the covariance
between Z(d 0 ) and Z(d 00 ) is given in (11) with r replaced by r` . If `0 6= `00 , then
the covariance between Z(d 0 ) and Z(d 00 ) is derived in (A.6) in Appendix 8.1. It
now follows that the MLR test statistic or the maximum of the square of score test
statistics is asymptotically distributed as
(16)
sup Z 2 (d),
0≤d≤D
where Z(d) is a Gaussian stochastic process with mean 0 and covariance in (11)
when `0 = `00 or in (A.6) when `0 < `00 . Note that the null distribution of (16) only
depends on Haldane distances or equivalently on recombination fractions between
the markers. In Section 5, we will use a numerical approach to compute the distribution of (16).
Published by The Berkeley Electronic Press, 2009
9
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16
4.2
QTL mapping on a dense map
For a dense genetic map, we can assume that markers are located at every point
along the genome. This can be considered as the limit of the sparse map when k
(the number of diallelic markers) converges to infinity and the genomic distance
between any two consecutive markers converges to zero.
For a dense map, two markers M `−1 and M ` can be assumed to have the
identical distance (d) to marker M 0 . Furthermore, for any interval [M `−1 , M ` ], we
(`)
(`)
(`)
(`)
(`)
have n2 ≈ n3 ≈ 0, n1 ≈ n4 ≈ 12 N, θ1 ≈ 0, a(x, N(`) ) ≈ −n4 , and d(x, N(`) ) ≈
(`)
n1 . If the Haldane distance between M 0 and the QTL, which is located within
interval [M `−1 , M ` ], is d ∈ [0, D], the score statistic (6) for detecting the QTL at d
is reduced to
(`)
(`)
y − y1
U(d) = r4
1
σ̂
(`) +
n1
1
(`)
n4
.
For two possible locations d 0 and d 00 of the QTL, d 0 < d 00 , where d 0 and d 00 belong to
0
0
00
00
[M ` −1 , M ` ] and [M ` −1 , M ` ], the covariance (see Chang et al. 2003 for details)
is reduced to
(17)
cov(U(d 0 ),U(d 00 )) ≈ 1 − 2r`0 ,`00 −1 ≈ 1 − 2r`0 `00 .
This suggests that the asymptotic distribution of MLR test statistic, or the maximum of the square of the score test statistics, is the same as the distribution of
sup0≤d≤D Z 2 (d), where Z(·) is a Gaussian stochastic process with mean 0 and covariance (17). This result is consistent with the finding of Lander and Botstein
(1989).
5
Simulation of the Null Distribution
Although it is difficult to derive an explicit formula for the asymptotic null distributions derived in Sections 3 and 4.1, it is relatively straightforward to simulate
them.
5.1
Interval mapping
For any given values of two standard normal random variables W1 and W2 , we can
compute sup0≤x≤D Z 2 (x) where Z(x) is given in (10). Although x does not appear
explicitly in Z(x), recall from (3) that (θ1 , θ2 ) = θ (x, D), so Z 2 (x) is a function of
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
10
Chang et al.: Score Statistics
x. Define
√
−2e2DW1W2 ± (W12 +W22 ) e4D − 1
√
=
,
(W12 −W22 ) e4D − 1 − 2W1W2
y1,2
where y1 correspond to + of “ ± ” and y2 correspond to − of “ ± ”. By straightfor2
ward computation, it can be shown that the solutions to dZdx(x) = 0 are
1
(log yi + 2D),
4
provided that yi > 0. We can then simulate
xi =
½
(18)
2
sup Z (x) =
0≤x≤D
max{Z 2 (0), Z 2 (xi ), Z 2 (D)}
max{Z 2 (0), Z 2 (D)}
i = 1, 2,
if yi > 0 and 0 ≤ xi ≤ D
.
otherwise
by simulating W1 and W2 i.i.d. standard normal.
5.2 Composite interval mapping
The expression (14), which is asymptotically equivalent to the score test statis(`)
tic, involves standard normal random variables Wi , i = 1, 2; ` = 1, 2, · · · , k. It
(`)
(`)
can be shown (see Appendix 8.1 for details) that W1 ,W2 , ` = 1, 2, · · · , k, follow a joint multivariate normal distribution with mean 0, variances equal to 1,
(1)
(1)
cov(W1 ,W2 ) = 0, and remaining covariances
(`0 )
cov(W1
(`0 )
(`00 )
,W1
(`00 )
) ≈
p
(1 − r`0 )(1 − r`00 )(1 − 2r`0 ,`00 −1 )
p
(1 − r`0 )r`00 (1 − 2r`0 ,`00 −1 ),
p
cov(W2 ,W1 ) ≈ − r`0 (1 − r`00 )(1 − 2r`0 ,`00 −1 )
√
(`0 )
(`00 )
cov(W2 ,W2 ) ≈ − r`0 r`00 (1 − 2r`0 ,`00 −1 ).
cov(W1
(19)
(`0 )
,W2
) ≈
(`00 )
0
where `0 < `00 and r`0 ,`00 is the recombination fraction between markers M ` and
(1)
00
(1)
(`)
M ` . Thus the variance-covariance matrix Σ of W1 , W2 and W1 , ` = 2, 3, · · · , k
(`+1)
is well defined. It can also be shown (see Appendix 8.2) that W2
may be recursively calculated through
(20)
(`+1)
W2
¶
µ
p
p
√
1
(`+1)
(`)
(`)
.
1 − r`W1 − r`W2 − 1 − r`+1W1
=√
r`+1
To simulate the null distribution of sup0≤d≤D Z 2 (d), where Z(d) is in (14),
we can do the following:
Published by The Berkeley Electronic Press, 2009
11
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16
(1)
(1)
(`)
(i) Generate (W1 ,W2 ,W1 , ` = 2, 3, · · · , k) ∼ N(0, Σ);
(`+1)
(ii) Compute W2
, ` = 1, 2, · · · , k − 1 by (20);
(iii) Using (18), for each `, find
(21)
sup Z 2 (d) ,
`
∑`−1
i=1 Di ≤d≤∑i=1 Di
and take the maximum over ` = 1, 2, · · · , k.
Repeat the above steps many times (for example, 10,000 times), to obtain
the approximate distribution of sup0≤d≤D Z 2 (d), the asymptotic null distribution of
the MLR test or the maximum of the square of score test statistics. The critical
thresholds can be obtained accordingly.
(1)
(1)
(`)
Note that the random variables W1 , W2 and W1 , ` = 2, 3, · · · , k, can be
calculate as A0 Z, where A is the Cholesky factorization of Σ (that is, A0 A = Σ)
and Z follows multivariate standard normal distribution. Note that we only need to
factor the matrix once for the entire simulation.
6 Examples
To examine the statistical properties of our score test statistic proposed, we perform a simulation study to determine the asymptotic null distribution and the critical threshold for declaring the existence of a significant QTL. Our results here are
compared to those obtained by Lander and Botstein (1989), Doerge and Churchill
(1996) and Piepho (2001). The score test statistic is further validated using a case
study from a forest tree genome project.
6.1 Simulation
We use the same simulation design as used by Lander and Botstein (1989) to simulate a backcross mapping population of size 250. Thus, our results can be directly
compared with those of Lander and Botstein. In our simulation study, equal spacing
of the six markers is assumed throughout 12 chromosomes of 100 cM each. Five
QTL are hypothesized to be located at genetic positions 70, 49, 27, 8 and 3 cM from
the left end on the first five chromosomes with effects of 1.5, 1.25, 1.0, 0.75 and
0.50, respectively. For each individual, genotypes at the markers were generated
assuming no interference. The corresponding quantitative trait was simulated by
summing the five QTL effects and random standard normal noise.
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
12
Chang et al.: Score Statistics
60
60
60
Chromosome1
Chromosome2
Chromosome3
40
40
40
20
20
20
0
0
20
40
60
80
100
Chromosome4
15
0
0
20
40
60
80
100
Chromosome5
15
0
10
10
5
5
5
0
20
40
60
80
100
Chromosome7
15
0
0
20
40
60
80
100
Chromosome8
15
0
10
10
5
5
5
0
0
20
40
60
80
100
Chromosome10
15
0
20
40
60
80
100
Chromosome11
15
0
10
10
5
5
5
0
0
20
40
60
80
100
0
0
20
40
60
80
100
0
20
0
60
80
100
40
60
80
100
Chromosome9
0
20
40
60
80
100
Chromosome12
15
10
40
Chromosome6
15
10
0
20
15
10
0
0
0
20
40
60
80
100
Figure 1: Score test for a simulated data set with 250 backcross progeny. As in Lander and Botstein (1989), we assumed that markers are spaced 20 cM throughout 12
chromosomes of 1 Morgan each. The QTLs, located at genetic positions 70, 49, 27,
8 and 3 cM from the left end on the first five chromosomes, were assumed to have
effects 1.5, 1.25, 1.0, 0.75 and 0.50, respectively. For each individual, genotypes at
markers were generated assuming no interference. The corresponding quantitative
trait was simulated by summing the five QTL effects and random standard normal
noise. The solid line is for score test statistics and the dotted line is for likelihood
ratio test statistics (-2log(LR)). The dashed line indicates the threshold of 12.83 at
the 0.05 level based on the simulated asymptotic null distribution. All five QTLs
attained this threshold.
Published by The Berkeley Electronic Press, 2009
13
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16
Both the likelihood approach of Lander and Botstein (1989) and the scorestatistic approach are used to analyze the simulated marker and phenotype data.
The QTL likelihood profiles are drawn from the likelihood ratio test statistics and
the score test statistics. As shown in Figure 1, these two statistics gave similar
profiles across each of the simulated linkage groups. The 95% percentile of score
test statistics based on 10,000 simulations under the null model is 12.83, which is
used as the threshold for declaring a QTL at the significance level α = 0.05.
Other approaches to estimate the threshold for the same data set, are summarized in the following table:
Reference
This paper
Lander and Botstein (1989)
Lander and Botstein (1989)
Piepho (2001)
Churchill and Doerge (1994)
Method
Simulation
Dense markers
Bonferroni
Approximation
Permutation Test
.05 Threshold
12.83
14.51
11.16
12.96
11.44
From the example, it can be seen that the threshold obtained by our method
is close to Piepho’s threshold and higher than that by Churchill and Doerge. Our
large scale simulation study showed that thresholds by Bonferroni method and
thresholds based on dense markers tend to under estimate and to overestimate the
cutoff point, respectively.
6.2
A case study
The study we consider was derived from an interspecific hybridization of Populus
(poplar). A P. deltoides clone (designated I-69) was used as a female parent to mate
with a P. euramericana clone (designated I-45) as a male parent (Wu et al. 1992).
Both P. deltoides I-69 and P. euramericana I-45 were selected at the Research Institute for Poplars in Italy in the 1950s and were introduced to China in 1972. A
genetic linkage map has been constructed using 90 genotypes randomly selected
from the 450 hybrids with random amplified polymorphic DNAs (RAPDs), amplified fraction length polymorphisms (AFLPs) and inter-simple sequence repeats
(ISSRs) (Yin et al. 2002). This map comprises the 19 largest linkage groups for
each parental map, which roughly represent 19 pairs of chromosomes. The 90 hybrid genotypes used for map construction were measured for wood density with
wood samples collected from 11-year-old stems in a field trial in a completely randomized design.
The 19 linkage groups constructed for each parent were scanned for possible existence of QTL affecting wood density using both our score statistic and
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
14
Chang et al.: Score Statistics
the likelihood approach. We successfully detected a wood density QTL on linkage group, D17, composed of 16 normally segregated markers. The QTL profiles
obtained from these two approaches, as shown in Figure 2, consistently exhibit a
marked peak within a narrow marker interval AG/CGA-480 – AG/CGA-330. Both
the score statistic value and log-likelihood ratio at the peak are greater than the
threshold of 10.10 (α = 0.05) obtained from the simulated asymptotic null distribution. Other approaches are also used to estimate the α = 0.05 level threshold,
which give 7.87 for permutation tests, 8.87 for Piepho’s quick method, 8.73 for the
Bonferroni method and 10.20 for Lander and Botstein’s formula for dense markers. All these suggest that both approaches have consistently detected a significant
QTL, located at the peak of the profiles, affecting wood density in hybrid poplars.
This example illustrated that the score statistic has the power to detect the
hypothesized QTL as well as the likelihood approach. Also, the QTL profiles from
our score statistic and the likelihood approach are similar. However, in the wood
density example, the calculation of the score statistic threshold based on 10,000
simulations from the asymptotic null distribution only took 5 minutes, where the
calculation of the threshold based on 10, 000 permutation likelihood ratio tests took
4, 100 minutes. (All calculations were performed using Matlab software on a Dell
Inspiron 7000 with a 300 MHz CPU.)
12
Test statistics
10
8
6
4
2
0
0
20
40
60
80
100
120
140
160
180
Haldane distance from left end, x
Figure 2: The value of the test statistics along chromosome 17, which has 16 markers. The solid line is for the score test, the dotted line is for the likelihood ratio test
(-2log(LR)), and the dashed line indicates the threshold of 10.10 at the 0.05 level
based on the simulated asymptotic null distribution.
Published by The Berkeley Electronic Press, 2009
15
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16
7
Discussion
In this article, we devise a score-statistic method for QTL mapping based on a genetic linkage map. The asymptotic null distribution of the score statistics is derived.
The score test statistics U(x) in equation (6) converges to its asymptotic equivalent
Z(x) quickly, because the convergence only depends on two things: (a) sample
proportions in the four marker groups converge to the asymptotic proportions and
(b) the standardized sample mean difference in equation (9) converges to standard
normal distribution. In a variety of cases we studied, a sample size of 150 to 200
is usually sufficient to warrant the use of the threshold based on asymptotic null
distribution.
Determining threshold of test statistics for QTL mapping is a first important
step toward the identification of individual genes that contribute to genetic variation in a complex trait. However, this is not a trivial task because many factors,
such as map density, marker distribution pattern, genome size, and the proportion of
missing marker and phenotypic data, may affect the distribution of the test statistic.
For an infinitely dense map, the test statistic may be approximated by an OrnsteinUhlenbeck diffusion process for a backcross of large samples (Lander and Botstein
1989). This discovery was then used in more complicated genetic designs (Dupuis
and Siegmund 1999; Zou et al. 2001). Rebai et al. (1994, 1995) derived a conservative threshold by considering the position of a QTL as a nuisance parameter present
only under the alternative hypothesis in that the theoretical resulst of Davies (1977,
1987) can be at work. In this article, we devise a score-statistic method for QTL
mapping based on a genetic linkage map. A score test statistic inherits the optimal
property of the maximum likelihood method, yet it is much easier to compute. It is
asymptotically equivalent to the likelihood ratio test statistic with asymptotic null
distribution being the distribution of the maximum of the square of a well-defined
Gaussian stochastic process. The critical thresholds obtained in Section 5 can be
used for both the MLR test and the maximum of the square of score test statistics.
The computational method in Section 5 is feasible even for a large k, the
number of molecular markers in a genetic linkage map. For a dense map, we can
assume that a QTL is located at the same position as a marker. In this case, the
(`)
(`)
numbers of recombinant groups n2 and n3 approaches zero for all intervals and
our results on the asymptotic distribution of MLR test statistic in Section 4 reduce
to that of Lander and Botstein (1989). Results from numerical analyses suggest
that our score statistical model can be reasonably well applied in sparse maps, thus
making the model more useful in general genetic studies. As an alternative to permutation tests or resampling strategies with no assumption about the distribution of
the test statistic (Chuchill and Doerge 1994; Zou et al. 2004), our model displays
significant computing advantages.
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
16
Chang et al.: Score Statistics
Current genetic mapping has achieved a point at which almost every kind
of biological traits and organisms can be studied by QTL mapping approaches.
This requires that our model be of great usefulness and effectiveness in a broad
spectrum of genetic settings. For outcrossing species, like forest trees, a full-sib
family derived from two heterozygous parents provides an informative design for
QTL mapping (Haley et al. 1994; Xu 1996; Wu et al. 2002; Lin et al. 2004). For
humans and other species, QTL mapping may be based on a set of random samples
drawn from a natural population in which associations between markers and QTL
are studied in terms of linkage disequilibrium. For many complex traits, a one-QTL
model is too simple to characterize their genetic architecture of high complexity.
Traits may be likely to be controlled by a web of genes, each interacting with one
another and environmental factor. With strong theoretical derivations, the proposed
method can be readily extended to consider these more realistic situations.
Readers can find our Matlab code for computing the score statistic and the
critical threshold at the webiste: http://www.ehpr.ufl.edu/code/235.
8 Appendix: Some Technical Details
8.1
The covariance of Z(d 0) and Z(d 00 )
Assume (`0 , x0 ) and (`00 , x00 ) are associated with d 0 and d 00 through (13), respectively.
Since the covariance of Z(d 0 ) and Z(d 00 ) is available in (11) for `0 = `00 , we derive
the covariance for `0 < `00 .
Let As,t;`0 ,`00 denote the subset of individuals who belong to Group s in interval `0 and belong to Group t in interval `00 , i.e.,
As,t;`0 ,`00 = {i : 1 ≤ i ≤ N,
(`0 )
Gi
= s,
(`00 )
Gi
= t}, 1 ≤ s,t ≤ 4, 1 ≤ `0 < `00 ≤ k.
Denote the size of As,t;`0 ,`00 by ns,t;`0 ,`00 , i.e.,
N
(A.1)
(`0 )
ns,t;`0 ,`00 = ∑ I(Gi
i=1
(`00 )
= s)I(Gi
= t), 1 ≤ s,t ≤ 4, 1 ≤ `0 < `00 ≤ k.
0 00
Let N(` ,` ) = (ns,t;`0 ,`00 )s,t=1,2,3,4 be the 4 × 4 matrix with s,t as the row index and
column index, respectively. It is seen that

lim
1
N→∞ N
0
00 )
N(` ,`

1 − r`0 0 0
0
1
r`0 0
0 
 0

=

0
0 r`0
0 
2
0
0 0 1 − r`0
Published by The Berkeley Electronic Press, 2009
17
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16

1 − r`0 ,`00 −1 1 − r`0 ,`00 −1
 r`0 ,`00 −1
r`0 ,`00 −1
×
 1 − r`0 ,`00 −1 1 − r`0 ,`00 −1
r`0 ,`00 −1
r`0 ,`00 −1

1 − r`00 0 0
0
 0
r`00 0
0
×
 0
0 r`00
0
0
0 0 1 − r`00
(A.2)

r`0 ,`00 −1
r`0 ,`00 −1
1 − r`0 ,`00 −1 1 − r`0 ,`00 −1 

r`0 ,`00 −1
r`0 ,`00 −1 
1 − r`0 ,`00 −1 1 − r`0 ,`00 −1


.

In addition, we have results similar to (4):
(`0 )
n
lim 1
N→∞ N
(`0 )
n
lim 2
N→∞ N
(A.3)
(`00 )
n
lim 1
N→∞ N
(`00 )
n
lim 2
N→∞ N
(`0 )
n
1
= lim 4 = (1 − r`0 ),
N→∞ N
2
(`0 )
n
1
= lim 3 = r`0 ,
N→∞ N
2
(`00 )
n
1
= lim 4 = (1 − r`00 ),
N→∞ N
2
(`00 )
n
1
= lim 3 = r`00 .
N→∞ N
2
Using (15), (A.2), and (A.3), we have
(`0 )
(`00 )
cov(W1 ,W1
N p
(1 − r`0 )(1 − r`00 )
4σ 2
(`0 )
(`0 ) (`00 )
(`00 )
×cov(y4 − y1 , y4 − y1 )
Np
=
(1 − r`0 )(1 − r`00 )
4µ
¶
n4,4;`0 ,`00
n4,1;`0 ,`00
n1,4;`0 ,`00
n1,1;`0 ,`00
× (`0 ) (`00 ) − (`0 ) (`00 ) − (`0 ) (`00 ) + (`0 ) (`00 )
n n
n4 n1
n1 n4
n1 n1
p4 4
≈ (1 − r`0 )(1 − r`00 )(1 − 2r`0 ,`00 −1 )
) ≈
(A.4)
Similarly, we have
(`0 )
(`00 )
(`0 )
(`00 )
cov(W1 ,W2
(A.5)
) ≈
p
(1 − r`0 )r`00 (1 − 2r`0 ,`00 −1 ),
p
cov(W2 ,W1 ) ≈ − r`0 (1 − r`00 )(1 − 2r`0 ,`00 −1 )
√
(`0 )
(`00 )
cov(W2 ,W2 ) ≈ − r`0 r`00 (1 − 2r`0 ,`00 −1 ).
By (14), (A.4) and (A.5), for `0 < `00 the covariance of Z(d 0 ) and Z(d 00 ) is
cov(Z(d 0 ), Z(d 00 ))
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
18
Chang et al.: Score Statistics
= (1 − 2r`0 ,`00 −1 )
·
× (1 − 2θ10 )(1 − 2θ100 )(1 − r`0 )(1 − r`00 )
+(1 − 2θ10 )(1 − 2θ200 )(1 − r`0 )r`00 − (1 − 2θ20 )(1 − 2θ100 )r`0 (1 − r`00 )
¸
0
00
−(1 − 2θ2 )(1 − 2θ2 )r`0 r`00 /
q
(1 − r`0 )(1 − 2θ10 )2 + r`0 (1 − 2θ20 )2 /
q
(1 − r`00 )(1 − 2θ100 )2 + r`00 (1 − 2θ200 )2 ,
(A.6)
0
00
where r`0 `00 denotes the recombination fraction between two markers M ` and M `
(`0 < `00 ), where r`−1,` = r` and r`,` = 0.
8.2
The linear relationship between W
(`)
(`)
(`+1)
,
1 ,W2 ,W1
(`+1)
and W2
(`)
(`)
in (20)
(`+1)
From (A.4) and (A.5), the variance-covariance matrix of W1 ,W2 ,W1
(`+1)
W2
has the form
p
µ
¶
µ p
¶
I2 Γ
(1p
− r` )(1 − r`+1 )
(1 − r` )r`+1
, where Γ =
.
√
Γ0 I2
− r` (1 − r`+1 )
− r` r`+1
, and
It can be seen that the above matrix is singular and hence (20) holds.
The relationship (20) has an intuitive interpretation. Denote by Bs;` the
subset of individuals who belong to Group s in interval `, i.e.,
Bs;` = {i : 1 ≤ i ≤ N, Gi` = s},
1 ≤ s ≤ 4;
1 ≤ ` ≤ k.
Consider two consecutive intervals [M`−1 , M` ] and [M` , M`+1 ]. For any individual
`
`
in
¡ Group
¢1 or Group 3 (Group 2 or Group 4), the marker condition is (M , m )
`
`
(m , m ) at marker M` . Thus this individual must belong to Group 1 or Group 2
(Group 3 or Group 4) in interval [M` , M`+1 ]. Therefore,
B1;` ∪ B3;` = B1;`+1 ∪ B2;`+1
and similarly,
B2;` ∪ B4;` = B3;`+1 ∪ B4;`+1 .
Consequently,
(`) (`)
(`) (`)
(`+1) (`+1)
(`+1) (`+1)
y1
+ n2
y2
n1 y1 + n3 y3 = n1
Published by The Berkeley Electronic Press, 2009
19
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16
and
(`) (`)
(A.7)
(`) (`)
(`+1) (`+1)
(`+1) (`+1)
y3
+ n4
y4
.
n2 y2 + n4 y4 = n3
Combining (15), (A.3), and (A.7) we see that
p
(`) √
(`) p
(`+1) √
(`+1)
1 − r`W1 − r`W2 − 1 − r`+1W1
− r`+1W2
√ ·µ
¶µ
¶
µ
¶
N
(`)
(`)
(`)
(`)
≈
1 − r`
y4 − y1 − r` y3 − y2
2
µ
¶µ
¶
µ
¶¸
(`+1)
(`+1)
(`+1)
(`+1)
− y1
− y2
− 1 − r`+1
y4
− r`+1 y3
·µ
¶ µ
¶
1
(`) (`)
(`) (`)
(`) (`)
(`) (`)
≈ √
n4 y4 − n1 y1 − n3 y3 − n2 y2
N
µ
¶ µ
¶¸
(`+1) (`+1)
(`+1) (`+1)
(`+1) (`+1)
(`+1) (`+1)
− n4
y4
− n1
y1
− n3
y3
− n2
y2
= 0,
which provides a justification of (20).
8.3 Equivalence of the s uprema of the MLR and s core s tatistic
In this section we demonstrate the equivalence of the suprema of the MLR statistic and the maximum of the square of score test statistic over one marker interval
[M 0 , M 1 ]. The equivalence over multiple intervals follows immediately. Specifically, we prove Theorem 1 of Section 3:
¯
¯
¯
¯
(N)
lim ¯ sup 2{`N (η̂ (N) (x), x) − `N (η̂ 0 )} − sup U 2 (x)¯ = 0 a.s.
N→∞ x∈[0,D]
x∈[0,D]
where the log likelihood ` and the score statistics U(x)2 are described in Section 3,
and the almost sure convergence is with respect to the distribution of the phenotypic
trait under H0 .
The proof of Theorem 1 is based on two lemmas. We will state the two
lemmas and then, using them, will prove the theorem. In separate sections we will
prove the two lemmas, and two other needed results. Before stating the lemma we
review some notation.
The genotype variable G` is defined at the beginning of Section 4.1. The
superscript “`” will be omitted since we focus on only one interval [M 0 , M 1 ]. The
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
20
Chang et al.: Score Statistics
probability function of G is defined by
 1
(1 − r),


 21
2 r,
p(g) = P(G = g) =
 12 r,

 1
2 (1 − r),
g=1
g=2
g=3
g = 4.
To express the likelihood function in a compact form, we define
ϕ (y; λ , θ ) = (1 − θ )e
−
1
(y−µ1 )2
2σ 2
+θe
−
1
(y−µ2 )2
2σ 2
.
where λ = (µ1 , µ2 , σ )T . We will also use the alternative parameterization µ1 =
µ − ∆, µ2 = µ + ∆ and η = (∆, µ , σ )T
The density function of the phenotype variable Y conditional on G = j is
fY |G= j (y; λ , x) = √
1
ϕ (y; λ , θ j∗ ),
2πσ
where θ j∗ = θ1 , θ2 , 1 − θ2 , 1 − θ1 , for j = 1, 2, 3, 4, respectively, and θ1 , θ2 are functions of x as defined in (3).
Assume the observations are (Yi , Gi ) = (yi , gi ), i = 1, . . . , N, and are arranged
so that the first n1 observations belong to group 1; the next n2 belong to group 2;
and so on. The likelihood functions conditional on the genotype variable are
µ
(N)
f (y
; λ , x) =
1
√
2πσ
¶N
N1
∏ ϕ (yi; λ , θ1)
i=1
N3
∏
N2
∏
ϕ (yi ; λ , θ2 )
i=N1 +1
N
ϕ (yi ; λ , 1 − θ2 )
i=N2 +1
∏
ϕ (yi ; λ , 1 − θ1 ),
i=N3 +1
where y(N) = (y1 , . . . , yN ). The full likelihood function for observations (Yi , Gi ) =
(yi , gi ), i = 1, . . . , N, is
4
(A.8)
LN (λ ; x) = f (y(N) ; λ , x) ∏ [p( j)]n j .
j=1
Since the second factor does not contain parameters of interest, we consider the
log likelihood function for y(N) conditional on the genotype variable `N (λ ; x) =
log f (y(N) ; λ , x), which is given in (5).
Published by The Berkeley Electronic Press, 2009
21
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16
For a fixed x ∈ [0, D], the mle’s of µ1 , µ2 , σ , ∆, µ , λ , η are denoted, respec(N)
(N)
tively, by µ̂1 (x), µ̂2 (x), σ̂ (N) (x), ∆ˆ (N) (x), µ̂ (N) (x),
³
´T
(N)
(N)
(N)
λ̂ (x) = µ̂1 (x), µ̂2 (x), σ̂ (N) (x) ,
³
´T
η̂ (N) (x) = ∆ˆ (N) (x), µ̂ (N) (x), σ̂ (N) (x) ,
(N)
(N)
with µ̂1 (x) = µ̂ (N) (x) − ∆ˆ (N) (x) and µ̂2 (x) = µ̂ (N) (x) + ∆ˆ (N) (x).
Let µ0 be the true common value of µ1 and µ2 and σ0 be the true value of
σ under H0 , and denote the normal density function with mean µ0 , and variance σ02
by n(y; µ0 , σ0 ). Under λ = λ 0 = (µ0 , µ0 , σ ), the densities of (Y, G) and Y are
fY,G (y, g; λ 0 , x) = n(y; µ0 , σ0 )p(g) and fY (y; λ 0 , x) = n(y; µ0 , σ0 ),
where neither depend on x under H0 .
From now on, all probabilities P0 (·), expectations E0 (·), variances Var0 (·),
and all probability statements such as “converges in probability”, “converges almost
surely (a.s.)”, and “with probability one” are under λ = λ 0 .
Lemma 1 (Information Matrix) The information matrix I(x) is given by
 2



∂ `
∂ 2`
∂ 2`
2
∂ ∆∂ µ
∂ ∆∂ σ
γ
(x)
0
0
∂
∆

1  ∂ 2`
1
∂ 2`
∂ 2` 
(A.9) I(x) = −E0 
= 2  0 1 0 ,
2
∂
µ∂
∆
∂
µ∂
σ


∂
µ
N
σ0
0 0 2
∂ 2`
∂ 2`
∂ 2`
∂σ∂∆ ∂σ∂ µ
∂σ2
η = η0
where γ (x) = (1 − 2θ1 )2 (1 − r) + (1 − 2θ2 )2 r (recall that θ1 and θ2 depend on x;
see (3)).
A sequence of random variables XN (η , x) is almost surely (a.s.) uniformly
bounded with respect to η ∈ Γ = {(∆, µ , σ ) : |µ − ∆| ≤ ξ , |µ + ∆| < ξ , 0 < σ1 ≤
σ ≤ σ2 < ∞} and x ∈ [0, D] if for any ε > 0, there exists a constant K such that
¯
¯
h
i
¯
¯
P0 sup ¯XN (η , x)¯ ≤ K > 1 − ε .
η ∈Γ
x∈[0,D]
N=1,2,...
A sequence of random variables XN (x) converges to a random variable Y (x)
almost surely and uniformly for x ∈ [0, D] if
lim sup |XN (x) −Y (x)| = 0 a.s.
N→∞ x∈[0,D]
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
22
Chang et al.: Score Statistics
Two sequences of random variables XN (s) and YN (x) are called asymptoti.
cally equivalent, denoted by XN (x) = YN (x), if XN (x) −YN (x) converges to zero a.s.
and uniformly for x ∈ [0, D].
Lemma 2 ( Higher Order Terms)
(A) All elements of the 3 × 3 × 3 matrix
1 ∂3
`N (η , x)
N ∂ η3
(A.10)
are a.s. bounded uniformly for η ∈ Γ and x ∈ [0, D].
(B)
¯
¯
√ £
¤ 1 ¡ (N)
¢ . 1 £
¤− 1 ∂
¯
2
2
N I(x) η̂ (x) − η 0 = √ I(x)
`N (η , x)¯
.
¯
∂η
N
η =η
0
8.3.1 Proof of Theorem 1:
It is sufficient to show that
(A.11)
.
(N)
2{`N (η̂ (N) (x), x) − `N (η̂ 0 )} = U 2 (x).
Recall from Section 3 that, under H0 , we can write U(x) as
(N)
(N)
u(µ̂0 , σ̂0 , x)
U(x) = q
(N)
(N)
Var0 [u(µ̂0 , σ̂0 , x)]
.
It follows that,
(N)
(N)
u(µ0 , σ0 , x) − u(µ̂0 , σ̂0 , x)]
q
(N)
(N)
Var0 [u(µ̂0 , σ̂0 , x)]
(N)
1 (µ̂0 − µ0 )[(1 − 2θ1 )(n4 − n1 ) + (1 − 2θ2 )(n3 − n2 )]
q
=
(N)
(N)
σ02
Var0 [u(µ̂0 , σ̂0 , x)]
Ã
!
(N)
(N)
u(µ̂0 , σ̂0 , x)
1
1
q
+
−
(N)
(N)
σ02 (σ̂ (N) )2
Var0 [u(µ̂0 , σ̂0 , x)]
0
.
= 0,
Published by The Berkeley Electronic Press, 2009
23
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16
and hence
u(µ0 , σ0 , x)
.
U(x) = q
(N)
(N)
Var0 [u(µ̂0 , σ̂0 , x)
.
Next, using the facts that
¯
¯
1
∂
. 1
¯
(N)
(N)
Var0 [u(µ̂0 , σ̂0 , x)] = 2 γ (x) and
`N (η , x)¯
= u(µ0 , σ0 , x),
¯
N
∂∆
σ0
η =η 0
we can write
(A.12)
¯
¯
¯
∂
∂ ∆ `N (η , x)¯¯
u(µ0 , σ0 , x)
η =η 0
.
.
q
U(x) = q
=
N
(N)
(N)
γ (x)
Var0 [u(µ̂0 , σ̂0 , x)
σ02
√
N ˆ (N) p
.
=
∆ (x) γ (x),
σ0
where the last equality follows from Lemma 2 (B).
Next we consider the likelihood ratio statistic. Using a Taylor expansion,
together with Lemma 2 (A), we can write
(A.13)
. N(η̂ (N) (x) − η )T I(x)(η̂ (N) (x) − η ),
2[`N (η̂ (N) (x), x) − `N (η 0 )] =
0
0
and, under H0 , since there is no dependence on x, we can use classical results to
write
!
µ
1
0
. N (N)
(N)
(N)
2[`N (η̂ 0 ) − `N (η 0 )] = 2 (ν̂ 0 − ν 0 )T
(ν̂ 0 − ν 0 ),
(A.14)
0 2
σ0
(N)
(N)
(N)
where ν = (µ , σ )T , ν 0 = (µ0 , σ0 )T , and ν̂ 0 = (µ̂0 , σ̂0 )T .
Using the derivatives of the log likelihood (see the proof of Lemma 1 we
have
√ ¡ (N)
¢ . σ2
N ν̂ 0 − ν 0 = √0
N
Ã
1 0
0 21
!
¯
"Ã
!
#
¯
µ̂ (N) (x)
∂
. √
¯
`N (η , x)¯
= N
− ν0 ,
¯
∂ν
σ̂ (N) (x)
η =η 0
where the last equality follows from Lemma 2 (B). The likelihood ratio statistic is
obtained by subtracting (A.14) from (A.13), which results in
.
(N)
2[`(y(N) ; η (N) (x), x) − `(y(N) ; η̂ 0 )] =
N ¡ ˆ (N) ¢2
∆ (x) γ (x)
σ02
.
= U 2 (x),
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
24
Chang et al.: Score Statistics
where the last equality follows from (A.12). The proof of Theorem 1 is now complete.
8.3.2 Proof of Lemma 1
The proof of Lemma 1 is quite straightforward, and we only sketch it here. Evaluation of the derivatives of the log likelihood is long and tedious, and we omit the
details. However, for completeness, we give the derivatives evaluated at the null
parameter.
¯
"
N1
N2
∂ `N ¯¯
1
=
− (1 − 2θ1 ) ∑(yi − µ0 ) − (1 − 2θ2 ) ∑ (yi − µ0 )
¯
∂∆ ¯
σ02
1
N1 +1
η =η 0
#
N3
∑
+(1 − 2θ2 )
N
(yi − µ0 ) + (1 − 2θ1 )
N +1
∑ (yi − µ0)
,
N +1
2
3
¯
¯
N
∂ `N ¯
1
=
¯
∑ (yi − µ0),
∂µ ¯
σ02 i=1
η =η 0
¯
∂ `N ¯¯
N
1 N
= − + 3 ∑ (yi − µ0 )2
¯
∂σ ¯
σ0 σ i=1
η =η 0
¯
"
N1
∂ 2 `N ¯¯
N
1 N
2
2
2
=
−
+
(y
−
µ
)
−
(1
−
2
θ
)
¯
1 ∑ (yi − µ0 )
∑ i 0
∂ ∆2 ¯
σ02 σ04 i=1
i=1
η =η 0
2
−(1 − 2θ2 )
N2
∑
2
(yi − µ0 ) − (1 − 2θ2 )
i=N1 +1
−(1 − 2θ1 )2
N
∑
2
#
N3
∑
(yi − µ0 )2
i=N2 +1
(yi − µ0 )2 ,
i=N +1
3
¯
¯
∂ 2 `N ¯
N
= − 2,
¯
2
∂µ ¯
σ0
η =η 0
¯
∂ 2 `N ¯¯
N
3 n
=
− 4 ∑ (yi − µ0 )2 ,
¯
2
2
∂σ ¯
σ0 σ0 i=1
η =η 0
¯
"
#
∂ 2 `N ¯¯
1
=
(1 − 2θ1 )(n1 − n4 ) + (1 − 2θ2 )(n2 − n3 ) ,
¯
∂ ∆∂ µ ¯
σ02
η =η 0
Published by The Berkeley Electronic Press, 2009
25
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16
¯
"
N1
N2
∂ 2 `N ¯¯
2
=
(1 − 2θ1 ) ∑(yi − µ0 ) + (1 − 2θ2 ) ∑ (yi − µ0 )
¯
∂ ∆∂ σ ¯
σ03
1
N1 +1
η =η 0
#
N3
N
N2 +1
N3 +1
−(1 − 2θ2 )
∑ (yi − µ0) − (1 − 2θ1) ∑ (yi − µ0)
¯
¯
−2 N
N ¯
=
¯
∑ (yi − µ0).
∂ µ∂ σ ¯
σ03 i=1
η =η 0
,
∂ 2`
8.3.3 Proof of Lemma 2
The proof of Lemma 2 relies on two other results, which we state here and prove in
the next sections.
Lemma 3 (Boundedness of MLE)
(A) For any x ∈ [0, D], the mle λ̂
(N)
(x) exists with probability one.
(B) For any δ > 0, there exist a measurable subset A of the sample space with
P(A) ≥ 1 − δ , real numbers σ1 , σ2 satisfying 0 < σ1 < σ0 < σ2 < ∞, a real
number ξ > 0 and an integer M > 0, such that for any sample point in A,
x ∈ [0, D], and N > M, we have
¯
¯
¯
¯
¯ (N) ¯
¯
¯
¯µ̂ j (x)¯ ≤ ξ , j = 1, 2, and σ1 ≤ ¯σ̂ (N) (x)¯ ≤ σ2 ,
where δ , A, σ1 , σ2 , ξ , and M are independent of x.
(N)
(N)
Lemma 4 (Uniform convergence of the MLE) The MLEs µ̂1 (x), µ̂2 (x), and
σ̂ (N) (x) converge uniformly to µ0 , µ0 , and σ0 , respectively, a.s. and uniformly for
x ∈ [0, D].
Since the second order moments of the elements in (A.10) are bounded uniformly for η ∈ Γ and x ∈ [0, D], the conclusion in (A) follows from Chebyshev’s
inequality, which also implies that
¯
¯
1 ∂
¯
√
(A.15)
`N (η , x)¯
¯
N ∂η
η =η 0
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
26
Chang et al.: Score Statistics
is a.s. bounded uniformly for x ∈ [0, D], since the elements in (A.15) have mean zero
and variances bounded uniformly for x ∈ [0, D]. Applying Chebyshev’s inequality
once more, we can show that
¯
¯
1 ∂2
.
¯
(A.16)
`
(
η
,
x)
= I(x),
¯
N
¯
N ∂ η2
η =η 0
since the right-hand side of (A.16) is the mean of the left-hand side and the variances
of the elements of the left-hand side converge to zero uniformly for x ∈ [0, D].
From the Taylor expansion
¯
¯
∂
∂
¯
`N (η̂ (N) (x), x) −
`N (η , x)¯
¯
∂η
∂η
η =η 0
¯
¯
∂
1 ∂3
¯
(N)
=
`
(
η
,
x)
(
η̂
(x)
−
η
)
+
`N (η ∗ (x), x)[η̂ (N) (x)
¯
N
0
¯
∂ η2
2 ∂ η3
η =η 0
−η 0 , η̂ (N) (x) − η 0 ],
where η ∗ (x) = η 0 + ζ (η̂ (N) (x) − η 0 ) with 0 < ζ < 1, we have
¯
¯
1
√ £
£
¤
¤ 1 ¡ (N)
¢
−2 ∂
¯
1
2
√
N I(x) η̂ (x) − η 0 − N I(x)
∂ η `N (η , x)¯¯
η =η 0
¯
¯
©
£
¡
¢¡
¢¤−1 ª
2
3
¯
= I(x) N1 ∂∂η 2 `N (η , x)¯
+ 12 N1 ∂∂η 3 `N η ∗ (x), x η̂ N) (x) − η 0
−I
¯
η =η 0
¯
¯
1
¯
(A.17)
· √1N I − 2 (x) ∂∂η `N (η , x)¯
,
¯
η =η 0
where I is the 3×3 identity matrix. The argument in the first paragraph of this proof
and Lemmas 3 and 4 imply that the right-hand side of (A.17) is asymptotically
equivalent to 0.
8.3.4
Proof of Lemma 3
Under H0 , the mle’s of µ1 = µ2 = µ0 and σ 2 = σ02 are the sample mean and sample
¡ (N) (N) (N) ¢
(N)
variance, respectively, and let λ̂ 0 = µ̂0 , µ̂0 , σ̂0 . Clearly,
Ã
!N
1
(N)
− N2
(A.18)
sup f (y(N) ; λ , x) ≥ f (y(N) ; λ̂ 0 , x) = √
e
.
(N)
σ̂
2
π
λ
0
Published by The Berkeley Electronic Press, 2009
27
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16
For any positive number b, the variance of Y conditional on |Y | ≤ b is denoted by
¯
¯
σb2 = Var0 (Y ¯|Y | ≤ b),
which converges to σ02 =Var0 (Y ) as b → ∞. For any ε > 0, select a positive number
b such that
P0 (|Y | ≤ b) > 1 − ε and σb > (1 − ε )σ0 .
(N)
The sample mean and sample variance using yi ’s with |yi | ≤ b are denoted by yb
(N)
and (σ̂b )2 , respectively. We have
(N)
lim σ̂
N→∞ b
(A.19)
= σb > (1 − ε )σ0 .
Now choose intervals I1 = [−2, −1], I2 = [0, 1], I3 = [2, 3], and I4 = [−b, b]. The
choice of these intervals is somewhat arbitrary, we are only using I1 − I3 to get a
lower bound on σ 2 , and I4 to get a bound on µ .
(N)
(b)
Let W j be the number of Y1 , . . . ,YN in I j , j = 1, 2, 3, 4. Denote by n j the
number of Y1 , . . . ,YN in [−b, b] and in group j, j = 1, 2, 3, 4. Clearly,
(N)
W4
4
=
(b)
∑ nj
.
j=1
Following from the fact that
(N)
lim
Wj
N→∞
= P0 (Y ∈ I j ),
N
(N)
lim σ̂0
N→∞
j = 1, 2, 3, 4,
= σ0 ,
and the limits in (4) and (A.19), there exist a measurable subset A of the sample
space with P0 (A) > 1 − δ , a positive number α , and an integer M such that for any
sample point in A and N > M,
(N)
> α N,
(N)
> (1 − ε )N,
3
<
σ0 ,
2
1
> (1 − ε ) (1 − r)N, j = 1, 4,
2
1
> (1 − ε ) rN, j = 2, 3,
2
> (1 − ε )σ0 .
Wj
W4
(N)
σ̂0
nj
(A.20)
nj
(N)
σ̂b
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
j = 1, 2, 3,
28
Chang et al.: Score Statistics
Hereafter in the proof of Lemma 3, we assume that the sample point is in A and
N > M and then, (A.20) hold. We then have,
Ã
!N
Ã
!N
1
1
2e− 2
(N)
− N2
√
(A.21)
sup f (y ; λ , x) ≥ √
e >
.
(N)
3 2πσ0
2π σ̂0
λ
Now, because of the gaps between I1 − I3 , for any real
n¯ values¯ of µ1 and µ2 ,
∗
there exists an interval, I , among I1 , I2 and I3 such that inf ¯y − µ j ¯ : j = 1, 2, y ∈
o
³
´
I ∗ ≥ 12 . This fact, together with the inequality that f y(N) ; λ , x is less than or
equal to
Ã
(A.22)
1
√
2πσ
!N
(
∏
exp
yi ∈I1UI2UI3
)
h
i
1
− 2 min (yi − µ1 )2 , (yi − µ2 )2
2σ
and (A.20), imply that
³
(A.23)
´
Ã
1
f y(N) ; λ ,x ≤ √
2πσ
!N "
Ã
1 1
exp − 2 ·
2σ 4
!#α N
Ã
−
e
= √
α
8σ 2
2πσ
!N
.
As σ → 0 or σ → ∞, the quantity in parentheses converges to zero and then (A.23) <
right-hand side of (A.21). As a result, the supremum of f (·) can be achieved in the
area of 0 < σ1 ≤ σ ≤ σ2 < ∞, where σ1 and σ2 are two constants independent of x.
Furthermore, for any yi ∈ I1 ∪ I2 ∪ I3 , the limit
h
i
lim min (yi − µ1 )2 , (yi − µ2 )2 = ∞,
|µ1 |→∞
|µ2 |→∞
implies that the right-hand side of (A.22) < right-hand side of (A.21) for a bounded
σ as both |µ1 | and |µ2 | converge to ∞. Therefore, we have shown that the supremum
of f (·) can be achieved in the area where σ1 ≤ σ ≤ σ2 and at least one of µ1 and
µ2 is bounded.
Now we show that for |µ1 | ≤ b and as |µ2 | → ∞, the value of the likelihood function is smaller than the right-hand side of (A.21). Following from the
inequalities (A.20), we have
1
1
(b)
n j > n j − ε N > (1 − ε ) (1 − r)N − ε N ≥ (1 − kε ) (1 − r)N,
2
2
Published by The Berkeley Electronic Press, 2009
j = 1, 4,
29
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16
and
1
1
(b)
n j > n j − ε N > (1 − ε ) rN − ε N ≥ (1 − kε ) rN,
2
2
where
Ã
k = max
j = 2, 3,
!
3−r 2+r
,
.
1−r r
Consequently,
lim|µ2 |→∞ f (y(N) , λ , x)
µ
¶
1
≤ lim|µ2 |→∞ √
2πσ
N1
∏
i=1
yi ∈I4
N3
∏
×
µ
=
1
√
2πσ

N4
ϕ (yi ; λ , 1 − θ2 )
i=N2 +1
yi ∈I4
ϕ (yi ; λ , θ1 )
∏
N2
∏
ϕ (yi ; λ , θ2 )
i=N1 +1
yi ∈I4
ϕ (yi ; λ , 1 − θ1 )
i=N3 +1
yi ∈I4
¶N
(b)
n1
· (1 − θ1 )
(b)
n2
(1 − θ2 )
(b)
n
θ2 3
(b)
n
· θ1 4
N
∏
i=1
N
·
¸
1
2
exp − 2 (yi − µ1 )
2σ
yi ∈I4
1
1
1
 [(1 − θ1 )θ1 ](1−kε ) 2 (1−r)N [(1 − θ2 )θ2 ](1−kε ) 2 rN e− N2
≤ √
(N)
W
(N)
2π N4 σ̂b
Ã
!N
1
e− 2
√
≤
.
21−kε 2π (1 − ε )2 σ0
(N)
(N)
Note that to obtain the second last inequality, the MLEs yb and σ̂b have been
used. For small ε , clearly the last term is less than the right-hand side of (A.21).
Symmetrically, for |µ2 | ≤ b and as |µ1 | → ∞, the value of the likelihood function
is also smaller than the right-hand side of (A.21). Therefore, we have shown that
the supremum of f (·) is achieved in the area of µ1 , µ2 , and σ bounded uniformly
with respect to x ∈ [0, D]. The existence of mle’s of µ1 , µ2 , and σ for any x ∈ [0, D]
follows from the continuity of f (·) with respect to λ .
8.3.5 Proof of Lemma 4
Before we prove Lemma 4 we need a few preliminaries.
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
30
Chang et al.: Score Statistics
For any function h(λ , x) of λ and x, any subset Λ of the parameter space of
λ , and subset Φ of [0, D], define
h(Λ, Φ) =
h(λ , x).
sup
λ ∈Λ,
x∈Φ
For any ε > 0, λ ∗ = (µ1∗ , µ2∗ , σ ∗ ), and x∗ ∈ [0, D], define the ε -neighborhood of λ ∗
by
S(λ ∗ , ε ) = {λ = (µ1 , µ2 , σ ) : |µ1 − µ1∗ | < ε , |µ2 − µ2∗ | < ε , |σ − σ ∗ | < ε }
and the ε -neighborhood of x∗ by
S(x∗ , ε ) = {x ∈ [0, D] : |x − x∗ | < ε }.
As ε ↓ 0, we have
and
(A.24)


fY |G=i (y; S(λ , ε ), S(x, ε ))y fY |G=i (y; λ , x)


fY,G (y, g; S(λ , ε ), S(x, ε ))y fY,G (y, g; λ , x).
From Rao (Section 6.6, page 59), or an application of Jensen’s inequality, for λ 6=
λ 0,
· µ
¶¸
fY,G (Y, G; λ , x)
E0 log
< 0.
fY,G (Y, G; λ 0 )
Now we prove Lemma 4.
Let δ , M, σ1 , σ2 , ξ and A be as in Lemma 3. It is sufficient to show that for
any ε > 0, there exists a M1 such that when N > M1 ,
λ̂
(N)
(x) ∈ S(λ 0 , ε ),
x ∈ [0, D],
with probability 1 − 2δ . Define Γ = {λ : |µ1 | ≤ ξ ,
that
(A.25)
σ1 ≤ σ ≤ σ2 } and observe
¯
µ
¶¯
¯
¯
¯log fY,G (y, g; Γ, [0, D]) ¯ ≤ K|y| + Q,
¯
¯
fY,G (y, g; λ 0 )
where K and Q are positive constants depending on ξ , σ1 , and σ2 . Let B = Γ −
S(λ 0 , ε ) and note that for any λ ∈ B and x ∈ [0, D], (A.24), (A.25) and the Lebesgue
Dominated Convergence Theorem imply that
" Ã
¡
¢ !#
· µ
¶¸
fY,G Y, G; S(λ , ε ), S(x, ε )
fY,G (Y, G; λ , x)
lim E0 log
= E0 log
fY,G (Y, G; λ 0 )
fY,G (Y, G; λ 0 , x)
ε ↓0
< C(λ , x) < 0.
Published by The Berkeley Electronic Press, 2009
31
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16
Therefore, there exists an ε = ε (λ , x) such that
" Ã
¡
¢ !#
fY,G Y, G; S(λ , ε ), S(x, ε )
E0 log
< C(λ , x) < 0.
fY,G (Y, G; λ0 )
The preceding argument applies to each (λ , x) ∈ B × [0, D]. Since B × [0, D] is
compact, there exist points (λ 1 , x), . . . , (λ k , xk ) and corresponding ε1 , . . . , εk and
C1 , . . . ,Ck such that
B × [0, D] ⊂
k
[
S(λ i , ε ) × S(xi , ε )
n=1
and for i = 1, . . . , k,
"
(A.26)
Ã
E0 log
¡
¢ !#
fY,G Y, G; S(λ i , εi ), S(xi , εi )
< Ci < 0.
fY,G (Y, G; λ 0 )
Let C = max(C1 , . . . ,Ck ). Because
1
log
N
(A.27)
Ã
¡
¢!
LN S(λ i , εi ), S(xi , εi )
LN (λ 0 )
converges a.s. to the left-hand side of (A.26), there exist a measurable subset A1 ⊂ A
with P0 (A1 ) > 1 − 2δ and an integer M1 > M such that when N > M1 ,
(A.27) < C < 0,
for i = 1, . . . , k.
Now for any λ ∈ B and any x ∈ [0, D], there exists an i, 1 ≤ i ≤ k, such that (λ , x) ∈
S(xi , εi ) × S(xi , εi )and for any sample point in A1 and N > M1 ,
¤
1£
`N (λ , x) − `N (λ 0 ) =
N
<
which implies that λ̂
(N)
(x)∈B,
λ̂
(N)
µ
¶
LN (λ , x)
L (λ )
à N¡ 0
¢!
LN S(λ i , εi ), S(xi , εi )
1
log
< C < 0,
N
LN (λ0 )
1
log
N
for all x ∈ [0, D]. By Lemma 3,
(x) ∈ Γ,
for all x ∈ [0, D].
Thus, for any sample point in A1 and N > M1 ,
λ̂
(N)
(x) ∈ S(λ 0 , ε ),
for all x ∈ [0, D],
which concludes the proof.
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
32
Chang et al.: Score Statistics
References
Churchill, G. A., and R. W. Doerge. (1994). Empirical threshold values for
quantitative trait mapping. Genetics 138: 963-971.
Cox, D. R. and D. V. Hinkley. (1974). Theoretical Statistics. Chapman & Hall:
London.
Darvasi, A., A. Weinreb, V. Minke, J. I. Weller and M. Soller.( 1993). Detecting
marker-QTL linkage and estimating QTL gene effect and map location using a
saturated genetic map. Genetics 134: 943-951.
Davies, R. B. (1977). Hypothesis testing when a nuisance parameter is present
only under the alternative. Biometrika 64: 247-254.
Davies, R. B. (1987). Hypothesis testing when a nuisance parameter is present
only under the alternative. Biometrika 74: 33-43.
Doerge, R. W., and G. A. Churchill. (1996). Permutation tests for multiple loci
affecting a quantitative character. Genetics 142: 285-294.
Doerge, R. W., Z-B. Zeng and B. S. Weir (1997). Statistical Issues in the Search
for Genes Affecting Quantitative Traits in Experimental Populations. Statistical
Science 12: 195-219.
Dupuis, J. and D. Siegmund. (1999). Statistical methods for mapping quantitative
trait loci from a dense set of markers. Genetics 151: 373–386.
Haldane, J. B. S. (1919) The combination of linkage values and the calculation of
distance between the loci of linkage factors. J. Genet. 8: 299-309.
Haley, C. S., S. A. Knott and J. M. Elsen. (1994). Genetic mapping of quantitative
trait loci in cross between outbred lines using least squares. Genetics 136: 11951207.
Hoeschele, I. (2001). Mapping quantitative trait loci in outbred pedigrees. In:
Handbook of Statistical Genetics, Edited by D. J. Balding, M. Bishop and C.
Cannings. Wiley New York 599-644.
Published by The Berkeley Electronic Press, 2009
33
Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16
Jansen, R. C. (2001). Quantitative trait loci in inbred lines. In: Handbook of
Statistical Genetics, Edited by D. J. Balding, M. Bishop and C. Cannings. Wiley
New York. 567-597.
Lander, E. S., and D. Botstein. (1989). Mapping Mendelian factors underlying
quantitative traits using RELP linkage maps. Genetics 121: 185-199.
Lander, E. S., and N. J. Schorck. (1994). Genetic dissection of complex traits.
Science 265: 2037-2048.
Lin, M., X.-Y. Lou, M. Chang and R. L. Wu. (2004). A general statistical
framework for mapping quantitative trait loci in non-model systems: Issue for
characterizing linkage phases. Genetics 165: 901-913.
Piepho, H-P. (2001). A quick method for computing approximate thresholds for
quantitative trait loci detection. Genetics 157: 425-432.
Rao, C. R. (1973). Linear Statistical Inference and Its Applications, 2nd Edition.
Wiley, New York.
Rebai, A., B. Goffinet, and B. Mangin. (1994). Approximate thresholds of
interval mapping tests for QTL detection. Genetics 138: 235–240.
Rebai, A., B. Goffinet and B. Mangin. (1995) Comparing power of different
methods for QTL detection. Biometrics 51: 87-99.
Schaid, D. J., C. M. Rowland, D. E. Tines, R. M. Jacobson and G. A. Poland.
(2002). Score tests for association between traits and haplotypes when linkage
phase
is
ambiguous.
Am.
J.
Hum.
Genet.
70:
425-34.
Van Ooijen, J.W. (1992). Accuracy of mapping quantitative trait loci in
autogamous species. Theor. Appl. Genet. 84: 803-811.
Wang, K., and J. Huang. (2002). A score-statistic approach for mapping
quantitativetrait loci with sibships of arbitrary size. Am. J. Hum. Genet. 70: 412424.
Wu, R. L., C.-X. Ma and G. Casella. (2007) Statistical Genetics of Quantitative
Traits: Linkage, Maps, and QTL. Springer-Verlag, New York.
http://www.bepress.com/sagmb/vol8/iss1/art16
DOI: 10.2202/1544-6115.1386
34
Chang et al.: Score Statistics
Wu, R. L., C.-X. Ma, I. Painter and Z.-B.Zeng. (2002). Simultaneous maximum
likelihood estimation of linkages and linkage phases over a heterogeneous
genome. Theor. Pop. Biol. 61: 349-363.
Wu, R. L., M. X. Wang and M. R. Huang. (1992). Quantitative genetics of yield
breeding for Populus short rotation culture. I. Dynamics of genetic control and
selection models of yield traits. Can. J. Forest Res. 22: 175-182.
Xu, S. Z. (1996). Mapping quantitative trait loci using four-way crosses. Genet.
Res. 68: 175-181.
Yin, T. M., X. Y. Zhang, M. R. Huang, M. X. Wang, Q. Zhuge, S. M. Tu, L. H.
Zhu and R. L. Wu. (2002). Molecular linkage maps of the Populus genome.
Genome 45: 541-555.
Zeng, Z.-B. (1994). Precision mapping of quantitative trait loci. Genetics 136:
1457-1468.
Zou, F., B. S. Yandell and J. P. Fine. (2001). Statistical issues in the analysis of
quantitative traits in combined crosses. Genetics 158: 1339-1346.
Zou, F, J. P. Jason, J. Hu, and D. Y. Lin (2004). An efficient resampling method
for assessing genome-wide statistical significance in mapping quantitative trait
loci. Genetics 168 (4): 2307-2316.
Published by The Berkeley Electronic Press, 2009
35
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising