Statistical Applications in Genetics and Molecular Biology Volume 8, Issue 1 2009 Article 16 Score Statistics for Mapping Quantitative Trait Loci Myron N. Chang∗ Rongling Wu† Samuel S. Wu‡ George Casella∗∗ ∗ University of Florida, mchang@cog.ufl.edu University of Florida, rwu@ufl.edu ‡ University of Florida, samwu@biostat.ufl.edu ∗∗ University of Florida, casella@ufl.edu † c Copyright 2009 The Berkeley Electronic Press. All rights reserved. Score Statistics for Mapping Quantitative Trait Loci Myron N. Chang, Rongling Wu, Samuel S. Wu, and George Casella Abstract We propose a method to detect the existence of quantitative trait loci (QTL) in a backcross population using a score test. Since the score test only uses the MLEs of parameters under the null hypothesis, it is computationally simpler than the likelihood ratio test (LRT). Moreover, because the location parameter of the QTL is unidentifiable under the null hypothesis, the distribution of the maximum of the LRT statistics, typically the statistic of choice for testing H 0 : no QTL, does not have the standard chi-square distribution asymptotically under the null hypothesis. From the simple structure of the score test statistics, the asymptotic null distribution can be derived for the maximum of the square of score test statistics. Numerical methods are proposed to compute the asymptotic null distribution and the critical thresholds can be obtained accordingly. We show that the maximum of the LR test statistics and the maximum of the square of score statistics are asymptotically equivalent. Therefore, the critical threshold for the score test can be used for the LR test also. A simple backcross design is used to demonstrate the application of the score test to QTL mapping. KEYWORDS: score test, likelihood ratio test, asymptotic equivalence, backcross design Chang et al.: Score Statistics 1 Introduction Since the publication of the seminal mapping paper by Lander and Botstein (1989), there has been a vast wealth of literature concerning the development of statistical methods for mapping quantitative trait loci (QTL) that affect complex traits (reviewed in Doerge et al. 1997; Jansen 2001; Hoeschele 2001; Wu et al. 2007). In QTL mapping, one of the thorniest issues is the the characterization of critical thresholds to declare the statistical significance of a QTL (Lander and Schork 1994). Typically the profile of likelihood-ratio (LR) test statistics is constructed over the grid of possible QTL locations in a linkage group or an entire genome and the maximum of the LR (MLR) is used as a global test statistic. At a given position of the QTL, the LR test statistic is asymptotically χ 2 -distributed under the null hypothesis with degrees of freedom equal to the number of associated QTL effects. However, the QTL position is unidentified under the null hypothesis (1) H0 : no QTL, therefore, the MLR test statistic does not have the standard χ 2 -distribution asymptotically. To overcome the limitations due to the failure of the test statistic to follow a standard statistical distribution, a distribution-free simulation approach has been proposed to calculate critical values for different experimental settings with intermediate marker densities (Lander and Botstein 1989; van Ooijen 1992; Darvasi et al 1993). A more empirical approach for determining critical thresholds is based on a permutation test procedure (see Churchill and Doerge 1994 and Doerge and Churchill 1996). The simulation- or permutation-based approach requires a very high computational workload, and this makes its application impractical in many situations. Several researchers have derived formulae for computing approximate critical thresholds to control the type I error rate at a chromosome- or genome-wide level (Rebai et al. 1994; Dupuis and Siegmund 1999; Piepho 2001). The purpose of this article is to propose a general statistical framework to detect the existence of a QTL using a score test. A score-statistic approach for mapping QTL was considered with sibships or a natural population in humans (Schaid et al. 2002; Wang and Huang 2002). Zou et al. (2004) utilized score statistics for genome-wide mapping of QTL in controlled crosses, in which the resampling method was proposed to assess the significance. Since the score test only uses maximum likelihood estimates (MLEs) of parameters under the null hypothesis (1), under which all phenotypes in the sample are independent and identically distributed (i.i.d.), this approach is much less demanding computationally than the LR test. We form score test statistics for given locations in the permissible range and then use Published by The Berkeley Electronic Press, 2009 1 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 the maximum of the square of these score test statistics as a global test statistic. We show that under the H0 , the maximum of the square of score test statistics asymptotically follows the distribution of sup Z 2 (d), where Z(d) is a Gaussian stochastic process with mean zero and well-formulated covariances, and the supremum is over the permissible range of d, the Haldane (1919) distance between one end of the genome and the location of the QTL. The distribution of sup Z 2 (d) only depends on Haldane distances or equivalently on recombination fractions between markers, and the critical thresholds can be obtained by simulation. We show that the maximum of the square of score test statistics and the MLR are asymptotically equivalent, and hence the critical thresholds obtained for the maximum of the square of score test statistics can be used for MLR test also. The article is organized as follows. We provide background information on QTL mapping in backcross population in Section 2. We derive the score test statistic and the asymptotic null distribution of the maximum of the square of score test statistics for interval mapping In Section 3. We further show that the maximum of the LR test statistics and the maximum of the square of score statistics are asymptotically equivalent under the null hypothesis. In Section 4, the score test and the null distribution are extended to composite interval mapping, where we show that, in the dense map case, the distribution of sup Z 2 (d) is the same as that derived by Lander and Botstein (1989). Numerical methods to compute the null distribution are addressed in Section 5. In Section 6, examples are presented and the thresholds obtained by our method are compared with those from other procedures. Section 7 is a discussion section. We provide some technical details in the Appendix. 2 QTL Mapping in Backcross Populations This section provides a short introduction to QTL mapping in backcross populations. A more thorough introduction can be found in Doerge et al. (1997). 2.1 Basics The aim of QTL mapping is to associate genes with quantitative phenotypic traits. For example, we might be interested in QTL that affect the growth rate of pine trees, so we might be looking for regions on a chromosome that are associated with growth rate in pines. In the simplest case we assume that there are two parents that have alleles QQ and qq at a certain (unknown) place, or locus, on the chromosome. If the parents mate, the offspring (called the F1 generation) will have the allele Qq. In a backcross, the offspring is mated back to a parent, say QQ, producing a new generation with http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 2 Chang et al.: Score Statistics alleles QQ, Qq. We want to investigate the association between the new generation (sometimes called BC1 ) and the quantitative trait. Now the problems arise. We typically cannot see if the locus is Q or q, but rather we see a ”marker”, with alleles M and m. If the marker is ”close” to the gene, where closeness is measured by recombination distance, r, then we hope that seeing an M implies that a Q is there, and so on. If r is small, with r = 0 being the limiting case of no recombination, then M is closely linked to Q. But if r is large, with r = 1/2 being the limiting case, then there may be no linkage between M and Q. A simple statistical model for all of this is the following. If Y is a random variable signifying the quantitative trait, assume that Y is normal with variance σ 2 and mean µQQ , µqq or µQq depending on the alleles at the particular locus under consideration. Clearly the parents are N(µQQ , σ 2 ) and N(µqq , σ 2 ), respectively, and the F1 is N(µQq , σ 2 ). The distribution of the backcross to the QQ parent is a mixture of normals with means µQQ and µQq . If the Q alleles could be observed, the mixing fraction would be 1/2, but we can only observe the marker genotypes M or m. If r denotes the recombination probability, then the distribution of the phenotype in the backcross population (Doerge et al. 1997) is ½ N(µQQ , σ 2 ) with probability 1 − r Y |MM ∼ , N(µQq , σ 2 ) with probability r and ½ Y |Mm ∼ N(µQQ , σ 2 ) with probability r . N(µQq , σ 2 ) with probability 1 − r When categorized by the markers, the difference in means of the populations is µMM − µMm = (1 − 2r)(µQQ − µQq ). Assuming that µQQ − µQq 6= 0, a test of H0 : no linkage between M and Q can be carried out by testing H0 : µMM − µMm (which is also equivalent to the null hypothesis H0 : r = 1/2). Using this mixture model in a likelihood analysis, we can not only test for linkage but also estimate r. (The backcross is the simplest design in which we get enough information to estimate r.) 2.2 Interval mapping In practice, we typically use a somewhat more complex model, where we assume that the QTL is bracketed by two flanking markers M `−1 (with alleles M `−1 and Published by The Berkeley Electronic Press, 2009 3 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 m`−1 ) and M ` (with alleles M ` and m` ) from a genetic linkage map. These two markers form four distinct genotype groups (2) {M `−1 m`−1 M ` m` }, {M `−1 m`−1 m` m` }, {m`−1 m`−1 M ` m` }, {m`−1 m`−1 m` m` }. The recombination fraction between the two markers M `−1 and M ` is denoted by r, and that between M `−1 and the QTL by r1 , and that between the QTL and M ` by r2 . The corresponding genetic distances between these three loci are denoted by D, x and D − x, respectively. Genetic distances and recombination fractions are related through a map function (Wu et al. 2007). Here we use the Haldane (1919) map function, given by r = (1 − e−2D )/2, to convert the genetic distance to the corresponding recombination fraction. To express the conditional probability of a QTL genotype, say Qq, given each of the four two-marker genotypes, we introduce two ratio parameters, θ1 = (3) 1 + e−2D − e−2x − e−2(D−x) r1 r2 r1 r2 , = = 2(1 + e−2D ) 1 − r (1 − r1 )(1 − r2 ) + r1 r2 and θ2 = 1 − e−2D − e−2x + e−2(D−x) r1 (1 − r2 ) r1 (1 − r2 ) . = = 2(1 − e−2D ) r1 (1 − r2 ) + (1 − r1 )r2 r Furthermore, assume that the putative QTL has an additive effect on the trait, and define the expected genotypic values as µ + ∆ and µ − ∆ for the two QTL genotypes Qq and qq, respectively, where µ is the overall mean of the backcross population. The corresponding distributions of the phenotypic values for the four marker genotypes can be modelled, respectively, as (1 − θ1 )N(µ − ∆, σ 2 ) + θ1 N(µ + ∆, σ 2 ), (1 − θ2 )N(µ − ∆, σ 2 ) + θ2 N(µ + ∆, σ 2 ), θ2 N(µ − ∆, σ 2 ) + (1 − θ2 )N(µ + ∆, σ 2 ), θ1 N(µ − ∆, σ 2 ) + (1 − θ1 )N(µ + ∆, σ 2 ), where µ , ∆, σ 2 , and x, 0 ≤ x ≤ D, are the four unknown parameters contained in the above mixture models. Lastly, suppose that we take a sample of size N from the backcross populations, with respective sample sizes of n1 , n2 , n3 and n4 in the four marker groups of (2). We then have 1 lim(n1 /N) = lim(n4 /N) = (1 − r) 2 (4) 1 lim(n2 /N) = lim(n3 /N) = r, 2 as N → ∞. http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 4 Chang et al.: Score Statistics 3 Single Interval Mapping Since the existence of the QTL in the interval is indicated by ∆ 6= 0, its statistical significance can be tested by the hypotheses: H0 : ∆ = 0 vs. Ha : ∆ 6= 0, where H0 is equivalent to (1). Note that under the null hypothesis, all observations are i.i.d. from N(µ , σ 2 ). Thus, the QTL position parameter θ = (θ1 , θ2 ) = θ (x, D) is unidentifiable because there is no information on θ if H0 is true. In this case, the MLR test statistic does not follow a standard χ 2 -distribution with two degrees of freedom. We derive a score statistic to overcome this problem. Recall that n1 , n2 , n3 and n4 are the sample sizes of the four marker groups. If we define N0 = 0, N1 = n1 , N2 = N1 + n2 , N3 = N2 + n3 and N4 = N = N3 + n4 , the log likelihood of the phenotype data conditional on the marker information can be written `(∆, µ , σ 2 , x) = − N N log 2π − log σ 2 2 2 N1 + ∑ log[(1 − θ1 )e i=1 N2 + ∑ − 1 (yi −µ +∆)2 2σ 2 log[(1 − θ2 )e − + θ1 e 1 (yi −µ +∆)2 2σ 2 − 1 (yi −µ −∆)2 2σ 2 ] − 1 (yi −µ −∆)2 2σ 2 ] − 1 (yi −µ −∆)2 2σ 2 ] − 1 (yi −µ −∆)2 2σ 2 ]. + θ2 e i=N1 +1 (5) N3 + ∑ − 1 (yi −µ +∆)2 2σ 2 − 1 (yi −µ +∆)2 2σ 2 log[θ2 e + (1 − θ2 )e i=N2 +1 N + ∑ log[θ1 e + (1 − θ1 )e i=N3 +1 The score function is the derivative of (5) with respect to ∆ evaluated at ∆ = 0: ¯ ½ ¯ N1 ∂ ` 1 ¯ 2 u(µ , σ , x) = = 2 − (1 − 2θ1 ) ∑ (yi − µ ) ¯ ∂∆¯ σ i=1 ∆=0 N3 N2 −(1 − 2θ2 ) ∑ (yi − µ ) + (1 − 2θ2 ) i=N1 +1 ¾ (yi − µ ) . N +(1 − 2θ1 ) ∑ ∑ (yi − µ ) i=N2 +1 i=N3 +1 Published by The Berkeley Electronic Press, 2009 5 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 Under the null hypothesis, the MLEs of µ and σ 2 are µ̂ = 1 N 1 N 2 y and σ̂ = ∑ i ∑ (yi − µ̂ )2, N i=1 N i=1 respectively. For a given x ∈ [0, D], to form the score test statistic we need to replace µ and σ 2 in u(µ , σ 2 , x) by µ̂ and σ̂ 2 . For ease of notation, denote yt = a(x, N) b(x, N) c(x, N) d(x, N) = = = = Nt 1 yi , ∑ nt i=Nt−1 +1 t = 1, 2, 3, 4. (θ1 − θ2 )n2 − (1 − θ1 − θ2 )n3 − (1 − 2θ1 )n4 , (θ2 − θ1 )n1 − (1 − 2θ2 )n3 − (1 − θ1 − θ2 )n4 , (1 − θ1 − θ2 )n1 + (1 − 2θ2 )n2 + (θ1 − θ2 )n4 (1 − 2θ1 )n1 + (1 − θ1 − θ2 )n2 + (θ2 − θ1 )n3 , where N = (n1 , n2 , n3 , n4 ). We can then write u(µ̂ , σ̂ 2 , x) = 2 [a(x, N)n1 y1 + b(x, N)n2 y2 + c(x, N)n3 y3 + d(x, N)n4 y4 ], N σ̂ 2 where, under the null hypothesis, the variance of u(µ̂ , σ̂ 2 , x) is µ 2 Var(u(µ̂ , σ̂ , x)) = N σ̂ 2 ¶2 [a2 (x, N)n1 + b2 (x, N)n2 + c2 (x, N)n3 + d 2 (x, N)n4 ]. The score test statistic, U(x), is u(µ̂ , σ̂ 2 , x) divided by its standard deviation, that is, (6) U(x) = p u(µ̂ , σ̂ 2 , x) Var(u(µ̂ ,σ̂ 2 x)) , Calculation of the score test statistic U(x) is simple (in contrast to the computation of the MLEs ∆ˆ x , µ̂x , and σ̂x , which can be complex). Moreover, the maximum score test statistic is asymptotically equivalent to the maximum likelihood ratio statistic, as we now show. For a given x ∈ [0, D], let ∆ˆ x , µ̂x , and σ̂x2 be the MLEs of ∆, µ and σ 2 . By Cox and Hinkley (1974, pages 323-324), the LR test statistic and the square of the score test statistic are asymptotically equivalent at this x, that is, (7) 2[`(∆ˆ x , µ̂x , σ̂x2 , x) − `(0, µ̂ , σ̂ 2 , x)] ≈ U 2 (x). Over the range of x ∈ [0, D], the maximum of the left-hand side of (7) (MLR) equals the maximum of the right-hand side of (7) asymptotically. The proof is deferred to Appendix 8.3. http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 6 Chang et al.: Score Statistics Theorem 1 ¯ ¯ ¯ ¯ lim ¯ sup 2{`(η̂ (x), x) − `(η̂ 0 )} − sup U 2 (x)¯ = 0 a.s. N→∞ x∈[0,D] x∈[0,D] where η̂ (x) = (∆ˆ x , µ̂x , σ̂x ), η̂ 0 = (0, µ̂ , σ̂ ), and the almost sure convergence is with respect to the distribution of the phenotypic trait under H0 . In Appendix 8.3, we show that the difference between the two sides in (7) converges to zero with probability one uniformly for x ∈ [0, D] as N → ∞. To test (1) we need the null distribution of the maximum of U 2 (x) (which is equivalent to the MLR test statistic). To do this we derive a simplified and asymptotically equivalent form of the score test statistic. Because of (4), we can replace n1 , n2 , n3 , and n4 in U(x) by 12 (1 − r)N, 12 rN, 12 rN, and 12 (1 − r)N, respectively. We then have (8) µ√ ¶ N (1 − r)(1 − 2θ1 )(y4 − y1 ) + r(1 − 2θ2 )(y3 − y2 ) p . U(x) ≈ U (x) = 2σ̂ (1 − r)(1 − 2θ1 )2 + r(1 − 2θ2 )2 ∗ Under the null hypothesis, (9) 1 √ (y3 − y2 ) (y − y1 ) 1p ≈ W2 , rN ≈ W1 and (1 − r)N 4 σ̂ 2 σ̂ 2 where W1 and W2 are two independent standard normal random variables. Thus, U ∗ (x) is asymptotically equivalent to (10) √ √ (1 − 2θ1 ) 1 − rW1 + (1 − 2θ2 ) rW2 Z(x) = p . (1 − r)(1 − 2θ1 )2 + r(1 − 2θ2 )2 Next, consider two QTL positions x0 and x00 ∈ [0, D], whose corresponding probabilities in (3) can be written as (θ10 , θ20 ) = θ (x0 , D) and (θ100 , θ200 ) = θ (x00 , D). The covariance between Z(x0 ) and Z(x00 ) is cov(Z(x0 ), Z(x00 )) (1 − r)(1 − 2θ10 )(1 − 2θ100 ) + r(1 − 2θ20 )(1 − 2θ200 ) (11) p . = p (1 − r)(1 − 2θ10 )2 + r(1 − 2θ20 )2 (1 − r)(1 − 2θ100 )2 + r(1 − 2θ200 )2 Therefore, the MLR test statistic or the maximum of the square of score test statistics asymptotically distributed as (12) sup Z 2 (x), 0≤x≤D where Z(x) is a Gaussian stochastic process with mean 0 and covariances (11). Published by The Berkeley Electronic Press, 2009 7 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 4 Composite Interval Mapping In the preceding section, we discussed the asymptotic theory of the score test statistic for QTL mapping from a single marker interval. In this section, we use the full marker information in composite interval mapping. Statistically, this is a combination of interval mapping based on two given flanking markers and a partial regression analysis on all markers except for the two ones bracketing the QTL. 4.1 QTL mapping on a sparse map Suppose there are k + 1 diallelic markers, M 0 , M 1 , ..., M k , on a sparse linkage map. These markers encompass k marker intervals. In a backcross population, four marker genotypes at a given interval [M `−1 , M ` ] (` = 1, 2, . . . , k) are indicated by 1 if a marker genotype is M `−1 m`−1 M ` m` 2 if a marker genotype is M `−1 m`−1 m` m` G(`) = 3 if a marker genotype is m`−1 m`−1 M ` m` 4 if a marker genotype is m`−1 m`−1 m` m` . The number of observations within each of these four marker genotypes at a given marker interval can be expressed as N (`) (`) nt = ∑ I(Gi = t), t = 1, 2, 3, 4 i=1 where I(A) is the indicator function of event A. We then use (`) (`) (`) (`) N(`) = (n1 , n2 , n3 , n4 ) to array the observations of all the four marker genotypes at the interval. We can then express the phenotypic mean of a quantitative trait (y) for a marker genotype t at the interval [M `−1 , M ` ] as (`) ȳt = 1 N (`) yi I(Gi (`) nt i=1 ∑ = t). Let D` and r` (` = 1, 2, · · · , k) denote the Haldane distance and recombination fraction between M `−1 and M ` , respectively, and denote the Haldane distance between the two extreme markers M 0 and M k by D = ∑k`=1 D` . A putative QTL http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 8 Chang et al.: Score Statistics must be located somewhere on the map, at unknown position d ∈ [0, D]. For each d ∈ [0, D], define (13) `d = min{` : `−1 ` `−1 i=1 i=1 i=1 ∑ Di ≤ d < ∑ Di} and xd = D − ∑ Di, to identify the possible QTL within the interval [M `d −1 , M `d ] that is associated with the distance d. As derived in Section 3, the score test statistic (6) for detecting the QTL at d ∈ [0, D], the asymptotically equivalent forms (8) and (10), and the rela(`) tionship (9) are all valid with N, yt , and r replaced by N(`) , yt , and r` , respectively. In (6), (8), and (10), (θ1 , θ2 ) = θ (xd , D`d ) is the function of xd and D`d defined in (3). For example, we express (10) as √ √ (`) (`) (1 − 2θ1 ) 1 − r`W1 + (1 − 2θ2 ) r`W2 p Z(d) = , (1 − r` )(1 − 2θ1 )2 + r` (1 − 2θ2 )2 (14) (`) (`) where W1 and W2 are two independent standard normal random variables, (15) (`) W1 (`) (`) (`) (`) (y − y2 ) (y − y1 ) 1p 1p (`) ≈ (1 − r` )N 4 and W2 ≈ r` N 3 . 2 σ̂ 2 σ̂ In order to search for a QTL throughout the entire genome, we need to compute the covariance between Z(d 0 ) and Z(d 00 ), 0 ≤ d 0 ≤ d 00 ≤ D. Assume (`0 , x0 ) and (`00 , x00 ) are associated with d 0 and d 00 through (13). Let (θ10 , θ20 ) = θ (x0 , D`0 ) and (θ100 , θ200 ) = θ (x00 , D`00 ) as defined in (3). If `0 = `00 = `, then the covariance between Z(d 0 ) and Z(d 00 ) is given in (11) with r replaced by r` . If `0 6= `00 , then the covariance between Z(d 0 ) and Z(d 00 ) is derived in (A.6) in Appendix 8.1. It now follows that the MLR test statistic or the maximum of the square of score test statistics is asymptotically distributed as (16) sup Z 2 (d), 0≤d≤D where Z(d) is a Gaussian stochastic process with mean 0 and covariance in (11) when `0 = `00 or in (A.6) when `0 < `00 . Note that the null distribution of (16) only depends on Haldane distances or equivalently on recombination fractions between the markers. In Section 5, we will use a numerical approach to compute the distribution of (16). Published by The Berkeley Electronic Press, 2009 9 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 4.2 QTL mapping on a dense map For a dense genetic map, we can assume that markers are located at every point along the genome. This can be considered as the limit of the sparse map when k (the number of diallelic markers) converges to infinity and the genomic distance between any two consecutive markers converges to zero. For a dense map, two markers M `−1 and M ` can be assumed to have the identical distance (d) to marker M 0 . Furthermore, for any interval [M `−1 , M ` ], we (`) (`) (`) (`) (`) have n2 ≈ n3 ≈ 0, n1 ≈ n4 ≈ 12 N, θ1 ≈ 0, a(x, N(`) ) ≈ −n4 , and d(x, N(`) ) ≈ (`) n1 . If the Haldane distance between M 0 and the QTL, which is located within interval [M `−1 , M ` ], is d ∈ [0, D], the score statistic (6) for detecting the QTL at d is reduced to (`) (`) y − y1 U(d) = r4 1 σ̂ (`) + n1 1 (`) n4 . For two possible locations d 0 and d 00 of the QTL, d 0 < d 00 , where d 0 and d 00 belong to 0 0 00 00 [M ` −1 , M ` ] and [M ` −1 , M ` ], the covariance (see Chang et al. 2003 for details) is reduced to (17) cov(U(d 0 ),U(d 00 )) ≈ 1 − 2r`0 ,`00 −1 ≈ 1 − 2r`0 `00 . This suggests that the asymptotic distribution of MLR test statistic, or the maximum of the square of the score test statistics, is the same as the distribution of sup0≤d≤D Z 2 (d), where Z(·) is a Gaussian stochastic process with mean 0 and covariance (17). This result is consistent with the finding of Lander and Botstein (1989). 5 Simulation of the Null Distribution Although it is difficult to derive an explicit formula for the asymptotic null distributions derived in Sections 3 and 4.1, it is relatively straightforward to simulate them. 5.1 Interval mapping For any given values of two standard normal random variables W1 and W2 , we can compute sup0≤x≤D Z 2 (x) where Z(x) is given in (10). Although x does not appear explicitly in Z(x), recall from (3) that (θ1 , θ2 ) = θ (x, D), so Z 2 (x) is a function of http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 10 Chang et al.: Score Statistics x. Define √ −2e2DW1W2 ± (W12 +W22 ) e4D − 1 √ = , (W12 −W22 ) e4D − 1 − 2W1W2 y1,2 where y1 correspond to + of “ ± ” and y2 correspond to − of “ ± ”. By straightfor2 ward computation, it can be shown that the solutions to dZdx(x) = 0 are 1 (log yi + 2D), 4 provided that yi > 0. We can then simulate xi = ½ (18) 2 sup Z (x) = 0≤x≤D max{Z 2 (0), Z 2 (xi ), Z 2 (D)} max{Z 2 (0), Z 2 (D)} i = 1, 2, if yi > 0 and 0 ≤ xi ≤ D . otherwise by simulating W1 and W2 i.i.d. standard normal. 5.2 Composite interval mapping The expression (14), which is asymptotically equivalent to the score test statis(`) tic, involves standard normal random variables Wi , i = 1, 2; ` = 1, 2, · · · , k. It (`) (`) can be shown (see Appendix 8.1 for details) that W1 ,W2 , ` = 1, 2, · · · , k, follow a joint multivariate normal distribution with mean 0, variances equal to 1, (1) (1) cov(W1 ,W2 ) = 0, and remaining covariances (`0 ) cov(W1 (`0 ) (`00 ) ,W1 (`00 ) ) ≈ p (1 − r`0 )(1 − r`00 )(1 − 2r`0 ,`00 −1 ) p (1 − r`0 )r`00 (1 − 2r`0 ,`00 −1 ), p cov(W2 ,W1 ) ≈ − r`0 (1 − r`00 )(1 − 2r`0 ,`00 −1 ) √ (`0 ) (`00 ) cov(W2 ,W2 ) ≈ − r`0 r`00 (1 − 2r`0 ,`00 −1 ). cov(W1 (19) (`0 ) ,W2 ) ≈ (`00 ) 0 where `0 < `00 and r`0 ,`00 is the recombination fraction between markers M ` and (1) 00 (1) (`) M ` . Thus the variance-covariance matrix Σ of W1 , W2 and W1 , ` = 2, 3, · · · , k (`+1) is well defined. It can also be shown (see Appendix 8.2) that W2 may be recursively calculated through (20) (`+1) W2 ¶ µ p p √ 1 (`+1) (`) (`) . 1 − r`W1 − r`W2 − 1 − r`+1W1 =√ r`+1 To simulate the null distribution of sup0≤d≤D Z 2 (d), where Z(d) is in (14), we can do the following: Published by The Berkeley Electronic Press, 2009 11 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 (1) (1) (`) (i) Generate (W1 ,W2 ,W1 , ` = 2, 3, · · · , k) ∼ N(0, Σ); (`+1) (ii) Compute W2 , ` = 1, 2, · · · , k − 1 by (20); (iii) Using (18), for each `, find (21) sup Z 2 (d) , ` ∑`−1 i=1 Di ≤d≤∑i=1 Di and take the maximum over ` = 1, 2, · · · , k. Repeat the above steps many times (for example, 10,000 times), to obtain the approximate distribution of sup0≤d≤D Z 2 (d), the asymptotic null distribution of the MLR test or the maximum of the square of score test statistics. The critical thresholds can be obtained accordingly. (1) (1) (`) Note that the random variables W1 , W2 and W1 , ` = 2, 3, · · · , k, can be calculate as A0 Z, where A is the Cholesky factorization of Σ (that is, A0 A = Σ) and Z follows multivariate standard normal distribution. Note that we only need to factor the matrix once for the entire simulation. 6 Examples To examine the statistical properties of our score test statistic proposed, we perform a simulation study to determine the asymptotic null distribution and the critical threshold for declaring the existence of a significant QTL. Our results here are compared to those obtained by Lander and Botstein (1989), Doerge and Churchill (1996) and Piepho (2001). The score test statistic is further validated using a case study from a forest tree genome project. 6.1 Simulation We use the same simulation design as used by Lander and Botstein (1989) to simulate a backcross mapping population of size 250. Thus, our results can be directly compared with those of Lander and Botstein. In our simulation study, equal spacing of the six markers is assumed throughout 12 chromosomes of 100 cM each. Five QTL are hypothesized to be located at genetic positions 70, 49, 27, 8 and 3 cM from the left end on the first five chromosomes with effects of 1.5, 1.25, 1.0, 0.75 and 0.50, respectively. For each individual, genotypes at the markers were generated assuming no interference. The corresponding quantitative trait was simulated by summing the five QTL effects and random standard normal noise. http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 12 Chang et al.: Score Statistics 60 60 60 Chromosome1 Chromosome2 Chromosome3 40 40 40 20 20 20 0 0 20 40 60 80 100 Chromosome4 15 0 0 20 40 60 80 100 Chromosome5 15 0 10 10 5 5 5 0 20 40 60 80 100 Chromosome7 15 0 0 20 40 60 80 100 Chromosome8 15 0 10 10 5 5 5 0 0 20 40 60 80 100 Chromosome10 15 0 20 40 60 80 100 Chromosome11 15 0 10 10 5 5 5 0 0 20 40 60 80 100 0 0 20 40 60 80 100 0 20 0 60 80 100 40 60 80 100 Chromosome9 0 20 40 60 80 100 Chromosome12 15 10 40 Chromosome6 15 10 0 20 15 10 0 0 0 20 40 60 80 100 Figure 1: Score test for a simulated data set with 250 backcross progeny. As in Lander and Botstein (1989), we assumed that markers are spaced 20 cM throughout 12 chromosomes of 1 Morgan each. The QTLs, located at genetic positions 70, 49, 27, 8 and 3 cM from the left end on the first five chromosomes, were assumed to have effects 1.5, 1.25, 1.0, 0.75 and 0.50, respectively. For each individual, genotypes at markers were generated assuming no interference. The corresponding quantitative trait was simulated by summing the five QTL effects and random standard normal noise. The solid line is for score test statistics and the dotted line is for likelihood ratio test statistics (-2log(LR)). The dashed line indicates the threshold of 12.83 at the 0.05 level based on the simulated asymptotic null distribution. All five QTLs attained this threshold. Published by The Berkeley Electronic Press, 2009 13 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 Both the likelihood approach of Lander and Botstein (1989) and the scorestatistic approach are used to analyze the simulated marker and phenotype data. The QTL likelihood profiles are drawn from the likelihood ratio test statistics and the score test statistics. As shown in Figure 1, these two statistics gave similar profiles across each of the simulated linkage groups. The 95% percentile of score test statistics based on 10,000 simulations under the null model is 12.83, which is used as the threshold for declaring a QTL at the significance level α = 0.05. Other approaches to estimate the threshold for the same data set, are summarized in the following table: Reference This paper Lander and Botstein (1989) Lander and Botstein (1989) Piepho (2001) Churchill and Doerge (1994) Method Simulation Dense markers Bonferroni Approximation Permutation Test .05 Threshold 12.83 14.51 11.16 12.96 11.44 From the example, it can be seen that the threshold obtained by our method is close to Piepho’s threshold and higher than that by Churchill and Doerge. Our large scale simulation study showed that thresholds by Bonferroni method and thresholds based on dense markers tend to under estimate and to overestimate the cutoff point, respectively. 6.2 A case study The study we consider was derived from an interspecific hybridization of Populus (poplar). A P. deltoides clone (designated I-69) was used as a female parent to mate with a P. euramericana clone (designated I-45) as a male parent (Wu et al. 1992). Both P. deltoides I-69 and P. euramericana I-45 were selected at the Research Institute for Poplars in Italy in the 1950s and were introduced to China in 1972. A genetic linkage map has been constructed using 90 genotypes randomly selected from the 450 hybrids with random amplified polymorphic DNAs (RAPDs), amplified fraction length polymorphisms (AFLPs) and inter-simple sequence repeats (ISSRs) (Yin et al. 2002). This map comprises the 19 largest linkage groups for each parental map, which roughly represent 19 pairs of chromosomes. The 90 hybrid genotypes used for map construction were measured for wood density with wood samples collected from 11-year-old stems in a field trial in a completely randomized design. The 19 linkage groups constructed for each parent were scanned for possible existence of QTL affecting wood density using both our score statistic and http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 14 Chang et al.: Score Statistics the likelihood approach. We successfully detected a wood density QTL on linkage group, D17, composed of 16 normally segregated markers. The QTL profiles obtained from these two approaches, as shown in Figure 2, consistently exhibit a marked peak within a narrow marker interval AG/CGA-480 – AG/CGA-330. Both the score statistic value and log-likelihood ratio at the peak are greater than the threshold of 10.10 (α = 0.05) obtained from the simulated asymptotic null distribution. Other approaches are also used to estimate the α = 0.05 level threshold, which give 7.87 for permutation tests, 8.87 for Piepho’s quick method, 8.73 for the Bonferroni method and 10.20 for Lander and Botstein’s formula for dense markers. All these suggest that both approaches have consistently detected a significant QTL, located at the peak of the profiles, affecting wood density in hybrid poplars. This example illustrated that the score statistic has the power to detect the hypothesized QTL as well as the likelihood approach. Also, the QTL profiles from our score statistic and the likelihood approach are similar. However, in the wood density example, the calculation of the score statistic threshold based on 10,000 simulations from the asymptotic null distribution only took 5 minutes, where the calculation of the threshold based on 10, 000 permutation likelihood ratio tests took 4, 100 minutes. (All calculations were performed using Matlab software on a Dell Inspiron 7000 with a 300 MHz CPU.) 12 Test statistics 10 8 6 4 2 0 0 20 40 60 80 100 120 140 160 180 Haldane distance from left end, x Figure 2: The value of the test statistics along chromosome 17, which has 16 markers. The solid line is for the score test, the dotted line is for the likelihood ratio test (-2log(LR)), and the dashed line indicates the threshold of 10.10 at the 0.05 level based on the simulated asymptotic null distribution. Published by The Berkeley Electronic Press, 2009 15 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 7 Discussion In this article, we devise a score-statistic method for QTL mapping based on a genetic linkage map. The asymptotic null distribution of the score statistics is derived. The score test statistics U(x) in equation (6) converges to its asymptotic equivalent Z(x) quickly, because the convergence only depends on two things: (a) sample proportions in the four marker groups converge to the asymptotic proportions and (b) the standardized sample mean difference in equation (9) converges to standard normal distribution. In a variety of cases we studied, a sample size of 150 to 200 is usually sufficient to warrant the use of the threshold based on asymptotic null distribution. Determining threshold of test statistics for QTL mapping is a first important step toward the identification of individual genes that contribute to genetic variation in a complex trait. However, this is not a trivial task because many factors, such as map density, marker distribution pattern, genome size, and the proportion of missing marker and phenotypic data, may affect the distribution of the test statistic. For an infinitely dense map, the test statistic may be approximated by an OrnsteinUhlenbeck diffusion process for a backcross of large samples (Lander and Botstein 1989). This discovery was then used in more complicated genetic designs (Dupuis and Siegmund 1999; Zou et al. 2001). Rebai et al. (1994, 1995) derived a conservative threshold by considering the position of a QTL as a nuisance parameter present only under the alternative hypothesis in that the theoretical resulst of Davies (1977, 1987) can be at work. In this article, we devise a score-statistic method for QTL mapping based on a genetic linkage map. A score test statistic inherits the optimal property of the maximum likelihood method, yet it is much easier to compute. It is asymptotically equivalent to the likelihood ratio test statistic with asymptotic null distribution being the distribution of the maximum of the square of a well-defined Gaussian stochastic process. The critical thresholds obtained in Section 5 can be used for both the MLR test and the maximum of the square of score test statistics. The computational method in Section 5 is feasible even for a large k, the number of molecular markers in a genetic linkage map. For a dense map, we can assume that a QTL is located at the same position as a marker. In this case, the (`) (`) numbers of recombinant groups n2 and n3 approaches zero for all intervals and our results on the asymptotic distribution of MLR test statistic in Section 4 reduce to that of Lander and Botstein (1989). Results from numerical analyses suggest that our score statistical model can be reasonably well applied in sparse maps, thus making the model more useful in general genetic studies. As an alternative to permutation tests or resampling strategies with no assumption about the distribution of the test statistic (Chuchill and Doerge 1994; Zou et al. 2004), our model displays significant computing advantages. http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 16 Chang et al.: Score Statistics Current genetic mapping has achieved a point at which almost every kind of biological traits and organisms can be studied by QTL mapping approaches. This requires that our model be of great usefulness and effectiveness in a broad spectrum of genetic settings. For outcrossing species, like forest trees, a full-sib family derived from two heterozygous parents provides an informative design for QTL mapping (Haley et al. 1994; Xu 1996; Wu et al. 2002; Lin et al. 2004). For humans and other species, QTL mapping may be based on a set of random samples drawn from a natural population in which associations between markers and QTL are studied in terms of linkage disequilibrium. For many complex traits, a one-QTL model is too simple to characterize their genetic architecture of high complexity. Traits may be likely to be controlled by a web of genes, each interacting with one another and environmental factor. With strong theoretical derivations, the proposed method can be readily extended to consider these more realistic situations. Readers can find our Matlab code for computing the score statistic and the critical threshold at the webiste: http://www.ehpr.ufl.edu/code/235. 8 Appendix: Some Technical Details 8.1 The covariance of Z(d 0) and Z(d 00 ) Assume (`0 , x0 ) and (`00 , x00 ) are associated with d 0 and d 00 through (13), respectively. Since the covariance of Z(d 0 ) and Z(d 00 ) is available in (11) for `0 = `00 , we derive the covariance for `0 < `00 . Let As,t;`0 ,`00 denote the subset of individuals who belong to Group s in interval `0 and belong to Group t in interval `00 , i.e., As,t;`0 ,`00 = {i : 1 ≤ i ≤ N, (`0 ) Gi = s, (`00 ) Gi = t}, 1 ≤ s,t ≤ 4, 1 ≤ `0 < `00 ≤ k. Denote the size of As,t;`0 ,`00 by ns,t;`0 ,`00 , i.e., N (A.1) (`0 ) ns,t;`0 ,`00 = ∑ I(Gi i=1 (`00 ) = s)I(Gi = t), 1 ≤ s,t ≤ 4, 1 ≤ `0 < `00 ≤ k. 0 00 Let N(` ,` ) = (ns,t;`0 ,`00 )s,t=1,2,3,4 be the 4 × 4 matrix with s,t as the row index and column index, respectively. It is seen that lim 1 N→∞ N 0 00 ) N(` ,` 1 − r`0 0 0 0 1 r`0 0 0 0 = 0 0 r`0 0 2 0 0 0 1 − r`0 Published by The Berkeley Electronic Press, 2009 17 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 1 − r`0 ,`00 −1 1 − r`0 ,`00 −1 r`0 ,`00 −1 r`0 ,`00 −1 × 1 − r`0 ,`00 −1 1 − r`0 ,`00 −1 r`0 ,`00 −1 r`0 ,`00 −1 1 − r`00 0 0 0 0 r`00 0 0 × 0 0 r`00 0 0 0 0 1 − r`00 (A.2) r`0 ,`00 −1 r`0 ,`00 −1 1 − r`0 ,`00 −1 1 − r`0 ,`00 −1 r`0 ,`00 −1 r`0 ,`00 −1 1 − r`0 ,`00 −1 1 − r`0 ,`00 −1 . In addition, we have results similar to (4): (`0 ) n lim 1 N→∞ N (`0 ) n lim 2 N→∞ N (A.3) (`00 ) n lim 1 N→∞ N (`00 ) n lim 2 N→∞ N (`0 ) n 1 = lim 4 = (1 − r`0 ), N→∞ N 2 (`0 ) n 1 = lim 3 = r`0 , N→∞ N 2 (`00 ) n 1 = lim 4 = (1 − r`00 ), N→∞ N 2 (`00 ) n 1 = lim 3 = r`00 . N→∞ N 2 Using (15), (A.2), and (A.3), we have (`0 ) (`00 ) cov(W1 ,W1 N p (1 − r`0 )(1 − r`00 ) 4σ 2 (`0 ) (`0 ) (`00 ) (`00 ) ×cov(y4 − y1 , y4 − y1 ) Np = (1 − r`0 )(1 − r`00 ) 4µ ¶ n4,4;`0 ,`00 n4,1;`0 ,`00 n1,4;`0 ,`00 n1,1;`0 ,`00 × (`0 ) (`00 ) − (`0 ) (`00 ) − (`0 ) (`00 ) + (`0 ) (`00 ) n n n4 n1 n1 n4 n1 n1 p4 4 ≈ (1 − r`0 )(1 − r`00 )(1 − 2r`0 ,`00 −1 ) ) ≈ (A.4) Similarly, we have (`0 ) (`00 ) (`0 ) (`00 ) cov(W1 ,W2 (A.5) ) ≈ p (1 − r`0 )r`00 (1 − 2r`0 ,`00 −1 ), p cov(W2 ,W1 ) ≈ − r`0 (1 − r`00 )(1 − 2r`0 ,`00 −1 ) √ (`0 ) (`00 ) cov(W2 ,W2 ) ≈ − r`0 r`00 (1 − 2r`0 ,`00 −1 ). By (14), (A.4) and (A.5), for `0 < `00 the covariance of Z(d 0 ) and Z(d 00 ) is cov(Z(d 0 ), Z(d 00 )) http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 18 Chang et al.: Score Statistics = (1 − 2r`0 ,`00 −1 ) · × (1 − 2θ10 )(1 − 2θ100 )(1 − r`0 )(1 − r`00 ) +(1 − 2θ10 )(1 − 2θ200 )(1 − r`0 )r`00 − (1 − 2θ20 )(1 − 2θ100 )r`0 (1 − r`00 ) ¸ 0 00 −(1 − 2θ2 )(1 − 2θ2 )r`0 r`00 / q (1 − r`0 )(1 − 2θ10 )2 + r`0 (1 − 2θ20 )2 / q (1 − r`00 )(1 − 2θ100 )2 + r`00 (1 − 2θ200 )2 , (A.6) 0 00 where r`0 `00 denotes the recombination fraction between two markers M ` and M ` (`0 < `00 ), where r`−1,` = r` and r`,` = 0. 8.2 The linear relationship between W (`) (`) (`+1) , 1 ,W2 ,W1 (`+1) and W2 (`) (`) in (20) (`+1) From (A.4) and (A.5), the variance-covariance matrix of W1 ,W2 ,W1 (`+1) W2 has the form p µ ¶ µ p ¶ I2 Γ (1p − r` )(1 − r`+1 ) (1 − r` )r`+1 , where Γ = . √ Γ0 I2 − r` (1 − r`+1 ) − r` r`+1 , and It can be seen that the above matrix is singular and hence (20) holds. The relationship (20) has an intuitive interpretation. Denote by Bs;` the subset of individuals who belong to Group s in interval `, i.e., Bs;` = {i : 1 ≤ i ≤ N, Gi` = s}, 1 ≤ s ≤ 4; 1 ≤ ` ≤ k. Consider two consecutive intervals [M`−1 , M` ] and [M` , M`+1 ]. For any individual ` ` in ¡ Group ¢1 or Group 3 (Group 2 or Group 4), the marker condition is (M , m ) ` ` (m , m ) at marker M` . Thus this individual must belong to Group 1 or Group 2 (Group 3 or Group 4) in interval [M` , M`+1 ]. Therefore, B1;` ∪ B3;` = B1;`+1 ∪ B2;`+1 and similarly, B2;` ∪ B4;` = B3;`+1 ∪ B4;`+1 . Consequently, (`) (`) (`) (`) (`+1) (`+1) (`+1) (`+1) y1 + n2 y2 n1 y1 + n3 y3 = n1 Published by The Berkeley Electronic Press, 2009 19 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 and (`) (`) (A.7) (`) (`) (`+1) (`+1) (`+1) (`+1) y3 + n4 y4 . n2 y2 + n4 y4 = n3 Combining (15), (A.3), and (A.7) we see that p (`) √ (`) p (`+1) √ (`+1) 1 − r`W1 − r`W2 − 1 − r`+1W1 − r`+1W2 √ ·µ ¶µ ¶ µ ¶ N (`) (`) (`) (`) ≈ 1 − r` y4 − y1 − r` y3 − y2 2 µ ¶µ ¶ µ ¶¸ (`+1) (`+1) (`+1) (`+1) − y1 − y2 − 1 − r`+1 y4 − r`+1 y3 ·µ ¶ µ ¶ 1 (`) (`) (`) (`) (`) (`) (`) (`) ≈ √ n4 y4 − n1 y1 − n3 y3 − n2 y2 N µ ¶ µ ¶¸ (`+1) (`+1) (`+1) (`+1) (`+1) (`+1) (`+1) (`+1) − n4 y4 − n1 y1 − n3 y3 − n2 y2 = 0, which provides a justification of (20). 8.3 Equivalence of the s uprema of the MLR and s core s tatistic In this section we demonstrate the equivalence of the suprema of the MLR statistic and the maximum of the square of score test statistic over one marker interval [M 0 , M 1 ]. The equivalence over multiple intervals follows immediately. Specifically, we prove Theorem 1 of Section 3: ¯ ¯ ¯ ¯ (N) lim ¯ sup 2{`N (η̂ (N) (x), x) − `N (η̂ 0 )} − sup U 2 (x)¯ = 0 a.s. N→∞ x∈[0,D] x∈[0,D] where the log likelihood ` and the score statistics U(x)2 are described in Section 3, and the almost sure convergence is with respect to the distribution of the phenotypic trait under H0 . The proof of Theorem 1 is based on two lemmas. We will state the two lemmas and then, using them, will prove the theorem. In separate sections we will prove the two lemmas, and two other needed results. Before stating the lemma we review some notation. The genotype variable G` is defined at the beginning of Section 4.1. The superscript “`” will be omitted since we focus on only one interval [M 0 , M 1 ]. The http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 20 Chang et al.: Score Statistics probability function of G is defined by 1 (1 − r), 21 2 r, p(g) = P(G = g) = 12 r, 1 2 (1 − r), g=1 g=2 g=3 g = 4. To express the likelihood function in a compact form, we define ϕ (y; λ , θ ) = (1 − θ )e − 1 (y−µ1 )2 2σ 2 +θe − 1 (y−µ2 )2 2σ 2 . where λ = (µ1 , µ2 , σ )T . We will also use the alternative parameterization µ1 = µ − ∆, µ2 = µ + ∆ and η = (∆, µ , σ )T The density function of the phenotype variable Y conditional on G = j is fY |G= j (y; λ , x) = √ 1 ϕ (y; λ , θ j∗ ), 2πσ where θ j∗ = θ1 , θ2 , 1 − θ2 , 1 − θ1 , for j = 1, 2, 3, 4, respectively, and θ1 , θ2 are functions of x as defined in (3). Assume the observations are (Yi , Gi ) = (yi , gi ), i = 1, . . . , N, and are arranged so that the first n1 observations belong to group 1; the next n2 belong to group 2; and so on. The likelihood functions conditional on the genotype variable are µ (N) f (y ; λ , x) = 1 √ 2πσ ¶N N1 ∏ ϕ (yi; λ , θ1) i=1 N3 ∏ N2 ∏ ϕ (yi ; λ , θ2 ) i=N1 +1 N ϕ (yi ; λ , 1 − θ2 ) i=N2 +1 ∏ ϕ (yi ; λ , 1 − θ1 ), i=N3 +1 where y(N) = (y1 , . . . , yN ). The full likelihood function for observations (Yi , Gi ) = (yi , gi ), i = 1, . . . , N, is 4 (A.8) LN (λ ; x) = f (y(N) ; λ , x) ∏ [p( j)]n j . j=1 Since the second factor does not contain parameters of interest, we consider the log likelihood function for y(N) conditional on the genotype variable `N (λ ; x) = log f (y(N) ; λ , x), which is given in (5). Published by The Berkeley Electronic Press, 2009 21 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 For a fixed x ∈ [0, D], the mle’s of µ1 , µ2 , σ , ∆, µ , λ , η are denoted, respec(N) (N) tively, by µ̂1 (x), µ̂2 (x), σ̂ (N) (x), ∆ˆ (N) (x), µ̂ (N) (x), ³ ´T (N) (N) (N) λ̂ (x) = µ̂1 (x), µ̂2 (x), σ̂ (N) (x) , ³ ´T η̂ (N) (x) = ∆ˆ (N) (x), µ̂ (N) (x), σ̂ (N) (x) , (N) (N) with µ̂1 (x) = µ̂ (N) (x) − ∆ˆ (N) (x) and µ̂2 (x) = µ̂ (N) (x) + ∆ˆ (N) (x). Let µ0 be the true common value of µ1 and µ2 and σ0 be the true value of σ under H0 , and denote the normal density function with mean µ0 , and variance σ02 by n(y; µ0 , σ0 ). Under λ = λ 0 = (µ0 , µ0 , σ ), the densities of (Y, G) and Y are fY,G (y, g; λ 0 , x) = n(y; µ0 , σ0 )p(g) and fY (y; λ 0 , x) = n(y; µ0 , σ0 ), where neither depend on x under H0 . From now on, all probabilities P0 (·), expectations E0 (·), variances Var0 (·), and all probability statements such as “converges in probability”, “converges almost surely (a.s.)”, and “with probability one” are under λ = λ 0 . Lemma 1 (Information Matrix) The information matrix I(x) is given by 2 ∂ ` ∂ 2` ∂ 2` 2 ∂ ∆∂ µ ∂ ∆∂ σ γ (x) 0 0 ∂ ∆ 1 ∂ 2` 1 ∂ 2` ∂ 2` (A.9) I(x) = −E0 = 2 0 1 0 , 2 ∂ µ∂ ∆ ∂ µ∂ σ ∂ µ N σ0 0 0 2 ∂ 2` ∂ 2` ∂ 2` ∂σ∂∆ ∂σ∂ µ ∂σ2 η = η0 where γ (x) = (1 − 2θ1 )2 (1 − r) + (1 − 2θ2 )2 r (recall that θ1 and θ2 depend on x; see (3)). A sequence of random variables XN (η , x) is almost surely (a.s.) uniformly bounded with respect to η ∈ Γ = {(∆, µ , σ ) : |µ − ∆| ≤ ξ , |µ + ∆| < ξ , 0 < σ1 ≤ σ ≤ σ2 < ∞} and x ∈ [0, D] if for any ε > 0, there exists a constant K such that ¯ ¯ h i ¯ ¯ P0 sup ¯XN (η , x)¯ ≤ K > 1 − ε . η ∈Γ x∈[0,D] N=1,2,... A sequence of random variables XN (x) converges to a random variable Y (x) almost surely and uniformly for x ∈ [0, D] if lim sup |XN (x) −Y (x)| = 0 a.s. N→∞ x∈[0,D] http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 22 Chang et al.: Score Statistics Two sequences of random variables XN (s) and YN (x) are called asymptoti. cally equivalent, denoted by XN (x) = YN (x), if XN (x) −YN (x) converges to zero a.s. and uniformly for x ∈ [0, D]. Lemma 2 ( Higher Order Terms) (A) All elements of the 3 × 3 × 3 matrix 1 ∂3 `N (η , x) N ∂ η3 (A.10) are a.s. bounded uniformly for η ∈ Γ and x ∈ [0, D]. (B) ¯ ¯ √ £ ¤ 1 ¡ (N) ¢ . 1 £ ¤− 1 ∂ ¯ 2 2 N I(x) η̂ (x) − η 0 = √ I(x) `N (η , x)¯ . ¯ ∂η N η =η 0 8.3.1 Proof of Theorem 1: It is sufficient to show that (A.11) . (N) 2{`N (η̂ (N) (x), x) − `N (η̂ 0 )} = U 2 (x). Recall from Section 3 that, under H0 , we can write U(x) as (N) (N) u(µ̂0 , σ̂0 , x) U(x) = q (N) (N) Var0 [u(µ̂0 , σ̂0 , x)] . It follows that, (N) (N) u(µ0 , σ0 , x) − u(µ̂0 , σ̂0 , x)] q (N) (N) Var0 [u(µ̂0 , σ̂0 , x)] (N) 1 (µ̂0 − µ0 )[(1 − 2θ1 )(n4 − n1 ) + (1 − 2θ2 )(n3 − n2 )] q = (N) (N) σ02 Var0 [u(µ̂0 , σ̂0 , x)] Ã ! (N) (N) u(µ̂0 , σ̂0 , x) 1 1 q + − (N) (N) σ02 (σ̂ (N) )2 Var0 [u(µ̂0 , σ̂0 , x)] 0 . = 0, Published by The Berkeley Electronic Press, 2009 23 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 and hence u(µ0 , σ0 , x) . U(x) = q (N) (N) Var0 [u(µ̂0 , σ̂0 , x) . Next, using the facts that ¯ ¯ 1 ∂ . 1 ¯ (N) (N) Var0 [u(µ̂0 , σ̂0 , x)] = 2 γ (x) and `N (η , x)¯ = u(µ0 , σ0 , x), ¯ N ∂∆ σ0 η =η 0 we can write (A.12) ¯ ¯ ¯ ∂ ∂ ∆ `N (η , x)¯¯ u(µ0 , σ0 , x) η =η 0 . . q U(x) = q = N (N) (N) γ (x) Var0 [u(µ̂0 , σ̂0 , x) σ02 √ N ˆ (N) p . = ∆ (x) γ (x), σ0 where the last equality follows from Lemma 2 (B). Next we consider the likelihood ratio statistic. Using a Taylor expansion, together with Lemma 2 (A), we can write (A.13) . N(η̂ (N) (x) − η )T I(x)(η̂ (N) (x) − η ), 2[`N (η̂ (N) (x), x) − `N (η 0 )] = 0 0 and, under H0 , since there is no dependence on x, we can use classical results to write ! µ 1 0 . N (N) (N) (N) 2[`N (η̂ 0 ) − `N (η 0 )] = 2 (ν̂ 0 − ν 0 )T (ν̂ 0 − ν 0 ), (A.14) 0 2 σ0 (N) (N) (N) where ν = (µ , σ )T , ν 0 = (µ0 , σ0 )T , and ν̂ 0 = (µ̂0 , σ̂0 )T . Using the derivatives of the log likelihood (see the proof of Lemma 1 we have √ ¡ (N) ¢ . σ2 N ν̂ 0 − ν 0 = √0 N Ã 1 0 0 21 ! ¯ "Ã ! # ¯ µ̂ (N) (x) ∂ . √ ¯ `N (η , x)¯ = N − ν0 , ¯ ∂ν σ̂ (N) (x) η =η 0 where the last equality follows from Lemma 2 (B). The likelihood ratio statistic is obtained by subtracting (A.14) from (A.13), which results in . (N) 2[`(y(N) ; η (N) (x), x) − `(y(N) ; η̂ 0 )] = N ¡ ˆ (N) ¢2 ∆ (x) γ (x) σ02 . = U 2 (x), http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 24 Chang et al.: Score Statistics where the last equality follows from (A.12). The proof of Theorem 1 is now complete. 8.3.2 Proof of Lemma 1 The proof of Lemma 1 is quite straightforward, and we only sketch it here. Evaluation of the derivatives of the log likelihood is long and tedious, and we omit the details. However, for completeness, we give the derivatives evaluated at the null parameter. ¯ " N1 N2 ∂ `N ¯¯ 1 = − (1 − 2θ1 ) ∑(yi − µ0 ) − (1 − 2θ2 ) ∑ (yi − µ0 ) ¯ ∂∆ ¯ σ02 1 N1 +1 η =η 0 # N3 ∑ +(1 − 2θ2 ) N (yi − µ0 ) + (1 − 2θ1 ) N +1 ∑ (yi − µ0) , N +1 2 3 ¯ ¯ N ∂ `N ¯ 1 = ¯ ∑ (yi − µ0), ∂µ ¯ σ02 i=1 η =η 0 ¯ ∂ `N ¯¯ N 1 N = − + 3 ∑ (yi − µ0 )2 ¯ ∂σ ¯ σ0 σ i=1 η =η 0 ¯ " N1 ∂ 2 `N ¯¯ N 1 N 2 2 2 = − + (y − µ ) − (1 − 2 θ ) ¯ 1 ∑ (yi − µ0 ) ∑ i 0 ∂ ∆2 ¯ σ02 σ04 i=1 i=1 η =η 0 2 −(1 − 2θ2 ) N2 ∑ 2 (yi − µ0 ) − (1 − 2θ2 ) i=N1 +1 −(1 − 2θ1 )2 N ∑ 2 # N3 ∑ (yi − µ0 )2 i=N2 +1 (yi − µ0 )2 , i=N +1 3 ¯ ¯ ∂ 2 `N ¯ N = − 2, ¯ 2 ∂µ ¯ σ0 η =η 0 ¯ ∂ 2 `N ¯¯ N 3 n = − 4 ∑ (yi − µ0 )2 , ¯ 2 2 ∂σ ¯ σ0 σ0 i=1 η =η 0 ¯ " # ∂ 2 `N ¯¯ 1 = (1 − 2θ1 )(n1 − n4 ) + (1 − 2θ2 )(n2 − n3 ) , ¯ ∂ ∆∂ µ ¯ σ02 η =η 0 Published by The Berkeley Electronic Press, 2009 25 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 ¯ " N1 N2 ∂ 2 `N ¯¯ 2 = (1 − 2θ1 ) ∑(yi − µ0 ) + (1 − 2θ2 ) ∑ (yi − µ0 ) ¯ ∂ ∆∂ σ ¯ σ03 1 N1 +1 η =η 0 # N3 N N2 +1 N3 +1 −(1 − 2θ2 ) ∑ (yi − µ0) − (1 − 2θ1) ∑ (yi − µ0) ¯ ¯ −2 N N ¯ = ¯ ∑ (yi − µ0). ∂ µ∂ σ ¯ σ03 i=1 η =η 0 , ∂ 2` 8.3.3 Proof of Lemma 2 The proof of Lemma 2 relies on two other results, which we state here and prove in the next sections. Lemma 3 (Boundedness of MLE) (A) For any x ∈ [0, D], the mle λ̂ (N) (x) exists with probability one. (B) For any δ > 0, there exist a measurable subset A of the sample space with P(A) ≥ 1 − δ , real numbers σ1 , σ2 satisfying 0 < σ1 < σ0 < σ2 < ∞, a real number ξ > 0 and an integer M > 0, such that for any sample point in A, x ∈ [0, D], and N > M, we have ¯ ¯ ¯ ¯ ¯ (N) ¯ ¯ ¯ ¯µ̂ j (x)¯ ≤ ξ , j = 1, 2, and σ1 ≤ ¯σ̂ (N) (x)¯ ≤ σ2 , where δ , A, σ1 , σ2 , ξ , and M are independent of x. (N) (N) Lemma 4 (Uniform convergence of the MLE) The MLEs µ̂1 (x), µ̂2 (x), and σ̂ (N) (x) converge uniformly to µ0 , µ0 , and σ0 , respectively, a.s. and uniformly for x ∈ [0, D]. Since the second order moments of the elements in (A.10) are bounded uniformly for η ∈ Γ and x ∈ [0, D], the conclusion in (A) follows from Chebyshev’s inequality, which also implies that ¯ ¯ 1 ∂ ¯ √ (A.15) `N (η , x)¯ ¯ N ∂η η =η 0 http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 26 Chang et al.: Score Statistics is a.s. bounded uniformly for x ∈ [0, D], since the elements in (A.15) have mean zero and variances bounded uniformly for x ∈ [0, D]. Applying Chebyshev’s inequality once more, we can show that ¯ ¯ 1 ∂2 . ¯ (A.16) ` ( η , x) = I(x), ¯ N ¯ N ∂ η2 η =η 0 since the right-hand side of (A.16) is the mean of the left-hand side and the variances of the elements of the left-hand side converge to zero uniformly for x ∈ [0, D]. From the Taylor expansion ¯ ¯ ∂ ∂ ¯ `N (η̂ (N) (x), x) − `N (η , x)¯ ¯ ∂η ∂η η =η 0 ¯ ¯ ∂ 1 ∂3 ¯ (N) = ` ( η , x) ( η̂ (x) − η ) + `N (η ∗ (x), x)[η̂ (N) (x) ¯ N 0 ¯ ∂ η2 2 ∂ η3 η =η 0 −η 0 , η̂ (N) (x) − η 0 ], where η ∗ (x) = η 0 + ζ (η̂ (N) (x) − η 0 ) with 0 < ζ < 1, we have ¯ ¯ 1 √ £ £ ¤ ¤ 1 ¡ (N) ¢ −2 ∂ ¯ 1 2 √ N I(x) η̂ (x) − η 0 − N I(x) ∂ η `N (η , x)¯¯ η =η 0 ¯ ¯ © £ ¡ ¢¡ ¢¤−1 ª 2 3 ¯ = I(x) N1 ∂∂η 2 `N (η , x)¯ + 12 N1 ∂∂η 3 `N η ∗ (x), x η̂ N) (x) − η 0 −I ¯ η =η 0 ¯ ¯ 1 ¯ (A.17) · √1N I − 2 (x) ∂∂η `N (η , x)¯ , ¯ η =η 0 where I is the 3×3 identity matrix. The argument in the first paragraph of this proof and Lemmas 3 and 4 imply that the right-hand side of (A.17) is asymptotically equivalent to 0. 8.3.4 Proof of Lemma 3 Under H0 , the mle’s of µ1 = µ2 = µ0 and σ 2 = σ02 are the sample mean and sample ¡ (N) (N) (N) ¢ (N) variance, respectively, and let λ̂ 0 = µ̂0 , µ̂0 , σ̂0 . Clearly, Ã !N 1 (N) − N2 (A.18) sup f (y(N) ; λ , x) ≥ f (y(N) ; λ̂ 0 , x) = √ e . (N) σ̂ 2 π λ 0 Published by The Berkeley Electronic Press, 2009 27 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 For any positive number b, the variance of Y conditional on |Y | ≤ b is denoted by ¯ ¯ σb2 = Var0 (Y ¯|Y | ≤ b), which converges to σ02 =Var0 (Y ) as b → ∞. For any ε > 0, select a positive number b such that P0 (|Y | ≤ b) > 1 − ε and σb > (1 − ε )σ0 . (N) The sample mean and sample variance using yi ’s with |yi | ≤ b are denoted by yb (N) and (σ̂b )2 , respectively. We have (N) lim σ̂ N→∞ b (A.19) = σb > (1 − ε )σ0 . Now choose intervals I1 = [−2, −1], I2 = [0, 1], I3 = [2, 3], and I4 = [−b, b]. The choice of these intervals is somewhat arbitrary, we are only using I1 − I3 to get a lower bound on σ 2 , and I4 to get a bound on µ . (N) (b) Let W j be the number of Y1 , . . . ,YN in I j , j = 1, 2, 3, 4. Denote by n j the number of Y1 , . . . ,YN in [−b, b] and in group j, j = 1, 2, 3, 4. Clearly, (N) W4 4 = (b) ∑ nj . j=1 Following from the fact that (N) lim Wj N→∞ = P0 (Y ∈ I j ), N (N) lim σ̂0 N→∞ j = 1, 2, 3, 4, = σ0 , and the limits in (4) and (A.19), there exist a measurable subset A of the sample space with P0 (A) > 1 − δ , a positive number α , and an integer M such that for any sample point in A and N > M, (N) > α N, (N) > (1 − ε )N, 3 < σ0 , 2 1 > (1 − ε ) (1 − r)N, j = 1, 4, 2 1 > (1 − ε ) rN, j = 2, 3, 2 > (1 − ε )σ0 . Wj W4 (N) σ̂0 nj (A.20) nj (N) σ̂b http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 j = 1, 2, 3, 28 Chang et al.: Score Statistics Hereafter in the proof of Lemma 3, we assume that the sample point is in A and N > M and then, (A.20) hold. We then have, Ã !N Ã !N 1 1 2e− 2 (N) − N2 √ (A.21) sup f (y ; λ , x) ≥ √ e > . (N) 3 2πσ0 2π σ̂0 λ Now, because of the gaps between I1 − I3 , for any real n¯ values¯ of µ1 and µ2 , ∗ there exists an interval, I , among I1 , I2 and I3 such that inf ¯y − µ j ¯ : j = 1, 2, y ∈ o ³ ´ I ∗ ≥ 12 . This fact, together with the inequality that f y(N) ; λ , x is less than or equal to Ã (A.22) 1 √ 2πσ !N ( ∏ exp yi ∈I1UI2UI3 ) h i 1 − 2 min (yi − µ1 )2 , (yi − µ2 )2 2σ and (A.20), imply that ³ (A.23) ´ Ã 1 f y(N) ; λ ,x ≤ √ 2πσ !N " Ã 1 1 exp − 2 · 2σ 4 !#α N Ã − e = √ α 8σ 2 2πσ !N . As σ → 0 or σ → ∞, the quantity in parentheses converges to zero and then (A.23) < right-hand side of (A.21). As a result, the supremum of f (·) can be achieved in the area of 0 < σ1 ≤ σ ≤ σ2 < ∞, where σ1 and σ2 are two constants independent of x. Furthermore, for any yi ∈ I1 ∪ I2 ∪ I3 , the limit h i lim min (yi − µ1 )2 , (yi − µ2 )2 = ∞, |µ1 |→∞ |µ2 |→∞ implies that the right-hand side of (A.22) < right-hand side of (A.21) for a bounded σ as both |µ1 | and |µ2 | converge to ∞. Therefore, we have shown that the supremum of f (·) can be achieved in the area where σ1 ≤ σ ≤ σ2 and at least one of µ1 and µ2 is bounded. Now we show that for |µ1 | ≤ b and as |µ2 | → ∞, the value of the likelihood function is smaller than the right-hand side of (A.21). Following from the inequalities (A.20), we have 1 1 (b) n j > n j − ε N > (1 − ε ) (1 − r)N − ε N ≥ (1 − kε ) (1 − r)N, 2 2 Published by The Berkeley Electronic Press, 2009 j = 1, 4, 29 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 and 1 1 (b) n j > n j − ε N > (1 − ε ) rN − ε N ≥ (1 − kε ) rN, 2 2 where Ã k = max j = 2, 3, ! 3−r 2+r , . 1−r r Consequently, lim|µ2 |→∞ f (y(N) , λ , x) µ ¶ 1 ≤ lim|µ2 |→∞ √ 2πσ N1 ∏ i=1 yi ∈I4 N3 ∏ × µ = 1 √ 2πσ N4 ϕ (yi ; λ , 1 − θ2 ) i=N2 +1 yi ∈I4 ϕ (yi ; λ , θ1 ) ∏ N2 ∏ ϕ (yi ; λ , θ2 ) i=N1 +1 yi ∈I4 ϕ (yi ; λ , 1 − θ1 ) i=N3 +1 yi ∈I4 ¶N (b) n1 · (1 − θ1 ) (b) n2 (1 − θ2 ) (b) n θ2 3 (b) n · θ1 4 N ∏ i=1 N · ¸ 1 2 exp − 2 (yi − µ1 ) 2σ yi ∈I4 1 1 1 [(1 − θ1 )θ1 ](1−kε ) 2 (1−r)N [(1 − θ2 )θ2 ](1−kε ) 2 rN e− N2 ≤ √ (N) W (N) 2π N4 σ̂b Ã !N 1 e− 2 √ ≤ . 21−kε 2π (1 − ε )2 σ0 (N) (N) Note that to obtain the second last inequality, the MLEs yb and σ̂b have been used. For small ε , clearly the last term is less than the right-hand side of (A.21). Symmetrically, for |µ2 | ≤ b and as |µ1 | → ∞, the value of the likelihood function is also smaller than the right-hand side of (A.21). Therefore, we have shown that the supremum of f (·) is achieved in the area of µ1 , µ2 , and σ bounded uniformly with respect to x ∈ [0, D]. The existence of mle’s of µ1 , µ2 , and σ for any x ∈ [0, D] follows from the continuity of f (·) with respect to λ . 8.3.5 Proof of Lemma 4 Before we prove Lemma 4 we need a few preliminaries. http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 30 Chang et al.: Score Statistics For any function h(λ , x) of λ and x, any subset Λ of the parameter space of λ , and subset Φ of [0, D], define h(Λ, Φ) = h(λ , x). sup λ ∈Λ, x∈Φ For any ε > 0, λ ∗ = (µ1∗ , µ2∗ , σ ∗ ), and x∗ ∈ [0, D], define the ε -neighborhood of λ ∗ by S(λ ∗ , ε ) = {λ = (µ1 , µ2 , σ ) : |µ1 − µ1∗ | < ε , |µ2 − µ2∗ | < ε , |σ − σ ∗ | < ε } and the ε -neighborhood of x∗ by S(x∗ , ε ) = {x ∈ [0, D] : |x − x∗ | < ε }. As ε ↓ 0, we have and (A.24) fY |G=i (y; S(λ , ε ), S(x, ε ))y fY |G=i (y; λ , x) fY,G (y, g; S(λ , ε ), S(x, ε ))y fY,G (y, g; λ , x). From Rao (Section 6.6, page 59), or an application of Jensen’s inequality, for λ 6= λ 0, · µ ¶¸ fY,G (Y, G; λ , x) E0 log < 0. fY,G (Y, G; λ 0 ) Now we prove Lemma 4. Let δ , M, σ1 , σ2 , ξ and A be as in Lemma 3. It is sufficient to show that for any ε > 0, there exists a M1 such that when N > M1 , λ̂ (N) (x) ∈ S(λ 0 , ε ), x ∈ [0, D], with probability 1 − 2δ . Define Γ = {λ : |µ1 | ≤ ξ , that (A.25) σ1 ≤ σ ≤ σ2 } and observe ¯ µ ¶¯ ¯ ¯ ¯log fY,G (y, g; Γ, [0, D]) ¯ ≤ K|y| + Q, ¯ ¯ fY,G (y, g; λ 0 ) where K and Q are positive constants depending on ξ , σ1 , and σ2 . Let B = Γ − S(λ 0 , ε ) and note that for any λ ∈ B and x ∈ [0, D], (A.24), (A.25) and the Lebesgue Dominated Convergence Theorem imply that " Ã ¡ ¢ !# · µ ¶¸ fY,G Y, G; S(λ , ε ), S(x, ε ) fY,G (Y, G; λ , x) lim E0 log = E0 log fY,G (Y, G; λ 0 ) fY,G (Y, G; λ 0 , x) ε ↓0 < C(λ , x) < 0. Published by The Berkeley Electronic Press, 2009 31 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 Therefore, there exists an ε = ε (λ , x) such that " Ã ¡ ¢ !# fY,G Y, G; S(λ , ε ), S(x, ε ) E0 log < C(λ , x) < 0. fY,G (Y, G; λ0 ) The preceding argument applies to each (λ , x) ∈ B × [0, D]. Since B × [0, D] is compact, there exist points (λ 1 , x), . . . , (λ k , xk ) and corresponding ε1 , . . . , εk and C1 , . . . ,Ck such that B × [0, D] ⊂ k [ S(λ i , ε ) × S(xi , ε ) n=1 and for i = 1, . . . , k, " (A.26) Ã E0 log ¡ ¢ !# fY,G Y, G; S(λ i , εi ), S(xi , εi ) < Ci < 0. fY,G (Y, G; λ 0 ) Let C = max(C1 , . . . ,Ck ). Because 1 log N (A.27) Ã ¡ ¢! LN S(λ i , εi ), S(xi , εi ) LN (λ 0 ) converges a.s. to the left-hand side of (A.26), there exist a measurable subset A1 ⊂ A with P0 (A1 ) > 1 − 2δ and an integer M1 > M such that when N > M1 , (A.27) < C < 0, for i = 1, . . . , k. Now for any λ ∈ B and any x ∈ [0, D], there exists an i, 1 ≤ i ≤ k, such that (λ , x) ∈ S(xi , εi ) × S(xi , εi )and for any sample point in A1 and N > M1 , ¤ 1£ `N (λ , x) − `N (λ 0 ) = N < which implies that λ̂ (N) (x)∈B, λ̂ (N) µ ¶ LN (λ , x) L (λ ) Ã N¡ 0 ¢! LN S(λ i , εi ), S(xi , εi ) 1 log < C < 0, N LN (λ0 ) 1 log N for all x ∈ [0, D]. By Lemma 3, (x) ∈ Γ, for all x ∈ [0, D]. Thus, for any sample point in A1 and N > M1 , λ̂ (N) (x) ∈ S(λ 0 , ε ), for all x ∈ [0, D], which concludes the proof. http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 32 Chang et al.: Score Statistics References Churchill, G. A., and R. W. Doerge. (1994). Empirical threshold values for quantitative trait mapping. Genetics 138: 963-971. Cox, D. R. and D. V. Hinkley. (1974). Theoretical Statistics. Chapman & Hall: London. Darvasi, A., A. Weinreb, V. Minke, J. I. Weller and M. Soller.( 1993). Detecting marker-QTL linkage and estimating QTL gene effect and map location using a saturated genetic map. Genetics 134: 943-951. Davies, R. B. (1977). Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 64: 247-254. Davies, R. B. (1987). Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 74: 33-43. Doerge, R. W., and G. A. Churchill. (1996). Permutation tests for multiple loci affecting a quantitative character. Genetics 142: 285-294. Doerge, R. W., Z-B. Zeng and B. S. Weir (1997). Statistical Issues in the Search for Genes Affecting Quantitative Traits in Experimental Populations. Statistical Science 12: 195-219. Dupuis, J. and D. Siegmund. (1999). Statistical methods for mapping quantitative trait loci from a dense set of markers. Genetics 151: 373–386. Haldane, J. B. S. (1919) The combination of linkage values and the calculation of distance between the loci of linkage factors. J. Genet. 8: 299-309. Haley, C. S., S. A. Knott and J. M. Elsen. (1994). Genetic mapping of quantitative trait loci in cross between outbred lines using least squares. Genetics 136: 11951207. Hoeschele, I. (2001). Mapping quantitative trait loci in outbred pedigrees. In: Handbook of Statistical Genetics, Edited by D. J. Balding, M. Bishop and C. Cannings. Wiley New York 599-644. Published by The Berkeley Electronic Press, 2009 33 Statistical Applications in Genetics and Molecular Biology, Vol. 8 [2009], Iss. 1, Art. 16 Jansen, R. C. (2001). Quantitative trait loci in inbred lines. In: Handbook of Statistical Genetics, Edited by D. J. Balding, M. Bishop and C. Cannings. Wiley New York. 567-597. Lander, E. S., and D. Botstein. (1989). Mapping Mendelian factors underlying quantitative traits using RELP linkage maps. Genetics 121: 185-199. Lander, E. S., and N. J. Schorck. (1994). Genetic dissection of complex traits. Science 265: 2037-2048. Lin, M., X.-Y. Lou, M. Chang and R. L. Wu. (2004). A general statistical framework for mapping quantitative trait loci in non-model systems: Issue for characterizing linkage phases. Genetics 165: 901-913. Piepho, H-P. (2001). A quick method for computing approximate thresholds for quantitative trait loci detection. Genetics 157: 425-432. Rao, C. R. (1973). Linear Statistical Inference and Its Applications, 2nd Edition. Wiley, New York. Rebai, A., B. Goffinet, and B. Mangin. (1994). Approximate thresholds of interval mapping tests for QTL detection. Genetics 138: 235–240. Rebai, A., B. Goffinet and B. Mangin. (1995) Comparing power of different methods for QTL detection. Biometrics 51: 87-99. Schaid, D. J., C. M. Rowland, D. E. Tines, R. M. Jacobson and G. A. Poland. (2002). Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am. J. Hum. Genet. 70: 425-34. Van Ooijen, J.W. (1992). Accuracy of mapping quantitative trait loci in autogamous species. Theor. Appl. Genet. 84: 803-811. Wang, K., and J. Huang. (2002). A score-statistic approach for mapping quantitativetrait loci with sibships of arbitrary size. Am. J. Hum. Genet. 70: 412424. Wu, R. L., C.-X. Ma and G. Casella. (2007) Statistical Genetics of Quantitative Traits: Linkage, Maps, and QTL. Springer-Verlag, New York. http://www.bepress.com/sagmb/vol8/iss1/art16 DOI: 10.2202/1544-6115.1386 34 Chang et al.: Score Statistics Wu, R. L., C.-X. Ma, I. Painter and Z.-B.Zeng. (2002). Simultaneous maximum likelihood estimation of linkages and linkage phases over a heterogeneous genome. Theor. Pop. Biol. 61: 349-363. Wu, R. L., M. X. Wang and M. R. Huang. (1992). Quantitative genetics of yield breeding for Populus short rotation culture. I. Dynamics of genetic control and selection models of yield traits. Can. J. Forest Res. 22: 175-182. Xu, S. Z. (1996). Mapping quantitative trait loci using four-way crosses. Genet. Res. 68: 175-181. Yin, T. M., X. Y. Zhang, M. R. Huang, M. X. Wang, Q. Zhuge, S. M. Tu, L. H. Zhu and R. L. Wu. (2002). Molecular linkage maps of the Populus genome. Genome 45: 541-555. Zeng, Z.-B. (1994). Precision mapping of quantitative trait loci. Genetics 136: 1457-1468. Zou, F., B. S. Yandell and J. P. Fine. (2001). Statistical issues in the analysis of quantitative traits in combined crosses. Genetics 158: 1339-1346. Zou, F, J. P. Jason, J. Hu, and D. Y. Lin (2004). An efficient resampling method for assessing genome-wide statistical significance in mapping quantitative trait loci. Genetics 168 (4): 2307-2316. Published by The Berkeley Electronic Press, 2009 35

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising