Cutoff Sample Size Estimation For Survival Data: A Simulation Study By Huiwen Che Department of Statistics Uppsala University Supervisor: Inger Persson, Katrin Kraus Master Thesis 15 hp 2014 Abstract This thesis demonstrates the possible cutoff sample size point that balances goodness of estimation and study expenditure by a practical cancer case. As it is crucial to determine the sample size in designing an experiment, researchers attempt to find the suitable sample size that achieves desired power and budget efficiency at the same time. The thesis shows how simulation can be used for sample size and precision calculations with survival data. The presentation concentrates on the simulation involved in carrying out the estimates and precision calculations. The Kaplan-Meier estimator and the Cox regression coefficient are chosen as point estimators, and the precision measurements focus on the mean square error and the standard error. Keywords: sample size; simulation; survival analysis Contents 1 2 3 4 5 Introduction 2 1.1 Simulation and Sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Method 5 2.1 Survival function and Hazard function . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Kaplan-Meier method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Mean squared error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Cox regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Simulations: Kaplan-Meier Estimation 12 3.1 Simulation design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Simulation: Cox Regression Estimation 18 4.1 Simulation design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Conclusion and Discussion 25 A Appendix 27 1 Chapter 1 Introduction 1.1 Simulation and Sample size Sample size estimation is an integral part of planning a statistical study. Usaully, a trade-off between accuracy of estimation and study cost is made in sample size decision. An adequate sample size is important to yield statistically significant results. A large sample size, however, may run over budget. Thus, a sample size that satisfies both aspects is required. Besides sample size, the follow up time, where careful thought is also given in medical research, is taken into consideration in this thesis. The longer the follow up time, the more information we know about the life expectancy. The right censoring, referring to the time of an obersvation’s occurrence for the event greater than the specified study time, is often present in survival data due to the insufficient follow-up. Statisticians calculate the required sample size based on the purpose of the study, the level of confidence, and the level of precision. The sample size analytic fomulas is one method to determine the sample size and the alternative way to estimate the sample size is simulation. Zhao and Li (2011) infered in their article that, the simulation technique, accommodating more complicated statistical designs, has increased use in sample size specification. The availability of computer simulation tools has driven the extensive use of computing intensive methods. The Monte Carlo simulation (referred as simulation in this thesis) can deal with uncertainty and it attempts to mimic the procedure samples collected from the population, which supports its use in sample size and parameter estimation (Efron and Tibshirani, 1986). The motivation for this thesis is to demonstrate the use of simulation in precision and sample size estimations by example. By simulating on concrete data, I intend to illustrate practical 2 results, confirming the theoretical analysis. For simplicity, the survival data used in this motivating example is regarded as the target population, where samples are drawn from. In this situation, the population parameter can be determined. With different sizes of samples taken from the population, statistical inferences are conducted, including sample statistics and precision estimation. The simulation plays a role as a procedure for evaluating the performance of different sample sizes. The multiple follow up time plans lead to multiple situations to repeat the analysis. The longer the follow-up, the greater number of events. It is easy to expect that as the sample size gets larger, the estimations will get more accurate and the same result applies to longer follow up time. However, it is of interest that of which cutoff sample size point or sample size region that we reach a certain desired precision, that we get dramatic performance improvement of the specified estimator, and that we do not have to sacrifice an increased sample size and thus increased cost to achieve a corresponding precision. It is possible to evaluate the effects of longer follow up time as well. The similar procedures are done in some pilot studies to estimate sample size required, but generally, these pilot studies are either more parametric oriented, assuming certain probability distribution to simulate from, or dependent on analogous previous studies. Teare et al. (2014), in their paper, compared the precision (the width of confidence interval) when sample sizes are different and suggested the recommended sample size in randomized controlled trials by sampling from distributions. In another paper, Lee et al. (2014) presented an real data example, whether the pilot 3 month data for 40 patients would proceed to main study of 233 patients at certain significant levels. The virtual scenario in my thesis may result in less adaptive but more detailed findings. In this thesis, I outline the simulation method and report the results from the simulation study of statistics when applied to the Kaplan-Meier estimation and the Cox regression. In the final part, I make some conclusion remarks and a brief discussion. The simulation and analysis are implemented in software R. 1.2 Background In this study, the population is generated based on the data that was reported by Kardaun (1983). In Kardaun’s study, survival time of 90 males with laryngeal cancer who were diagnosed and treated during the period 1970-1978 was studied. Analogous to the original data of Kardaun’s, a made-up population of 994 patients, including stage of cancer, year of diagnosis, month in 3 which the patient was diagnosed, patient’s age at diagnosis, and the survival time (measured in months), is used to conduct the simulation study. The male patients in this ficticious population (hereafter referred to as population) were diagnosed with laryngeal cancer during 1990-1998. There are four stages of cancer, among which stage 4 is of the highest severity and stage 1 is of the lowest severity. All patients died within 9 years after the end of diagnosis year (i.e.within 9 years from 1998) in the population. To investigate follow up time effects on the study, I assume two study termination dates, Jan 2004, and the day when the last survivor died. Thus respectively, there are two populations with different study termination date. In population 1, the so-called censoring is detected; while in population 2, following all patients untill death occurs to every individual, complete survival time is recorded. Censoring comes in the form of right censoring in this case, an observation terminated before the event occurs. In the laryngeal cancer data, if a patient’s survival time (in months) T is greater than the study follow up time, namely, a survivor at the end of the study. This is marked as a censored observation. We do not obtain the survival time information after the follow-up. In the right-censored data, an observation’s time on study is the time interval between diagnosis and either death or the end of the study, and the associated indicator of death (indicator of 1) or of survival (indicator of 0) are included. The two populations have been set up. In population 1, the number of events is 915 (approximately 8% censoring); while in population 2, the number of events is 994 with no censoring. The two simulation pools will affect the information the samples inherit. Samples drawn from population 1 are anticipated to carry less information than samples from population 2. 4 Chapter 2 Method To find out the possible threshold sample size value, we estimate the precision of inferences and specify a certain level of precision that is likely to achieve both estimation accuracy and cost efficiency. The measurements used to realize the evaluation are the mean squared error that incorporates the variability and bias of an estimator, and the standard error that is a typical measurement of precision. The mean squared error and the standard error are simple but useful performance measurements, which are also essentially associated with the simulation in the thesis. The particular simulation methodology used is the bootstrap method introduced by Efron (1979). The basic idea to estimate the standard error is that we generate a number of bootstrap samples of the same size by drawing randomly from the known observations, and calculate the estimator of interest for each bootstrap sample. When we derived a number of estimates of the estimator from bootstrap samples, we can estimate the standard error with respect to the estimator of interest. Efron and Tibshirani (1986) has shown that the boostrap estimate of standard error approaches to the true standard error as the simulation times are sufficient. The same idea can be adopted when calculating the mean squared error. In terms of the estimators, the thesis considers models for survival analysis which have the following three main characteristics: (1) time-to-event data features; (2) censored observations; (3) the effect of explanatory variables on the death time. The Kaplan-Meier estimator, accommodating characteristics (1) and (2), and the Cox regression, accommodating characteristics (2) and (3), are selected to analyse the survival data. The ability of Kaplan-Meier method to summarize survival probability intuitively when there is censoring and to offer further implications in survival analyses is the prominent reason to devote effort to evaluate the Kaplan-Meier estimator. While the Kaplan-Meier method focuses more on the basic shape of survival func5 tion, the Cox regression proceeds to further complicated analysis of the relationship between survival time and explanatory variables. Since the Cox regression is also the main model used in Kardaun’s study for analysis and the made-up population is ground on the data from Kardaun’s study, the thesis continues to use the Cox regression to model the survival data. The survival models used are based on the following basic definitions. 2.1 Survival function and Hazard function In survival analysis, it is common to employ survival function to describe the time-to-event phenomena (Klein and Moeschberger, 1997). The survival function is defined as S(x) = P r(X > x). (2.1) If the event of interest is death, the survival function models the probability of an individual surviving beyond time x. S(x) is bounded between 0 and 1 as a probability. A closely related function is the cumulative distribution function of a random variable X, which is defined as the probability that X will be less than or equal to x. F (x) = P (X ≤ x). (2.2) If X is a continuous random variable, S(x) = 1 − F (x), and S(x) is non-increasing monotone function. For continuous survival data, we want to quantify the risk for event occurrence at exactly time t (Klein and Moeschberger, 1997), and hence the hazard function is defined by h(x) = lim ∆x→0 P [x ≤ X < x + ∆x|X ≥ x] . ∆x (2.3) The hazard function, or simply hazard rate, is nonnegative. It can be written as h(x) = − 2.2 d log S(x). dt (2.4) Kaplan-Meier method The Kaplan-Meier (KM) estimator, also known as the product-limit estimator, is widely used in estimating survivor functions. Kaplan and Meier (1958) gave a theoretical justification to the 6 method by showing that the KM estimator is a nonparametric maximum likelihood estimator. The estimator is defined as: Ŝ(t) = 1 Q if ti ≤t [1 − di ] Yi if t < t1 , (2.5) t1 < t. where t1 is the first observed failure time, di is the number of individuals who died at time t, and Yi is the number of individuals who are at risk of the event of interest. The KM estimator also takes censoring into account. When there is censoring, being at risk means that individuals have not experienced the event nor have they been censored prior to time ti . Thus Yi is the number of survivors substracting the number of censored observations. Figure 2.1 and 2.2 shows the graph of the Kaplan-Meier estimates of survival function for population 1 and population 2 respectively. Figure 2.1: Kaplan-Meier survival function for population 1. The small verticle tick-marks indicate censoring 7 Figure 2.2: Kaplan-Meier survival function for population 2 The Kaplan-Meier method produces estimates of survival function at the various death times. In this thesis, special interest is given to the survival probabilities associated survival time (months) - the quartile estimates (point estimate on survival probability of 75%, 50%, and 25%). The summary parameters of the two populations for time is illustrated in Table 2.1. As the survival time gets greater, the survival probability declines. The survival times of the two populations at each of the three probability points are in register. Sample estimates will be obtained to compare with the following true parameters. Survival probability 75% 50% 25% population 1 4.20 8.95 51.20 population 2 4.20 8.95 51.20 Table 2.1: Quartile estimates of survival times (months) 8 2.3 Mean squared error The mean squared error (MSE) measures the mean squared difference between the estimator and the parameter and it evaluates the error made by the estimator, which serves as a measurement of goodness of an estimator (Casella and Berger, 2002, chap. 7). The MSE has the interpretation M SE(t̂) = E(t̂ − t)2 (2.6) the MSE can be decomposed into a sum of bias and variance M SE(t̂) = E(t̂ − t)2 = V ar(t̂) + B(t̂)2 (2.7) the variance measures the precision of the estimator, while the bias - the difference between the true survival time and the mean of estimated survival times, measures the accuracy of the estimator. 2.4 Cox regression Survival analysis is typically concerned with examining the relationship of the survival distribution to some covariates. Cox regression modelling is a modelling approach to explore the effects of variables (so-called covariates) on survival, as Fox (2002) described in his article. The prediction idea in survival regression is similar to that in ordinary regression (Klein and Moeschberger, 1997). The non-parametric strategy that leaves the baseline hazard h0 (t) unspecified is used here to regress the survival times on the explanatory variables. The model, also called proportional hazards model, was proposed by Cox (1972) as follows hi (t) = h0 (t)exp(β1 ∗ xi1 + β2 ∗ xi2 + · · · + βk ∗ xik ) (2.8) where h0 (t) is an arbitrary baseline hazard rate; that is when all covariates are set to zero at time t. Xi = (xi1 , · · · , xik ) are the covariates (risk factors) for the ith individual, and β = (β1 , · · · , βk ) are regression coefficients that predict the proportional change in the hazard. The covariates (β1 ∗ xi1 + β2 ∗ xi2 + · · · + βk ∗ xik ) form the model linearly. Suppose two individuals i and i0 , the associated linear parts are as follow ηi = β1 ∗ xi1 + β2 ∗ xi2 + · · · + βk ∗ xik 9 and ηi = β1 ∗ xi0 1 + β2 ∗ xi0 2 + · · · + βk ∗ xi0 k The hazard rates in the Cox model are proportional, as the quantity 2.7 demonstrates. h0 (t)expηi hi (t) = = exp(ηi − ηi0 ) hi0 (t) h0 (t)expηi0 (2.9) which is a constant. An individual with risk factor Xi experiencing the event as compared to an individual with risk factor Xi0 is exp(ηi − ηi0 ). The explanatory variables of interest in this thesis are stage, age, and year of diagnosis. Using the Cox model, the hazard at time t is expressed as hi (t) = h0 (t)exp(β1 ∗ stage + β2 ∗ age + β3 ∗ yearof diagnosis) or, equivalently, log hi (t) = log h0 (t) + β1 ∗ stage + β2 ∗ age + β3 ∗ yearof diagnosis Tables 2.2 and 2.3 show the parameter estimates of the Cox regression after fitting the model to population 1 and population 2. All three covariates have statistically significant coefficients. The regression coefficients of the two populations have nuances and the standard errors of the coefficients for population 2 are slightly smaller than those for population 1 due to the censoring in population 1. The exponentiated coefficients represent the multiplicative effects on the hazard. For instance, as shown in Table 2.2, with an additional stage of the cancer and other covariates held constant, the hazard (risk of dying at the next instant) increases by a factor of 1.344 or 34.4 percent. Holding other covariates constant, an additional year of diagnosis reduces the hazard by a factor of 0.935 or 6.5 percent. coef1 exp(coef)2 se(coef)3 z4 p5 stage 0.2959 1.344 0.03296 8.98 0.0e+00 age 0.0166 1.017 0.00331 5.02 5.3e-07 -0.0668 0.935 0.01556 -4.30 1.7e-05 year6 Table 2.2: The Cox regression on population 1 1 2 coefficient exponentiated coefficient 10 The coefficients for population 2 are similar to those for population 1. However, the presence of 8% censoring in population 1 causes some differences. Viewing at the exponential coefficients from two tables (Table 2.2 and 2.3), we find that the hazards of population 1 are of greater increase or decrease than the hazards of population 2. Taking the covariate - stage as the example, the hazard of population 1 increases 34.4 percent, while the hazard of population 2 increases 33.6 percent with an additional stage of the cancer and holding other covariates constant. The observation is also true for the covariate - age, though the difference is tiny. For the covariate - year of diagnosis, the hazard of population 1 (6.5 percent) reduces more than the hazard of population 2 (4.7 percent), which interpreting in another way, we may conclude that the impact of year of diagnosis is inflated. It seems that the risk of dying is overestimated in population 1 due to censoring, comparing with population 2. coef exp(coef) se(coef) z p stage 0.2895 1.336 0.0317 9.13 0.0e+00 age 0.0159 1.016 0.0032 4.98 6.4e-07 year -0.0478 0.953 0.0147 -3.25 1.2e-05 Table 2.3: The Cox regression on population 2 3 standard error of coefficient Z-score 5 P-value 6 year of diagnosis 4 11 Chapter 3 Simulations: Kaplan-Meier Estimation In the following two chapters, simulation procedures and results are discussed. This chapter presents estimates of survival function, with different sample sizes drawn from the population. The nonparametric Kaplan-Meier estimator is used here. As stated in the previous chapter, quartiles of KM estimates when the survival probabilities are 0.75, 0.50, and 0.25 are the primary consideration for each simulation. 3.1 Simulation design The simulation steps are shown below 1. Generate random index of size n with replacement. 2. Draw a sample X of n observations from population 1, according to the index generated in step 1. 3. Calculate KM estimators from the drawn sample, extract the quartiles estimates, and store these three estimates. 4. Repeat steps 1 to 3 for 5000 times. 5. Each simulation has 5000 estimated survival time at each survival probability and calculate the mean , variance, and bias of the estimates at each quartiles. 6. Repeat steps 1 through 5 for a range of different sizes (n = 30, 40, 50, 60, 75, 100, 125, 150, 200, 250, 300, 350, 400, 450, 500). 12 7. Repeat the above procedures using population 2. In each simulation, the associated time t1 , t2 , t3 to S(t1 ) = 0.75, S(t2 ) = 0.50, S(t3 ) = 0.25 are stored. To capture the average performance of the estimator, we consider the MSE. The variance of survival times at each targeting survival probability point, the average bias and the MSE are computed in the simulation. 3.2 Results Implementing the above procedures, the resulting estimates appear in Tables 3.1 and 3.2. First, take a close look at the results from population 1 shown in Table 3.1. When comparing the bias, horizontally (at the three probability points when sample size is the same), the high survival probability point is more likely to have lower bias, although there are a few exceptions at the 50% survival point; and vertically (at different sample size points), the main trend is that the greater the sample size, the smaller the bias. The difference between bias, however, is quite insignificant. In this simulation study, the main source of bias seems to arise from the nonrepresentative sample. With greater sample size, the sample could be more representative. But since all biases are small, we may assume that the simulation setting actually plays a role in sampling representative samples. The remarkable difference lies in variance. Vertically, larger sample sizes indicate lower variance, which is most notable in the 25th percentile point. Horizontally, the variance soars up as survival time gets longer. One possible explanation for this phenomenon could be censoring. In population 1, the censored observations are gathered after survival time of 60 months, which corresponds to survival probability below the 25 percent (as shown in Figure 2.1 in the previous chapter). In the 25th percentile survival probability point, the less information about death is known, leading to less accurate estimation of survival time. Another reason may be that there is smaller number of observations at longer survival time. As patients die or get censored, less and less information is available, which leads to a larger variance. The results derived from the population 2 in Table 3.2 share consistent trend with results from the population 1. The changing pattern of the performance in bias, variance and MSE is similar to Table 3.1. Comparing the two tables, it is hard to conclude any major difference due to the degree of censoring. There is probably one noteworthy exception, however, and that is the MSE or the variance at the 25 percent probability point. The differences of the variances 13 14 0.07712 0.82087 0.04677 0.65724 -0.01316 0.52448 0.01390 0.37531 0.05390 0.30673 0.01550 0.25358 0.00210 0.18213 0.00928 0.15339 -0.00458 0.11771 0.00254 0.10306 -0.00203 0.08889 -0.00364 0.07895 -0.00138 0.07009 50 60 75 100 125 150 200 250 300 350 400 450 500 0.07009 0.07896 0.08889 0.10307 0.11773 0.15347 0.18214 0.25382 0.30963 0.37551 0.52465 0.65942 0.82682 0.98670 1.52150 75%MSE 0.03721 0.03990 0.09154 0.10310 0.09620 0.17881 0.20731 0.40566 0.50464 0.59644 0.96490 1.28006 1.53499 1.98093 2.70359 50%Bias 0.54959 0.64337 0.79341 0.91237 1.14603 1.63492 2.11572 3.99660 5.17277 7.21642 13.05787 20.20743 23.73817 34.36520 51.30076 50%Var 0.55097 0.64496 0.80179 0.92300 1.15529 1.66690 2.15869 4.16116 5.42743 7.57216 13.98890 21.84599 26.09436 38.28928 58.61016 50%MSE 0.42431 0.47350 0.65155 0.64590 0.50525 0.52260 0.38242 0.77368 0.16310 0.47290 1.55563 0.55657 0.81218 0.86410 0.68037 25%Bias 25%Var 27.69627 32.13124 35.53592 41.09315 47.38854 60.41012 70.95333 95.94227 112.62423 138.98899 187.39595 218.89704 264.47523 303.80036 420.45954 Table 3.1: The quartiles KM estimates of bias, variance and MSE for population 1 0.06980 0.98183 40 75%Var 0.11520 1.50823 75%Bias 30 sample size 27.87631 32.35544 35.96043 41.51034 47.64382 60.68323 71.09958 96.54085 112.65083 139.21262 189.81593 219.20681 265.13486 304.54702 420.92244 25%MSE between the two populations are relatively large, especially for some small sample sizes, such as sample size of 30. The variances at the 25 percent point of population 1 are greater than the variances of population 2, which may indicate less information in population 1 because of the censoring. Figure 3.1: MSE of survival time at the three quartiles probability points for the two population. The vertical axis is on a logarithmic scale. The three probability points are in different symbols and the two populations are in different colors. Figure 3.1 shows the MSE result graphically. The bias of the estimates are quite close (Table 3.1, 3.2) and the difference of MSE lies mainly in variance. Again, we see that the MSE is of greater value at the lower survival probability point and the differences between two populations are quite small in this logarithmic scaled graph1 . The MSE declines as the sample 1 The vertical axis is on a logarithmic scale due to the relative large range of the MSE values. The logarithms of the MSE values produce a more decent graph to see the trend. 15 16 0.05998 0.77794 0.04474 0.63842 -0.01088 0.52251 0.01791 0.37394 0.03812 0.30754 0.01450 0.25029 0.00748 0.17726 -0.00048 0.14877 0.01455 0.12133 -0.00358 0.10578 -0.00236 0.08845 -0.00562 0.07824 -0.00293 0.07348 50 60 75 100 125 150 200 250 300 350 400 450 500 0.07349 0.07827 0.08846 0.10580 0.12154 0.14877 0.17731 0.25050 0.30899 0.37426 0.52263 0.64042 0.78154 1.02147 1.64036 75%MSE 0.05530 0.03107 0.05087 0.10324 0.14760 0.17477 0.26129 0.38719 0.47986 0.67663 0.93484 1.18701 1.54496 1.94487 2.56653 50%Bias 0.57760 0.62564 0.76262 1.04873 1.32389 1.67099 2.42025 4.33859 5.29591 8.25865 12.92215 17.30307 24.23997 33.60324 50.89364 50%Var 0.58066 0.62661 0.76521 1.05939 1.34568 1.70154 2.48852 4.48851 5.52617 8.71648 13.79608 18.71206 26.62687 37.38576 57.48072 50%MSE 0.55070 0.49254 0.47388 0.56174 0.60266 0.60050 0.70176 0.70214 0.00652 0.26768 1.61344 0.23290 0.60056 0.14262 0.01022 25%Bias 25%Var 28.58522 30.30830 36.29475 43.19440 50.67014 58.31568 73.01640 96.88729 110.72287 138.95243 176.30919 214.34067 245.34415 298.91219 375.54986 Table 3.2: The quartiles KM estimates of bias, variance and MSE for population 2 0.06476 1.01728 40 75%Var 0.12456 1.62484 75%Bias 30 sample size 28.88850 30.55090 36.51931 43.50995 51.03334 58.67628 73.50887 97.38030 110.72291 139.02408 178.91238 214.39491 245.70482 298.93253 375.54997 25%MSE size goes up, and meanwhile it is apparent that the declining rate of MSE slows down as the size grows. Though, there is no definition of best estimator in terms of the MSE, the sample size of 100 appears to be the threshold point of trade-off concerning all three probability points of interest. When the sample size is smaller than 100, the MSE curves have steeper slope. When the sample size is beyond 100, all curves are fairly flat. And the phenomenon applies to the both population estimations. 17 Chapter 4 Simulation: Cox Regression Estimation 4.1 Simulation design The underlying distribution of the data is unknown, but the population is available. Therefore the sampling strategy used in this research is case sampling, sampling individual cases - each row of the data frame, to draw random samples. The simulation steps are as follow 1. Generate random index of size n with replacement. 2. Draw a sample X of n observations from population 1, according to the index generated in step 1. 3. Regress the sample X on the Cox regression model specified before and store the estimate of each coefficient - stage, age, and year of diagnosis. 4. Repeat steps 1 to 3 for 5000 times. 5. After the replication, there are 5000 estimates for each coefficient, calculate mean and standard deviation among these 5000 estimates, get the estimated mean and standard error of each coefficient. 6. Repeat steps 1 through 5 for a range of different sizes (n = 15, 17, 19, 22, 25, 28, 32, 36, 40, 45, 50, 55, 65, 75, 85, 100, 125, 150, 200, 250, 300, 350, 400, 450, 500). 7. Apply the above procedures to population 2. As the population size is finite, if sample without replacement, the covariance of the different sample values is non-zero. To rule out the dependence, I sample with replacement. Sample 18 sizes are selected from small sizes as 15 to relatively large sizes as 500. In this study, if the sample size is lower than 15, the number of events may be not enough to do regression. The perfomance of small sample sizes may change remarkably, and hence the intervals are small between chosen small sample sizes. 4.2 Results For the design described in the previous section, the estimation of mean regression coefficients and their standard errors of varying sample sizes are derived from simulation. The selected outcome (sample size n = 15, 20, 30, 40, 50, 75, 100, 125, 150, 200, 300, 400, 500) of population 1, follow up time terminating on Jan 2004, is shown in Table 4.1. sample size stage se.stage age se.age year of diag se.year of diag 15 0.4373869 0.4873281 0.0196584 0.0468380 -0.1003996 0.2197601 20 0.3862809 0.3451874 0.0189307 0.0350348 -0.0874264 0.1693540 30 0.3564377 0.2447388 0.0181800 0.0257017 -0.0838596 0.1229153 40 0.3439469 0.2047557 0.0178105 0.0208841 -0.0785292 0.0992787 50 0.3327355 0.1768086 0.0173763 0.0183326 -0.0771482 0.0864128 75 0.3167494 0.1420939 0.0172394 0.0145324 -0.0728275 0.0671305 100 0.3136281 0.1148034 0.0170618 0.0119367 -0.0712902 0.0562382 125 0.3088530 0.1031355 0.0169817 0.0104599 -0.0712314 0.0496486 150 0.3071950 0.0916850 0.0171219 0.0096201 -0.0682126 0.0453214 200 0.3031681 0.0770696 0.0172510 0.0082965 -0.0686442 0.0388819 300 0.3013235 0.0635650 0.0168542 0.0066123 -0.0686322 0.0311307 400 0.2990278 0.0554107 0.0167197 0.0056633 -0.0675111 0.0270559 500 0.2989356 0.0497359 0.0169062 0.0051131 -0.0666488 0.0239313 parameter 0.2959 0.03296 0.0166 0.00331 -0.0668 0.01556 Table 4.1: The estimated regression coefficients and standard errors for population 1 with study ending on Jan 2004 The estimated mean coefficients, on average, get closer to the population parameter (see Table 2.2) as the sample size increases. In regression modelling aspect, the hazards are overestimated when sample sizes get smaller, since the samller the sample size, the greater impact 19 of explanatory variables. The performance improves dramatically among small sample sizes; while the results change slowly among large sample sizes. The standard error declines when the sample size increases as expected and it approaches to the population standard error. The sample size influences standard error considerablly for small sample sizes. The similar results hold true for population 2 (see Table 2.3) with longer follow up time as shown in Table 4.2. sample size stage se.stage age se.age year of diag se.year of diag 15 0.4261169 0.4958081 0.0210507 0.0458473 -0.0939805 0.2163588 20 0.3722668 0.3313938 0.0187941 0.0337496 -0.0804861 0.1600106 30 0.3484987 0.2436342 0.0175071 0.0249944 -0.0723029 0.1182873 40 0.3317312 0.1984947 0.0172376 0.0206801 -0.0658672 0.0975629 50 0.3241549 0.1738218 0.0167006 0.0174436 -0.0617790 0.0834075 75 0.3130634 0.1334420 0.0167675 0.0139466 -0.0581667 0.0646893 100 0.3057838 0.1121780 0.0164154 0.0116479 -0.0560753 0.0560396 125 0.3049019 0.0986926 0.0162281 0.0102005 -0.0555682 0.0477849 150 0.2981441 0.0890423 0.0163201 0.0091576 -0.0537220 0.0435515 200 0.2968767 0.0776443 0.0160546 0.0079510 -0.0525709 0.0369441 300 0.2947015 0.0609716 0.0161918 0.0063128 -0.0516554 0.0302071 400 0.2946066 0.0524316 0.0161907 0.0054369 -0.0505546 0.0261691 500 0.2924009 0.0459512 0.0159905 0.0049064 -0.0501157 0.0228183 parameter 0.2895 0.0317 0.0159 0.0032 -0.0478 0.0147 Table 4.2: The estimated regression coefficients and standard errors for population 2 with study ending on last death of cancer patients Considering the two follow up time plans, Figures 4.1, 4.2, and 4.3 illustrate that as the sample increases, estimates associated with the two follow-up plans approach their own population parameters respectively. The censoring in population 1, which means fewer numbers of death, results in higher values of the parameters in population 1 than those in population 2. This further leads to the systematic estimation difference between samples drawn from two populations. The scale of the difference is fairly small, and tends to be neglectable. The standard errors between the two time plans, as shown in Tables 4.1 and 4.2, possess slight difference. Generally, the estimated standard errors in shorter follow-up are a little bit greater, comparing 20 to the standard errors in longer follow-up, which is caused by limited death information in shorter follow-up. The neglectable difference in this case might owe to the low degree of censoring (approximately 8% censoring) difference between the two follow up time plans. We may expect some considerable effects of longer follow up time and of higher degree of censoring difference in other cancer studies. Figure 4.1: Estimated mean regression coefficient for stage. The points in red are estimates for population 1 (population parameter βstage = 0.2959) and points in blue are estimates for population 2 (population parameter βstage = 0.2895). 21 Figure 4.2: Estimated mean regression coefficient for age (population 1 parameter βage = 0.0166 and population 2 parameter βage = 0.0159). Figure 4.3: Estimated mean regression coefficient for year of diagnosis (population 1 parameter βyear = −0.0668 and population 2 parameter βyear = −0.0478). 22 Figures 4.4, 4.5, and 4.6 display the estimated regression coefficients and their standard errors in the manner of error bars. The error bars become shorter in an decelerated rate when sample size increases. It is also evident in these figures that the two follow-up plans have little influence in regression coefficient and standard error estimation, as the point estimator and error bars overlap. If we would like to achieve a prespecified standard error of all three coefficients below e.g. 0.01, a sample size of approximately 100 to 125 is needed. The cut off point that balances the precision and cost might be found around a sample size of 100, since the estimation perfomance improves much slower beyond size 100. Figure 4.4: Estimated standard error of coefficient stage. The point is the mean coefficient and the error bar represents a coefficient estimate plus one standard error above the point, minus one standard error below the point. The red color represents samples from population 1 and the blue color from population 2 (population 1 s.e.stage = 0.03296 and population 2 s.e.stage = 0.0317). 23 Figure 4.5: Estimated standard error of coefficient age (population 1 s.e.age = 0.00331 and population 2 s.e.age = 0.0032). Figure 4.6: Estimated standard error of coefficient year of diagnosis (population 1 s.e.year = 0.01556 and population 2 s.e.year = 0.0147). 24 Chapter 5 Conclusion and Discussion Both the survival function estimator and the Cox regression coefficient estimator are investigated. In chapter 3, we consider the inferential characteristics of the Kaplan-Meier estimator. The performance of the estimator has major improvement in respect to the MSE (mainly due to the variance) within sample sizes of up to 100 and much less change after 100. The trend that the variance gets more drastic when survival probability decreases is rather interesting, which may be accounted for by the censoring and the decreasing information with lower survival probability as discussed before. This trend may also be caused by another key factor - the simulation design. The simulation design in the thesis, is crude to some degrees, which specific needs for percentage of censoring in samples are not guaranteed. Since the simulation process is random, the degree of censoring in each sample drawn from population 1 differs, which may explain the large variance at low survival probability points where censored observations are mainly found in population 1. If every sample from population 1 has 8% censoring when simulating, we may expect the changing result, although the simulation design explain little about the trend in population 2 where there is no censoring. In chapter 4, we consider the Cox regression coefficient estimation. The coefficient and its precision estimates show concave-down curves, which confirm the presumption of a cutoff sample size for some specified precision. The coefficient estimates of samll sample sizes, compared with those of large sample sizes, indicate overestimation of death risk when sample size is small as with an additional increase of one of the covairates level, the hazards increase more than the hazards in population. It is likely that the degree of overestimation could be a supplement measurement to decide the sample size when considering the regression models, but more thoughts are required to set up the threshold for overestimations or possible underes25 timations. If the standard error of the coefficient is controlled at 0.01, we may conclude that the cutoff point is around sample size of 100, which is consistent with the conclusion in the KM estimation. A very small effect of follow up time is witnessed and more meaningful results might be discovered in longer follow-up designs. Another strategy of simulation design, when simulating from population with censoring, may improve the result about the effect of follow-up. Through demonstrating with practical examples, I show how to use simulation technique to estimate parameters, precision of the estimation, and calculate sample size. Though involving intensive computing, simulation approaches are applicable to any data-generating model, and statistical test. The flexibility of simulation enables researchers to estimate the sample size required in the complex medical study design, which might be not available in conventional sample size determination. Arnold et al. (2011) stated in their paper that it is common to examine the treatment effect in clinical trials. With simulation technique, we can determine the sample size required for detecting interaction effect at some significance level. Stahl and Landau (2013) claim that the simulation requires statisticians to clearly state the analysis procedures, which encourages investigators to be more realistic and more cautious about modelling and estimation. In this thesis, I only consider the MSE and standard error as measurements of precision. In the Kaplan-Meier method simulation, the choice of point estimation (the quartiles survival probability associated survival time) may be improved if survival probability is compared rather than survival time, as the probability has better generality. It might be more reasonable to find the quartiles survival probability points associated survival times using the population parameter, and when we have samples, we search for the approximately corresponding survival probability using the population survival time. The sample probability will be compared to the population probability. In terms of standard error, we may use more statistical measurements such as confidence intervals to draw more reliable conclusions. And more work can be done by power analysis in determining sample size, as specifying the probability that a particular estimate will be statistically significant is also typically adopted. Due to the real data case, the results are only applicable to this cancer study. 26 Appendix A Appendix R code for simulation dat=read.csv("popdata.csv",header=TRUE) ###diagnosed during 1.Jan.1990-1.Jan.1999; The end of the study 1-Jan-2004 #identify the censored observations censor=function(data){ n=dim(data)[1] status=rep(NA,n) ind=rep(0,n) ind=data$diag_yr+(data$diag_mn+data$time)%/%12 for(i in 1:n){ if(ind[i]>=104){ status[i]=0 } else{ status[i]=1 } } return(status) } death1=censor(data=dat) #death indicator cdat1=cbind(dat,death1) #create new dataset with indicator #modify the survival time of censored observations according to the #study time span surtime=function(data){ n=dim(data)[1] 27 for(i in 1:n){ if(data$death1[i]==0){ data$time[i]=(104-data$diag_yr[i])*12+1-data$diag_mn[i] } } return(data$time) } stime1=surtime(data=cdat1) #my dataset to do further analysis mydata1=data.frame(cdat1$id,cdat1$stage,cdat1$diag_yr,cdat1$age, stime1,death1) colnames(mydata1)=c("id","stage","year","age","time","death") ###detecting the longest time (in years) the patients survive ls=function(data){ n=dim(data)[1] st=rep(0,n) for(i in 1:n){ st[i]=(data$diag_mn[i]+data$time[i])/12+data$diag_yr[i]-100 } return(max(st)) } ls(dat) ###The last survivor lived to June 2008 ###Dataset 2 #The end of the study 1-Jan-2009; no censoring death2=rep(1,994) mydata2=data.frame(dat$id,dat$stage,dat$diag_yr,dat$age,dat$time,death2) colnames(mydata2)=c("id","stage","year","age","time","death") ### install.packages("survival") library(survival) ###Kaplan-Meier #population my.surv1=survfit(Surv(time,death)~1,data=mydata1) my.surv1 quantile(my.surv1,c(0.25,0.5,0.75)) 28 my.surv2=survfit(Surv(time,death)~1,data=mydata2) my.surv2 quantile(my.surv2,c(0.25,0.5,0.75)) myfit=function(data){ f=survfit(Surv(time,death)~1,data=data) return(f) } #simulation set.seed(77) KM=function(s,B,data){ n=dim(data)[1] m=length(s) index=data$id quan=matrix(0,nrow=B,ncol=3) avg=matrix(0,nrow=m,ncol=3) va=matrix(0,nrow=m,ncol=3) bias=matrix(0,nrow=m,ncol=3) tt=matrix(0,nrow=B,ncol=3) mse=matrix(0,nrow=m,ncol=3) for(i in 1:m){ for(j in 1:B){ newindex=sample(index,size=s[i],replace=TRUE) newdata=data[newindex,] quan[j,]=quantile(myfit(newdata),c(0.25,0.50,0.75))$quantile } avg[i,]=apply(quan,2,mean) va[i,]=apply(quan,2,var) bias[i,]=avg[i,]-c(4.20,8.95,51.20) mse[i,]=bias[i,]^2+va[i,] } res=data.frame(s,bias[,1],va[,1],mse[,1],bias[,2],va[,2],mse[,2], bias[,3],va[,3],mse[,3]) colnames(res)=c("sample.size","bias75","Var75","mse75","bias50", "Var50","mse50","bias25","Var25","mse25") return(res) } 29 s=c(30,40,50,60,75,100,125,150,200, 250,300,350,400,450,500) km1=KM(s,B=5000,mydata1) km2=KM(s,B=5000,mydata2) ###Cox Regression #population m1=coxph(Surv(time,death)~stage+age+year,data=mydata1) m1 summary(m1) m2=coxph(Surv(time,death)~stage+age+year,data=mydata2) m2 summary(m2) #simulation set.seed(43) Reg=function(data){ H0=coxph(Surv(time,death)~stage+age+year,data) H1=coef(summary(H0)) betas=H1[,1] return(betas) } sim=function(s,B,data){ n=dim(data)[1] m=length(s) nc=3 index=data$id reg=matrix(0,nrow=B,ncol=nc) beta=matrix(0,nrow=m,ncol=nc) se=matrix(0,nrow=m,ncol=nc) for(i in 1:m){ for(j in 1:B){ newindex=sample(index,size=s[i],replace=TRUE) newdata=data[newindex,] reg[j,]=Reg(newdata) } beta[i,]=apply(reg,2,mean) se[i,]=apply(reg,2,sd) 30 } res=data.frame(s,beta[,1],se[,1],beta[,2],se[,2],beta[,3],se[,3]) colnames(res)=c("sample.size","stage","se.stage","age","se.age", "year","se.year") return(res) } s=c(15,17,20,22,25,28,30,32,36,40,45,50,55,65,75,85,100,125,150,200, 250,300,350,400,450,500) est1=sim(s,B=5000,data=mydata1) est2=sim(s,B=5000,data=mydata2) 31 Bibliography Arnold, B. F., Hogan, D. R., Colford, Jr, J. M., and Hubbard, A. E. (2011). Simulation methods to estimate design power: an overview for applied research. BMC medical research methodology, 11(1):94–94. Casella, G. and Berger, R. L. (2002). Statistical inference. Thomson Learning, Australia, second edition. Cox, D. (1972). Regression models and life-tables. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 34(2):187–187. Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1):1–26. Efron, B. and Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science, 1(1):54–75. Fox, J. (2002). Cox proportional-hazards regression for survival data. Avail- able online at http://cran.r-project.org/doc/contrib/Fox-Companion/ appendix-cox-regression.pdf. Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282):457–481. Kardaun, O. (1983). Statistical survival analysis of male larynx-cancer patients - a case study. Statistica Neerlandica, 37(3):103–125. Klein, J. P. and Moeschberger, M. L. (1997). Survival analysis: techniques for censored and truncated data. Springer, New York. 32 Lee, E. C., Whitehead, A. L., Jacques, R. M., and Julious, S. A. (2014). The statistical interpretation of pilot trials: should significance thresholds be reconsidered? BMC medical research methodology, 14(1):41–41. Stahl, D. and Landau, S. (2013). Sample size and power calculations for medical studies by simulation when closed form expressions are not available. Statistical methods in medical research, 22(3):324–345. Teare, M. D., Dimairo, M., Shephard, N., Hayman, A., Whitehead, A., and Walters, S. J. (2014). Sample size requirements to estimate key design parameters from external pilot randomised controlled trials: a simulation study. Trials, 15(1):264–264. Zhao, W. and Li, A. X. (2011). able online at Estimating sample size through simulations. Avail- http://www.pharmasug.org/proceedings/2011/SP/ PharmaSUG-2011-SP08.pdf. 33

Download PDF

advertisement