Course BIOS601: ASSIGNMENT on Measurement Errors and their Effects. LARGE BIAS LARGE VARIABILITY LARGE BIAS SMALL VARIABILITY 5. [m-s] Exercise section 4: Relationship between correlation(X, X 0 ) and ICC(X) [ibid.] SMALL BIAS LARGE VARIABILITY SMALL BIAS SMALL VARIABILITY Fall 2017, v08.25 6. Francis Galton (1822-1911) found that the correlation between (selfreported ) parental and (adult) offspring heights was strongest for the one between father and son [0.396 ± 0.024], and weakest for the one between mother and daughter [0.284 ± 0.028]. Given the way he obtained the measurements, can you imagine why this was? 2 Being approximately correct and being precisely wrong [It was 0.302±0.027 for mother & son; 0.360±0.026 for father & daughter.] 1. Refer to the descriptions of the SMOG index, the Fry method, the Flesch Reading Ease, and the Flesch-Kincaid Grade Level, for measuring readability (under Resources for Measurement/Surveys).1 For the article or text you have chosen (as per discussion in class), randomly select three separate 100 word passages, and use this set of three passages to measure the readability (F1 ) using the Fry graph. Rather than do so manually, you can use the SMOG calculator to determine the average number of sentences and syllables per hundred words. Repeat the readability measurement (F2 ) with a second different set of three passages. Repeat once more (F3 ), using a third set. Using these same three sets, calculate the SMOG index, the Flesch Reading Ease, and the Flesch-Kincaid Grade Level. For each index, use the 3 estimates to calculate the standard error of measurement, and the coefficient of variation. Comment. 2. Propose a method to assess the validity of a readability index. 3. [m-s] Derive the link between the standard error of measurement and the (intraclass correlation) reliability coefficient [last line, column 1, p. 7 of notes on “Quantifying Reliability” in Notes on Psychometrics for students in rehabilation sciences in Resources for Measurement/Surveys. Hint: it’s simply a matter of using the definition of R. 4. [m-s] Exercise in section 3: Relationship between test-retest correlation and ICC(X) [In notes on Effect of Errors in X and Y on measured correlation and slope] 1 In 2010, ToneCheck ( https://techcrunch.com/2010/07/20/tonecheck/) seemed like an interesting tool, but JH can’t find it anymore in 2017 1 Family heights: Page 1/8 of notebook in Galton Papers : see “Galton’s family data on human stature” – the link is on the left hand side of JH’s home page. 2 After you have thought about it for a while, and looked carefully at Galton’s Notebook, you might wish to compare your answer with that given by Karl Pearson: Cf. “Why Galton got different parent-offspring correlations in heights and he (KP) got a larger ones” in the ‘Measurement – Lecture Notes, etc’ section of the bios601 resources page for Measurement. Course BIOS601: ASSIGNMENT on Measurement Errors and their Effects. 7. Bridging the physical- and the psycho-metric: The notes on “Increasing Reliability by averaging several measurements” on the right hand column of page 4 of JH’s notes on Quantifying Reliability give the formula for the so-called “Stepped-Up Reliability”. In psychometrics (where the number of items on a test serves as the “several measurements”) this formula serves as the basis for the “Spearman-Brown prediction formula”.3 [m-s] Invert the formula on p.4 to derive the one on the right hand column of p.1 for Spearman-Brown prediction formula relating the reliability of two versions of a test, one with N times more items than the other. 8. You are trying to estimate, from imperfect observations of F and C, the values of the two coefficients B0 and B1 in the temperature relation F = B0 + B1 × C. For each of the following situations, and using the true values B0 = 32 and B1 = 9/5 = 1.8, simulate4 1000 datasets and investigate the behaviour of the 1000 estimates, b0 and b1 , of B0 and B1 . In each simulation, use samples of size n = 4, with temperatures of C = 14, 16, 18 and 20. (a) C measured perfectly, F measured with F ∼ Gaussian(µ = 0, σF = 1) errors that are independent of F . Check – formally, using a test (or CI) based on the mean of the 1000 estimates – for evidence of bias in b1 . Also check whether the empirical variance of b1 agrees with that given by the theoretical formula, namely X V ar(b1 ) = σ2F / (x − x̄)2 . (b) F measured perfectly, C measured with C ∼ Gaussian(µ = 0, σC = 1) errors that are independent of C [Classical type error: someone else chose situations when C was indeed exactly 14, 16, etc, but didn’t tell you what C was, and instead asked you to independently record C using your own imperfect instrument, and to use your recordings of C in your estimation of the equation]. Again, formally test for evidence of bias in b1 . Do your findings line up with the predictions in the Notes? If the patterns are difficult to see, you might change the number of simulations, the sizes of the errors, the range of C or the sample size.5 3 Wikipedia has an entry called ‘Spearman Brown prediction formula’. new to simulations, see “Computer code to simulate datasets with measurement error” at the bottom of the Resources webpage for measurement/surveys. It gives some ‘starter’ computer code, which you can modify to suit. 5 The article by Hutcheon et al. “Random measurement error and regression dilution bias”, in the Resources for Measurement page tries to explain these patterns intuitively. 4 If 2 Fall 2017, v08.25 9. Attenuation of fitted ‘F on C’ slopes when progressively greater amounts of error are added to the C measurements Run the R code provided under the heading ‘Animation (in R) of effects of errors in X on slope of Y on X’. It uses the ‘animation’ package to add progressively greater amounts of error to the C measurements and show how effects they affect the fitted slopes. Include the plot with your answers. Examine the trace of the fitted slopes, and try to mathematically link the pattern of the ‘decay’ with the amount of error. Hint: as we saw earlier, the attenuation should be a function of (actually, proportional to) the ICCC ; so use the various amounts of error in C (ranging from σC = 0 to σC = 22) to calculate the various ICCC ’s and see if the predicted attenuations line up with the trace. 10. Before we study how well we can digitize survival curves, here is an exercise on communicating what the curves are meant to convey and the context in which they were generated. Refer to the article “Associations between C-reactive protein, coronary artery calcium, and cardiovascular events: implications for the JUPITER population from MESA, a population-based cohort study”, available in the Resources link opposite ‘Applications’ in bios601. We digitized the lowermost (green) curve in Figure 2A of that article. (a) Read the Abstract and study the Figures in the article. Then, write, in your own words, a short news item of 250 words or so (2-3 minutes or so on radio) for your local newspaper and radio station, where you moonlight as a health reporter. In your piece address (i) the rationale for the study (ii) the principal findings and (iii) the implications of these findings. Also suggest a headline for your story. [You might want to study some health reports to see how they are structured.. the order may not be the (i)-(iii) order listed above. An interesting but slightly more highbrow website devoted to science reporting in general is http://www.sciencedaily.com/. The websites ... http://www.cnn.com/HEALTH/, ... http://www.nytimes.com/pages/health/index.html, ... http://www.bbc.co.uk/news/health/ and ... http://www.cbc.ca/news/health/ are also worth consulting, and indeed monitoring. ] (b) A 65-year old relative of yours reads your story, looks on the internet and finds that a test that measures coronary artery calcium is available in a private clinic in Montreal, and phones you to ask if it Course BIOS601: ASSIGNMENT on Measurement Errors and their Effects. would be worth being tested and getting her “score”. What would you say to this relative? Fall 2017, v08.25 12. Bernoulli Error? A not-discovered-for-almost-300-years error in Bernoulli’s book? Or a not-discovered-for-almost-7-years error by A.W.F. Edwards. Which is it? 11. Errors in digitization Refer to the duplicate readings you made of the Kaplan-Meier survival curve in the study entitled “Associations between C-reactive protein, coronary artery calcium, and cardiovascular events: implications for the JUPITER population from MESA, a population-based cohort study” available in the Resources link opposite ‘Applications’ in bios601 For now, ignore the point-wise measures of precision, i.e., the standard errors and confidence intervals, that often accompany such curves. These are (decreasing) functions of the numbers of subjects and the numbers of ‘events’; we will cover their calculation later in the term. For now, focus only the loss of precision as a result of your digitization. Focus on your two measurements of each of the reported y-year risks, where y= 1, 2, 3, 4, 5, 6, 7: y-year CHD risk = 100 × (1 − proportion free of CHD at year y)% (a) From your two measurements at each of the 7 timepoints, obtain a 7d.f. estimate of the ‘standard error of measurement’. Do so using a ‘canned’ statistical routine and also ‘from scratch’ in R Write out the statistical model that you used to obtain this, and list any assumptions it makes. (b) The estimate in (a) is an estimate of the ‘within’ observer variation. In order to estimate the ‘between’-observer variation, what is the minimal information you would need from each of you co-observers? (since JH has access to all of them, he will supply each of them once you email him with your specific request: he can supply the full raw data that could be then put into a canned statistical routine, but he would prefer that you do the calculations ‘from scratch’ in R). Again, write out the statistical model that you used to obtain this, and list any assumptions it makes. (c) Here the ‘objects’ to be measured were 7 very specific (fixed) timepoints. Assume for the sake of this exercise that the 7 objects were 7 randomly selected human subjects and that we were interested in calculating an intra-class correlation coefficient to serve as a reliability measure. Carry out the ICC calculation. Restrict you attention to years 1-5 and recalculate the new ICC. Comment on why the ICC becomes smaller. 3 In his ‘Ars conjectandi three hundred years on’ article in Significance Magazine, Cambridge University Professor Edwards tells us that, a few years ago, he was reviewing Sylla’s English translation of (Jacob) Bernoulli’s book. He worked through one of the expectation problems, and came up with a different answer than Bernoulli. In early June of 2013, a week before the Edwards item was published in Significance, Julian Champkin, the magazine Editor, and a journalist by profession, used this ‘300-year-old error’ in the ‘trailer/teaser’ for the upcoming piece, and his question ‘Can you correct it?’ generated a number of responses on the Significance website. In the bios601 resources for surveys and measurement, at the bottom of the Webpage, JH has collected together in one .pdf file the item by Champkin, some of the original Bernoulli text in Latin, the full article by Edwards, the Edwards review of the Sylla translation into English, and Sylla’s translation of Berrnoulli’s treatment of the problem. The question arises as to whether it is the probabilities that are incorrect, or the expectation based on them, or whether it is Edwards who is incorrect. What is your answer ? [Remember that Edwards had studied Bernoulli earlier, when writing his book on Pascal’s triangle, and had found an error, that had been reproduced over the centuries in different books, in a table of Bernoulli numbers. So might Bernoulli (or the printers) had been a little bit careless?] Course BIOS601: ASSIGNMENT on Measurement Errors and their Effects. Fall 2017, v08.25 94 13. Imprecision in recording event times AMERICAN ECONOMIC JOURNAL: ECONOMIC POLICY AUGUST 2013 Does talking on a cell phone while driving increase your risk of a crash? The popular belief is that it does – a recent New York Times/CBS News survey found that 80 percent of Americans believe that cell phone use should be banned. This belief is echoed by recent research. Over the last few years, more than 125 published studies have examined the impact of driver cell phone use on vehicular crashes. In an influential paper published in the New England Journal of Medicine, Redelmeier and Tibshirani (1997) – henceforth, RT – concluded that cell phones increase the relative likelihood of a crash by a factor of 4.3. Laboratory and epidemiological studies have further compared the relative crash risk of phone use while driving to that produced by illicit levels of alcohol. Average number of scaled moving calls 300 The Introduction to a recent (2013) journal article “Driving under the (Cellular) Influence” by Saurabh Bhargava and Vikram S. Pathania of Carnegie Mellon University begins: Monday to Thursday 250 Friday 200 Weekend 150 8 PM 8:30 PM 9 PM Time (1-minute bins) 9:30 PM 10 PM Figure 2. Cell Phone Call Volume from Moving Vehicles for California from 8pm to 10pm in 2005 Later, in bios602, you will be introduced to the very clever study design that RT used to arrive at the 4.3. The 2013 authors then go on to study the topic using a very different but also clever design. We investigate the causal link between driver cell phone use and crash rates by exploiting a natural experiment induced by the 9pm price discontinuity that characterizes a majority of recent cellular plans. We first document a 7.2 percent jump in driver call likelihood at the 9 pm threshold. Using a prior period as a comparison, we next document no corresponding change in the relative crash rate. Our estimates imply an upper bound in the crash risk odds ratio of 3.0, which rejects the 4.3 asserted by Redelmeier and Tibshirani (1997). Additional panel analyses of cell phone ownership and cellular bans confirm our result. But while they had very precise data on when cell phones were being used, (see Fig2) the data on crashes were quite messy. To quote the authors: 4 present on cell phone time by nondrivers) Ouradditional analysisevidence principally relies oncalls two(this sources of drivers crash and data. and 30,000 plansData acrossSystem 26 markets to affirm the sensitivity cellular users First, pricing the State (SDS) provides data forof the to theuniverse 9 pm price rise from in call1990 likelihood at 9 pm of threshold. reported The crashes to 2005 for represents Califor- the first stagenia, of our analysis. Florida, Illinois, Kansas, Maryland, Mississippi, Missouri, We next and test whether the rise in at the thresholdofleads Ohio, Pennsylvania. A call welllikelihood recognized drawback us- to a corresponding rise in the crash rate. In order to smooth crash counts that are ing a crash database based on self-reports is the presence of subject to well substantive recognized periodicity due to reporting conventions, we aggregate crashes into periodic heaping . bins of varying sizes. While this strategy improves estimate precision, it introduces . a bias due to potential covariate changes away from the threshold. To account for trajectory of a crashwerecord to illuminate the origins suchThe movement in covariates, adopthelps a double-difference approach to compare of this bias. Once a vehicular crash is reported, police the period the change in crashes at the threshold to the analogous change in at a control various detailsplans of the including priorscene to the document prevalence of 9 pm pricing andincident, characterized by low the cellular use. minute of the crash occurrence, and submits the paperwork Figure 3 plots the universe of crashes for the state of California on Monday to to one of several possible state the agencies. varytoin1998.3 The Thursday evenings in 2005 and during control While period states from 1995 the specifics that govern data collection and crash qualificaplot, and subsequent regressions, indicate that crash rates in 2005, or in the extended records ultimately and threshold sent time tion framecriteria, of 2002 crash to 2005, do notare appear to changecentralized across the 9 pm relaonce a year to the NHTSA where they are standardized andeight additive to the preperiod. We then generalize our crash analysis to include tionalmaintained. states for which we have the universe of crash data. Placebo tests of weekends and proximal hours, as well as robustness checks to account for the reporting bias . in crashes, confirm that cell phone use does not result in a measurable increase in . the crash rate. Figure 4 illustrates the nature of the heaping in reports Our estimates of the relative rise in crashes and call likelihood at 9 pm imply a 3.0 upper bound in the crash risk odds ratio (and a 1 s.e. upper bound of 1.4) under 3 The periodicity evident in Figure 3 is due to the aforementioned reporting bias in the timing of accident reports. Course BIOS601: ASSIGNMENT on Measurement Errors and their Effects. Fall 2017, v08.25 that characterizes a representative hour in 2005 across the states in our sample. A close examination indicates that nearly 11 percent of crash reports fall exactly on the hour, 31 percent are on the hour, half hour, or quarter hour, and 61 percent reside in a minute ending in either zero or five. JH has contacted one of the authors (Frank Ahern) who replied that “Despite a great deal of searching, neither I or Jerry McClearn have been able to find the original data that were used back in ’85.” . BHARGAVA AND PATHANIA: DRIVING UNDER THE (CELLULAR) INFLUENCE VOL. 5 NO. 3 103 Total number of crashes 15,000 So, we will start again. But this time, instead of having to go to London and photocopy the records, you can take advantage of the scanned copies provided by the Wellcome Library and the Galton archives. To save you having to find the books (each containing about 500 records) in the large amount of material in the Galton archives, JH has downloaded them and put them on the bios601 website, in the Resources for Sampling/Measurement folder, under the heading (flagged in red) “Data from Galton’s Anthropometric Laboratory.”’ For this exercise, which is designed to familiarize you with how to statistically quantify the psychometric (and psychophysical) properties of different measuring instruments, we will focus on subjects who have been measured more than once, so that we can assess the reliability of the various measures. For now, we will ignore the fact that there is quite a bit of time between some of the measurements, and that some attributes are age-related (we will try later to see at what age the peak is), and so some of the non-repeatability is for legitimate biological reasons. 10,000 5,000 0 0 20 40 60 Representative minute Figure 4. Periodicity in SIDS Crashes across Representative Hour in 2005 for All States in Sample Exercise: In this study, the primary contrast involves crash rates in the 1 hour afterof and the 1record hour helps before became “free” 9 pm. trajectory a crash to cellphone illuminate calls the origins of this bias.atOnce a vehicDoular youcrash think heaping errors arescene an insurmountable is the reported, police at the document variousproblem? details of If theyou incident, do,including why? If the not,minute suggest ways to occurrence, deal with them. of the crash and submits the paperwork to one of several possible state agencies. While states vary in the specifics that govern data 14. Galton’s morequalification than century later collectiondata and crash criteria, crash records are ultimately centralized and sent once a year to the NHTSA where they are standardized and maintained.26 Figure 4 illustrates the the heaping reportson that characterizesReliaa represen[See also Questions 3-5 nature above,ofand see JH’sinnotes Quantifying tative hour in 2005 across the states in our sample. A close examination bility under the Measurement Lecture Notes heading in the website] indicates that nearly 11 percent of crash reports fall exactly on the hour, 31 percent are on the The 1985 article Dataand a Century Data” re-analyzes extenhour, half hour, “Galton’s or quarter hour, 61 percent reside in a minutethe ending in either sive data collected by Francis Galton at his anthropometric laboratory in zero or five. the South Kensington London.System (FARS), also administered by Second, the FatalityMuseum AnalysisinReporting the NHTSA, provides data for the universe of fatal crash records from 1987 to 2007 for each of the 50 states. FARS captures any vehicle crash resulting in a death 5 within 30 days of the collision. Like the SDS data, FARS suffers from severe periodicity in the specific minute of the crash reports. So as to get a feel for the (small sample) sampling variability of these measures, and also so that it is not too big a data entry burden, you are asked to enter the complete records for 10 such subjects, i.e., subjects who were measured on more than one date. We can pool these student datasets later to get a more – statistically – reliable estimate of the various reliability measures. In order to standardize the variable names, and provide a small element of quality control, a .csv file (Spreadsheet for Data Entry) with several subjects from the first book is provided on the website, immediately after the data books. Add to it the data for the first ten eligible ones you find in the range assigned to you (enter all of the records per subject, no matter how close or far apart they are in time). After you have added your entries, delete the ones already there — they were merely provided so as to standardize the naming of variables, and to act as a guide to align the columns correctly, and to make it easier to see any items that are mis-entered. A few notes at this point (we may discover other oddities that we need to deal with as we go along). JH has noticed that subsequent measurements are some times recorded in metric units rather than Imperial (e.g., cm instead of inches and tenths or inches). We could discuss other ways to enter such mixed units (from JH’s past experience, converting as we Course BIOS601: ASSIGNMENT on Measurement Errors and their Effects. enter is not an option!) but JH decided that when he met a metric measurement when he had allocated a pair of fields for say inches and tenths, he simply put the metric measurement in the first field and left the second field blank. It should be relatively easy to use programming to harmonize them later. In the case of blanks, or illegible recordings, please leave the field blank. JH has noticed some instances where there were several (4 in subject 0001) rows for the first several items (up to the Snellen test) but fewer (e.g. 2 in subject 0001) rows for the later items at the bottom of the page, from sitting height to strength of blow with fist. In such instances, use any indications you can to decide which rows at the bottom of the page go with which ones at the top (in the case cited, JH decided that the first and fourth rows were complete, as were both of the bottom ones, so he put these with the first and fourth). In such cases, use the remarks column to flag the case. Here are the books assigned to the different students. Contact JH if your ID number is not in the list. ID Subjects JH 26xxxxx21 26xxxxx19 26xxxxx57 26xxxxx99 26xxxxx78 26xxxxx65 26xxxxx58 26xxxxx90 26xxxxx94 Fall 2017, v08.25 with the Pearson correlation (see exercises above), it is more general and it uses whatever number of measurements per person there are. It is less cumbersome than using all possible pairwise correlations, or selecting just two. Compare the ICCs with the test-retest correlations in Table 1 of the 1985 ‘a century later’ paper, and comment on any substantial differences. 15. Physical Activity: JH 2010-2013 t -rlt Sf ', {Jtat* 0511-1028 1029-1530 1531-2020 2021-2520 2521-3021 3022-3521 3522-4000 4001-4500 4501-5000 5001-5500 5501-6000 6001-6500 7001-7459 20t2 6 I 3FJ#4 Tlqlj q l4 ' v4l'-1 ,+t, #e;? 4 15 ;q1 +ar 7,f,'l /S.ts'' 4+F{1,/1',71 9 ,JD o1?o t1 l l6TRqP FJ 2l { 17 ^' i " 25 tScl s 61sI tll).b )3^ t'tL++7{Btl-v qr>g?>tV':4 4sqst i4 3 > 24 l*,11I 5Sa+ 1a I I I 74u N(wd An Chinois (Dragon) 30 ? CdlW' 11r++ zA1 / -/ -- )^ ri '-t 7 1+gt *1Tl 3 0 3 'q5a 1 :q z 5/s ? 2 3 ' |" 0001-0491 ,V*t, Janvier itrfrH tr5'35 27 26 ?a Fz, - ! ., 2gtlu"u -727{ )g wqI ) 9 tw D 6 cembr20l e l ' ) Lt \ 4l v l v s I2 3 1 56/ 69 t0 L) 12 13 14 15 16 17 3 t 92 02t 22 23 24 15 26 27 2E 29 30 ll Flvrier 2012 L] \4 M J V 5 123 4 678 910 1 1 Bt 41 51 61 718 20 21 22 23 24 25 27 28 29 Since 2010, JH has used a ‘step-counter’ (pictured above left) to record how many steps he takes each day. His spouse AM has done the same, and has entered the pairs of daily counts onto a log book. Refer to the two files (2010-2011 and 2012-2013) under the heading “Physical Activity: How many steps a day has JH being doing since 2010?” near the top of the Resources webpage. Once you have entered the data, adopt the supplied R code to calculate the ICC for each of the measures shown in Table 1 of the 1985 article. Do not worry about timing or segregation by sex, or age-correction – you will not have enough data to do so; we will do this later when we pool the data. It appears (but JH is not entirely certain) that the 1985 authors used a simple Pearson product moment correlation with paired measurements. The advantage of the ICC is that while it is still connected mathematically 6 The 2010-2011 .csv file has the paired recordings for 2010, as well as JH’s ones for 2011. The 2012-2013 .pdf file has scanned images (see above right) of the pages of paired recordings from the log-book. The exercise in sampling from these data raised the issue of how many days one needs to sample in order to ensure that the estimate one gets is close to what one would obtain with a census, i.e., a 100% sample of days. Similar issues occur in dietary recall surveys. The least costly method is the food frequency questionnaire (Google for more info); a much more Course BIOS601: ASSIGNMENT on Measurement Errors and their Effects. costly one is the x-day 24-Hour dietary recall method. How large x should be for different sub-populations (e.g., children, young adults, the elderly) has been studied. In measuring physical activity, it is common to use quite expensive accelerometers, and so they are usually given to research subjects for just one randomly chosen week. The Omron model shown costs a lot less, and unlike the accelerometers – which store minute by minute activity – just records the number of steps for each of the last 7 days. JH’s data help us answer the question of how many weeks are needed to get a good estimate of his yearly activity. (a) divide the 2010-2111 data into weeks, and derive a (somewhat oversimplified) 1-way analysis of variance table, with week as the factor. in this greatly oversimplified model, the numbers of steps (y) on any day (j) within week w (i=1. . . 104) can be written as Fall 2017, v08.25 (c) Using the results from (b), and the same overly simplified model, work out the expected variance of estimators that average recordings from (i) 3 random days in 1 random week (ii) 1 random day in each of 3 random weeks (iii) 3 random days in each of 3 random weeks. (d) Could you have arrived at the results in (c) using the ‘Stepped-Up’ Reliability formula referred to in page 4 of the Quantifying Reliability notes? 16. Repeatability of a Test – and of the statistical analysis itself ! Refer to the report ‘A Novel Test of Endurance Running Performance’ in the Resources website [under the tab ‘Data from various repeatability studies’]. (a) Redo the 2-way ANOVA ‘with participant and trial as main effects’ to see if you can reproduce the reported coefficient of variation. yw,j = µ + bw + w,j (b) For didactic purposes, treat the model as a random-effects one, i.e., with week as the random factor. Thus, the 104 bw ’s are assumed to 2 be a random sample drawn from a N (0, σw ) distribution.6 Even though they may have a lot of structure, treat the variations across days within a week as uncorrelated ‘disturbances’ or ‘errors’ (yr,w.y,j ) with variance σ 2 but no structure (i.e. treat all ’s as exchangeable, so that order of observations within the same week is irrelevant – in the file, you only need to know which week it is, not which day of the week. Clearly, there may be strong intra-week patterns, but for now assume that you are not even told which observation corresponds to which day of the week. From the Expected Mean Squares (EMS) for this model7 Source Sum of Squares df Mean Square EMS Weeks SSw 103 SSw /df 2 σ 2 + 7σw Error SSe 104 × 6 SSe /df σ2 (b) Use a 1-way ANOVA, with subjects as a random effect, and the 3 trials as replicates (i.e. ignoring the order) and calculate an overall coefficient of variation. [A very similar 1-way ANOVA is shown in the 1st column of page 5 of the ‘Introduction to Measurement Statistics’ Notes on the Resources website. Page 3 of the Notes ‘Quantifying Reliability’ has an example with 2 measurements per family, but the principle is the same.] Which makes more sense to you, the CV based on their 2-way ANOVA, or yours based on a 1-way ANOVA? (c) Calculate subject-specific coefficients of variation (just as was reported in Table 1 in the article on breath alcohol – the link to this article can be found just above the one for the endurance test). Summarize the 10 CVs using say the median and the range. Would you report the ‘overall’ CV the authors did, or some summary of the 10 subject-specific ones? Give a reason for your choice. (d) Use the results of the 1-way ANOVA8 to calculate an intra-class correlation (ICC). (e) In this setting, which makes more sense, a CV or an ICC? Why? 2 use the method of moments to estimate the σw and σ 2 components. 6 Using Roman b’s and Greek β’s to distinguish random effects from fixed effects is a recent convention: it was not used when JH learned linear models. 7 See also pages 4 and 5 of Notes on Introduction to Measurement Statistics, and pages 3 and 4 of the Notes on Quantifying Reliability (on the Resources website, under the heading ‘Measurement – Lecture Notes, etc’). ‘Weeks’ in the current example correspond to ‘persons’ or ‘subjects’ or ‘families’ in those examples. 7 (f) Rerun the ICC code several times on random subsets of the subjects. As you reduce the sample size to just 2 or 3, does the ICC stay stable? Use the example to say what the ICC tells us that the CV can not, and what the CV tells us that the ICC can not. 8 The R code supplied makes use of an ICC package, but it is always safer to check with a worked example that a package you don’t know is doing what you want it to do. adoption by the general population. In contrast, nearly two- mostly consistent between the 500 and 1500 step trials. thirds of adults in the United States own a smartphone2 and Course BIOS601: ASSIGNMENT on Measurement and their Effects. Fall 2017, v08.25and technologyErrors advancements have enabled these devices to track Discussion | We found that many smartphone applications health behaviors such as physical activity and provide conve- wearable devices were accurate for tracking step counts. Data nient feedback.3 New wearable devices that may have more from smartphones were only slightly different than observed (g) How could one ‘rig’ (i.e., manipulate) the sample of subjects the count ranged from -0.3% to 1.0% for the pedometer and acconsumer appeal have also beenin developed. step counts, but could be higher or lower. Wearable devices difbreath alcohol study to (i) maximize Even (ii) minimize the ICC? celerometers, -22.7% to reported -1.5% for wearable devices, though these devices and applications might bet- fered more and 1 device stepthe counts more than 20% and -6.7% to 6.2% for smartphone applications. Findings ter engage individuals in their health, for example through lower than observed. Step counts are often used to derive other were 3 17. How reproducible and accurate are free wellness smartphone apps tohas been little evalua-mostly consistent between the 500 1500 step trials. there workplace programs, measures of physical activity, suchand as distance or calories 3-5active time? track your steps, calories burned,tion distance and of their use. The objective of this study was to evaluate the accuracy of smartphone applications and wearable Figure 1. Device Outcomes for the 500 Step Trials devices compared with directDevices observation of step counts, a The letter ‘Accuracy of Smartphone Applications and Wearable successfully used in interventions No. of for Tracking Physical Activity Data’ inmetric JAMA in February 2015 [under to improve clinical Device Observations 1 outcomes. the tab ‘Data from various repeatability studies’] reports prospective study recruited healthy adults aged This prospective study recruited Methods healthy| This adults aged 18 years 18 years or older through direct verbal outreach at a univeror older through direct verbal outreach at a university. Parsity. Participants gave verbal informed consent to walk on a ticipants gave verbal informed consent to walk on a treadmill treadmill set at 3.0 mph for 500 and 1500 steps, each twice, set at 3.0 mph for 500 and 1500 steps, each twice, for no for no compensation. An observer (M.A.C.) counted steps using compensation. An observer (M.A.C.) counted steps using a a tally counter in August 2014. This study was approved by the tally counter in August 2014. This study was approved by the University of Pennsylvania institutional review board. University of Pennsylvania institutional review board. A convenience sample of 10 applications and devices was selected from among the top sellers in the United States. On A convenience sample of 10 applications and each devices was sethe waistband, participant wore the Digi-Walker SW-200 lected from among the top sellers pedometer in the United States. On the (Yamax), which has been well validated for 6 waistband, each participant wore research, the Digi-Walker SW-200 pe-the Zip and One (Fitbit). On and 2 accelerometers: dometer (Yamax), which has beenthe well validated for wrist, each wore 3research,6 wearable devices: the Flex (Fitbit), the and 2 accelerometers: the Zip andUP24 One(Jawbone), (Fitbit).and Onthe the wrist, (Nike). In one pants pocket, Fuelband each wore 3 wearable devices: the UP24simultaneously running 3 iOS eachFlex carried(Fitbit), an iPhonethe 5s (Apple) applications: Fitbit (Fitbit), Health Mate (Withings), and Moves (Jawbone), and the Fuelband (Nike). In one pants pocket, (ProtoGeo Oy). In the other pants3pocket, each carried the Galeach carried an iPhone 5s (Apple) simultaneously running axy S4 (Samsung Electronics)and running 1 Android application: iOS applications: Fitbit (Fitbit), Health Mate (Withings), Movespants (ProtoGeo Oy). each carMoves (ProtoGeo Oy). In the other pocket, At the end of each1trial, step counts from each device were ried the Galaxy S4 (Samsung Electronics) running Android recorded. In rare instances that a device was not properly set application: Moves (ProtoGeo Oy). to record steps (8 of 560 observations), these data were not in... cluded. The mean step count and standard deviation for each Across all devices, 552 step countdevice observations wereusing recorded was estimated Excel (Microsoft). from 14 participants in 56 walking trials. Participants were 71.4% female, had a mean (SD) Results age of| Across 28.1 (6.2) years,552 and all devices, step count observations were had a mean (SD) self-reported body mass index (calculated recorded from 14 participantsas in 56 walking trials. Particiweight in kilograms divided by height in 71.4% meters squared) of (SD) age of 28.1 (6.2) years, pants were female, had a mean 22.7 (1.5). and had a mean (SD) self-reported body mass index (calculated as weight in kilograms divided by height in meters ... 22.7 (1.5). Figure 1 shows the results for the squared) 500 stepoftrials by device and Figure 2 shows the results for the 1500 step trials. Compared jama.com with direct observation, the relative difference in mean step Galaxy S4 Moves App 27 iPhone 5s Moves App 28 iPhone 5s Health Mate App 28 iPhone 5s Fitbit App 28 Nike Fuelband 28 Jawbone UP24 28 Fitbit Flex 28 Fitbit One 27 Fitbit Zip 27 Digi-Walker SW-200 28 200 400 500 600 Mean No. of Steps The vertical dotted line depicts the observed step count. The error bars indicate ±1 SD. Figure 2. Device Outcomes for the 1500 Step Trials No. of Observations Device Galaxy S4 Moves App 28 iPhone 5s Moves App 28 iPhone 5s Health Mate App 27 iPhone 5s Fitbit App 27 Nike Fuelband 28 Jawbone UP24 28 Fitbit Flex 28 Fitbit One 26 Fitbit Zip 27 Digi-Walker SW-200 28 500 1000 1500 2000 Mean No. of Steps The vertical dotted line depicts the observed step count. The error bars indicate ±1 SD. (Reprinted) JAMA February 10, 2015 Volume 313, Number 6 Copyright 2015 American Medical Association. All rights reserved. 8 300 Downloaded From: http://jama.jamanetwork.com/ by a McGill University Libraries User on 08/16/2016 625 Course BIOS601: ASSIGNMENT on Measurement Errors and their Effects. (b) For which instruments is there evidence that this ‘bias’ is non-zero? You can use your eye to determine the means and SDs, or use the ones in the .pdf file shared by senior author (‘I’m attaching the raw data that we have to share’) and available on the course website. (c) The data summaries were in response to an email from JH to the author, asking if there was ‘any chance you would be able to share the Excel file of raw data, so we should see if the deviations from the target were all over the place, or peculiar to a few people or a few devices. I can imagine the pockets on some people being a bit deep and wide.. and that the machines in them slosh around – I sometimes keep my $20 dollar step counter in my pocket instead of on my belt.’ Imagine that the author had shared these data as 552 separate lines, each one containing a step count, a participant ID (1-14), the target (500 or 1500), the occasion (1st or 2nd) and the name of the devise.9 Write out a plan for analyzing them, including the model you would use, the meaning of each component (parameter) in the statistical model, how you would estimate each component, a table of results (use made up, but realistic numbers), and a sketch of one or more graphs that would quickly tell the same story. (d) In the Fall of 2016, the EPIB601 class carried out its own investigations. The Epidemiology teacher tested an app called Pacer Pedometer plus Weight Loss and BMI Tracker By Pacer Health, Inc that is available for free for both the iPhone and Android devices. Dr Patel (senior author of the letter) ‘particularly like[d] Withings HealthMate because it has a good user interface and works with both iPhones and Androids. Fitbit is also good but works with a limited set of Androids.’ For the BIOS601 of 2016, students were asked to prepared to participate in a planning session, where together they would design (and subsequently carry out) their our investigation into the reproducibility and validity of a few smartphone apps with respect to steps, distance, calories, etc 9 At the end of each trial, step counts from each device were recorded. In rare instances that a device was not properly set to record steps (8 of 560 observations), these data were not included. The mean step count and standard deviation for each device was estimated using Excel (Microsoft). Across all devices, 552 step count observations were recorded from 14 participants in 56 walking trials. 9 18. Reaction times The orientational material below is from the sleepstudy data reanalyzed in Ch. 3 of the excellent (online) book ‘lme4: Mixed-effects modeling with R, dated June 25 2010, by Douglas M. Bates. The data are included in the lme4 package – and were used again in the 2017 Epidemiology (teaching) article by Weichenthal, Baumgartner and Hanley. Belenky et al. [2003] report on a study of the effects of sleep deprivation on reaction time for a number of subjects chosen from a population of long- distance truck drivers. These subjects were divided into groups that were allowed only a limited amount of sleep each night. We consider here the group of 18 subjects who were restricted to three hours of sleep per night for the first ten days of the trial. Each subject’s reaction time was measured several times on each day of the trial. 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 372 333 352 331 330 337 308 371 369 351 335 332 450 400 350 300 250 200 Average reaction time (ms) (a) Rewrite the authors findings using the words ‘under-’ and ‘overcounted.’ Fall 2017, v08.25 450 400 350 300 250 200 310 309 370 349 350 334 450 400 350 300 250 200 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 Days of sleep deprivation ‘Average reaction time versus number of days of sleep deprivation by subject for the sleepstudy data. Each subject’s data are shown in a separate panel, along with a simple linear regression line fit to the data in that panel. The panels are ordered, from left to right along rows starting at the bottom row, by increasing intercept of these per-subject linear regression lines. The subject number is given in the strip above the panel.’ Course BIOS601: ASSIGNMENT on Measurement Errors and their Effects. The 2003 article [European Sleep Research Society, J. Sleep Res., 12, 1-12] that Bates cites is more specific about the Psychomotor vigilance test (PVT), and the number of trials (JH estimates 100 or so) that went into each datapoint shown in the graph [note that Bates used the average response latency whereas Belenky used its reciprocal.] Fall 2017, v08.25 [To get around this, JH wrote a simple R program that may not be as accurate or fancy but that stores the individual times from however many you do into a vector. The R code (and links to web-based tools, and to some scholarly and newspaper articles on reaction times) is available under Online Tools on the webpage for the Resources for measurement.] The main objective is to gain experience with ‘hands on’ data, and with sample size planning, so try both tools and choose between them. The PVT measures simple reaction time to a visual stimulus, presented approximately 10 times/minute (interstimulus interval varied from 2 to 10 s in 2-s increments) for 10 min and implemented in a thumb-operated, hand-held device (Dinges and Powell 1985). Subjects attended to the LED timer display on the device and pressed the response button with the preferred thumb as quickly as possible after the appearance of the visual stimulus. The visual stimulus was the LED timer turning on and incrementing from 0 at 1-ms intervals. In response to the subject’s button press, the LED timer display stopped incrementing and displayed the subject’s response latency for 0.5 s, providing trial-by-trial performance feedback. At the end of this 0.5-s interval the display turned off for the remainder of the foreperiod preceding the next stimulus. Foreperiods varied randomly from 2 to 10 s. Dependent measures, averaged or summed across the 10-min PVT session, included mean speed (reciprocal of average response latency), number of lapses (lapse = response latency exceeding 500 ms), and mean speed for the fastest 10% of all responses. [If you have energy to spare, you can try to empirically determine how closely this R-based instrument and the web-based instrument agree.] Before running the measurements, be sure to practice first. (a) Run 10 trials using your dominant hand, and calculate the mean reaction time, the SD, and the SE of the mean (SEM). Convert the SEM into a coefficient of variation (CV12 ). How does this CV (which measures the ‘instability’ of the mean) relate to the CV for individual measurements? Use the SEM to calculate a 95% confidence interval to accompany your point estimate of the true mean. Why use a larger-than-1.96 multiplier to calculate the margin of error? In bios601 in 2017, each of you will make some rough (‘amateur’) reaction time measurements, so as to learn what your reaction times are like, and to plan a study into whether they are faster when using your dominant rather than your nondominant hand. (b) Suppose you wished to perform enough trials that the margin of error would to be less than 5% of the mean. Using the SD (or SEM, or CV) you already obtained13 , calculate how many trials you would need. Guidance on such sample size considerations (JH prefers this term over sample size requirements) can be found in section 4 of his bios601 Notes on Mean/quartile of a quantitative variable:- models / inference / planning The 2003 measurements relied on a thumb-operated, hand-held device and a microcomputer program described in 1985.10 To make your own measurements, you can choose this quite intuitive web tool11 – and use either the keyboard or the mouse/trackpad. It only performs and shows the results of 5 trials at a time. So – since you will need to calculate the mean and SD of 10 individual times – you will need to copy the individual times into R, 5 at a time. 10 Dinges, D. F. and Powell, J. W. Microcomputer analyses of performance on a portable, simple, visual RT task during sustained operations. Behav. Res. Meth. Instrum. Comput., 1985, 17: 652?655. 11 https://faculty.washington.edu/chudler/java/redgreen.html 10 (c) Suppose you wished to (i) test whether, or (ii) measure how much, the mean of reaction times (r.t.) obtained with your dominant hand (D) differs from the mean of reaction times obtained with your non-dominant hand (ND). 12 When reporting a CV, it is customary to do it so as a percentage course, if you were to run that many trials, there is no guarantee that the SD would be the same as the SD you got for the 10 – it could be higher or it could be lower. But use the SD of the 10 as the best guess for planning purposes 13 Of Course BIOS601: ASSIGNMENT on Measurement Errors and their Effects. You will make n measurements with each hand. Assume that there is no ‘fatigue factor’ or ‘order-of-testing’ effect, so that it doesn’t matter whether you first do the n with one hand and then the n with the other. [If there were a fatigue factor, or order effect, then we would want to think of other designs, possibly involving pairing/blocking]. The 2 n’s may be large enough that the relevant sampling distribution of the difference of two independent sample means (Student’s t) is close to a Z distribution; otherwise, use trial and error. Also assume that the variability is about the same in both r.t. series. For (i) you will use a 95% confidence interval for the difference of two unknown means, µD − µN D . For (ii) you will use the test statistic α = 0.05 (2-sided). r.t.D − r.t.N D SE of this dif f erence , and For the estimated difference determine the n per hand that would yield a margin of error of at most: 10 milliseconds; 5 milliseconds. For the statistical test determine the n per hand that would give you an 80% chance of obtaining a ‘statistically significant’ test result if the true difference in milliseconds were: 5, 10, 25. For the statistical test, also determine the chance of obtaining a ‘statistically significant’ test result (the statistical ‘power’, or 1-β) if each n is fixed at 25, but the true difference in milliseconds was: 1, 5, 10, 25. What if the SD you used for planning was too large? too large? (d) Do a few trials using the tool https://www.justpark.com/creative/reaction-time-test/ that was featured in the newspaper story ‘Brain test judges how old you are based on your reaction time.’. Consider their reaction-time vs. age curve, and how it was fitted. The website don’t say (i) how they selected the 2,000 people aged 18 and above that they surveyed, or (ii) how many trials they asked 11 Fall 2017, v08.25 each of them to do. As for (i), describe one scenario where the curve they obtained would be ‘flatter’ than the one that would be obtained if representative population-based samples were recruited at each age. Suppose14 that each of the very large number of subjects in each 1-year-wide age-bin was tested a very large number of times. Suppose then that within each age-bin we sorted the persons from slowest to fastest and selected the ‘median’ (middlemost) person. Suppose further15 that from age 25 to age 64, these medians made an almost perfect straight line with slope 2 ms per year of age, or 0.5 years of age per ms of response latency if we plot age on the vertical (y) axis and response latency on the horizontal (x) axis. For now, we will retain these 40 people from this ‘ideal’ world. As for (ii), we will ask them to make just 1 trial each, and (like the website) use these 40 values to fit the LS line of age(y) upon latency(x). Assuming within-person variation of the same magnitude as in your own set of measurements, what is your best estimate of what the fitted slope will be? Hint: remember some earlier exercises. The above scenario selected the median person in each bin. If you picked one random person from each bin, what is your best estimate of what the fitted slope will be? (State your assumptions). Write a few sentences summarizing why (even if their sample of subjects is representative) the age-latency graph in the website may be inaccurate, and in what respect. (e) What if each median-person’s latency was measured perfectly (large n), but ages were in bins (intervals) 5 years wide (so that, e.g., the persons aged 25, 26, 27, 28 and 29 are put at age 27), and we fitted the LS line of latency(y) upon the midpoint (x) of each age bin? 14 This ideal universe where subjects are easily recruited, and have lots of patience and can maintain their attention over a very large number of trials, is just for didactic purposes. 15 Now we are really dreaming! While we are at it, we will assume symmetric age-specific distributions. Course BIOS601: ASSIGNMENT on Measurement Errors and their Effects. Note re terminology: Fall 2017, v08.25 Type IV error In the situation where x = latency, the errors in measuring the true X values are uncorrelated with these true values of X. This is called the classical ‘errors in X’ situation. It is the nastier case. X = true value; x = X + X , with X ⊥ X In the situation where x = the mid-age of the bin, the errors in measuring the true X values (ages) are correlated with the true values of X, but uncorrelated with the observed x’s. This is called the Berkson ‘errors in X’ situation. It is less nasty, but it does increase the (sampling) variability of the estimated slope. X = true value; x = X + X , with X ⊥ x JH’s favourite example of Berkson error (one he adapted for the earlier exercise on F v.s C temperatures) is one that may have come from Berkson himself: An investigator wished to measure temperatures in an oven at various times. • An unreliable thermometer, i.e., one that gives readings that fall equally on both sides of the truth, would generate classical errors. • The temperatures shown on the thermostat are as likely to be above/below the true temperature at any given moment of interest; as you can check, these would be Berkson errors. For more on these, consult JH’s Ch. 4 notes in his Applied Linear Models course 679, or the books or presentation by the (measurement-expert) statistician Raymond Carroll https://www.stat.tamu.edu/~carroll/talks/NCI_MEM_Call.pdf 19. What was the point of each of the assignments? For each of the assigned questions, use one sentence to describe what you think the learning objective was; use another to describe in what situations the concepts and techniques will be of use to you and to those you will work with. http://en.wikipedia.org/wiki/Cavendish experiment: in 1798 Cavendish found that the Earth’s density was 5.448 ± 0.033 times that of water (due to a simple arithmetic error, found in 1821, the erroneous value 5.48 ± 0.038 appears in his paper). 12

Download PDF

- Similar pages