# Bioessays Formative Assessments

### Validity

Because of the assessment design, students using an alternate strategy to interpret the relationships on evolutionary trees were not expected to answer all questions incorrectly. The assessment included eleven questions containing one incorrect strategy as distracter and five questions containing an unknown or no incorrect strategy as a distracter. Students who approached a question using a strategy that had been controlled for a particular question (a strategy not incorporated as a distracter) could not use that strategy to arrive at an answer. When students could not identify a clear answer using their determined strategy, they often voiced confusion during interviews and admitted guessing in order to answer the question. When guessing on a single question, students had an equal chance in answering that question correctly or incorrectly. For example, answer choices for a question with a proximity-based distracter had the same number of internal nodes between them and the focal organism, therefore students who used node counting to determine relationships were not able to distinguish between the two choices using this strategy. Given two choices to answer the question, students have a 0.50 probability of answering correctly.

Faculty members (content experts) verified the content validity of the assessment. They verbally affirmed that the assessment tested the ability to use MRCA to interpret relationships on an evolutionary tree, and confirmed that distracters were appropriate for each question (especially for taxa included for similarity-based distracters). The validity of the test was also investigated using the group difference method (Cronbach and Meehl 1955). Because professors (content experts) and graduate student teaching associates had the construct, the ability to determine relationships on evolutionary trees, whereas many of the students in the *Evolution and Biodiversity* course did not have the construct (as verified by the interviews), professors scoring higher than students provide evidence that the assessment has construct validity. The scores of the faculty (n = 3) mean 0.98 with standard error (SE) 0.019 and graduate student teaching associates (n = 4) mean 0.96 with SE 0.0266 were higher than the scores of students in *Evolution and Biodiversity* (n = 205) mean 0.64 with SE 0.020 verifying the construct validity of the assessment.

### Item analysis

Two-way sign test comparing student performance on distracter categories

A two-way sign test showed significant differences in student performance between all distractor categories (p < 0.05) except between the following distractor categories: similar—none, proximity—node counting, and multiple—node counting (Table 5).

We investigated several factors that could influence student performance on questions that were unrelated to either conceptual understanding or the alternative conceptions tested. Using a two-sample t test we found no significant difference in student performance (mean number of correct responses per question ± standard deviation) (1) on the first eight questions (139.5 ± 19.54) versus the second eight questions (128 ± 11.31) of the assessment (p = 2.72, t = 1.14, df = 14), (2) when the correct answer was to the right of the focal taxon (139.5 ± 24.86) or to the left of the focal taxon (126.75 ± 19.03) (p = 0.301, t = 1.10, df = 8), or (3) when trees were drawn with up-to-the-right orientation of the root (129.44 ± 21.81) or down-to-the-right orientation of the root (132.56 ± 22.00) (p = 0.767, t = −0.0301, df = 16).

Difficulty and discrimination values for the questions in the instrument

The reliability of the instrument is the degree the instrument produces consistent results. Cronbach’s alpha, which measures internal consistency, was used to estimate reliability. Internal consistency estimates the extent to which items that measure the same construct have similar results. The internal consistency of the items was excellent (Cronbach’s alpha of 0.90).

Student scores on the instrument were significantly and positively correlated with scores on Lawson’s Classroom Test of Scientific Reasoning (r = 0.31, p < 0.001); students with higher scientific reasoning scores performed better on our assessment. Scientific reasoning is composed of inquiry, experimentation, evidence evaluation, inference and argumentation (Zimmerman 2007) a skill set that applies to evolutionary tree interpretation. While we did not measure learning gains, our results are consistent with other studies that have found a positive correlation between scientific reasoning abilities and student gains in learning science (Coletta and Phillips 2005).

We expected that students who were using one of the three alternate strategies consistently would experience some cognitive dissonance when they encountered a question that did not enable them to use that strategy (e.g., using node counting strategy on a question where the number of nodes between the focal taxon and the two choices were the same). Yet, students rarely recognized that if their strategy was correct it should work on all questions, and if their strategy wasn’t working then it was not a valid way to approach any of the questions. Students often switched strategies throughout the assessment, indicating that these strategies were not deeply seated misconceptions (see Wandersee et al. 1994), but rather, alternate approaches that should be relatively easily dispelled with additional training. Recently we have used this assessment as a diagnostic and training tool with our graduate teaching associates and undergraduate supplemental instruction leaders. The assessment has been very effective in helping us identify instructors who have problems interpreting relationships among taxa on an evolutionary tree; with relatively little additional training they master this skill fairly quickly. We find, anecdotally, that rather than learning to determine relationships in a gradual manner, students typically experience a “light-bulb” moment when they understand how to read these trees.

Our novel question design can also be adapted and used by instructors to develop their own questions using this binary, forced choice model to test for one or more alternate conceptions while controlling for the use of other strategies.

## Abstract

Most students have difficulty reasoning about chance events, and misconceptions regarding probability can persist or even strengthen following traditional instruction. Many biostatistics classes sidestep this problem by prioritizing exploratory data analysis over probability. However, probability itself, in addition to statistics, is essential both to the biology curriculum and to informed decision making in daily life. One area in which probability is particularly important is medicine. Given the preponderance of pre health students, in addition to more general interest in medicine, we capitalized on students’ intrinsic motivation in this area to teach both probability and statistics. We use the randomized controlled trial as the centerpiece of the course, because it exemplifies the most salient features of the scientific method, and the application of critical thinking to medicine. The other two pillars of the course are biomedical applications of Bayes’ theorem and science and society content. Backward design from these three overarching aims was used to select appropriate probability and statistics content, with a focus on eliciting and countering previously documented misconceptions in their medical context. Pretest/posttest assessments using the Quantitative Reasoning Quotient and Attitudes Toward Statistics instruments are positive, bucking several negative trends previously reported in statistics education.

## INTRODUCTION

Most students have difficulty reasoning about chance events (Shaughnessy, 1977, 1992). Students arrive in the classroom with theories or intuitions about probability that are at odds with conventional thinking (see examples in Table 1) and can even hold multiple mutually contradictory misconceptions about the same situation (Konold, 1995). Unfortunately, misconceptions generally persist and can even become stronger after instruction (Sundre, 2003; Delmas *et al.*, 2007). This can occur not only for traditional instruction, but also for more innovative, hands-on approaches (Hodgson, 1996; Pfaff and Weinberg, 2009). The stakes are high, because overcoming these obstacles is essential for achieving numeracy to the level necessary for informed decision making in modern society (Gigerenzer, 2002; Gaissmaier and Gigerenzer, 2011; Reyna and Brainerd, 2007).

Table 1.

Common misconceptions about probability

Because both probability and statistics are difficult to teach, some have advocated bypassing formal probability in favor of early exploratory data analysis (Moore, 1997). A risk of this approach is that many students never get up to probability at all. This is a problem, because probability is not merely the foundation for statistics but is also directly relevant to medical and other decisions that we all must make (Gaissmaier and Gigerenzer, 2011). Probability is also important to the biology curriculum via genetics (Masel, 2012), and so minimizing probability in a statistics class shifts instructional burden to the biology faculty. Given the central importance of understanding probability in becoming an informed citizen in general, as well as to the life sciences in particular, we believe that the effort to counter probability misconceptions warrants more than the brief treatment it often gets as rapid “background” in a genetics course. For students whose curriculum stresses the exploratory data analysis approach, probability has become an upper-division mathematics elective, such that even the few biology students who take it are unlikely to do so before exposure in genetics.

## COURSE DESCRIPTION AND DESIGN

Students are intrinsically motivated to learn about medicine, providing a great opening to teach probability and statistics in a medical context starting earlier in the curriculum. We therefore developed an undergraduate course in evidence-based medicine at the University of Arizona as a substitute for traditional 200-level biostatistics. It doubles as a substitute for either a traditional bioethics course or a science and society elective and meets both institutional requirements for a “writing-emphasis” course and the minimum quantity of reading and writing shown to be associated with gains on the Collegiate Learning Assessment (Arum and Roksa, 2011).

The primary tool of evidence-based medicine is the randomized controlled trial (RCT). We therefore made this the centerpiece of the class, making the class as much an exercise in the scientific method as it was a course in probability and statistics. Instead of teaching a broad diversity of scientific methods, we focused on gold-standard RCTs as an ideal paradigm for teaching the application of the scientific method not just to medicine but also to all messy data, that is, to everyday life. To reinforce the link to normal life, students read an engrossing history of RCTs (Burch, 2009), and all students wrote a proposal to perform an RCT. As a capstone, students carried out a handful of the proposed RCTs as class projects, for example, testing whether texting increases the likelihood that volunteers follow through on their commitment to give blood (Littin, 2012), whether the digital removal of a Nike logo changes the desirability of an article of clothing, or whether men can bench-press more when a woman sits on their hips (Huynh, 2014, 2015; Innes, 2015). Teaching the scientific method through RCTs is both a goal in and of itself, as well as a contextual tool that we hope may help make learning gains about probability stick.

Hypothesis testing was introduced early in the course, starting with two previously developed case studies, slightly modified by us for this course. The first, on Ignaz Semmelweis and hand-washing (Colyer, 1999), introduced hypothesis testing and the scientific method in a nonquantitative setting and prepared the way for contemporary discussions of hand-washing and checklists (Gawande, 2007). The second, based on Fisher’s original essay on the lady tasting tea (Maynard *et al.*, 2009), extended this to bring in more formal hypothesis-testing concepts, including the null hypothesis, *p*-values, and the binomial distribution.

Motivated by the goal of understanding RCTs, we used backward-design principles to guide our choice of probability and statistics content. Discrete data in a 2 × 2 contingency table (treatment vs. control, live vs. die) is the obvious way to approach a clinical trial. Rather than the traditional Pearson’s version of the chi-square test (comparing Σ(*O* − *E*)^{2}/*E* with χ^{2}), we taught the likelihood-ratio version (comparing *G* = 2ln [*L*(data|*H*_{1})/*L*(data|*H*_{0})] with χ^{2}) (Howell, 2014), both to reinforce learning of probability, and also because, should students continue in science, likelihoods appear in most statistical settings, whereas Pearson’s approach is used only for contingency tables. To avoid the trap of a canned technique, as Pearson’s test so easily becomes, our teaching of the derivation of the likelihood values required understanding the binomial distribution. Understanding of binomial coefficients is in any case needed to understand Fisher’s argument involving eight-choose-four equally likely options in the lady tasting tea. A less mathematically intensive version of the course than ours might omit the full binomial distribution and use Pearson’s test instead. In either case, *p*-values and type I and type II error rates are central topics, and working backward from what was needed, it was clear that a basic but firm grounding in probability is key.

To achieve this, we focused on eliciting and then combating known student misconceptions about probability (Table 1). We were particularly concerned about the total failure to grasp stochasticity known as the “outcome orientation” (Konold, 1989), an especially strong danger in the medical context (Humphrey and Masel, 2014). The goal of students with an outcome orientation “in dealing with uncertainty is to predict the outcome of a single next trial” (Konold, 1989, p. 61). When guessing the outcome of the roll of an irregular die, they are happy to call their estimate as right or wrong based on a single roll and are remarkably uninterested in gathering data on multiple rolls (Konold, 1989). If students treat every patient outcome as a unique event, rather than as members of a statistical group, they will not be able to grasp the power of RCTs (Humphrey and Masel, 2014).

Probability, in its modern philosophical interpretations, can mean very different things (Hájek, 2012). Frequentism refers to “forward probability”: the probability of seeing particular data given a state of the world. For example, *p*-values give the probability of seeing data so at odds with the null hypothesis, given that the null hypothesis is true. The most accessible, classical cases of forward probability focus on randomization devices such as dice and cards, for which each of a set of outcomes is equally likely. In contrast, Bayesianism focuses on “backward probability”; it is epistemic in nature, with “probability” describing our degree of confidence in an inference about the state of the world. Rather than promoting a single interpretation of probability or confusing students by presenting multiple interpretations simultaneously, we introduced notions of probability one at a time throughout the semester, in historical order. First, we worked with dice and playing cards to reinforce classical probability, trying to counter the outcome orientation by forcing students to consider dice rolls as a group. Then we did exercises with irregular dice (Bramald, 1994) to combat equiprobability bias during the transition from classical to frequentist probability. Some students had already encountered this distinction during K–12 as “theoretical” versus “experimental” or “empirical” probability. Here, we addressed outcome orientation again, stressing that however rare an event is, it can still happen, and that frequencies are the only way to put a number on this. We used combinatorics for both classical and frequentist probabilities, connected via the binomial distribution.

We introduced Bayesian probability much later in the semester, out of fear that content on subjective probability would accidentally reinforce the outcome orientation. Bayes’ theorem was taught in the context of medical-screening programs such as mammography (Gigerenzer, 2002) and Ioannidis’ argument that “most published research findings are false” (Ioannidis, 2005). The latter required a strong grounding in type I versus type II errors, built up during work on the likelihood ratio test. Conditional probability was introduced using real data on breast cancer incidence, with students exploring tables of data themselves before receiving formal instruction designed to distinguish between prob(A|B) and prob(B|A), in this case, prob(die of breast cancer|die young) ≠ prob(die young|die of breast cancer). Building on this foundation, Bayes’ theorem was then taught using dot diagrams and natural frequency trees (Sedlmeier and Gigerenzer, 2001; Figure 1) rather than via the equation.

Figure 1.

Use of a natural frequency tree to implement Bayes’ theorem. For this problem, the information given is “About 0.01% of men with no known risk factors have HIV. HIV+ men test positive 99.9% of the time. HIV− men test negative 99.99%**...**

We left out many traditional biostatistics topics, including observational statistics. We taught mean, SD, variance, and SEM as background to the insight that the effect size that a study has adequate power to detect is proportional to one divided by the square root of the number of patients. But we did not teach correlation as a formal mathematical concept, although we did mention it informally when we stressed the importance of a randomized intervention as the only way to sort out association versus causation. For example, we contrasted early observational results that women undergoing hormone replacement therapy have better health (Grodstein *et al.*, 1996) with later contradictory results from randomized trials (Women’s Health Initiative Steering Committee, 2004), drawing attention to how socioeconomic factors confound the former result but not the latter. Incoming students were all too keen to assert that it is impossible to reach conclusions without “controlling for” every conceivable confounding factor; omitting correlation almost until the end of the course allowed us to stress the power of randomization to remove the need to do this and hence distinguish causation from correlation alone.

The basics of randomization turned out to be surprisingly hard to teach and required substantial time. We used a previously developed active-learning exercise in which students assign playing cards randomly into two groups (Enders *et al.*, 2006) and extended this exercise to have students physically implement a matched-pair design using playing cards. This was later reinforced by an exploration of alternative study designs, in particular comparing parallel groups with crossover design and with *N* of 1 designs.

The third pillar of the course, after RCTs and Bayes’ theorem, was science and society. Indeed, topics such as placebo effects naturally combine statistical material (regression to the mean) with the human aspects of doctors’ and patients’ desires “to please.” Students left the class with the useful take-home skill of being able to place studies, such as those cited above on hormone replacement therapy, on an evidence pyramid (Figure 2), knowing how to locate the highest quality evidence, for example, Cochrane Reviews, and knowing that *not* treating a patient can be a valid medical option for providers. Interestingly, the most disturbing content for many students came not from fiercely partisan issues such as healthcare system design or even from the troubling influence of money on medical decision making (Angell, 2005; Fugh-Berman and Ahari, 2007), but from challenges to the role of reductionism in biomedical science (Horrobin, 2003; Scannell *et al.*, 2012). Table 2 outlines the topics covered by our course, and Table 3 gives the complete list of learning objectives.

Figure 2.

Evidence pyramid. Near the end of the course, students are exposed to alternatives to RCTs and learn to identify the level to which a research article belongs and to choose the highest level of evidence available for a given question. The value of meta-analyses**...**

Table 3.

Course learning objectives are for students to

Active-learning techniques were used as much as possible, including dice-rolling whenever possible. In addition to the previously published activities cited and otherwise described above, we made liberal use of think–pair–share interspersed within the 75-min classes. The outlines of these active-learning techniques can be followed via the staggered presentation of material in the slides in the Supplemental Material. Complete course materials are also available upon request. For example, think–pair–share was used for numerical questions such as applications of Bayes’ theorem via natural frequency trees, for guessing how things work in the real world for questions such as which categories of medical professionals are most and least likely to adhere to hand-washing and checklist regimes, and for open-ended experimental design questions such as what are the most important factors to control for/match. Role-playing exercises included one in which students decide on the ratio of type I: type II errors that they consider a reasonable trade-off, both for drug main effects and for serious side effects. Students then act out the roles of a desperate patient, a drug company rep, and an insurance company as each attempts to persuade the doctor as to the appropriate ratio.

## ASSESSMENT METHODS

To assess our success in improving not only context-specific qualitative understanding, but also more generalized numeracy, we compared precourse versus postcourse results for each student using the Quantitative Reasoning Quotient (QRQ) instrument (Sundre, 2003), a refinement of the earlier Statistical Reasoning Assessment instrument (Garfield, 1998, 2003). While many later instruments focus on statistics alone, we chose the QRQ, because it also covers probability in a multiple-choice format that assesses many conceptions and misconceptions simultaneously. Note that, in previous studies, instruction does not have a good track record of improving QRQ scores. For example, sophomores who have completed their 10–12 credit-hour requirement in mathematics and sciences do not perform better on the QRQ than those who have not (Sundre, 2003). Indeed, it is not uncommon for some misconceptions to increase postcourse versus precourse (Delmas *et al.*, 2007).

We simultaneously surveyed students’ Attitudes Towards Statistics (ATS; Wise, 1985), precourse and postcourse. Previous research with this and related instruments has found that students’ positive attitudes coming into a statistics course predicts their eventual performance in such a course and that attitudes improve only marginally following instruction (Elmore, 1993; Shultz and Koshino, 1998) or can even deteriorate (Schau and Emmioglu, 2012).

We compared overall correct score and individual QRQ subscores precourse versus postcourse using repeated-measures analysis of variance (ANOVA). For the ATS, we summed total positive attitude scores from the 29 ATS Likert items and compared these overall scores precourse versus postcourse using repeated-measures ANOVA, but given the coarse-grained ordinal nature of the individual Likert items, we analyzed these with the nonparametric Wilcoxon signed-rank test. Nonetheless, we display mean rather than median changes across students in ATS individual item scores precourse versus postcourse; otherwise the changes are sometimes invisible, even for statistically significant items. Raw anonymized data and scripts for QRQ and ATS analyses are available upon request.

## RESULTS AND DISCUSSION

After several years of development, the latest iteration of our evidence-based medicine course was taught in Spring 2014 to 40 students (22 women, at least 15 members of underrepresented minority groups). The only prerequisite to the course was a “C” or higher in college algebra or placement directly into calculus. In practice, our enrollment consisted of one freshman, eight sophomores, 15 juniors, and 15 seniors, most of whom had some prior exposure via an introductory biostatistics course, genetics course, and/or social science research methods course. We were delighted that, from pretest to posttest, QRQ increased by 0.63 pretest SDs (*p* < 0.001; Table 4), and the ATS increased by 0.32 pretest SDs (*p* = 0.002; Table 5). Figure 3 shows those QRQ subscores and ATS items showing improvement with *p* < 0.05 and 0.5 < *p* < 0.1; none deteriorated at this level. QRQ subscores that improved included distinguishing correlation and causation, a task for which statistically significant deteriorations have previously been observed (Delmas *et al.*, 2007).

Figure 3.

We observed postcourse vs. precourse (a) overall improvements and improvements in some (b) QRQ subscores and (c) ATS item scores for our Spring 2014 course offering. The ATS is a 1–5 Likert scale, and QRQ scores are arbitrarily scaled to match.**...**

Table 4.

Precourse, postcourse, and changes in QRQ total and subscores

Table 5.

Precourse, postcourse, and changes in ATS total and individual items

Previous research suggests that QRQ-like scores correlate negatively with effort-based course grades (explaining previously noted gender biases) and only weakly positively with other graded items (Tempelaar *et al.*, 2006). Results for our course were different: posttest QRQ correlations (Pearson’s *r*) with both course grades as a whole and with our final closed-book exam (included in the Supplemental Materials) were high at 0.5, and even correlations on more effort-based items such as homework problem sets (as found in the Supplemental Materials) were 0.38. Pretest QRQ correlations with final course grade, final exam, and effort-based content were similarly high. This demonstrates that our course assessments are well aligned with the widely endorsed learning objectives of the QRQ (Sundre, 2003). This is despite the fact that course assessments, for example, the two final exams included as Supplemental Materials, differ substantially in content from the QRQ, testing course-specific information in addition to general quantitative reasoning skills. ATS pre- and posttests also predicted course performance, in line with previous studies on attitudes using both the ATS (Waters *et al.*, 1988; Vanhoof *et al.*, 2006) and similar instruments (Emmioglu and Capa-Aydin, 2012) in other statistics classes. Changes in attitudes and quantitative reasoning reflected in the ATS and QRQ were not significantly correlated with pretest scores (Pearson’s *r* = –0.18 for ATS; –0.19 for QRQ; both *p* > 0.35). This indicates despite the diversity in ability and attitudes present in a class with as few prerequisites as ours, initially strong or positive students were not systematically more or less likely to benefit from instruction than weaker or more negative students.

Despite the small class size, the assessment evidence suggests that the course was a spectacular success, especially relative to the somewhat dismal history of probability and statistics education. Note that it aligns well with many calls for change (Table 6). We believe it to be far superior to the standard biostatistics curriculum in preparing students for real-world decision making, which benefits from a critical evaluation of (and perhaps even generation of) evidence. Indeed, we have heard a number of promising anecdotes about former students applying knowledge from their class, both as patients and as medical workers, in ways that affected medical care choices.

Table 6.

The course addresses calls for change

We have begun developing a new hybrid (50% online 50% face-to-face) version of the course, taught for the first time in Spring 2015 to 29 students. This move was motivated primarily by pedagogical concerns; our quantitative material is highly cumulative in nature, inevitably leaving some students behind in face-to-face classes. When material is give online, students have more ability to set their own pace, and interspersing content with frequent autograded quizzes can provide additional help through greater formative assessment and learning through testing (Brown *et al.*, 2014). We have developed two new online apps as part of the online materials, one on confirmation bias (http://bias.oia.arizona.edu/index.html) and one on the mathematics of power (http://power.oia.arizona.edu/index.html). The power app was designed to be used to illustrate how the effect size that a study has power to detect depends on the SD among patients divided by the square root of the number of patients. Customizable options (at http://bias.oia.arizona.edu/options.html and http://power.oia.arizona.edu/options.html for confirmation bias and power, respectively) allow the staged introduction of elements of the apps.

We hope these changes will lead to learning gains in a higher proportion of the class. QRQ and ATS scores for our first offering of the hybrid version (Spring 2015) are shown in Supplemental Tables 1 and 2, and pooled data across both semesters is shown in Supplemental Tables 3 and 4. QRQ and ATS scores each showed improvements of 0.28 pretest SDs (*p* = 0.046 for QRQ; *p* = 0.077 for ATS; Supplemental Tables 1 and 2). These effect sizes are on the whole (nonstatistically significantly) smaller, around half the size of the fully face-to-face class discussed at length above. When data from both years were combined, effect sizes for both QRQ and ATS overall scores were intermediate and remained statistically significant (0.46 and 0.42 pretest SDs for QRQ and ATS, respectively; both *p* < 0.001; Supplemental Tables 3 and 4).

Note that while there is a strong correlation between subscore effect sizes across the two semesters for QRQ (Pearson’s *r* = 0.55, *p* < 0.001), the best- and worst-performing subscores in Table 4 nevertheless regress to the mean in Supplemental Tables 1 and 2, a fact that acts as a caution against the overinterpretation of outlier subscores. Nevertheless, the added power afforded by combining results from both years increased the number of individual subscore items showing a change with *p* < 0.05 (Supplemental Table 3). A consistent underperformer across both semesters was equiprobability bias, which we intend to target more actively next time. Similarly, while overall ATS improvements were seen in each year, when both were combined, the effect sizes of individual ATS items were entirely uncorrelated between years (Pearson’s *r* = 10^{−5}). This reinforces the caution that individual attitude items are likely uninformative, even though the overall effect sizes may indicate a more general and positive shift in attitudes.

While not definitively worse, clearly the hybrid version is not outperforming the face-to-face version at this time. We note that there were the inevitable teething problems associated with the transition to online instruction, and we hope to see learning gains improve over the coming years as the online materials are refined in the light of the abundant data that online instruction generates. If and when the online hybrid version outperforms the original, a second benefit of the new format is to make it easy to disseminate; its writing-intensive nature can be preserved if a high faculty–student ratio is available, or a simplified version should work for larger classes, helping meet high demand. In the meantime, extensive and up-to-date course materials beyond the Supplemental Materials are available on request.

## Acknowledgments

This work was funded in part by a grant to the University of Arizona from the Howard Hughes Medical Institute (52006942). The opinions expressed herein do not necessarily represent those of our funders, who played no role in the preparation of this article. Online course development was funded in part by an Online/Hybrid Course Development Grant as part of the Online Education Project at the University of Arizona. We thank Gretchen Gibbs for guidance in online course development and Gary Carstensen for implementing the two online apps.

## REFERENCES

- American Association for the Advancement of Science . Vision and Change in Undergraduate Biology Education: A Call to Action. Washington, DC: 2011.
- Angell M. The Truth about the Drug Companies: How They Deceive Us and What to Do About It. New York: Random House; 2005.
- Arum R, Roksa J. Academically Adrift: Limited Learning on College Campuses. Chicago: University of Chicago Press; 2011.
- Association of American Medical Colleges–Howard Hughes Medical Institute. Scientific Foundations for Future Physicians. Washington, DC: 2009.
- Bramald R. Teaching probability. Teach Stat. 1994;16:85–89.
- Brown PC, Roediger HL, III, McDaniel MA. Make It Stick. Cambridge, MA: Harvard University Press; 2014.
- Burch D. Taking the Medicine: A Short History of Medicine’s Beautiful Idea, and Our Difficulty Swallowing It. London: Chatto & Windus; 2009.
- Cech EA. Embed social awareness in science curricula. Nature. 2014;505:477–478.[PubMed]
- Colyer C. Childbed Fever: A Nineteenth-Century Mystery. Buffalo, NY: National Center for Case Study Teaching in Science, University at Buffalo; 1999.
- Delmas R, Garfield J, Ooms A, Chance B. Assessing students’ conceptual understanding after a first course in statistics. Stat Educ Res J. 2007;6:28–58.
- Elmore PB. Statistics achievement: a function of attitudes and related experiences. 1993. pp. 1–19. Paper presented at the annual meeting of the American Educational Research Association, held 12–16 April 1993, in Atlanta, GA.
- Emmioglu E, Capa-Aydin Y. Attitudes and achievement in statistics: a meta-analysis study. Stat Educ Res J. 2012;11:95–102.
- Enders CK, Stuetzle R, Laurenceau J-P. Teaching random assignment: a classroom demonstration using a deck of playing cards. Teach Psychol. 2006;33:239–242.
- Fugh-Berman A, Ahari S. Following the script: how drug reps make friends and influence doctors. PLoS Med. 2007;4:e150.[PMC free article][PubMed]
- Gaissmaier W, Gigerenzer G. When misinformed patients try to make informed health decisions. In: Gigerenzer G, Gray JAM, editors. Better Doctors, Better Patients, Better Decisions. Cambridge, MA: MIT Press; 2011.
- Garfield JB. The statistical reasoning assessment: development and validation of a research tool. 1998:781. Proceedings of the 5th International Conference on Teaching Statistics, Singapore.
- Garfield JB. Assessing statistical reasoning. Stat Educ Res J. 2003;2:22–38.
- Gawande A. The checklist. New Yorker. 2007;83:86–95.[PubMed]
- Gigerenzer G. Calculated Risks: How to Know when Numbers Deceive You. New York: Simon and Schuster; 2002.
- Grodstein F, Stampfer MJ, Manson JE, Colditz GA, Willett WC, Rosner B, Speizer FE, Hennekens CH. Postmenopausal estrogen and progestin use and the risk of cardiovascular disease. N Engl J Med. 1996;335:453–461.[PubMed]
- Hájek A. Zalta EN, editor. Interpretations of probability. In Stanford Encyclopedia of Philosophy (Winter 2012 Ed.) 2012http://plato.stanford.edu/archives/win2012/entries/probability-interpret.
- Hodgson T. The effects of hands-on activities on students’ understanding of selected statistical concepts. 1996:241–246. Proceedings of the eighteenth annual meeting, North American Chapter of the International Group for the Psychology of Mathematics Education, Florida State University, Panama City.
- Horrobin DF. Modern biomedical research: an internally self-consistent universe with little contact with medical reality. Nat Rev Drug Discov. 2003;2:151–154.[PubMed]
- Howell DC. Chi-square test: analysis of contingency tables. In: Lovric M, editor. International Encyclopedia of Statistical Science. Berlin: Springer; 2014. pp. 250–252.
- Humphrey PT, Masel J. Outcome orientation—a misconception of probability that harms medical research and practice [preprint] arXiv. 2014:1412.4604.
- Huynh J. UA students test Internet meme using statistics. The Daily Wildcat, October 23 (University of Arizona) 2014
- Huynh J. Meme inspires scientific redo by undergraduates. The Daily Wildcat, October 30 (University of Arizona) 2015
- Innes S. University of Arizona students pumped to question health system. Arizona Daily Star, May 10. 2015
- Ioannidis JPA. Why most published research findings are false. PLoS Med. 2005;2:e124.[PMC free article][PubMed]
- Kahneman D, Slovic P, Tversky A. Judgment Under Uncertainty: Heuristics and Biases. Cambridge, UK: Cambridge University Press; 1982.
- Konold C. Informal conceptions of probability. Cogn Instr. 1989;6:59–98.
- Konold C. Issues in assessing conceptual understanding in probability and statistics. J Stat Educ. 1995;3:1–9.
- Lecoutre M-P. Cognitive models and problem spaces in “purely random” situations. Educ Stud Math. 1992;23:557–568.
- Littin S. Study: texting increases turnout to campus blood drive. UANews, May 8 (University of Arizona) 2012
- Masel J. Rethinking Hardy–Weinberg and genetic drift in undergraduate biology. BioEssays. 2012;34:701–710.[PubMed]
- Maynard J, Mulcahy MP, Kermick D. Lady Tasting Coffee: A Case Study in Experimental Design. Buffalo, NY: National Center for Case Study Teaching in Science, University at Buffalo; 2009.
- Moore DS. New pedagogy and new content: the case of statistics. Int Stat Rev. 1997;65:123–137.
- Pfaff TJ, Weinberg A. Do hands-on activities increase student understanding? A case study. J Stat Educ. 2009;17:1–34.
- Reyna VF, Brainerd CJ. The importance of mathematics in health and human judgment: numeracy, risk communication, and medical decision making. Learn Individ Differ. 2007;17:147–159.
- Scannell JW, Blanckley A, Boldon H, Warrington B. Diagnosing the decline in pharmaceutical R&D efficiency. Nat Rev Drug Discov. 2012;11:191–200.[PubMed]
- Schau C, Emmioglu E. Do introductory statistics courses in the United States improve students’ attitudes. Stat Educ Res J. 2012;11:86–94.
- Schwartzstein RM, Rosenfeld GC, Hilborn R, Oyewole SH, Mitchell K. Redesigning the MCAT exam: balancing multiple perspectives. Acad Med. 2013;88:560–567.[PubMed]
- Sedlmeier P, Gigerenzer G. Teaching Bayesian reasoning in less than two hours. J Exp Psychol. 2001;130:380–400.[PubMed]
- Shaughnessy JM. Misconceptions of probability: an experiment with a small-group, activity-based, model building approach to introductory probability at the college level. Educ Stud Math. 1977;8:295–316.
- Shaughnessy JM. Research in probability and statistics: reflections and directions. In: Grouws DA, editor. Handbook of Research on Mathematics Teaching and Learning. New York: Macmillan; 1992. pp. 465–494.
- Shultz KS, Koshino H. Evidence of reliability and validity for Wise’s Attitude Toward Statistics scale. Psychol Rep. 1998;82:27–31.
- Sundre DL. 2003. Assessment of quantitative reasoning to enhance educational quality. American Educational Research Association meeting, held in Chicago, IL, April.
- Tempelaar DT, Gijselaers WH, van der Loeff SS. Puzzles in statistical reasoning. J Stat Educ. 2006;14(1)
- Vanhoof S, Sotos AEC, Onghena P, Verschaffel L, Van Dooren W, Van den Noortgate W. Attitudes toward statistics and their relationship with short-and long-term exam results. J Stat Educ. 2006;14(3)
- Waters LK, Martelli TA, Zakrajsek T, Popovich PM. Attitudes towards statistics: an evaluation of multiple measures. Educ Psychol Meas. 1988;48:513–516.
- Wise SL. The development and validation of a scale measuring attitudes towards statistics. Educ Psychol Meas. 1985;45:401–405.
- Women’s Health Initiative Steering Committee Effects of conjugated equine estrogen in postmenopausal women with hysterectomy: the Women’s Health Initiative randomized controlled trial. J Am Med Assoc. 2004;291:1701–1712.[PubMed]

## Comments