When prior scores are not destiny

This post is for statistics and assessment wonks.  I’ve been really engaged in a bit of data detective work, and one of my findings-in-progress has whacked me up side the head, making me re-think my interpretation of some common statistics.

Here’s the setup. In lots of educational experimental designs we have some sort of measure of prior achievement – this can be last year’s end-of-year test score, or a pre-test administered in early Fall.  Then (details vary depending on the design) we have one group of students/teachers try one thing, and another group do something else. We then administer a test at the end of the course, and compare the test score distributions (center and spread) between the two groups. What we’re looking for is a difference in the mean outcomes between the two groups.

So, why do we even need a measure of prior achievement? If we’ve randomly assigned students/teachers to groups, we really don’t. In principle, with a large enough sample, those two groups will have somewhat equal distributions of intellectual ability, motivation, special needs, etc. If the assignment isn’t random, though – say one group of schools is trying out a new piece of software, while another group of schools isn’t – then we have to worry that the schools using software may be “advantaged” as a group, or different in some other substantial way. Comparing the students on prior achievement scores can be one way of assuring ourselves that the two groups of students were similar (enough) upon entry to the study.  I’m glossing over lots of technical details here – whole books have been written on the ins and outs of various experimental designs.

Here’s another reason we like prior achievement measures, even with randomized experiments: they give us a lot more statistical power. What does that mean? Comparing the mean outcome score of two groups is done against a background of a lot of variation. Let’s say the mean scores of group A are 75% and group B are 65%. That’s a 10 percentage point difference. But let’s say the scores for both groups range from 30% to 100%. We’re looking at a 10 point difference against a background of a much wider spread of scores. It turns out that if the spread of scores is very large relative to the mean difference we see, we start to worry that our result isn’t “real” but is in fact just an artifact of some statistical randomness in our sample. In more jargon-y language, our result may not be “statistically significant” even it the difference is educationally important.

Prior scores to the rescue. We can use these to eliminate some of the spread of outcome scores by first using the prior scores to predict what the outcomes scores would likely be for a given student. Then we look at the mean difference of two groups against not the spread of scores, but the spread of predicted scores. That ends up reducing a lot of the variation in the background and draws out our “signal” against the “noise” more clearly.  Again, this is a hand-wavy explanation, but that’s the essence of it. (A somewhat equivalent model is to look at the  gains from pretest to posttest and compare those gains across groups. This requires a few extra conditions but is entirely feasible and increases power for the same reasons).

In order for this to work, it is very helpful to have a prior achievement measure that is highly predictive of the outcome. When we have a strong predictor, we can (it turns out) be much more confident that any experimental manipulation or comparisons we observe are “real” and not due to random noise. And for many standardized tests across large samples, this is the case – the best predictor of how well a student does at the end of grade G is how well they were doing at the end of grade G-1. Scores at the end of grade G-1 swamp race, SES, first language… all of these predictors virtually disappear once we know prior scores.

What happens in the case when the prior test scores don’t predict outcomes very well? From a statistical power perspective, we’re in trouble – we may not have reduced the “noise” adequately enough to detect our signal. Or, it could indicate technical issues with the tests themselves – they may not be very reliable (meaning the same student taking both tests near to one another in time may get wildly different scores). In general, I’ve historically been disappointed by low pretest/posttest correlations.

So today I’m engaged in some really interesting data detective work. A bunch of universities are trying out this nifty new way of teaching developmental math – that’s the course you have to take if your math skills aren’t quite what are needed to engage in college-level quantitative coursework. It’s a well-known problem course, particularly in the community colleges: students may take a developmental math course 2 or 3 times, fail it each time, accumulate no college credits, and be in debt after this discouraging experience. This is a recipe for dropping out of school entirely.

In my research, I’ve been looking at how different instructors go about using this nifty new method (I’m keeping the details vague to protect both the research and participant interests – this is all very preliminary stuff). One thing I noticed is that in some classes, the pretest predicts the posttest very accurately. In others, it barely predicts the outcome at all. The “old” me was happy to see the classrooms with high prediction – it made detecting the “outlier” students, those that were going against all predicted trends, easier to spot. The classes with low prediction were going to cause me trouble in spotting “mainstream” and “outlier” students.

Then it hit me – how should I interpret the low pretest-posttest correlation? It wasn’t a problem with test reliability – the same tests were being used across all instructors and institutions, and were known to be reliable. Restriction of range wasn’t a problem either (although I still need to document that for sure) – sometimes we get low correlations because, for example, everyone aces the posttest – there is therefore very little variation to “predict” in the first place.

Here’s one interpretation: the instructors in the low pretest-posttest correlation classrooms are doing something interesting and adaptive to change a student’s trajectory. Think about it – high pretest-posttest correlation essentially means “pretest is destiny” – if I know what you score before even entering the course, I can very well predict what you’ll score on the final exam. It’s not that you won’t learn anything – we can have high correlations even if every student learns a whole lot. It’s just that whatever your rank order in the course was when you came in, that’ll likely be your rank order at the end of the course, too. And usually the bottom XX% of that distribution fails the class.

So rather than strong pretest-posttest correlations being desirable for power, I’m starting to see them as indicators of “non-adaptive instruction.” This means whatever is going on in the course, it’s not affecting the relative ranking of students; put another way, it’s affecting each student’s learning somewhat consistently. Again, it doesn’t mean they’re not learning, just that they’re still distributed similarly relative to one another. I’m agnostic as to whether this constitutes a “problem” – that’s actually a pretty deep question I don’t want to dive into in this post.

I’m intrigued for many reasons by the concept of effective adaptive instruction – giving the bottom performers extra attention or resources so that they may not just close the gap but leap ahead of other students in the class. It’s really hard to find good examples of this in general education research – for better or worse, relative ranks on test scores are stubbornly persistent. It also means, however, that the standard statistical models we use are not accomplishing everything we want in courses where adaptive instruction is the norm. “Further research is needed” is music to the ears of one who makes a living conducting research. 🙂

I’m going to be writing up a more detailed and technical treatment of this over the next months and years, but I wanted to get an idea down on “paper” and this blog seemed like a good place to plant it. It may turn out that these interesting classes are not being adaptive at all – the low pretest-posttest correlations could be due to something else entirely. Time will tell.

Voting and uncertainty

Turning and turning in the widening gyre
The falcon cannot hear the falconer;
Things fall apart; the centre cannot hold;
Mere anarchy is loosed upon the world,
The blood-dimmed tide is loosed, and everywhere
The ceremony of innocence is drowned;
The best lack all conviction, while the worst
Are full of passionate intensity.

 From The Second Coming, William Butler Yeats

“The best lack all conviction, while the worst are full of passionate intensity.” Oh how I relate to those two lines, particularly in this election cycle.  Hate mongers fill cyberspace with passionate intensity. Yes, the “good guys” express their share of passion, too, but often I find that the most thoughtful, nuanced commentaries on current events are… thoughtful, not passionate.

In my work I sometimes think of myself as a professional skeptic. We conduct large scale experiments on educational interventions, and roughly 90% of the time I end up being the bearer of bad news to passionate advocates – their particular policy/cause/gizmo was just not as effective as they’d hoped. People are complicated; social change is hard work, with few turnkey solutions (even when we’ve managed to define the “problem” adequately).

Politically, I’m generally left-leaning. Without parsing that too finely, I advocate for personal liberty (control over one’s reproduction, sexuality, freedom of expression and association, etc.), as well as a vision of government as steward and protector of the commons, particularly against those organizations/corporations/tribes who would impose their will on others.  Some issues for me are no-brainers. As often as not, however, people with similar ends in mind can argue over means. What should one do, for example, about housing stock in San Francisco? Do we even agree that preserving an economically diverse city is worth attempting in the first place?

Around this time of year I often read commentary to the effect of “why bother voting, we’re just choosing the lesser of two evils, if voting could change anything it would be illegal, etc.” While I sympathize with the spirit of these nay-sayers, I’ve come down firmly on the side of voting, even if what we’re doing is little more than a coin flip. I could cite a number of civic-minded reasons why voting is important (not the least of which is that it sends a message to politicians that people are paying attention), but I want to focus on decision making under uncertainty.

For those of us “lacking all conviction” (or who can see both sides of an issue), it’s tempting to withhold our vote until we have a strong, defensible argument for picking among A, B, C or D. What if I choose badly? What if the policy/politician I’m endorsing has unintended consequences I can’t foresee? Here’s my question – do you think it is more likely than not that your choice will further your political agenda in a desired direction? Not with a strong degree of certainty, just simply better than 50/50.  If you believe even slightly better than 50/50 that your choice will have a positive impact, you should vote.

I work with numbers and statistics so frequently that I sometimes forget to go back to basics and understand where my intuitions come from. I’ve been thinking about voting as a very error-prone process, but one in which the law of large numbers (or the Wisdom of the Crowd, to use a popular term) can tip the balance decisively.

Here’s a thought experiment. Say we have 1,000,000 voters choosing between two candidates. These million voters lack all conviction, but there is a very slight preference for the policy of candidate A over candidate B. Let’s say that out of 1,000 voters, 501 – just slightly more than 50/50 – favor candidate A. If you run the numbers, in the end we expect 50.1% – or 501,000 voters – to vote for A over B, and A wins.  Now of course, randomness means that we would never have exactly 501,000 to 499,000 in an election, but how “wobbly” would those numbers be?  After all, candidate A wins by a mere 2,000 votes in a population of one million. Here’s the magic: If I wave my statistical wand, I can tell you that in over 97% of elections of this type, candidate A would win¹.

Let’s think about this. One interpretation is that if half the population was just a smidgeon more convinced that candidate A was preferable, candidate A would win almost every time. Another interpretation is that it only takes one in 1,000 committed voters in a sea of uncertainty to swing an election. I invite you to contribute your own interpretation (and critique) in the comments.

Obviously the real world is not as simple as this mathematical model – we have passionate factions on multiple sides of an election, the perceived costs/benefits of a good choice can vary, etc. But, having worked through the numbers, I’m struck at how little it takes to reliably tip the balance of an election in a particular direction.

Sometimes – often – the majority is wrong; our history is one of prolonged oppression of minority groups, until either a court intervenes or public opinion dramatically shifts. Still, all else being equal, I would prefer to have thoughtful citizens weighing in on an issue – even under extreme uncertainty – than any of the alternatives.

¹ Technical footnote: The standard error of the sum of 1,000,000 coin tosses with a 50/50 probability is sqrt(1,000,000*.5*.5) = 500. A final result of 501,000 votes is two standard errors above a tie. Using a Normal approximation, we find that a bell curve centered at 501,000 with standard deviation of 500 has 97.5% of its area above the 500,000 (50/50) mark. That is, 97.5% of elections under this scenario should come out in favor of candidate A.