This post is for statistics and assessment wonks. I’ve been really engaged in a bit of data detective work, and one of my findings-in-progress has whacked me up side the head, making me re-think my interpretation of some common statistics.
Here’s the setup. In lots of educational experimental designs we have some sort of measure of prior achievement – this can be last year’s end-of-year test score, or a pre-test administered in early Fall. Then (details vary depending on the design) we have one group of students/teachers try one thing, and another group do something else. We then administer a test at the end of the course, and compare the test score distributions (center and spread) between the two groups. What we’re looking for is a difference in the mean outcomes between the two groups.
So, why do we even need a measure of prior achievement? If we’ve randomly assigned students/teachers to groups, we really don’t. In principle, with a large enough sample, those two groups will have somewhat equal distributions of intellectual ability, motivation, special needs, etc. If the assignment isn’t random, though – say one group of schools is trying out a new piece of software, while another group of schools isn’t – then we have to worry that the schools using software may be “advantaged” as a group, or different in some other substantial way. Comparing the students on prior achievement scores can be one way of assuring ourselves that the two groups of students were similar (enough) upon entry to the study. I’m glossing over lots of technical details here – whole books have been written on the ins and outs of various experimental designs.
Here’s another reason we like prior achievement measures, even with randomized experiments: they give us a lot more statistical power. What does that mean? Comparing the mean outcome score of two groups is done against a background of a lot of variation. Let’s say the mean scores of group A are 75% and group B are 65%. That’s a 10 percentage point difference. But let’s say the scores for both groups range from 30% to 100%. We’re looking at a 10 point difference against a background of a much wider spread of scores. It turns out that if the spread of scores is very large relative to the mean difference we see, we start to worry that our result isn’t “real” but is in fact just an artifact of some statistical randomness in our sample. In more jargon-y language, our result may not be “statistically significant” even it the difference is educationally important.
Prior scores to the rescue. We can use these to eliminate some of the spread of outcome scores by first using the prior scores to predict what the outcomes scores would likely be for a given student. Then we look at the mean difference of two groups against not the spread of scores, but the spread of predicted scores. That ends up reducing a lot of the variation in the background and draws out our “signal” against the “noise” more clearly. Again, this is a hand-wavy explanation, but that’s the essence of it. (A somewhat equivalent model is to look at the gains from pretest to posttest and compare those gains across groups. This requires a few extra conditions but is entirely feasible and increases power for the same reasons).
In order for this to work, it is very helpful to have a prior achievement measure that is highly predictive of the outcome. When we have a strong predictor, we can (it turns out) be much more confident that any experimental manipulation or comparisons we observe are “real” and not due to random noise. And for many standardized tests across large samples, this is the case – the best predictor of how well a student does at the end of grade G is how well they were doing at the end of grade G-1. Scores at the end of grade G-1 swamp race, SES, first language… all of these predictors virtually disappear once we know prior scores.
What happens in the case when the prior test scores don’t predict outcomes very well? From a statistical power perspective, we’re in trouble – we may not have reduced the “noise” adequately enough to detect our signal. Or, it could indicate technical issues with the tests themselves – they may not be very reliable (meaning the same student taking both tests near to one another in time may get wildly different scores). In general, I’ve historically been disappointed by low pretest/posttest correlations.
So today I’m engaged in some really interesting data detective work. A bunch of universities are trying out this nifty new way of teaching developmental math – that’s the course you have to take if your math skills aren’t quite what are needed to engage in college-level quantitative coursework. It’s a well-known problem course, particularly in the community colleges: students may take a developmental math course 2 or 3 times, fail it each time, accumulate no college credits, and be in debt after this discouraging experience. This is a recipe for dropping out of school entirely.
In my research, I’ve been looking at how different instructors go about using this nifty new method (I’m keeping the details vague to protect both the research and participant interests – this is all very preliminary stuff). One thing I noticed is that in some classes, the pretest predicts the posttest very accurately. In others, it barely predicts the outcome at all. The “old” me was happy to see the classrooms with high prediction – it made detecting the “outlier” students, those that were going against all predicted trends, easier to spot. The classes with low prediction were going to cause me trouble in spotting “mainstream” and “outlier” students.
Then it hit me – how should I interpret the low pretest-posttest correlation? It wasn’t a problem with test reliability – the same tests were being used across all instructors and institutions, and were known to be reliable. Restriction of range wasn’t a problem either (although I still need to document that for sure) – sometimes we get low correlations because, for example, everyone aces the posttest – there is therefore very little variation to “predict” in the first place.
Here’s one interpretation: the instructors in the low pretest-posttest correlation classrooms are doing something interesting and adaptive to change a student’s trajectory. Think about it – high pretest-posttest correlation essentially means “pretest is destiny” – if I know what you score before even entering the course, I can very well predict what you’ll score on the final exam. It’s not that you won’t learn anything – we can have high correlations even if every student learns a whole lot. It’s just that whatever your rank order in the course was when you came in, that’ll likely be your rank order at the end of the course, too. And usually the bottom XX% of that distribution fails the class.
So rather than strong pretest-posttest correlations being desirable for power, I’m starting to see them as indicators of “non-adaptive instruction.” This means whatever is going on in the course, it’s not affecting the relative ranking of students; put another way, it’s affecting each student’s learning somewhat consistently. Again, it doesn’t mean they’re not learning, just that they’re still distributed similarly relative to one another. I’m agnostic as to whether this constitutes a “problem” – that’s actually a pretty deep question I don’t want to dive into in this post.
I’m intrigued for many reasons by the concept of effective adaptive instruction – giving the bottom performers extra attention or resources so that they may not just close the gap but leap ahead of other students in the class. It’s really hard to find good examples of this in general education research – for better or worse, relative ranks on test scores are stubbornly persistent. It also means, however, that the standard statistical models we use are not accomplishing everything we want in courses where adaptive instruction is the norm. “Further research is needed” is music to the ears of one who makes a living conducting research. 🙂
I’m going to be writing up a more detailed and technical treatment of this over the next months and years, but I wanted to get an idea down on “paper” and this blog seemed like a good place to plant it. It may turn out that these interesting classes are not being adaptive at all – the low pretest-posttest correlations could be due to something else entirely. Time will tell.