When prior scores are not destiny

This post is for statistics and assessment wonks.  I’ve been really engaged in a bit of data detective work, and one of my findings-in-progress has whacked me up side the head, making me re-think my interpretation of some common statistics.

Here’s the setup. In lots of educational experimental designs we have some sort of measure of prior achievement – this can be last year’s end-of-year test score, or a pre-test administered in early Fall.  Then (details vary depending on the design) we have one group of students/teachers try one thing, and another group do something else. We then administer a test at the end of the course, and compare the test score distributions (center and spread) between the two groups. What we’re looking for is a difference in the mean outcomes between the two groups.

So, why do we even need a measure of prior achievement? If we’ve randomly assigned students/teachers to groups, we really don’t. In principle, with a large enough sample, those two groups will have somewhat equal distributions of intellectual ability, motivation, special needs, etc. If the assignment isn’t random, though – say one group of schools is trying out a new piece of software, while another group of schools isn’t – then we have to worry that the schools using software may be “advantaged” as a group, or different in some other substantial way. Comparing the students on prior achievement scores can be one way of assuring ourselves that the two groups of students were similar (enough) upon entry to the study.  I’m glossing over lots of technical details here – whole books have been written on the ins and outs of various experimental designs.

Here’s another reason we like prior achievement measures, even with randomized experiments: they give us a lot more statistical power. What does that mean? Comparing the mean outcome score of two groups is done against a background of a lot of variation. Let’s say the mean scores of group A are 75% and group B are 65%. That’s a 10 percentage point difference. But let’s say the scores for both groups range from 30% to 100%. We’re looking at a 10 point difference against a background of a much wider spread of scores. It turns out that if the spread of scores is very large relative to the mean difference we see, we start to worry that our result isn’t “real” but is in fact just an artifact of some statistical randomness in our sample. In more jargon-y language, our result may not be “statistically significant” even it the difference is educationally important.

Prior scores to the rescue. We can use these to eliminate some of the spread of outcome scores by first using the prior scores to predict what the outcomes scores would likely be for a given student. Then we look at the mean difference of two groups against not the spread of scores, but the spread of predicted scores. That ends up reducing a lot of the variation in the background and draws out our “signal” against the “noise” more clearly.  Again, this is a hand-wavy explanation, but that’s the essence of it. (A somewhat equivalent model is to look at the  gains from pretest to posttest and compare those gains across groups. This requires a few extra conditions but is entirely feasible and increases power for the same reasons).

In order for this to work, it is very helpful to have a prior achievement measure that is highly predictive of the outcome. When we have a strong predictor, we can (it turns out) be much more confident that any experimental manipulation or comparisons we observe are “real” and not due to random noise. And for many standardized tests across large samples, this is the case – the best predictor of how well a student does at the end of grade G is how well they were doing at the end of grade G-1. Scores at the end of grade G-1 swamp race, SES, first language… all of these predictors virtually disappear once we know prior scores.

What happens in the case when the prior test scores don’t predict outcomes very well? From a statistical power perspective, we’re in trouble – we may not have reduced the “noise” adequately enough to detect our signal. Or, it could indicate technical issues with the tests themselves – they may not be very reliable (meaning the same student taking both tests near to one another in time may get wildly different scores). In general, I’ve historically been disappointed by low pretest/posttest correlations.

So today I’m engaged in some really interesting data detective work. A bunch of universities are trying out this nifty new way of teaching developmental math – that’s the course you have to take if your math skills aren’t quite what are needed to engage in college-level quantitative coursework. It’s a well-known problem course, particularly in the community colleges: students may take a developmental math course 2 or 3 times, fail it each time, accumulate no college credits, and be in debt after this discouraging experience. This is a recipe for dropping out of school entirely.

In my research, I’ve been looking at how different instructors go about using this nifty new method (I’m keeping the details vague to protect both the research and participant interests – this is all very preliminary stuff). One thing I noticed is that in some classes, the pretest predicts the posttest very accurately. In others, it barely predicts the outcome at all. The “old” me was happy to see the classrooms with high prediction – it made detecting the “outlier” students, those that were going against all predicted trends, easier to spot. The classes with low prediction were going to cause me trouble in spotting “mainstream” and “outlier” students.

Then it hit me – how should I interpret the low pretest-posttest correlation? It wasn’t a problem with test reliability – the same tests were being used across all instructors and institutions, and were known to be reliable. Restriction of range wasn’t a problem either (although I still need to document that for sure) – sometimes we get low correlations because, for example, everyone aces the posttest – there is therefore very little variation to “predict” in the first place.

Here’s one interpretation: the instructors in the low pretest-posttest correlation classrooms are doing something interesting and adaptive to change a student’s trajectory. Think about it – high pretest-posttest correlation essentially means “pretest is destiny” – if I know what you score before even entering the course, I can very well predict what you’ll score on the final exam. It’s not that you won’t learn anything – we can have high correlations even if every student learns a whole lot. It’s just that whatever your rank order in the course was when you came in, that’ll likely be your rank order at the end of the course, too. And usually the bottom XX% of that distribution fails the class.

So rather than strong pretest-posttest correlations being desirable for power, I’m starting to see them as indicators of “non-adaptive instruction.” This means whatever is going on in the course, it’s not affecting the relative ranking of students; put another way, it’s affecting each student’s learning somewhat consistently. Again, it doesn’t mean they’re not learning, just that they’re still distributed similarly relative to one another. I’m agnostic as to whether this constitutes a “problem” – that’s actually a pretty deep question I don’t want to dive into in this post.

I’m intrigued for many reasons by the concept of effective adaptive instruction – giving the bottom performers extra attention or resources so that they may not just close the gap but leap ahead of other students in the class. It’s really hard to find good examples of this in general education research – for better or worse, relative ranks on test scores are stubbornly persistent. It also means, however, that the standard statistical models we use are not accomplishing everything we want in courses where adaptive instruction is the norm. “Further research is needed” is music to the ears of one who makes a living conducting research. 🙂

I’m going to be writing up a more detailed and technical treatment of this over the next months and years, but I wanted to get an idea down on “paper” and this blog seemed like a good place to plant it. It may turn out that these interesting classes are not being adaptive at all – the low pretest-posttest correlations could be due to something else entirely. Time will tell.

I’m a data scientist!

At a recent Cyberlearning Research Summit organized by some colleagues, Bill Finzer proposed that we think about the Data Sciences as a distinct educational goal. Briefly, the data sciences exist at the intersection of three domains: math & statistics, a substantive area of knowledge (economics, psychology, education, etc.), and “hacking” or the ability to construct algorithmic solutions to problems.

What Bill described was essentially my current professional life. It usually takes me a couple of minutes to describe what I do when I meet someone at a party (I don’t have a simple job descriptor like “tax lawyer”).  Now I can tell people “I’m a data scientist!” (Of course, they won’t know what that means, and it’ll take me two minutes to explain that, anyway).

Still, I’ve felt like I’ve assembled an odd hybrid of skills and interests in my professional life, and I’m not eager to part with any of them.  I’ll take the bit of external validation I got from watching this video.  🙂

I haven’t been blogging much over the past months, but I have a collection of ideas starting to backlog and will be writing more frequently in the near future.  Stay tuned.

Virtual worlds and messy reality

My “hobby” life and professional life recently crossed paths in an interesting way. It started when the federal government announced a grant competition with one of the possible research topics involving robotics competitions.  (For those not familiar with how research is funded, this isn’t at whacky as it sounds. There are lots of programs in the federal government that hold annual competitions in a broad variety of areas. The specification of focal areas is how the government – and we the taxpayers – have some assurance that research conducted with federal dollars will be important and/or useful. As I recall, the robotics topic was part of a larger program that covers innovative uses of technology in education. Robotics competitions are gaining in popularity, and there is considerable interest in their impact on future science and technology interests of the flesh-and-blood participants).

During a meeting with colleagues we brainstormed some possibly interesting areas of research that would respond to the spirit of the grant. Two ideas in particular were notable for their contrast. One was (broadly) the question of what is gained from having a lot of practical, hands-on experience with mechanical systems. Real robots break and have problem with tolerances; builders need to respect the limits of materials, fasteners, and the laws of physics. Whereas in the 1950’s teenagers tinkered with cars after school, nowadays robots are the equivalent pastime for many students.

The second idea had to do with programming and simulation. Robotics also involves control and planning. In many competitions, the robots have to solve tasks or navigate obstacles without any human intervention. This can require considerable programming prowess to execute elegantly. One colleague (who was an advisor to his son’s team) said kids’ programming tends to be a batch of spaghetti code – long lists of instructions and contingencies sort of hacked together to get the job done.

A colleague pointed out that if we care about kids learning the control/automation side of robotics, then the “messiness” of the physical machines often gets in the way. It’s hard enough to devise an intelligent algorithm for navigating obstacles without also worrying what happens when a wheel inadvertently jams up.  One could imagine kids being overloaded with the frustration of learning to program AND having to deal with clunky hardware (these robots aren’t being designed by engineers with graduate degrees, remember). So he wondered whether a “virtual robotics competition” – where the robots were just simulated avatars a la Second Life – would be an interesting case to study.

On the flip side, others felt that learning about the “messiness” of physical systems, how to improvise solutions, plan for contingencies, etc., were equally valuable lessons, perhaps more important than learning elegant programming habits. Having gotten my start in software engineering, and now being very interested in “learning with the hands,” I could see both sides of this argument. Dealing with physical systems can be very frustrating at times; that was one of the appeals of the “virtual world” when I started in computer science. On the other hand, we live in a physical world, and I wonder what is lost when kids don’t get a lot of experience just interacting with the (non-mediated) world as they grow up.

My thinking is that if you want to teach programming, then teach programming, with or without robotic avatars. Just as we teach Newtonian physics in high school with an emphasis on theoretical models (mechanical systems operating in airless vacuums using weightless strings and pulleys, for example), one could imagine teaching the fundamentals of programming with reference to “ideal” robots or objects.

But to me, there is something special about tinkering with physical systems. I can’t put my finger on it exactly, but I feel like there are some valuable lessons in there, some of which are shared with the programming world (perseverance in the face of failure and frustration; the need for careful planning; problem decomposition, etc.), but others which are entirely separate from virtual spaces (namely, how gears work, what friction “feels like” on different surfaces, the strengths and limitations of motors, etc.)  Just writing these down, I feel a big “so what” question looming – do we really care that youth gain facility with building drive trains? It’s more than that – it’s a “feel” for mechanical systems. Again, I’m at a loss for words. Maybe I’m just being sentimental. But I know I’m not alone in this. Others have been writing at some length on the need to re-integrate the hands into educational experience (e.g., Doug Stowe’s Wisdom of the Hands blog), and some have designed engineering curricula appropriate for elementary school (e.g., Engineering is Elementary).

Actually, I can think of one lesson that differentiates the physical from the virtual – I’ve written about this in the past. The physical world does not have an “Undo” button. Mistakes have consequences. A piece of bad code can be erased and revised in the blink of an eye, but a badly assembled drive train can mean a week of wasted effort.

A confluence of interests

By day I’m an educational researcher studying different facets of how kids and adults learn new things. By night (well, depending on the season) I’m a craftsperson, and my current medium is wood. Over the past few months I’ve become aware of places and writers who touch both of these interests in significant ways. I’m intrigued by the possibility of combining the two professionally – that is, studying how learning “hand work” impacts people.

It started when I became aware of the Wisdom of the Hands blog (written by Doug Stowe). Doug Stowe is a professional woodworker and educator. He teaches at the Clear Spring School, where, to quote from the school’s web site,

Since 1974 Clear Spring School, in Eureka Springs, Arkansas, has thrived on the educational principle that engagement results in learning. The school’s hands-on curriculum for pre-primary through 12th grade is based on proven Clear Spring traditions blending core subjects, camping, community service, travel, woodshop, environmental education, and conflict resolution.

Oh my!  A curriculum that integrates camping, wood shop, and conflict resolution along with core subjects?  In this era of No Child Left Behind, how could such a school exist? (Well, it’s an independent private school, for starters).  Seriously, Doug Stowe is quite the advocate for integrating the manual arts into the core educational curriculum.  He’s done research on the 19th century Swedish educational philosophy known as Sloyd. From the Wikipedia entry,

Sloyd differed from other forms of manual training in its adherence to a set of distinct pedagogical principles. These were: that instruction should move from the known to the unknown, from the easy to the more difficult, from the simple to the more complex, from the concrete to the abstract and the products made in sloyd should be practical in nature and build the relationship between home and school. Sloyd, unlike its major rival, “the Russian system” promoted by Victor Della Vos, was designed for general rather than vocational education.

After reading about Sloyd and the Clear Spring School, the researcher in me wants to know what the impact of a Sloyd-style education is on youth development. Certainly we have lots of anecdotal evidence of positive impact from promoters like Doug Stowe. What would it take, I wonder, to document these impacts using the rigor of current social science research methods?

So I’ve been pondering how to put together a research program (in particular, funding) to study the impact of “hands on” education.  Meanwhile, a wonderful excerpt of a book appears in the New York Times Magazine  – Shop Class as Soul Craft, by Matthew Crawford.  Crawford has a PhD in philosophy from the University of Chicago, but makes his bread and butter running a small motorcycle repair shop.  Huh?  Why would he do such a thing?  I’ve ordered his book, and it’s next on my reading list.  It appears that he’s turned his philosopher’s mind to analyzing just why “the trades” have fallen into such disrepute among the intellectual elite over the years. Clearly, an impartial empirical examination of, say, the work of a master motorcycle mechanic reveals all sorts of high level cognitive abilities in diagnosis, planning, visualization, etc.

No sooner have I heard of Crawford’s book than another book by Mike Rose crosses my path: The Mind at Work. In it, Rose documents his research as he followed student carpenters, plumbers, and hair stylists on their educational trajectories, noting when and how considerable “intelligence” was called for in those tasks.  While other scholars criticized the traditional American definition of “intelligence,” Rose brings this argument home to a lay audience not necessarily versed in the history of psychological research.

Now I’m definitely intrigued.  To tie it all together, a few of us at the Center for Technology in Learning are starting to systematically plan a research program to document and assess the impact of “informal” learning environments, particularly in after school settings. Historically, CTL has studied the integration of “high tech” innovations in education. I’d argue that learning to use a Sloyd knife well counts as educational technology, and can be studied similarly. Stay tuned as we put our thoughts together – I hope to document our progress in this blog.

Meanwhile, my work in the shop has taken a bit of a hiatus. This often happens during the summer months, when I can’t resist the long daylight hours to hop on my mountain bike after work and on the weekends. It’s just part of the seasonal ebb and flow of life.

Home sweet home

(photo: rough-cut back rails for chairs waiting to be sanded and finished)

Just got back from a business trip to Chicago. Business trips aren’t like vacation travel – I’m on someone else’s schedule and agenda, and often don’t have the budget to stay an extra day to sight see. Plus, I was really worried I’d get snowed in – my outbound flight was delayed nearly 2 hours. It’s good to be home!

I helped give a one day workshop on educational test design for researchers. Not the large-scale standardized stuff kids are subjected too – this was about the careful craftsmanship of focused assessments on particular topics. I also attended a few sessions (including another one I gave a paper at) where the general theme was “we all know No Child Left Behind-mandated tests as they currently exist aren’t measuring much that we care about… but what would it take to do better?” I’ve often believed (and stated) that we don’t assess what’s important, we assess what’s cheap and easy. Some of the empirical research is bearing that out – we don’t even assess our own state standards particularly well. Or rather, the state doesn’t assess these things well – teachers (more or less) are constantly assessing the on-going learning of students in order to guide instruction, and isn’t that what counts most? (OK, off my soap box for now…)

Funny, I never saw the parallel before – when I do get involved in assessment (or research) design it’s a lot like my woodworking. I like to be careful, artisitic, and take pride in the product… hmmm… something to sit with…

One good thing about getting away for a few days is gaining some perspective on things. I was reading The Places That Scare You by Pema Chodron (an American Buddhist teacher) while on this trip. It’s hard to describe the experience on a blog, but essentially it was about cultivating equanimity and taming our demons and the “stories” we often overlay over our direct experiences. In one sense, it was like being constantly reminded of some essential Truths I knew deep down inside, but have habitually forgotten. In fact, I think it’s a little like going to Sunday church services – it’s not like the preacher says anything we haven’t heard a million times before, but it’s helpful to be reminded, and for that brief period of engagement to have our attention focused on these essential ideas.

And babies, babies everywhere! So many of my friends/colleagues are parents now! For roughly the past 10 years I’d envisioned a future without children (long story – started as a joint decision with a partner, and never really changed after that relationship ended). Now, in my (very) early 40’s, I wonder… I still come back to my baseline feeling: an absence of desire. Not an active aversion to having children, just an absence of strong desire. And I feel like you have to really want to be a parent / have a family; it’s not something that should just “happen” along the way. But it’s interesting being part of essentially the first generation of people who’ve grown up completely in control of their own fertility (between widespread availability of contraception and the right to terminate pregnancies). There isn’t a long cultural history that supports having children as an active choice rather than a matter of course. We’re charting our own course in relatively new waters.

Yikes! Just looked at the clock, but it’s still set to Chicago time. Phew! Nonetheless, time to get to bed.