Arroyo Research Services Blog

Back to Listing

The best educational measurement strategy is the one that fits your specific circumstance, not the hottest method of the day. And not necessarily the ones that pols believe are singularly able to deliver “real, hard data.”

The What Works Clearinghouse (WWC) at the Institute of Education Sciences (IES), for example, reserves its highest rating of confidence for studies based on well-implemented Randomized Control Trials (RCTs), arguably the gold standard in evaluation. RCTs are not only credible, but are presumed to be top-of-the-line in eliminating bias and uniquely capable of surfacing replicable results (assuming fidelity to the original model).

Much has been written in the past year about whether RCTs deserve the vaunted status they’ve been assigned (see, for instance, the debate between Lisbeth Schorr and the Center for the Study of Social Policy’s “Friends of Evidence” group and Patrick Lester of Stanford’s Social Innovation Research Center).

We agree that it’s important to measure what you pay for in order to suss out what works: after all, we’re evaluators. But we also know from our own work that no single methodology can capture the complexities of program implementation in social sectors like education, nor is there one that can wholly eliminate bias. So it doesn’t surprise us that a 2016 American Enterprise Institute analysis of 27 RCT mathematics studies meeting WWC standards identified 12 non-selection bias threats “not neutralized by randomization of students between the intervention and comparison groups, and when present, studies yield unreliable and biased outcomes inconsistent with the ‘gold standard’ designation.” (Twenty-six of the 27 studies had multiple threats to their usefulness, though the magnitude of the error generated by even a single threat frequently exceeded the average effect size uncovered by the RCT.)

Jennifer Brooks of SSIR believes we haven’t gotten enough traction for evidence-based decision-making among the powers that be, and she argues that we’re “wasting time” debating the primacy of RCTs instead of selling decision makers on the value of measurement generally. Brooks also suggests that we’re doing so “without ever asking what problems decision-makers are trying to solve.”

That, to us, is the ultimate caveat for educational measurement. Our experience with RCTs (or any methodology) is that they can indeed make a strong case as long as they are strategically applied at the right time to the right research question(s). Often, it’s not the right time for an RCT, which is not to say it isn’t the right time to set targets, measure implementation, foster continuous improvement, or identify preliminary program outcomes. Finding the right tool for those tasks requires understanding the questions being asked, the challenges at hand, and the intervention selected to address them.

Sometimes, before we can even consider selecting appropriate measurement strategies, the intervention itself will bear review. Where does it sit in comparison to already reviewed peer interventions aiming for similar outcomes?
Getting a window on where an intervention fits into the landscape of system-wide reform efforts provides insight into how other programs support their work to achieve success, what the current program principals can learn from others, and where the intervention may make a unique contribution.

In addition, we’ve seen that implementation itself can frequently play a role in achieving successful outcomes (or not). When an intervention does not self-implement but is dependent on adoption by school leaders, educators, students, and communities/families, it is helpful to identify implementation differences between successful and less
successful contexts. Knowing how a program works in practice and what implementation measures indicate likely success can supply/inform? the early positive results that may later be measured by more rigorous methodologies. In the meantime, a qualitative exploration of differences (through observation/site visits, interviews, focus groups, local document review, etc.) can produce formative feedback and suggest process improvements in real time.

Once a client knows where they are, so to speak, with their program, a roadmap for what should be measured at each stage can be developed and baseline performance data can be gathered and analyzed. We suggest creating a logic model to show how program components are expected to lead to specific outcomes, as well as a measurement framework that articulates implementation and outcomes, such as changes in student/teacher performance, practices, beliefs, and/or attitudes. Expanded data collection instrumentation will often be required at this stage to standardize collection procedures and control for some variables.

In the end, we may or may not recommend an RCT to our clients. But for real leaders in real educational contexts, questions about the usefulness of RCTs relative to other forms of evidence are simply not the right questions most of the time. To ensure you’ve picked the right tool for the job, first get a solid handle on what the job is.

Back to top