When asked publicly or privately about high stakes assessments for teachers and schools, we always say the same thing: don’t go there. Using value-added models based of student test scores to reward or punish teachers misdiagnoses educator motivation, guides educators away from good assessment practices, and unnecessarily exposes them to technical and human testing uncertainties. Now, to be clear, we do use and value standardized tests in our work. But here’s a 10,000-foot view of why we advise against the high stakes use of value-added models in educator assessments:
When Wayne Craig, then Regional Director of the Department of Education and Early Childhood Development for Northern Melbourne, Australia, sought to drive school improvement in his historically underperforming district, he focused on building teachers’ intrinsic motivation rather than the use of external carrots and sticks. His framework for Curiosity and Powerful Learning presented a matrix of theories of action that connect teacher actions to learning outcomes. Data informs the research that frames core practices, which then drive teacher inquiry and adoption. The entire enterprise is built on unlocking teacher motivation and teachers’ desire to meet the needs of their students.
Value-added models of teacher incentive pay do the opposite: they use extrinsic motivators to drive human behavior. Drawing on 51 separate studies of incentive pay, a London School of Economics performance pay analysis found that “financial incentives may indeed reduce intrinsic motivation and diminish ethical or other reasons for complying with workplace social norms such as fairness [and] can result in a negative impact on overall performance.” And educators are demonstrably less likely to be driven by pay than other professions. We could go on and on…
Mathematica’s 2010 study of value-added modeling for the US Department of Education found that models using seven different nationally adopted standardized assessments were subject to “a considerable degree of random error when based on the amount of data that are typically used in practice,” which would mean that, “in a typical performance measurement system, more that 1 in 4 teachers who are truly average in performance will be erroneously identified for special treatment, and more than 1 in 4 teachers who differ from average performance by 3 months of student learning in math or 4 months in reading will be overlooked.”
Yikes! And stepping up to 10 years of data only cuts those error rates in half.
Ok, but we use multiple measures, you counter. We still say that when one of your core measures is inherently unreliable, its use will result in refactoring teacher motivation around test scores, souring educators’ ability to use testing for positive purposes and good teaching, and draining significant time from instruction and useful assessment.
And that’s laying aside the reliability of assessments for some 70% of educators who do not teach statewide tested subjects.
While the error rates above are driven by the measurement error inherent in any standardized test combined with real world rates of missing data and student mobility, they don’t address the very real possibility of technical and human error in testing and calculation.
Not only have we witnessed the failure of live, multi-year stress tests of statewide testing systems, such as the CTB/McGraw Hill-administered ISTEP in Indiana, we also find examples of coding errors affecting performance pay calculations. Though not widely reported, such examples are common to those who work in assessment analysis every day; we’ve seen them even in our own work. Internal errors we’ve found in the past few years include: allowing an ID field to be set to 9 characters rather than 10, resulting in hundreds of duplicate records that were inappropriately merged; setting a date range for enrollment 1 day past opening day, thereby missing more than 150 students; importing a variable as numeric that had characters in a limited number of cases, thereby failing to import about 2% of the available records.
We caught each of these errors in review and adjusted our internal procedures to address them in the future. But the point is that they are human errors in technical systems. And most would never see the light of day, even if discovered.
Now add to these technical errors the human errors of miscoding students, test administration lapses, intentionally and unintentionally missing students, and the unreliability of locally developed assessments, and you might reasonably decide that a system this fallible is not worthy of the stakes we attach to it.
Standardized testing plays a useful role in monitoring school performance, helping policy-makers understand the broad effects of adopted curriculum and reform efforts, and, if reported timely and in detail, informing educators about students’ individual and group progress. Value-added modeling of assessment results can also provide insight into the effectiveness of educational interventions and teacher practices; we have used it to do so for multiple projects.
But the surest way to compromise the validity of any assessment is to use it for methods beyond its reliability. To then attach high stakes to its outcome is misguided at best.