Perfect Test Comparability Is a Sign of Failure
Why We Need to Rethink Innovative Assessment Comparability Requirements
It’s clear that state and district leaders are interested in innovating their assessment systems for a host of reasons, yet the Innovative Assessment Demonstration Authority (IADA) created by the Every Student Succeed Act (ESSA) is struggling to take hold. Currently, only three states—North Carolina, Louisiana, and Massachusetts—are participating, down from five states just a couple of years ago. Why is that?
There are many reasons, but as I’ve written previously, comparability and scaling the innovation statewide are two of the biggest hurdles. It’s become increasingly clear that we’ve been thinking about test comparability the wrong way.
The Innovative Assessment Demonstration Authority
A quick review: the IADA, authorized under Section 1204 of ESSA, provides states with an opportunity to develop new approaches to assessment. To be clear, states do not need federal permission to reform their assessment systems, as long as they meet the rigorous technical requirements outlined in the law and regulations. But the IADA lets them try out new approaches in a subset of districts rather than immediately changing the entire state testing system—an important way to work out the bugs.
But participating states must still ensure their innovative systems meet four key requirements (among others):
- Assessment quality: The system must comprise high-quality assessments that support the calculation of valid and reliable scores.
- Comparability: The state must produce student-level annual determinations relative to the state’s grade-level content standards that are comparable to those derived from the state’s current statewide assessment results.
- Scaling statewide: The state must implement the innovative assessment system statewide within seven years.
- Demographically representative: The participating pilot districts must be demographically representative of the entire state.
What is Assessment Comparability?
Psychometricians obsess (rightfully so) about comparability because users typically want to interpret score changes as due to changes in student achievement, not to differences in the test. Users typically compare test scores across two or more administrations and they want a score of 240 on the 5th grade math test in 2017 and 2018, for example, to carry similar meanings.
While familiar in popular lexicon, comparability has a particular meaning for measurement experts. Like validity, it is a judgment based on an accumulation of evidence to support claims about the meaning of test scores and whether scores from two or more tests or assessment conditions can be used to support the same interpretations and uses. (For a deep dive into comparability, see the National Academy of Education’s Comparability Issues in Large-Scale Assessment.)
Most important for my discussion here is this: Most comparability analyses in educational measurement focus on comparing scores or performance determinations from one test to those from another.
The U.S. Department of Education (ED) recently asked for comments on the IADA comparability requirements. This is good news because it suggests that they recognize the ongoing challenges associated with evaluating and ensuring comparability.
But it’s also kind of a déjà-vu-all-over-again situation. Back in 2016, my colleague Susan Lyons and I convened, on behalf of the Hewlett Foundation, the world’s leading experts on test comparability to provide recommendations to ED before the initial regulations were finalized. The expert panel noted many legitimate reasons for non-comparability between a state’s existing assessment and its innovative pilot, including the possibility that the innovative system:
- Measures the state-defined learning targets more efficiently (reduces testing time),
- Measures the learning targets more flexibly (when students are ready to demonstrate “mastery,” for instance),
- Measures the learning targets more deeply, or
- Measures the targets more completely (listening, speaking, extended research, scientific investigations, for example).
We were concerned back then that requiring high levels of comparability with the state test scores might limit innovation. In fact, Dr. Robert Brennan, one of the world’s leading measurement experts, noted: “Perfect agreement would be an indication of failure.” Thanks, Bob, for the title of this blog post.
Want Real Test Innovation? Rethink Comparability
Just recently, I was trying to explain to a congressional staffer what’s wrong with the IADA’s comparability requirements when it hit me: We’re aiming at the wrong target! If the innovative assessment has to generate results that match the existing state assessment, why go through the trouble of innovating? States apply for the IADA because they want to change their existing state assessment for many reasons, including the four I mentioned above. Requiring comparability to the legacy test acts as an anchor to innovation.
So what do I recommend? Anchor comparability to the standards instead. To be fair, traditional forms of comparability include analyses of how well two or more assessments are designed to measure the same content frameworks at the same depth, as reflected in test blueprints. While this may be only one step in typical psychometric comparability studies, it’s sufficient for the purposes of innovative assessments and would restore the flexibility to truly innovate.
I propose that innovators first create a rationale for how they’ll measure the state’s content standards, including whether they will prioritize certain knowledge and skills that best fit the goals of the innovation. Then users and developers must collect and provide evidence that the new assessment system measures the knowledge and skills in the ways that the designers intended. Like validity, comparability is evaluated based on the evidence provided for score interpretations and uses.
Equity: A Critical Element in Test Comparability
I’m not giving up on the equity reasons associated with designing tests to hold all students to comparable learning outcomes. Let me offer a few examples of how we use broader, legitimate conceptions of comparability all the time that most users accept.
First, when the two general assessment consortia, PARCC and Smarter Balanced, were awarded federal funds to create new tests, nobody expected (rightfully) that the scores would be what measurement geeks call “interchangeable.” Further, nobody claimed a student scoring proficient on Smarter Balanced, for example, would automatically be considered proficient on PARCC. Rigorous alignment studies and other technical evaluations showed that both assessments were high-quality measures of how well students learned the Common Core State Standards.
Similarly, when states switch assessment programs, as they do a little too often for my taste, they are not required to demonstrate the comparability of the scores on the old and new assessments. Rather, both assessments are evaluated against ED’s standards and assessment peer review criteria.
We should hold innovative assessments to the same expectations as we do whenever we evolve assessments, and not tie them to the traditional test. This will free states to use their innovative systems to more deeply or differently represent the standard to advance student learning. And advancing student learning—remember?—was the goal of the innovation program to begin with.