Comparability Isn’t All or Nothing | Center for Assessment

How States and Districts Can Reimagine Comparability

The comparability of information from assessments is a core issue that cuts through the field of educational assessment and is mentioned in many technical reports, frameworks, and guidelines. For example, in our professional standards alone, it is mentioned more than 60 times, and it is discussed in detail in a specialized volume on the comparability of large-scale assessments.

Why Do We Seek Comparability?

At the heart of comparability is the simple desire to understand, as clearly as possible, student performance across contexts. This is done by considering key assessment properties such as the purpose, content, and administration conditions of assessments as well as the psychometric properties of the resulting quantitative information. Importantly, this includes students’ opportunities to learn and the conditions under which learning has occurred.

In this blog, I examine comparability from a particular angle: comparability in innovative performance assessment systems. This is timely as many states and districts currently innovate within their educational programs and formal assessment and accountability systems. As one might imagine, understanding comparability turns out to be a wicked but manageable problem if approached thoughtfully and systematically.

Comparability Is Complex and Multifaceted

Comparability is fundamentally a multifaceted problem, not a simple yes-or-no issue. Yet, in conversations about reimagining assessment and accountability systems, I sometimes still hear people say things like, “If different districts have their own measures, nothing will be comparable,” or “Federal officials expect (strict) comparability, so if we don’t achieve it (using more traditional assessments), we can’t pursue this innovation.”

Neither of these extremes is generally true, and both set up straw-person arguments that can be easily knocked down. In essence, these kinds of starting points are unproductive. In fact, there are many ways to illustrate the multifaceted nature of comparability.

As a lover of the performing arts, I think of comparability as the product of intentionally “mixing” various design features within a principled assessment design approach, as instrumental and vocal tracks are mixed in music production. Each part has its own nuances, and the result is more than the sum of its parts.

Pathways for Exploring Comparability

Performance assessments are common elements of portfolios or capstone projects, especially when the goal is to develop 21st-century skills in line with portraits of learners or graduates.

To understand comparability more deeply—and keep it from being a detriment to innovative work in assessment and accountability—it is critical to first work through some guiding questions, such as:

Do we want to compare students, schools, or some other unit?
What purpose does the comparison serve for our problems of practice?
What frameworks are available to help us make comparability judgments?
Who needs to evaluate the quality of evidence about comparability?
What risks exist if judgments about comparability are misguided?

These are riffs on similar questions posed by colleagues, and should be expanded and adapted as needed—whatever helps your team perform a critical analysis of comparability with practical design solutions in mind.

In addition, we can critically analyze different aspects of performance assessment design from the perspective of what the state and local districts are responsible for. Understanding the overall integrated comparability of information for a performance assessment requires considering these design decisions in conjunction—not an easy feat rhetorically or practically!

My colleague Chris Brandt and I did an illustrative, high-level walkthrough recently in which these design decisions are expressed as tight, relaxed and looser guardrails for local districts. This includes an application to the example that I discuss next.

Comparability in Action

The following example illustrates how this might play out. It is inspired by research in an area called epistemic games / quantitative ethnography. Imagine a scenario where a state offers up standardized performance tasks in urban planning that align with state standards and are designed according to best practices in principled assessment design.

The tasks allow learners to collaborate while playing different roles (e.g., traffic planner, representative of the business community, budget office, or parks and recreation) on an urban redesign project. Learners have different information and responsibilities in a jigsaw task setup where there is no single, correct solution.

The geographical area in question can be localized through integration with, say, My Google Maps or other digital platforms where students can configure different areas and negotiate compromises.

Students’ spoken exchanges are then recorded via microphones, tagged according to meaning by humans or AI tools, and visualized using tools from network analysis. A teacher who acts as a project mentor—or AI-supported technology—supports the nature and flow of the collaborative interactions to solve the problem at hand.

Customizing the Approach

In this scenario, there are common task design templates, work examples, and evaluation rubrics that are used to develop teacher guidance on expectations for student performance and feedback processes as well as the visualization and scoring algorithms.

This overall approach can be customized for local use: Teachers can create a specific task variant in a particular district (e.g., a planning project in a locally-relevant setting with locally meaningful constraints for the stakeholders).

They can also come together locally within their district or regionally across districts via a local educational service agency to review student work across task variants. They discuss what they consider evidence for, say, “emerging,” “competent,” and “exemplary” student performance and work backward from these examples to identify profile patterns on metrics that capture these distinctions. They can use best-in-case examples to build up a regional resource repository to celebrate the work.

How should we think about comparability with this type of performance task?

Competency meanings are closely comparable given a shared task design, approach to feedback, and scoring approach for work products and conversational exchanges. This is facilitated by an identical core rubric, even if local variations and expansions of criteria and best practices are welcomed. However, local variations in these aspects are intentional and would not likely affect the core assessment properties of the tasks per se.
Expectations for and evidence of student performance are closely comparable as well given the common rubric and social moderation procedures, although there will be differences in the effectiveness of support that different teachers provide to students. There will also be variations depending on how the task is used – for example, to primarily support engagement and learning versus more formal, summative evaluations.
Classifications of students and dynamic visualizations of student behaviors as summative outcomes are partially comparable across districts. Stricter comparisons for the classifications are possible although only if there was a set of independent ratings of student work from different teachers as part of calibration sessions in place.

In short, the overall setup of the task design, implementation, and scoring sets up broader lanes or guardrails for comparability. Documenting the key ways in which particular design decisions across implementations of these performance tasks were made can help others to understand how to make sense of student performance across local contexts. It certainly is not a yes-or-no issue but can be understood in a principled, evidence-based fashion.

As this brief example has illustrated, there is no reason to shy away from comparability considerations for technical reasons or to dismiss the desire for comparability as unduly restrictive. To pick up the earlier music metaphor, the core idea is to develop an appropriate mix of design decisions that can create learning and assessment experiences that resonate and inspire!

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.