The importance of fully executing the test development process.

Part 1: Measure Twice, Cut Once When Developing Large-Scale Assessments 

Dec 06, 2022

Don’t Cut Corners in the Test Development Process 

In this two-part blog series, I address the problems I see with the disturbing trend to cut corners in the test development process for large-scale assessments, and I provide some guidance to avoid the pitfalls this practice can lead to. 

In this post, I’ll be describing the test development process and why it matters. In Part 2, I’ll review the timeline for test development and deployment, emphasizing the important connections to shifts in curriculum and instruction.  

Recently, a colleague shared the following image for a bit of levity.  

While the humor wasn’t lost on me, I also appreciated the broader message it conveys. It’s a reminder that careful preparation at the start of almost any endeavor can save a lot of hassle correcting problems later.  

That advice certainly applies to the field of large-scale assessment, where careful preparation is essential. Unfortunately, I’ve observed that some programs are taking risky shortcuts. For example, I’ve noticed the phrase “operational field test” being used more frequently in the test development process. This term is meant to describe the practice of trying out items and scoring them for consequential purposes in the same administration. Frankly, I think “operational field test” is an oxymoron. Either items are administered to try them out or they count – we shouldn’t pretend we can responsibly do both at the same time. 

The Test Development Process

When developing large-scale tests, such as the ones that state education agencies develop to fulfill the requirements of the Every Student Succeeds Act, there are a number of important steps along the way. I’m focusing specifically on procedures that apply when new items must be developed. The process is different when a new test is developed from content that has already been through a thorough development process, which may occur if a state has an existing item bank from previous development work or licenses content from a vendor.   

To be clear, there is not a uniform process for test development, but strong approaches are characterized by similar procedures that include:   

  • Planning: Identify the content and objectives of the assessment, including the primary purposes, uses, claims, and intended interpretations. 
  • Initial Development: Develop item and form specifications.
  • Item Writing: Write items to match measurement objectives and specifications 
  • Initial Review and Pilot: Empanel multiple teams of experts and specialists to review items for areas such as alignment, bias, and sensitivity. As appropriate, small-scale interaction and pilot studies are conducted to gather more information to determine if the items function as intended.
  • Field Testing: The candidate items are administered to broad groups of examinees that represent the full breadth of the target population to gather more data about the items.  
  • Data Review: Panels of experts and specialists review the items again with the benefit of field test data to make more informed decisions about the appropriateness of the items. 
  • Form Construction: Assessment experts create test forms or presentation algorithms (i.e., for adaptive tests), which are carefully evaluated with respect to technical, qualitative, and operational criteria.  
  • Administration and Scoring: The test is delivered for the first time that “counts” (also called an operational test) and the results are scored. 
  • Final Review and Standard Setting: The results are reviewed and performance standards are established.  
  • Reporting: Individual and summary test results are distributed. 

The process is not the same for every program. For example, an assessment created for special populations or one that is measuring a novel construct may require more steps to develop and evaluate specifications before initial review and piloting. 

What’s the Risk in Skipping Steps in the Development Process? 

It may be tempting to skip important steps to produce something faster or cheaper, but that practice almost always leads to an unintended negative consequence later. For example, cutting corners on item development may lead to items of poor quality that are not well aligned to the standards or do not work well for all student groups. Weak alignment sends mixed signals to educators about instructional priorities and also produces assessment results that do not accurately represent student achievement with respect to state expectations.  

Cutting corners can also lead to a host of practical problems throughout the assessment cycle.  For example, test items may not display appropriately on all devices, the accessibility features may not be available or work as intended, the delivery system could go down during testing, or score reports may contain errors. These are just a few operational failures state programs have experienced in recent years.  

Measure Twice to Avoid Preventable and Costly Mistakes

A testing program may be very well designed and technically strong, but if administration and reporting are flawed, none of it matters. It’s like going to a restaurant and discovering health code violations. It will not matter how talented the chefs are; you’re unlikely to order a meal if the utensils aren’t clean.     

At a time when public support for large-scale testing is already low, test administration failures greatly magnify mistrust. In fact, there really is no such thing as a ‘small failure.’ Even one bad item among many hundreds or one brief outage during online test administration is enough to erode confidence and raise questions about the integrity of the entire program.  

Finally, fixing mistakes on the back end almost always lengthens the timeline or increases the cost over and above what would be expected by doing it correctly from the start.   

What does it look like to “measure twice”? That is, how do we ensure the test development process builds in careful preparation to minimize the chances of problems later? I address this question in part two of this series.

Share: