Balancing Skepticism and Utility in Machine Scoring
Understanding How Machine Scoring Can Be Used Now, and What We Need to Do to Expand Its Usefulness for the Future
Without a doubt, the public is skeptical about using machine scoring for examinees’ written responses. This skepticism makes sense because we know that machines do not score all elements of writing equally well. Machines do not “understand” creativity, irony, humor, allegory, and other literary techniques, opening them to criticism for their insufficiency in evaluating some of these more subtle qualities of writing.
But does a machine’s inability to understand all possible nuances in writing represent a fatal flaw that invalidates any uses of machine scoring for a standardized assessment? It is probably fair to say that such a conclusion over-generalizes current limitations in machine scoring because there are many features of writing that machines are well-suited to score. Machines have routinely been shown to meet (and even exceed!) human score quality when it comes to the more structural aspects of writing such as grammar, mechanics, and usage, and there have been significant advancements in scoring content as well.
The Past and Present – Focusing on What Works and Why
Although techniques for machine scoring of constructed response items are generally considered to be less well advanced than for essay scoring, its feasibility and use have increased in recent years due to two important factors:
- not all constructed response items demand the features that are difficult to score
- current machine scoring approaches often include a human assist for scoring unexpectedly unique responses.
Imagine an item that requires examinees to study a table of facts related to various energy sources such as wind, solar, hydro, and fossil fuels. They then have to use those facts to choose and defend one source of energy for a particular purpose.
In this scenario, the more complete the argument, the higher the score. A response that earns a high score requires concrete response elements using the facts to support claims related to the sources of energy discussed in the prompt. However, some examinee responses may also contain persuasive features meant to express personal beliefs (not facts) through the use of techniques such as symbolism or humor – which might influence a human rater but a machine might (appropriately) determine to be off-topic. Since the objective of the task is to know how accurately and completely an examinee can make an argument based on a set of facts, is a machine’s inability to measure the more esoteric features of personal commentaries a meaningful limitation?
In this context, we might even conclude that machine scores provide an advantage. A human rater who is impressed by elements of a response that were not required by the scoring rules, or that may align with the rater’s belief system, may find it tempting to overlook required features that are missing from an examinee’s response. This scenario is rater bias in one of its most subtly unfair forms.
The Future – Research Agendas to Advance Utility
So, how can we move beyond present limitations and expand the utility of automated scoring? The short answer is that we can improve utility by improving score quality.
Score quality improvements include statistical elements and a validity component. As most machines are trained on human scores, high-quality human scores represent a necessary condition for improving and evaluating the utility of a machine’s scores. In human scoring processes:
- raters are presumed to directly align features of the examinee’s response with the scoring rules,
- training procedures are used to mitigate possible sources of bias and inconsistency, and
- inter-rater reliability measures are used to assess final score quality.
To some extent, we rely on faith that these procedures adequately control and detect the kinds of bias and inconsistency in human scores that we might care about.
Machines are subject to the same statistical measures of quality, but they typically do not align response and rubric features directly to produce a score. Rather, machines use human scores to define the response features (and their weighting) that best approximates human scores. This difference is the second source of skepticism about machine scoring, and one that concerns testing experts on a more fundamental level than criticisms related to the types of items that a machine can score.
Is Different Necessarily Wrong?
The reason we are skeptical is that we implicitly trust humans, particularly humans with a background in the content being assessed, and whom we have trained to score a specific item. Further, we know that the way a machine approaches scoring is different than our trusted, trained human. However, if we accept that we do not fully understand how humans are scoring, is different necessarily wrong? The dependency of machine scores on human scores demands that we answer this question with explicit evidence of the validity of the human scores used to train and validate machines. It follows, then, that machines might help us better understand which features humans are scoring, thereby strengthening score validity arguments for both human and machine scoring.
One practical way of improving the utility of machine scoring is to develop robust automated raters that are able to manage changes across items and examinees. What does that mean? Currently, the most accurate automated raters are those that have been trained on a specific item or task and a specific sample of examinees. The responses of that sample, however, may or may not accurately represent the target population of examinees – or the performance of the target population may change over the lifespan of the item. What researchers have learned is that score quality is degraded when examinee score distributions change, and that a machine trained to score one item often does not generalize well to other items without additional training and validation. This gap essentially means that every item is its own research project. Budgets and implementation timelines do not support this approach well.
Finally, for the purpose of using machine scores more widely in standardized assessments, we will need to better understand and account for the psychometric impact on overall score scales. As we expect score quality to be at least incrementally improved over time, what might that mean for the accuracy and comparability of test results? Simulation studies show the possibility of degradation in comparability over time, but empirical research on questions of psychometric impact is needed to fully support validity arguments for large scale assessment uses of machine scores.
The use of machine scoring is not going away – it is growing in use and popularity. With this growth, we must learn how to get the most from this technique to ensure both that it is effective in producing accurate and actionable score results, and that it is accepted as a sufficient stand-in for human raters.