Computers will now score essays and other open-ended questions on the State of Texas Assessments of Academic Readiness (STAAR), at least in part. The change concerns district leaders and teachers, who worry about the accuracy of the new format.

The Texas Education Agency already used computers to grade the tests of students retaking STAAR in December. District leaders will review tests scored as zero next week, a first look at how virtual grading compares to grading by humans.

TEA officials have stressed that the system is trained by and overseen by humans and has been proven in field tests to score similarly. 

Computer scoring vs. AI

Officials with the Texas Education Agency told the San Antonio Report Thursday that the new technology was necessitated by the redesigned STAAR, which introduced free response questions in every subject and every grade level to align what was being tested with what is being taught in the classroom.

With seven times more constructed response questions than in past years, the agency anticipated that maintaining full human scoring would cost $15 million-$20 million. Constructed response questions allow students to write out answers in short or essay format, as opposed to the long-used standard of multiple choice questions. 

In communications with districts and the public, the TEA has said that the scoring system differs from generative artificial intelligence.

Chris Rozunick, the director of the state’s assessment development division, said the technology is far from the technology behind Chat GPT, an AI system criticized for producing inaccurate and unreliable responses, including when grading essays. 

“Within generative AI you have a main hub where it learns and it can expand and continue to grow in ways to continue growing knowledge,” she said. “But the specific way we are using our technology is that our engine is trained only on each specific prompt and each specific item that the students see.”

Rozunick said that on any given question, the system can only score based on the information for that specific item. 

“There’s no way for the engine to generalize or do other prompts,” she added. “So in many ways, you might be able to see the difference between those two concepts.”

But Schertz-Cibolo-Universal City Independent School District officials cast doubt on that characterization during a meeting this week.

One trustee said of the system: “If it walks like a duck, and it quacks like a duck…” 

Kelly Kovacs, the chief academic officer, also questioned the characterization.

“It’s hard to imagine how it’s not artificial intelligence, but I think they’re saying that because there is that human backup plan,” she said.

Hybrid scoring system

At a board meeting Tuesday, Kovacs detailed the new system, which is referred to as a hybrid scoring system. 

“There’s going to be an automated scoring engine or a computer, which will initially score all of the responses from our students on their tests, as long as they’re not a Spanish assessment,” she said. “At least 25% of those responses will then be kicked to a human scorer, which is why they’re calling it a hybrid scoring model.”

Kovacs said the system was trained using field test data with the goal of the computer mimicking human grading. The TEA released a PowerPoint presentation Wednesday detailing the testing that went into the system. 

Each computer-graded test will have a “confidence value,” Kovacs said, representing the degree to which the system feels confident the score matches what a human would score. 

Errors could be flagged if the computer sees fewer words on a page than expected, another language or too much of the question repeated in the response. Those would be transferred for human review. 

When a human takes over the grading, the human score overrides the computer’s. 

According to Kovacs, parents and districts can request a rescoring of an assessment — for a fee. 

“If the score changes because of an error, then the fee would be waived,” she said. “I think the idea here is that they don’t want a district sending all of them back to be rescored by a human, which is why they have the fee.” 

Agency officials said Thursday that the practice has been in place for years, with rescoring requiring a fee of $50.

“This has nothing to do with us introducing automated scoring as part of the process,” said Jose Rios, director of the student assessment division. “That was always the case.”

Pressure on teachers

San Antonio Independent School District officials have concerns about how the new system could impact scores and how teachers might respond when teaching students to prepare for the test in the future.  

“AI scoring could result in basic writing being scored at higher rates, as we have seen from the AI-scored [Texas Success Initiative assessment],” district spokesperson Laura Short said Wednesday. “We worry that nuanced, complex writing will not receive a valid score because the AI engines may be looking for specific words, phrases, and formulaic construction.”

If that were proven true, Short said, that could put pressure on teachers “to teach formulaic writing to receive better STAAR scores, impairing our efforts to provide writing instruction that prepares students for college and careers.”

Short said the district will use information from the review next week to learn how to best support students and teachers going into the next testing cycle. 

Despite the changes to the test in recent years and scoring this year, districts are trying to focus on the core writing lessons they have used with success.  

Tony Perez, the high school English instructional specialist for Northside Independent School District, said he was concerned when he first heard of the updated system. However, he remains confident that teachers’ skilled instruction will prepare students for the tests. 

“I think good writing instruction is just good writing instruction,” he said. “That’s what I told our teachers when a lot of them expressed concern and frustration [about the new grading system].”

That good writing instruction includes modeling writing, explaining different choices you can make as a writer and using good mentor texts for written publications to show students what authentic writing should be, he said. 

Perez said that preparing students for the test is not a small feat, but he is focused on ensuring teachers aren’t changing the way they teach to appeal to any one facet of a test. 

“Kids’ graduation is on the line, and I take that very seriously, it keeps me up at night,” he said. “But what I don’t want is for us to make changes or take shortcuts for the sake of a test that might hurt kids in the long run, when they’re writing out in the real world.” 

Beyond the grading, Perez said Northside ISD is focused on helping students prepare for the different descriptions and types of questions involved in the test, another update implemented last year.

“I think if we just get wide in our tool belt of test-taking strategies for decoding what is actually on the screen, a lot of our concerns will take care of themselves,” he added. 

Isaac Windes is an award-winning reporter who has been covering education in Texas since 2019, starting at the Beaumont Enterprise and later at the Fort Worth Star-Telegram. A graduate of the Walter Cronkite...