The bottom line: All tested LLMs cite table values erroneously, but can be improved by up to 12 percent through specialized critic models.
Researchers have for the first time systematically examined how frequently Large Language Models make errors when citing or omitting table values — a problem that compromises intermediate results even when the final answer is correct.
The study evaluates Data Referencing Errors (DREs) across different models and task types, showing that all tested models ranging from 1.7 billion to 20 billion parameters commit these errors. Although LLMs understand table structures, they cite values incorrectly or forget them — an error type that destroys the traceability of reasoning steps beyond final answer accuracy.
The researchers demonstrate that a critic model can detect and correct these errors: through critic-based filtering and rejection sampling, answer accuracy increases by up to 12.0 percent. A trained 4-billion-parameter critic model achieves an average F1-score of 78.2 percent in detecting in-distribution and out-of-distribution DREs.
For developers, this means: table-based reasoning pipelines should include explicit validation steps for data citations. The lightweight critic model can be integrated as a separate validation pass in inference workflows and helps larger models produce more reliable results — particularly in scenarios where traceability is critical.
Source: arxiv.org · Published June 29, 2026
Lumi AI News — AI-assisted curation pursuant to Article 50 EU AI Act. Paraphrase and classification through Lumi News Pipeline v1.7.2.