This has become such a problem in scholarly publishing that we have a business that provides citation checking https://groundedai.company/ that we've been buidling for a couple of years now
So far we basically just provide a very rule-based approach and try not use LLMs as much as possible. So we extract and parse the citations using various ML and rule-based approaches, and carry out a bunch of predetermined queries and do various fuzzy matching approaches on the metadata components, and have a bunch of rules around risk levels of things we should have found/matched based on what type of source it is, which venue we should have found it in, etc.
So there are absolutely a bunch of tasks that could be evaled/benchmarked, but "hallucination rate" isn't particularly applicable/interesting as a metric of how good the tool is
that said, we do use various LLMs (mostly local, fine-tuned, small, for things like NER/parsing/metadata comparison, etc.). and they can and do hallucinate, but we have very hard constraints on the validation, so any extraction results that don't match 1:1 back to the input text are discarded for example. so again, rather than hallucination risk we prefer hard constraints
We collaborated with Nature here to study the extent of fake/frankenstein citations in scholarly literature (from top 5 publishers - Springer, Elsevier, Wiley, Sage, Taylor & Francis)
We're estimating hundreds of thousands of papers affected in 2025 with hallucinated citation issues
As part of the work we analysed 20k papers generated with ChatGPT API to figure out which citation errors are characteristic of gen AI use and use that classify the errors we saw in the wild.
The world's gone mad, publishing is in a nuts state, the training data is poisoned!
yeah this makes sense. we run a citation verification service and provide publishers with data of hey this citation could be fake etc. but we don't currently capture any "action" or "measured result" so i guess that's what we need to expand to next
I work on Veracity https://groundedai.company/veracity/ which does citation checking for academic publishers. I see stuff like this all the time in paper submissions. Publishers are inundated
That's basically what we're doing with app.studyrecon.ai.
What we've found is that vector similarity is often not the final solution. It is still only a crude proxy for the true goal of 'informativeness' or 'usefulness' with relation to the user goal/query. Works okay, but we're definitely seeing a need for more rigorous LLM-postprocessing to enrich the results set.
We're working on this at Grounded AI (https://www.groundedai.company/contact-us). We'd love to help you if we can. Feel free to contact me (email is on my profile page)
reply