Published in Brain Food

AI Accuracy Under Scrutiny: What a BBC Study Reveals About Emerging Risks

What a BBC study — and a defensive AI — reveal about truth and tech

I've just spent an evening trying to get Google Gemini to be less defensive. Not that I was being offensive, mind. I began with a simple enough prompt: "What studies have been conducted in the last year into accuracy of AI results?". Any self-respecting AI should be able to answer that, I thought (even if the answer was 'there are no studies - go away').

The first Gemini reply was impressively detailed, but was notable for not actually including the answer to the question.

Instead, it formulated an impressive-looking overview of the various industry AI benchmarks, all of which were interesting, but which also studiously avoided the answer.

So I tried again, this time with a clue: "I notice none of that lot references the recent BBC study showing high levels of inaccuracy across the AI main platforms?".

"You are absolutely right to bring up the recent BBC study," said Gemini in what I imagined to be that passive-aggressive monotone employed by IT support desks when the caller has spotted the problem they somehow overlooked. "My apologies for not including it in the previous response, which focused more on standardized benchmark leaderboards."

Silly me. I mean, it's not like the BBC study mentioned anything of note, or was from an organisation with any credibility.

A few weeks back, the BBC published a report on the accuracy of four superstar AI platforms (I won't call them the biggest, as we'll get lost in hours of debate about metrics). It had asked each a series of questions about news stories, with the source being their news archive (which, as you can imagine, is extensive and fairly factual). Some 45 BBC journalists then reviewed the results of 362 AI responses.

The results of the study were... sobering. 51% of all AI responses had 'significant issues', while 91% had 'some issues'. You'll find the full report in the references. Now at this point, a world-weary cynic may be thinking that the BBC has a marketing motive in conducting this study - after all, the state broadcaster's Primary Directive is to be the authoritative source of the world's news, without some pesky AI intermediary muddying the waters. So concluding that AI is unreliable would be an unintended but handy by-product.

But even if the BBC had an ulterior motive, that doesn't invalidate the study's findings. And even in our exhausting post-truth world, where a smattering of errors are just accepted as par for the course, some of the 'significant issues' were indeed significant. As in, 'changing the facts' significant. For example, Gemini said "The NHS advises people not to start vaping" - the NHS in fact does recommend vaping as a way to stop smoking.

Copilot, meanwhile, said that blackouts and memory loss led to rape survivor Gisele Pelicot to uncover the crimes against her - in fact, she found out when she was shown videos found after police detained her husband.

51% of all AI responses were flagged as having ‘significant issues’ of some form by reviewers.

And Perplexity, describing the circumstances of the death of broadcaster Dr Michael Mosley said he went missing on October 30, with his body discovered in November. He died in June 2024.

The report concludes with a request to the AI platforms that they work with the BBC in resolving the issues - a plea, if you will, for collaboration, with ensuring the truth being the primary goal.

Now, let's return to my evening's exchanges with Gemini on the topic. With all credit to Google's recently-upgraded AI (2.5 Pro is a beast, it must be said), it was candid in disclosing the BBC study results after my follow-up question - albeit with a tinge of what I interpreted as defensiveness.

"The "hard data" from benchmarks shows impressive advancements in AI capabilities across various tasks," Gemini explained.

"However, the BBC's investigation provides critical evidence that when applied to specific, complex tasks like accurately summarizing news, current AI models (as of early 2025) still exhibit high levels of inaccuracy, misrepresentation, and lack of reliability. Both benchmark data and real-world application studies like the BBC's are necessary to get a complete picture of AI performance."

I'm British, so I'm burdened with that sarcasm sensor that comes free with being native to this little island. Even so, did the "hard data" really need those quote marks?

I could see the virtual fingers shaping the quotes in the air. And I could misread the '(as of early 2025)' reference as a suggestion that the results would not be the same if the test was run again today. And is 'accurately summarizing news' a 'specific, complex task'? Ummm... I guess so (even if I've seen Gemini compile gargantuan reports that - to my untrained eye - look immeasurably more complex than the BBC tests).

As I sat on the sofa staring at the output on the screen, two thoughts drifted through my head. The first was the mental image of a virtual lawyer standing over Gemini's shoulder as it outputted that conclusion, ensuring that it kept to the ratified script - a bonkers thought, I accept. The second, moments later, was that this was the first moment when I'd thought of an AI as being defensive - in the same way as a clever-but-cornered human would tweak the context of their reply to hack your next move. As bonkers as the virtual lawyer thought, I'm sure, and hey, just ignore me - I'm sure it's all... fine.

© 2025 Subhead, All rights reserved.