[Image: AI-Arts.org]
Brilliant but untrustworthy people are basically the main characters of both fiction and history. Apparently, the same vibe applies to AI now, too.
According to an OpenAI investigation shared by The New York Times, our silicon-brained friends aren’t just making mistakes, they’re confidently spewing nonsense. Hallucinations, imaginary facts, and bold-faced lies have been in the AI playbook since day one. And while newer models are supposed to be getting smarter, they still can’t shake the habit of making stuff up.
Enter OpenAI’s latest brainchildren: GPT o3 and o4-mini. These autocomplete machines are built to “reason” their way to answers, like a nerdy Sherlock Holmes with zero impulse control. Unlike older models that mostly focused on stringing nice sentences together, o3 and o4-mini are supposed to think things through like humans do (you know, slowly and with lots of second-guessing). OpenAI even claimed that o1 could match or outclass PhD students in chemistry, biology, and math. But if you were hoping that meant fewer screw-ups, the latest findings are here to dunk cold water on that idea.
OpenAI’s own tests showed GPT o3 hallucinated in a third of a benchmark involving public figures. That’s not just bad, it’s twice as bad as last year’s o1 model. Not to be outdone, the smaller o4-mini model basically went full fever dream, hallucinating on a whopping 48% of similar tasks.
And it gets even messier. When tested on general knowledge using the SimpleQA benchmark, o3’s hallucination rate exploded to 51%. O4-mini? Try 79%. That’s like the Matrix throwing a rave in a mirror funhouse. You’d hope a system built to “reason” wouldn’t trip over basic facts, but here we are.
One popular theory floating around AI circles? The more a model tries to “think,” the more likely it is to lose the plot. Older models mostly played it safe, sticking to high-confidence answers. But these new reasoning models? They go exploring—piecing together scattered facts, taking intellectual detours, and yes, sometimes veering straight into fantasy land. Improvising facts, after all, is just a fancy way of lying.
Sure, correlation isn’t causation, and as OpenAI told The Times, the rise in hallucinations might not mean these models are inherently worse. It could just be that they’re chattier and more willing to roll the dice. Because instead of simply regurgitating predictable facts, the new models speculate. They hypothesise. They wander into “what if” territory where truth gets a little… slippery. And unfortunately, some of those “what ifs” end up being completely disconnected from reality.
The thing is, that’s exactly what OpenAI, and its competitors like Google and Anthropic, don’t want. The entire idea of calling an AI chatbot a “copilot” or “assistant” is that it’s supposed to help, not hand you imaginary citations or invent fake medical advice. Lawyers have already gotten burned by trusting ChatGPT too much. Who knows how many other people are relying on confidently wrong AI answers without realising it?
And as AI rolls into classrooms, courtrooms, hospitals, and HR departments, the stakes are only going up. Sure, the tech can draft job applications or wrangle spreadsheets like a wizard. But the more useful AI gets, the less tolerance anyone has for it pulling facts out of thin air.
Because you can’t say AI is saving people time if everyone still has to fact-check it like a paranoid editor. That’s not time-saving, that’s a digital goose chase. Not that these models aren’t impressive. GPT o3 can code like a champ and solve problems that make most humans sweat. But the second it confidently declares that Abraham Lincoln hosted a podcast or that water boils at 80°F, the whole façade falls apart.
Until that gets fixed? Treat everything an AI says like it’s coming from that guy in the meeting – you know the one. Talks a lot. Sounds super sure of himself. But halfway through, you’re googling everything he says because, deep down, you know he’s full of it.
[Source: TechRadar]