Microsoft's Dolphin and phi models have used this successfully, and there's some evidence that all newer models use big LLM's to produce synthetic data (Like when asked, answering it's ChatGPT or Claude, hinting that at least some of the dataset comes from those models).
theterrasque
randomly sampled.
Semi-randomly. There's a lot of sampling strategies. For example temperature, top-K, top-p, min-p, mirostat, repetition penalty, greedy..
As with any statistics you have a confidence on how true something is based on your data. It’s just a matter of putting the threshold higher or lower.
You just have to make so if that level of confidence is not reached it just default to a “I don’t know answer”. But, once again, this will make the chatbots seem very dumb as they will answer with lots of “I don’t know”.
I think you misunderstand how LLM's work, it doesn't have a confidence, it's not like it looks at it's data and say "hmm, yes, most say Paris is the capital of France, so that's the answer". It "just" puts weight on the next token depending on it's internal statistics, and then one of those tokens are picked, and the process start anew.
Teaching the model to say "I don't know" helps a bit, and was lauded as "The Solution" a year or two ago but turns out it didn't really help that much. Then you got Grounded approach, RAG, CoT, and so on, all with the goal to make the LLM more reliable. None of them solves the problem, because as the PhD said it's inherent in how LLM's work.
And no, local llm's aren't better, they're actually much worse, and the big companies are throwing billions on trying to solve this. And no, it's not because "that makes the llm look dumb" that they haven't solved it.
Early on I was looking into making a business of providing local AI to businesses, especially RAG. But no model I tried - even with the documents being part of the context - came close to reliable enough. They all hallucinated too much. I still check this out now and then just out of own interest, and while it's become a lot better it's still a big issue. Which is why you see it on the news again and again.
This is the single biggest hurdle for the big companies to turn their AI's from a curiosity and something assisting a human into a full fledged autonomous / knowledge system they can sell to customers, you bet your dangleberries they try everything they can to solve this.
And if you think you have the solution that every researcher and developer and machine learning engineer have missed, then please go prove it and collect some fat checks.
The fix is not that hard, it’s a matter of reputation on having the chatbot answer “I don’t know” when the confidence on an answer isn’t high enough.
This has been tried, it's helping but it's not enough by itself. It's one of the mitigation steps I was thinking of. And companies do work very hard to reduce hallucinations, just look at Microsoft's newest thing.
From that article:
“Trying to eliminate hallucinations from generative AI is like trying to eliminate hydrogen from water,” said Os Keyes, a PhD candidate at the University of Washington who studies the ethical impact of emerging tech. “It’s an essential component of how the technology works.”
Text-generating models hallucinate because they don’t actually “know” anything. They’re statistical systems that identify patterns in a series of words and predict which words come next based on the countless examples they are trained on.
It follows that a model’s responses aren’t answers, but merely predictions of how a question would be answered were it present in the training set. As a consequence, models tend to play fast and loose with the truth. One study found that OpenAI’s ChatGPT gets medical questions wrong half the time.
It's an inherent negative property of the way they work. It's a problem, but not a bug any more than the result of a car hitting a tree at high speed is a bug.
Calling it a bug indicates that it's something unexpected that can be fixed, and as far as we know it can't be fixed, and is expected behavior. Same as the car analogy.
The only thing we can do is raise awareness and mitigate.
Well, It's not lying because the AI doesn't know right or wrong. It doesn't know that it's wrong. It doesn't have the concept of right or wrong or true or false.
For the llm's the hallucinations are just a result of combining statistics and producing the next word, as you say. From the llm's "pov" it's as real as everything else it knows.
So what else can it be called? The closest concept we have is when the mind hallucinates.
This is a very simple one, but someone lower down apparently had issue with a script like this:
https://i.imgur.com/wD9XXYt.png
I tested the code, it works. If I was gonna change anything, probably move matplotlib import to after else so it's only imported when needed to display the image.
I have a lot more complex generations in my history, but all of them have personal or business details, and have much more back and forth. But try it yourself, claude have a free tier. Just try to be clear in the prompt what you want. It might surprise you.
What llm did you use, and how long ago was it? Claude sonnet usually writes pretty good python for smaller scripts (a few hundred lines)
The only issue I see with targeting Linux is the sheer variety of Desktop setups. Finding one keyboard shortcut and payload that will work on even just the majority of distros would be a challenge.
The printing press, of course
https://learnprompting.org/docs/intermediate/chain_of_thought
It's suspected to be one of the reasons why Claude and OpenAI's new o1 model is so good at reasoning compared to other llm's.
It can sometimes notice hallucinations and adjust itself, but there's also been examples where the CoT reasoning itself introduce hallucinations and makes it throw away correct answers. So it's not perfect. Overall a big improvement though.