#3. We need to talk about regression

2026-03-24 16:37:32 +0100 CET

You cannot train a large language model (LLM) on information generated by an LLM. If you do that the model will regress quickly to the point of being incomprehensible. The reason is simple and unavoidable. LLM’s are stochastic models. They count the number of tokens and relations between those tokens and when they are done counting they are trained. If you ask them a question it will present you with the token that fits your question with the highest probability, using that to find the next one, etc. It is actually pretty good at generating something decent from those probabilities. However if you start feeding that output back into an LLM training session those probabilities will be threated as certainty, as they are counted as 1 in the new model. The next time you ask the LLM a question that learned relation will be stronger so more common in the output. The problem with this is off course that this relation becoming stronger, other relations are getting weaker and your model will regress narrowing the answers it can give, only because of its own previous answer.

Research showed that left unchecked a model will regress into gibberish within about 9 generations. Now this can be mitigated somewhat by applying gradient descent on new data, by preselecting new data to train with an existing model or by never using data generated by AI. The last one suggest there is such a set (and there is), but it will also be very quickly outdated as adding to that set won’t be possible as determining what data was generated by AI will become impossible. Even then models will regress. Slower, but regress. All for the simple reason you cannot count the outcome of a statistical function as the thing being counted in the same statistical function.

The counter argument to this I heard goes as follows: As we cannot know for certain that a completely reasonable text is written by an AI, that text could just as well be written by a human. Adding it to the training data of an AI will count as just another voice. Let’s delve into this. LLM’s do not consider content, only relations between tokens (all you need is attention). The relations it produces are the aggregates of the most probable relations of what it learned. If some relation had only a probability of 0.90 and is now counted as 1.00 in the next iteration, the relation will have different probabilities and the AI will most likely not be able to reproduce its own previous statement. In that sense it is also not its own voice.

The issue of regression was already common knowledge when starting with Deep Learning Models. It seems people were kind of hoping from Large Language Models there would emerge an intelligence that could solve this problem for us. Safe to say, that did not happen. So now it seems a race to improve LLM’s enough so it can help to solve this problem with new generations of LLM that will get less intelligent.