If you have been starting to wonder if AI language model ChatGPT’s answers have been becoming progressively worse, you are not alone.
Several researchers, who have been following the capabilities of AI products that rely on large amounts of data scraped from the internet, are sounding the alarm over an issue that has been dubbed ‘model collapse’.
Data Gravity
A technique called machine learning is used to process and analyze the data used to train large language models such as OpenAI’s ChatGPT and Google’s Gemini.
One of the major downsides to this technology is that it requires large amounts of high-quality input to function as intended.
Data Sources
This large amount of data has to be found somewhere, so the corporations creating these models came up with a data-based solution.
To find the necessary data, they scrape the internet for terabytes of content from public sources such as news sites and social media.
AI Web
However, Google, Meta, and others are running into a problem that had been predicted but was sidelined as the hype for AI grew.
Large amounts of textual and visual information that is uploaded to the internet is either fully, or in part, generated by AI.
Blind Leading the Blind
All this is leading to the fact that newer AI models are now being trained, more and more, on content either generated by itself or other AI models.
A researcher that works with Monash University, Jathan Sadowski, recently described a hypothetical self-trained AI as an “an inbred mutant, likely with exaggerated, grotesque features.”
Evidence Mounts
More recently, a study was published in Nature that tested the capabilities of AI trained on itself, and the results were concerning.
By the 5th cycle, the degradation had become stark, and by the 9th, it was almost completely nonsensical.
Rapid Decline
Dr. Ilia Shumailov, of the University of Oxford, who researched the phenomenon with her team, was shocked at the rate of decline.
She says: “It is surprising how fast model collapse kicks in and how elusive it can be.” She says it initially “affects [badly represented] minority data,” before moving on to more diverse outputs.
Model Collapse
‘Model collapse’ is the term given to this phenomenon, where AI models trained on themselves degrade and become meaningless over iterations.
AI has the potential to cause its own downfall. Shumailov warns that “model collapse can have serious consequences”.
Scale of Issue
One lesser known aspect of the issue is the sheer scale of AI-generated content that can be found on the web today, as revealed by separate researchers at Amazon Web Services.
They found that as much as 57% of text on the web has been at least partly generated by AI.
Real Content
All of this means that it will not only become harder to find real content on the internet, but that the content we do find may start to become markedly worse.
So, are you an AI-evangelist or a tech skeptic? What do you think of the increasing proportion of AI-generated content found on the internet, even aside from its potential to scupper AI-driven language models?