AI-Generated Data Can Poison Future AI Models

Thanks to a boom in generative artificial intelligence, programs that can produce text, computer code, images and music are readily available to the average person. And we’re already using them: AI content is taking over the Internet, and text generated by “large language models” is filling hundreds of websites, including CNET and Gizmodo. But as AI developers scrape the Internet, AI-generated content may soon enter the data sets used to train new models to respond like humans. Some experts say that will inadvertently introduce errors that build up with each succeeding generation of models.

A growing body of evidence supports this idea. It suggests that a training diet of AI-generated text, even in small quantities, eventually becomes “poisonous” to the model being trained. Currently there are few obvious antidotes. “While it may not be an issue right now or in, let’s say, a few months, I believe it will become a consideration in a few years,” says Rik Sarkar, a computer scientist at the School of Informatics at the University of Edinburgh in Scotland.

The possibility of AI models tainting themselves may be a bit analogous to a certain 20th-century dilemma. After the first atomic bombs were detonated at World War II’s end, decades of nuclear testing spiced Earth’s atmosphere with a dash of radioactive fallout. When that air entered newly-made steel, it brought elevated radiation with it. For particularly radiation-sensitive steel applications, such as Geiger counter consoles, that fallout poses an obvious problem: it won’t do for a Geiger counter to flag itself. Thus, a rush began for a dwindling supply of low-radiation metal. Scavengers scoured old shipwrecks to extract scraps of prewar steel. Now some insiders believe a similar cycle is set to repeat in generative AI—with training data instead of steel.

Researchers can watch AI’s poisoning in action. For instance, start with a language model trained on human-produced data. Use the model to generate some AI output. Then use that output to train a new instance of the model and use the resulting output to train a third version, and so forth. With each iteration, errors build atop one another. The 10th model, prompted to write about historical English architecture, spews out gibberish about jackrabbits.

“It gets to a point where your model is practically meaningless,” says Ilia Shumailov, a machine learning researcher at the University of Oxford.

Shumailov and his colleagues call this phenomenon “model collapse.” They observed it in a language model called OPT-125m, as well as a different AI model that generates handwritten-looking numbers and even a simple model that tries to separate two probability distributions. “Even in the simplest of models, it’s already happening,” Shumailov says. “I promise you, in more complicated models, it’s 100 percent already happening as well.”

In a recent preprint study, Sarkar and his colleagues in Madrid and Edinburgh conducted a similar experiment with a type of AI image generator called a diffusion model. Their first model in this series could generate recognizable flowers or birds. By their third model, those pictures had devolved into blurs.

Other tests showed that even a partly AI-generated training data set was toxic, Sarkar says. “As long as some reasonable fraction is AI-generated, it becomes an issue,” he explains. “Now exactly how much AI-generated content is needed to cause issues in what sort of models is something that remains to be studied.”

Both groups experimented with relatively modest models—programs that are smaller and use fewer training data than the likes of the language model GPT-4 or the image generator Stable Diffusion. It’s possible that larger models will prove more resistant to model collapse, but researchers say there is little reason to believe so.

The research so far indicates that a model will suffer most at the “tails” of its data—the data elements that are less frequently represented in a model’s training set. Because these tails include data that are further from the “norm,” a model collapse could cause the AI’s output to lose the diversity that researchers say is distinctive about human data. In particular, Shumailov fears this will exacerbate models’ existing biases against marginalized groups. “It’s quite clear that the future is the models becoming more biased,” he says. “Explicit effort needs to be put in order to curtail it.”

Perhaps all this is speculation, but AI-generated content is already beginning to enter realms that machine-learning engineers rely on for training data. Take language models: even mainstream news outlets have begun publishing AI-generated articles, and some Wikipedia editors want to use language models to produce content for the site.

“I feel like we’re kind of at this inflection point where a lot of the existing tools that we use to train these models are quickly becoming saturated with synthetic text,” says Veniamin Veselovskyy, a graduate student at the Swiss Federal Institute of Technology in Lausanne (EPFL).

There are warning signs that AI-generated data might enter model training from elsewhere, too. Machine-learning engineers have long relied on crowd-work platforms, such as Amazon’s Mechanical Turk, to annotate their models’ training data or to review output. Veselovskyy and his colleagues at EPFL asked Mechanical Turk workers to summarize medical research abstracts. They found that around a third of the summaries had ChatGPT’s touch.

The EPFL group’s work, released on the preprint server last month, examined only 46 responses from Mechanical Turk workers, and summarizing is a classic language model task. But the result has raised a specter in machine-learning engineers’ minds. “It is much easier to annotate textual data with ChatGPT, and the results are extremely good,” says Manoel Horta Ribeiro, a graduate student at EPFL. Researchers such as Veselovskyy and Ribeiro have begun considering ways to protect the humanity of crowdsourced data, including tweaking websites such as Mechanical Turk in ways that discourage users from turning to language models and redesigning experiments to encourage more human data.

Against the threat of model collapse, what is a hapless machine-learning engineer to do? The answer could be the equivalent of prewar steel in a Geiger counter: data known to be free (or perhaps as free as possible) from generative AI’s touch. For instance, Sarkar suggests the idea of employing “standardized” image data sets that would be curated by humans who know their content consists only of human creations and freely available for developers to use.

Some engineers may be tempted to pry open the Internet Archive and look up content that predates the AI boom, but Shumailov doesn’t see going back to historical data as a solution. For one thing, he thinks there may not be enough historical information to feed growing models’ demands. For another, such data are just that: historical and not necessarily reflective of a changing world.

“If you wanted to collect the news of the past 100 years and try and predict the news of today, it’s obviously not going to work, because technology’s changed,” Shumailov says. “The lingo has changed. The understanding of the issues has changed.”

The challenge, then, may be more direct: discerning human-generated data from synthetic content and filtering out the latter. But even if the technology for this existed, it is far from a straightforward task. As Sarkar points out, in a world where Adobe Photoshop allows its users to edit images with generative AI, is the result an AI-generated image—or not?

Leave a Reply

Your email address will not be published. Required fields are marked *