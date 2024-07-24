The use of computer-generated data to train artificial intelligence models risks causing them to produce nonsensical results, according to new research that highlights looming challenges to the emerging technology. Leading AI companies, including OpenAI and Microsoft, have tested the use of “synthetic” data — information created by AI systems to then also train large language models (LLMs) — as they reach the limits of human-made material that can improve the cutting-edge technology. Research published in Nature on Wednesday suggests the use of such data could lead to the rapid degradation of AI models. One trial using synthetic input text about medieval architecture descended into a discussion of jackrabbits after fewer than 10 generations of output. The work underlines why AI developers have hurried to buy troves of human-generated data for training — and raises questions of what will happen once those finite sources are exhausted. “Synthetic data is amazing if we manage to make it work,” said Ilia Shumailov, lead author of the research. “But what we are saying is that our current synthetic data is probably erroneous in some ways. The most surprising thing is how quickly this stuff happens.” The paper explores the tendency of AI models to collapse over time because of the inevitable accumulation and amplification of mistakes from successive generations of training. The speed of the deterioration is related to the severity of shortcomings in the design of the model, the learning process and the quality of data used. The early stages of collapse typically involve a “loss of variance”, which means majority subpopulations in the data become progressively over-represented at the expense of minority groups. In late-stage collapse, all parts of the data may descend into gibberish. “Your models lose utility because they are overwhelmed with all of the errors and misconceptions that are introduced by previous generations — and the models themselves,” said Shumailov, who carried out the work at Oxford university with colleagues from Cambridge, Imperial College London, Edinburgh and Toronto.

Full report : Researchers suggest using “synthetic” data, created by AI systems to train AI systems, could lead to the rapid degradation of AI models and a collapse over time.