Why computer-made data is being used to train AI models

Artificial intelligence companies are exploring a new avenue to obtain the massive amounts of data needed to develop powerful generative models: creating the information from scratch. Microsoft, OpenAI and Cohere are among the groups testing the use of so-called synthetic data — computer-generated information to train their AI systems known as large language models (LLMs) — as they reach the limits of human-made data that can further improve the cutting-edge technology. The launch of Microsoft-backed OpenAI’s ChatGPT last November has led to a flood of products rolled out publicly this year by companies including Google and Anthropic, which can produce plausible text, images or code in response to simple prompts. The technology, known as generative AI, has driven a surge of investor and consumer interest, with the world’s biggest technology companies including Google, Microsoft and Meta racing to dominate the space. Currently, LLMs that power chatbots such as OpenAI’s ChatGPT and Google’s Bard are trained primarily by scraping the internet.

Full story : Microsoft, OpenAI, Cohere, and others are testing the use of “synthetic data”, as they find generic data from the web is no longer good enough for training LLMs.

About OODA Analyst