OpenAI seeks partnerships to generate AI training data

11/10/2023

It’s an open secret that the data sets used to train AI models are deeply flawed. Image corpora tends to be U.S.- and Western-centric, partly because Western images dominated the internet when the data sets were compiled. And as most recently highlighted by a study out of the Allen Institute for AI, the data used to train large language models like Meta’s Llama 2 contains toxic language and biases. Models amplify these flaws in harmful ways. Now, OpenAI says that it wants to combat them by partnering with outside institutions to create new, hopefully improved data sets. OpenAI today announced Data Partnerships, an effort to collaborate with third-party organizations to build public and private data sets for AI model training. In a blog post, OpenAI says Data Partnerships is intended to “enable more organizations to help steer the future of AI” and “benefit from models that are more useful.” “To ultimately make [AI] that is safe and beneficial to all of humanity, we’d like AI models to deeply understand all subject matters, industries, cultures and languages, which requires as broad a training data set as possible,” OpenAI writes. “Including your content can make AI models more helpful to you by increasing their understanding of your domain.” As a part of the Data Partnerships program, OpenAI says that it’ll collect “large-scale” data sets that “reflect human society” and that aren’t easily accessible online today. While the company plans to work across a wide range of modalities, including images, audio and video, it’s particularly seeking data that “expresses human intention” (e.g. long-form writing or conversations) across different languages, topics and formats.

Full story : OpenAI seeks partnerships with diverse organizations to build new AI training data sets.

Tagged: AI AI Training Bias ChatGPT OpenAI

Subscribe Sign In

Related Posts