UPDATED 17:39 EDT / NOVEMBER 09 2023


OpenAI launches partner initiative focused on creating AI training datasets

OpenAI LP today announced a new initiative, OpenAI Data Partnerships, through which it will collect records from other organizations to create artificial intelligence training datasets.

The quality of training files directly influences the reliability of the neural network they’re used to build. The more relevant the dataset, the more accurately the neural network can answer users’ questions. Creating a high-quality dataset is often a time-consuming and expensive process, which is likely one reason OpenAI is seeking the help of external organizations.

One goal of the company’s new partner initiative is to assemble private datasets that can be used to train its foundation models. Additionally, OpenAI will leverage the records for model customization. Last week at its DevDay product event, the company debuted a program that allows enterprises to customize GP-4 for their requirements by “modifying every step of the model training process.”

Another goal of the initiative is to create an open-source AI dataset that will be free for developers to use. According to OpenAI, the database will be specifically geared towards language model projects. The company added that it may consider using the files in the repository to build and publish open-source AI models.

OpenAI already offers a collection of open-source neural networks. The two newest additions to the lineup, Whisper large-v3 and Consistency Decoder, made their debut at the company’s DevDay event last week. They focus on transcription and image generation tasks, respectively.

Several early participants signed up for the OpenAI Data Partnerships initiative ahead of its debut today. The Icelandic government and Miðeind ehf, a Reykjavík-based software company, are working with OpenAI to make GPT-4 more fluent in Icelandic. Meanwhile, the nonprofit organization Free Law Project is contributing a collection of legal documents.

“We’re interested in large-scale datasets that reflect human society and that are not already easily accessible online to the public today,” OpenAI detailed in a blog post. “We’re particularly looking for data that expresses human intention (e.g. long-form writing or conversations rather than disconnected snippets), across any language, topic, and format.”

OpenAI is seeking multiple types of training data including text, images, audio and video. That suggests the company plans to use files contributed by partners to train not only language models, but also other types of neural networks such as image generators. OpenAI will accept training datasets even if they contain errors or are stored in a format that is difficult to process. 

“We can work with data in almost any form and can use our next-generation in-house AI technology to help you digitize and structure your data,” OpenAI stated. 

Image: Unsplash

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy