Mostly AI aims to overcome the AI training plateau with synthetic text based on proprietary datasets
Synthetic data startup Mostly AI Solutions MP GmbH says today it’s looking to solve the challenge of finding enough text-based training data for artificial intelligence models.
Its proposed solution is a new, synthetic text generator that can transform proprietary data into something much more useful for AI developers.
The startup says that finding data to train AI models has become a major headache for developers as they have largely exhausted the most useful, publicly available datasets out there. Of course, most organizations also have their own proprietary data that could be used, but they’re loath to do so due to privacy concerns.
It’s a problem that needs to be solved, according to Mostly AI Chief Executive Tobias Hann, who believes the lack of available training data is having a negative impact on the quality of newer AI models. “AI training is hitting a plateau as models exhaust public data sources and yield diminishing returns,” he insisted.
Mostly AI wants to help developers by taking their proprietary data and using it to generate synthetic text, which can then be used to fine-tune AI. In other words, what it’s doing is transforming proprietary text such as emails, customer support transcripts and chatbot conversations into a resource for AI, without compromising privacy.
The company says the shortage of text for AI training has become extremely acute, citing data from Gartner Inc. that shows how 75% of companies will be using generative AI to create synthetic customer data by 2026, up from less than 5% of companies today.
But the problem with using proprietary, or “real” text data is that it often contains sensitive information, such as customer’s personally identifiable information, which means it cannot be exposed to large language models. In addition, these datasets might not be ideal for LLM training due to a lack of diversity, which results in low-quality outputs.
Synthetic data offers companies an alternative, yet at the same time, it can benefit immensely from being grounded in proprietary data that contains more useful insights relating to the owner’s business.
Putting proprietary data to work in AI
What Mostly AI does is create a synthetic representation of their proprietary text data, which reflects both the text and the structured insights within those proprietary datasets. By uniquely integrating structured and unstructured information, it enables organizations to create a complete and statistically accurate, yet safe-to-use version of their proprietary data assets that can then be used to fine-tune AI systems in a compliant way.
The other thing it does is ensure that its synthetic data is of extremely high quality. According to the startup, its synthetic text generator outperforms rival generative AI models such as GPT-4o by a significant degree.
“When training a downstream text classifier, synthetic text generated by the Mostly AI Platform delivers performance improvement as much as 35% compared to text generated by prompting GPT-4o-mini providing either no or just a few real-world examples,” the company said.
Holger Mueller of Constellation Research Inc. said AI has evolved to the stage where it can now be helpful in training other forms of AI. “Mostly AI shows us how this is possible, using AI to create synthetic data out of sensitive data, so it can be used to train AI systems,” the analyst said. “It’s an elegant solution coming from the startup out of Austria, addressing two key problems for AI today — the lack of data, and the lack of respect for privacy of data.”
The startup says that with today’s launch of its synthetic text generator, users will be able to take any model from a platform such as Hugging Face and fine-tune it with synthetic data that’s as rich and accurate as their proprietary text.
“To harness high-quality proprietary data, which offers far greater value and potential than the residual public data currently being used, enterprises must take the leap and leverage both structured and unstructured synthetic data to safely train and deploy forthcoming generative AI solutions.” Hann said.
Image: SiliconANGLE/Microsoft Designer
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU