Snorkel AI raises $85M at $1B valuation to create AI training datasets automatically
Stanford University spinoff Snorkel AI Inc. today announced that it has raised $85 million in funding to continue commercializing its namesake software tool, which is used by companies such as Apple Inc. and Google LLC to speed up their machine learning projects.
The funding round values Snorkel AI at $1 billion. Google’s GV venture capital arm was one of the more than a half-dozen institutional backers that participated in the round along with Addition and BlackRock, which co-led the investment.
A large portion of enterprise AI models are developed through a method known as supervised learning. In supervised learning, a company assembles a training dataset that contains pairs of questions and answers. The company then provides this training dataset to an AI model. The AI studies the dataset and learns to answer questions on its own.
In enterprise AI projects, the “questions” that make up an AI training set are pieces of data that a company is looking to process. The “answers” are tags that describe each piece of data. For example, to build an AI model that sorts scientific papers by topic, a company would need a training dataset in which the questions are scientific papers and the answers are tags describing the topic of each paper.
The challenge Snorkel AI is tackling is that the task of creating datasets for training AI models must often be done manually. Companies sometimes spend months manually assembling the data, which slows down AI development and comes with considerable costs. The more complex the project, the more training data is needed and the longer it takes to assemble it.
Snorkel AI’s solution is to automate the process. Normally, AI training datasets are created by manually assembling thousands of individual data points or more. Snorkel AI has developed a platform that largely removes the need for manual work and instead allows developers to write code to perform the task automatically.
The reason assembling AI datasets can be automated with code is that much of the work involved in the task consists of adding tags to pieces of data.
If the training dataset consists of scientific papers, then the tags might be keywords describing the topic of each paper, such as “machine learning” or “quantum computing.” Adding tags to data is often a fairly straightforward process and is consequently relatively straightforward to automate. For example, a developer could create a code snippet that simply assigns the label “quantum computing” to every scientific paper that contains the keyword “qubit.”
Snorkel AI has found a way to apply this approach to enterprise AI projects and free up much of the manual work involved in creating training datasets. The startup’s platform enables developers to assemble training datasets by creating scripts, or functions as the startup calls them, that automatically perform data tagging. The startup says it can help companies reduce the amount of time required to build an AI by up to a hundred times, thereby shortening projects that normally take months to just a few weeks.
To achieve that speedup, Snorkel AI’s team had to overcome a major technical challenge. Writing a piece of code that assigns tags to pieces of data such as scientific papers is fairly simple. However, there’s a catch: Training datasets created with this method frequently contain a larger number of errors. In some cases, there are so many errors that the dataset can’t be used for AI training.
Snorkel AI’s platform uses statistical methods to identify and filter the errors that are generated during the process, so that only the accurate training data remains. Companies can then use it to build their AI models.
The statistical methods used by the platform to spot errors were developed by Snorkel AI’s founding team at the Stanford AI Lab over the course of more than five years. The team initially made the technology available as an open-source tool called Snorkel. In 2019, the founders launched Snorkel AI, whose platform is a commercial version of that open-source tool with additional features to simplify related aspects of the AI development projects.
The platform not only enables companies to create training datasets with code but also provides capabilities for building machine learning algorithms using those datasets. Snorkel AI has also included a number of features that can help developers detect if a machine learning algorithm generates inaccurate results after it’s deployed and fix the issue.
The original open-source tool on which the platform is based is used by tech giants such as Intel Corp., Apple and Google. According to Snorkel AI, Google used its software to create a training dataset in 30 minutes that previously would have required more than six months to assemble. At Apple, engineers used the software to reduce the number of errors in some AI applications by nearly three times.
The $85 million funding round Snorkel AI announced today will help the startup bring its technology to more companies and expand its platform’s feature set. “With this $85 million investment, we plan to accelerate turning our years of breakthrough AI research – and the continued advances of our team – into core product capabilities within Snorkel Flow,” Snorkel AI co-founder and Chief Executive Officer Alex Ratner wrote in a blog post. “We also plan to significantly accelerate the build-out of our go-to-market and customer success infrastructure.”
Snorkel AI has raised a total of $135 million to date.
Image: Snorkel AI
A message from John Furrier, co-founder of SiliconANGLE:
Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.
We really want to hear from you, and we’re looking forward to seeing you at the event and in theCUBE Club.