UPDATED 19:10 EST / MARCH 19 2024

AI

Databricks acquires AI dataset management startup Lilac

Databricks Inc. has acquired Lilac AI Inc., a startup with a tool that helps developers manage the text datasets they use in artificial intelligence projects. 

The companies announced the deal today without disclosing the financial terms. Boston-based Lilac AI was founded by Daniel Smilkov and Nikhil Thorat, two former Google LLC engineers who helped build TensorFlow.js. That’s a component of TensorFlow, the search giant’s popular AI development tool, which can be used to write machine learning applications in JavaScript.

Developing an AI model requires software teams to assemble and analyze large volumes of text. First, developers must create a collection of documents on which the model can be trained. Once training is complete, the AI’s outputs have to be reviewed to determine if the text it generates meets accuracy requirements.

“Exploring and understanding these datasets is critical for building quality GenAI apps,” Databricks co-founder Matei Zaharia and other executives explained in a blog post today. “However, analyzing unstructured text data can become highly cumbersome and extremely difficult in the age of GenAI. Historically, this process has been marred by manual, labor-intensive methods that lack scalability.”

Lilac AI has developed an open-source tool, Lilac, that promises to streamline the task. The software is used by Databricks, Cohere Inc. and other players in the AI software market.

One of Lilac’s flagship features is a so-called clustering capability that’s powered by a built-in AI model. It can analyze the documents that make up a text dataset, organize similar documents into groups and generate a description of each group. Lilac could, for example, determine that two-thirds of the items in an AI training dataset are book summaries while the rest are math questions.

Developers can use the tool to find parts of a training dataset that should be removed. If a software team is building an AI model that generates book summaries, the dataset with which the model is developed doesn’t necessarily need to include math questions. Removing unnecessary items speeds up training and increases the accuracy of AI responses.

Lilac also lends itself to other tasks. It includes a dashboard that can be used to compare individual records from a dataset with one another, which is useful for assessing the impact of dataset updates. It also allows developers to turn text data into embeddings, mathematical representations that are easier for AI models to understand.

Lilac AI offers a paid cloud version of its namesake tool that includes additional features. According to the company, there’s an upgraded clustering capability that can organize one million records into groups in 20 minutes. The cloud version also includes tools that make it easier to edit large datasets.

Databricks plans to integrate Lilac AI’s software into its flagship data management and AI platform. The addition will complement the technology the company obtained through its $1.3 billion acquisition of MosaicML Inc. last June. MosaicML developed an AI development platform of the same name along with several prepackaged language models.

Image: Unsplash

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.