UPDATED 19:10 EDT / MARCH 19 2024


Databricks acquires AI dataset management startup Lilac

Databricks Inc. has acquired Lilac AI Inc., a startup with a tool that helps developers manage the text datasets they use in artificial intelligence projects. 

The companies announced the deal today without disclosing the financial terms. Boston-based Lilac AI was founded by Daniel Smilkov and Nikhil Thorat, two former Google LLC engineers who helped build TensorFlow.js. That’s a component of TensorFlow, the search giant’s popular AI development tool, which can be used to write machine learning applications in JavaScript.

Developing an AI model requires software teams to assemble and analyze large volumes of text. First, developers must create a collection of documents on which the model can be trained. Once training is complete, the AI’s outputs have to be reviewed to determine if the text it generates meets accuracy requirements.

“Exploring and understanding these datasets is critical for building quality GenAI apps,” Databricks co-founder Matei Zaharia and other executives explained in a blog post today. “However, analyzing unstructured text data can become highly cumbersome and extremely difficult in the age of GenAI. Historically, this process has been marred by manual, labor-intensive methods that lack scalability.”

Lilac AI has developed an open-source tool, Lilac, that promises to streamline the task. The software is used by Databricks, Cohere Inc. and other players in the AI software market.

One of Lilac’s flagship features is a so-called clustering capability that’s powered by a built-in AI model. It can analyze the documents that make up a text dataset, organize similar documents into groups and generate a description of each group. Lilac could, for example, determine that two-thirds of the items in an AI training dataset are book summaries while the rest are math questions.

Developers can use the tool to find parts of a training dataset that should be removed. If a software team is building an AI model that generates book summaries, the dataset with which the model is developed doesn’t necessarily need to include math questions. Removing unnecessary items speeds up training and increases the accuracy of AI responses.

Lilac also lends itself to other tasks. It includes a dashboard that can be used to compare individual records from a dataset with one another, which is useful for assessing the impact of dataset updates. It also allows developers to turn text data into embeddings, mathematical representations that are easier for AI models to understand.

Lilac AI offers a paid cloud version of its namesake tool that includes additional features. According to the company, there’s an upgraded clustering capability that can organize one million records into groups in 20 minutes. The cloud version also includes tools that make it easier to edit large datasets.

Databricks plans to integrate Lilac AI’s software into its flagship data management and AI platform. The addition will complement the technology the company obtained through its $1.3 billion acquisition of MosaicML Inc. last June. MosaicML developed an AI development platform of the same name along with several prepackaged language models.

Image: Unsplash

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy