UPDATED 09:00 EDT / JULY 23 2024

BIG DATA

Iterative debuts DataChain for curating and processing unstructured data with AI models

A startup called Iterative Inc., which is focused on helping to improve and streamline workflows for artificial intelligence engineers, today announced a new open-source tool called DataChain that it says will transform the way unstructured data is curated, processed and evaluated, by using large language models.

The company explains that unstructured data is vital for training and fine-tuning the most advanced and sophisticated AI models today. But though it makes up the bulk of information stored by companies on their servers and systems, it’s not easy to put it to use.

A recent survey by McKinsey & Company on the state of AI, published in early 2024, shows that just 15% of companies have managed to deploy generative AI systems that have had a meaningful impact on their business and their bottom lines. That study reveals that one of the major challenges is the problem of processing unstructured data at large scale, and estimating the results of those operations.

Iterative Chief Executive Dmitry Petrov told SiliconANGLE that the problem with managing unstructured data is that existing tools are designed for structured data, such as tables in databases and spreadsheets. With structured data, it’s simple to add extra information such as income or revenue to the rows where they’re stored. But that’s not the case with unstructured data, such as images, audio and PDFs, which are stored as files instead of being neatly organized in tables.

“These files cannot be easily enriched with additional information using traditional methods,” Petrov said. “You need to keep this extra information somewhere, without modifying the files.”

This is why AI engineers need special tooling to manage and curate unstructured data effectively. Such tools can handle the files themselves, maintain metadata about the files, and manage the relationships between those files.

“Properly curating unstructured data [in this way] is essential because it allows AI models to access and process the information accurately, leading to better insights and results,” Petrov said.

The problem Iterative is trying to overcome with DataChain is that such tools don’t really exist, Petrov said. He added that this lack of tools has become the biggest single bottleneck in the AI development chain, and he believes that the answer to this problem lies in AI itself.

“We need AI models that can evaluate and improve other AI models,” he said. “So far this has only happened at the industry forefront,” he added, citing Google LLC’s DeepMind, which has trained its AlphaGo model on itself, and OpenAI, which has curated its own dataset for DALL-E 3.

Petrov said most AI engineers are still forced to create custom code to convert their JSON model responses and adapt them to databases, and they’re still running AI models in parallel with out-of-memory data. There’s a lot of potential for DataChain to change this, he believes.

He explained that what DataChain does is add a “meta-layer” of information that contains information about the files, as well as meta information. Users can slice and dice their files in numerous ways using this meta-layer of information, and they can do it using their natural language, since it’s powered by its own LLM. So a user could ask DataChain to surface images from East Coast plans, but omit those that weren’t taken at nighttime.

“If this information is not enough, users can enrich the data with more meta information by running models such as ChatGPT or Mistral, and asking them which of the photos were taken at night time,” Petrov said.

“In many ways, DataChain is just like using SQL queries for structured data, but the difference is that it interacts with files and meta attributes,” he added. “It’s about processing and curating unstructured data using LLM and local models.”

Image: SiliconANGLE/Microsoft Designer

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU