UPDATED 08:00 EDT / MARCH 14 2024

AI

AI-focused big data startup Unstructured raises $40M to make raw data LLM-ready

Generative artificial intelligence data processing startup Unstructured Technologies Inc. has closed on its second major funding round in less than a year, announcing a $40 million fundraising.

Today’s Series B round was led by Menlo Ventures with participation from a host of big-name backers, including Nvidia Corp.’s venture capital arm, IBM Ventures, Databricks Ventures and angel investors such as Sacramento Kings Chairman Vivek Ranadivé, Datastax Inc. Chief Executive Chet Kapoor and Allison Pickens of the New Normal Fund.

Existing investors, including Madrona, Bain Capital Ventures and Mango Capital, also invested in the round, which follows a $25 million raise announced in July 2023. All told, Unstructured has now raised more than $65 million in funding.

Unstructured is getting a lot of attention because it’s a pioneer in the area of converting unstructured data such as images, written notes, audio, video and so on into formats that can be read easily by large language models. It’s an extremely interesting proposition for many companies, as LLMs are the class of AI models that power generative AI services such as OpenAI’s ChatGPT and Google LLC’s Gemini, and few people will need reminding how popular they are these days.

The startup notes that more than half of organizations globally have stepped up their investments in generative AI tech over the past year, but they are faced with a massive data challenge. Although structured data has long since been made available for advanced analytics due to innovations around modern data stacks, there is no easy way to take advantage of unstructured data, which accounts for more than 80% of all information stored away by enterprises. If generative AI can find a way to access this information more easily, it’s likely to improve vastly its capabilities and make chatbots and other applications more powerful than ever.

This is the challenge that Unstructured has determined to tackle, and it claims to be the first and only company that can ingest and transform any unstructured data type into a format that can be immediately used by LLMs.

The startup provides customers with a platform that offers three starting points: an open-source Python library, containers and a cloud-hosted application programming interface. The API can process more than 20 natural language files types, including raw data and LLM-ready files. It comes with multiple enterprise-grade data connectors to services, including Microsoft Corp.’s Azure Blob and OneDrive, Amazon Web Services Inc.’s S3, Google LLC’s Cloud Storage and Google Drive, plus Dropbox and Elasticsearch.

Unstructured, which was founded in 2022 by U.S. Central Intelligence Agency analyst Brian Raymond, developed its technology in collaboration with the open-source community, commercial enterprises and a number of U.S. government defense and intelligence organizations. The startup has been awarded Phase I and Phase II Small Business Innovation and Research contracts by the U.S. Air Force and Space Force, with additional support coming via the U.S. Special Operations Command.

Since launching its platform that same year, Unstructured has become a valuable tool for organizations looking to put their LLMs into production. Its technology enables users to automate the transformation of unstructured data formats and make it usable for LLM training, fine-tuning and retrieval augmented generation or RAG, which is where pretrained generative AI models can access additional data to augment their knowledge.

CEO Raymond said the development of LLMs nested in RAG architectures has enabled companies to build a new generation of LLMs and analytics products based on unstructured data. “For the first time, developers are able to interact with all of their data through large foundation models,” he said.

According to Raymond, the ability to ingest and pre-process human-generated data is a critical bottleneck to realizing the value of LLMs, and his company will be the one to help organizations overcome it. “2024 will be the year of moving LLM prototypes into production and organizations of all types and sizes are hungry to build out these architectures efficiently and at scale,” he said. “Automating the process of structuring data and seamlessly delivering it into storage is critical for enterprises that want to build solutions on this new tech stack and go to market quickly.”

Constellation Research Inc. Vice President and Principal Analyst Andy Thurai told SiliconANGLE that data preparation is one of the forgotten aspects of AI development, as the task of doing so is much less exciting than prompt engineering, RAG and the actual end products, the LLMs. But he said it’s an area that can benefit tremendously from automation, as data scientists spend the bulk of their time on getting data ready.

“Unstructured data can be a real mess, primarily because there are no established standards and it is difficult to finding meaning within it,” Thurai said. “While vector databases help with storing unstructured data, getting the data ready to be put into a vector database or data lake is a considerable challenge.”

It’s precisely because of this challenge that Unstructured believes its platform has already become a critical piece of infrastructure for generative AI projects, transforming information into LLM-ready data and making it compatible with vector databases, which store unstructured information as numerical representations that can be accessed more easily. The company claims it can help to drive generative AI application performance improvements of up to 20% without any customization.

That’s why its open-source library has been downloaded more than six million times, the startup said. It’s used by more than 12,000 code bases and more than 45,000 organizations, including more than a third of the Fortune 500.

In January, Unstructured debuted its commercial software-as-a-service API, and has amassed more than 1,000 paying customers. The following month, it launched its enterprise platform, which is said to be the world’s first platform that can continuously extract raw information from existing databases and transform it into LLM-ready formats in close to real time before loading it into a vector database.

It provides a key advantage, as studies show that data scientists spend more than three-quarters of their time on data preparation. By providing continuous and real-time access to the latest unstructured data, Unstructured is uniquely able to keep LLMs up to date, the company says.

Thurai said Unstructured isn’t the only data preparation tool for unstructured information, but points out that such tools are not widely used, as many enterprises are still doing lots of manual work. What’s more, that work is becoming more difficult, he said, as the most advanced LLMs demand much more data than earlier models. “Unstructured does have good traction with its open-source downloads, and the recently announced enterprise version of its platform helps companies more by continuously extracting raw, unstructured data from existing databases, which wasn’t possible before,” Thurai said. “Unstructured’s tools can be very useful for enterprises that need to use raw, unstructured information for RAG workloads, especially given its new ability to provide models with continually updated and current information.”

Menlo Ventures partner Tim Tully unsurprisingly used even more superlatives, saying Unstructured has built an “exceptional platform” that can transform the way developers build new data pipelines for RAG, AI applications, chatbots and more. “It has become the preferred way developers build AI applications and assemble data pipelines,” he said. “People in the industry know that RAG quickly became the industry standard. Soon they will understand that Unstructured is the tip of the RAG spear.”

Unstructured said it will use the funds from today’s round to grow its engineering and sales teams and accelerate the development of its data preprocessing tools for LLMs.

Images: Unstructured

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU