UPDATED 10:07 EDT / OCTOBER 18 2024

Big-data dust-up: Why two AI giants are at war over who’s more open

The battle for supremacy in the emerging market for platforms that best support the coming boom in artificial intelligence development may ultimately come down to a geeky storage format that even its inventor says only 20 people should care about.

Apache Iceberg, a management layer that sits atop data files in cloud storage, has become the subject of an escalating war of words between Snowflake Inc. and archrival Databricks Inc. Both claim they are more committed than the other to making it easy for users to read and write to the open-source project. Both have been cagey about articulating their commitment to fully support open-source platforms. Each is invested in alternatives that benefit their business.

The battle has escalated as the companies have pushed their own versions of companion software called a data catalog that is critical to unlocking the full value of data lakes — the centralized repositories that store large amounts of raw data in their native format for analysis. Snowflake released its catalog under an open-source license and charges that its rival has been intentionally evasive about its plans to do the same. Databricks charges that Snowflake’s catalog doesn’t hold a candle to its own and says it’s working to make file formats a non-issue.

Both companies say the other is trying to lock customers into proprietary technology while claiming to be open.

At stake is a market for generative AI products and services growing more than 40% annually and expected to reach $1.3 trillion by 2032, according to Bloomberg Intelligence. The choice of a data platform is crucial because training generative AI models requires gathering, cleaning and storing vast amounts of data in unstructured forms such as documents, emails, images and free-form text.

Conventional data warehouses don’t deal well with that kind of stuff and can also be expensive to scale. Snowflake and Databricks rose to prominence by offering cheaper and more flexible alternatives.

Collision course

Iceberg’s surging popularity has now put the firms on a collision course. Iceberg simplifies data processing for large datasets in data lakes. Its flexibility to accommodate a wide range of analytical engines has found favor with customers and forced Snowflake and Databricks to embrace it as an open alternative to their proprietary architectures. The battle now is over who is more committed to openness. Neither firm has yet seized the high ground.

Snowflake kicked off the dispute two years ago when it threw its support behind Iceberg despite the threat the open file format poses to its flagship cloud data warehouse. Databricks more recently has pledged to support Iceberg fully, even though the project competes with an alternative table format called Delta Lake it developed and released to open source five years ago in hopes it would become a standard.

Delta Lake has amassed a large user base of mostly existing Spark and Databricks customers. “It’s downloaded millions of times per month,” said Adam Conway, Databricks’ senior vice president of products. “It’s a massively successful project.”

But the momentum clearly favors Iceberg, a shift that has sent Databricks scrambling to make Iceberg a fully functional equivalent to Delta Lake. That isn’t a simple task. The underlying table formats are different, which means that each has its own way of organizing, storing, and accessing data. Iceberg is designed for flexibility while Delta Lake’s developers focused on transaction integrity. Each manages metadata differently and in a way that is tightly coupled with data handling. Such differences make it hard for a single engine to easily write to both.

“Delta Lake remains central to Databricks’ strategy, and Iceberg support within Databricks isn’t yet as robust as Delta Lake,” said Jayesh Chaurasia, an analyst at Forrester Research Inc.

Market shakeup

Snowflake and Databricks approach the problem of simplifying data management from opposite directions. “Databricks came at the market from the cloud data lake end of the spectrum, and Snowflake came at it from the cloud data warehouse end,” said David Menninger, executive director of software research with global technology research and advisory firm Information Services Group Inc.

The Databricks architecture separates table storage in Delta Lake from the compute engines that update it. That’s different from Snowflake, whose tables can only be accessed through its compute engine. Databricks has made this openness a critical distinction from its rival, saying customers should never give control of their data to a vendor. Snowflake has belatedly adopted a similar strategy, but is trying to walk a line between opening up more options for customers and cannibalizing its highly profitable cloud data warehousing business.

Snowflake shook up the market a decade ago when it introduced a platform that manages data cheaply in the cloud and scales almost infinitely. It has amassed a base of more than 10,000 customers, including many of the world’s largest companies. Its annual revenue is approaching $3 billion, and its market capitalization reached $118 billion in 2021, a year after it staged the most successful initial public offering by a software company in history.

ISG’s Menninger sees Snowflake and Databricks approaching the same problem from opposite directions. Photo: ISG

Databricks emerged at about the same time, casting itself as the more open alternative. The company’s lineage is rooted in Apache Spark, the phenomenally successful open-source data processing and analytics framework. It introduced Delta Lake in 2019 with features that addressed many of the management and quality problems that plagued data lakes. Among them are guaranteed reliable processing of database transactions, a scalable metadata model and the ability to handle streaming and batch data processing at the same time. In 2020, Databricks coined the term “data lakehouse” to describe a data storage architecture with the best features of flexible data lakes and high-performance data warehouses.

Databricks has surged on the popularity of its AI tooling and is seen as having the upper hand in AI and machine learning, at least for now. “Databricks has an advantage due to its integrated data and AI stack, end-to-end AI workflows, and ability to manage both AI models and data governance on the same platform,” Chaurasia said. “Snowflake is catching up, but it still has a limited AI/ML footprint compared to Databricks.”

Snowflake, on the other hand, has taken a hit recently from growing competition, the need to invest in AI tooling and some customer unease about its pricing. “Snowflake’s pricing model includes underlying cloud hardware infrastructure,” said George Gilbert, principal analyst for data and AI at SiliconANGLE sister company theCUBE Research. “To make their profit-and-loss statement appear like a software company, they mark up the hardware so that the combined price yields something close to 80% gross margins. But that makes them less competitive for price-sensitive workloads.” That has forced the company into the unenviable choice between reducing its margins or cutting prices.

Growth unexpectedly slowed last year, and the sudden departure of veteran Chief Executive Frank Slootman in February shook investors. Snowflake stock has fallen 70% from its 2021 high despite a 30% annual revenue growth rate. It has also suffered from a growing perception that its proprietary architecture is out of step with customers’ increasing preference for open platforms. “They tried to maintain control too long, and now they’re opening up on someone else’s terms,” Gilbert said.

Iceberg ahead

Iceberg was developed at Netflix Inc. and began incubating in the Apache ecosystem the same year Delta Lake was released to open source. Iceberg’s developers set out to fix some of the most vexing problems of other table formats, Ryan Blue, Iceberg’s principal architect, said in a podcast hosted by theCUBE.

Iceberg principal architect Ryan Blue said his team’s goal was to fix some of the most vexing problems of other table formats. Photo: Tabular

For example, the table format supports multiple processing engines, such as Apache Flink, Apache Presto and the Presto fork called Trino. Iceberg has advanced partitioning capabilities, schema evolution and partition evolution, which reduces data management tasks and can dramatically increase performance on large-scale data sets. Iceberg’s approach to managing metadata operations is also considered superior to Delta Lake’s, and its “time travel” feature enables users to roll back data views to any point in time.

A less tangible appeal of Iceberg is that it’s fully community-based. When Databricks released Delta Lake to open source in 2019, it kept some features proprietary and positioned the format as a superior option for its customers. Databricks fully open-sourced Delta Lake in 2022, but by then, Iceberg had a head of steam.

Over time, Delta Lake has achieved parity with many Iceberg features and is still considered a better choice for customers working with Spark. However, some suspicion remains in the developer community that Delta Lake is tied too closely to its original developer.

“Databricks’ support for multiple table formats aligns with its message of open data ecosystems,” said Forrester’s Chaurasia. “However, its deeper commitment to Delta Lake means some degree of reliance on Databricks’ platform may persist, even with Iceberg support.”

Snowflake has no such conflicting loyalties. Its data management technology was built before community-based source alternatives had fully matured. Its storage architecture is proprietary by necessity, so supporting an open option was more of an evolutionary step.

Good idea at the time

Snowflake’s Dageville says his company has always been willing to work with open-source projects. Photo: SiliconANGLE

When Snowflake built its data warehouse 12 years ago, open-source technology wasn’t an attractive option. “We didn’t take something that existed as open source and make a commercial company from it because Snowflake didn’t exist as open source,” said Snowflake co-founder and President of Product Benoît Dageville. “We looked at Parquet very closely, but it was missing support for variants, which is our semi-structured, schema-less data type, and many of the performance features we have in our own file format.” Parquet is the columnar storage format used by many data lake and data warehousing platforms.

So, the company created its own format called FDN (for “flocon de neige,” the French term for snowflake). In retrospect, “I wish we had figured out a way to use Parquet, but developers couldn’t predict the emergence of full-featured table formats such as Iceberg,” Dageville said in an interview with SiliconANGLE. He said the company has always been receptive to open-source alternatives.

But Snowflake’s culture hasn’t always been so receptive. Three years ago, a quartet of its senior executives published a lengthy blog post — later taken down but still visible on the Internet Archive – that argued that open-source licensing could handcuff innovation by locking customers into old technology and limiting migration options.

They asserted that vendors who abstract away low-level technical details are freer to make improvements without subjecting customers to painful migrations. Rather than publishing the code, they said, developers are better off selectively exposing functionality through application programming interfaces.

“We believe in open where open matters,” the executives wrote. “We do not believe in open as a principle to follow blindly without weighing the trade-offs.”

Snowflake’s position has evolved since then. Although its bread and butter remain its proprietary cloud data warehouse, the company now readily adopts open-source technology, even when it competes with its own, Dageville said.

The new strategy is hybrid. Snowflake offers a fully managed and proprietary cloud data warehouse platform while allowing customers to use open-source alternatives for data management. Dageville said it’s all about giving customers a choice. “Some of our customers want to be open, and we want to provide them with the best open,” he said.

That isn’t a strategy, said Databricks’ Conway. “They’re saying it’s a good thing to have two silos,” he said. Snowflake’s principal business model of ingesting and managing all of its customers’ analytical data “is the worst thing you can do,” he said. “It locks you in, prevents you from utilizing that data, and creates an incentive to keep data and a vendor you may not be happy with.”

The popularity of open formats have now forced Snowflake to balance competing narratives: owning customer data on the one hand and letting them manage it themselves on the other. “It’s a bunch of disconnected things, and they haven’t figured out how it all works together,” Conway said.

Change of heart

Under pressure to open up and looking to make up for lost time, Snowflake threw its full support behind Iceberg in 2022, pledging to adopt it as a native file format. It doubled down last spring with the announcement of Polaris Catalog, which is described as a vendor-neutral, open-source catalog that supports Iceberg and other data architectures. That’s when things got nasty.

Databricks’ Conway: Trusting critical data to a vendor model of ingesting and managing all of its customers’ analytical data “is the worst thing you can do.” Photo: Databricks

Catalogs are critical to data storage and management in data lakes, warehouses and distributed databases. They help organize vast amounts of data by providing a structured way to index and describe it across multiple locations, much like a card catalog does in a library. Catalogs also allow users to discover and understand available datasets through metadata like data type, size, schema and other characteristics.

They ensure data consistency, validity and compliance. They also help improve query efficiency against large datasets because they point queries to the most relevant parts of the data. “The new platform is the catalog,” Gilbert wrote in an analysis following Databricks’ Data + AI Summit in June.

Snowflake positions Polaris Catalog as a tool to help organizations manage and govern their data across multiple clouds with fine-grained access controls, compliance monitoring, and data security. The company also has a proprietary catalog called Horizon that focuses on application and AI development. Horizon has more granular access controls and governance features and is intended to be “a system that your dentist or my mom could use,” unlike Polaris, which is more flexible but requires greater technical acuity, Dageville said.

In open-sourcing Polaris Catalog, Snowflake put a stake in the ground, Dageville said. “We are committed,” he said. “It is a very serious project.”

The decision also raised the stakes in the battle with Databricks, whose proprietary Unity catalog was introduced in 2021. Wary of allowing Snowflake to reap a public relations bonanza from Polaris, Databricks quickly released Unity Catalog to open source less than two weeks later.

Databricks Chief Technology Officer Matei Zaharia did so in dramatic fashion, striding across the stage at the Data + AI Summit last June 16 and clicking a link to make the source code public.

Or did he? In fact, only a small portion of Unity Catalog – Dageville claims about 4,000 lines of code – was released that day. Nevertheless, Databricks said, and numerous news sources reported, that Unity Catalog was now public.

Intentions versus reality

“What they meant is that they intend to open-source it,” Gilbert said. “That’s fine and even admirable. But it’s not what they said.”

“It disappointed many, including myself,” said Forrester’s Chaurasia.

Catalog comparison

	Databricks Unity Catalog	Polaris Catalog	Snowflake Horizon Catalog
Focus	Unified governance for Databricks workspaces	Interoperability for Apache Iceberg	Comprehensive governance for Snowflake ecosystem
Interoperability	Limited to Databricks ecosystem	Highest, supports multiple engines and platform	Limited to Snowflake ecosystem
Governance Features	Comprehensive, native to Databricks	Basic, relies on integration with Horizon	Comprehensive, native to Snowflake
Data Discovery	Robust	Robust	Robust
Access Control	Fine-grained	Relies on integration with Horizon for advanced access control	Fine-grained
Lineage and Auditing	Extensive	Not clearly specified	Extensive
Deployment	Integrated into Databricks platform	Can be self-hosted or run on Snowflake	Integrated into Snowflake platform

Source: theCUBE Research

Gilbert said both vendors’ strategies mix closed and open platforms. Horizon, native Snowflake FDN data and managed Iceberg are the most functional and proprietary parts of the Snowflake ecosystem, but users can also synchronize permissions seamlessly with the open-source Polaris catalog, which governs open Iceberg tables.

George Gilbert, theCUBE Research talks ETR survey results during Supercloud 7.

TheCUBE Research’s Gilbert: Databricks markets its intentions to be fully open, “but reality sometimes lags somewhat.” Photo: SiliconANGLE

“Snowflake spans a spectrum from proprietary but richly functional, where all the complexity is hidden, to fully open data with a catalog to manage it,” Gilbert said. Databricks’ Delta Tables are fully open, in that any tool or engine can read and write them, but managed by the still mostly proprietary Unity catalog. “Both vendors are partly closed and partly open,” he said.

Databricks is good about marketing its intentions to be fully open, “but reality sometimes lags somewhat,” Gilbert said. That was the case with Unity, as it was with the universal table format, Uniform, that Databricks introduced in Delta Lake 3.0.

Uniforum enhances interoperability across different Delta Lake, Apache Iceberg and another open format called Apache Hudi, but its current instantiation can only read Iceberg and Hudi tables, not write to them. Gilbert believes the Tabular acquisition was motivated at least in part by the need to fix that shortcoming.

Databricks’ Conway acknowledged that the original claims of full openness were overstated but said the roadmap leads in that direction. “When you open-source something as sophisticated as Unity Catalog, much of the code needs to be rewritten so that it is not reliant on Databricks’ specific infrastructure,” he said.

Conway dismissed Dageville’s claims that Snowflake stole a march on his company by fully releasing Polaris to open-source. “It is a false comparison,” he said. “Polaris is an extremely simplistic catalog. It is mostly an implementation of the existing catalog, which was already defined in the Apache Iceberg project.”

Gilbert agreed that Polaris isn’t Unity Catalog’s equivalent. “It’s 100% open source, but it’s much less functional than Unity,” he said. A better comparison is a combination of Snowflake’s Horizon and Polaris versus Databricks’ Unity.

According to Conway, Polaris is limited to working with the Iceberg tabular format and is unsuitable for large-scale AI projects. “Essentially, Snowflake just cleared a two-foot-high jump and is attempting to criticize us for going for an eight-foot record,” he said.

Nevertheless, the incident tarnished Databricks’ image as a champion of openness.

Snowflake has made it clear that it intends to press whatever short-term advantage it can glean from that. Dageville charged that Unity Catalog lacks full read/write support for Iceberg and omits key features such as credential vending, support for Amazon Web Services Inc.’s S3 object storage format and performance optimization.

He also said users can access Unity Catalog only through Delta Lake, a restriction that impacts its usability and effectiveness. Uniform was intended to bridge a gap between the projects but fell short of achieving native Iceberg support, a task that took Snowflake two years, Dageville said. “They open-sourced something that they probably wrote in a few weeks for a publicity stunt,” he said.

A Databricks spokeswoman responded, “Unity Catalog has open APIs for writes to external tables via the Unity REST spec, with writes to managed tables coming soon.”

Snowflake’s criticisms have evidently drawn some blood at Databricks. In June, it agreed to acquire Tabular Technologies Inc., a small startup founded by the developers of Iceberg. The reported price tag of $2 billion raised eyebrows, given that Tabular had almost no revenue. Still, the deal gives Databricks a faster track to resolving compatibility issues between Iceberg and Delta Lake and a modicum of control over Iceberg’s future. Whether Iceberg will remain a fully community-led project is an open question.

Databricks would rather the whole table debate go away. “We had two projects that solved the exact same problem almost the same way,” Conway said. “We were seeing the projects diverge, and that was just not good for customers. Acquiring Tabular allowed us to switch directions. The Tabular product is also very clever.”

Snowflake’s Dageville maintained the Tabular deal was done out of desperation because Databricks was so far behind on native Iceberg support. “Delta Lake is dead, which is great news,” he said. “They were scared to death, so they bought Tabular for a lot of money. It’s an important indicator that Iceberg won this competition.”

Conway returned fire. “There’s a lot of data in Delta Lake,” he said. “Its adoption is massive. A lot of what we’re doing with Iceberg is bringing them together.”

Rebutting Dageville’s charges that native Iceberg support is two years away, Conway said the integration effort is moving quickly. “Everything but the metadata will be compatible in the next version,” he said. “Within a quarter, we’ll have compatibility at the Parquet layer.”

Silver lining

One positive outcome of the dispute is that the market’s attention is shifting from the compute engine as point of control to the catalog. If catalogs are open and freely available, then the value shifts to the analytic and application development tools that leverage that catalog.

“The competition has been good for the market, resulting in less vendor lock-in and greater compatibility between products,” said ISG’s Menninger.

Writing on SiliconANGLE shortly after the Data + AI Summit concluded, Gilbert and theCUBE Research Chief Analyst David Vellante said the big-data landscape has fundamentally changed. “Because catalogs are becoming freely available, the value in data platforms is also shifting toward toolchains to enable a new breed of intelligent applications that leverage the governance catalog to combine all types of data and analytics while preserving open access,” they wrote.

It’s too early to say which company will win, but Forrester’s Chaurasia said Snowflake scored points by fully open-sourcing Polaris. The move “makes it a more transparent and open option for metadata and governance, especially for organizations prioritizing cloud-agnostic solutions,” he said. “Databricks still holds an advantage in the AI/ML space due to its integrated data and AI stack, end-to-end AI workflows and ability to manage both AI models and data governance on the same platform.”

The kerfuffle has left Ryan Blue, Iceberg’s creator, a bit baffled. “There should be maybe 20 people in the world who care about this problem, and you should not have one in your organization,” he told customers in a video posted on Databricks’ website.

Having recently been paid handsomely to solve the compatibility problem, Blue has an incentive to minimize it. How quickly it becomes a non-issue for customers is a matter for debate. “You should care about using your data and not choosing which format to use,” Blue said.

On that, everyone can agree.

Image: SiliconANGLE/Dall-E

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU