UPDATED 19:16 EDT / AUGUST 06 2021

BIG DATA

Data warehousing has problems. A data mesh could be the solution

Since its founding 13 years ago as an online shoe seller, Zalando SE has grown to become one of the largest e-commerce fashion retailers in Europe. Given its roots on the web, it’s not surprising that about 15% of the firm’s 14,000 employees work in technology disciplines. Zalando “has been a data-driven company pretty much from the start,” Max Schultze, a lead data engineer at Zalando, said in a presentation at the Spark + AI Summit last year.

The company had used a central data warehouse for data analysis since its early days, but scalability eventually became a problem. Moving to the cloud was a partial solution, but the bigger issue was how to satisfy the growing demand for new uses of that data. Caught in the middle were the data engineers who were responsible for both cleansing and transforming an ever-increasing amount of data and satisfying demand for access.

“They were mostly firefighting issues that were introduced upstream by changes from the data-generating teams,” said Arif Wider, a professor for software engineering at HTW Berlin, Germany, and a lead technology consultant with ThoughtWorks Inc. “They needed to solve issues where they were not the domain experts.”

Two years ago Zalando shifted to a strategy of distributing data across the company and handing over ownership to the business groups that created it. Data scientists and engineers were assigned to work with business leaders to whip the data into shape so that it could be easily shared. People on the technology side were now expected to understand the context of the data they worked with.

The result has been the best of both worlds, Schultze said. “Even though we have decentralized ownership, we still have a central governance layer that allows us to tie all these things together.”

Mesh mania

The architecture Zalando chose was a “data mesh,” a concept that’s is arguably the hottest topic in the data analytics world right now, despite being so new that it doesn’t even have a Wikipedia entry.

The term was coined by Zhamak Dehghani, a principal consultant at Thoughtworks, in a post on a blog maintained by Thoughtworks Chief Scientist Martin Fowler two years ago. Around the same time, Gartner Inc. began talking about a similar but not identical concept called a data fabric.

Dehghani: “This is a global movement and I’m humbled by the response.” Photo: SiliconANGLE

Both notions proceed from the same presumption: The way organizations manage data is woefully out of step with the way they go to market. Enterprises spent the last 20 years decentralizing their organizations and investing more authority in the people closest to their products and customers.

But at the same time, the data that people need to make decisions is held in a centralized data warehouse, a data management construct that dates back to the 1980s. That store is tended by a team of data scientists and engineers who ensure quality and usability but know little about how data is used by the business. They field requests and manage data pipelines without context. It is perhaps not surprising, then, that only 22% of 500 data and analytics managers surveyed by Dremio Corp. said they have fully realized the return on their data warehouse investment over the past two years and that 56% reporting having no consistent way of measuring it.

“The notion of storing all data together within a centralized platform creates bottlenecks where everyone is largely dependent on everyone else,” said Gil Feig, co-founder of Merge API Inc., an application integration startup. “Data mesh addresses this head-on.”

The technology foundation for data mesh is steadily being put in place, but the bigger challenges are cultural and organizational, advocates say. Companies have a lot invested in their warehouses and the teams that maintain them. The organizational upheaval of tearing apart centralized teams and distributing data ownership throughout the organization is a massive task. However, the penalties for maintaining the status quo may be even greater.

The data mesh concept “comes from a place of empathy for the pains that CEOs, CIOs and chief data officers who have been going through decades of spending a lot of money on infrastructure and not seeing the results they want,” Dehghani said in an interview with SiliconANGLE.

Transfer of ownership

Simply stated, a data mesh invests ownership of data in the people who create it. They’re responsible for ensuring quality and relevance and for exposing data to others in the organization who might want to use it. A consistent and organization-wide set of definitions and governance standards ensures consistency, and an overarching metadata layer lets others find what they need. “Data mesh is the concept of data-aligned data products,” Dehghani said in a video introduction. “Find the analytical data each part of the organization can share.”

Dehghani lists eight attributes of a data mesh. Elements must be discoverable, understandable, addressable, secure, interoperable, trustworthy and natively accessible and they must have value on their own.

The concept of decentralized data management is nothing new. Distributed databases rode the coattails of the client/server craze in the 1990s. Part of the appeal of the Hadoop software library of a decade ago was that processing was distributed to where data lived. More recently, data virtualization has gained traction with its concept of a logical data layer that integrates data silos.

However, distributed databases were ahead of their time and Hadoop was torpedoed by complexity. Virtualization struggles with running queries across diverse data sources. The mesh approach, however, may be in the right place at the right time.

Gartner’s Beyer: “The tools are eminently more capable than they have been in the past.” Photo: Twitter

“The tools are eminently more capable than they have been in the past, specifically for using a combination of data domain discovery and use case analysis,” said Mark Beyer, a Gartner distinguished research vice president.

Open-source query engines for processing data in place have proliferated in recent years. One is the Presto distributed SQL query engine and a forked project called Trino, which were created by Facebook Inc. and are highly regarded for their performance. There’s also the high-speed Apache Spark analytics engine that is sold commercially by Databricks Inc. as well as the open-source projects Apache Drill, Apache Impala and Apache Flink.

Another piece in the puzzle, data catalogs, have grown in sophistication to enable organizations to create a master record of all the data they have tagged for easy access. Data integration software has also improved to streamline the messy task of cleaning up data into a consistent format.

An interesting new dynamic is Delta Sharing, an open protocol developed by Databricks and released to open source in May that provides for secure data-sharing across different data management platforms. Delta Sharing can be used to distribute not only SQL queries but machine learning models and entire data sets. “It’s what vendors have had for decades but this is open,” said Ali Ghodsi, CEO of Databricks. Delta Sharing has earned endorsements from numerous business intelligence and analytics vendors but so far no sellers of database management systems. 

Another driver is the broad popularity of Amazon Web Services Inc.’s S3 object storage, which has enticed organizations to move large amounts of their data to a single cloud repository or data lake, making it easier for query engines to work against it. “Having a data lake in place is essential to having a performant architecture,” said Dipti Borkar, chief product offers at Ahana Cloud Inc., which sells a Presto managed service. “If most of the data is in S3, which is increasingly the case, a service mesh can sit on top and pull the data from wherever it lives.”

The growing popularity of microservices architectures, in which applications are assembled from loosely coupled and ephemeral services, has also created a foundation for distributed data access. “In a service mesh you have microservices across many different applications and business units that connect parts of the system together,” Borkar said. “That is now being applied to data.”

Cultural barriers

Ahana’s Borkar: The service mesh concept is now being applied to data. Photo: Alluxio

But cultural impediments and organizational inertia are likely to make the road to data meshes or fabrics a long one. “The data mesh paradigm is really much more of a mindset shift than of a technological shift,” Wider said. “You have to go from central ownership to decentralized ownership.”

Such a change has broad political implications at organizations where data science teams are large and established. On the plus side, information technology leaders may be happy to give up responsibility for continually fielding requests for information. However, “organizations that have an us-versus-them culture will have more trouble,” said Andy Mott, a solution architect at Starburst Data Inc., which sells an enterprise version of Trino.

The interest is clearly there, Beyer said. In a 2019 report, he and three other analysts estimated that a data fabric can reduce the time needed for integration design and deployment by 30% and cut maintenance overhead by 70%. Gartner is currently tracking 11 ongoing projects at large enterprises and has published case stories on three: Montefiore Health System Inc., the city of Turku in Finland and Jaguar Land Rover Ltd.

And those are the tip of the iceberg, Beyer said. “There could be 400 or 500 organizations pursuing this but that’s less than 1.5% of the market,” he said. “It usually takes about eight years to get from 1.5% to 25%,” at which point a technology or practice is considered mainstream.

“I’ve been overwhelmed with the number of large, medium, small, private, public and government organizations that reached out to us globally,” Dehghani said in an interview on theCUBE, SiliconANGLE’s livestreaming platform (below). “This is a global movement and I’m humbled by the response.”

Product thinking

There are a few assumptions underlying the data mesh/data fabric concept that require cultural adaptations. One is treating data as a product. Because most organizations think of data as an asset or resource, there is little incentive to package and share it with others.

Product thinking demands that organizations treat data with the same care and attention that would befit something they sold to a customer. That involves quality assurance, packaging and even marketing and sales. “It means putting the users of that data – the product – at the center, recognizing them as customers, understanding their needs and providing the data with capabilities that satisfy their journey in a frictionless, flawless and delightful fashion,” Dehghani said. “The success metrics change from tables and terabytes to happy users.”

Changing corporate culture to embrace product thinking was one of the most challenging aspects of JPMorgan Chase & Co.’s adoption of a data mesh architecture. In a recent video presentation that was analyzed in depth by Dave Vellante, chief analyst at SiliconANGLE sister market research firm Wikibon, JP Morgan executives described how they ultimately settled on a definition of the product as a “broad, cohesive collection of related data aligned to business functions and goals, and which are potentially made up from multiple contributors.”

Each data product at the financial services firm has a single owner and a technical specialist who is responsible for data maintenance, packaging and delivery. “Product owners had to defend why that product should exist, what boundaries should be put in place and what data sets do and don’t belong in the product,” Vellante said. “No doubt those conversations were engaging and perhaps sometimes heated.”

Starburst’s Mott: Data mesh is an antidote to the data warehouse “cycle of doom.” Photo: Andy Mott

In the same way that product teams have managers and developers, data product teams have owners and engineers, including people who may have formerly worked on the data warehouse. There may even be a marketing component.

“Data products are much like something you buy in a store; you buy what people recommend,” said Starburst’s Mott. “That will be significant because you’ll have a lot of data products from different domains.”

The idea also flips the data warehouse paradigm on its head. In a data mesh, the warehouse still has a role, but it becomes just another consumer of data rather than the oracular source of truth. “We don’t want to go back to ground zero and rebuild everything,” Dehghani said. “Use the infrastructure that exists.”

One of the appeals of a data mesh is that it can be built incrementally on a department-by-department basis without disrupting the entire company. “Rather than breaking down a centralized data platform, rules can be set for any new data,” Feig said. “When a new project launches, it can be placed into its own domain and exposed via application program interface. As bandwidth becomes available, teams can break pieces out of the original architecture until all that remains is a domain-separated data mesh.”

But that assumes everyone can agree on the platforms and toolsets to use. “The challenge is that you can’t get everybody to standardize on the same thing,” said Databricks’ Ghodsi. “A centralized team can mandate that everyone use the same software but the teams don’t always agree.”

Domain control

Another core mesh principle is domains. One of the structural weaknesses of conventional data warehouses is that the people who manage the data are functionally separate from the people who use it. Data scientists and engineers must retrieve data, cleanse it and transform it “without really understanding the domains because the teams responsible for it are not close to the source of the data,” Dehghani said.

In a mesh or fabric architecture, domains own and curate their own data. Data is cataloged and published so that others can find it and dream up new ways to use it. “The insights emerge from the interconnection of the data domains,” Dehghani said.

A third foundational principle is self-service. That addresses one of the greatest frustrations caused by the traditional warehouse model, which Mott called “the cycle of doom where tickets are created, access is provisioned and by the time you get the data, it’s no longer relevant.”

In a self-service scenario, users pick and choose the data from a catalog and decide how to apply it, in the same way that smartphone owners serve their own needs from an app store. Experts don’t expect self-service to be a major impediment to data mesh adoption since software-as-a-service has already made many organizations comfortable with self-provisioning.

Who’ll do the work?

But even companies that are all in on data mesh face some practical impediments. One is that data science and engineering skills are in chronically short supply. Online training firm QuantHub LLC last year estimated that there are 250,000 open data science jobs and the dearth of experts in data engineering, a relatively new discipline, was on track to be even worse.

Technology may come to the rescue. Spreadsheets turned legions of accountants and financial analysts into amateur programmers. Skyrocketing interest in the new crop of low-code/no-code tools is evidence that business people are increasingly confident in solving their own IT problems. And the tools are getting better. “I think in the long term you’ll get data consumers who pick up some of the data engineering tasks – the low hanging fruit,” Mott said.

Gartner’s data fabric definition assumes that artificial intelligence can solve many of the data engineering problems. “Data is just math,” Beyer said. “Math has rules and practices. Math always obeys the rules.”

By analyzing patterns in data, machine learning algorithms will be able to take on most of the quality and transformation tasks that are now performed by humans, Beyer believes. In contrast, the mesh concept assumes that humans will play a critical ongoing role in understanding context and change.

The debate over automation is the main difference between the data fabric and data mesh concepts, Beyer said. “The fabric says this is math. The mesh says you have to solve it,” he said. Gartner asserts that humans will be involved less and less in maintaining data over time. In a recent report, it estimated that by 2023, “artificial intelligence in the data fabric will be capable of reducing data quality and data mastering ongoing operations costs up to 65%.”

Then there’s the challenge of piecing together the technologies to make data discoverable and accessible across a large enterprise. In large part, the newness of the mesh concept is the biggest impediment here. “The elephant in the room is that the blueprints are not as mature as they should be, but I think they’ll mature fast,” said Mott.

Assembling the pieces

The piece parts are in place, but it’s up to the organization to make them work together. “Data mesh is not really an architecture or a well-defined set of technologies,” said Tomer Shiran, founder and chief product officer at Dremio Corp.

And because there are so many ways to implement a data mesh, it’s unlikely that a single technology solution will emerge.

“I don’t think there’ll be a data mesh solution you can buy,” Dehghani said. “I hope what people buy are a set of collaboration tools and technologies that they can stitch together to build an end-to-end mesh.” However, she said smaller organizations may be able to adopt a packaged solution, particularly if they don’t have a legacy warehouse in place.

The competitive dynamics of the software industry also argue against a single solution for enterprise-wide deployments, said Gartner’s Beyer. “Vendors would need a new revenue model that is based not on differentiation but on orchestration,” he said. “Right now, that’s now how software is designed.”

However, Databricks’ Ghodsi disagreed, saying his company intends to build a technology stack that delivers a soup-to-nuts data mesh incorporating Delta Sharing to bridge the gap to other vendors’ platforms and its Unity Catalog for cross-platform data management. “We are productizing it,” he said.

Whether data mesh is a shooting star or long-term trend remains to be seen. Few will argue, though, that the monolithic, centralized status quo can continue to serve the needs of businesses bent on digital transformation, or what Dehghani called “the hypocrisy that we want to be data-driven but outsource responsibility to the business intelligence team.”

Data mesh may be the answer. It may even someday get its own Wikipedia entry.

Photo: Unsplash

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU