UPDATED 10:45 EDT / NOVEMBER 15 2020

CLOUD

Cloud computing storms a bastion of the enterprise: the data warehouse

BIG DATA SPECIAL REPORT by Paul Gillin

In the course of managing 12 million requests for roadside an accident assistance across the U.S. each year, Agero Inc. crunches a lot of data. The contact center operation, including dispatch specialists, employs a team of data scientists to optimize the way service providers are deployed to deliver aid as quickly as possible to stranded motorists.

For years, the company used an on-premises data warehouse, which is a structured repository of data drawn from multiple sources and used for business intelligence analysis. When Agero launched a modernization initiative two years ago, “we realized our data warehouse infrastructure wasn’t going to keep up,” said Michael Bell, director of data science and analytics. “We were straining to ingest data into the legacy warehouse and performance was suffering.”

Onsite storage was costly and complex to manage and performance was suffering from query loads. “If we needed to ingest a new source of data and we didn’t have the drives provisioned we’d have to spin down everything to provision the new storage,” Bell said. “It took hours and hours and sometimes the computation couldn’t even handle it.”

Agero's Bell: "We’ve been able to democratize data and gets authorized users hands-on with it." Photo: Twitter

Agero’s Bell: “We’ve been able to democratize data and gets users hands-on with it.” Photo: Twitter

So Agero made the leap to a cloud data warehouse, choosing Snowflake Computing Inc. as its provider. The shift has paid off in many more ways than just cost savings, which have been substantial, Bell said. Storage scalability is no longer a problem. New use cases can be spun up in virtual warehouse instances in minutes and shut down just as easily.

More importantly, Bell said, “we’ve been able to democratize data and gets authorized users hands-on with it.” The company is now working on new projects that would have been impossible within its legacy environment, such as building custom dashboards that clients and partners can use to see data relevant to their services.

“Some of these dashboards might have thousands of users that would have required lots of individual data marts in the past,” Bell said. “We have broken the bottlenecks.”

Neck-snapping

Agero is one of thousands of companies that are new finding new value in the more than 40-year-old data warehousing model thanks to cloud computing. And the shift is happening with stunning speed. Global Market Insights Inc. estimates that cloud providers will host the majority of data warehousing loads by 2025. Gartner Inc. estimates that 30% of data warehousing workloads are now running in the cloud, growing to two-thirds by 2024. In 2016 the figure was less than 7%, said Gartner Analyst Adam Ronthal. “Everybody’s business is going to go to the cloud,” he said.

That’s going to change not only what data warehouses are but the very nature of how organizations can use data to create competitive advantage. As Dave Vellante, chief analyst at SiliconANGLE sister market research firm Wikibon, envisions it, the cloud is enabling the creation of a data mesh that will transform the way companies structure their business, with data at the core.

“There’s probably nothing more strategic than leveraging data to power your digital business and creating competitive advantage,” he said. “We believe a new approach is emerging where business owners with domain expertise will become the key figures in a distributed data model that will transform the way organizations approach data monetization.”

All this might seem a bit neck-snapping to those who remember that, just a few years ago, data warehousing had become almost a dirty word. Long maligned for their high cost and administrative overhead, data warehouses have historically been limited to large enterprises that could afford their seven-figure price tags. Nevertheless, they are an important single source of reliable data for use in business intelligence processing, a demand that has grown with the digital transformation wave.

But information technology people have always looked for a better solution. They thought they had found it a decade ago when the open-source software Hadoop stormed onto the scene with its promise of delivering warehouse-like functionality at a small fraction of the cost. The pitch was so appealing that some people began to write off data warehouses as a relic.

The Hadoop ecosystem gave birth to the metaphor of the data lake, an all-encompassing trove of information from structured sources such as relational tables, semistructured ones such as HTML code and even free-form text. A host of open-source tools emerged to index, format and providing access to data in lakes from the popular SQL query language.

Data lakes were touted as a new breed of data warehouse that didn’t have the downsides of high costs administrative inflexibility and limited scale. “A few years ago, you didn’t even say ‘data warehouse’ because people would say you were from the horse-and-buggy days,” said Carl Olofson, research vice president at International Data Corp.

Ahana's Borkar says complexity has stymied Hadoop acceptance: "up to 300 configuration parameters." Photo: Ahana

Ahana’s Borkar says complexity, “up to 300 parameters,” has stymied Hadoop acceptance. Photo: Ahana

But the complex ecosystem of open-source tools that made up the typical data lake was also a problem. Users had to do much of the integration grunt work themselves. In Hadoop, “everything is configuration-based,” said Dipti Borkar, chief product officer at Ahana Cloud Inc., which curates a cloud version of the Presto distributed SQL query engine. “There can be up to 300 configuration parameters that you have to figure out on your own.”

One of the appeals of data lakes was that they replaced costly data center disk drives with cheap commodity devices. It turns out cloud object storage is even cheaper. “Data lakes were formed so you could have a lot of data and take the compute to the storage,” said Anupam Singh, chief customer officer at Cloudera Inc. “What’s changed is that there’s a lot more need for compute than storage. All the action is in the compute layer.”

Agero dabbled in Hadoop-based data lakes but found administration to be costly. “My perspective is that the ecosystem required a lot of specialized knowledge to manage,” Bell said. “Even if your software stack is open source, you need a lot of expensive engineers to leverage it effectively.” It also turned out that a lot of the data people wanted to analyze was structured anyway.

“People who were going to move everything from Teradata to Hadoop have changed their minds,” said Olofson, referring to market leader Teradata Corp.

Cloud pure play

Then along came Snowflake.

The startup, which was named both for a logical arrangement of tables commonly used in data warehouses and the born-in-the-cloud nature of crystallized water, released a data warehouse as a service in 2014. built from the ground up for the cloud. With limitless scalability and support for low-cost cloud object storage, Snowflake crossed off two of the biggest items on data warehousing users’ gripe list. It also had some nice tools for integrating the semistructured data that tended to choke highly structured traditional data warehouses. And it was easy to use.

Founded by a team of data architects who previously worked at Oracle Corp. and Dutch analytics firm VectorWise, the company raised $1.4 billion and in September staged a blockbuster initial public offering that saw its value soar from $3 billion to $88 billion on the first day of trading.

Snowflake captivated customers with its cloud-native roots, ease of use and extensions for accommodating nontraditional data types. “If we take on new use cases, we can spin up a new virtual data warehouse on Snowflake in a matter of minutes,” Bell said. Agero can now ingest data “more or less in raw form because Snowflake has more flexibility in the data it can handle, and we can structure through Snowflake.”

Snowflake wasn’t the first cloud data warehouse. Google LLC’s BigQuery was launched in 2011 and is admired for its technical elegance. Amazon Web Services introduced Redshift in 2012 and is still considered the market leader. Oracle Corp.’s Autonomous Data Warehouse is applauded for its administrative efficiency and Microsoft Corp.’s Azure Synapse for its flexibility.

But those products are part of a much broader cloud portfolio, whereas Snowflake was a pure-play company that came to be seen as the manifestation of all things cloud. “An internal product doesn’t get the attention of a dedicated company,” Debanjan Saha, Google’s general manager of data analytics services, said somewhat wistfully.

A fresh look

Gartner’s Ronthal: “Everybody’s [data warehousing] business is going to go to the cloud.” Photo: Twitter

But many experts say Snowflake kicked off the cloud data warehouse craze. “Snowflake came to market with a fresh look at the architecture designed from the ground up for cloud and with full separation of resources,” said Gartner’s Ronthal. “They picked up some of what I call Redshift refugees and are now on all major clouds.” He noted that Amazon has since closed many of its initial competitive gaps with Snowflake.

“The reason Snowflake has been so successful is because they built a simple analysis tool that can be used at a departmental level, can be stood up to do simple things quickly and is easy to use at small incremental expense,” said Chris Lynch, chief executive of AtScale Inc., a Snowflake partner that develops software that abstracts a variety of back-end data stores.

Snowflake declined to be interviewed for this story, but in an interview earlier this year with theCUBE, SiliconANGLE’s streaming video platform, Chief Executive Frank Slootman explained why the company has stayed focused on its cloud-only roots.

“We burnt the ship behind us,” he said. “We’re not doing this endless hedging that people have done for 20 years of keeping a leg in both worlds. Forget it, this will only work in the public cloud. Because this is how the utility model works.”

Slootman told Vellante that speed and the ability to share have been essential principles of Snowflake’s approach to the market. “This data is what we call analytics-ready,” he said. “It is instantly accessible. It is also continually updated; you have to do nothing. It’s augmented with incremental data and then our Snowflake users can just combine this data with supply chain, with economic data, with internal operating data.”

That simplicity is opening the warehouse troves to a new group of tech-savvy users who are now able to work with data directly, rather than submitting requests through the IT organization, a group Gartner calls “citizen data scientists.”

Teradata users “used to be a small group of data scientists in a centralized team. Now it’s diffuse across the organization,” said Hillary Ashton, chief product officer at Teradata.

The instant scalability of cloud resources has been a refreshing change from the forklift upgrades that were sometimes required on-premises when demand exceeded capacity. “In the legacy world, if we needed CPU power we’d open a request, get a quote from the vendor and it was weeks before we could get what we needed,” said Anthony Seraphim, vice president of data governance at Texas Mutual Insurance Co. “And then it still wasn’t enough.”

Escaping ETL hell

In addition to their scalability and cost advantages, cloud data warehouses and their ecosystems have succeeded at making a dent in one of the most daunting tasks of data warehouse administration: the extract/transform/load process. Because data in a warehouse is usually imported from multiple sources, it needs to be adapted to a common format and schema, which is the organizational blueprint of the database management system. The ETL process is necessary but also time-consuming and monotonous. “People hate ETL,” said IDC’s Olofson.

ETL can also be complex and expensive, involving the need to create scripts, “build custom connectors to APIs, maintain schema updates and build pipes, making sure updates are happening and don’t fail,” said Dan Maycock, vice president of IT and data at BT Loftus Ranches Inc., a Yakima, Washington-based grower of hops. “ETL put data warehouses out of the realm of possibility for a lot of companies.”

Most cloud data warehouses enable users to load data first and transform it later with tools that involve the business users of that data, a process called ELT. That addresses another common complaint about legacy data warehouses: It takes too long to ingest data and wrangle it into shape.

“It’s perfect for the agricultural community that does not have a lot of system engineers,” Maycock said. “Snowflake automated a lot of the pain and agony. You set it and forget it.”

Loftus Ranch’s Maycock: “Snowflake automated a lot of the pain and agony. You set it and forget it.” Photo: Dan Maycock

At the same time, improvements in integration tools have alleviated some of the pain of transformation. A critical ally in Loftus Ranch’s march to the cloud has been integration technology from Fivetran Inc. that uses prebuilt connectors to automate data streaming from multiple sources into the destination schema. “I was able to use five new Fivetran connectors and within a couple of days had a fully functioning data warehouse,” Maycock said. “It’s exponentially less painful.”

Texas Mutual Insurance Co. has simplified administration of its Snowflake data warehouse by using a data catalog from Alation Inc. that gives business-side users the ability to assign their own metatags and build queries collaboratively.

With the firm’s legacy warehouse, data lineage was hard to trace. “I’d see reports and dashboards, but I didn’t know where the information was coming from, who built that dashboard and whether the formulas were correct,” said Vice President of Data Governance Anthony Seraphim. “Getting an answer could take weeks because developers had to go through the code by hand.”

Using the Alation catalog, “I can tell my business users I put data in the cloud that’s not structured optimally but its quick and we have tools that can help you,” he said. “We have shifted the power from hard-core tech developers to users.” He added that the company’s on-premises data warehouse, while still running, is now “on life support.”

The ELT approach widely used in the cloud “is less efficient because you use more storage and compute but the upside is that ELT is a lot more agile and easier to iterate because you have everything in one system,” said George Fraser, Fivetran’s CEO. “ETL was popular because data warehouses were so expensive. Now the incentives have changed; storage is super-cheap and you can get as much compute as you want.”

The ability to quickly ingest data is a plus for Loftus Ranch, whose business is heavily influenced by such factors as weather and changing market prices. “If you want to lose weight the best thing to do is track everything you eat,” Maycock said. “I think we’re getting to a similar place because we can track everything as it happens.”

Cloud architecture has hastened the shift toward ELT by enabling fast parallel queries to be run with persistence that preserves the lineage of data. “If you have source data in BigQuery you don’t have to throw it away and you can create different transformations,” said Google’s Saha. “All of the lineage information is available in the data warehouse itself. I don’t think need for ETL has gone away, but a lot of people are doing ELT instead.”

Curtains for data lakes?

Texas Mutual's Seraphim: " " Photo: LinkedIn

Texas Mutual’s Seraphim: “We have shifted the power from hard-core tech developers to users.” Photo: LinkedIn

Does the resurgence of data warehouses mean data lakes are no longer needed? Opinions are divided. Fans believe steadily improving performance will make them attractive alternatives to warehouses over time. They also point to the fact that data lakes are more appropriate place for data science and the training of machine learning models. Improving those repositories to handle business intelligence queries means fewer data copies and less potential for mistakes.

“As a classically trained SQL person I want to think that SQL is the only way to think about data warehousing,” said Cloudera’s Singh. “Those distinctions are getting blurred where the warehouse is one experience, but machine learning and data science are just as important. People don’t want boundaries between transactions and analytics.”

Others say data lakes can never deliver the performance users expect. They see warehouses evolving to support a greater range of data types, making data lakes a niche technology or even irrelevant. “There doesn’t seem to be a reason to build a data lake anymore,” said Fivetran’s Fraser. “I think they are going to disappear.”

Sudi Bhattacharya, cloud machine learning leader at Deloitte LLP, said certain structural limitations that will always favor structured data stores. “There are techniques that have evolved to make [data lakes] faster, but it’s not enough for certain access patterns,” he said. “For blazing-fast access, you want a data warehouse.”

Not so fast

IDC’s Olofson: “People who were going to move everything from Teradata to Hadoop have changed their minds.” Photo: Twitter

Databricks Inc. would beg to differ. The data analytics firm last week introduced technology that it said makes it possible for SQL queries of data lake repositories to perform up to nine times faster than comparable queries of a warehouse.

“We believe the data lake is the center of gravity because it’s so good at handling the unstructured information that data science and machine learning innovation comes from,” said Joel Minnick, Databricks’ vice president of marketing. “We have made good inroads around bringing transactional strengths of a data warehouse to a data lake.”

Many other software vendors are working toward the same goal using bitmap indexing, query optimizers, columnar processing and other acceleration techniques. Tableau Software Inc.’s Hyper engine uses in-memory processing and columnar storage to enable large data sets to be processed within Tableau without the need for a warehouse. Dremio Corp. takes a similar approach that builds on the Apache Arrow development platform for in-memory analytics on columnar storage.

Just as application development is evolving toward the use of microservices, data engineering “will evolve over the coming years to leverage an architecture of loosely coupled services rather than a monolithic cloud data warehouse,” said Tomer Shiran, Dremio’s co-founder. “The cloud data lake will replace the cloud data warehouse.”

Cost equation

Reasonable people also disagree over the perceived cost benefits of warehousing in the cloud. On the one hand, object storage has driven storage costs way down, addressing one of the costliest elements of the traditional data warehouse. “It’s 10-to-one savings on storage,” said Texas Mutual’s Seraphim.

“If you take into account infrastructure, licensing fees and DBAs, Snowflake is quite a lot cheaper” than a traditional warehouse, said Agero’s Bell.

But experts also warn that it can be easy to let CPU usage costs overwhelm savings in other areas, a situation made worse by the fact that cloud providers issue no warning signs of capacity growth.

“For people in the IT world where everything was on-prem, you knew when you were over-resourced because the systems would slow down,” said IDC’s Olofson. “Now you only see it in your bill.”

Before moving an existing data warehouse to the cloud, organizations need to fully understand why they’re doing it, said Rishi Diwan, chief product officer at Exasol AG, maker of an analytics database that has a large data center installed base. “If the reason is cost, do you truly understand your downtime?” he asked. “If it’s scale, model what it will take to get to the concurrency you need. Many times, contracts are renegotiated within a year because costs are higher than expected.”

Gartner’s Ronthal agreed that cloud costs can be deceptive. “In the cloud we’re in a world of abundance where you can provision the resources for whatever you need,” he said. “The conversation needs to shift from how to manage physical resources to how to manage limited budget resources.”

Despite the gotchas, no one is expecting data warehousing to migrate back on-premises again. Teradata, which for many people is synonymous with the legacy world, is emblematic of shifting attitudes. In the course of mentioning cloud 55 times in prepared remarks to analysts during the company’s third-quarter earnings call, CEO Steve McMillan underscored the company’s commitment to adopt a cloud-first approach to the market.

The company doesn’t break out cloud sales but said 80% of its revenues are now recurring. “We’ve had more product delivered in cloud this year than ever before,” said Teradata’s Ashton.

And so another bastion of the data center falls victim to the siren song of the cloud. In the case of the vilified data warehouse, many would say, “Good riddance.”

Photo: TheDigitalArtist/Pixabay

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU