UPDATED 08:00 EST / JANUARY 05 2026

BIG DATA

Data 2026 outlook: The rise of semantic spheres of influence

In 2024, the elephant in the room was how generative artificial intelligence seized the conversation. In 2025, the dialog shifted to agents and the question of whether there’s an AI bubble happening in our midst. But as we noted, AI’s taking of the limelight shined a new spotlight on the importance of having good data, and so last year, we forecast that data would have a renaissance.

Spoiler alert: Though having good data is an essential first step, AI models need the right data, and that’s where the focus will shift this year. In the words of Cindi Howson, semantics are sexy again.

But first, let’s look at how we got here and answer the question: Was there a data renaissance?

All AI and agents, all the time

While data started to garnering attention last year, AI and agents continued to suck up the oxygen. Why the urgency of agents? Maybe it’s “fear of missing out.”  Or maybe there’s a more rational explanation. According to Amazon Web Services Inc. CEO Matt Garman, agents are the technology that will finally make AI investments pay off. Go to the 12-minute mark in his recent AWS re:Invent conference keynote, and you’ll hear him say just that.

But are agents yet ready for prime time? While PricewaterhouseCoopers is extremely bullish, noting that 66% of respondents reported getting measurable results, a McKinsey study in our eyes paints a more realistic picture with 62% reporting that their organizations are “at least experimenting with agents.”

Agent technology is still in grammar school. Agents are only as reliable as the AI models underpinning them, and just because an AI model can reason doesn’t necessarily mean that it is reasoning correctly. And as for completing complex tasks, studies like this one from Carnegie-Mellon show agents are still struggling there.

As for the AI bubble, it is coming up for conversation because it is now having a material effect on the economy at large. Some accounts estimate that AI is driving 90% of US GDP growth, while others point to data center buildout adding a percentage point to overall GDP.

Are the numbers getting too hot to handle? Michael Burry, the investor made famous in “The Big Short” for predicting the housing market collapse, is currently shorting Nvidia Corp. and Palantir Technologies Inc. Oracle Corp.’s recent ride is illustrative. In the week following its Q1 FY 2026 financials back in September, where it disclosed of $455 billion of “remaining performance obligations” (the SEC term for committed revenue pipeline), the stock shot up roughly 35%. But those gains largely evaporated by the December quarter as concerns over leverage and the fact of OpenAI Group PBC accounting for more than half the pipeline took over the conversation. We’re not going to get caught down the rathole of making economic predictions, but we’re still sticking to our story from a couple years back that generative AI will take at least four to five years for industry investments to turn positive.

Offstage, data was happening

On the main stage at AWS re:Invent, Garman’s keynote had just one data-related announcement. It was the unveiling of a long-awaited Database Savings Plan that met even tough critic Corey Quinn’s approval. Nonetheless, the following day, the data and AI keynote from Dr. Swami Sivasubramanian’s totally overlooked data. Spoiler alert: Dr. Sivasubramanian, who for many years was vice president of data and AI, is now officially VP of agentic AI. So far, our prediction ain’t looking so hot.

But AWS’ off year for data doesn’t tell the whole story.

Across the industry, there was plenty about data sharing, Zero ETL, multicloud and acquisitions — for instance, IBM’s pending offer for Confluent as continuing evidence of collapse (or consolidation) of the modern data stack. And PostgreSQL also had a big year with Databricks Inc. and Snowflake Inc. acquiring PostgreSQL databases while Microsoft Corp. finally laid out its PostgreSQL answer to Amazon Aurora and Google AlloyDB. And there were some generational upgrades, such as AWS’ unveiling of Graviton5 for accelerating memory-intensive database workloads, and Snowflake announcing a new Gen2 premium compute tier.

Data sharing and Zero ETL had big years. The impetus was the pain and complexity of building and managing fragile data pipelines. And that’s where these design patterns became common themes. Data sharing, first popularized by Snowflake, became the path for how third parties such as Databricks, Google BigQuery and Microsoft Fabric access SAP data; this pattern is also used by Salesforce Inc.. As for the Zero ETL pattern, it was first introduced by AWS for simplifying populating Redshift and OpenSearch from its varied portfolio of operational databases. In 2025, Databricks, Microsoft and Snowflake took this approach further, using it for unifying transaction and analytics tiers.

There was also a continuation of last year’s momentum for vectors becoming a checkbox feature of database platforms. While Amazon Redshift still eschews vectors, others are expanding support through new data types, unified query and storage classes. Oracle unveiled unified query of structured and unstructured (vector) data, while AWS debuted a new S3 storage tier designed specifically for vectors.

No recap of 2025 would be complete without noting that Apache Iceberg cemented its place as the de facto standard open table format, capped by Databricks adding its full support. As for AI running the database? Oracle made headlines with the Autonomous Database way back when; it’s practically taken for granted now. And as we forecast last year, language models are now making sense of what’s really in the database, combining metadata, schema and table description to generate business terms.

Database and agents

Not surprisingly, there was lots of buzz over agents in databases. Here’s just a sampling.

For instance, the new PostgreSQL additions to the Databricks and Snowflake portfolios are well-suited for providing state-management for autonomous agents. AWS’ extension to its AgentCore deployment framework supports episodic memory for agents performing complex, extended workflows; it does so by “remembering” the state of an application and the database transactions supporting it. It’s a step toward addressing the agent forgetfulness problem pointed out in the Carnegie-Mellon study.

A Gen AI Toolbox for Databases introduced by Google provides an open source server for the middleware managing the underlying plumbing for AI agents connecting to databases. Among the features of Databricks’ new Agent Bricks toolkit are capabilities for judging whether agents are generating accurate SQL. Meanwhile, Oracle and Snowflake introduced capabilities for agents performing iterative reasoning in-database.

And of course, no discussion of agentic interaction with databases is complete without mention of Model Context Protocol. The open-source MCP framework, which Anthropic PBC recently donated to the Linux Foundation, came out of nowhere over the past year to become the de facto standard for how AI models connect with data. The beauty of MCP is its flexibility; it was built as an extensible framework that envisions multiple paths by which models connect to data. It might be as simple as a model issuing a REST call to grab data, or as complex as triggering intermediate steps for retrieving data that could be called by agent workflows. By now, almost all major data and AI platform providers have announced previews of their own MCP Servers.

So did a data renaissance actually occur?

A year ago, we made the case that adoption of AI would drive renewed attention to data. And at the time we noted that while the ecosystem for governing structured data was fairly mature, innovation in the near term would focus in a couple areas: governing unstructured data and grounding it with knowledge graphs. And our aspirational hope was that the governance silos between data and AI would start getting bridged. So does a string of vendor announcements signify that a renaissance for data actually occurred?

The case could be made that the disciplines of data quality and data governance, based on available tooling and documented best practices, are fairly mature – although in reality, broad, consistent adoption will forever remain a continuing saga. Perhaps language models could help here, but they won’t substitute for robust process.

As for AI governance, that’s a much newer body of practice. According to the Audit Board, only a quarter of organizations have fully implemented AI governance. And data governance itself is still a formidable barrier to AI adoption according to an AWS study, with nearly 40% of chief data officers citing the perennial problems of data quality and integration. We’ve still got homework to do.

2025 saw some progress. The almost overnight emergence of MCP as the de facto standard framework for AI models connecting to data reflects just how important the industry believes that data is for AI.

There were early advances for extending governance to unstructured data, primarily documents. IBM watsonx.governance introduced a capability for curating unstructured data that transforms documents and enriches them by assigning classifications, data classes and business terms to prepare them for retrieval-augmented generation, or RAG. On a similar tack, Microsoft extended Purview’s sensitivity labelling to unstructured files while adding capability for automatically classifying over 200 information types such as credit cards or passports. Databricks Unity Catalog extended access controls to unstructured data, using AI to crawl these files to generate descriptions and capture lineage that language models can read; Snowflake’s Document AI capability could conceivably be used for similar purpose.

A year later, bridging the governance silos between data and AI remains elusive. Credit Databricks for being one of the few to actually build capability into Unity Catalog for correlating which models use which sets of data.

There was also growing industry support for knowledge graphs and using them to ground AI with the GraphRAG pattern. The guiding notion behind GraphRAG has been for making vector similarity searches more relevant. Knowledge graph adoption is still embryonic. We’ve started to see the case studies from classic early adopters — for instance, Deloitte using Amazon Neptune graph database’s vector support for cybersecurity intelligence, or AdaptX, using FalkorDB for analyzing complex medical and patient data for improving outcomes. Otherwise, GraphRAG has been the fodder of research studies for use cases in human resources, tax and financial compliance, manufacturing and logistics.

The industry has stepped up to building the tooling to facilitate GraphRAG. A few examples include Google Vertex AI RAG Engine, providing the data framework for building RAG applications; it’s now integrated with Spanner. Microsoft, which originated the GraphRAG pattern, has been refining it with optimizations such as LazyGraphRAG that reduces indexing bottlenecks for large data sets.

Neo4J, the last big man standing in a long tail of specialized graph databases, has released a suite of tools, libraries, and database features. The highlights include a Python library for building full graph-based reasoning applications and an LLM Knowledge Graph Builder online app for turning unstructured text into knowledge graphs. GraphRAG does not have a monopoly on grounding RAG applications; AWS takes a different approach, employing language models to evaluate the outputs.

Of course, building the knowledge graphs for grounding generative AI applications has its fair share of challenges, among them getting the ontology right. And that comes down to a very human challenge that we pointed out a year ago: Generative AI will drive the need for more knowledge engineers. And that sets the stage for what we’re expecting to unwind in 2026.

What’s happening in 2026? Semantic spheres of influence will coalesce

AI models don’t only require “good” data, they require the “right” data. Now of course, that’s true for any application that uses data, but at least with traditional business intelligence or predictive apps running against known sets of data in data marts, warehouses or lakes, context was implicit. That’s not the case for language models piercing the boundaries of data lakes. Context must be explicit.

For instance, when ingesting and chunking various sources of documents, emails, texts or social network content, can you be sure that when one or more customers are talking about a “product” or “delivery issues’ that they are talking about the same thing, or that the similarities with different products or delivery issues are relevant? For AI, semantics should eliminate that guesswork by making explicit the business context, data definitions and object relationships (hopefully instantiated in an underlying knowledge graph).

That’s why our eyes were opened early last year about the need for knowledge engineers. And it led to a revival of interest in an age old idea: semantic layers. Having existed in BI tools and enterprise applications for years, semantic layers have long been taken for granted. An abstraction above business glossaries, semantic layers codified the types of reports and key performance indicators in a form of catalog. In many BI tools, semantic tiers were often called metrics tiers as they were considered the definitive repositories of metrics and KPIs.

Business Objects Universe was the first of those BI semantic layers; the goal was establishing a single source of the truth. It defined objects by their dimensions (the name of the entity, such as “customer” or “store city”); the measure (what the numbers stand for, e.g., “sales revenue” or “unit costs”); and detail, which is supporting information (e.g., the phone number or home city of the customer). In the BI world, uses of semantic layers waxed and waned; for instance, when self-service visualizations grew popular after Tableau appeared on the scene, customers prioritized getting access to their own data extracts over defining enterprise standards for reporting.

Semantics have had a longer-lasting presence in the enterprise applications world because they didn’t define only data, but also the business processes underlying them. We may not have called them semantics back then, but the data entities and processes that defined enterprise applications have always been as foundational as their source code, and not surprisingly they are just as closely guarded. For instance, you can’t run SAP’s, ServiceNow’s, or Salesforce’s knowledge graphs outside their applications.

Instead, the metrics from their application platforms are typically exposed to third-party systems through abstractions, which used to be quite primitive. For instance, prior to the recent launch of Tableau Semantics, Salesforce only provided low-level metadata APIs and query languages, where each BI tool had to build its own interpretation of Salesforce’s schema. The same was true for SAP before it launched data products; significantly, SAP only grants access by sharing, not exporting these data products to third-party systems. For ServiceNow, the recent acquisition of the data catalog data.world should provide that semantic window to the world.

In a harbinger of things to come, when SAP unveiled data products as part of its Business Data Cloud, it announced an OEM relationship with Databricks for sharing those products with Databricks users. Semantics was what brought together two fiercely independent companies: Databricks had the data science capabilities that SAP lacked, while SAP had the view into the context of its data that Databricks lacked. Since then, SAP has concluded similar data sharing agreements with Google BigQuery, Snowflake and Microsoft Fabric. We’re wondering when AWS will get around to it.

Yet another shot was fired by Microsoft with the unveiling of Fabric IQ. Building on the existing Power BI semantic layer that has been part of Fabric, IQ adds ontology. All too often, semantics and ontology are conflated; the distinction is that semantics are about what things mean (e.g., defining business terms and concepts for data entities for the purpose of analytic reporting) while ontology is how knowledge is structured (e.g., what the business is and how it operates). What’s significant is why Fabric IQ emerged. When Microsoft added real-time intelligence to Fabric, there was the need to ensure that actions taken are in the right context. Emergence of AI agents expanding the attack surface for AI to interact with data raised the stakes.

And this was the backdrop for Snowflake’s September announcement of the Open Semantic Interchange initiative backed by founding partners Salesforce, dbt Labs (about to become part of Fivetran), Relational AI and more than a dozen others. Based around MetricFlow, which was developed by dbt Labs, OSI is designed to provide a universal format for defining metrics and dimensions. Beneath the hood, OSI uses YAML files to configure canonical semantic models comprised of data entities and the dimensions, measures and labels associated with them. At this point, the specification is extremely preliminary; for instance, although default measures are required as a value, there is only a single one that has been defined so far.

The challenge for OSI is that open-source projects typically succeed if they address highly commoditized technology building blocks where there is no differentiating value, such as the open-table formats of Apache Iceberg. Semantics, however, are high up the value chain – so much so that applications vendors jealously guard them. Nonetheless, AI has raised the urgency for enabling different query engines to make their semantic definitions interoperable. The fact that Snowflake and other providers of query engines sense the urgency to feed the right data in the right context to AI models directly responds to the fact that so much value in the application is tied up in semantics where the SAPs, ServiceNows, Oracles and Salesforces have home-court advantages.

In the AI world of 2026, the maxim “Knowledge is power” could be updated to “Semantics is power.” As we noted last year, there is huge latent demand for knowledge engineers, one of the few professions that for now is where AI is increasing the need, rather than replacing people. But the ability to map the knowledge of the organization is daunting, and according to Jessica Talisman, who runs an ontology consulting practice, requires more than the skills of a data architect.

Oh, and by the way, it’s not the first time we’ve been to this dance. Recall the abortive knowledge management of the 1990s that turned into consulting boondoggles? There is the risk that project teams might scrape the wrong data for their knowledge graphs, declare victory and then go home.

Where’s the beef?

For most organizations, where are the semantics likely to reside? Traditional early adopters and organizations from sectors like financial services or telecom might kick off or ramp up their own initiatives for defining their own semantic tiers and/or definitive ontologies of their organizations.

But for most organizations lacking deep skills or rigorous enterprise architecture practices, the starting points for defining semantics is going straight to the sources: enterprise applications and/or, alternatively, the newer breed of data catalogs that are branching out from their original missions of locating and/or providing the points of enforcement for data governance. In most organizations, the solution is not going to be either-or.

In the age of AI, it’s logical that the SAPs, ServiceNows, Salesforces and Oracles of the world would take this to the next level, opening up the definitions of measures, data entities and business processes for entrenching their semantic spheres of influence. That has provided ample reason for the Snowflakes, Databricks, Informaticas, ThoughtSpots, Starbursts, Collibras and Alations of the world to band together to respond with OSI. In 2026, semantics will be the next battleground between umbrella solutions and best of breed, with SAP’s data sharing agreements with Databricks and Snowflake forming the most likely template.

All this is happening because AI doesn’t only need good data, but the right data. It’s going to be an interesting year.

Tony Baer is principal at dbInsight LLC, which provides an independent view on the database and analytics technology ecosystem. Baer is an industry expert in extending data management practices, governance and advanced analytics to address the desire of enterprises to generate meaningful value from data-driven transformation. He wrote this article for SiliconANGLE.

Image: SiliconANGLE/TK

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.