UPDATED 08:00 EDT / FEBRUARY 04 2025

BIG DATA

Data in the generative AI era: We need more knowledge engineers

Name the hot buttons about generative artificial intelligence, and they often center around data.

Here’s just one example. A recent survey of more than 300 risk and compliance professionals on AI by risk management firm Riskonnect revealed top concerns over data privacy and cyber issues. It was followed closely by employee decisions based on erroneous information, and with it, employee misuse and ethical risks. And then there were copyright and intellectual property risks. Similar conclusions have come from KPMG, the Massachusetts Institute of Technology, PricewaterhouseCoopers and others.

That’s why we forecast that this year, AI would drive a Renaissance for Data. As AI projects advance from proof of concept to production, organizations have to pay serious attention to the data being used for training and inference. We recently had a chance to test out that premise at Data Day Texas, an annual gathering of the data community down in Austin, and one term stood out: the need to understand the context of their data.

Concern over understanding the context of data stems from the need to ensure that AI models are running on the right data. And from that comes a need for a professional who has a handle on all that – who has a handle on the context and can help point you to the right stuff — and that’s the knowledge engineer. That’s the person who has many of the qualities that, in a previous era, we valued from reference librarians.

Hold those thoughts.

Get the context

During the heyday of Big Data, conventional wisdom was that the more data you had, the more complete the picture. But we’ve learned a few lessons since then that could best be summed up that bigger is not always better.

Last week’s headlines over DeepSeek underscore this point. Whether or not you buy into DeepSeek’s benchmarks, there was already an emerging trend toward delivering more rightsized models based on a variant of the 80/20 rule: Having just enough data and model to deliver results that are “good enough.” Providers such as Databricks Inc., IBM Corp. and Snowflake Inc. have already been working on approaches such as “mixture of experts” aimed at rightsizing models. To the initiated, it shouldn’t have been surprising that a DeepSeek would come along at some point to pop the Open AI bubble.

Rightsized models raise the stakes on having the right data. Throwing lots of mud at the wall and hoping some will stick won’t work when you have a smaller wall to work with. And to understand whether the data is right for the query, you must know its context. That’s not simply knowing what data is in your portfolio, but it’s about knowing everything relevant about the data: What is the source? How is it collected? How is it meant to be used? Who is meant to use it? You get the picture.

Of course, we’ve always needed the right data. But it’s one thing to conduct a conventional analytics query against a known data set in an enterprise data warehouse or data mart, but the risk surface expands when running AI because the models are often black boxes, not easily explainable. The challenge is compounded where unstructured data is involved as the data sources are likely not as well-known or vetted. A faulty BI query might reveal a faulty statistical analysis of customer demand, but an off-the-rails AI response could generate an unexplainable hallucination where the mistake might not always be obvious.

In short, finding that context is about answering what journalists have long referred to as The Five Ws Questions: Who, What, Where, When, and Why? To that, we’ll add a sixth “honorary” W, which is how.

That message came loud in clear in the conference keynote delivered by Ole Olesen-Bagneux. His message? Data in your organization is likely highly dispersed, and being dispersed makes it harder to get a grip on those Five Ws. A few years back, the dialogue over distributed data gave data mesh its 15 minutes of fame. But federated governance involved with data mesh proved tricky, with the legacy of data mesh being data products.

Instead, Olesen-Bagneux proposed something more modest: From the bottom up, map the systems and the people or line organizations that they serve, and more importantly, what systems are connected upstream and downstream. Track it through the flow of metadata.

Olesen-Bagneux called his concept the meta grid. He’s focusing on metadata rather than raw data because that yields insight on those Five Ws. And a lot of that context comes from the types of sources that generate the metadata. If the metadata comes from the database, it defines schema; in the business application, the metadata is largely about business logic; and if generated by cloud infrastructure, it is about how to process data.

For instance, data pipelines could yield a goldmine of context. They map out what data is shared across the organization, and importantly, how it is transformed and shared. It’s important to understand how data is transformed because that could impact its meaning – meaning could be enhanced or potentially lost. And that’s where the honorary “how” of the Five Ws comes. Understanding the source and how and where metadata flows is an important step toward understanding the pulse of information across your organization.

Here’s a fun fact. Olesen-Bagneux’s concept is so new that, at least for now you won’t find it in a Google search. So don’t confuse meta grid with a project tracking the connections of figures in history or the workflow app for the Mac on the App Store.

Sharing semantics

Juan Sequeda picked up where Olesen-Bagneux left off by speaking of the importance of sharing semantics; that’s critical for organizations to be able to effectively reuse their data. Andrew Nguyen, a data scientist with Best Buy Health, amplified that point with an example from his field: When a practitioner checks a box for a specific medical condition on a patient record, the question to ask is who entered it? Was it a physician making a diagnosis, a med student or scribe interpreting the doctor’s prognostication, or a clinician making an observation?

Each of them may have a different understanding on when to check the box for that medical condition. Communication and understanding people’s roles yields insight on what those data values mean.

Nguyen’s example came as part of a broader discussion about why connecting to the context of data will improve the effectiveness of AI models. He described an emerging discipline, Context Engineering, as a discipline for systematically capturing context and making it explicit. This discipline is not yet well-defined and there are few if any current published references; the closest thing you’ll find to a reference is from a technical blog post from Data.world describing a system that persists this in a form of enriched knowledge graph. And it’s a further hint of the need for knowledge engineering.

Divided by an (un)common language

You’ve heard it all before: When sharing data, we need to speak the same language.

With structured data, that’s about schema. Sequeda pointed to the success of Schema.org, an open-source project co-founded by Google, Microsoft, Yahoo and Yandex. If it could standardize how websites globally structure their data, harmonizing schema design practices inside enterprises shouldn’t have to become moonshots.

But inside enterprises, all too often the reality is that common entities often have multiple definitions from one line organization or system to the next. We’ve heard tales of how some enterprises have dozens of definitions of what a customer or product is.

And then there’s the matter of NULL values, where different systems often use NULL values differently. For instance, does it signify a data value that is missing, outside the normal range, or otherwise undefined? Though NULLs may not be used for answering queries, understanding how they are used points a spotlight on quality and reliability of the data, or the lack thereof.

Of course, all this so far is about structured data. When it comes to language models, we’re working with written text or voice from a variety of sources, where the type of source has huge bearing on context. Words or phrases written in an email have different purposes and may have to be interpreted differently from text extracted images or reference documents. For instance, an email or message is more likely to contain hearsay versus a reference manual that has been vetted, whereas text extracted from an image could be readily taken out of context.

Unstructured data is the frontier where it comes, not only to data governance, but also knowledge engineering. Current state of the art is rather basic: Assess the provenance and track the lineage at file level. Ultimately, we could use the entity extraction capabilities of language models to elicit metadata, and if we kluge enough tools together, we could harvest metadata to help discover (and govern) these assets.

The technology process today is still quite complex, and it doesn’t directly address context: Would you entrust a language model to deduce the meaning of the same term or phrase coming from a message or legal contract? And would you place faith in a language model to ascertain if the source contains the right data to answer the question?

Bring the knowledge engineer to the rescue

Cut to the chase: The answer to this question involves a “who.” AI can crawl information, extract metadata, and generate a layout of your data landscape, but humans must still be in the loop.

That point came through loud and clear at an end-of-day “town hall” session moderated by Joe Reis and Matthew Housley. One of the first statements from the audience was about the crying need for knowledge engineers.

What’s old is new. Knowledge engineering has roots not only with enterprise architecture and data stewardship but also library science. Traditionally, enterprise architects and data stewards concerned themselves with structured data. But with language models bringing in unstructured text in its many forms, it takes important cues from the reference librarians that we relied on for picking up where card catalogs left off.

Revealing our age here, we recall the days when research projects typically involved a trip to a library for printed documents and microfiche. Card catalogs pointed us to author and subjects, but when it came to finding out which books, periodicals or technical journals contained the most relevant information, reference librarians were there to point the way. They were equipped to do so because a key skill for their profession was understanding context of the types of sources.

Fast-forward to the present, and knowledge engineering calls for many of those skills. It updates the reference librarian with software engineering skills associated with data science, such as thorough knowledge of languages (e.g., Python), frameworks (e.g., TensorFlow) and an understanding how to use machine learning for knowledge extraction. It also borrows heavily from semantic disciplines such as ontology development, knowledge representation and related skills for representing it through graphs (property or RDF).

Modern crawlers and AI technologies play supporting roles by scaling the knowledge engineer’s capability to understand the context of what’s out there. As agentic AI technology matures, it may also automate some of the orchestration of legwork for entity extraction and summarization.

But human judgment must take over to ascertain that the information is appropriate and pertinent. Knowledge engineers can hardly operate solo in their ivory towers; they need to collaborate with subject matter and domain experts down in the trenches. It takes people, not machines, to ascertain the true meaning of that NULL value, or whether the source material is right for training or running AI inference.

It’s not surprising that demand for knowledge engineers has coincided with rising prominence of knowledge graphs. Just like machine learning, which has been present under the hood with predictive analytics and forecasting solution, knowledge graphs have grown ubiquitous. Off-the-shelf enterprise business solutions include the Microsoft Graph underlying Microsoft 365 (formerly Office) powering collaboration; Salesforce Data Graph, acting as a materialized view of the relationships between customer contacts; and the SAP Datasphere Knowledge Graph enshrining the interrelationships of business process metadata from its semantic tier.

We were reassured, as Sequeda noted in his postmortem of the event, that knowledge graphs weren’t treated as “new” concepts anymore. Though it still takes a combination of art and science to build knowledge graphs, ultimately we expect knowledge graphs will become the killer applications for grounding retrieval-augmented generation or RAG implementations.

With AI creating a need for knowledge engineering, information science (as we used to call it) has come full circle. It comes as enterprises demand that the principles of value engineering get applied to AI so training and running the models don’t generate millions of dollars in cloud computing bills, and that AI models stay properly grounded.

Enterprises must be confident that they are running their models with just the right data, and that’s where understanding the context of the data comes in. Though AI can assist in the legwork of harvesting data, ultimately it will require the skills of knowledge engineers to make the final call. With knowledge engineering’s roots in library science, it might be time to update the old Harvard Business Review article, naming the present-day equivalent of the reference librarian as the sexiest job of the 21^st century.

Click here for a deeper-dive discussion on Data Day Texas, as Tony Baer, along with Juan Sequeda and Matthew Housley, review their takeaways on The Joe Reis Show.

Tony Baer is principal at dbInsight LLC, which provides an independent view on the database and analytics technology ecosystem. Baer is an industry expert in extending data management practices, governance and advanced analytics to address the desire of enterprises to generate meaningful value from data-driven transformation. He wrote this article for SiliconANGLE.

Image: SiliconANGLE/Bing Image Creator

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU

Data in the generative AI era: We need more knowledge engineers

Get the context

Sharing semantics

Divided by an (un)common language

Bring the knowledge engineer to the rescue

Image: SiliconANGLE/Bing Image Creator

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

Understanding Today's Digital Business With Dynatrace

Black Hat USA 2025

World of Workato 2025

VMware Explore 2025

CrowdStrike Fal.Con 2025

RECENT CUBE EVENTS

Databricks Data + AI Summit 2025

AWS Summit Washington, DC 2025

Google Cloud Partner AI Series 2025

Snowflake Summit 2025

IBM Think 2025

Data in the generative AI era: We need more knowledge engineers

Get the context

Sharing semantics

Divided by an (un)common language

Bring the knowledge engineer to the rescue

Image: SiliconANGLE/Bing Image Creator

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

LATEST STORIES

LATEST STORIES

Understanding Today's Digital Business With Dynatrace

Black Hat USA 2025

World of Workato 2025

VMware Explore 2025

CrowdStrike Fal.Con 2025

Databricks Data + AI Summit 2025

AWS Summit Washington, DC 2025

Google Cloud Partner AI Series 2025

Snowflake Summit 2025

IBM Think 2025

Cookies