AI
AI
AI
As artificial intelligence agents become more powerful, agentic AI governance becomes increasingly important – and yet, today’s governance solutions struggle to keep AI agents from going off the rails.
In my last article in this series, I discussed the state of the art for keeping agents on the rails: multiple diverse adversarial validators with multilayer validation.
The idea is straightforward: To keep agents on track without limiting their capabilities, deploy several independent validator agents that evaluate each agent’s performance, looking for problems.
Only when enough of the validators agree the agent is performing properly can it proceed with its task.
For the most part, however, this vision for agentic AI governance is still on the drawing board. Only a few vendors are implementing validators of various levels of maturity.
Upon interviewing several of these vendors, a common bottleneck emerges: Leveraging validators to govern agentic orchestrations is too slow and expensive to support modern automation requirements. The state of the art for validator-centric governance, therefore, is coming up with ways to circumvent such latency and token consumption bottlenecks.
Here are some of the vendors tackling this problem today:
Agents that can evaluate the performance and behavior of other agents perform evaluations of those agents, and thus vendors call the practice of building them eval engineering.
The validators that ensure agents are behaving properly, therefore, are a type of evaluation. Eval engineering is broader than the practice of agentic AI governance, although the two are closely related.
Broadly speaking, eval engineering focuses on designing, running and operationalizing evaluations of large language model applications in general and agentic applications specifically.
The “LLM-as-a-judge” scoring technique is particularly useful for building agentic AI evaluators. With this technique, engineers can assess the quality, correctness and relevance of an agent’s (or other AI application’s) output.
Eval engineers combine LLM-as-a-judge scoring with software testing and observability to build LLM evals.
The most straightforward application of eval engineering is for the testing of AI agents before deployment into production. Using evals for testing avoids the performance and cost bottlenecks of eval engineering because they don’t run in production.
Eval engineers run structured evaluation pipelines against a variety of curated data sets including normal, edge case and adversarial inputs. Using LLM-as-a-judge scoring, engineers can measure accuracy, task completion, latency, policy compliance and other critical quality metrics as part of the engineering process.
Using eval engineering for testing is relatively common. My research turned up several vendors who offer this capability, including Comet ML Inc., Confident AI Inc., Evidently AI Inc., GoodEye Labs Inc. and the open-source MLflow Project, a Series of LF Projects LLC at the Linux Foundation, among others. I expect to write an article covering these vendors (and any I missed) in a future article.
The goal of many agentic systems is to orchestrate agents’ autonomous behavior to deliver automations. However, the more sophisticated the automation workflow, the more likely some agent in that workflow will go off the rails and take an undesirable action.
I spoke to Dany Kitishian, founder and chief executive of Klover Intelligence Corp. which does business as Klover AI. He explained that instead of focusing on automation, his company leverages eval engineering to deliver more accurate responses to queries than LLMs alone can typically offer.
Its platform takes input data, extracts and evaluates each fact within those data, analyzes each fact for accuracy within the context of opposing points of view, and then delivers well-reasoned responses based upon this analysis.
For Klover, evaluation is a layered framework that tests for correctness and alignment with real-world outcomes, delivering a measurable decision system rather than potentially dangerous autonomous agentic AI workflows.
Because Klover leverages curated data sets and doesn’t participate in time-sensitive automations, the cost and time limitations of eval engineering aren’t issues for its customers.
The most substantial cost and time bottlenecks for eval engineering constrain the governance of agentic workflows in production, and thus the greatest challenge – and promise – of eval engineering is supporting full-lifecycle agentic AI governance.
Without eval engineering, however, it would be impossible to implement the diverse adversarial validators so important for successful governance, thus limiting the ability for vendors to deliver effective agentic AI governance solutions at scale.
Eval engineers must conduct evals throughout the agent lifecycle, iteratively evaluating the performance of agents individually as well as agentic workflows for accuracy and alignment with goals. The eval process must automatically uncover drift and other failures, and feed back that information into the continuous integration/continuous delivery or CI/CD process.
In my interview with Vaibhavi “VG” Gangwar, co-founder and CEO of H3 Labs Inc., doing business as Maxim AI, she explained how her company combines eval engineering with prompt engineering, observability and simulations to help engineering teams build reliable agentic systems via continuous testing, monitoring and debugging.
Maxim AI combines “offline” evals (during development) as well as “online” evals during production. Offline evals focus on testing agentic behavior, while online evals work out of band to provide levels of confidence in the behavior of the agents in question.
In other words, Maxim uses a sampling-based approach during production to reduce token costs and to avoid slowing down the execution of agentic workflows, focusing its evals on high-risk interactions.
Several other vendors leverage eval engineering for full-lifecycle agentic AI governance. Arize AI Inc. offers an observability and eval platform for production AI systems including agentic AI workflows.
Arize tackles the performance challenges of running evals in production by offering continuous lightweight monitoring, reserving LLM-as-a-judge evals for high-risk situations much as Maxim does.
Conscium Ltd. also avoids limiting the performance of evals in production by delivering controlled virtual simulations that can identify unsafe agentic behavior, goal drift and policy violations.
Confident AI Inc. combines LLM-as-a-judge metrics with observability, tracing, and real-time monitoring to assess agentic behavior. It then feeds back the results of production interactions into ongoing eval datasets.
Confident AI bills itself as an eval-first platform, as it helps engineers test, monitor, and improve agentic systems across the full development and production lifecycle using automated evals, curated data sets, and repeatable testing workflows.
To address the latency and cost bottlenecks of eval engineering in production, the company moves most evals to asynchronous observability pipelines. Like Maxim AI, Confident AI leverages traffic sampling as well as targeted collection of metrics to reduce compute overhead.
Of the vendors I researched for this article, the one that stands out for having the most advanced answer to the cost/performance bottleneck is Galileo Technologies Inc., doing business as Galileo AI. To understand how Galileo AI’s solution differs from its competitors, it’s important to understand the research and innovation underlying its solution.
As co-founder and Chief Product Officer Atindriyo “Atin” Sanyal and Chief Marketing Officer Jason Garoutte explained it, Galileo’s story begins with ChainPoll. ChainPoll is a hallucination detection methodology that combines chain-of-thought reasoning and polling to deliver high-performance results.
Chain-of-thought reasoning requires evaluator models to explain their reasoning step-by-step. Polling means that the system runs evals multiple times (possibly leveraging different models) and then aggregates the results.
ChainPoll thus provides a methodology for reducing evals’ cost and performance overhead while governing agentic workflows and also sets up a framework for coordinating multiple evals. Leveraging ChainPoll, Galileo AI then developed Luna, a purpose-built evaluation model, to detect hallucinations in LLM outputs including retrieval-augmented generation or retrieval-augment generation-supported queries.
Where ChainPoll provides a methodology for providing yes/no or pass/fail results from evals, Luna offers a specialized model that delivers on the promise of ChainPoll with a substantially smaller token consumption footprint than competing LLMs can offer.
Leveraging the lessons of ChainPoll and the power and efficiency of Luna, Galileo AI implements specialized model-as-a-judge functionality for a tiny fraction of the cost and latency of LLM-as-a-judge alternatives that suffer under the overhead of general-purpose LLMs.
Unlike competing products, Galileo AI is able to offer agentic observability with 100% sampling in production, without requiring asynchronous, out-of-band evals or evals that leverage only a subset of available telemetry.
With Galileo AI, eval engineers can iterate their evals quickly, incorporating feedback to fine-tune Luna to resolve some of the knottier issues with misbehaving agents, including overconfidence, sycophantic behavior and their annoying tendency to break rules.
Given AI agents’ inherent nondeterministic behavior, no agentic AI governance is perfect, Galileo AI included. However, because of its high-efficiency approach as well as its ability to leverage chain-of-thought evals to govern agentic tasks within workflows, Galileo AI is able to deliver optimized agentic governance that gives its customers visibility and control over even the naughtiest of AI agents.
Though I focused on startups for this article, there is also innovation in eval engineering at larger vendors, including Google LLC, Microsoft Corp. and IBM Corp. Given the dominance of the major frontier models in the AI market, many LLM vendors have their fingers in the eval engineering pie as well.
Cisco Systems Inc. is also tossing its hat in the ring by acquiring Galileo AI. This transaction is imminent and promises to roll the startup into Cisco’s Splunk organization.
The primary takeaway from this article, however, isn’t the state of innovation in agentic AI governance. Rather, it’s the increasing cost and latency challenges inherent in LLM-based offerings.
These challenges, after all, are industry-wide, and are only getting worse. As LLMs become more powerful and thus consume more tokens, organizations will be looking for increasingly cost-effective ways to extract value from LLMs and AI generally.
In other words, for the perennial better-faster-cheaper triangle, LLMs are moving from the “better” corner to “faster and cheaper” – a true sign that the technology is reaching a level of maturity.
There are many vendors I was unable to fit into this article. If you think you belonged, or if I mentioned you but never spoke to you, I want to hear from you. Email me at jason@intellyx.com and we can set up a briefing.
Jason Bloomberg is founder and managing director of Intellyx, which advises business leaders and technology vendors on their digital transformation strategies. He wrote this article for SiliconANGLE. IBM, Microsoft and Splunk are former Intellyx customers. A human wrote every word of this article.
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.