UPDATED 15:45 EST / JANUARY 23 2025

AI

Galileo unleashes platform for evaluating AI agents

Galileo Technologies Inc., which makes tools for observing and evaluation artificial intelligence models, today unveiled Agentic Evaluations, a platform aimed at evaluating the performance of AI agents powered by large language models.

The company said it’s addressing the additional complexity created by agents, which are software robots imbued with decision-making capabilities that enable them to plan, reason and execute tasks across multiple steps and adapt to changing environments and contexts with little or no human oversight.

Because agent behavior is situational, developers can struggle to understand when and why failures occur. That hasn’t dampened interest in the technology’s workflow productivity potential. Gartner Inc. expects 33% of enterprise software applications to include agentic AI by 2028, up from less than 1% in 2024.

Agents challenge existing development and testing techniques in new ways. One is that they can choose multiple action sequences in response to a user request, making them unpredictable. Complex agentic workflows are difficult to model and require more complex evaluation. Agents may also work with multiple LLMs, making performance and costs harder to pin down. The risk of errors grows with the size and complexity of the workflow.

Galileo said its Agentic Evaluations provide a full lifecycle framework for system-level and step-by-step evaluation. It gives developers a view of an entire multistep agent process, from input to completion, with tracing and simple visualizations that help developers quickly pinpoint inefficiencies and errors. The platform uses a set of proprietary LLM-as-a-Judge metrics — an evaluation technique  that use LLMs to check and adjudicate tasks — specifically for developers building agents.

Metrics include an assessment of whether the LLM planner selected the correct tool and arguments, an assessment of errors by individual tools, traces reflecting progress toward the ultimate goal and how the final action align with the agent’s original instructions. Metrics are between 93% and 97% accurate, the company wrote in a blog post.

Performance is measured using proprietary, research-based metrics at multiple levels. Developers can choose which LLMs are involved in planning and assess errors in individual tasks.

Aggregate tracking for cost, latency and errors across sessions and spans helps with cost and latency measurement. Alerts and dashboards help in identifying systemic issues for continuous improvement such as failed tool calls or misalignment between the actions and instructions. The platform supports the popular open-source AI frameworks LangGraph and CrewAI.

Agentic Evaluations is now available to all Galileo users. The company has raised $68 million, including a $45 million funding round last October.

Image: SiliconANGLE/Microsoft Designer

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.