UPDATED 19:11 EDT / APRIL 21 2025

AI

How AI-driven development tools impact software observability

“I made this whole program in 5 minutes with just a few (Insert GenAI tool) prompts. Any developer not using AI tools to replace developers will find themselves out of a job in two years”  – random AI fanboy on X

Let’s face it, the next few years are going to be really tough for software-driven companies and software engineers. 

Even the most successful startups on their way up will be asked to deliver more software with fewer development resources. That means we can expect to see more artificial intelligence tooling being used in development, in an attempt either to enhance developer productivity or to replace some work hours with AI-driven automation and agents.

Some stories about generative AI hallucinations are making the rounds, for instance when an Air Canada chatbot speciously offered a customer a refund, which resulted in a penalty when it tried to rescind the offer. Or Microsoft’s experimental Tay chatbot, which became progressively more “racist” through dialogue with bias-trolling users.

Haha, funny. We know large language model chatbots have insanely complex models that are largely opaque to conventional testing and observability tools. But enough said about the risks of putting AI-based applications in front of customers.

Let’s shift left, and explore how the use of AI development tools within development processes is affecting software observability and see if we can figure out why these problems are happening.

How would we know AI development tools are reliable in production?

As humans developing software, we never expected to be as fully engaged as we are now. Thanks to the evolution of automation and agile DevOps practices, per-developer productivity is at an all time high. So where else can we go from here with AI assistance?

Let’s look for better data than some fanboy on X saying he developed a whole app in five minutes. 

The recent 2024 DORA Report, with a massive survey audience underwritten by Google, does highlight significant improvements in documentation quality, code quality, and code review speed. Then, the report says:

“However, despite AI’s potential benefits, our research revealed a critical finding: AI adoption may negatively impact software delivery performance. As AI adoption increased [for each 25% increment], it was accompanied by an estimated decrease in delivery throughput by 1.5%, and an estimated reduction in delivery stability by 7.2%.”

As it turns out, AI-generated code within applications, when infused with complex probabilistic weighting and nondeterministic thinking, are less observable than conventional applications that contain rules-based logic.

It’s not just that AI coding and configuration assistants can make mistakes. The real problem with AI driven development is confidence. Since generative AI is designed to produce answers that are plausible and believable to the user, the AI will seem quite confident it is providing the right code and the right answers unless told to confidently investigate its own “thinking.”

We could go deep on so many aspects of AI’s impact on observability that we’d still only be scratching the surface. So, to further complicate matters, I talked to several leading vendors involved in making observability and software quality solutions.

Starting out with a policy-first approach

When using AI for development, the problem of alignment becomes especially sticky, because the AI-driven tools used for code, configuration or operations are twice-removed from the intention of the end user or customer. In other words, the AI should align with the intentions of the developer, who in turn is aligning the AI-powered software with its intended business purpose.

SmartBear was one of the first vendors to publish specific guidelines on how it would apply AI for development of its own software, before it started releasing AI-driven tools to software delivery teams.

“You can still get trapped in viewing observability through the lens of error tracking to make sure there’s no failures — and that presupposes that every other part of what you’re doing in the SDLC is adding more value to your customers when you definitely cannot hold that as constant,” said Vineeta Puranik, chief technology officer at SmartBear. “How do I know that all the code we’re writing, whether it’s AI-generated or human generated, is actually achieving those goals and making customers feel like they are getting more value over time out of the service?”

While AI routines have proven quite effective at taking real user monitoring traffic, generating a suite of possible tests and synthetic test data, and automating test runs on each pull request, any such system still requires humans who understand the intended business outcomes to use observability and regression testing tools to look for unintended consequences of change.

“So the system just doesn’t behave well,” Puranik said. “So you fix it up with some prompt engineering. Or maybe you try a new model, to see if it improves things. But in the course of fixing that problem, you did not regress something that was already working. That’s the very nature of working with these AI systems right now — fixing one thing can often screw up something else where you didn’t know to look for it.”

Hey man, debug my vibe code

There’s a new phenomenon everyone wants to try: vibecoding. Some software vendors act like vibecoding just isn’t really happening in the field, while some low-code vendors are leveraging AI to help “citizen developers” build apps from exactly that perspective, so AI can operate within the guardrails of their toolkits.

“Vibecoding is not just doing autocomplete on lines of code, it’s developing entire new services and configuring infrastructure with just prompts,” said Camden Swita, director and head of AI/ML at New Relic. “Since a vibecoder has no requirement to understand the stack, the person may not even understand the best practices of observability or instrumentation, or how to zero in on an issue later, like an SRE [site reliability engineer]. The need for good observability baked into the process is important.”

To address this, New Relic has added an elaborate stack tracing engine within their AI monitoring solution to help engineers understand how AI agents are interfacing with different architectural elements such as vector databases, retrieval-augmented generation and external service interfaces in production.

Sure, vibecoding might replace some developers who are delivering less mission critical apps, but it seems like it might also create a new cottage industry cleaning up the mess. Here’s a dev with a compelling offer making the rounds: “I can’t wait to fix your vibe code for $200 an hour.”

Generative AI chicken meets agentic AI egg

We’ve been using AIOps-style routines productively for years to filter and tag telemetry data for better relevance in observability work. Agentic AI, meaning AI-based agents, promises to further offload some engineering work by autonomously handling multiple tasks in a workflow for an investigation, such as comparing codebases for change, documenting and escalating incidents with stack trace reports, and generating test cases.

Here’s my concern: It’s like asking AI agents to monitor AI-filtered telemetry, for applications with code generated by AI, and tested with AI-generated tests — sort of like a wyvern eating its own tail. A human still needs to be involved to keep the agent on course.

“Let’s say, ‘agent, write some code that satisfies these tests, please,'” said Phillip Carter, principal product manager of open telemetry and AI at Honeycomb. “And it does. Except one problem. It looked at all of my test cases and it planted those as ‘if-statements’ inside of the function. Oh no, when I told it to satisfy the test case, it was very literal in interpreting what I was saying, and it just wrote the code that makes the test pass. I have basically created a tautological system that does perform per the spec. And that’s simpler than talking about things like Kubernetes configuration changes.”

Carter added that there can be a legitimate acceleration of tasks, “but some people would argue the bottleneck has never been in the code generation side of it, as it shifts the bottleneck toward verification and understanding what should actually be happening. This highlights a use case where we’ll never really get away from needing experienced people.”

Honeycomb’s observability platform allows engineers to drop into code-level analysis from a heat map, and it recently added an AI-enhanced natural language query function, trained on gathering telemetry for tying specific development, SRE and ops use cases to service-level objectives.

Positive signs from hybrid testers

Well before the current AI hullabaloo, we were already seeing crossover between the space formerly known as test automation and observability. Real user monitoring and synthetic test data and generated test scenarios are getting “shifted left” for pre-production awareness, as well as “shifted right” to provide better observability and test feedback from production.

Katalon just put out an extensive 2025 State of Software Quality 2025 report that clearly indicates that QA is a bright spot for AI development, with more than 75% of respondents reporting using some AI-driven testing tools. Respondents who used AI testing tools reported prioritizing test planning and design less (36%) than non-AI users (44%), indicating some reduction in manual effort through AI.

These findings support their idea of the “hybrid tester” who will zip together several different AI models and agents with conventional test automation. The aim will be to enhance observability coverage, shorten test and delivery cycle times, and accelerate documentation and feedback loops, alongside conventional test automation and manual testing tasks.

Katalon itself has taken on a composite AI approach. Its agentic AI acts as the key in a “zipper” that stitches together prompts and responses from many different AI-driven testing, monitoring and observability tools within the context of validating a business scenario or service-level objective.

Looking at the signals of AI-developed software

Software development, like any creative work, follows the golden triangle of software development. You can either have it fast, good or cheap — or at most, two out of the three.

What observability points out for AI-driven development is that you can definitely deliver software faster, and perhaps cheaper (the jury is still out on that in the long run), but better software may remain just out of reach in many cases without clarity on who owns business and service level objectives for these tools.

“It’s not that different than other areas of specialization coming into the software lifecycle, just like observability led to SREs trying to figure out what is going on within the stack,” said Patrick Lin, senior vice president and general manager of observability at Splunk, a Cisco Systems company. “The idea of a full stack developer may expand to include AI skills as a prerequisite. At the same time, you will still have DBAs and network operations teams that are specialists.”

Even when developing with AI tools, added Hao Yang, head of AI at Splunk, “we’ve always relied on human gatekeepers to ensure performance. Now, with agentic AI, teams are finally automating some tasks, and taking the human out of the loop. But it’s not like engineers don’t care. They still need to monitor more, and know what an anomaly is, and the AI needs to give humans the ability to take back control. It will put security and observability back at the top of the list of critical features.”

In practice, the golden signals of software observability (latency, traffic, errors and saturation) are still the same, but Yang also highlights new ones for looking at AI responses: relevance, quality, hallucination and toxicity.

Who owns AI-generated code anyway?

Here’s an interesting quandary: if I use a copilot in GitHub, or a tool such as Cursor, who should take responsibility if there are faults in the application, or the wrong infrastructure is implemented? Whom does an SRE call first?

“We still have a lot of SREs and engineers who do not trust the computer with that kind of reasoning. You can still use LLMs to suggest approaches, but the more automated and complex your system becomes, the more you need humans in the loop,” said Tom Wilkie, chief technology officer at Grafana Labs. “The LLM may have written some of the code, but if there’s a bug in it, that’s still my code and pull request.”

Still, I have to wonder, who actually owns the code, and the intellectual property it represents in a product, if the developer approves a lengthy terms-of-use attestation during signup?

“As a management team, we decided to take a risk-tolerant approach to AI tools,” said Wilkie. “Also, we are open source, and 90% of our code will be out there in public, so we have no concerns about engineers using these tools and leaking proprietary code to an LLM. We can attract engineers to us because we are open source.”

To whatever extent you can use open source tools with AI-assisted coding, it makes the value of contributions higher, since they will certainly be vetted and hardened by a community of thousands or millions of developers. Nobody wants to use open source tooling that real human contributors won’t stand behind.

The Intellyx take

No matter what, it’s only going to get harder for developers. More competitive. Some companies will lay off developers because of AI, or the promise of it. 

It’s really amazing how quickly almost any software company of significant size already has a “head of AI” leadership role present. It took about five years after DevOps or cloud appeared on the scene before we saw director-level appointments with those buzzwords in their titles.

The illusory “five years of experience developing with AI” will become a seldom-achieved requirement on some developer job recs. Even AI development companies such as Anthropic have had to tell job applicants to not use AI when answering questions on their recruiting portal.

So many billions of dollars have been invested in AI development tooling that it is unlikely that any of the purported beneficiaries of reduced workforces and timelines — or the media and analysts that participated in hyping out-of-the-box AI application delivery — are going to tell the market that end-customer codebases are becoming cursed with intractable problems. 

At least, not until we have more high-profile production failures caused by AI development tools without enough human oversight and governance.

That’s why AI-aware observability and shift-left production-style testing are more important than ever before in heading off functional errors and configuration drift before they get replicated everywhere.

Jason English is director and principal analyst at Intellyx. He wrote this article for SiliconANGLE. ©2025 Intellyx B.V. Intellyx is editorially responsible for this article. No AI bots were used to write this content. At the time of writing, SmartBear is an Intellyx customer, and Dynatrace, New Relic and Splunk are former Intellyx customers. None of the other organizations mentioned here is an Intellyx customer.

Image: SiliconANGLE/Ideogram

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU