The holiday season is a peak time for e-commerce transactions, but also for the advertising that fuels them. Downtime or slowdowns can be disastrous for companies trying to put forth the latest promotions through online advertising, but the complexity of ad networks makes pinpointing the cause of anomalies difficult.
The Rubicon Project Inc. is one of a handful of exchanges that makes Internet advertising work. Its automated brokerage system does for ad buying what the NASDAQ does for stock trading. Online publishers post the availability of inventory on their sites to Rubicon’s automated network, where about 200 advertising agencies can bid for available spots. The whole bidding process is automated and completed in just milliseconds. More than 90 percent of people who browse the Internet see an ad that goes through the Rubicon exchange each day.
The Rubicon Project processes about 13 trillion monthly bid requests using a network of more than 60,000 central processing units spread across seven data centers. That scale, combined with the speed at which the network operates, makes detecting and isolating problems a tricky proposition. Adding further complexity is that activity varies by geography. “A Hong Kong data center will act differently than one in San Jose,” explained Rich Galan, director of analytics.
Spotting anomalies that undermine the bidding process is critical. If agency partners can’t put down offers on advertising inventory, their businesses suffer, so Rubicon constantly scans its network for signs of any behavior that falls outside the norm, such as an advertiser failing to bid or consistently building too slowly. “If partner doesn’t trade as we expect them to trade, it’s an anomaly,” Galan said.
Anomalies can happen for all kinds of reasons, such as an outage in an agency’s data center, network congestion or unexpectedly slow application performance. Often the customer doesn’t even know the problem is occurring. “Either we have a problem and don’t know it, or a partner has a problem and they don’t know it,” Galan said. “If we leave the slow bleeding unattended for too long it can be a large hit to our business.”
Rubicon uses Graphite, a free open-source system monitoring and visualization tool, to capture and store time-series data, but Graphite is difficult to use and requires programming to detect patterns that fall outside of the norm, Galan said. “We could identify anomalies, but only on an aggregate level,” he said. “So if a partner stops buying with us in a specific data center, we might not know it for a day or two.” Which can translate into millions of dollars in lost sales.
Rubicon realized it needed to bring machine learning to the anomaly detection process. It currently monitors about 40,000 combinations of metrics, such as timeouts by partner by data center, and plans to grow that number dramatically. “It would be ridiculous to try to write these algorithms manually,” Galan said.
The company tested various anomaly detection systems and settled on one from Anodot Inc., a two-year-old Israeli firm whose namesake product provides real-time analytics and uses machine learning to discover outliers in vast amounts of data. “Anodot was the winner for pure simplicity,” Galan said. It helped that the software also integrates seamlessly with Graphite.
Anomaly detection software is a relatively new category of predictive analytics. Rather than relying on humans to set up thresholds and follow alerts, the software “learns” the normal behavior of a complex system and thereafter reports on aberrations. “When we win, it’s usually because a company was going to build its own solution in-house,” said David Drai, Anodot’s cofounder and chief executive.
Set it and forget it
Anodot markets its software as plug-and-play simple. It typically takes about a week to accumulate enough data to establish a baseline for a new client. After that point, it lurks on the network and sends real-time alerts when anomalies are detected. Rubicon found the ease-of-use claims to be on the mark. “Implementing Anodot was super easy,” Galan said. “We were able to plug it right in. We let it run for a month, and found anomalies right away.”
The impact on Rubicon’s business has been immediate. The company is now often able to detect problems before the customer is even aware of them. Improved visibility is enabling Rubicon to load-balance activity across its seven data centers to improve performance and throughput. It’s also improving customer relations. In a recent example, one agency’s trading activity slowed due to increased network latency. Rubicon immediately contacted the company, which was unaware that it even had a problem. It turned out that a code release was the culprit. “A lot of times customers know there’s a problem, but we catch it before they do,” Galan said.
Rubicon is currently using Anodot to detect anomalies for about 25 of its customers, but plans to expand to support all 200 buyer clients while also increasing the number of metrics it monitors to a half-million. Galan said the company is confident that the software can handle the increased scale. Its technical operations team recently began using Anodot to track system-level anomalies, such as CPUs that are running hot or slow.
Bottom line: “We’ve been able to identify problems at an early stage, stop the bleeding and protect our business,” Galan said.