Meta details its approach to detecting data errors in IT infrastructure
Meta Platforms Inc. today detailed its approach to detecting so-called silent data corruptions, or SDCs, subtle errors that often emerge in information technology infrastructure and are highly difficult to troubleshoot.
Outages and other technical issues are a frequent phenomenon in data centers. As a result, companies use a variety of methods to ensure that important business information isn’t lost in the event of a malfunction. One of the most common approaches is to create multiple copies of a record, which ensures that a backup is available in the event the original record is lost.
But despite the steps that companies take to protect business information, data errors still frequently emerge in IT infrastructure. Among the most complex errors are malfunctions that Meta refers to as SDCs. Such errors emerge because of computing mistakes made by a server’s central processing unit.
Servers and other data center systems automatically generate logs about notable events such as a malfunction. Those logs can then be used by administrators to carry out troubleshooting. SDC errors are challenging to fix because they don’t appear in server logs, which makes them highly difficult to detect.
Meta’s engineers have developed multiple methods of detecting SDCs, the company detailed today. The company shared technical information about two of the most important methods in a blog post.
The first technique that Meta uses to detect SDCs is known as ripple testing.
To carry out ripple testing, Meta connects an error detection system to the applications running on a given server. The error detection system, with the help of the applications to which it’s connected, carries out a series of specialized computing operations. If the operations return an incorrect result, Meta can conclude that there was an SDC error caused by the server’s CPU.
“Ripple tests are typically in the order of hundreds of milliseconds within the fleet,” explained Meta engineer Harish Dattatraya Dixit. “They are scheduled based on workload behavior and can be switched on and off per workload.”
Because they can be completed in under a second, ripple tests require a relatively limited amount of infrastructure resources to carry out. A related benefit is that it’s possible to perform ripple tests fairly often. But while effective, this method can’t spot all types of SDCs, which is why Meta also uses a second error detection technique dubbed opportunistic testing.
Whereas a ripple test can be completed in under a second, opportunistic tests take several minutes to carry out, which reflects the fact that they are much more thorough. Meta built a custom software tool called Fleetscanner to manage the process. The company runs opportunistic tests on servers when they’re not actively used, for example while they’re undergoing maintenance.
Meta carries out opportunistic tests when a machine reboots, as well as when it installs updates to the onboarding operating system or firmware. The company also searches for SDC errors when certain changes are made to the server cluster to which a machine is attached.
Meta carries out 2.5 billion ripple tests every month across its data centers and has run a total of 68 million opportunistic tests to date. Ripple tests spot about 70% of SDC errors, the company says, while the rest are detected through opportunistic testing.
Image: Meta
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU