UPDATED 13:00 EST / JULY 19 2018

INFRA

Facebook open-sources its ‘oomd’ tool for data center memory management

Facebook Inc. is doling out yet another open-source software tool, this time aimed at data center operators that struggle with system outages from applications trying to consume more memory resources than are available to them.

The software in question is called oomd, which Facebook describes as a “faster and more reliable” solution for the “out-of-memory situations” that sometimes occur after a configuration change or software update relating to its information technology infrastructure.

“As our infrastructure has scaled, we’ve found that an increasing fraction of our machines and networks span multiple generations,” Facebook engineer Daniel Xu wrote in a blog post. “One side effect of this multigenerational production environment is that a new software release or configuration change might result in a system running healthily on one machine but experiencing an out-of-memory issue on another.”

Out-of-memory issues generally occur when no additional memory can be allocated for use by programs or the operating system. In this case, the system will be unable to load any additional programs, and since many programs may load additional data into memory during execution, they will cease to function correctly.

Like many other data center operators, Facebook has traditionally always relied on the Linux OOM killer to fix these kinds of issues. But the Linux OOM killer isn’t always reliable because it often kicks in too late, which means the system enters what’s called a “livelock.” That’s a state in which different system components are waiting for some other component to take an action, such as sending a message or more commonly releasing a lock. In more basic terms, it means the system freezes indefinitely while waiting for an action that will never occur.

The problem with the Linux OOM killer is it relies on a technique used to increase memory usage called “memory overcommit.” More memory is allocated for processes than is actually available, because the general assumption is that applications don’t actually use all of their assigned memory. But Xu said that’s not always true.

“When demand exceeds total available memory, the Linux OOM killer tries to reclaim memory,” Xu explained. “The Linux OOM killer’s primary responsibility is to protect the kernel so that the machine stays up; it accomplishes this by killing some processes without heed to the importance of a given workload. Hence, whenever the OOM killer engages, there is a significant risk that applications running on the machine will be affected.”

Facebook’s oomd works differently, however. The main difference is it has the ability to monitor key system resource indicators, allowing it to take corrective action by aborting nonessential processes even before a systemwide OOM problem occurs.

“We use a generic kill mechanism called the kill list, which is an ordered list of ‘known offenders’ — processes or services that ought to be the first to kill in the event of an OOM, provided certain criteria are met,” Xu said. “For example, if a workload creates an auxiliary service that holds an in-memory cache for certain hot objects, oomd’s kill list can be configured to kill the cache first.”

Xu said Facebook has carried out extensive testing of its new OOM killer, which gives users the ability to customize their response to situationas when a workload uses up all of its available memory. Those tests show oomd to be far more reliable than the older Linux OOM killer, resulting in a far lower frequency of livelocks and, consequently, much reduced application downtime.

Managing memory resources is becoming increasingly important for enterprises and software providers as they look to take advantage of falling memory chip prices, Holger Mueller, principal analyst and vice president of Constellation Research Inc., told SiliconANGLE. Companies are putting more of their application usage into memory as a result of this, but the risk is that out-of-memory situations can become very expensive when running massive scalable next-generation applications, he said.

“It’s good to see Facebook helping the overall industry by outsourcing tools such as oomd, as its scale will likely eclipse the scale of most enterprises,” Mueller said. “CXOs like to have a scalability buffer built into their software framework. But while open sourcing is easy, fostering and growing the community is a different task, so we will have to see what the uptake of oomd will be in a few quarters.”

Facebook has published the code for oomd to GitHub, so that anyone can benefit from or contribute to its further development.

Image: Facebook

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU