Nvidia built its Selene supercomputer for coronavirus research in just 3 weeks
Nvidia Corp. said today it managed to build Selene, the world’s seventh-fastest supercomputer, in just under three weeks.
Selene is based on the DGX A100 systems also used by the Argonne National Laboratory to research ways to stop the coronavirus. The Selene supercomputer has been deployed to tackle problems around concepts such as protein docking and quantum chemistry, which are key to developing an understanding of the coronavirus and a potential cure for the COVID-19 disease.
Nvidia said Selene is based on its most advanced DGX SuperPOD architecture, which is a new system developed for artificial intelligence workloads that was announced earlier this year. The DGX SuperPOD incorporates eight of Nvidia’s latest A100 graphics processing units, which are designed for data analytics, scientific computing and cloud graphics workloads.
Building the Selene supercomputer in such rapid time during the middle of a pandemic was no easy feat, but Nvidia said in a blog post it was able to draw on its earlier experience of piecing together supercomputers based on its older DGX-2 systems. Those experiences taught Nvidia some hard lessons about networking, storage, power and thermals, and the most efficient way to stitch those necessary components together to create a supercomputing machine that’s dedicated to scientific research.
For example, when Nvidia built Circe, currently the world’s 23rd fastest supercomputer, in June 2019, its engineers completely redesigned that machine’s network to simplify the assembly of the overall system. Circe’s network is based on scalable modules of 20 nodes that are connected by relatively simple “thin switches” that can be laid down cookie-cutter style, turned on and tested before another is added.
The design let engineers specify set lengths of cables that could be bundled together with Velcro at the factory, Nvidia explained. As a result, racks could be labeled and mapped, much simplifying the process of filling them with dozens of systems.
Nvidia said its experiences with Circe meant it could come up with a balanced design for supercomputers that can handle many different kinds of high-performance computing workloads. The flexibility of its design also means that researchers have much more freedom to explore new directions in AI and high performance computing, something that proved to be useful in the construction of Selene.
It generally takes teams of dozens of engineers several months to assemble, test and then commission a supercomputer class system. And the challenge was complicated by the fact that Nvidia’s engineers also had to maintain social distancing to ensure the safety of those workers.
Nvidia’s tactic was to use skeleton crews of two-person teams to unbox and rack its systems together, and those teams had to work separate shifts, around the clock, to avoid mixing with the others at all times. By following Nvidia’s pre-established design, those teams racked up 60 DGX SuperPOD systems each day. The engineering teams were aided virtually by administrators who validated the cabling remotely, testing each 20-node module as soon as it was deployed.
The design method was so successful and so rapid that Nvidia said another of its customers, the University of Florida, expects to be able to rack up and power on a 140-node extension to its existing HiPerGator supercomputer in just 10 days once the necessary systems and equipment are shipped out.
Now that it’s up and running, Selene can talk to its operators via a Slack channel, in order to report any problems such as malfunctioning hardware or loose cables. Those operators are further aided by a telepresence robot dubbed “Trip” that can drive up and down its aisles of SuperPOD systems to help keep an eye on things remotely.
Selene landed at No. 7 on the TOP500 list of the world’s supercomputers in June and came in at second place on the Green500 list of the world’s most power-efficient systems that same month. Then in July, it set new records in all eight systems tests for AI training performance in the latest MLPerf benchmarks.
Photo: Nvidia
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU