UPDATED 14:30 EDT / JUNE 10 2019

heinrich-rocha BIG DATA

From inventing the web to colliding particles, CERN’s computer scientists manage data for the universe

This article, and hundreds of millions of others, are viewable online across the globe because 30 years ago a computer scientist took a break from his research group’s work in particle physics to tinker with a new way to manage and share information.

That group was the European Organization for Nuclear Research, or CERN; the computer scientist was Tim Berners-Lee. And his proposal for the first hypertext browser essentially laid the groundwork for what ultimately became the modern internet.

While this historic milestone from March 1989 resulted in the creation of the World Wide Web as an automated way to share information between scientists around the globe, CERN’s real claim to fame involved its groundbreaking work in visible and even invisible matter within the universe. Fueled by development of the Large Hadron Collider, or LHC, the world’s largest particle accelerator, CERN has been at the forefront of scientific research that led to discovery of the elusive Higgs Boson particle in 2012.

Behind this heavy scientific lifting is a significant computing organization, one that must handle data at a scale most of us can only imagine. This includes a data center that holds 300,000 cores, according to Ricardo Rocha (pictured, right), computing engineer at CERN.

“That’s not enough, so what we’ve done over the last 15 to 20 years is create this large distributed computing environment around the world,” Rocha said. “We link to many different institutes and research labs, and this doubles our capacity.”

Rocha spoke with Stu Miniman and Corey Quinn, co-hosts of theCUBE, SiliconANGLE Media’s mobile livestreaming studio, during the KubeCon + CloudNativeCon event in Barcelona. He was joined by Lukas Heinrich (left), physicist at CERN, and they discussed the data management process needed for scientific discovery, the role of Kubernetes in the organization’s work, and how CERN shares its findings while contributing to the open-source world (see the full interview with transcript here). (* Disclosure below.)

This week, theCUBE features Lukas Heinrich and Ricardo Rocha as its Guests of the Week.

Uncovering the invisible

Discovery of Higgs Boson was a significant breakthrough because, until then, scientists had been unable to conclusively see a particle’s interaction with the invisible “Higgs field” in which particles acquire mass inside the Universe. The discovery seven years ago this July resulted in a Nobel Prize for the scientists involved, including physicist Peter Higgs.

The discovery was made possible through use of CERN’s LHC. Built in 2008, the particle accelerator employs a 27-kilometer ring of superconducting magnets to boost particle energy. The protons collide 40 million times per second, according to Heinrich, and the resulting data must then be carefully captured for thorough evaluation by CERN scientists.

“We accelerate protons, which are hydrogen nuclei, to very high energy so they almost go with the speed of light,” Heinrich explained. “We essentially run 10,000 core real-time applications just to analyze this data.”

Using Kubernetes for data analysis

At the KubeCon event in Barcelona, Rocha and Heinrich offered attendees a glimpse into how open-source and containerized computing tools, not readily available in 2012, could be used to recreate the data analysis that led to the Nobel Prize-winning Higgs Boson discovery.

Using a Jupyter notebook and Kubernetes on a small cluster within the CERN private cloud, the engineers demonstrated how the application and cluster itself could scale out and meet intensive data analysis needs. They also showed how work within the Kubernetes Multicluster Special Interest Group helped define scheduling policies and leveraged external cloud resources.

“Virtual machines still have a very complex setup to be able to support our diversity of software,” Rocha said. “With containerization, all people have to give us is a building block to run. It’s a standard interface, so we only have to build infrastructure to be able to handle these pieces.”

One of CERN’s ongoing challenges is to deal with the rapidly expanding amount of data it must be able to process. In 2017, the organization passed 200 petabytes of data stored in its archives, generated in part from its LHC, which produced one petabyte of collision data per second. Although this data gets ultimately reduced through filtering, CERN will soon be talking about exabytes of information, according to Rocha.

“It’s still a lot of data,” Rocha stated. “We’re now collecting something like 70 petabytes a year.”

To deal with data on such a massive scale, over 90% of CERN’s data center resources are provisioned through a private cloud based on OpenStack. CERN started with only four OpenStack projects and a few scattered hypervisors in 2012. Its cloud has now evolved into running 16 OpenStack projects, 9,000 hypervisors, and more than 400 Kubernetes clusters across two regions.

Harkening back to the information-sharing vision of Berners-Lee, CERN’s OpenStack cloud is part of the Worldwide LHC Computing Grid. This distributed scientific network involves 170 data centers across 42 countries, harnessing the power of 800,000 cores to process the Collider’s data exhaust.

“We’re looking into GPUs and machine learning to change how we do computing, and we’re looking at any kind of additional resources we might get, and the public cloud will probably play a role,” Rocha said.

Reliance on OpenStack

CERN has been diligent about feeding back its learnings upstream to the open-source community. The organization has made 745 code commits to various OpenStack code projects and discovered 339 bugs, according to one published report.

Scientists and computer engineers at CERN have also demonstrated a willingness to leverage open-source tools, such as Kubernetes, and the public cloud, to share data from experiments. A portion of information generated from the Compact Muon Solenoid, or CMS, a detector at the LHC, has been released publicly, allowing scientific researchers outside of the CERN orbit to benefit, according to Heinrich.

“By using Kubernetes and public cloud infrastructure, it actually becomes possible for people who don’t work at CERN to analyze this large-scale scientific data,” Heinrich said. “This was a 70-terabyte data set that, thanks to our Google Cloud partners, we were able to put onto public cloud infrastructure and then we analyzed it on a large-scale Kubernetes cluster.”

After setting in motion an information-sharing project 30 years ago that arguably became the most significant innovation of modern times, Berners-Lee has remained active in the computing world. He added “Sir” to his name after being knighted by Queen Elizabeth in 2004 and has remained a director of the World Wide Web Consortium, the global web standards body he founded in 1994.

The computer engineer played a role in the opening ceremonies for the 2012 Olympic Games held in London. Tweeting out “This is for everyone” during the event from a special computer, Berners-Lee could as well have been commenting on the scientific contributions of CERN itself, where his stellar career began long ago.

Here’s the complete video interview, part of SiliconANGLE’s and theCUBE’s coverage of the KubeCon + CloudNativeCon event. (* Disclosure: This segment is unsponsored. Red Hat Inc. is the headline sponsor for theCUBE’s live broadcast at KubeCon + CloudNativeCon. Neither Red Hat nor any other sponsors have editorial control over content on theCUBE or SiliconANGLE.)

Photo: SiliconANGLE

Since you’re here …

Show your support for our mission with our one-click subscription to our YouTube channel (below). The more subscribers we have, the more YouTube will suggest relevant enterprise and emerging technology content to you. Thanks!

Support our mission:    >>>>>>  SUBSCRIBE NOW >>>>>>  to our YouTube channel.

… We’d also like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.

If you like the reporting, video interviews and other ad-free content here, please take a moment to check out a sample of the video content supported by our sponsors, tweet your support, and keep coming back to SiliconANGLE.