With 2015 coming to an end, LinkedIn Corp. has taken a look back at its year of using, developing and contributing to open-source software.
Throughout the last year, LinkedIn made some of its biggest ever contributions to the open-source community by releasing ten new original projects, including Burrow, Goblin and Pinot, while pushing major updates to existing projects such as Apache Samza, Apache Kafka, Rest.li and Voldemort.
“We’ve worked to scale our infrastructure as we reached 400 million LinkedIn members, so it’s no surprise many of our open-source projects this year focus on building out our data pipelines and tools to help make sense of our data,” wrote LinkedIn’s Igor Perisic in a blog post. “The infrastructure improvements we’ve made in Kafka have allowed us to handle 1.3 trillion messages per day, and Espresso now serves 2.2 million rows per second.”
LinkedIn open-sourced its Pinot real-time analytics infrastructure last June. The technology allows LinkedIn to sort through and analyze enormous amounts of data in real-time for a wide variety of its products.
“At LinkedIn, we have a large deployment of Pinot storing hundreds of billions of records and ingesting over a billion records every day,” said Kishore Gopalakrishna, a senior software engineer at LinkedIn, in a blog post describing how it works. “Pinot serves as the backend for more than 25 analytics products for our customers and members. This includes products such as Who Viewed My Profile, Who Viewed My Posts and the analytics we offer on job postings and ads to help our customers be as effective as possible and get a better return on their investment. In addition, more than 30 internal products are powered by Pinot…”
LinkedIn also open-sourced its lightweight PalDB technology for storing side data last October. As Linkedin engineer Matthieu Monsch explains, side data is the extra read-only data needed by a process to do its job, such as a list of stop words used by a natural language processing algorithm, or machine learning models used in machine translation, content classification or spam detection are also side data. The problem is that when this side data becomes too large, it creates bottlenecks for applications that depend on it. PalDB was built to provide a read-only embeddable database that makes it easier to scale side data.
In his blog post, Perisic said he believes that LinkedIn’s engineers benefit from open-sourcing their projects because it means their work is exposed to the entire developer community.
“It seems paradoxical to think that developers write better software for others than they do for themselves, but it actually makes sense,” Perisic wrote. “When software is written ‘internally,’ developers have a tendency to cut some corners—and I’m as guilty as anyone—especially around documenting, making code easily readable and reusable and having all the right tests in order.”
“With open source, developers’ names are attached to the software they create and the entire community can look at it,” he continued. “This puts a human face on code and reputations on the line. Once a developer open sources some software, their names will be forever associated with it. This is a huge incentive to cross their T’s and dot their I’s. A developer wants to be associated with good stuff that is well written.”