Big data company Cloudera Inc. wants to lift the burden off data scientists’ shoulders with a new solution that allows them to use a variety of open-source big data tools while ensuring they remain in compliance with corporate governance regulations.
The new solution is called the Cloudera Data Science Workbench, which the company describes as a “self-service tool for data science” that sits atop of its Cloudera Enterprise platform.
Cloudera said it’s trying to fix a major headache for data scientists that has arisen from the availability of so many new open-source big data technologies in the past few years. The open data science ecosystem has expanded way beyond its original Python and R ecosystems and is now swamped with deep learning frameworks such as Tensorflow, Microsoft Cognitive Toolkit, MXnet and BigDL. Big data scientists want a way to bring these tools to their data, but the complexities of integrating them while ensuring compliance are little short of a nightmare.
The Data Science Workbench is Cloudera’s solution to this torment, giving “data scientists the freedom to use their favorite tools, on existing environments, while keeping in line with IT’s efforts to comply with corporate directives,” the company said in its release.
The solution is based on technology Cloudera acquired when it bought the San Francisco-based data science startup Sense.io last March, executives said. It enables data scientists to use open-source languages such as R, Python, Scala and others, as well as various other libraries and frameworks, with data stored in their Cloudera Enterprise Hadoop environments.
Charles Zedlewski, senior vice president of products at Cloudera, explained that the platform acts like a centralized environment for processing business information from a wide array of sources, allowing data scientists to scale their work to larger datasets and more powerful compute platforms, especially using Spark for data processing and machine learning.
Zedlewski said that Cloudera’s customers often struggle to onboard data scientists to shared environments given their diverse needs. The problem is especially acute where open-source tools are involved, often leading to problems with duplication, analytic silos and limited security and governance.