UPDATED 06:00 EDT / MAY 22 2018

BIG DATA

Cloudera courts data scientists with self-service workbench and collaboration features

Cloudera Inc. is appealing to the hearts and minds of data scientists and data engineers with a series of new features and products intended to make them more organized and productive.

The software the big-data company introduced today at the Strata conference in London include machine learning capabilities that it said make it easier for data scientists to train and deploy models more quickly with higher confidence and lower risk while enabling better team collaboration. Cloudera is also introducing version 6.0 of its big-data platform, sporting increases in performance, scale and capacity.

The company is trumpeting version 1.4 of its Data Science Workbench in particular, saying it introduces major new self-service enhancements. A major new feature is versioning, which enables data scientists to archive versions of their experiments for later retrieval. Another is deployment through representational state transfer or Rest application program interfaces, which enables scientists to more quickly deploy trained models into production.

The versioning feature of the workbench, called Experiments, “an immutable log of each experiment and changes between them,” said Matt Brandwein, Cloudera’s director of products. Changes might include code revisions, hardware variances and the outcome of machine learning models. Each experiment is captured in a Docker container, which is a lightweight runtime environment, and catalogued in a repository. “You get the artifacts and the commit log all together,” Brandwein said.

The new deployment features addresses a major pain point for data scientists and software engineering teams, Brandwein said. Data scientists prefer to work in statistical modeling languages like R and Python, but deployment is typically in a procedural language like Java.

In the new version of Workbench, which will be available this summer, a data scientist can “write a Python routine, push a button and automatically build a Docker container with a web service application,” Brandwein said. Models can be deployed as reports, batch processing jobs or exposed via the API. “We take whatever your function is and create a Rest interface and define a function signature for you,” Brandwein said. “Whatever you can encode in Python and R, you can deploy.”

Taken together, the new features ease the process of building data science-based applications in teams. All workloads share common security and governance parameters and the containers can run either on-premises or in the cloud.

Improvements in Cloudera Enterprise 6.0 are said to optimize resource utilization to accelerate analytics. The new release supports version 2.0 of the Apache Hive data warehousing project, version 2.0 of the Apache HBase data store and version 5.0 of the Apache Oozie 5.0 workflow scheduler. Combined with new optimization features that can assign jobs to optimal hardware resources, such as graphics processing unit chips, the new features enable machine learning processes to run up to 10 times faster.

Integrated search in Cloudera Enterprise is now done with Apache Solr 7.0, which supports nested data types and JavaScript Object Notation facets. Streaming data pipelines use the Apache Kafka 1.0 distributed streaming program and Apache Spark 2.2 analytics framework as fully native components. Customers can manage clusters of up to 2,500 nodes using a single management interface.

Finally, Cloudera is announcing that its Altus platform-as-a-service is now generally available on Microsoft Azure, fulfilling a promise the company made last year.

Image: Flickr CC

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU