The third annual Hadoop Summit in San Jose is off to a flying start with thousands in attendance and a record number of sponsors and exhibitors representing the full industry gamut, from incumbent vendors fighting to hold onto their turf to the ambitious startups looking to take their place. It’s the latter group that stole the spotlight on the first day of the event with a bevy of product announcements spanning the entire analytics lifecycle, beginning at the very start: preparing data for processing.
Checking all the boxes on information quality
Data scientists today spend as much as 80 percent of their time filtering out errors and inconsistencies and working around compatibility issues, according to Pentaho. The Hadoop business intelligence (BI) firm promises to help customers flip that number on its head with a new toolkit aimed at streamlining the process of readying information for analysis.
Included in the Data Science Pack are three utilities designed to simplify life for users working with Pentaho’s Weka open source data mining project and the R statistical language, two of the most widely used analytic technologies in the industry. Among the tools is a script execution engine that offloads all the messy details of the data transformation process to the company’s software, a scoring engine that rates datasets based on accuracy and an automated forecasting solution that generates predictions on incoming information.
Pentaho says that the bundle can make not only make it easier for users to whip their information into analyze shape but take the hassle out of blending multiple sources as well, a challenge Talent is also addressing with the latest edition of its namesake platform. The release brings with it the ability to import multi-gigabyte documents into Hadoop and provides a visual environment for integrating different streams with response times up to 45 percent faster than the previous version, according to the company.
Cutting out the middle-man
While some vendors are focusing on helping data scientists be more productive, others are working to eliminate the need for specialized talent altogether. Actian is firmly in the latter camp. It too made headlines the summit today after joining the ranks of the dozens of companies offering structured query capabilities for Hadoop with the introduction of a new SQL feature for its flagship analytics platform. The value proposition is a familiar one: the company claims that business users can leverage its software to access data stored in HDFS directly instead of going through a data scientist.
Altoscale has begun offering similar functionality to users of its Hadoop cloud, which supports the latest stable release of Apache Hive as of this morning. The open source data warehouse was originally developed by Facebook to save its developers the trouble of familiarizing themselves with MapReduce or the slightly less complex but still unwieldy Pig platform and simply use familiar SQL syntax instead.
Being able to access and manipulate data in Hadoop without getting bogged down in the inherent complexity of the batch processing framework is vital to enable the kind of velocity business users have come to expect from their application, but using a structured query tool is not the way to accomplish that. MetaScale, an analytics firm owned by Sears, says its newly launched “Ready-to-Go Reports” service can achieve the same results at a fraction of the cost of on-premise alternatives by eliminating the need for both data scientists and costly in-house infrastructure.