UPDATED 09:00 EDT / MARCH 26 2014

Datameer 4.0 visualizes data quality in Hadoop

analytics data economy 2Datameer, maker of self-service analytics software for Hadoop, today pulled the curtains back on Datameer 4.0, the latest version of its namesake platform. Datameer 4.0 introduces the ability for data scientists and data analysts to examine results throughout all of the stages of the data collection and visualization process—a feature the company hails as an industry first. “We are consistently pushing the envelope to make Big Data analysts as productive as possible, and studies show the fastest way to digest information is visually,” said Stefan Groschupf, CEO and founder of Datameer.

The ability to examine results throughout all of the stages of the data collection and visualization process is important because the data brought into Hadoop is so raw, according to Karen Hsu, Senior Director of Product Marketing at Datameer. “Data ingestion, data prep and data wrangling can take up to 80 percent of a data scientist’s or data analyst’s time,” Hsu explained in an interview with siliconANGLE. “People need a way to quickly inspect the data visually so they [can] know where the issues are (by profiling ingested data) and know if they fixed the issues (by profiling transformed data). People can’t wait until the end of the process to see the results because that means delays if the data was not cleaned up.”

Karen Hsu, Senior Director of Product Marketing at Datameer

Karen Hsu, Senior Director of Product Marketing at Datameer

Hsu said there are tools similar to Datameer 4.0 that are currently on the market but that those other tools address only part of the workflow. She said this is a problem because people not only need to learn multiple tools (if they use those other tools) but that data quality issues are potentially introduced at every handoff to another product used in the process. Plus, she said, most early Hadoop and Big Data tools are IT-focused. “But we have been hearing from customers that the person closest to the data is the subject matter expert,” Hsu said. “And it is this person [who] has the questions and can make the greatest impact when those questions are answered.”

This ability was lacking in the previous version of Datameer but Hsu said it was added to Datameer 4.0 after Datameer analyzed how their customer base was using the previous version of Datameer. The company found that many customers were using the platform for data preparation. “The new capabilities in 4.0 augmented the existing data preparation capabilities—in particular the ones that transformed the data into a usable form,” Hsu explained. “Now people have the ability to preview their data (before transformation) and check their work (after transformation).”

.

End-to-end data profiling workflow

.

Datameer 4.0 provides this productivity boost in the form of an “end-to-end” data profiling workflow that enables users to drill down into information quality metrics and other key indicators such as type, count, cardinality, mean and average immediately after an operation. In the previous releases of the platform, these metrics were only possible to view after the final report had been generated, which resulted in a lot of lost time for data scientists.

Stefan Groschupf, Datameer CEO

Stefan Groschupf, Datameer CEO

This can have a very tangible impact on the bottom line for organizations that have incorporated Big Data into their decision making process. “With Datameer 4.0, analysts no longer need to wait until the final visualization to gain insights into their data,” Groschupf said. “This new paradigm means companies will realize meaningful ROI on their Big Data analytics projects faster than ever before.” This point about ROI is something Groschupf drove home during an interview with theCUBE cohosts John Furrier and Dave Vellante last year at Hadoop Summit 2013.

During that interview, Furrier had commented that Big Data analytics is all about gaining “business value” and that in order “to go public, to get bought or to make money, startups or growing companies need to have metrics.” Furrier had asked Groschupf for his perspective about this, asking him, “What are the metrics that need to be in place in order to have a sustainable business model in this market?”

“You have to provide ROI to your customers,” Groschupf replied. “To answer your question, what is important really, if you want to be successful as a company, is to build your customer base, provide them value, and obviously make more money than you spend acquiring them.” (To hear the entire interview, click here or just scroll down to the end of this article).

.

New data profiling capability

.

Now Datameer 4.0 provides granular visibility into every single ingest, join and transformation, which allows users to identify new issues and evaluate information for patterns on an ad hoc basis. The company claims this functionality streamlines analytics and, ultimately, accelerates time to insight. “This ability to immediately profile is important for data scientists and data analysts because this enables anyone doing Big Data analytics to visually inspect the data and check their work after the operation has been done,” Hsu added. “[Users can] catch errors before the end of the process where visualization is typically seen, [and they can] use one platform to complete the Big Data analytics workflow, end-to-end.”

open flow blue data center infrastructure flying cubes architecture abstract big data analytics cloudTo round out the package, the new data profiling capability has also been baked into the complementary Smart Analytics module, which makes it possible for everyday knowledge workers to identify meaningful patterns and relationships in their information. But there is no difference in Datameer’s workflow for data scientists versus everyday knowledge workers, according to Hsu. “Both groups are able to profile the data from ingestion through data preparation to every step of analysis,” she said.

Hsu added that a “flipside” capability has been added to the Smart Analytics module, which shows the “why” behind the algorithms. “For example, in Clustering, the flipside visually shows why the clusters were formed,” she explained. “A cluster could be a group of customers who are between the ages of 40 and 60 and typically made purchases between 7pm and 8pm that were, on average, $30 per purchase. The ‘flipside’ would [show] the figures above for age range, purchase time, and average purchase amount.”

.

Real-time “smart sampling”

.

In either version of Datameer 4.0, users are able to see results in real time using a data sample from which they build their analysis. “Datameer offers ‘Smart Sampling’ which allows users to work with a representative sample of their data based on patented, machine learning algorithms to build their analysis,” Hsu said. “This ‘Smart Sampling’ is what enables analysts to see results on this sample instantly. When they are satisfied with the results, they run the analysis on the full data set.”

For example, European startup Trustev uses Datameer to help identify fraud in minutes. Trustev offers a real-time identity verification engine that focuses on the individual making the transaction rather than on the payment method with which they’re using, correlating transactional data with behavioral patterns from social networks to confirm that a customer is who they say they are. “Before, fraud solutions have identified online fraud in days, which means the fraudster is often long gone before the fraud is even noticed,” Hsu said.

.

Big Data analytics in the future

.

There is currently a massive shift to next-generation business intelligence (BI) as enterprises re-platform their data warehouses to include Hadoop-based architectures. Hsu said Datameer is seeing companies shift more of their workloads from data warehouses to Hadoop. “In the next year, we expect this trend to continue,” she said. “In the next five to 10 years, we see companies relying on end-to-end, pre-built Big Data applications like Datameer (instead of hand-coding their own Hadoop solutions with open-source projects like Hive, Sqoop, Flume, etc.) on any data—big or small.”

santa clara cloud analyticsWhile it is attractive for engineers to get the open-source tool experience on their resume right now, in the end Hsu said that time-to-insight will always win out and that businesses will opt for proven solutions instead of hand-built projects. “In every technology adoption cycle, it’s natural that the early adopters come more out of IT departments that tend to use more technical tools,” she explained. “Who would build their own CRM system today? Everyone just buys Salesforce now.

“We can think about our market the same way; it will be the same for Big Data analytics in the long run. We believe next-generation BI will be about the empowering the user. The analyst who in the past who has had to go to IT can now take control and find the previously hidden answers themselves.”

.

Watch the entire video in which theCUBE cohosts John Furrier and Dave Vellante interview Datameer’s CEO Stefan Groschupf during Hadoop Summit 2013:

Photo credit: SalFalko via photopin cc
Company logo, and photos of Stefan Groschupf and Karen Hsu courtesy of Datameer.
Photo credit: subarcticmike via photopin cc
Photo credit: mrjoro via photopin cc
Video interview courtesy of theCUBE.
Maria Deutscher contributed to this article.

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU