UPDATED 08:00 EST / JANUARY 12 2016

NEWS

Preparation demands new rigor in Big Data age, says Wikibon analyst

Preparing data for analysis in the age of Big Data, with multiple data types and sources and severe time restraints, are radically different than traditional extract, transform load (ETL) procedures that have existed for a long time. For Big Data, the pipeline becomes collect, prepare, blend, transform and distribute (see figure above). In “Unpacking the New Requirements for Data Prep & Integration”, the first of two Professional Alerts on the subject, Wikibon Big Data & Analytics Analyst George Gilbert discusses these new requirements in detail.

The traditional ETL pipeline was built with the expectation of exactly what data would be loaded, what form it was in, and what questions would be asked in the analysis. That made it efficient but inflexible. Big Data upends that model. The need for ETL grows not just in proportion to the volume of data but also in the variety of machine-generated data that needs to be incorporated.

Gilbert discusses several specific areas in which ETL must change:

  • Governance;
  • Data wrangling;
  • Security; and,
  • Integration and runtime.

A look at the leading candidates

In his second Alert, “Informatica & Integration vs Specialization & the Rest”, Gilbert goes into greater depth, providing notes on the leading choices, starting with individual tools such as the Apache Kafka message queue, Waterline Data, Inc.,  and Alation, Inc.  He defines the place of each of these vendors in an overall roll-your-own Big Data pipeline. He then discusses the integrated choices, starting with Informatica Corp.,  which has the most complete tool for Big Data integration. Its ETL system covers almost exactly the same ground as its traditional ETL for structured data does, providing everything up to analysis. Syncsort Inc.  and Talend, Inc., provide less complete but still useful integrated tool sets. Pentaho Corp. provides the most complete stack, from data prep and integration through analytics.

IT practitioners, Gilbert writes, should collect the requirements from each constituency involved in building the new analytic pipeline. If an integrated product such as Informatica or Pentaho does not meet the requirements, they should evaluate the amount of internal integration work they can support and whether the speed of the final pipeline can deliver on its performance goals.

Figure © 2015 Wikibon

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU