UPDATED 19:44 EDT / SEPTEMBER 20 2017

BIG DATA

Syncsort quality manager aims to purify Hadoop data lakes

Syncsort Inc. is extending the data quality features of the Trillium Software Inc. subsidiary it acquired last November to native Hadoop environments with Trillium Quality for Big Data.

The offering combines Trillium’s data quality features with its Intelligent Execution data integration platform to enable information technology organizations to normalize and integrate data at the same time. The Trillium platform was previously available in native format only on Linux, Unix and Windows operating systems. The Hadoop support is the first time Syncsort has applied its data quality features to applications.

Data quality is about identifying inconsistencies, errors or duplication. Examples include a ZIP code entered in a date field or duplicate customer records that appear to be different because of misspellings. Normalizing data is a tricky process. For example, different countries have different address and date formats and two people with the same name in the same ZIP Code may or may not be the same person.

Users are rushing to extract data from production systems and load it into analytics engines, but are discovering that quality problems limit their effectiveness. “Everybody is trying to govern the data once it’s in the data lake so it doesn’t turn into a data swamp,” said Tendü Yoğurtçu, Syncsort’s chief technology officer. “The volume and variety of data makes it complex.”

Trillium has hundreds of matching algorithms to identify such problems, and can be configured to automatically apply corrective algorithms, Yoğurtçu said. The offering includes address- and name-matching data for 150 countries as well as postal directories and geocoding. Intelligent Execution examines the topology of a data flow and optimizes resources for the job without changes to the application. It supports both new and existing Trillium data quality projects across Hadoop, MapReduce and Apache Spark on-premises or in the cloud.

“Once you understand the data you can create the rules to cleanse that data,” Yoğurtçu said. “For example, if you have duplicates you can specify a process to flag them or get rid of them.”

Trillium Quality for Big Data is available on all Hadoop distributions including Cloudera Inc.’s CDH, Hortonworks Inc.’s HDP and MapR Technologies Inc.’s Converged Data Platform. It deploys and installs via Cloudera Manager and Apache Ambari. Pricing is on a per-node basis or cloud subscription, but Syncsort didn’t provide specifics.

Image: Flickr CC

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU