UPDATED 19:44 EST / SEPTEMBER 20 2017

BIG DATA

Syncsort quality manager aims to purify Hadoop data lakes

Syncsort Inc. is extending the data quality features of the Trillium Software Inc. subsidiary it acquired last November to native Hadoop environments with Trillium Quality for Big Data.

The offering combines Trillium’s data quality features with its Intelligent Execution data integration platform to enable information technology organizations to normalize and integrate data at the same time. The Trillium platform was previously available in native format only on Linux, Unix and Windows operating systems. The Hadoop support is the first time Syncsort has applied its data quality features to applications.

Data quality is about identifying inconsistencies, errors or duplication. Examples include a ZIP code entered in a date field or duplicate customer records that appear to be different because of misspellings. Normalizing data is a tricky process. For example, different countries have different address and date formats and two people with the same name in the same ZIP Code may or may not be the same person.

Users are rushing to extract data from production systems and load it into analytics engines, but are discovering that quality problems limit their effectiveness. “Everybody is trying to govern the data once it’s in the data lake so it doesn’t turn into a data swamp,” said Tendü Yoğurtçu, Syncsort’s chief technology officer. “The volume and variety of data makes it complex.”

Trillium has hundreds of matching algorithms to identify such problems, and can be configured to automatically apply corrective algorithms, Yoğurtçu said. The offering includes address- and name-matching data for 150 countries as well as postal directories and geocoding. Intelligent Execution examines the topology of a data flow and optimizes resources for the job without changes to the application. It supports both new and existing Trillium data quality projects across Hadoop, MapReduce and Apache Spark on-premises or in the cloud.

“Once you understand the data you can create the rules to cleanse that data,” Yoğurtçu said. “For example, if you have duplicates you can specify a process to flag them or get rid of them.”

Trillium Quality for Big Data is available on all Hadoop distributions including Cloudera Inc.’s CDH, Hortonworks Inc.’s HDP and MapR Technologies Inc.’s Converged Data Platform. It deploys and installs via Cloudera Manager and Apache Ambari. Pricing is on a per-node basis or cloud subscription, but Syncsort didn’t provide specifics.

Image: Flickr CC

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

  • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
  • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.