UPDATED 09:00 EST / SEPTEMBER 22 2016

NEWS

Guest post: 3 ways to avoid the data silo syndrome

The inability to access data, run simple reports and develop new software and systems can be infuriating. Often, the root cause is fragmentation of data in multiple “data silos.” Unfortunately, the cure for data silos can be worse than the disease. Enterprise data lakes or warehouse projects, for example, are prone to becoming overgrown and unmanageable.

Three new data integration techniques offer hope of making things easier.

1. Data lakes
When you move all your data from disparate silos into one system – such as Hadoop – you create a data lake. Data lakes are easy to implement, highly flexible and moderately powerful. Unfortunately, they are hard to govern.

2. Virtual databases
Virtual databases, also known as federated databases, are single systems comprising multiple data sets that act as one big database. Under the covers, a virtual database queries the back-end systems and converts the data into a common format in memory. Unfortunately, that can take a lot of processing power and can cost a great deal of money.

3. Data hubs
This is a hub-and-spoke approach to data integration in which data is physically moved to a central location and re-indexed. Like data lakes, hubs are new systems. Unlike data lakes, data hubs are also operational, so they can support transactional workloads and discovery as well as analytics.

A framework for understanding

The three new approaches can be categorized in terms how they move, harmonize and index data. Each has its advantages, which are summarized here.

  Movement? Harmonization? Indexing?
Federation No.
Data remains in the silos.
Late.
All done at query time.
No additional indexing.
Delegate queries to the source systems using their own indexes.
Data Lakes Yes.
Data is moved to one place, but left in its source format.
No.
Or not in a manageable way. Data Lake analysis and reporting code must implicitly harmonize within the business logic of a report or analysis.
Very little
Source systems have their own functionality and indexes, that cannot usually be altered.
Data Hubs Yes.
Data is moved to one place.
Yes.
Data is (at least partially) harmonized as it is moved.
Yes.
Data is indexed in the harmonized form for efficient access and analysis.

Let’s look at each in more detail.

Data movement

This provides for operational – and organizational – separation from source systems. Hubs and lakes both move data. The data from each silo is copied to a new set of disks and processed by a new set of servers and software. The benefits of having a separate data store that doesn’t hit your production servers for every lookup are substantial. In contrast, a virtual (federated) database delegates queries to each source system for each query because data does not move.
The problem is that copies of data lag behind the master systems. There are workarounds like the Lambda architecture or streaming changes, but data freshness must be considered.

Data harmonization

All data must eventually be harmonized or it is nearly useless. A simple summation like “total sales per region” must access sales data in all formats, and normalize them for currency differences, postal codes and other factors. Data hubs harmonize early, federations harmonize late, and data lakes harmonize in an ad-hoc, ungoverned way.

Harmonization falls into three categories of varying computational difficulty: naming differences, which are simple matters of terminology; structural differences in the schemas storing the source data; and semantic differences where the values themselves are incompatible.

Harmonization is one of the most important factors in determining the success of a data integration project. The process should be well-governed, and the harmonized data should be indexed for efficient operational use, beyond the simple movement of dis-harmonious source data.
 
Businesses typically can’t wait for a massive data modeling effort. Instead, they should identify a limited number of critical data elements to harmonize early and fold in new ones over time. A system that can store data as is – such as Hadoop or MarkLogic – is ideal for this.

Data indexing

Indexing enables fast lookup for operational workloads. Unlike batch processes, which have the luxury of scanning every record in a data set, or fast-batch processors like Apache Spark, operational workloads require indexes.

Federated (virtual) databases don’t do any indexing themselves. Instead, they rely on the source silos to have adequate indexing, which is a risky approach. When indexes are missing, federated indexes may fall back to full-scan approaches to satisfy a query.
Data lakes do very little Indexing. When they do, it’s typically by integrating additional components such as mini data marts, Apache Solr text indexes and HBase tables.

Data hubs are distinguished from other approaches by their comprehensive indexing. Data is indexed after it is harmonized, so the indexes are built on top of higher-value data. By moving data, hubs have both a place to store indexes and an infrastructure on which to utilize them.

Applying the three key concepts

In summary, data hubs are the most powerful, fully evolved approach to data integration, but each approach has its place. The key concepts of data movement, harmonization and indexing tell us what tradeoffs we’re making.

For example, large organizations need separate, scalable, reliable infrastructure for integrated data. This requires data movement. Operational loads require sub-second, database-style query, which requires indexing. Federated databases suffer from “least common denominator” query syndrome, and are limited by the weakest integrated system. So, to effectively plan for future growth, choose an approach with its own indexing and movement capabilities.

Focusing on movement, harmonization and indexing will make your decision easier.


Damon Feldman, MarkLogicDamon Feldman is solutions director at MarkLogic Corp. and a seven-year veteran of the company. He has been involved with some of the largest MarkLogic projects for customers ranging from the US Intelligence Community to HealthCare.gov to private insurance companies.

Photo by Matt Batchelor via Flickr CC

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU