Understand Hadoop’s limitations in order to realize its full value

theCUBE Live at Hadoop World 2014

theCUBE Live at Hadoop World 2014A major issue facing Hadoop early adopters is… now what? While all the excitement about Hadoop is well justified, many people don’t fully understand its limitations – such as the difficulty of connecting and linking data across the cluster – and thus don’t use Hadoop to its fullest potential. you

Early Hadoop adopters are starting to realize that if used alone, Hadoop is not a data integration solution. There are three core challenges with Hadoop. Without an add-on or work-around application Hadoop is unable to:

  1. connect unstructured and structured content;
  2. discover and fully leverage complex relationships within and across disparate data sets; and
  3. organize data to enable scalable searches.

Organizations need to face these challenges head on in order to get the most out of their data. Let’s look at each in turn.

Inability to connect unstructured and structured content

Unstructured data is the free-form text in emails, reports, documents, log files, and other sources that make up the increasing variety of big data sources. This dense data lacks the structured context to link it to the traditional business application. Hadoop is able to store this information and even search through it via natural language processing or text search applications like Solr. However, it is unable to extract meaningful definitions from that data and link the unstructured entities to structured profiles found throughout other structured applications.

Unstructured data produces the need for an additional analytics capability to be added on top of Hadoop. Storing data isn’t enough. There needs to be timely linking and analysis between both unstructured and structured data in order to make sense of massive amounts of information. Analysts and data scientists need to be able to quickly find relevant information across billions of records and content. Companies are not getting the full potential out of their data by not implementing analytic applications that bridge the gap and make the most use of their data sets.

Discovering complex relationships from disparate data sets

Because of the large volume and complexity of data entering business systems, it is difficult to manage and interpret data sets. These complex data sets are only going to continue to grow with the Internet of Things. Add-on solutions are necessary to make sense of large amounts of data and put it to meaningful use to benefit a company. If the data sits in Hadoop and isn’t being analyzed to find correlations across different data sets, it isn’t serving much of a purpose. Correlation analysis allows users to find value in their data and prove or deny certain hypotheses. These findings will allow businesses to make better decisions and interpretations in the future.

Lengthy development times are required to organize data and produce scalable searches

Manual processes to effectively organize the large volume and variety of data stored in Hadoop are completely out of the question. Advanced technology is necessary to scale and organize data across the cluster to enable the location of relevant data in an efficient manner. Real-time analysis and organization of data is needed to allow for quick and accurate business analysis. Hadoop should be paired with fast entity analysis software to organize both structured and unstructured data into real-world entities and their relationships, which can enable actionable decisions. Search engines within Hadoop are only as good as the data stored within them, and they lack the keys that connect the dots to present a clear picture. These engines need to be provided with the keys to take advantage of all the data and give analysts the answers they need when they need them. Investing in the right advanced analytic applications to provide these keys allows queries to process faster and increase business value.

Hadoop is a platform that has the potential to accelerate the value of business intelligence and reporting, but additional applications and technologies are required to enable real-time data analysis. Realizing the restrictions that Hadoop presents is the first step toward better data management and analysis. Hadoop is the platform for which entity analytics applications can connect the dots across domains into real-world views of people, events, locations, and products. It is important to acquire a tool that is able to create linkages and relationships — this step is vital if organizations are to understand the complexity of the business landscape and improve their models and segmentation, leading to increased ROI and successful business decisions.

About the Author 

Jennifer Reed, NovettaJennifer Reed has more than 20 years of technical expertise and background in financial services and government. She is director of product management at Novetta Solutions, LLC. She is responsible for defining and implementing product strategy for Novetta Entity Analytics, establishing and maintaining relationships with clients, partners, and analysts, seeking new market opportunities, and providing oversight of overall strategy, technical and marketing aspects of the product. Jennifer was previously senior product manager at IBM for InfoSphere MDM.