Looming questions @Hadoop Summit : Forking, skill gaps + more
Tomorrow (Tuesday, June 3) theCUBE begins three days of in-depth coverage of Hadoop Summit in San Jose, California. The event, and theCUBE’s coverage, comes at an important point in the brief history of Hadoop, and several questions hang over the still immature technology. Both attendees and others interested in Big Data but unable to get to this conference can join the discussion of issues surrounding the Hadoop Summit on CrowdChat.
The largest issue is the danger of forking. The market is essentially divided into two camps; the pure, open source approach championed primarily by Hortonworks and Red Hat, and what might be called the Hadoop+ group, led by Cloudera, which provides Hadoop with proprietary extensions. This is not a good versus bad scenario. Hadoop is an immature technology, and Hadoop skills are thin on the ground, so users often need those extensions as well as consulting resources.
Multiple methods to open source monetization
.
Hortonworks, which was founded by the original Hadoop developers from Yahoo, has made the courageous decision of donating all the code it develops or acquires back to Apache, making its money exclusively through consulting. This is the Red Hat model of pure open source. However, very few vendors have created long-term success with this model. Most software developers, ranging from startups to Oracle, SAP and IBM, make most of their income on licensing fees. So one question is, can Hortonworks survive as an independent entity long term?
For the present, Hortonworks is riding high. It raised $100 million in its latest funding round, in March/April, and saw a 3x improvement year-to-year in first quarter revenues, admittedly off of a small base. Wikibon Principal Research Contributor Jeff Kelly says Hortonworks’ alliance with HP “could prove fruitful.”
The danger of the Hadoop+ approach is what might be called “Internet Explorer syndrome”, where companies have held onto old versions of Microsoft Internet Explorer (IE) despite Microsoft’s best efforts to get them to move on and some known security flaws in the old version’s code, because their software is optimized for that version of IE. So while the vendors in this group are not changing the core Hadoop code, the extensions they add can create lock-in as the core open source version develops and may encompass alternative solutions to the needs those extensions fill. This can make it hard for users to stay with the most recent open source version as it evolves, effectively creating forks in the technology even when that was not intended by the vendors.
Cloudera has good news going into the conference, in the form of a $740 million investment from Intel. Kelly says Intel is withdrawing its Hadoop distro and will transition its customers to Cloudera. Intel also will contribute its Hadoop security and performance improvements to Cloudera. Intel may also want to make Cloudera’s Hadoop distro the target system for the data from its Internet of Things initiative. The question, Kelly says, is what else Intel may want to get with its 18 percent interest in the company.
Initially this looks like a major boost for Cloudera, but, Kelly cautions, Intel has no idea of how to sell software to end-user companies, so the amount of actual business this will bring to Cloudera remains to be seen.
Big data washing
.
A second issue is “Big Data washing”. Everybody has Big Data. Certainly the Oracle and SAP databases in enterprises are big. But is a structured database, regardless of the number of petabytes involved, truly Big Data? Certainly Hadoop can operate on these large structured databases, and the market is seeing traditional data system vendors labeling their products as “Big Data”.
This is a common issue in the market, where vendors use “marketecture” to try to freeze their market in the face of disruptive new technology. Customers must be sure that the products they buy can do the job they need. If every product is “Big Data”, the term becomes meaningless. Big Data should imply the use of new data types, including unstructured data from sources such as social media or machine generated data, for new kinds of analysis such as predicting market trends or risk assessment. To the extent that “Big Data washing” erodes this distinction, it will confuse in the market.
The skills gap
.
Another issue is the scarcity of individuals with the requisite skills. These include technical Hadoop and other Big Data architecture and technology experts and data scientists with the knowledge to make the best use of that technology. Again this is a perennial problem with new technologies. Back in the early 2000s, for instance, the issue was the lack of Web site designers. Designers, however, could learn Web site design in a few months. Becoming a true data scientist requires an advanced degree. And with the technologies advancing quickly, the exact skills required are not always clear. For instance, does a Hadoop database manager need to know R? Does a data scientist need to master Map Reduce?
IBM, a company heavily invested in Big Data software and services, has stepped up to this issue, last week announced a partnership with 28 universities and business schools such as the University of Massachusetts Boston School of Business, to build Big Data curricula for the fall 2014 semester. The goal is to provide degree programs in data science and related areas that include technical and business training. IBM estimates that by next year 4.4 million Big Data related jobs, such as Data Scientist and Chief Data Officer, will be created. And that number is expected to grow throughout the decade, as Big Data analysis becomes a core part of business.
To an extent the need is being filled by technology, and particularly the wedding of SQL query language and real-time, interactive analysis that allows business people to query Hadoop directly without requiring knowledge of Map Reduce or Hive.
Several analysis initiatives are getting attention going into the Summit, Kelly says. Most recently, the Stinger initiative, a partnership including SAP, Wandisco and VISA led by Hortonworks, was completed. This brings full interactive query capabilities to Hadoop, based on the latest version of Hive. Spark, an in-memory data processing framework backed by all the major Hadoop vendors, is getting a lot of attention. MapR is backing a third interactive solution, Apache Drill, while Cloudera’s has its own SQL-on-Hadoop module, Impala. This profusion of similar solutions is creating confusion in the marketplace, and Kelly will explore their differentiation and use cases on theCUBE.
But getting the full value of Big Data analysis requires more than just extending tools. Using new kinds of analysis with multiple data types from different sources to answer new kinds of questions is much more complex than just asking the old questions from a larger data set.
“Data scientist” is another term that is being overused – every data admin is not suddenly a data scientist. Data science requires a rare combination of new technology skills, knowledge of business issues, creativity to identify new ways in which data analysis can create business advantage and visualization sophistication to present the results of the research in a form that business decision makers can grasp intuitively.
Data security
.
Security is as big an issue with Hadoop as it is with relational databases, particularly as enterprises move from trials to production that will involve sensitive customer, business and employee information. Again, the market has no lack of choices. Hortonworks acquired Hadoop security startup XA Secure and plans to submit the software to Apache for inclusion in open source and add it to the Hortonworks Data Platform. Among other things, XA Secure provides a centralized admin console to manage security in Hadoop, Kelly says.
Cloudera uses Apache Sentry for Hadoop security, which enables column-level permissions and access control as well as group- and role-based authentication. Other security packages include Sqrrl’s Accumulo, Zettaset and Intel’s Hadoop security tools, which are going to Cloudera. This is another area of marketplace confusion, and Kelly promises to help sort it out during the three days of theCUBE coverage.
Finally, the state of the market is in flux. Most users have been in test phase, and the emphasis has been as much on developing technology and building use cases as on putting Hadoop to work. So one line of questioning is, how quickly are new organizations starting to work with Hadoop, and how fast are the companies that have been in test and dev moving the technology into production?
If you are not going to Hadoop Summit, watching the live interviews on theCUBE is a good way to keep up. Whether you are going or not, log into the CrowdChat and join the conversation around the issues and events of the conference.
photo credit: JohnGoode via photopin cc
Cloudera data center image courtesy Cloudera
photo credit: coffish via photopin cc
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU