UPDATED 13:08 EDT / DECEMBER 30 2011

NEWS

5 Big Data Startups to Watch in 2012

Big data, without question, is a 2011 buzz word finalist. But like all metaphors, it communicates a universal understanding that data dominates our lives and will increasingly do so in the years ahead. How can you deny that a company’s success will depend in great part on how they view data and its value?

The belief is not lost on venture capitalists who have invested $350 million in Hadoop and NoSQL startups since 2008.

To commemorate this mega trend, we’ve picked five big data startups and one honorable mention that we believe are ones to watch in 2012.  Here they are:

Cloudera

Cloudera took the spotlight once again this year.  By year’s end, the company had secured an additional $40 million in financing. In total, the company has raised $76  million. Its Hadoop World event had record attendance. And the company is playing a leading role in the emerging Hadop ecosystem.

Jeff Hammerbacher is Cloudera’s chief scientist. In January he posted on Quora that he was looking for someone to work with him. His comments say a lot about Cloudera and what it does:

I’m looking to hire someone to work closely with me at Cloudera in my role as Chief Scientist. We’d work together to shape Cloudera’s long-term technology strategy.

The primary focus is understanding how our customers create business value from the data deluge. I meet regularly with technical decision makers wrestling down big data in the web, financial services, federal, retail, telco, life sciences, oil and gas, and other domains. We’d work to distill these discussions into high-level product requirements and partner recommendations.

A secondary focus is understanding which trends in the academic and open source communities will impact our customers. I sit on a lot of program committees and I engage with a number of open source projects. We’d work to translate the innovations of these communities into high-level product requirements and partner recommendations.

Cloudera will face its most serious competition this year. Startups like Hortonworks and MapR are new competitors. Cloudera gets criticized for its proprietary offerings but it remains one of the largest contributors to Apache Hadoop. Web companies such as GroupOn and Klout use its technology. It has a partnership and certification program that should help it build new channels with consultants and systems integrators.

Jeff Kelly of Wikibon writes:

Cloudera Management Suite, while proprietary, includes important enterprise-level features such as automated, wizard-based Hadoop deployment capabilities, dashboards for configuration management and a resource management module for capacity and expansion planning. Ambari, Hortonworks’ answer to Cloudera Management Suite, is open but is less mature and currently lacks advanced cluster management capabilities.

Earlier this year, Cloudera and Dell announced a partnership. Dell’s Crowbar software integrates with the Cloudera distribution. Crowbar manages the Hadoop deployment from the initial server boot to the configuration of the primary Hadoop components. This allows for bare-metal deployments that the companies say takes hours instead of days.

The partnership shows why 2012 could be such a big year for Cloudera. It gives Dell the leading Hadoop distribution and Cloudera association with a well-known brand with established channels. It’s a services play, too, as education and configuration assistance are pretty much required for any customer integrating Hadoop into its environment.

MapR

Here’s what I really like about MapR. They don’t have to charge for training. The business model is in selling their stack to work with Hadoop. Training is not what they do. My bet: MapR may be bet positioned out of all the newcomers in the Hadoop space to make a real run against the incumbents.

It comes down to the data MapR can handle. What it does reminds me of what I am hearing from other smart startups. The rate of data growth has changed the bottleneck. It’s the network, not the disk that’s the issue. The better way, according to MapR, is to combine the data and compute together and send the results over the network.

Other factors that make MapR one to watch. It has a world-class team. CEO M.C. Srivas left Google to start MapR. You learn to build infrastructure at a company like Google. But you do it with a different twist. The technology infrastructure has to take the end user in mind. It has to be simple, something that people can use without a worry. It requires engineers to retain a certain discipline. Srivas had a job at Google that suited him well for leading MapR. He led the BigTable project. BigTable is used in Google Maps, Google Reader, Blogger, YouTube, GMail and a number of other Google services. It’s the basis of Apache Hbase, a non-relational columnar data store included with most Hadoop distributions. Add to that Srivas’ additional background in enterprise storage and you can see why Hadoop is such a fit for him.

It’s evident Srivas developed MapR with that simplicity in mind. Customers cite its ease of use. And that’s a huge issue with Hadoop. It is not easy to use. It’s why services providers are in high demand. Customers need help. MapR eases that complexity to some degree.

All added up and it’s apparent why MapR received another $20 million this Fall from Lightspeed Ventures. Its partnership with EMC will help increase access to customers.

All in all, 2012 looks like a breakout year for MapR.

10Gen

10gen is cleaning up in the NoSQL market. The company is the sponsor for the open source NoSQL database MongoDB and appears ready to move into broader markets now that the big data meme is spreading into the mainstream.

MongoDB has a bustling ecosystem. RedHat announced earlier this month that OpenShift, its Platform as Service, would support MongoDB. 10gen lists a number of companies as partners, including: “Microsoft, VMWare Cloud Foundry, Rackspace, Amazon Web Services, MongoLab, Nebula, Joyent, Pentaho, Fusion-io, Ubuntu, O’Reilly, Manning Publications, Loggly, Github, Twilio, MongoHQ, Server Density, Right Scale, dotCloud, PalominoDB, DATAVERSITY, and Startup Monthly.”

10gen also raised $20 million this Fall. According to GigaOm, the company has 400 customers. The opportunity, though, comes with the massive amount of unstructured data that some say accounts for 90% of the data we now produce. SQL databases just can’t handle the big data requirements. That means we will no doubt see a huge new demand for NoSQL. And 10gen is well-positioned to become  a dominant leader in the space.

Hortonworks

In just a matter of months Hortonworks established itself as a pivotal player among vendors in the Hadoop space. It’s a company that did not come out of a garage. It came out of Yahoo! with 22 engineers and funding from Benchmark Capital.

Hortonworks has announced a series of partnerships, many announced this Fall. In November Hortonworks announced partnerships with Datameer, Informatica, Karmasphere and Pervasive Software.

The next year will determine if there is room for multiple Hadoop software distributions. For Hortonworks, it means ramping up partnerships and taking the lead in marketing its products and services.

Splunk

Splunk goes into the new year looking at the possibility of an IPO with a value of $1 billion.

Splunk analyzes machine data. That means analyzing server and log files. It’s a proprietary database that sucks in the data very fast. But setting it apart is its simplicity. It’s easy to use. The UI is similar to Google, making it easy to use without any real training. You can do command-line scripting and search for different combinations of events. It provides the user with the capability to build dashboards and alerts.

The pricing is simple. You can start using it for free. If it works, then you start paying on a usage basis. A modern approach designed to get the most out of on-premise systems.

Splunk shows the value of analytics on a massive scale and how it can foster a new enterprise that builds a DevOps community. According to the Splunk blog, LinkedIn has 350 developers/engineers. It has 70 IT operations people. The Splunk analytics allow for smarter interaction between these two groups. It monitors how service calls are performing, interacting and affecting application services. Splunk provides visibility into the impact of release changes on service calls and performance, fix issues, optimize performance, and monitor production environments.

From the Splunk blog:

LinkedIn has a sizeable environment and impressive Splunk implementation: a highly distributed service oriented architecture (270+ services) across 5 data centers, indexing about 3 Terabytes in Splunk every day, 700 saved searches, 8000 searches per hour at peaks, A typical set of DevOps uses of Splunk: Trace individual calls through entire service stack, identify service dependencies, and Capacity Planning.Since then LinkedIn has expanded its use of Splunk to a variety of security use cases as well.

“We rely on Splunk to understand the impact of new features on our back-end services. This type of operational visibility allows us to correlate the impact of front-end usage with backend processes.”

Splunk is blurring the lines between developers and engineering.That puts the company in a sweet spot going into the new year. It’s modern approach is in line with a trend that we believe will change IT forever. Analytics are at the core of its approach. Their success shows the value that big data can bring and why companies like Splunk represent a new generation of companies that will challenge the traditional technology providers.

Honorable Mention: Tresata

Tresata is a favorite of the team here at SiliconAngle. Co-Founder Abhishek Mehta is an inspiring character who we often ask to share a few moments on theCube. Here’s an interview with him at Hadoop World this Fall.


Watch live video from SiliconANGLE.com on Justin.tv

Mehti shares a simple message about Tresata. Hadoop does not have to be difficult. It took hundreds of years to go from the blacksmith to the steel forges of the 20th century. Today is still a time when for the most part we take data and shape it by hand much like the ironsmiths of centuries past shaped metal into swords and forks.

Skyscrapers could not be built without steel made from factories. And data can not provide any kind of scaling value without automation, too.

Jeff Kelly, our data analyst from Wikibon, interviewed Mehta in April. Mehta said the goal is to create data factories that are tuned to filter timely, accurate data that can be used to solve pressing business problems.

That may mean filtering tweets or any form of actionable intelligence.

It’s a three step process that Mehta compares to manufacturing models. Henry Ford modernized the assembly line to create new kinds of products. Tresata sees itself taking a similar approach with data, Create a pipeline that provides value to financial services firms. That’s the first market for Tresata, It sees retail and healthcare as new markets to explore.

The promise for Tresata lies in its modern approach to data filtering It is unlike the data integration providers of old that made their fortune in scrubbing data. Instead, its future is in creating data products that are built specifically through a process that delivers micro-focused value.


A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU