UPDATED 13:13 EDT / JUNE 15 2011

Big Data and the Semantic Web: A Synergy of Infrastructure and Language

people-computers-language-cloud Edd Dumbill at O’Reilly Radar asks an interesting question pertaining to the relationship between Big Data and the semantic web: “Big data and the semantic web: At war, indifferent, or intimately connected?” The answer is complex and interesting, and it involves a lot of the two being all three at the same time, as both are tools of human understanding, it will ultimately come down to what a particular entity wants to do with any given set of data.

There are many avenues to wed the concepts of Big Data and the semantic web by using humans as the engines of context and Big Data analytics approaches to develop semantics based on those contexts. He mentions the Google “Knowledge” effort to augment search (something becoming a Big Data enterprise) with semantic web content determined and developed by the users doing the searches.

Conventionally, semantic web systems generate metadata and identified entities explicitly, ie. by hand or as the output of database values. But as anybody who’s tried to get users to do it will tell you, generating metadata is hard. This is part of why the full semantic web dream isn’t yet realized. Analytical approaches take a different approach: surfacing and classifying the metadata from analysis of the actual content and data itself. (Freely exposing metadata is also controversial and risky, as open data advocates will attest.)

Once big data techniques have been successfully applied, you have identified entities and the connections between them. If you want to join that information up to the rest of the web, or to concepts outside of your system, you need a language in which to do that. You need to organize, exchange and reason about those entities. It’s this framework that has been steadily built up over the last 15 years with the semantic web project.

The alchemy behind Big Data building contextual semantic schema is currently obfuscated in developing smart systems that amplify human interaction with data. Since the entire purpose of producing schema is all about involving humans (since humans give context to data) and human time is limited and expensive. It becomes a big deal to be able to take limited exposure of people to small cross-sections of data and expand that outwards to apply that to larger tracts of data.

This is where Big Data analytics comes into play—and why Big Data is the next big thing.

The first expert systems that generate knowledge (or at least context) from incoming data will probably be extremely advanced Baysesian filters that take into account defined similarities between different types of data. Aggregated with how a group of people interact with a set of data (adding keywords, metadata, organize it) smart systems will be able to “guess” at the organization of incoming data as the sets increase in size; and when it gets something wrong and enough people correct it, the entire system will shift along with the modifications to produce a better context for new data and data it already has.

We’ve seen the sheer capability of systems that do this sort of thinking with IBM’s Watson when it did astoundingly well on the TV show Jeopardy!

Individual expert humans will always be better at making contextual connections than machines—that’s the power of the semantic web driven by users—but the ability to harness people in large numbers is bottlenecked by the amount of data users are willing to process (and how much they can process.) Crowdsourcing may be fundamental to the rudder of Big Data and semantic web analytics, but it will largely function in a space that steers expert systems rather than the expert systems themselves.

It’s all About Structure and Language

Just like human languages code for culture—human communities are bounded by what they can talk about and how they can talk about it; languages solve problems by developing new words and grammars when new experiences exceed current vocabularies—computer languages and systems are limited by their ability to describe relationships and variables. Edd Dumbill speaks about structuring the next Big Data revolution, amid factors also discussed at GigaOM’s Structure Big Data 2011 conference, will have to build on the petabyte scale storage we have and beyond by developing deep lexicon for describing semantic relationships.

Expert systems will be beholden to their initial structure and code in how analysis can approach semantic issues. Cisco, Greenplum, EMC and others are giving us the infrastructure to build highly interconnected, vast storage atop of. The next stage of Big Data and semantic development necessary to produce expert systems to take advantage of this will be about constructing its “thought processes” and that will be limned by the code we use to allow it to describe itself.


A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU