If IBM Watson proved anything, it’s that it’s not outside of the grasp of modern technology to provide answers to questions posed in natural language. That idea is core to Wikidata, a proposed addition to the Wikimedia project, as it seeks to collect hard-to-compute structured data – like a country’s population or an actor’s birthplace – in a way that can be processed automatically into useful overviews and charts.
It’s less complicated than it might first sound. Users will submit data to Wikidata, which will in turn be edited and maintained by the Wikimedia community. Data entered in any language will be available in any other language, and anything entered will be made available across the entire Wikipedia/Wikimedia Commons ecosystem. Goal number one of the Wikidata project is to provide better information boxes on the side of Wikipedia articles.
The timeline for Wikidata’s future is a little fuzzy. April will see a closed team begin work on Wikimedia’s German chapter, with development finishing within a year. The idea is to bring all the German language links into the fledgling Wikidata repository, and then have the different international versions of Wikipedia make use of that data, in turn. Later in 2012, Wikidata will start to build a community, including the start of user-submitted data set integration.
I’m calling it now – in another two years, some open source major analytics solution built on top of a technology under development for Wikidata is going to hit the market. As we saw recently with Microsoft Research, there’s an emerging relationship between machine learning, big data and projects hosted on the web. If Wikidata is successful at making structured data easier to parse and process automatically, it could be the first step towards smarter, more efficient language recognition. That means better, smarter eDiscovery, and who knows what else.