

Blekko, the search engine known for blocking out spam, came quite a long way since their rough launch last year. People weren’t really expecting too much from them, especially when they downplayed their product as “not a Google killer,” accompanied by a not-so-outstanding release.
Blekko didn’t get frazzled with all the criticism and they even found their first partner, DuckDuckGo, a general purpose search engine that brags on how they don’t track their users. So it seems Blekko is ready to start a new chapter, having launched a new search feature, Web Grepper. The purpose of this new tool is to allow access to previously hidden data. It’s a feature that’s actually written on their Web Search Bill of Rights: that search should be open or transparent and readily available.
With the Web Grepper, questions like: “Can you give me a list of every site that uses Facebook Connect? In rank order?” or “Can you send me a list of sites that have the Google +1 button on them?” and “Can you give me a list of every site that is running [insert name] ad network?” can now be easily answered and data acquired quickly.
According to their blog post, “The way Web Grepper works is simple. Everyday, we will run 2 map jobs against our crawl of 4 billion pages. These will be greps for strings, patterns, regex expressions that blekko users submit to us and decide are cool. Got a grep you want to run? Submit it here. If enough people agree with you that this grep is interesting (by voting it up), we’ll run it. And we’ll post the results here. We make the top 500 results for every grep available for free to anyone who wants it. Pretty cool, eh?”
“We look at the Web as one big, massive data set. Keyword matches (with a relevance filter) are certainly one way to pull data from that set. But it’s not the only way. There’s a ton of interesting data that lives within the source HTML that keywords just can’t get to. That’s where Web Grepper comes in – and now you have access to it.”
The tool was created by their engineers when they saw an increase in the demand for information coming from brands and networks. Blekko decided to open the tool publicly after developing a democratic process on how to determine the list of jobs the search engine would run, as they do not want hackers to acquire private and personal information.
According to Blekko Vice President of Marketing Mike Markson, in order to request data, the person must first submit a job to the community, which will be up for voting to see if the request will be granted or declined.
If you’re wondering why Google hasn’t utilized this feature if this is such an ingenious addition to searching the Web, Google’s Matt Cutts’ Webmaster video explains why. According to Cutts, data found on Web pages are “regular expressions” and that we should not expect for the feature to be available in Google in the near future as no one is requesting for it.
And if you’re wondering where the name Web Grepper came from, according to Wikipedia, it’s based on a command-line text matching search tool originally developed for UNIX. The name comes from the ed command g/re/p (global / regular expression / print). The grep command searches files or standard input globally for lines matching a given regular expression, and prints the lines to the program’s standard output.
It’s another example of a startup leveraging big data to differentiate itself in a market. Search in particular is going to be an important area for unstructured data analysis, and more natural interactions between human and machine will come as a result. Google may not have this search format in their toolkit yet, but they are working towards a more unstructured search method. Wolfram Alpha is another.
Support our open free content by sharing and engaging with our content and community.
Where Technology Leaders Connect, Share Intelligence & Create Opportunities
SiliconANGLE Media is a recognized leader in digital media innovation serving innovative audiences and brands, bringing together cutting-edge technology, influential content, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — such as those established in Silicon Valley and the New York Stock Exchange (NYSE) — SiliconANGLE Media operates at the intersection of media, technology, and AI. .
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a powerful ecosystem of industry-leading digital media brands, with a reach of 15+ million elite tech professionals. The company’s new, proprietary theCUBE AI Video cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.