How Do You Grep the Web? Blekko Tells All

Blekko, the search engine known for blocking out spam, came quite a long way since their rough launch last year.  People weren’t really expecting too much from them, especially when they downplayed their product as “not a Google killer,” accompanied by a not-so-outstanding release.

Blekko didn’t get frazzled with all the criticism and they even found their first partner, DuckDuckGo, a general purpose search engine that brags on how they don’t track their users.  So it seems Blekko is ready to start a new chapter, having launched a new search feature, Web Grepper.  The purpose of this new tool is to allow access to previously hidden data.  It’s a feature that’s actually written on their Web Search Bill of Rights: that search should be open or transparent and readily available.

With the Web Grepper, questions like: “Can you give me a list of every site that uses Facebook Connect? In rank order?” or “Can you send me a list of sites that have the Google +1 button on them?” and “Can you give me a list of every site that is running [insert name] ad network?” can now be easily answered and data acquired quickly.

According to their blog post, “The way Web Grepper works is simple. Everyday, we will run 2 map jobs against our crawl of 4 billion pages. These will be greps for strings, patterns, regex expressions that blekko users submit to us and decide are cool. Got a grep you want to run? Submit it here. If enough people agree with you that this grep is interesting (by voting it up), we’ll run it. And we’ll post the results here. We make the top 500 results for every grep available for free to anyone who wants it. Pretty cool, eh?”

“We look at the Web as one big, massive data set. Keyword matches (with a relevance filter) are certainly one way to pull data from that set. But it’s not the only way. There’s a ton of interesting data that lives within the source HTML that keywords just can’t get to. That’s where Web Grepper comes in – and now you have access to it.”

The tool was created by their engineers when they saw an increase in the demand for information coming from brands and networks.  Blekko decided to open the tool publicly after developing a democratic process on how to determine the list of jobs the search engine would run, as they do not want hackers to acquire private and personal information.

According to Blekko Vice President of Marketing Mike Markson, in order to request data, the person must first submit a job to the community, which will be up for voting to see if the request will be granted or declined.

If you’re wondering why Google hasn’t utilized this feature if this is such an ingenious addition to searching the Web, Google’s Matt Cutts’ Webmaster video explains why.  According to Cutts, data found on Web pages are “regular expressions” and that we should not expect for the feature to be available in Google in the near future as no one is requesting for it.

And if you’re wondering where the name Web Grepper came from,  according to Wikipedia, it’s based on a command-line text matching search tool originally developed for UNIX.  The name comes from the ed command g/re/p (global / regular expression / print).  The grep command searches files or standard input globally for lines matching a given regular expression, and prints the lines to the program’s standard output.

It’s another example of a startup leveraging big data to differentiate itself in a market.  Search in particular is going to be an important area for unstructured data analysis, and more natural interactions between human and machine will come as a result.  Google may not have this search format in their toolkit yet, but they are working towards a more unstructured search method.  Wolfram Alpha is another.