This week, a startup called Limited Run made headlines when they went public with news that the majority of their Facebook ad clicks were coming from bots. The startup was rightfully upset that they were only receiving 20% human clicks from their ad buy, and ultimately decided to delete their Facebook account.
Ad clicks aren’t the only pressing issue for websites when it comes to online marketing and bots – since tools like Google analytics don’t provide granular insight into site traffic, it can be very tricky to differentiate between human and non-human traffic unless you really drill down, or notice after the fact that something is amiss. Even more difficult is deciphering between malicious bots that can harm your website and good bots, like Googlebot, that improve SEO.
If you run an e-commerce site, have a website or a personal blog, chances are you want Google to visit your site and index your content as often as possible. By doing so, Google learns what is new on your site and can immediately share updated content with the online community. However, differentiating between Google and hackers who impersonate Google can be a major challenge for website operators and can have a damaging impact on your online business.
Google uses a robot called “Googlebot” that crawls millions of sites simultaneously and indexes their content in Google’s databases. The more Googlebot visits your site, the faster your site’s content updates will appear in Google’s search results. It’s crucial to allow Googlebot to crawl your website without blocking or disturbing it, and many companies invest in special SEO tools to attract it.
A wolf in sheep’s clothing
While you want to give Googlebot the real VIP treatment, it’s complicated differentiating between the “good” bots that enhance your SEO (among other things), and the Googlebot impersonators that hackers use it to litter your blog with comment spam and copy your website’s content to be published elsewhere.
The MaMa Casper worm, for example, scans for vulnerable PHP code in Joomla and e107, two very common Content Management Systems. Once a susceptible site is found, MaMa Casper will infect the sites with malicious code. Alternatively, some Googlebot impersonators originate from IPs registered to SEO companies. These visits are a byproduct of online SEO tools which check for competitor information and view it as it’s presented to the Google search crawler.
Recent studies have shown that over half of a medium-sized website consists of non-human traffic, much of which can potentially harm your business. However, once you drill down into what these non-human entities are actually doing to evade detection, it becomes increasingly challenging to block them from damaging your site.
How do those hackers do it?
The Googlebot has a very unique way of identifying itself. It uses a specific user agent that arrives from IP addresses belonging to Google and always adheres to the robots.txt (the crawling instructions that website owner provide to such bots).
Here are the most common methods used by Googlebot impersonators and how you can protect your Web site:
#1: Not validating Google IPs
It’s not always a straightforward process validating whether a bot declaring itself as Google is the real thing. While bots with fake or weird-looking user agents are easy to spot, more sophisticated bots can be very deceptive. Google has a number of user agents and a very wide range of (non public) IPs from which it can crawl a website. Most websites don’t validate IPs on the fly to check that they are associated with the Google network, and furthermore, the rest of the traffic coming from these IPs (Google App Engine or Google Employees), is it Google or not? Lack of validation is the number one opening that lets the bad guys in.
Currently Google’s search bot has two official user agents: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) and the less common Googlebot/2.1 (+http://www.google.com/bot.html).
Reference: Webmaster Tools – Google Crawler
#2: Forging User-Agent strings
Since Google does not provide a list of IP addresses to white list since they change very often, the best way to identify Google’s crawlers is using the User-Agent string. Fortunately for the bad guys, user-Agent strings are very easy to forge.
There are several ways in which intruders impersonate Googlebot. The simple and non-sophisticated impersonators copy-paste its user agent into requests that their bot generates, many times with glaring mistakes, such as “googlebot (googlebot),” ”Mozilla/5.0+(compatible;+googlebot/2.1;++http://www.google.com/bot.html)” or just plain “Googlebot.” Many also use cURL in their bot’s code and simply replace the default cURL user agent with Google’s. Other, more sophisticated bots generate requests that seem identical to the original and can fool the naked eye. I’ve even observed bots that mimic Google’s crawling behavior, fetching the robots.txt first and taking a crawler- like method of browsing through the website.
Letting a Googlebot impersonator into your website poses a serious threat to your business. Not only can they trash your blog with comment spam, but they can also scrape (steal) your website’s content and publish it elsewhere. Most damaging is when they can suck up your website’s computing and bandwidth resources to the point that the server crashes or your hosting provider shuts you down for hogging too many resources.
Conversely, by accidentally blocking Googlebot you can also suffer devastating results. By not allowing Google to index your site, new content will not become “searchable” by the billions of people that use Google to search for content online. Prolonged blocking will also result in loss of valuable search engine rankings that are sometimes worth millions of dollars in brand equity and online awareness that took years to achieve. It’s crucial to closely monitor your site’s non-human website traffic, or invest in tools like Incapsula or CloudFlare that monitor all visitors for you.
About the Author
Marc Gaffan, Co-Founder, VP Marketing & Business Development
Mr. Marc Gaffan has over 15 years of R&D, Product Management & Marketing experience in high-tech companies. Prior to founding Incapsula, Marc was Director of Product Marketing at RSA, EMC’s security division, responsible for strategy and go to market activities of a $500M IT Security product portfolio. Before that, Marc was the Director of Marketing for the Consumer Solutions Business Unit at RSA. While at RSA, Marc presented at the US Congress, FDIC and Federal Trade Commission on cyber security and identity theft topics.