Google pushes for its robots.txt parser to become internet standard
Google LLC is pushing for its decades-old Robots Exclusion Protocol to be certified as an official internet standard, so today it open-sourced its robots.txt parser as part of that effort.
The REP, as it’s known, is a protocol that website owners can use to exclude web crawlers and other clients from accessing a site. Google said it’s one of the “most basic and critical components of the web” and that it’s in everyone’s interest that it becomes an official standard.
REP was first proposed as a web standard by one of its creators, the Dutch software engineer Martijn Koster, back in 1994, and has already become the de facto standard that’s used by websites to tell crawlers which parts of a website they shouldn’t process.
When indexing websites for its search engine, Google’s Googlebot crawler typically scans the robots.txt file to check for any instructions on which parts of the site it should ignore. If it doesn’t find any robots.txt file in a site’s root directory, it will just assume that it’s okay to index the entire site
But Google worries that REP was never officially adopted as an internet standard, saying that the “ambiguous de-facto protocol” has been interpreted “somewhat differently over the years” by developers, and that this makes it “difficult to write the rules correctly.”
“On one hand, for webmasters, it meant uncertainty in corner cases, like when their text editor included BOM characters in their robots.txt files,” Google wrote on its Webmaster Central blog. “On the other hand, for crawler and tool developers, it also brought uncertainty; for example, how should they deal with robots.txt files that are hundreds of megabytes large?”
To solve those problems, Google said, it has documented exactly how REP should be used with the modern web and submitted its proposal for it to become an official standard to the Internet Engineering Task Force, which is a nonprofit open-standards organization.
“The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP,” Google said. “These fine-grained controls give the publisher the power to decide what they’d like to be crawled on their site and potentially shown to interested users. It doesn’t change the rules created in 1994, but rather defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web.”
Analyst Holger Mueller of Constellation Research Inc. told SiliconANGLE that standards are vital for the internet to work properly, so it’s good to see Google take a lead even for something very basic like REP.
“As with any open-source initiative and standardization attempt, we’ll have to wait and see wait and see what kind of uptake there is before we know if this is a success or not,” Mueller said. “It can be also something very self-serving, as Google is one of the biggest web crawlers itself. It’s an area to keep a watchful eye on.”
Image: Google/Twitter
A message from John Furrier, co-founder of SiliconANGLE:
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU