UPDATED 21:11 EDT / JULY 01 2019

APPS

Google pushes for its robots.txt parser to become internet standard

Google LLC is pushing for its decades-old Robots Exclusion Protocol to be certified as an official internet standard, so today it open-sourced its robots.txt parser as part of that effort.

The REP, as it’s known, is a protocol that website owners can use to exclude web crawlers and other clients from accessing a site. Google said it’s one of the “most basic and critical components of the web” and that it’s in everyone’s interest that it becomes an official standard.

REP was first proposed as a web standard by one of its creators, the Dutch software engineer Martijn Koster, back in 1994, and has already become the de facto standard that’s used by websites to tell crawlers which parts of a website they shouldn’t process.

When indexing websites for its search engine, Google’s Googlebot crawler typically scans the robots.txt file to check for any instructions on which parts of the site it should ignore. If it doesn’t find any robots.txt file in a site’s root directory, it will just assume that it’s okay to index the entire site

But Google worries that REP was never officially adopted as an internet standard, saying that the “ambiguous de-facto protocol” has been interpreted “somewhat differently over the years” by developers, and that this makes it “difficult to write the rules correctly.”

“On one hand, for webmasters, it meant uncertainty in corner cases, like when their text editor included BOM characters in their robots.txt files,” Google wrote on its Webmaster Central blog. “On the other hand, for crawler and tool developers, it also brought uncertainty; for example, how should they deal with robots.txt files that are hundreds of megabytes large?”

To solve those problems, Google said, it has documented exactly how REP should be used with the modern web and submitted its proposal for it to become an official standard to the Internet Engineering Task Force, which is a nonprofit open-standards organization.

“The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP,” Google said. “These fine-grained controls give the publisher the power to decide what they’d like to be crawled on their site and potentially shown to interested users. It doesn’t change the rules created in 1994, but rather defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web.”

Analyst Holger Mueller of Constellation Research Inc. told SiliconANGLE that standards are vital for the internet to work properly, so it’s good to see Google take a lead even for something very basic like REP.

“As with any open-source initiative and standardization attempt, we’ll have to wait and see wait and see what kind of uptake there is before we know if this is a success or not,” Mueller said. “It can be also something very self-serving, as Google is one of the biggest web crawlers itself. It’s an area to keep a watchful eye on.”

Image: Google/Twitter

A message from John Furrier, co-founder of SiliconANGLE:

Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

About SiliconANGLE Media

SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

Google pushes for its robots.txt parser to become internet standard

Image: Google/Twitter

A message from John Furrier, co-founder of SiliconANGLE:

LATEST FROM THECUBE

UPCOMING CUBE EVENTS

RECENT CUBE EVENTS

UiPath Fusion 2025

theCUBE + NYSE Wired: AI Factories - Data Centers of the Future 2025

DigiCert World Quantum Readiness Day 2025

EVOLVE25

Oktane 2025

Google pushes for its robots.txt parser to become internet standard

Image: Google/Twitter

A message from John Furrier, co-founder of SiliconANGLE:

LATEST STORIES

LATEST STORIES

UiPath Fusion 2025

theCUBE + NYSE Wired: AI Factories - Data Centers of the Future 2025

DigiCert World Quantum Readiness Day 2025

EVOLVE25

Oktane 2025

Cookies