Issue #56: AI and the Robots.txt Debate — Jason Michael Perry

Howdy 👋🏾. In the early days of search, websites needed a way to signal what, if any, of their content should be indexed by search engines. This served two purposes. First, some organizations might prefer that search engines like Google or Bing not index their content. Second, it allowed them to restrict specific types of pages from being indexed, such as membership-only content or other pages meant to be gated.

As a solution, a standard called the robots.txt file was introduced as a voluntary system that allows websites to signal their intentions to search engines. This has been the law of the land since 1994.

This file is getting new attention as a new question arises: what content on the open web should AI companies be allowed to use to train their models, and what content is off-limits? Many AI models have slurped content from the web, claiming it sits in the public domain, but have also released tools using the robots.txt file to allow websites to decide what content they would like to make available to these bots.

A few weeks ago, I shared the backlash Perplexity AI received for its AI-powered search engine, which some refer to as an answer engine, as it uses sourced content from the web to power responses but does not send traffic to the sources it pulls from. It was also discovered that Perplexity ran afoul of its own stated rules, pulling information into its answers from sites that explicitly asked it not to use their robots.txt file (Perplexity’s grand theft AI).

The response from Perplexity’s CEO made sense. “Perplexity is not ignoring the Robot Exclusions Protocol and then lying about it,” said Perplexity co-founder and CEO Aravind Srinivas in a phone interview. “I think there is a basic misunderstanding of how this works,” Srinivas said. “We don’t just rely on our own web crawlers; we rely on third-party web crawlers as well.”

In other words, indexing is not limited to a singular bot crawling the Internet. Different bots and third-party services used for their search make it difficult to easily point out that they are purposely not following the rules they stated.

Of course, Reddit, which has made it known that it plans to monetize the value of its community content fully, has gone a step further than robots.txt by implementing checks to identify bots and crawlers attempting to access its content and returning them a 404 (page not found) error. This ensures that this content is prohibited for those who ignore the voluntary pact of the robots.txt file (Reddit blocking search engine crawlers and AI bots).

What made me interested in this now is that OpenAI just released SearchGPT, its competitor to Perplexity AI. While I haven’t had the opportunity to try it myself, I have spent time reading how it works and noticed a pretty interesting bit about how it handles crawling for its search service (OpenAI Platform):

For the less technical readers, OpenAI has different bots that serve other uses, allowing publishers of content to grant more granular control to the various ways OpenAI may interact with openly crawled data:

  • GPTBot: This bot is the one most are worried about. It actively consumes content it crawls and potentially uses to train OpenAI’s many generative models.
  • ChatGPT-User: One of the features of the multimodal ChatGPT 4o is that a user can insert a web URL in a prompt, and ChatGPT can read that URL and use its contents to aid its response. This type of bot is used only for direct response to help the user, and the URL contents are not stored for search or training.
  • OAI-SearchBot:  This bot is for search. It indexes content to aid the search experience or to surface content, but the content crawled is not used to train models.

This all makes a ton of sense, and the more granular rule system gives publishers a ton of control in picking and choosing to what extent they would like to share their data with OpenAI. Perplexity appears to have one bot and one bot only, as stated in its documentation, but I could see it quickly following suit.

With the new complexities of AI and the diminishing line between AI-powered chatbots, search, and answering engines pushing us to Google Zero or zero-click traffic, one has to wonder if the voluntary robots.txt system is enough, or if Reddit’s heavier-handed approach of directly blocking bots is the best way. Who would have known that a small text file (The rise and fall of robots.txt) would be so important?

Now, my sponsors and my thoughts on tech & things:


🤝 This week’s newsletter issue is proudly sponsored by:

If you are looking to find qualified candidates, contact Baird Consulting.


🚫 Why Google Is No Longer Limiting Third-Party Cookies in Chrome – Google’s shift away from limiting third-party cookies in Chrome marks a significant change in the browser’s privacy approach. Explore the reasons behind this decision and its potential impact on web tracking and privacy. Read more>

🔍 OpenAI Just Released Search – OpenAI’s new search capabilities promise to revolutionize how we retrieve information. Dive into the features of this latest release and how it could reshape our interaction with AI. Read more >

🤖 Zuckerberg’s Thoughts on Open Source AI – Meta’s release of Llama 3.1, the largest open-source AI model with 405 billion parameters, sets a new standard in AI development. Discover the implications of this move and its potential impact on the AI ecosystem. Read more >

🍏 The First Taste of Apple Intelligence Is Here – Apple’s latest betas for iOS and iPadOS 18.1 introduce new Apple Intelligence features. Get a sneak peek at these innovations and what they mean for the future of Apple’s devices. Read more >


SearchGPT begs the question: what is search today? I initially viewed Perplexity AI as a search engine, but it’s quite different. It operates through a chat interface that enhances its responses with search results using a RAG-style approach. Is it search, or is the term “answer engine” or “AI chat” a better way to describe it?

Because these engines are coming from the perspective of AI and what should be used and allowed when training, or out of fear of giving AI too much access to our information, it also feels like OpenAI and Perplexity AI are getting away with limiting the sources they index for their search systems. Take this quote from OpenAI’s website:

“We are committed to a thriving ecosystem of publishers and creators. We hope to help users discover publisher sites and experiences, while bringing more choice to search. For decades, search has been a foundational way for publishers and creators to reach users. Now, we’re using AI to enhance this experience by highlighting high-quality content in a conversational interface with multiple opportunities for users to engage.”

With OpenAI and other AI companies working hard to license data from select publishers, in some ways this could lead to more limited sources powering these systems. Is Wired right to limit Perplexity AI? Would they ever consider doing the same to Google Search?

I don’t know the answers to these questions, but seeing how we figure them out will be interesting.

-jason

p.s. I’m a board member of the Baltimore Symphony Orchestra and a fan of jazz music, but alas, I do not play an instrument myself. I dream of playing the piano, maybe with a group of friends in a smoky dark jazz club in New Orleans, and until today I thought knowing how to play music was a prerequisite to releasing an album or performing live… but folks, H. Jon Benjamin, the voice of Archer and Bob from Bob’s Burgers, has shown me where there’s a will, there is indeed a way. Watch, and enjoy.