Protecting the Open Web: How New IETF Standards Threaten Web Scraping and AI Access

The Internet Engineering Task Force (IETF) is currently reviewing technical proposals that could allow website operators to block automated web crawling and scraping. These standards, developed by working groups such as AI Preferences and Web Bot Auth, may restrict access to public information for researchers, archivists, and non-profit organizations.

The debate centers on how the internet’s foundational protocols will handle the rise of artificial intelligence. While publishers and website owners seek to protect their infrastructure and advertising revenue from AI-driven bots, digital rights advocates warn that these technical changes could fundamentally alter the “open web.” The conflict highlights a growing tension between the economic needs of content creators and the long-standing principle of unrestricted access to publicly available information.

How new IETF standards could affect web scraping

Web scraping and crawling—the automated process of collecting data from websites—serve as the backbone for many essential digital services. These tools allow journalists to investigate public records, enable researchers to analyze large datasets, and help non-profits like the Internet Archive preserve historical snapshots of the web. Consumers also rely on automated tools for comparison shopping and finding the best available prices online.

According to the Electronic Frontier Foundation (EFF), the current technical standards maintained by the IETF are designed to be neutral protocols that encourage openness. However, new proposals aim to shift these standards toward more restrictive requirements. The EFF argues that if these changes are adopted, website operators could gain significant power to decide which automated tools are allowed to interact with their content, potentially creating a “pay-to-play” environment for data access.

The primary concern is that these standards could move the internet away from a model of open access toward one of controlled, monetized access. If a website can use technical protocols to distinguish between a “good” bot and a “bad” bot based on its identity or intent, it could effectively shut out any automated tool that has not been pre-approved or licensed.

The role of the AI Preferences working group

One of the specific avenues for this change is the IETF’s AI Preferences working group. This group is currently developing proposals that would allow publishers to express “preference signals” to automated crawlers. These signals would be integrated into the robots.txt file, a standard text file used by websites to communicate with web robots about which parts of a site should not be crawled.

Under the proposed framework, these preference signals could indicate that a website does not want its data used for specific AI-related purposes. This could include:

Training large language models (LLMs).
Generating AI-driven search outputs.
Using web content to power AI-generated summaries.

While the use of robots.txt is a standard industry practice, the EFF notes that making these signals more formal and potentially legally binding could change the nature of web interaction. If a preference signal becomes a standard technical requirement, it could provide a mechanism for publishers to prevent AI companies from accessing their data without explicit licensing agreements, even if that data is otherwise publicly accessible.

Risks of cryptographic bot authentication

A second, more controversial effort is being led by the Web Bot Auth working group. This group’s stated objective is to protect website resources by identifying and managing “overly-aggressive” bots that strain server infrastructure. High volumes of automated requests can degrade site performance or even take websites offline, causing legitimate users to face slow loading times or service interruptions.

However, the Web Bot Auth group is also pursuing standards that would enable websites to cryptographically identify bots. Cryptographic authentication would allow a website to verify the identity of a bot before allowing it to access any data. While this could successfully block malicious actors or “bad” bots, it also creates a significant technical loophole for site owners.

If websites can identify bots with certainty, they can implement a “preapproved list” of authenticated crawlers. This would allow them to grant access to specific, trusted partners while blocking others. Critics argue this could lead to several unintended consequences:

Economic barriers: Companies could require licensing payments from any bot not on their approved list.
Suppression of competition: Established tech giants could be granted access while smaller startups are blocked.
Censorship risks: Websites could use authentication to block bots used by dissidents, researchers, or watchdog organizations.

The shift from identifying “bad” behavior to identifying “unauthorized” identity represents a fundamental change in how internet protocols operate. Instead of focusing on how a bot behaves (e.g., how many requests it sends per second), the standards would focus on who the bot is.

Balancing economic interests with open access

The push for these standards is driven by understandable economic anxieties within the publishing and media industries. The rise of AI-generated overviews in search engines has changed how users interact with the web. If a user receives a complete answer from an AI summary, they are less likely to click through to the original source website, which directly impacts advertising revenue and subscription models.

IETF 124: Web Bot Auth (WEBBOTAUTH) 2025-11-04 19:30

Furthermore, the computational cost of serving AI bots can be substantial. Many websites lack the financial or technical resources to upgrade their infrastructure to handle the high-frequency requests generated by modern AI training models. For these organizations, implementing restrictive bot controls is seen as a necessary measure for survival.

The following table compares the two competing perspectives regarding the proposed IETF changes:

Feature/Concern	Publisher/Big Tech Perspective	Open Web/Advocacy Perspective
Primary Goal	Protect revenue and server stability.	Maintain universal access to information.
AI Impact	AI models “steal” content and traffic.	AI tools provide new ways to analyze data.
Bot Control	Necessary to stop aggressive/unpaid bots.	Risk of creating a “paywalled” internet.
Technical Approach	Formalize preference and identity signals.	Keep protocols neutral and behavior-based.

The outcome of these IETF discussions will likely determine the accessibility of the internet for the next decade. If the standards prioritize the ability of website owners to monetize access, the web may become increasingly fragmented, with information available only to those who can afford the technical or financial entry fee.

Why this matters for the future of information

The implications of these technical standards extend far beyond the business models of media companies. They affect the fundamental ability of various stakeholders to function in a digital society. For example, accessibility tools used by people with disabilities often rely on automated processes to interpret and present web content. If these tools are classified as “unauthorized bots,” they may lose access to the very information they are designed to make accessible.

Similarly, the scientific community relies on the ability to scrape large volumes of data for research, trend analysis, and peer review. A web where data access is gated by cryptographic identity could slow the pace of scientific discovery and increase the cost of academic research. The same applies to government accountability; watchdog organizations often use automated tools to monitor public filings, election data, and legislative changes.

The fight over IETF standards is, in many ways, a fight over the architecture of the internet itself. The question is whether the internet should remain a collection of neutral protocols that facilitate the free flow of information, or whether it should become a platform where access is a privilege granted by content owners.

The next major checkpoint in this debate will occur during the upcoming IETF working group meetings, where the specific technical drafts for the AI Preferences and Web Bot Auth proposals will undergo further review and community feedback. Stay tuned for updates on how these standards progress through the formal ratification process.

What do you think about the balance between protecting website revenue and maintaining an open web? Share your thoughts in the comments below and share this article with your network.

Protecting the Open Web: How New IETF Standards Threaten Web Scraping and AI Access

How new IETF standards could affect web scraping

The role of the AI Preferences working group

Risks of cryptographic bot authentication

Balancing economic interests with open access

Why this matters for the future of information

Related

Leave a Comment Cancel reply

How new IETF standards could affect web scraping

The role of the AI Preferences working group

Risks of cryptographic bot authentication

Balancing economic interests with open access

Why this matters for the future of information

Share this:

Related

Leave a Comment Cancel reply