Amnesty International: Generative AI Built on Unlawful Web Scraping and Privacy Invasions

The seamless user experience of modern generative artificial intelligence—the instant poem, the photorealistic image, the rapid code snippet—often masks a much more aggressive and extractive reality. Behind the veneer of digital sophistication lies a massive, global data pipeline that many human rights advocates argue is fundamentally incompatible with the right to privacy.

A new briefing from Amnesty International, titled “Unlawful by Design: Exposing the Human Rights Costs of Generative AI,” warns that the particularly foundation of these systems is being built upon mass invasions of privacy. The report contends that the industry’s reliance on large-scale, non-consensual data scraping is not merely a technical necessity, but a design choice that places the technology in direct conflict with international human rights standards.

As the race to dominate the AI landscape intensifies, the human rights costs of generative AI are becoming increasingly difficult to ignore. From the erosion of individual privacy to the exacerbation of systemic biases and a mounting environmental crisis, the “extractive” nature of the current AI development paradigm is coming under intense scrutiny from global watchdogs and local communities alike.

The Mechanics of “Unlawful” Web Scraping

At the heart of the controversy is a process known as web scraping. This is an automated method used to extract vast quantities of information from the internet, ranging from text and articles to personal images and social media activity. While scraping has legitimate uses, Amnesty International argues that its application in training generative AI models has reached an unprecedented and potentially unlawful scale.

“Companies across the world are supplying generative AI products under the veneer of efficiency and sophistication, but in reality, these systems perpetuate mass invasions of privacy through unlawful web scraping,” said Likhita Banerji, Head of the Algorithmic Accountability Lab at Amnesty International. Banerji noted that this process involves extracting data from websites, including sensitive personal information, often without the explicit consent of the individuals who created or appear in that content.

This practice challenges the principle of “privacy by design”—the concept that privacy protections should be integrated into the very architecture of a technology from its inception. By building models that rely on billions of public online posts and images harvested without permission, tech companies are, according to the briefing, creating systems that are unlawful by design.

The research conducted by Amnesty International examined several of the most prominent standalone generative AI tools currently available, including OpenAI’s GPT-3, Google’s Gemini, Meta’s Llama and DeepSeek, as well as image generation tools like Midjourney and Stable Diffusion. The findings suggest that these models are heavily dependent on datasets that infringe upon personal privacy on a global scale.

Amplifying Bias and Threatening Freedom of Thought

The risks of this extractive approach extend far beyond the loss of data control. Because the datasets used to train these models are largely pulled from the open web, they are inherently “polluted” with the same real-world prejudices, stereotypes, and hateful content that exist online. This creates a feedback loop where AI outputs do not just reflect human bias, but actively amplify it.

The briefing highlights that racial, gendered, and cultural biases are consistent features of generative AI systems. When these models scale up, the presence of discriminatory content in their outputs can become more pronounced, disproportionately harming historically marginalized communities. This creates a digital environment where negative stereotypes are reinforced through automated suggestions.

the report raises a profound concern regarding the right to freedom of thought. Large-scale generative AI models are increasingly capable of influencing user perspectives and shaping personal beliefs through predictive suggestions and highly convincing generated content. This ability to subtly steer human cognition represents a new and significant frontier in human rights challenges.

“These choices are not inevitable,” Banerji emphasized, urging the industry to move away from the current trajectory. “We must challenge the design choices adopted by companies who build generative AI systems by relying on training data, including personal data, that is extracted non-consensually and on a grand scale.”

Key Takeaways: The Human Rights Impact of AI

  • Privacy Erosion: Mass, non-consensual web scraping is being used to build training datasets, violating the principle of privacy by design.
  • Systemic Bias: AI models trained on uncurated web data tend to amplify racial, gender, and cultural prejudices.
  • Cognitive Risks: Predictive suggestions and highly persuasive generative outputs pose risks to the fundamental right to freedom of thought.
  • Environmental Strain: The massive computational power required for AI is driving significant increases in greenhouse gas emissions and resource consumption.
  • Community Displacement: The expansion of data center infrastructure is causing tension in regions already facing water and electricity shortages.

The Environmental Cost of the AI Boom

While much of the debate focuses on data and ethics, the physical infrastructure required to run generative AI is leaving a massive environmental footprint. The demand for higher processing speeds and larger-scale models has necessitated the development of more energy-intensive chips and much larger data centers.

The Emergence of International Norms on AI Ethics – Dr Petru Dumitriu

This shift has direct consequences for global climate goals. The operationalization of these massive data centers requires immense amounts of both electricity and water for cooling. The scale of this resource consumption is reflected in the recent sustainability reports of the world’s largest tech firms.

The Environmental Cost of the AI Boom
Amnesty International Google

According to Google’s 2024 sustainability reporting, the company has seen a 48% increase in greenhouse gas emissions since 2019, a surge largely attributed to data center operations and supply chain requirements. Similarly, Microsoft’s emissions reportedly increased by 29% between 2020 and 2024, driven by the intensive processes required to support AI development.

The impact is not just global, but deeply local. In several parts of the world, the push to build data centers is meeting fierce resistance from communities already struggling with resource scarcity. In Cerrillos, Chile, and Querétaro, Mexico, as well as in parts of Arizona in the United States, local populations have voiced opposition to data center expansion in areas already heavily affected by droughts and electricity shortages.

The Push for Accountability and Regulation

As the gap between technological capability and human rights protection widens, Amnesty International is calling for urgent state intervention. The organization is urging governments to prohibit the development of standalone generative AI systems that are built using unlawful web scraping—defined as the bulk and mass collection of training data through the web.

A critical distinction made in the briefing is the definition of “standalone” generative AI tools. These are products developed and marketed specifically for their generative capabilities, such as AI chatbots or text/image generators. This excludes generative AI that is merely an added feature within a larger, established software suite (such as a word processor with an optional AI assistant).

Amnesty’s recommendations for companies include an immediate cessation of the practice of non-consensual web scraping of personal data for training purposes. The organization is calling on states to hold corporations legally accountable for human rights abuses linked to their design choices and business models.

In the lead-up to the briefing, Amnesty International sought responses from several major players. While the organization reached out to Google, OpenAI, Meta, Stability AI, Midjourney, and DeepSeek regarding web scraping, and to Intel, VMware, Amazon, and Microsoft regarding discrimination and environmental harms, only Microsoft, Amazon, Intel, OpenAI, and Meta provided responses. A summary of those company responses has been included in the full research briefing.

The next significant checkpoint for this issue will be the evolving regulatory landscape in the European Union and the United States, where new frameworks regarding AI transparency and data usage are currently being debated in legislative bodies.

What do you think? Should tech companies be allowed to use public data to train AI without explicit consent, or does this represent a fundamental breach of privacy? Share your thoughts in the comments below.

Leave a Comment