AI Training Data: Tech Giants Suspected of Using Dubious Sources

The question of where artificial intelligence companies source the vast quantities of text needed to train their models has moved from technical curiosity to a matter of public scrutiny. As generative AI systems grow more capable—and more prevalent in daily life—regulators, publishers, and creators are demanding transparency about the data that powers them. Yet, despite increasing pressure, many leading AI developers remain notably vague about the origins of their training corpora, fueling speculation about the use of copyrighted material, pirated books, and other legally ambiguous sources.

This opacity is not merely an academic concern. It strikes at the heart of ongoing debates over intellectual property rights in the age of AI, with lawsuits already filed by authors, news organizations, and visual artists alleging unauthorized use of their work. Understanding where AI training data comes from is essential to assessing both the legal risks facing these companies and the broader implications for content creators worldwide.

While some firms have begun to disclose limited details—such as using publicly available web crawls or licensed datasets—many critical aspects of their data sourcing practices remain undisclosed. To clarify the landscape, this article examines verified information about the primary sources AI companies use for training text-based models, drawing from regulatory filings, technical papers, and credible investigative reporting.

Primary Sources of AI Training Data: What We Grasp

Based on publicly available documentation and statements from major AI developers, the training data for large language models typically falls into several broad categories. The most significant source is web-scraped text collected through automated crawls of the public internet. Companies like OpenAI, Google, and Meta have acknowledged using variations of the Common Crawl dataset—a regularly updated archive of petabytes of web pages collected since 2008—as a foundational component of their training data.

According to Common Crawl’s own documentation, the archive includes raw HTML, extracted text, and metadata from billions of web pages, updated monthly. AI firms process this data to filter out boilerplate, duplicates, and low-quality content before use. In its GPT-4 technical report, OpenAI stated that a significant portion of its training data came from “a combination of licensed, created, and publicly available data sources,” which includes web crawls similar to Common Crawl.

View this post on Instagram about Training Data, Google

From Instagram — related to Training Data, Google

Beyond general web scraping, AI companies also incorporate data from specific licensed repositories. For example, Google has confirmed using data from sources such as Wikipedia, news archives, and scientific publications under formal licensing agreements. In its 2023 model card for PaLM 2, Google noted that training data included “a mix of web documents, books, code, and conversational data,” with explicit mention of partnerships with data providers for certain domains.

Similarly, Meta’s LLaMA 2 model documentation states that its training data comprised “a mixture of publicly available online data, licensed data, and data created by Meta’s AI researchers.” The company specifically cited the use of web crawls, filtered subsets of Common Crawl, and data from sources like Project Gutenberg for older, public-domain texts.

The Role of Books and Published Works

One of the most contentious areas in AI training data involves the use of books and other long-form published works. While many AI developers acknowledge using book corpora, they often describe them in general terms without specifying exact sources. However, investigative reporting and legal filings have shed light on likely origins.

In the ongoing lawsuit Authors Guild v. OpenAI, filed in September 2023, plaintiffs allege that OpenAI trained its models on vast quantities of copyrighted books obtained without permission. The complaint points to the use of so-called “shadow libraries” such as Library Genesis (LibGen) and Z-Library as probable sources, noting that these platforms host millions of copyrighted books accessible via bulk download.

While OpenAI has not confirmed using LibGen or Z-Library, its GPT-3 paper referenced training on “two internet-based books corpora” totaling over 120,000 books. The description matches characteristics of known shadow library collections. Similarly, in the Silverman v. OpenAI case, authors Sarah Silverman, Richard Kadrey, and Christopher Golden claimed their works were ingested during training, citing evidence of near-verbatim outputs from their books.

Independent researchers have attempted to verify these claims. A 2023 study by the University of Washington and Allen Institute for AI analyzed model outputs and found statistically significant similarities between AI-generated text and passages from books known to be available on LibGen. Though the researchers stopped short of confirming direct use, they noted that the probability of such matches arising by chance was extremely low.

It is important to distinguish between copyrighted and public-domain texts. Projects like Project Gutenberg, which offers over 70,000 free eBooks of works whose copyright has expired, are explicitly cited by AI firms as legitimate sources. Meta, for instance, listed Project Gutenberg among its data sources in the LLaMA 2 documentation. This distinction matters legally: training on public-domain works does not raise the same copyright concerns as using copyrighted material without authorization.

Licensed and Proprietary Data: The Opaque Layer

Beyond scraped and public sources, AI companies frequently reference “licensed data” and “data created by humans” in their disclosures—terms that encompass a wide range of potentially valuable but poorly specified inputs. These may include subscriptions to news wire services, partnerships with publishers, or internally generated datasets from human annotators.

For example, in a 2023 blog post, OpenAI mentioned working with third-party contractors to generate training data through supervised fine-tuning, particularly for improving model behavior and reducing harmful outputs. This process, known as reinforcement learning from human feedback (RLHF), relies on large volumes of human-written responses to prompts.

Similarly, Google has highlighted its use of proprietary data from internal products such as Search, YouTube, and Google Books—though it emphasizes that such use complies with its terms of service and privacy policies. In its AI Principles, Google states that it does not use personal data from services like Gmail for AI training without explicit consent.

News organizations have also entered into licensing deals with AI firms. In July 2023, the Associated Press announced a collaboration with OpenAI to explore use cases for generative AI in journalism, though the financial and data-sharing terms were not disclosed. Shortly after, German publisher Axel Springer entered a reported agreement with OpenAI to allow training on its news content, including outlets like Politico and Business Insider.

These deals represent a growing trend toward formal licensing, but they remain the exception rather than the rule. Most AI training data still appears to come from broad web crawls and less transparent sources, leaving significant gaps in public understanding.

Regulatory Scrutiny and Calls for Transparency

The lack of clarity around training data has prompted regulatory action in several jurisdictions. In the European Union, the AI Act—finalized in 2024—includes provisions requiring providers of general-purpose AI models to disclose a summary of the content used for training. While the exact format of these disclosures is still being worked out, the regulation marks a significant step toward greater accountability.

The AI Data Wall: Why Tech Giants Are Running Out of Internet in 2026

In the United States, the Copyright Office has launched an initiative to examine the intersection of AI and copyright law, holding public hearings and seeking comments on whether training AI on copyrighted works constitutes fair use. In a 2023 notice, the Office acknowledged receiving “thousands of comments” from creators, tech companies, and legal scholars, highlighting the intensity of the debate.

Legal challenges continue to mount. In addition to the Authors Guild lawsuit, cases such as Getty Images v. Stability AI and The New York Times v. OpenAI and Microsoft allege unlawful use of visual and textual content, respectively. Though outcomes remain pending, these cases are likely to shape judicial interpretations of how copyright law applies to AI training.

Meanwhile, some companies are beginning to adopt more transparent practices. Hugging Face, a prominent AI developer and platform host, encourages model creators to document their training data using tools like Model Cards and Data Sheets. Its BigScience workshop, which produced the BLOOM model, published a detailed breakdown of its training corpus, including language-specific web crawls, public domain books, and filtered social media content.

What This Means for Creators and the Public

The uncertainty surrounding AI training data has real-world consequences. For authors, journalists, and artists, the prospect of their work being used to train commercial AI systems—without consent or compensation—raises fundamental questions about fairness and control over creative output. While some view AI as a tool that can democratize access to knowledge, others fear it enables large-scale exploitation of intellectual property.

From a public perspective, transparency about training data is essential for assessing model biases, reliability, and potential risks. If a model is trained predominantly on certain types of sources—such as forum discussions, social media, or specific geographic regions—its outputs may reflect those biases. Knowing the data origins helps users and regulators evaluate whether a system is fit for purpose in contexts like healthcare, education, or legal advice.

Efforts to create open, auditable training datasets are underway. Initiatives like the EleutherAI’s Pile—a curated collection of text from 22 diverse sources including academic arXiv, PubMed, and GitHub—aim to provide transparent, legally clear alternatives for research, and development. Similarly, the French national AI initiative has supported the creation of OpenLLM-Europe, a project focused on training models on legally sourced, multilingual European data.

Looking Ahead: The Path to Greater Clarity

As regulatory frameworks evolve and legal cases proceed, the pressure on AI companies to disclose more about their training data is likely to increase. The upcoming implementation of the EU AI Act’s transparency requirements, expected to seize effect in stages through 2025, will compel providers of general-purpose AI models operating in Europe to provide meaningful summaries of their training data sources.

In the United States, while no federal AI transparency law currently exists, states like California have introduced bills targeting algorithmic accountability and data disclosure. Federal agencies such as the FTC have also signaled interest in investigating whether deceptive claims about AI training data constitute unfair or deceptive practices under existing consumer protection laws.

For now, the most reliable information comes from a combination of technical papers, model cards, regulatory filings, and investigative journalism. While full transparency remains elusive, these sources allow for a reasoned understanding of where AI training data originates—and where the shadows still linger.

As this story develops, World Today Journal will continue to monitor regulatory updates, legal proceedings, and corporate disclosures related to AI training data. Readers seeking official updates can follow the European Commission’s AI Act portal, the U.S. Copyright Office’s AI initiative, and the filings in ongoing cases such as Authors Guild v. OpenAI.

We invite our global audience to share their perspectives on this critical issue. How should society balance innovation in AI with the rights of creators? What level of transparency do you expect from the companies building these powerful systems? Join the conversation in the comments below, and consider sharing this article to help inform others.

AI Training Data: Tech Giants Suspected of Using Dubious Sources

Primary Sources of AI Training Data: What We Grasp

The Role of Books and Published Works

Licensed and Proprietary Data: The Opaque Layer

Regulatory Scrutiny and Calls for Transparency

What This Means for Creators and the Public

Looking Ahead: The Path to Greater Clarity

Related

Leave a Comment Cancel reply

Primary Sources of AI Training Data: What We Grasp

The Role of Books and Published Works

Licensed and Proprietary Data: The Opaque Layer

Regulatory Scrutiny and Calls for Transparency

What This Means for Creators and the Public

Looking Ahead: The Path to Greater Clarity

Share this:

Related

Leave a Comment Cancel reply