Backxwash, Titus Andronicus Among Musicians to Find Their Songs in AI-Training Datasets Exposed by ‘The Atlantic

Musicians including Backxwash, Titus Andronicus, Tre Mission, Lunice, and DJ Sabrina the Teenage DJ have publicly criticized the inclusion of their copyrighted works in large-scale datasets used to train artificial intelligence models. The datasets, which were recently made searchable, were highlighted following reporting on their widespread use within the AI development community. These collections, which contain millions of tracks, have prompted renewed scrutiny regarding the unauthorized use of creative content for machine learning purposes.

The controversy centers on four expansive datasets that have been circulated among AI developers, as detailed in recent reporting. These collections contain a vast catalog of recorded music, with the largest single dataset alone encompassing 12 million tracks, representing approximately 91 years of recorded audio. According to findings from industry research, these datasets are comparable in scale to the training material utilized by prominent commercial music-generating AI companies. Court filings from 2024 indicate that AI developer Suno stated it trained its models on essentially all music files of reasonable quality accessible online, while Google reportedly utilized a dataset of 44 million songs for its own training purposes as early as 2022. Additionally, OpenAI previously utilized 1.2 million songs to train its Jukebox model in 2020.

The visibility of these datasets increased after producer Sophiaaaahjkl;8901 shared a searchable interface for the collections on social media on June 17. This prompted artists to verify whether their own catalogs were included in the training sets, leading to a wave of public responses from the music community.

Titus Andronicus expressed frustration on social media, noting that among their top tracks identified in the Suno training data were an ambient noise piece and a lesser-known cut from a 2022 release. The band’s response highlighted the seemingly indiscriminate nature of the data scraping process.

Backxwash also voiced opposition to the practice, stating clearly that the unauthorized use of their music for AI training was unwelcome.

Other artists, including DJ Sabrina the Teenage DJ, pointed to the irony of being accused of producing AI-generated content before the datasets were widely known, suggesting that the training models themselves may be responsible for the sonic similarities critics have noted.

The Scope of AI Training Data

The secrecy surrounding the origins of AI training material remains a point of contention. While many AI companies maintain that they rely on content that is freely available online, the existence of these massive, downloadable datasets suggests that developers have access to significant quantities of music that is not intended for free use. This discrepancy between corporate claims and the actual availability of copyrighted material in training sets is central to the ongoing debate over intellectual property rights in the age of generative AI.

The Scope of AI Training Data

Tre Mission confirmed finding 20 of their tracks within the Suno training data, emphasizing a lack of consent and expressing disappointment regarding the integration of their work into AI systems.

The Scope of AI Training Data

Lunice also responded to the discovery, acknowledging the gravity of the situation as artists continue to grapple with the implications of their digital footprints being repurposed for machine learning.

Industry Impact and Legislative Context

As the conversation around AI-generated media matures, the demand for legislative protection for artists is intensifying. The primary concern among creators is that their copyrighted works are being utilized to train commercial models without any form of compensation or explicit permission. This has fueled calls for clearer regulatory frameworks that address the intersection of copyright law and artificial intelligence.

The use of these datasets, which have reportedly been downloaded thousands of times by the AI development community, highlights a lack of transparency in how training data is sourced. As artists continue to share their findings, the pressure on companies to disclose their training methodologies and respect intellectual property is expected to grow. The situation remains fluid as more musicians check their status in the publicly available databases.

Readers can monitor further developments through upcoming court filings and legislative updates regarding AI and copyright. We invite readers to share their thoughts on the balance between technological innovation and the protection of artistic rights in the comments section below.

Leave a Comment