Musicians including Backxwash, Titus Andronicus, Tre Mission, Lunice, and DJ Sabrina the Teenage DJ have publicly criticized the inclusion of their copyrighted works in large-scale datasets used to train artificial intelligence models. The datasets, which were recently made searchable, were highlighted following reporting on their widespread use within the AI development community. These collections, which contain millions of tracks, have prompted renewed scrutiny regarding the unauthorized use of creative content for machine learning purposes.
The controversy centers on four expansive datasets that have been circulated among AI developers, as detailed in recent reporting. These collections contain a vast catalog of recorded music, with the largest single dataset alone encompassing 12 million tracks, representing approximately 91 years of recorded audio. According to findings from industry research, these datasets are comparable in scale to the training material utilized by prominent commercial music-generating AI companies. Court filings from 2024 indicate that AI developer Suno stated it trained its models on essentially all music files of reasonable quality accessible online, while Google reportedly utilized a dataset of 44 million songs for its own training purposes as early as 2022. Additionally, OpenAI previously utilized 1.2 million songs to train its Jukebox model in 2020.
The visibility of these datasets increased after producer Sophiaaaahjkl;8901 shared a searchable interface for the collections on social media on June 17. This prompted artists to verify whether their own catalogs were included in the training sets, leading to a wave of public responses from the music community.
Titus Andronicus expressed frustration on social media, noting that among their top tracks identified in the Suno training data were an ambient noise piece and a lesser-known cut from a 2022 release. The band’s response highlighted the seemingly indiscriminate nature of the data scraping process.
Interestingly enough, among the top 6 songs by Titus Andronicus used to train Suno are an ambient noise track utilizing the “mother chord” (a chord with all 12 notes in our western chromatic scale) and a deep cut from our 2022 album that no one heard or liked lmao good luck buddy pic.twitter.com/I7q3EfV8ZP
— Titus Andronicus (@TitusAndronicus) June 18, 2026
Backxwash also voiced opposition to the practice, stating clearly that the unauthorized use of their music for AI training was unwelcome.
I dont like this https://t.co/OCcPkny7fu pic.twitter.com/qrHxQCRfk6
— Backxwash (@backxwash) June 18, 2026
Other artists, including DJ Sabrina the Teenage DJ, pointed to the irony of being accused of producing AI-generated content before the datasets were widely known, suggesting that the training models themselves may be responsible for the sonic similarities critics have noted.
to everyone who thought my music sounded like ai slop, did you ever think it was because Suno was using a dataset that contained 22 of my songs?it’s funny how there were no accusations of my music sounding like ai slop until these datasets started getting used to generate slop pic.twitter.com/SerSnaLO46
— DJSabrinaTheTeenDJ (@DJSTTDJ) June 18, 2026
The Scope of AI Training Data
The secrecy surrounding the origins of AI training material remains a point of contention. While many AI companies maintain that they rely on content that is freely available online, the existence of these massive, downloadable datasets suggests that developers have access to significant quantities of music that is not intended for free use. This discrepancy between corporate claims and the actual availability of copyrighted material in training sets is central to the ongoing debate over intellectual property rights in the age of generative AI.

Tre Mission confirmed finding 20 of their tracks within the Suno training data, emphasizing a lack of consent and expressing disappointment regarding the integration of their work into AI systems.

I just found out 20 of my songs are being used to train sunos AI…I’m 100% sure I never consented to this 🤔Anyone who knows me, knows I HATE the use of AI in music, so this is very disappointing pic.twitter.com/4VRyA8Uktn
— TRE MISSION (@TreMission) June 18, 2026
Lunice also responded to the discovery, acknowledging the gravity of the situation as artists continue to grapple with the implications of their digital footprints being repurposed for machine learning.
That’s wild… Thanks for bringing it up.https://t.co/Z8lcA2j4vH pic.twitter.com/3350G2l2Vt
— Lunice (@Lunice) June 18, 2026
Industry Impact and Legislative Context
As the conversation around AI-generated media matures, the demand for legislative protection for artists is intensifying. The primary concern among creators is that their copyrighted works are being utilized to train commercial models without any form of compensation or explicit permission. This has fueled calls for clearer regulatory frameworks that address the intersection of copyright law and artificial intelligence.
The use of these datasets, which have reportedly been downloaded thousands of times by the AI development community, highlights a lack of transparency in how training data is sourced. As artists continue to share their findings, the pressure on companies to disclose their training methodologies and respect intellectual property is expected to grow. The situation remains fluid as more musicians check their status in the publicly available databases.
Readers can monitor further developments through upcoming court filings and legislative updates regarding AI and copyright. We invite readers to share their thoughts on the balance between technological innovation and the protection of artistic rights in the comments section below.