Logo

The fight over AI training data is coming to a head

As AI spreads into everyday life, pressure is mounting on tech firms to explain what data their models learn from — and who gets paid

Jaque Silva/NurPhoto via Getty Images

OpenAI was an open book back in 2020. When it launched GPT-3, it released a detailed report on how the chatbot was built with a public “reading list” showing the kinds of material it was trained on. (About 3% of it was Wikipedia.) That allowed researchers to see exactly what made the AI tick. 

Today, details like these are treated as trade secrets. AI companies say revealing too much about how their technology works would give competitors an advantage, so much of it is withheld from public view — even as these systems are integrated into schools, hospitals, and workplaces. That loss of transparency has become a major source of concern — and the basis of dozens of legal battles.

It's no secret that the work of writers, artists, musicians, and publishers helps power today’s AI models. That fact has resulted in a torrent of lawsuits from copyright holders who allege that AI companies are illegally using their work to train systems without permission. It's becoming one of the defining battles over how the AI industry is allowed to grow.

“You cannot avoid the fact that its sheer existence is because of the songs that I wrote in the past,” said Björn Ulvaeus, the Swedish singer-songwriter and member of ABBA, speaking to Bloomberg last year about AI tools for musicians. “I should be remunerated for that. If you make money on something that I helped you create, I get a share.”

The stakes extend far beyond individual artists. Industries built around copyrighted work — including music, film, publishing, and software — accounted for about 8% of U.S. gross domestic product in 2023 and supported nearly 12 million jobs, according to the International Intellectual Property Alliance.

The argument is moving into a decisive year. More than 50 copyright lawsuits have already been filed in the U.S. and several of the biggest ones are expected to move forward in 2026. Among them are cases brought by music publishers against Anthropic, accusing the AI company of using song lyrics to train its Claude models, and by visual artists challenging how Google $GOOGL built its image-generation tools. Other cases target Stability AI and companies behind AI music generators.

In a 2025 suit, Walt Disney $DIS and Universal Pictures accuse the AI image generator Midjourney of being a “bottomless pit of plagiarism,” saying it copied and reproduced famous characters without permission. “Piracy is piracy, and the fact that it’s done by an AI company does not make it any less infringing,” Disney chief legal officer Horacio Gutierrez said in a statement. 

The AI firms involved broadly reject the claims, arguing that training models on large collections of existing material is necessary to build systems that can understand language, images, or sound, and that it doesn't amount to an act of copying in the traditional sense.

That view has found some sympathy in U.S. courts. In one closely watched case brought by book authors against Anthropic, U.S. District Judge William Alsup described AI training as “quintessentially transformative,” writing that copyright law “seeks to advance original works of authorship, not to protect authors against competition.” He likened the process to “training schoolchildren to write well,” and said learning from existing works does not necessarily amount to infringement.

Other judges have been more cautious. In a separate ruling involving Meta $META, U.S. District Judge Vince Chhabria said that training would fail the fair-use test “in many circumstances,” especially where the technology risks “flood[ing] the market” with new content in ways that weaken incentives for human creators — one of the central tenets of copyright law.

What neither side disputes is the sheer volume of material involved. Meta has said one of its recent models was trained on roughly 40 trillion tokens of text — an amount that would take tens of millions of years for an average human reader to absorb. The sheer scale has left courts grappling with how traditional copyright tests, developed long before such systems existed, apply in practice.

Some companies are opting not to wait for answers. Disney agreed late last year to invest $1 billion in OpenAI and allow the company to use Disney characters in its Sora video generator. Warner Music has settled lawsuits with AI music startups and announced plans to build licensed tools with them. Universal Music said in January it would work with Nvidia $NVDA on AI-related music projects.

That is easier for large entertainment firms, which have the leverage to negotiate bespoke deals with AI firms. Smaller rights holders and independent creators do not. Moreover, the incentive to strike such deals could disappear if courts decide they're not necessary.

The Trump administration seems unlikely to help. The White House’s AI Action Plan, announced last year, does not contain points meant to protect the rights of artists or creators whose work is used to train chatbots. “You can’t be expected to have a successful AI program when every single article, book, or anything else that you've read or studied, you're supposed to pay for,” Trump said at its launch. “Gee, I read a book, I’m supposed to pay somebody.”

Copyright law isn't the only concern. In 2023, the Stanford Internet Observatory found more than a thousand images of child sexual abuse in a public dataset that was used to train popular AI image generators. Researchers said the dataset had been widely shared and incorporated into multiple systems before the offensive material was found. Once embedded in training data, researchers said, such material can be difficult to identify or fully remove.

Other studies have raised red flags about AI systems being disproportionately trained on English-language content and Western cultural output, which likely shapes how these tools interpret the world and whose perspectives they prioritize.

All these concerns have sharpened calls for visibility into how AI systems are built. Europe has already moved to make companies publish summaries of training data. No comparable rules exist in the U.S., leaving courts and licensing agreements to fill the gaps. As generative AI becomes more deeply embedded in everyday life, the question of what goes into these systems — and who gets to see it — is more pressing than ever.

📬 Sign up for the Daily Brief

Our free, fast and fun briefing on the global economy, delivered every weekday morning.