OpenAI’s models ‘memorized’ copyrighted content, new study suggests

April 4, 2025

0 2 minutes read

A New study Seems to lend faith to accusations that OpenAi has trained at least some of its AI models on copyrighted content.

OpenAi is involved in suits that are brought by authors, programmers and other rights holders who accuse the company of the use of their works books, code bases, etc. models can be developed without permission. OpenAi has long had one reasonable use Defense, but the claimants in these cases claim that there is no carve-out in the US copyright legislation for training data.

The study, which was co-author of researchers from the University of Washington, the University of Copenhagen and Stanford, proposes a new method for identifying training data “remember” by models behind an API, such as those of OpenAi.

Models are forecast engines. Train on a lot of data, they learn patterns – so they can generate essays, photos and more. Most outputs are not literal copies of the training data, but because of the way in which models learn, some are inevitable. Image models have been found Regurgitite screenshots from films on which they have been trainedwhile language models have been observed In fact plagiarism news articles.

The method of the study is based on words that the co-authors call ‘high-surprisal’-that means words that stand out as unusual in the context of a larger oeuvre. The word ‘Radar’ in the sentence ‘Jack and I were still buzzing with the radar’ would be considered high-surprisal because it is statistically less likely than words such as ‘motor’ or ‘radio’ to appear before ‘buzzing’.

The co-authors investigated various OpenAI models, including GPT-4 and GPT-3.5, on signs of remembering by removing high-surprisal words from fragments of fiction books and pieces of the New York Times and trying to “guess” which words were masked. If the models have succeeded in guessing correctly, it is likely that they remember the fragment during training, the co-authors completed.

OpenAI Copyright Study — An example of having a model a ‘high-surprisal word’ guess’.Image Credits:Openi

According to the results of the tests, GPT-4 showed signs of having remembering parts of popular fiction books, including books in a dataset with examples of copyright e-books called Bookmia. The results also suggested that the model sharing the articles of the New York Times, albeit at a relatively lower speed.

ABHILASHA RAVICHANDER, a doctoral student at the University of Washington and a co-author of the study, told WAN that the findings shed light on the “Contenty Data” models may have been trained.

“In order to have large language models that are reliable, we must have models that we can investigate and control and scientific investigations,” said Ravicander. “Our work is intended to offer a tool to investigate large language models, but there is a real need for more data transparency in the entire ecosystem.”

OpenAi has long argued for looser restrictions on the development of models using copyrighted data. Although the company has certain content liculation consumption and offers opt-out mechanisms with which copyright owners can mark the content that they prefer to the company that is not used for training purposes, it has lobbying various governments to codify “fair use” rules around AI-training appeals.

Source link