Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

April 2, 2025

6 3 minutes read

Openi accused Through a lot of Parties of training his AI on copyrighted content without permission. Now a new one paper Through an AI watching dog organization, the serious accusation makes the company increasingly familiarized non -public books that did not deliver to train more advanced AI models.

AI models are essentially complex prediction engines. Trained on a lot of data – books, films, TV programs, etc. – they learn patterns and new ways to extrapolate from a simple prompt. When a model “writes” about a Greek tragedy or “pulls” images in Ghibli style, it simply pulls out of his enormous knowledge to approach. It doesn’t come to something new.

Although a number of AI laboratories, including OpenAI, started embracing AI-generated data to train AI while they exhaust real-world sources (especially the public web), few have really shaken out of practice. This is probably because training on purely synthetic data comes with risks, such as the performance of a model deteriorating.

The new article, from the AI Disclosures project, a non-profit organization that was co-founded in 2024 by Media-Mogul Tim O’Reilly and economist Ilan Strauss, draws the conclusion that OpenAi probably trained his GPT-4O model on affordable books by O’Reilly Media. (O’Rilly is the CEO of O’Reilly Media.)

In Chatgpt, GPT-4O is the standard model. O’Rilly has no license agreement with OpenAi, says the newspaper.

“GPT-4O, the more recent and capable model of OpenAI, shows a strong recognition of affordable O’Reilly book content … compared to the earlier model GPT-3.5 Turbo from OpenAi,” wrote the co-authors of the article. “GPT-3.5 Turbo, on the other hand, shows a greater relative recognition of publicly accessible O’Reilly Book samples.”

The paper used a method with the name CavityIntroduced for the first time in an academic article in 2024, designed to detect copyrighted content into training data of language models. Also known as a ‘membership-flie attack’, tests the method or a model can reliably distinguish from torn texts from paraphrased, AI-generated versions of the same text. If possible, it suggests that the model can have prior knowledge of the text from the training data.

The co-authors of the newspaper O’Reilly, Strauss and AI researcher Sruly Rosenblat sealing that they have investigated GPT-4O, GPT-3.5 Turbo and the knowledge of other OpenAI models from O’Reilly Media Books who have published dates before and after their training disorder. They used 13,962 paragraph fragments from 34 O’Reilly books to estimate the chance that a certain fragment was included in the training dataset of a model.

According to the results of the newspaper “GPT-4O” recognized “much more payment peelder O’Reilly Book content than the older models of OpenAI, including GPT-3.5 Turbo. That is even after explaining potential confusing factors, the authors said, as improvements in the ability of newer models was written by humans.

“GPT-4O [likely] Recognizes, and so has prior knowledge of, many non-public O’Reilly books published before the training date, “wrote the co-authors.

It is not a smoking gun, the co-authors notice. They acknowledge that their experimental method is not watertight and that OpenAi may have collected the extracts of the affordable book from users who copy and paste in chatgpt.

The co-authors continued the waters and have not evaluated the most recent collection of models from OpenAi, including GPT-4.5 and “reasoning models” such as O3-Mini and O1. It is possible that these models are not trained on the book data of the Paywalled O’Reilly or on a lesser quantity are trained than GPT-4O.

That said, it is no secret that OpenAi, who has advocated looser restrictions on the development of models using copyrighted data, has been looking for higher quality training data for some time. The company went so far Rent journalists to refine the export of his models. That is a trend in the wider industry: AI companies that recruit experts in domains such as science and physics until These experts have effectively feeding their knowledge in AI systems.

It should be noted that OpenAi pays at least some of its training data. The company has license deals with news publishers, social networks, libraries for stock media and others. OpenAi also offers opt-out mechanisms albe the imperfect – With which copyright owners can mark content that they prefer not to use the company for training purposes.

Still, while OpenAi fights different suits about the training data practices and the treatment of copyright legislation in American courts, the O’Reilly paper is not the most flattering look.

OpenAi did not respond to a request for comment.

Source link