AI

Court filings show Meta staffers discussed using copyrighted content for AI training

For years, Meta employees have discussed internally with the help of copyrighted works that have been obtained through legal questionable means to train the AI ​​models of the company, according to judicial documents that were not translated on Thursday.

The documents were submitted by the claimants in the Kadrey v. Meta case, one of the many AI authentic law disputes that slowly swing through the American legal system. The defendant, Meta, claims that training models on IP-protected works, in particular books, are ‘reasonable use’. The claimants, who include authors Sarah Silverman and Ta-Nehisi Coates, do not agree.

Previous materials submitted in the court case claimed that Meta -CEO Mark Zuckerberg gave the AI ​​team of Meta the OK to train on copyright content and that Meta AI training dollish grams stopped with book publishers. But the new files, most of which of which show parts of internal work treasure between Meta Staff employees, paint the clearest picture of how Meta may have started using copyright data to train its models, including models in the Lama family of the company .

In one chat, Meta employees, including Melanie Kambadur, discussed a senior manager for the Lama model of Meta, training models that they knew they were legally loaded.

‘[M]Y Opinion would be (in line with ‘question forgiveness, not for permission’): We try to acquire the books and escalate to execs so that they call, ”wrote Xavier Martinet, a meta -research engineer, in a chat dated February 2023 ,, ” According to the files. ‘[T]He is why they set up this gene ai org [sic]: So we can be less risk -suffering. “

See also  Elevating Customer Interactions with AI-Powered Chatbots

Martinet drove the idea to buy e-books at retail prices to build a training set instead of reducing license colors with individual book publishers. After another employee pointed out that the use of unauthorized, copyright protected materials can be land for a legal challenge, Martinet doubled, with the argument that startups of ‘a Gazillion’ probably already used pirated books for training.

“I mean, the worst case: we found out that it is finally ok, while a Gazillion starts up [sic] just illegal tons of books about Bittorrent, ‘wrote Martinet, According to the files. ‘[M]Y 2 cents again: Trying to have deals with publishers takes a long time … “

In the same chat, Kambadur, who noticed Meta in conversation with Document hosting platform Scribd “and others” for licenses, although the use of “publicly available data” for model training required approvals, the lawyers of Meta were “less conservative” than they were in the past have been with such approvals.

“Yes, we must absolutely receive licenses or approvals on publicly available information,” said Kambadur, According to the files. ‘[D]Ifference is now that we have more money, more lawyers, more Bizdev -Help, the ability to follow/escalate quickly for speed, and lawyers are slightly less conservative for approvals. “

Talk about Libgen

In another work Chat passed on in the archives, Kambadur possibly discusses with the help of libes, a “left -wing aggregator” that offers access to copyrighted works of publishers, as an alternative to data sources that can license Meta License.

Libgen has been sued a number of times, ordered to close and fined tens of millions of dollars for copyright infringement. One of the colleagues at Kambadur responded with a screenshot From a Google search result for libes with the fragment “No, libbs is not legal.”

See also  Overcoming Cross-Platform Deployment Hurdles in the Age of AI Processing Units

Some decision makers within Meta seem to have had the impression that not using libes for modeling education could seriously harm the competitiveness of meta in the AI ​​race, According to the files.

In an e-mail addressed to Meta AI VP Joelle Pineau, called Sony Theakanath, Director of Product Management at Meta, called Libgen “Essential to meet Sota numbers in all categories”, referring to the top of the best, ultramodern (Sota) AI -Models and benchmark categories.

Theeakanath also outlined “mitigations” in the e -mail that is intended to reduce the legal exposure of meta, including the removal of data from libes “clearly marked as illegal/stolen” and simply does not quote public use. “We would do not know the use of libgen -data sets that are used to train,” as Theakanath expressed it.

In practice, these mitigations brought along to comb libe files for words such as “stolen” or “illegal”, According to the files.

In one work chatKammbadur named That meta-team of Meta has also adjusted models to avoid “IP-Risicovolle prompts”-that is, the models configured to refuse to answer questions as “reproduce the first three pages of ‘Harry Potter and the Sorcerer’s Stone’ or “Tell me which e-books you have trained on. ‘

The archives contain other revelations, which implies that meta can have scraped reddit data For a kind of model training, possibly by simulating the behavior of an app from third parties, called Push shift. Reddit said in particular in April 2023 that it was planning to ask AI companies to access data for modeling training.

See also  Unveiling SAM 2: Meta's New Open-Source Foundation Model for Real-Time Object Segmentation in Videos and Images

In A chat from March 2024Chaya Nayak, Director of Product Management at the generative AI -Gor of Meta, said that Meta leadership is considering considering ‘compelling’ earlier decisions about training sets, including a decision not to use a quora content or licensed books and scientific articles, to do it To ensure that the models of the company had sufficient training data.

Nayak implied that meta’s first-party training datasets-Facebook and Instagram messages, text transcribed from videos on meta-platforms, and certainly Meta for business Messages – were just not enough. ‘[W]E needs more information, “she wrote.

The plaintiffs in Kadrey v. Meta have changed their complaint several times since the case was submitted to the Northern District of California, San Francisco Division, in 2023. The latter claims that Meta, in addition to other claims, cross removals certain illegal books by copyright protected books available for license to determine whether it was useful to have a license agreement with a Publisher to follow.

In a sign of how High Meta regards the legal interests, the company has added Two proceedings from the Supreme Court of the law firm Paul Weiss to his defense team in the case.

Meta did not immediately respond to a request for comment.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button