Microsoft is exploring a way to credit contributors to AI training data

March 22, 2025

0 3 minutes read

Microsoft is launching a research project to estimate the influence of specific training examples on the text, images and other types of media that make generative AI models.

That is According to a vacancy Dating from December that recently recovered on LinkedIn.

According to the offer, which a research trainee is looking for, the project will try to show that models can be trained in such a way that the impact of certain data – for example, photos and books – can be “efficient and useful estimated”.

“The current neural network architectures are opaque in terms of providing sources for their generations, and there are […] Good reasons to change this, “reads the offer.”[One is,] Interesting, recognition and possibly paying for people who contribute certain valuable data to unforeseen types of models that we want in the future, assuming that the future will surprise us fundamentally. “

AI-driven text, code, image, video and song generators are in the middle of A number of IP -Rechtszaken Against AI companies. These companies often train their models on huge amounts of data from public websites, some of which are protected by copyright. Many of the companies claim that Fair use doctrine Has their data scraping and training practices. But creatives – from artists to programmers to authors – largely disagree.

Microsoft itself stands for at least two legal challenges of copyright holders.

The New York Times sued the Tech giant and her once employee, OpenAi, in December, and accused the two companies of infringing the copyright of Times by deploying models that have been trained on millions of his articles. Different software developers also brought a lawsuit against Microsoft and claims that the Github Copilot AI coding assistant of the company was illegally trained with the help of their protected works.

The new research effort from Microsoft, which describes the list as a ‘training time of origin’, Reportedly Has the involvement of Jaron Lanier, the experienced technologist and interdisciplinary scientist At Microsoft Research. In April 2023 Op-ed in the New YorkerLanier wrote about the concept of ‘Data -worthiness’, which for him meant connecting ‘digital things’ with ‘the people who want to be familiar that they have made it’.

“An approach to data-decent approach would trace the most unique and influential contributors when a large model offers a valuable output,” Lanier wrote. “For example, if you ask a model for” an animation film from my children in an oil painter world of talking cats on an adventure, “than certain important oil painters, cat portraits, voice actors and writers or their estate can be calculated to be unique for the creation of the new masterpiece.

There are, not for nothing, several companies that try this. AI model developer Bria, which recently collected $ 40 million in risk capital, claims that they ‘compensate data owners’ programmatically’ according to their ‘overall influence’. Adobe and Shutterstock also generate regular payouts to dataset female carriers, although the exact payment amounts are usually opaque.

Few large laboratories have made an individual contribution to the inks of license agreements with publishers, platforms and data brokers. Instead, they have provided resources for copyrights to register for training. But some of these opt-out processes are heavy and only apply to future models-not previously trained.

Of course, the Microsoft project can be little more than a proof of concept. There is a precedent for that. In May, OpenAI said that it developed similar technology with which makers could specify how they want their works to be included in – or excluded from – training data. But almost a year later the tool still has to see the daylight, and it is often not considered a priority internally.

Microsoft may also try ‘ethics‘Here – or go from the regulatory and/or court decisions disturbing for its AI activities.

But the fact that the company is investigating ways to trace training data is remarkable in the light of the recently expressed views of other AI Labs about reasonable use. Several of the best laboratories, including Google and OpenAI, have published policy documents in which the Trump administration weakens copyright protection with regard to AI development. OpenAI has explicitly called on the US government to codify reasonable use for model training, which she claims to free developers from difficult restrictions.

Microsoft did not immediately respond to a request for comment.

Source link