AI

MLCommons and Hugging Face team up to release massive speech dataset for AI research

MLCommons, a non -profit AI Safety Working Group, collaborates with AI DEV platform to publish one of the world’s largest collections of public domain voting recordings for AI research.

The dataset, called The speech of people without supervisionContains more than a million hours of audio of at least 89 languages. MLCommons says it was motivated to create it through a desire to support R&D in ‘different areas of speech technology’.

“Supporting broader research into natural language processing for languages ​​other than English helps communication technologies to bring more people worldwide,” the organization wrote in a Blog post Thursday. “We expect different ways for the research community to keep building and developing, especially in the field of improving language models with low resource, improved speech recognition about different accents and dialects and new applications in speech synthesis.”

It is certainly an admirable goal. But ai -data sets such as the speech of people without supervision can bring risks for the researchers who choose to use them.

Familated data is one of those risks. The recordings in the speech of non -controlled people came from Archief.org, the non -profit organization perhaps best known for the Wayback Machine Web Archival Tool. Because many of the contributors of Archief.org English speaking and American-Bijna are all recordings in the speech of people without supervision in American accented English, Per reading girl on the official project page.

That means that, without careful filtering, AI systems such as speech recognition and speech synthesizer models that have been trained in the speech of people without supervision, could show some of the same prejudices. For example, they may have trouble transcribing English that is spoken by a non-Native speaker, or have problems generating synthetic voices in languages ​​other than English.

See also  Anthropic's New Claude Models Bridge the Gap Between AI Power and Practicality

The speech of people without supervision can also contain recordings from people who do not know that their voices are being used for AI research purposes – including commercial applications. Although MLCommons says that all recordings in the dataset are the public domain or are available under Creative Commons statements, there are the possibility that mistakes have been made.

According to a MIT analysisHundreds of publicly available AI training datas sets miss license information and contain errors. Lawyers from the maker, including Ed Newton-Rex, the CEO of AI-ethically-oriented non-profit freely trained non-profit organizations, have made that makers should not be obliged to “conclude” from AI data sets because of the heavy Last that unsubscribes from these makers.

“Many makers (eg Squarespace users) do not have a meaningful way to unsubscribe,” Newton-Rex wrote In a message on X last June. ‘For makers who can Oet out, there are several overlapping opt-out methods, which (1) are incredibly confusing and (2) miserably incomplete in their coverage. Even if there was a perfect universal opt-out, it would be extremely unfair to put the opt-out trouble for makers, given that generative AI uses their work to compete with them simply would not realize that they could unsubscribe. “

MLCommons says that it is committed to updating, maintaining and improving the quality of the speech of people without supervision. But given the potential defects, they were developers to bring serious caution.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button