A new, open source text-to-speech model called Dia has arrived to challenge ElevenLabs, OpenAI and more

19 hours ago

0 0 4 minutes read

Become a member of our daily and weekly newsletters for the latest updates and exclusive content about leading AI coverage. Leather

A startup of two people with the name Nari Labs Dia introduced Dia, a 1.6 billion parameter text-to-speech model (TTS) that is designed to produce the naturalistic dialogue directly from text prompts and one of his makers claims that it exceeds the performance of competing own offer of people like Elfabs, Google’s Podast.

It can also threaten the recording of OpenAi’s recent GPT-4O-Mini-TTs.

“Dia Rivals Notebooklm’s Podcast function, while the Evenlabs Studio and Sesame’s Open Model exceeds quality,” said Toby Kim, one of the co-makers of Nari and Dia, On a post from his account on the social network X.

In one separate postKim noted that the model was being built with “Zero Funding” and added to a thread: “… We were not AI experts from the start. It all started when we fell in love with the Podcast function of Notebooklm when it was released last year. We wanted more -more control over the Script. We tried more freedom.

Kim further Written Google To give him and his employee access to the Tensor Processing Unit Chips (TPUs) for training DIA by Google’s Research Cloud.

Slides code and weights – the internal model connection set – is now available for downloading and local implementation by everyone from Hug or Gitub. Individual users can try to generate speech on one Hug Room.

Advanced operating elements and more adjustable functions

DIA supports nuanced functions such as emotional tone, speaker tagging and non -verbal audio signals – all of normal text.

Users can mark loudspeakings with tags such as [S1] And [S2]And include signals such as (laughs), (cough) or (throat) to enrich the resulting dialogue with non -verbal behavior.

These tags are correctly interpreted by DIA while generating – something that is not reliable supported by other available models, according to the examples page of the company.

The model is currently only English and is not bound by the voice of a single speaker, which produces different voices per run, unless users repair the generation soul or give an audio trump to. Audio conditioning or speech clones, allow users to have speech tone and speech similar to uploading an example clip.

Nari Labs offers sample code to facilitate this process and a gradio -based demo so that users can try it without settings.

Comparison with Elflabs and Sesame

Nari offers A host of example audio files Generated by DIA on his notebox site, where it is compared to other leading speech-totext rivals, in particular Elflabs Studio and Sesam CSM-1B, the latter a new one Text-to-speech model of Oculus VR headset Co-maker Brendan Iribe Earlier this year, that went somewhat viral on X.

Side-by-Side Examples Shared by Nari Labs show how DIA surpasses competition in different areas:

In standard dialogue scenarios, DIA treats both natural timing and non -verbal expressions better. For example, in a script that ends with (laughs), Dia actually interprets and delivers laughter, while Elfs and Sesame output textual substitutions such as “haha”.

Here is, for example, slide …

… and the same sentence spoken by Elflabs Studio

In multi-turn conversations with emotional range, Dia shows smoother transitions and tone shifts. One test included a dramatic, emotionally charged emergency scene. Dia made the urgency and speaker stress effective, while competing models often split or lost stimules.

Dia treats unique non-verbal scripts, such as a humorous exchange with cough, sniffing and laughing. Competitive models did not recognize these tags or have completely skipped them.

Even with rhythmically complex content such as rap texts, Dia generates liquid speech in performance style that retains pace. This is in contrast with more monotone or incoherent outputs of Elflabs and the 1B model of Sesame.

With the help of audio trumps, Dia can expand or continue the speech style of a speaker in new lines. An example with the help of a conversation clip such as seed showed how Dia wore vocal properties from the sample through the rest of the scripted dialogue. This function is not robustly supported in other models.

In one set of tests, Nari Labs noted that the best website of Sesame probably used an internal 8B version of the model instead of the public 1B checkpoint, which resulted in a gap between advertising and actual performance.

Model access and technical specifications

Developers have access to Dia van Nari Labs’ Github -Repository and be Hugging facial modelpage.

The model runs on Pytorch 2.0+ and Cuda 12.6 and requires approximately 10 GB of Vram.

Inference about GPUs from Enterprise-Grade such as the Nvidia A4000 delivers around 40 tokens per second.

Although the current version only runs on GPU, Nari intends to offer CPU support and a quantified release to improve accessibility.

The startup offers both a Python library and Cli tool to further streamline the implementation.

The flexibility of DIA opens use cases from making content to supporting technologies and synthetic voice -overs.

Nari Labs also develops a consumer version of DIA, aimed at informal users who want to remix or share generated conversations. Interested users can sing via e -mail on a waiting list for early access.

Fully open source

The model is divided under one Fully open Source Apache 2.0 licenseWhich means that it can be used for commercial purposes -something that will clearly appeal to companies or Indie app developers.

Nari Labs explicitly prohibits the use that occurs to individuals, the distribution of wrong information or performing illegal activities. The team stimulates responsible experiments and has taken a position against unethical commitment.

Slide Development Gredits Support from the Google TPU Research Cloud, Hugging Face’s Zerogpu Grant program and earlier work on Soundstorm, Parakeet and Description Audio Codec.

Nari Labs itself consists of only two engineering-one full-time and one part-time but they actively invite community contributions through his Discord Server and Github.

With a clear focus on expressive quality, reproducibility and open access, DIA adds a distinctive new voice to the landscape of generative speech models.

Source link