Voice AI that actually converts: New TTS model boosts sales 15% for major brands

1 day ago

0 0 6 minutes read

Become a member of the event that is trusted by business leaders for almost two decades. VB Transform brings together the people who build the real Enterprise AI strategy. Leather

Generating voices that are not only human and nuanced, but diverse remains a struggle in conversation AI.

At the end of the day, people want to hear voices that sound on them or are at least natural, not just the American temporary employment standard of the 20th century.

Startup Rhymes Has this challenge been tackled with Arcana Text-Speech (TTS), a new spoken language model that can quickly generate “infinite” new voices of different sexes, ages, demography and languages, just based on a simple text description of intended characteristics.

The model helped the sale of customers – for Domino’s and Wingstop – with 15%.

“It is one thing to have a really high-quality, lively, real person sounding model,” Lily Clifford, CEO of Rime and co-founder, told Venturebeat. “It is another to have a model that can not only create one voice, but infinite variability of voices along demographic lines.”

A voice model that ‘acts humanly’

Rime’s multimodal and auto -grazing TTS model was trained in natural conversations with real people (in contrast to voice actors). Users simply type a text prompt description of a voice with desired demographic characteristics and language.

For example: “I want a 30 -year -old woman who lives in California and loves software, or” give me a voice of an Australian man. ”

“Every time you do that, you get a different voice,” said Clifford.

The fog V2 TTS model of Rime is built for high-quality, business-critical applications, allowing companies to make unique voices for their business needs. “The customer hears a voice that makes a natural, dynamic conversation possible without needing a human agent,” said Clifford.

For those who are looking for out-of-the-box options, Rime now offers eight flag shiplid speakers with unique characteristics:

Luna (female, chill but excited, gene optimist)
Celeste (female, warm, relaxed, pleasant)
Orion (male, parent, African-American, Happy)
Ursa (male, 20 years old, encyclopedic knowledge of emo music from 2000)
Astra (woman, young, with wide eyes)
Esther (female, parent, Chinese American, loving)
Estelle (Female, middle-aged, African-American, sounds so sweet)
Andromeda (female, young, breathable, yoga -vibes)

The model has the opportunity to switch between languages and can whisper, be sarcastic and even spot. Arcana can also place laughter in speech when it’s token is given. This can return varied, realistic outputs, from “a small smile to a big Guffaw,” says Rime. The model is also possible ,, ” and even Correctly interpreting, although it is not explicitly trained to do this.

“It leads to emotion from context,” Rime writes in a technical article. “It laughs, sighs, breathes audibly and makes subtle mouth sounds. It says naturally ‘UM’ and other disfluencies. It has emerging behavior that we are still discovering. In short, it works humanly.”

Record natural conversations

The Rime model generates audio chefs that are decoded in speech using a codec-based approach, which, according to Rime, provides “Faster-Real-Time Synthesis.” During the launch, Time to First Audio was 250 milliseconds and the public cloud levy was around 400 milliseconds.

Arcana was trained in three phases:

Pre-training: Rime used open-source large language models (LLMS) as a backbone and trained in advance on a large group of text Audioparen to help Arcana to learn general language and acoustic patterns.
Supervised refinement with a “massive” own data set.
Speaker-specific refinement: Rime identified the speakers who found the “most exemplary” between the dataset, conversations and reliability.

The data from Rime includes sociolinguistic conversation techniques (factoring in social context such as class, gender, location), idiolect (individual speeches) and paralinguistic nuances (non-verbal aspects of communication that go together with speech).

The model was also trained on accent subtleties, vulwords (those subconscious ‘UHS’ and ‘UMS’) as well as breaks, prosodic stress patterns (intonation, timing, stress of certain syllables) and multilingual code switches (when multilingual speakers switch between languages).

The company has chosen a unique approach to collect all this data. Clifford explained that model builders will usually collect fragments of voice actors and then make a model to reproduce the characteristics of the voice of that person based on text input. Or they scrape audio book data.

“Our approach was very different,” she explained. “It was:” How do we create the world’s largest own data set of conversation -speech? “”

To do this, Rime built his own recording studio in a basement in San Francisco and spent a few months recruiting people from Craigslist, through word of mouth, or simply gathered causally himself and friends and family. Instead of scripping conversations, they recorded natural conversations and Chitchhat.

They then annotated voices with detailed metadata, which codes for gender, age, dialect, speech effect and language. This has enabled Rime to achieve 98 to 100% accuracy.

Clifford noted that they are constantly expanding this dataset.

“How do we make it sound personal? You will never come if you only use voice actors,” she said. “We have done the crazy difficult thing to collect really naturalistic data. The enormous secret sauce of Rimes is that these are not actors. These are real people.”

A ‘personalization harness’ that creates tailor -made voices

Rime plans to give customers the opportunity to find voices that work best for their application. They built a tool for ‘personalization harness’ so that users can do A/B tests with different voices. After a certain interaction, the API reports to Rime, which offers an analysis dashboard that identifies the best performing voices on the basis of success statistics.

Customers naturally have different definitions of what a successful call is. In Foodservice that can tell an order from fries or extra wings.

“The goal for us is how we create an application that makes it easy for our customers to perform those experiments themselves?” Said Clifford. “Because our customers are not speeches directors, we are not either. The challenge becomes how you can really make that personalization analysis layer intuitive.”

Maximizing another KPI customers for the willingness of the caller to talk to the AI. They have found that when switching to Rimpel, Belers are 4x more likely to talk to the bone.

“For the first time ever, people like:” No, you don’t have to transfer me. I am perfectly willing to talk to you, “said Clifford.” Or, when they are transferred, they say “Thank you.” “(20%is in fact warm when ending conversations with a bone).

Feed 100 million calls per month

Rime counts to the customers Domino’s, Wingstop, Converse Now and Ylopo. They do a lot of work with large contact centers, Enterprise developers build interactive voting response (IVR) systems and telecom, Clifford noted.

“When we switched to Rime, we saw an immediate improvement of double digits compared to the probability of our calls,” said Akshay Kayastha, director of Engineering at Conversow. “Working with Rime means that we solve a lot of the last mile problems that arise when sending a high-impact application.”

Ylopo CPO GE JUEFENG noted that, for the great outgoing application of his company, they should build up immediately confidence with the consumer. “We tested every model on the market and discovered that the voices of Rime have converted customers with the highest rate,” he reported.

Rime already helps the electricity almost 100 million phone calls per month, Clifford said. “If you call Domino’s or Wingstop, there is a chance of 80 to 90% that you hear a rime voice,” she said.

Looking ahead, Rime will push more into the on-premises offer to support low latency. In fact, they expect that by the end of 2025 90% of their volume will be on-Prem. “The reason for this is that you will never be so fast if you perform these models in the cloud,” said Clifford.

Rime also continues to refine his models to take on other linguistic challenges. For example, sentences that the model has never encountered, such as Domino’s Tongtying “Meated Schaaktza Extravaganzza”. As Clifford noted, even if a voice is personalized, of course and reacts in real time, it will fail if it cannot cope with the unique needs of a company.

“There are still many problems that our competitors see as Last-Mijl problems, but who see our customers as problems in the first mile,” said Clifford.

Source link