It’s funny how fast certain things can become dated and obsolete!
Late Modern Text-to-Speech
This video offers a cheery glimpse behind the curtain hiding the awkward truth about Siri and what we at Rime call Late Modern Text-to-Speech. Listening to Siri today, the shine has worn off and the voice sounds more stilted and robotic than ever before. And though we have nothing but love for Susan Bennett, the original voice of Apple’s Siri, the question remains: Why does Siri suck?
At Rime, we think the answer lies in two places: 1) voice cloning and 2) the nature of reading aloud. Below we dive into these issues that both characterize and plague Late Modern TTS, and how Rime is charting the path forward.
🤖 Brute Force Fakes
The voices behind the Google Assistant, Amazon Alexa, and Apple’s Siri are voice clones. Even today, nearly every synthetic speech product on the market are machine learning models that are designed to replicate the voice of a single person. The fact that voice clones sound so great is a testament to how far deep learning advancements have taken us over the last decade. But voice clones lead to voice fatigue: the gradual but inevitable dissatisfaction and boredom that arises when the same voice is used indefinitely.
Voice-fatigue is related to its better-known counterparts ad-fatigue and banner blindness, which occurs when an audience sees the same ad so many times that they stop paying attention to it. Siri sucks in part because we are falling prey to voice-fatigue.
At Rime, we offer an end to voice fatigue. We offer over 200 voices and will soon be offering an infinite, generative approach to creating new voices on the fly for every kind of conversational AI application. The dated feel to Alexa and Siri and others will soon be a thing of the past.
📖 Reading Aloud
The dream is to have conversations with AI. Assistant device technologies are a strong first step in this direction, but they don’t remotely sound like you’re having a conversation.
This should be unsurprising. Siri and any other voice clone is created from audio of humans reading aloud, not speaking as if they were having a conversation.
Deep learning-speech synthesis has given us a status quo in which the output from TTS engines sounds nearly indistinguishable from training data. It should be obvious, then, that if you train a TTS model on audiobook data, you’re going to get output that sounds like narration.
✨ The Future
Rime is uniquely positioned to synthesize speech that sounds more conversational, because we’ve spent a ton of time figuring out the best way of collecting, labeling, and annotating conversational speech data. And we’re not stuck with cloning voices — the advancements we’ve charted out are going to bring infinite variability to the voices that we can build conversational applications with!