Text to speech: AI trained on YouTube and podcasts has more human voice

10 March 2023

Creating artificial intelligence (AI) text to speech systems with different rhythms and pauses will make it sound more human, a new study has concluded.

Researchers from Carnegie Mellon University in Pittsburgh in the United States, trained artificial intelligence on speech from YouTube and podcasts.

They said that although text to speech systems had become quite human sounding because they are trained on acted speech, they lack the full human sound.

So the researchers trained the AI on more natural human voices from the real world using hours of YouTube and podcast footage.

More on AI

Could AI chatbots help people cheat on homework?

Is artificial intelligence a good idea?

How technology could help your brain talk

Text to speech AI systems are most commonly used by conversational bots like Alexa, Siri, Google assistant etc.

Since these systems have become more and more common, experts have been looking at ways they can sound less robotic and more human.

The researchers of the paper used almost 900 hours of talking from YouTube and podcasts to train a text to speech AI.

They were then able to create a model which produced more natural sounding speech complete with ums and hesitations.

David Beavan at the Alan Turing Institute in London, who was not involved in the study. spoke to the publication The New Scientist about the findings.

"They clearly haven't quite got to the point where it is totally human sounding, but they're absolutely going in the right direction."

He added that it could be useful for when you might want a more sensitive AI voice, like when you've just woken up in the morning.

More on this story