AI-powered lip sync puts old words into Obama's new mouth

The technology could eventually be developed to tell if a video is real or fake

Researchers have developed a machine learning algorithm that can turn audio clips into realistic, lip-synced videos.

A video shows former US president Barack Obama apparently speaking on a number of subjects including terrorism, though the clips were artificially generated using existing video addresses.

Researchers from the University of Washington believe the system could eventually be used to improve video calls – or even to ascertain whether a video is real or fake.

The system uses a neural network that was trained to watch videos of people talking and convert audio files into realistic mouth shapes. These are then grafted onto the head of that person from another existing video, by combining previous research from the university's image lab with a new mouth synthesis technique. The technology also enables a small time shift so that the neural network can anticipate what the speaker is going to say next. The team chose Obama because the system needs around 14 hours of video to learn from, which isn't a problem for one of the most-filmed faces on the planet.

"In the future, video chat tools like Skype or Messenger will enable anyone to collect videos that could be used to train computer models," said Ira Kemelmacher-Shlizerman, from UW's Paul G. Allen School of Computer Science & Engineering.

Because streaming audio over the internet takes up far less bandwidth than video, the new system could spell the end for glitchy, frozen video chats.

"When you watch Skype or Google Hangouts, often the connection is stuttery and low-resolution and really unpleasant, but often the audio is pretty good," said co-author and Allen School professor Steve Seitz. "So if you could use the audio to produce much higher-quality video, that would be terrific."

Previous audio-to-video conversion technology has focussed on filming multiple people saying the same sentence repeatedly to try and capture how sounds correlate to different mouth shapes. But this process is expensive and time-consuming.

By reversing the process - feeding video into the network instead of just audio - the team could potentially develop algorithms to detect whether a video is real or fake. However, the neural network is currently designed to learn on one individual at a time.

"You can't just take anyone's voice and turn it into an Obama video," Seitz said. "We very consciously decided against going down the path of putting other people's words into someone's mouth. We're simply taking real words that someone spoke and turning them into realistic video of that individual."

In future, algorithms may be developed to recognize a person's voice and speech patterns using just an hour of video, rather than 14 hours, the researchers said.

This article was originally published by WIRED UK