Synthesizing Obama from Audio

Im Mai bloggte ich über ein damals noch nicht veröffentlichtes Paper zur SigGraph2017, in dem sie eine Methode für generative Video-Obamas aus Tonspuren vorstellten. Zusammen mit Tools wie etwa dem Adobe Voice Generator (mit dem ich wiederum beliebige Sätze mit Hilfe von Sprachproben generieren kann) ist es so möglich, allen Menschen alle möglichen Sätze in den Mund zu legen und alle möglichen Videos damit anzufertigen. Once again: Kiss your reality goodbye.

Das Paper ist jetzt online: Synthesizing Obama: Learning Lip Sync from Audio, hier als PDF.

Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results.

Ein paar Details aus dem Paper:

Our system first converts audio input to a time-varying sparse mouth shape. Based on this mouth shape, we generate photo-realistic mouth texture, that is composited into the mouth region of a target video. Before the final composite, the mouth texture sequence and the target video are matched and re-timed so that the head motion appears natural and fits the input speech.

Given a source audio track of President Barack Obama speaking, we seek to synthesize a corresponding video track. To achieve this capability, we propose to train on many hours of stock video footage of the President (from his weekly addresses) to learn how to map audio input to video output. ThŒis problem may be thought of as learning a sequence to sequence mapping, from audio to video, that is tailored for one speci€c individual. Œis problem is challenging both due both to the fact that mapping goes from a lower dimensional (audio) to a higher dimensional (video) signal, but also the need to avoid the uncanny valley, as humans are highly aŠuned to lip motion.

To make the problem easier, we focus on synthesizing the parts of the face that are most correlated to speech. At least for the Presidential address footage, we have found that the content of Obama’s speech correlates most strongly to the region around the mouth (lips, cheeks, and chin), and also aspects of head motion – his head stops moving when he pauses his speech (which we model through a retiming technique). We therefore focus on synthesizing the region around his mouth, and borrow the rest of Obama (eyes, head, upper torso, background) from stock footage.