See me speaking? differentiating on whether words are spoken on screen or off to optimize machine dubbing

Link:
Autor/in:
Verlag/Körperschaft:
Association for Computing Machinery (ACM)
Erscheinungsjahr:
2020
Medientyp:
Text
Schlagworte:
  • Facial Animation
  • Virtual Agents
  • Speech Synthesis
  • Speech
  • Speech Recognition
  • Models
  • Dubbing
  • Activity recognition
  • Audiovisual machine translation
  • Multi-modal speech processing
  • Facial Animation
  • Virtual Agents
  • Speech Synthesis
  • Speech
  • Speech Recognition
  • Models
Beschreibung:
  • Dubbing is the art of finding a translation from a source into a target language that can be lip-synchronously revoiced, i. e., that makes the target language speech appear as if it was spoken by the very actors all along. Lip synchrony is essential for the full-fledged reception of foreign audiovisual media, such as movies and series, as violated constraints of synchrony between video (lips) and audio (speech) lead to cognitive dissonance and reduce the perceptual quality. Of course, synchrony constraints only apply to the translation when the speaker's lips are visible on screen. Therefore, deciding whether to apply synchrony constraints requires an automatic method for detecting whether an actor's lips are visible on screen for a given stretch of speech or not. In this paper, we attempt, for the first time, to classify on- from off-screen speech based on a corpus of real-world television material that has been annotated word-by-word for the visibility of talking lips on screen. We present classification experiments in which we classify individual words as on- or off-screen. We find that this task is far from trivial and that combined audio and visual features work best.
Lizenz:
  • info:eu-repo/semantics/closedAccess
Quellsystem:
Forschungsinformationssystem der UHH

Interne Metadaten
Quelldatensatz
oai:www.edit.fis.uni-hamburg.de:publications/fd8e24e6-5999-42c0-9e04-671e8f2ead08