Conversion of whispered speech to audible speech ...
Motivation
In recent years, advances in wireless communication technology have led to the widespread use of cellular phones to communicate by speech. Because of noisy environmental conditions and competing surrounding conversations, users ten to speak loudly. As a consequence, private policies and public legislation tend to restrain the use of cellular phone in public places. Silent speech which can only be heard by a limited set of listeners close to the speaker is an attractive solution to this problem if it can effectively be used for quiet and private communication. Silent speech capture interfaces have already been developed including surface Electromyography (sEMG), Electromagnetic Articulography (EMA), Encephalographic (EEG), Non-Audible Murmur (NAM) microphone, Ultrasound (US) and optical imagery of the tongue and lips together with some techniques such as direct signal-to-signal mapping or using a phonetic pivot by combining speech recognition and synthesis techniques to convert silent speech signals to audible speech. However, the naturalness of the converted speech is still unsatisfactory due to the poor F0 estimation from silent speech with both approaches.
Our work at the Speech Cognition Departement of Gipsa-Lab proposed to improve the naturalness and the intelligibility of the synthesized speech converted from silent speech captured by NAM microphone together with facial movements estimated by using an accurate motion capture system tracking 3D positions of coloured beads glued on the speaker's face.
Approaches
To date, we use two different systems to map silent speech to modal speech:

Direct signal-to-signal mapping using aligned corpora based on GMM model.

Combining HMM-based speech recognition and HMM-based synthesis techniques. By introducing linguistic levels both in the recognition and synthesis, such systems can potentially compensate for the impoverished input by including linguistic knowledge into the recognition process. The quality of the output speech is either excellent or extremely degraded depending on recognition performance.
Examples
-
whisper to speech (without facial movements) (French data)
whisper captured by NAM microphone
converted speech
-
acoustic whisper to audiovisual speech (Japanese data)
input whisper
audiovisual converted speech

-
audiovisual whisper to audiovisual speech (Japanese data)
audiovisual whisper
audiovisual converted speech
Speaker recognition ...