Rendering Speech Across Speaker and Language Barriers Frank K. Soong 宋謌平 Speech Group Microsoft Research Asia ISCSLP2016, Tianjin, China Oct. 20, 2016 Why Speech? We speak to be heard, and to be understood. – Speech is the most important and effective means for people to communicate – Spoken language is more variable (flexible), i.e., less regular, than written language – Our dream: human-machine communication via speech – Close the loop: “Speech Chain” of production/perception Search “Elementary Particles” of Speech? • Similar Human Speech Production Mechanism Similar anatomic structure among human speakers and languages: vocal tract + vocal fold • Language and Speaker Difference Languages: different at sentence, word, syllable, phoneme levels across different languages Sub-phonemic or frame level similarity: speech becomes more language independent, e.g. speech coders are essentially language and speaker independent Speaker Equalization: Despite similar articulators, speaker difference still needs to be equalized, e.g. voice conversion, speaker adaptation or Xlingual applications 19-Oct-16 Audio-Visual Speech? We speak to be heard/seen, and to be understood – Communicate face-to-face (i.e., A/V mode), info in both channels is re-enforcing, complimentary and error correcting – A/V info needs to be synchronous, coherent, and consistent with each other; otherwise, A/V channels may lead to confusion, e.g. McGurk effect – A/V human-machine communication needs A/V {TTS + ASR} – Close the loop: “Speech Chain” in A/V mode Rendering Speech from Text - Text-to-Speech (TTS) Building TTS models (backend) • GMM-HMM: statistical model of speech parameters – Acoustic parameters: static and dynamic parameters of spectrum, F0, duration and voicing – Model: Gaussians (acoustic features), HMM (temporal), decision tree (between labels and acoustic models in phonetic and acoustic spaces) • Waveform concatenation (unit selection) TTS – Waveform concatenation: speech segments (units) selection – Cost function: static (instantaneous) + dynamic (transitional) • Deep Neural Net (DNN): mapping (linear + nonlinear) between linguistic input and acoustic output – Feedforward DNN – Recurrent NN (RNN) • Long Short Time Memory (LSTM) , … HMM Text-to-Speech (TTS) System Text Training Adaptation Speech Database Personal Speech Statistical Training HMMs Adaptation Text Analysis Trajectory Generation Vocoder Speech Speech Database Synthesis Waveform Concatenation Pitch Duration Loudness Mandarin and Dialects • Mandarin 普通话 • TianJin accent 天津 • WuHan accent 武汉 • XiAn accent 西安 • ShanDong accent 山东 GMM-HMM Parametric TTS • Features – Output: Acoustic (from recorded training speech) • spectrum; F0 + voicing; duration, excitation, etc. – Input: Linguistic (from text analysis front-end) • context-dependent phone; position of phone; syllable and word in phrase and sentence; stress of syllable; POS, etc. • Models – Gaussian (mixture) for modeling spectrum – Continuous(Gaussian) + discrete prob’s: MSD for F0 and voicing – Gaussian, Gamma, Semi-HMM for duration • Algorithms – Baum-Welch (E-M with path info) for estimating HMM parameters – Decision tree with MDL (a greedy search for insufficient training data) Spectrum, F0 and Duration HMMs • Each state has three trees: duration, spectrum and pitch HMM parameter trajectory generation with dynamic constraints For a given HMM , determine a speech parameter vector sequence O [o1T , o2T , o3T ,..., oTT ]T , ot [ctT , ctT , 2ctT ]T maximizing: P(O | ) P(O, Q | ) all Q If given Q , maximizes P(O | Q, ) with respect to O WC 1 log P(O | Q, ) OTU 1O O TU 1M K 2 LogP(WC | Q, ) 0 C W TU 1WC W TU 1M T K. Tokuda, H. Zen, A.W. Black, “An HMM-based speech synthesis system applied to English”, Proc. of 2002 IEEE SSW, 2002. Trajectory Generation with dynamic constraints K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech parameter generation algorithms for HMM-based speech synthesis, ICASSP-2000. HMM Trajectory Tiling (HTT) Approach to High Quality TTS High quality TTS: natural, intelligible, speaker similarity → HMM Trajectory Tiling (HTT) – HMM trajectory: highly intelligible, guide for “tiling” – Tiling the trajectory with natural speech segments (“tiles”) – Challenges: • speech database: phonetic and prosodic coverage • segments into structured form for easy search • How/where to connect 2 non-consecutive tiles Y. Qian, F.K. Soong, Z.-J. Yan, A unified trajectory tiling approach to high quality speech rendering, Audio, Speech, and Language Processing, IEEE Transactions on 21 (2), 280-29, 2013 17 High Quality Speech Synthesis HMM Trajectory Tiling (HTT) based TTS Clustered HMM Speech database Waveform inventory Parametric speech trajectory HMM-based TTS Guided waveform concatenation HMM Trajectory Tiling-based TTS Waveform concatenation Concatenation TTS HMM-Trajectory Tiling based TTS HMM Parameter Trajectory Generation Unit Sausage (Lattice) Construction Search in Sausage and Concatenation Search in “Sausage” and Concatenation – Maximizing normalized cross-correlation (NCC) for – spectral similarity – phase continuity – concatenation time instants High Quality TTS Demo • Trained on a British English (5 hours) corpus • Trained on a Mandarin Chinese (9 hours) corpus Cross-lingual TTS X-lingual Speech Technology for: • Computer Assisted Language Learning (CALL) • Speech-to-speech translation, e.g. MS skype translator • Mix-code automatic speech recognition (ASR) and text-to-speech (TTS) synthesis • Low resourced spoken language research • Lip-synced, cross-lingual talking head Training TTS across Language Barrier • Making a monolingual speaker, say English, to speak a new language that he doesn’t speak, say Mandarin Chinese Y. Qian, F.K. Soong, Z.-J. Yan, “A unified trajectory tiling approach to high quality speech rendering”, IEEE Trans. on Audio, Speech, and Language Processing, 21 (2), 280-29, 2013 How? Busy but want to speak a new language English English speaker 1 hr public speech (un-transcribed) Mandarin Reference Mandarin speaker 2 hrs read speech Train TTS across Language Barrier Vocal Tract Length Normalization Speaker equalization Rendering Mundie’s Mandarin Sentences Trajectory tiling at frame level TTS Model Training HMM-based TTS Frame-wise Trajectory Tiling Source speaker’s warped trajectories “Sausage” of waveform segments from target speaker Tiled waveform Cross-lingual TTS Personalization (E2C) Cross-lingual, personalized Chinese TTS, trained with 1hr of Craig’s English data (un-transcribed) Reference Chinese spkr’s TTS Mundie’s Chinese TTS Car Navigation in a Foreign Land • Web Search Result: Driving directions to Beijing Railway Station. Head south on 中 关 村 南 大 街 , then toward 大慧寺路, turn left at 白 石 新 桥 , continue onto 西直门外大街。 SI-DNN ASR and Kullback-Leibler Divergence For voice conversion and X-lingual TTS 19-Oct-16 Speaker Independent DNN ASR 𝑃 𝑥 = [𝑝1 , 𝑝2 , … , 𝑝𝐼 ] 𝑃 𝑥 ′ = [𝑝1′ , 𝑝2′ , … , 𝑝𝐼′ ] posterior vector ( phonetic space) Speaker Independent DNN ASR 𝑥 19-Oct-16 𝑥′ physical unit ( acoustic frames) Kullback-Leibler Divergence (Discrete) 𝐷𝐾𝐿 (𝑃| 𝑄 𝐼 𝐼 = 𝑝𝑖 − 𝑞𝑖 log 𝑝𝑖 − log 𝑞𝑖 = 𝑝𝑖 log 𝑝𝑖 − log 𝑞𝑖 + 𝑞𝑖 log 𝑞𝑖 − log 𝑝𝑖 𝑖=1 𝑖=1 𝑃 = [𝑝1 , 𝑝2 , … , 𝑝𝐼 ] 𝑄 = [𝑞1, 𝑞2 , … , 𝑞𝐼] 𝐼 𝐼 𝑝𝑖 = 𝑞𝑖 = 1 𝑖=1 Positive Semi-definite, i.e., ≥ 0 Symmetric Distortion Measure in the Probability Space 𝑖=1 Feng-Long Xie, Frank K. Soong, Haifeng Li, “A KL Divergence and DNN-based Approach to Voice Conversion without Parallel Training Sentences,” InterSpeech 2016 19-Oct-16 Voice Conversion “Welcome to Microsoft Research Asia” 19-Oct-16 Voice Conversion System “Welcome to Microsoft Research Asia” KLD-DNN based Approach “Welcome to Microsoft Research Asia” Speech Equalize speaker difference 𝜎 log 𝐹0𝑡𝑟𝑎𝑛𝑠 = 𝜇𝑠 + 𝜎𝑠 (log 𝐹0𝑡 − 𝜇𝑡 ) Speaker Independent DNN ASR 𝑡 Extract Acoustic Unit ASR Senones Acoustic units Frames ASR Senones … Acoustic Units • TTS senone • phonetic cluster centroid • frame … Acoustic units 𝐼 𝐷𝐾𝐿 (𝑃| 𝑄 = 𝑝… 𝑞𝑖 𝑖 − 𝑞𝑖 log 𝑝𝑖 − log Speech Frames 𝑖=1 Generation “Welcome to Microsoft Research Asia” Experimental set-up • DNN Training • Wall Street Journal CSR corpus (78 hours, 284 speakers) • 38-dim MFCC + log energy and the acoustic contexts (~ 100 ms) • DNN-ASR model • Database • CMU Arctic Database • Training set: 1,000 utterances, • Test set : 50 utterances • Conversion Pairs • SLT(F) to BDL(M) • CLB(F) to SLT(F) • RMS(M) to BDL(M) 19-Oct-16 DNN-ASR Adaptation with T-speaker Data Systems II: Frame (acoustic delta distortion minimization), VC III: Frame (window based trajectory estimation), VC Voice Conversion Demos NR_S I II IV-TTS NR_T Systems 19-Oct-16 III I: PCC: phonetic Cluster Centroid, VC II: Frame (acoustic delta distortion minimization), VC III: Frame (window based trajectory estimation), VC IV: TTS upper bound Cross-Lingual TTS Source spkr’s speech Challenges: • Intelligible • Similar to the source speaker • Natural Cross-Lingual TTS system Source spkr’s speech Reference spkr’s speech Feng-Long Xie, Frank K. Soong, Haifeng Li, “A KL Divergence and DNN Approach to Cross-Lingual TTS”, ICASSP2016 Senone Mapping (Supervised) 𝜎 log 𝐹0𝑡𝑟𝑎𝑛𝑠 = 𝜇𝐿1 + 𝜎𝐿1 (log 𝐹0𝐿2 − 𝜇𝐿2 ) 𝐿2 L2 Speech L1 Speech Equalize the speaker difference GMM-HMM TTS Training L1 TTS senones L2 TTS senones ASR senones Forced Alignment GMM-HMM TTS Training Speaker Independent DNN ASR … ASR senones … Forced Alignment L1 TTS senones 𝐼 𝐷𝐾𝐿 (𝑃| 𝑄 = 𝑝…𝑖 − 𝑞𝑖 log 𝑝𝑖 − log 𝑞 Senone𝑖 L2 TTS senones 𝑖=1 Mapping L2 TTS Phonetic Cluster Mapping Reference Speaker Source Speaker L1 Speech L2 Speech Speaker Independent DNN ASR GMM-HMM TTS Training Phonetic clusters Phonetic Clustering Forced Alignment L2 TTS senones ASR senones … ASR senones … Phonetic clusters L2 TTS senones … Senone-cluster mapping L2 TTS Frame Mapping (unsupervised) L1 Speech L2 Speech 𝜎 log 𝐹0𝑡𝑟𝑎𝑛𝑠 = 𝜇𝐿1 + 𝜎𝐿1 (log 𝐹0𝐿2 − 𝜇𝐿2 ) Equalize the speaker difference GMM-HMM TTS Training 𝐿2 Speaker Independent DNN ASR ASR senones … Frames Forced Alignment L2 TTS senones ASR senones … Frames 𝐼 𝐷𝐾𝐿 (𝑃| 𝑄 = 𝑝𝑖 − 𝑞𝑖 log…𝑝𝑖 − log 𝑞𝑖 𝑖=1 L2 TTS senones Weighted Average L2 TTS Voice Conversion based TTS “你好吗?” L1 Speech L2 speech input Speaker Independent DNN ASR ASR senones ASR senones Frames … … Frames Frames … 19-Oct-16 Frames “你好吗?” Speech Generation L2 Speech L2 GMM-HMM TTS 43 Experimental setup SI-DNN ASR training Training Data 78hr WSJ CSR corpus; 284 (M/F) speakers; US English DNN 4 hidden layers; 2,048 units/layer; 2,754 output (senone) nodes Input Features 38-dim MFCC + log energy with acoustic context (110 ms) Cross-Lingual TTS English (L1) -> Mandarin (L2) speaker M1 used as source speaker 1,000 English sentences (for CL-TTS training) 1,000 Mandarin sentences (for comparison as upper bound) reference speaker M2: 1,000 Mandarin sentences (L2) Systems 19-Oct-16 I: senone mapping (supervised), CL-TTS II: Phonetic cluster centroid mapping (unsupervised), CL-TTS III: frame mapping (unsupervised), CL-TTS IV: Cross-lingual Voice Conversion (unsupervised), CL-TTS V: trajectory tiling, CL-TTS baseline[1] VI: direct training with source speaker’s Mandarin, TTS Objective Test: LSD (dB) on test set LSD(dB) II III IV V TTS 4.68 4.55 4.50 4.45 5.39 3.91 Systems 19-Oct-16 I I: senone mapping (supervised), CL-TTS II: phonetic cluster mapping (unsupervised), CL-TTS III: frame mapping (unsupervised), CL-TTS IV: cross-lingual voice conversion (unsupervised), CL-TTS V: trajectory tiling, CL-TTS baseline VI: direct training with source speaker’s Mandarin, TTS Intelligibility Intelligibilit y (%) II III IV V VI 98.1 97.9 98.0 98.0 98.2 98.7 Systems 19-Oct-16 I I: II: III: IV: V: VI: senone mapping (supervised), CL-TTS phonetic cluster mapping (unsupervised), CL-TTS frame mapping (unsupervised), CL-TTS cross-lingual voice conversion (unsupervised), CL-TTS trajectory tiling, CL-TTS baseline direct training with source speaker’s Mandarin, TTS Feng-Long Xie / work was done in Speech Group, MSRA Spectrograms Senone mapping (supervised) Voice conversion (unsupervised) 19-Oct-16 Phonetic cluster (unsupervised) Frame mapping (unsupervised) Trajectory tiling (baseline) Directly trained TTS (upperbund) Amount of Data (utterances) LSD(dB) 10 50 100 300 500 1000 II 4.85 4.63 4.59 4.58 4.59 4.55 III 4.70 4.54 4.52 4.52 4.52 4.50 IV 4.72 4.52 4.53 4.52 4.48 4.45 Systems II: phonetic cluster mapping (unsupervised), CL-TTS III: frame mapping (unsupervised), CL-TTS IV: voice conversion (unsupervised), CL-TTS 50 utterances, adequate 19-Oct-16 Naturalness(MOS) and Speaker Similarity(DMOS) natural direct-trained TTS senone phonetic cluster frame voice conversion trajectory tiling TTS baseline Systems 19-Oct-16 I: senone mapping (supervised), CL-TTS II: phonetic cluster mapping (unsupervised), CL-TTS III: frame mapping (unsupervised), CL-TTS IV: voice conversion (unsupervised), CL-TTS V: trajectory tiling, CL-TTS baseline VI: direct training with source speaker’s Mandarin speech, TTS Demos I II III IV V VI N_R Sample 1 Sample 2 Systems 19-Oct-16 I: senone mapping (supervised), CL-TTS II: phonetic cluster mapping (unsupervised), CL-TTS III: frame m apping (unsupervised), CL-TTS IV: voice conversion (unsupervised), CL-TTS V: trajectory tiling, CL-TTS baseline VI: direct training with source speaker’s Mandarin, TTS upper bound Voice/Text-Driven Talking Head A Photorealistic Talking Head • Goal – From uni-modal (speech or text) to bi-modal (audio + visual). • Applications – Services in video conferencing, voice agent, online chatting, gaming, computer assisted language learning (CALL), etc. • Solutions – – – – – Small training set (<30 minutes AV recording) Data driven, fully automatic, real sample rendering Personalized photo-real talking head using HMM-guided approach Lip-sync with speech, natural head motion and facial expressions No. 1 in A/V Consistency Test, LIPS Challenge 2009 Rich context Unit Selection (RUS) TTS • Our approach to high quality Text-to-Speech (TTS) Speech database Clustered HMM Parametric speech trajectory HMM-based TTS Rich-context HMM Guided waveform concatenation RUS Waveform inventory Waveform concatenation Concatenative TTS Talking Head Synthesis • Same approach to high quality, visual speech synthesis Audio/Visual database Clustered HMM Parametric visual speech trajectory HMM-based Visual Speech Synthesis Rich-context HMM Guided Image concatenation HMM-Guided Photo-Real Talking Head Video inventory Video concatenation Concatenative Visual Speech Synthesis HMM-Guided Lips Image Selection HMM-based Visual Synthesis n Image Candidates m ey Synthesized Lips Movements • HMM-based vs. HMM-guided original HMM-based HMM-guided original HMM-based HMM-guided Lijuan Wang, Xiaojun Qian, Wei Han, Frank K. Soong, “Synthesizing Photo-Real Talking Head via Trajectory-Guided Sample Selection,” Proc. Interspeech2010, Sept., 2010, pp.446~449. Turn Mono-lingual to Multi-lingual with English TTS with Chinese TTS DNN-Based Cross Lingual Speech-to-Lips Conversion Cross-lingual Talking Head System • Power of Neural Net – – • Model with high resolution: down to state (sub-phoneme) level Speaker independent, x-language phonetic representation State decoding – – Language independent, no phone set constraint, no OOV problem Short look-ahead (for low-delay , real-time, voice-driven talking head scenario, e.g., on-line video games) Xinjian Zhang, Lijuan Wang, Gang Li, Frank Seide, Frank Soong, “A New Language Independent, Photo-realistic Talking Head Driven by Voice Only,” Interspeech2013 Results (American English) Results (Mandarin Chinese) Results (Japanese) Results (French) Summary Search “Elementary units” across speaker/language barriers • HMM trajectory tiling (HTT) algorithm for X-lingual TTS: Microsoft Research S2S project, ASR + MT+ X-lingual TTS • DNN-based ASR and KL divergence – DNN-based SI ASR • Equalize speaker difference in the phonetic space • Senones (context-dependent, subphonemic units) are used universally within or across languages – KL divergence • Measuring distortion between two given acoustic segments in the phonetic space in their correspond ASR posteriors • X-lingual voice conversion and X-lingual TTS – Supervised and unsupervised – 50 sentences seems to be adequate for descent quality – Better performance than trajectory tiling algorithm • Visual speech -- talking head for input of different languages – Rendering talking head’s lips movements, given speech/text inputs from different languages
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertising