Rendering Speech Across Speaker and Language

Rendering Speech Across Speaker and Language
Rendering Speech Across
Speaker and Language Barriers
Frank K. Soong
宋謌平
Speech Group
Microsoft Research Asia
ISCSLP2016, Tianjin, China
Oct. 20, 2016
Why Speech?
We speak to be heard, and to be understood.
– Speech is the most important and effective means for people
to communicate
– Spoken language is more variable (flexible), i.e., less regular,
than written language
– Our dream: human-machine communication via speech
– Close the loop: “Speech Chain” of production/perception
Search “Elementary Particles” of Speech?
• Similar Human Speech Production Mechanism
 Similar anatomic structure among human speakers and languages: vocal
tract + vocal fold
• Language and Speaker Difference
 Languages: different at sentence, word, syllable, phoneme levels across
different languages
 Sub-phonemic or frame level similarity: speech becomes more language
independent, e.g. speech coders are essentially language and speaker
independent
 Speaker Equalization: Despite similar articulators, speaker difference still
needs to be equalized, e.g. voice conversion, speaker adaptation or Xlingual applications
19-Oct-16
Audio-Visual Speech?
We speak to be heard/seen, and to be understood
– Communicate face-to-face (i.e., A/V mode), info in both
channels is re-enforcing, complimentary and error correcting
– A/V info needs to be synchronous, coherent, and consistent
with each other; otherwise, A/V channels may lead to
confusion, e.g. McGurk effect
– A/V human-machine communication needs A/V {TTS + ASR}
– Close the loop: “Speech Chain” in A/V mode
Rendering Speech from Text
- Text-to-Speech (TTS)
Building TTS models (backend)
• GMM-HMM: statistical model of speech parameters
– Acoustic parameters: static and dynamic parameters of spectrum, F0,
duration and voicing
– Model: Gaussians (acoustic features), HMM (temporal), decision tree
(between labels and acoustic models in phonetic and acoustic spaces)
• Waveform concatenation (unit selection) TTS
– Waveform concatenation: speech segments (units) selection
– Cost function: static (instantaneous) + dynamic (transitional)
• Deep Neural Net (DNN): mapping (linear + nonlinear)
between linguistic input and acoustic output
– Feedforward DNN
– Recurrent NN (RNN)
•
Long Short Time Memory (LSTM) , …
HMM Text-to-Speech (TTS) System
Text
Training
Adaptation
Speech
Database
Personal
Speech
Statistical
Training
HMMs
Adaptation
Text
Analysis
Trajectory
Generation
Vocoder
Speech
Speech
Database
Synthesis
Waveform
Concatenation
Pitch
Duration
Loudness
Mandarin and Dialects
• Mandarin 普通话
• TianJin accent 天津
• WuHan accent 武汉
• XiAn accent 西安
• ShanDong accent 山东
GMM-HMM Parametric TTS
• Features
– Output: Acoustic (from recorded training speech)
• spectrum; F0 + voicing; duration, excitation, etc.
– Input: Linguistic (from text analysis front-end)
• context-dependent phone; position of phone; syllable and word in phrase and
sentence; stress of syllable; POS, etc.
• Models
– Gaussian (mixture) for modeling spectrum
– Continuous(Gaussian) + discrete prob’s: MSD for F0 and voicing
– Gaussian, Gamma, Semi-HMM for duration
• Algorithms
– Baum-Welch (E-M with path info) for estimating HMM parameters
– Decision tree with MDL (a greedy search for insufficient training data)
Spectrum, F0 and Duration HMMs
• Each state has three trees: duration, spectrum and pitch
HMM parameter trajectory generation
with dynamic constraints
For a given HMM  , determine a speech parameter vector
sequence O  [o1T , o2T , o3T ,..., oTT ]T , ot  [ctT , ctT , 2ctT ]T
maximizing:
P(O |  )   P(O, Q |  )
all Q
If given Q , maximizes P(O | Q,  ) with respect to O  WC
1
log P(O | Q,  )   OTU 1O  O TU 1M  K
2
LogP(WC | Q,  )
0
C
W TU 1WC  W TU 1M T
K. Tokuda, H. Zen, A.W. Black, “An HMM-based speech synthesis system applied to English”, Proc. of
2002 IEEE SSW, 2002.
Trajectory Generation with dynamic
constraints
K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech parameter generation
algorithms for HMM-based speech synthesis, ICASSP-2000.
HMM Trajectory Tiling (HTT) Approach to
High Quality TTS
High quality TTS: natural, intelligible, speaker similarity →
HMM Trajectory Tiling (HTT)
– HMM trajectory: highly intelligible, guide for “tiling”
– Tiling the trajectory with natural speech segments (“tiles”)
– Challenges:
• speech database: phonetic and prosodic coverage
• segments into structured form for easy search
• How/where to connect 2 non-consecutive tiles
Y. Qian, F.K. Soong, Z.-J. Yan, A unified trajectory tiling approach to high quality speech rendering,
Audio, Speech, and Language Processing, IEEE Transactions on 21 (2), 280-29, 2013
17
High Quality Speech Synthesis
HMM Trajectory Tiling (HTT) based TTS
Clustered
HMM
Speech
database
Waveform
inventory
Parametric
speech
trajectory
HMM-based TTS
Guided
waveform
concatenation
HMM Trajectory
Tiling-based TTS
Waveform
concatenation
Concatenation
TTS
HMM-Trajectory Tiling based TTS
HMM Parameter
Trajectory Generation
Unit Sausage (Lattice)
Construction
Search in Sausage
and Concatenation
Search in “Sausage” and Concatenation
– Maximizing normalized cross-correlation (NCC) for
– spectral similarity
– phase continuity
– concatenation time instants
High Quality TTS Demo
• Trained on a British English (5 hours) corpus
• Trained on a Mandarin Chinese (9 hours) corpus
Cross-lingual TTS
X-lingual Speech Technology for:
• Computer Assisted Language Learning (CALL)
• Speech-to-speech translation, e.g. MS skype
translator
• Mix-code automatic speech recognition (ASR)
and text-to-speech (TTS) synthesis
• Low resourced spoken language research
• Lip-synced, cross-lingual talking head
Training TTS across Language Barrier
• Making a monolingual speaker, say English,
to speak a new language that he doesn’t
speak, say Mandarin Chinese
Y. Qian, F.K. Soong, Z.-J. Yan, “A unified trajectory tiling approach to high quality speech
rendering”, IEEE Trans. on Audio, Speech, and Language Processing, 21 (2), 280-29, 2013
How?
Busy but want
to speak a
new language
English
English speaker
1 hr
public speech
(un-transcribed)
Mandarin
Reference Mandarin speaker
2 hrs
read speech
Train TTS across Language Barrier
Vocal Tract Length
Normalization
Speaker equalization
Rendering Mundie’s
Mandarin Sentences
Trajectory tiling at
frame level
TTS Model Training
HMM-based TTS
Frame-wise Trajectory Tiling
Source speaker’s
warped trajectories
“Sausage” of waveform
segments from target speaker
Tiled waveform
Cross-lingual TTS Personalization (E2C)
Cross-lingual, personalized Chinese TTS, trained with 1hr of
Craig’s English data (un-transcribed)
Reference
Chinese
spkr’s TTS
Mundie’s
Chinese
TTS
Car Navigation in a Foreign Land
• Web Search Result:
Driving directions to Beijing
Railway
Station.
Head
south on 中 关 村 南 大 街 ,
then toward 大慧寺路, turn
left at 白 石 新 桥 , continue
onto 西直门外大街。
SI-DNN ASR and
Kullback-Leibler Divergence
For voice conversion and X-lingual TTS
19-Oct-16
Speaker Independent DNN ASR
𝑃 𝑥 = [𝑝1 , 𝑝2 , … , 𝑝𝐼 ]
𝑃 𝑥 ′ = [𝑝1′ , 𝑝2′ , … , 𝑝𝐼′ ]
posterior vector
( phonetic space)
Speaker Independent
DNN ASR
𝑥
19-Oct-16
𝑥′
physical unit
( acoustic frames)
Kullback-Leibler Divergence (Discrete)
𝐷𝐾𝐿 (𝑃| 𝑄
𝐼
𝐼
= ෍ 𝑝𝑖 − 𝑞𝑖 log 𝑝𝑖 − log 𝑞𝑖 = ෍ 𝑝𝑖 log 𝑝𝑖 − log 𝑞𝑖 + 𝑞𝑖 log 𝑞𝑖 − log 𝑝𝑖
𝑖=1
𝑖=1
𝑃 = [𝑝1 , 𝑝2 , … , 𝑝𝐼 ]
𝑄 = [𝑞1, 𝑞2 , … , 𝑞𝐼]
𝐼
𝐼
෍ 𝑝𝑖 = ෍ 𝑞𝑖 = 1
𝑖=1
 Positive Semi-definite, i.e., ≥ 0
 Symmetric Distortion Measure in the Probability
Space
𝑖=1
Feng-Long Xie, Frank K. Soong, Haifeng Li, “A KL Divergence and DNN-based Approach to Voice
Conversion without Parallel Training Sentences,” InterSpeech 2016
19-Oct-16
Voice Conversion
“Welcome to
Microsoft
Research Asia”
19-Oct-16
Voice
Conversion
System
“Welcome to
Microsoft
Research Asia”
KLD-DNN based Approach
“Welcome to
Microsoft
Research Asia”
Speech
Equalize speaker
difference
𝜎
log 𝐹0𝑡𝑟𝑎𝑛𝑠 = 𝜇𝑠 + 𝜎𝑠 (log 𝐹0𝑡 − 𝜇𝑡 )
Speaker Independent DNN ASR
𝑡
Extract
Acoustic Unit
ASR Senones
Acoustic units
Frames
ASR Senones
…
Acoustic Units
• TTS senone
• phonetic cluster centroid
• frame
…
Acoustic units
𝐼
𝐷𝐾𝐿 (𝑃| 𝑄 = ෍ 𝑝…
𝑞𝑖
𝑖 − 𝑞𝑖 log 𝑝𝑖 − log
Speech
Frames
𝑖=1
Generation
“Welcome to
Microsoft
Research Asia”
Experimental set-up
• DNN Training
• Wall Street Journal CSR corpus (78 hours, 284 speakers)
• 38-dim MFCC + log energy and the acoustic contexts (~ 100 ms)
• DNN-ASR model
• Database
• CMU Arctic Database
• Training set: 1,000 utterances,
• Test set : 50 utterances
• Conversion Pairs
• SLT(F) to BDL(M)
• CLB(F) to SLT(F)
• RMS(M) to BDL(M)
19-Oct-16
DNN-ASR Adaptation with T-speaker Data

Systems


II: Frame (acoustic delta distortion minimization), VC
III: Frame (window based trajectory estimation), VC
Voice Conversion Demos
NR_S

I
II
IV-TTS
NR_T
Systems




19-Oct-16
III
I: PCC: phonetic Cluster Centroid, VC
II: Frame (acoustic delta distortion minimization), VC
III: Frame (window based trajectory estimation), VC
IV: TTS upper bound
Cross-Lingual TTS
Source spkr’s
speech
Challenges:
• Intelligible
• Similar to the source speaker
• Natural
Cross-Lingual TTS
system
Source spkr’s
speech
Reference
spkr’s speech
Feng-Long Xie, Frank K. Soong, Haifeng Li, “A KL Divergence and DNN Approach to Cross-Lingual
TTS”, ICASSP2016
Senone Mapping (Supervised)
𝜎
log 𝐹0𝑡𝑟𝑎𝑛𝑠 = 𝜇𝐿1 + 𝜎𝐿1 (log 𝐹0𝐿2 − 𝜇𝐿2 )
𝐿2
L2 Speech
L1 Speech
Equalize the
speaker
difference
GMM-HMM TTS
Training
L1 TTS senones
L2 TTS senones
ASR senones
Forced
Alignment
GMM-HMM TTS
Training
Speaker Independent DNN ASR
…
ASR senones
…
Forced
Alignment
L1 TTS senones
𝐼
𝐷𝐾𝐿 (𝑃| 𝑄 = ෍ 𝑝…𝑖 − 𝑞𝑖 log 𝑝𝑖 − log
𝑞
Senone𝑖
L2 TTS senones
𝑖=1
Mapping
L2 TTS
Phonetic Cluster Mapping
Reference Speaker
Source Speaker
L1 Speech
L2 Speech
Speaker Independent DNN ASR
GMM-HMM
TTS Training
Phonetic clusters
Phonetic
Clustering
Forced
Alignment
L2 TTS senones
ASR senones
…
ASR senones
…
Phonetic clusters
L2 TTS senones
…
Senone-cluster
mapping
L2 TTS
Frame Mapping (unsupervised)
L1 Speech
L2 Speech
𝜎
log 𝐹0𝑡𝑟𝑎𝑛𝑠 = 𝜇𝐿1 + 𝜎𝐿1 (log 𝐹0𝐿2 − 𝜇𝐿2 )
Equalize the
speaker
difference
GMM-HMM TTS
Training
𝐿2
Speaker Independent DNN ASR
ASR senones
…
Frames
Forced
Alignment
L2 TTS senones
ASR senones
…
Frames
𝐼
𝐷𝐾𝐿 (𝑃| 𝑄 = ෍ 𝑝𝑖 − 𝑞𝑖 log…𝑝𝑖 − log 𝑞𝑖
𝑖=1
L2
TTS senones
Weighted
Average
L2 TTS
Voice Conversion based TTS
“你好吗?”
L1 Speech
L2 speech
input
Speaker Independent DNN ASR
ASR senones
ASR senones
Frames
…
…
Frames
Frames
…
19-Oct-16
Frames
“你好吗?”
Speech
Generation
L2 Speech
L2 GMM-HMM
TTS
43
Experimental setup

SI-DNN ASR training
Training Data
78hr WSJ CSR corpus; 284 (M/F) speakers; US English
DNN
4 hidden layers; 2,048 units/layer; 2,754 output (senone) nodes
Input Features
38-dim MFCC + log energy with acoustic context (110 ms)

Cross-Lingual TTS English (L1) -> Mandarin (L2)

speaker M1 used as source speaker

1,000 English sentences (for CL-TTS training)

1,000 Mandarin sentences (for comparison as upper bound)

reference speaker M2: 1,000 Mandarin sentences (L2)

Systems






19-Oct-16
I: senone mapping (supervised), CL-TTS
II: Phonetic cluster centroid mapping (unsupervised), CL-TTS
III: frame mapping (unsupervised), CL-TTS
IV: Cross-lingual Voice Conversion (unsupervised), CL-TTS
V: trajectory tiling, CL-TTS baseline[1]
VI: direct training with source speaker’s Mandarin, TTS
Objective Test: LSD (dB) on test set
LSD(dB)

II
III
IV
V
TTS
4.68
4.55
4.50
4.45
5.39
3.91
Systems






19-Oct-16
I
I: senone mapping (supervised), CL-TTS
II: phonetic cluster mapping (unsupervised), CL-TTS
III: frame mapping (unsupervised), CL-TTS
IV: cross-lingual voice conversion (unsupervised), CL-TTS
V: trajectory tiling, CL-TTS baseline
VI: direct training with source speaker’s Mandarin, TTS
Intelligibility
Intelligibilit
y (%)

II
III
IV
V
VI
98.1
97.9
98.0
98.0
98.2
98.7
Systems






19-Oct-16
I
I:
II:
III:
IV:
V:
VI:
senone mapping (supervised), CL-TTS
phonetic cluster mapping (unsupervised), CL-TTS
frame mapping (unsupervised), CL-TTS
cross-lingual voice conversion (unsupervised), CL-TTS
trajectory tiling, CL-TTS baseline
direct training with source speaker’s Mandarin, TTS
Feng-Long Xie / work was done in Speech
Group, MSRA
Spectrograms
Senone mapping
(supervised)
Voice conversion
(unsupervised)
19-Oct-16
Phonetic cluster
(unsupervised)
Frame mapping
(unsupervised)
Trajectory tiling
(baseline)
Directly trained
TTS (upperbund)
Amount of Data (utterances)
LSD(dB)
10
50
100
300
500
1000
II
4.85
4.63
4.59
4.58
4.59
4.55
III
4.70
4.54
4.52
4.52
4.52
4.50
IV
4.72
4.52
4.53
4.52
4.48
4.45

Systems



II: phonetic cluster mapping (unsupervised), CL-TTS
III: frame mapping (unsupervised), CL-TTS
IV: voice conversion (unsupervised), CL-TTS
50 utterances, adequate
19-Oct-16
Naturalness(MOS) and Speaker Similarity(DMOS)
natural
direct-trained TTS
senone
phonetic cluster
frame
voice conversion
trajectory tiling TTS
baseline

Systems






19-Oct-16
I: senone mapping (supervised), CL-TTS
II: phonetic cluster mapping (unsupervised), CL-TTS
III: frame mapping (unsupervised), CL-TTS
IV: voice conversion (unsupervised), CL-TTS
V: trajectory tiling, CL-TTS baseline
VI: direct training with source speaker’s Mandarin speech, TTS
Demos
I
II
III
IV
V
VI
N_R
Sample
1
Sample
2

Systems






19-Oct-16
I: senone mapping (supervised), CL-TTS
II: phonetic cluster mapping (unsupervised), CL-TTS
III: frame m apping (unsupervised), CL-TTS
IV: voice conversion (unsupervised), CL-TTS
V: trajectory tiling, CL-TTS baseline
VI: direct training with source speaker’s Mandarin, TTS upper bound
Voice/Text-Driven
Talking Head
A Photorealistic Talking Head
• Goal
– From uni-modal (speech or text) to bi-modal (audio + visual).
• Applications
– Services in video conferencing, voice agent, online chatting, gaming,
computer assisted language learning (CALL), etc.
• Solutions
–
–
–
–
–
Small training set (<30 minutes AV recording)
Data driven, fully automatic, real sample rendering
Personalized photo-real talking head using HMM-guided approach
Lip-sync with speech, natural head motion and facial expressions
No. 1 in A/V Consistency Test, LIPS Challenge 2009
Rich context Unit Selection (RUS) TTS
• Our approach to high quality Text-to-Speech (TTS)
Speech
database
Clustered
HMM
Parametric
speech
trajectory
HMM-based TTS
Rich-context
HMM
Guided
waveform
concatenation
RUS
Waveform
inventory
Waveform
concatenation
Concatenative
TTS
Talking Head Synthesis
• Same approach to high quality, visual speech synthesis
Audio/Visual
database
Clustered
HMM
Parametric
visual speech
trajectory
HMM-based
Visual Speech
Synthesis
Rich-context
HMM
Guided Image
concatenation
HMM-Guided
Photo-Real
Talking Head
Video
inventory
Video
concatenation
Concatenative
Visual Speech
Synthesis
HMM-Guided Lips Image Selection
HMM-based Visual Synthesis
n
Image Candidates
m
ey
Synthesized Lips Movements
• HMM-based vs. HMM-guided
original
HMM-based
HMM-guided
original
HMM-based
HMM-guided
Lijuan Wang, Xiaojun Qian, Wei Han, Frank K. Soong, “Synthesizing Photo-Real Talking Head via
Trajectory-Guided Sample Selection,” Proc. Interspeech2010, Sept., 2010, pp.446~449.
Turn Mono-lingual to Multi-lingual
with English TTS
with Chinese TTS
DNN-Based Cross Lingual
Speech-to-Lips Conversion
Cross-lingual Talking Head System
•
Power of Neural Net
–
–
•
Model with high resolution: down to state (sub-phoneme) level
Speaker independent, x-language phonetic representation
State decoding
–
–
Language independent, no phone set constraint, no OOV problem
Short look-ahead (for low-delay , real-time, voice-driven talking head scenario, e.g., on-line video games)
Xinjian Zhang, Lijuan Wang, Gang Li, Frank Seide, Frank Soong, “A New Language Independent, Photo-realistic
Talking Head Driven by Voice Only,” Interspeech2013
Results (American English)
Results (Mandarin Chinese)
Results (Japanese)
Results (French)
Summary
Search “Elementary units” across speaker/language barriers
•
HMM trajectory tiling (HTT) algorithm for X-lingual TTS: Microsoft Research S2S project, ASR +
MT+ X-lingual TTS
•
DNN-based ASR and KL divergence
– DNN-based SI ASR
• Equalize speaker difference in the phonetic space
• Senones (context-dependent, subphonemic units) are used universally within or
across languages
– KL divergence
• Measuring distortion between two given acoustic segments in the phonetic space
in their correspond ASR posteriors
•
X-lingual voice conversion and X-lingual TTS
– Supervised and unsupervised
– 50 sentences seems to be adequate for descent quality
– Better performance than trajectory tiling algorithm
•
Visual speech -- talking head for input of different languages
– Rendering talking head’s lips movements, given speech/text inputs from different languages
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising