Rapid Unsupervised Speaker Adaptation for HMM-based Text-to-Speech
Synthesis
ABSTRACT
Recently, statistical speech synthesis (SSS) approach is proposed that can address both of these problems. In the SSS
approach, statistical models are used to synthesize the speech sounds. Smooth synthetic speech that does not have the
spurious error problem of the concatenative synthesis methods can be generated with the SSS approach. Moreover, voice
style and characteristics as well as emotions can be easily transformed. In the latest TTS Blizzard challenges, one of the
instances of the SSS techniques outperformed the concatenative synthesis techniques in Mean Opinion Score (MOS) quality
tests.
The high quality and intelligibility speech it generates, the flexibility it offers in voice/speaker/emotion conversion, and its
small memory requirements make SSS systems a strong candidate to replace the concatenative systems that are the most
popular TTS systems in use today. SSS systems already started to enable high quality embedded speech synthesis products
because of its small memory footprint requirements. Moreover, the SSS technology is receiving increasing attention
from companies that offer server or PC-based TTS applications because of its competitive voice quality and flexibility in
voice conversion. The success of the current SSS systems is expected open new research avenues which will lead to new
discoveries and potentially make the SSS technology the dominant TTS technology in the next decade.
Electrical & Electronics Engineering
Concatenative synthesis method has been the dominant approach in text-to-speech synthesis (TTS) in the last decade.
Despite its success, the concatenative synthesis approach has several disadvantages. One of the disadvantages is the
spurious errors that pop up during synthesis which can significantly distract the listener. A second disadvantage with the
concatenative approach is the difficulty in modifying the voice characteristics, voice style, and emotions.
Yrd. Doç. Dr.
Cenk Demiroğlu
DEPARTMENT
Electrical & Electronics
Engineering
CONTACT
[email protected]
FUNDING SCHEME
TÜBİTAK 3501
One of the most exciting research directions in the SSS field is speaker adaptation where the goal is to adapt the voice model
to a target speaker that does not exist in the training data. Maximum a posteriori (MAP) and maximum likelihood linear
regression (MLLR) methods are the two of the commonly used approaches used for adaptation. MLLR method performs
better than the MAP method when the amount of adaptation data is small. Therefore, MLLR adaptation is more suitable for
rapid adaptation. Several variations of the MLLR technique and combination of the MLLR and MAP techniques are used in
the context of SSS.
START DATE
01.03.2010
Unsupervised adaptation is difficult to achieve with SSS because of the rich context information used in speech sounds. It
is very difficult to generate the correct context using speech recognition tools as is commonly done in unsupervised speech
recognition systems. There is only one paper on unsupervised adaptation for SSS [1]. The idea proposed in [1] attempts to
extract only the triphone context ignoring the other information such as syllable, location of the sound in the syllable etc..
OZU BUDGET
108,400.00 TL
2010 National Grants
DURATION
36 months
.
11