While hidden Markov modeling (HMM)
has been the dominant technology for acoustic modeling in automatic
speech recognition today, many of its weaknesses have also been well
known and they have become the focus of much intensive research. One
prominent weakness in current HMMs is the handicap in representing
long-span temporal dependency in the acoustic feature sequence of
speech, which, nevertheless, is an essential property of speech
dynamics. The main cause of this handicap is the conditional IID
(Independent and Identical Distribution) assumption inherit in the HMM
formalism. Furthermore, in the standard HMM approach the focus is on
verbal information. However, experiments have shown that non-verbal
information also plays an important role in human speech recognition
which the HMM framework has not attempted to address directly. Numerous
approaches have been taken over the past dozen years to address the
above weaknesses of HMMs. These approaches can be broadly classified
into the following two categories.
The first, parametric, structure-based approach establishes
mathematical models for stochastic trajectories/segments of speech
utterances using various forms of parametric characterization,
including polynomials, linear dynamic systems, and nonlinear dynamic
systems embedding hidden structure of speech dynamics. In this
parametric modeling framework, systematic speaker variation can also be
satisfactorily handled. The essence of such a hidden-dynamic approach
is that it exploits knowledge and mechanisms of human speech production
so as to provide the structure of the multi-tiered stochastic process
models. A specific layer in this type of models represents long-range
temporal dependency in a parametric form.
The second, non-parametric and template-based approach to overcoming
the HMM weaknesses involves direct exploitation of speech feature
trajectories (i.e., “template”) in the training data
without any modeling assumptions. Due to the dramatic increase of
speech databases and computer storage capacity available for training,
as well as the exponentially expanded computational power,
non-parametric methods using the traditional pattern recognition
techniques of kNN (k-nearest-neighbor decision rule) and DTW (dynamic
time warping) have recently received substantial attention. Such
template-based methods have also been called exemplar-based or
data-driven techniques in the literature.
The purpose of this special session is to bring together researchers
who have special interest in novel techniques that are aimed at
overcoming weaknesses of HMMs for acoustic modeling in speech
recognition. In particular, we plan to address issues related to the
representation and exploitation of long-range temporal dependency in
speech feature sequences, the incorporation of fine phonetic detail in
speech recognition algorithms and systems, comparisons of pros and cons
between the parametric and non-parametric approaches, and the
computation resource requirements for the two approaches.
This Special Session addresses key issues of
Sound to Sense (S2S), a Marie Curie Research Training Network
that started in 2007. S2S's unifying theme is the role of fine phonetic detail (FPD) in speech processing. This special session
focuses on alternative theoretical and computational modeling paradigms for encoding FPD.
The special session is on Wednesday, August 29, 10:00 – 12:00, in the Astrid Park Plaza (APP) hotel. This hotel is nearby the Flanders Congress & Concert Centre (FCCC), it is on the same square. We start with a small poster session of 45 minutes, then 45 minutes for 3 orals, and we end with the panel discussion. Information about this special session can also be found at http://www.interspeech2007.org/Technical/structure_template_based_asr.php
Session organizers:
Li Deng <deng [at] microsoft.com>
Helmer Strik <strik [at] let.ru.nl>
|