Zoom out Search Issue

IEEE SIGNAL PROCESSING MAGAZINE [51] MARCH 2015
straightforward optimization problems that can be solved using the
Karush–Kuhn–Tucker conditions. The resulting analytic solutions
are easy to implement. The later work of [8] models
(, )AE D
ii
more accurately at low SNR values and provides improved perfor-
mance over the original work of [4] under low SNR conditions.
The discussion in this section assumed stationarity. Time varia-
tion can be accounted for by recursive updating of the equivalent
spectrum levels
E
i
and D
i
and periodically recomputing the
gains g
i
[4]. This is consistent with the SII update described in
[27], which uses frequency-dependent temporal windows.
WORD-SEQUENCE PROBABILITY-BASED
ENHANCEMENT
The section “Defining Intelligibility” identified the suitability of the
expected probability of correct message recognition as a measure for
optimizing intelligibility at a high level of abstraction. We noted in
the section “Measures Operating on a Word Sequence” that, under
an ergodicity assumption, the expectation over messages can be
approximated by averaging over time.
Optimizing a measure derived from
the probability of correct recognition
under a power constraint has been
shown to provide significant intelligi-
bility gain assuming that accurate
sound segmentation information and
an appropriate acoustic speech model
are available [11]. We emphasize that
the method assumes that ASR word probability tracks the human
recognition performance, which was found to be true in [11] but is
not guaranteed. Here we provide more detail about this approach.
To make high-level machine-based optimization feasible in
practice, we can represent the message at the phoneme level. This
means we refine our Markov chain to include an intermediate
level. The chain now becomes
,MuaauM
TTTLL L
"""""
where u
T
and u
L
denote the talker and lister phoneme sequences,
respectively. By first performing time alignment of a sequence of
acoustic features vectors
a
T
and a sequence of phonemes u
T
by
means of an ASR engine, a practical intelligibility enhancement
approach can be defined. The ASR speech model can then be used
to provide the probability densities that characterize clean speech
sounds in the acoustic feature space.
To enhance intelligibility, we want to find the parameters
C
*
of
our speech modification scheme that maximize the average proba-
bility that the listener interpreted phoneme sequence
u
L
is the
talker-generated sequence :u
T
|,,argmax puuCC
*
| TTuu
C
=
LT
^h
(14)
where the subscripts of the density label the density it repre-
sents. Note that the densities are consistent with the models
shown in (4).
Simplifications were introduced in [11] to make the optimiza-
tion tractable. It was tacitly assumed that the message is accu-
rately represented by the phonemes and production noise was not
formally considered. It was also assumed that
v
E
(the representa-
tion of the noise) can be approximated as deterministic, which is
reasonable for typical acoustic signal representations and station-
ary noise. The only remaining uncertainty is due to the interpre-
tation noise in the mapping from
a
L
to .u
L
In an ASR system
based on an HMM, this is modeled by the observation noise.
Equation (14) can now be approximated by
|(
,)
argmax puauCC
*
| TLTua
C
.
LL
t
t
^h
(15)
|,
()
argmax paupuC
| LT Tau u
C
=
LL L
t
t
^h
T
(|,) (),paupuC
|
u
LT T
1
au u
-
LL L
ll
l
t
t
`
j
/
(16)
where we used Bayes’ rule and where (,),a u C
LT
t
abbreviated to
,a
L
t
is the set of acoustic features observed by the listener, which
is modeled as a deterministic function of the talker phoneme
sequence
u
T
and the speech modi-
fication parameters .C The first
term of (16) is the likelihood of the
talker phoneme sequence for the
observed features
,a
L
t
the second
term is the a priori probability that
the phoneme sequence
u
T
is
decoded by the listener, and the
third term is the inverse a priori
probability of the listener-observed features. Optimization of the
likelihood term only reduces complexity and provides good
results [11].
The theory is simplest to implement if the sequences are con-
sidered stationary. The averaging of (16) over long time intervals
(multiple sentences) is then preferred. In a practical implementa-
tion, shortcuts may have to be made due to requirements on delay
and complexity and because the stationarity assumption may not
be sufficiently accurate.
A system-level perspective of the proposed approach is
shown in Figure 3. In [11], the approach was validated for a
combination of two modifications: prosody-affecting phoneme
gain adjustment and a spectral modification redistributing the
signal energy across frequency bands. The method compared
favorably to a method based on the optimization of a measure
operating on a sequence of auditory states [14], discussed in the
section “Measures Operating on a Sequence of Auditory States.”
Results reported in [9] suggest that using the full Bayesian
approach rather than optimizing only the likelihood component
of (16) improves performance.
In text-to-speech applications it may be possible to select
from a set of phrases to convey a particular message. The mea-
sure given in (16) has also been used to determine the optimal
phrasing of utterances [19]. This study indicates that maximiz-
ing the probability of correct interpretation of the phoneme
sequence increases intelligibility. Considering prior information
on the predictability of various formulations is expected to fur-
ther enhance performance.
TO MAKE HIGH-LEVEL
MACHINE-BASED OPTIMIZATION
FEASIBLE IN PRACTICE, WE CAN
REPRESENT THE MESSAGE AT THE
PHONEME LEVEL.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®