Zoom out Search Issue
IEEE SIGNAL PROCESSING MAGAZINE [47] MARCH 2015
middle ear. This filter is cascaded with an auditory filter bank that
models processing at the level of the basilar membrane in the
cochlea. Subsequently, the envelope of each of the outputs of the
auditory filters is obtained, which simulates the transduction of
the inner hair cells. To model an absolute hearing threshold, a
constant is added to each envelope. In the current context, this
threshold corresponds to an interpretation noise. In the final
stage, a log transform is used to model the loudness dependent
compression of the auditory filter bank outputs by the outer hair
cells. An important difference between the model from [21] and
the more advanced model presented in [20] is the logarithmic
transform, which is a simplification of the adaptation loops that
are used in [20]. The simplification particularly affects the output
near transitions where the gain of adaptation loops changes.
By applying an auditory model to the acoustic sequences
a
T
and a
L
and comparing the results, a distortion measure can be
obtained. Mutual information is a natural measure for this pur-
pose, but, to our best knowledge, it has not been applied to the
auditory representation for intelligibility enhancement. Note that
while mutual information is not affected by smooth invertible
mappings, auditory representations likely are not smooth map-
pings from features such as cepstra, or line spectral frequencies.
This suggests that it may be essential to consider the detailed
behavior of more sophisticated auditory models.
In the literature, various measures have been used to compare
the auditory representations of
a
T
and .a
L
In [14], it was shown
that an
1
, criterion leads to a mathematically tractable method
and to provide good results for intelligibility enhancement. Refer-
ence [13] uses a similar auditory model for the so-called glimpse
proportion measure of intelligibility: rather than comparing
a
T
and a
L
directly, it compares the auditory representation of the a
T
with the auditory representation of the environmental noise .v
E
The glimpse proportion approach computes the proportion of sig-
nal blocks where the auditory representation of the signal is
louder than the noise. In more recent work on the glimpse pro-
portion, a sigmoidal function is applied to the difference of the
auditory signal and noise representations [6], [22]. The method
provides good intelligibility enhancement [6], [22], [23]. Both the
1
, criterion and glimpse proportion approaches do not explicitly
consider the information conveyed in a particular signal compo-
nent, which should, at least in principle, be a disadvantage com-
pared to mutual information-based approaches.
MEASURES OPERATING ON SPECTRAL BAND POWERS
The mutual information between
M
L
and M
T
(6) can be seen to
correspond to a classic view of intelligibility based on band powers
of the auditory filter bank [17], [18], [24]–[27], by writing it as
(; ) ()IIAMM
LT i
i
ii
p=
u
/
(7)
()logI
2
1
1
,
i
i0
2
t=- -
u
(8)
()
()
()
.
log
log
A
1
1
11
,
,
ii
i
i
i
i
0
2
0
2
p
t
p
tp
=
-
+
-+
(9)
The maximum mutual information is attained at high SNR and is
.I
i
i
u
/
Defining /I
II
i
j
ij
=
uu
/
and normalizing (7) accordingly, we
recognize I
i
as the so-called band-importance function and ( )A
ii
p
as the so-called weighting function or band-audibility function. The
formulation (7) forms the basis of speech intelligibility measures
such as the SII [18] and the extended SII [27]. These measures are
descendants of the so-called articulation index [24], [25], a measure
that predates information theory. In this classic view,
I
i
character-
izes the importance of frequency band i and the factor A
i
is a
weighting function that indicates what fraction of the information is
delivered to the listener. The information-theory derived form of
A
i
shown in (9) describes a sigmoidal function that approximates the
definition of
A
i
in the SII. [Equation (9) neglects the threshold of
hearing, the effect of high loudness, and the self-masking of noise.]
Our derivation of the band importance function
I
i
of (8) makes its
dependency on the production and interpretation noise explicit. If
the relative variances of the production and interpretation noise of a
band are low (high production and interpretation SNR;
,i0
t
approaches one), that band is important for intelligibility. In the SII
definition, the values of
I
i
are set empirically. As is shown in [15],
the differences between the formulas for the classic approach and
the aforementioned information-theoretical derivation are well
within the precision of the original heuristic derivation of the classic
view. The classic SII has proven to be highly correlated with speech
intelligibility in many conditions and has been used as a basis for
speech intelligibility enhancement [4], [8], [12], [28]. It is discussed
in additional detail in the section “SII-Based Enhancement.”
CONSTRAINTS ON OPTIMIZATION
In most cases, the optimization must be performed subject to one
or more constraints. Important constraints are the speech-like
nature of the output, the signal power, and system delay. Addi-
tional constraints may be required. For instance, for a given mes-
sage
M
T
(and speaking rate), a longer word sequence will likely be
more intelligible than a short one, thus making a length con-
straint natural.
The speech-like nature, or the speech quality, of the enhanced
output may require an explicit constraint. However, in most prac-
tical systems the speech-like nature is enforced implicitly by either
the modification strategy, or the optimization criterion, or both.
Modification strategies such as slowly varying spectral shaping
facilitate speech-like output only. The maximum probability of
correct phoneme recognition is an example of a criterion that
favors signal features that resemble those of clean speech.
Signal power is a natural constraint. The unconstrained optimi-
zation of signal spectral modifications may lead to an unbounded
increase of the signal power if the reduction in recognition perfor-
mance of the human auditory system for loud sounds is not consid-
ered. Thus, a power constraint must be applied to prevent hearing
injuries and loudspeaker damage. Approximations to perceived
loudness, either in the form of an analytic expression, or in the
form of an algorithm, may also be used as constraints.
The system delay must be constrained in real-time systems.
This may prevent the usage of particular distortion measures
and modification operators.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®