Zoom out Search Issue
IEEE SIGNAL PROCESSING MAGAZINE [46] MARCH 2015
acoustic features are based on the
short-time discrete Fourier trans-
form (DFT) coefficients, variance
estimation can be based on the
short-time DFT periodogram, i.e.,
||a
,Ti
2
2
a
v =
,Ti
t
having a variance of
.||E a
2
2
,Ti
6
@
The low-level measure
(6) can then be used directly to opti-
mize speech intelligibility [15].
The frequency resolution of the human auditory system
decreases with frequency, which reduces the mutual information
from that obtained with (6) for a uniform high resolution. An
improved model of information transfer is obtained by assuming
that the signal is represented with one independent component
per equivalent rectangular bandwidth (ERB), which is consistent
with studies on intelligibility [17]. We show in the section “Mea-
sures Operating on Spectral Band Powers” that this approach pro-
vides an information-theoretical justification of the well-known
SII [18], a low-level measure of intelligibility.
PRACTICAL MEASURES OF INTELLIGIBILITY
Existing practical measures of intelligibility generally operate at
the word-sequence level, at the level of a sequence of auditory
states, or at the level of short-term spectra. We discuss these
classes next and end with a discussion of the constraints that must
be imposed on the optimization.
MEASURES OPERATING ON A WORD SEQUENCE
In the section “Defining Intelligibility,” we discussed that the ex-
pected probability of correct interpretation of the message,
[ ( | )],Ep MM
|TLT
TT
˘˘
is a reasonable measure of intelligibility. This
measure can be approximated as (|)pMM
|LT T T
{{
on real-world
data, where the overbar indicates averaging over realizations .M
T
{
If the averaging is done in time, i.e., over segments of a single
larger message (e.g., words), then this operation assumes ergodic-
ity. The measure is easily evaluated in a test with human test sub-
jects, where
(|)pMM
|LT T T
{{
can be estimated using histograms. A
machine-based quantitative measure requires a mapping from
any particular acoustic observation
a
L
to a message M
L
{
that cap-
tures the probabilistic nature of this mapping as performed by
humans. As will be discussed in the section “Word-Sequence
Probability-Based Enhancement,”
the standard approach to ASR com-
putes the probability of the observa-
tions given a message (word, or word
sequence). The basic assumption for
machine-based intelligibility en-
hancement is then that the trend of
ASR word probability in noise tracks
the trend of human recognition per-
formance in noise sufficiently well for the modification parame-
ters that are optimized. Experiments confirmed this hypothesis
[11], [19] for a particular set of practical systems.
MEASURES OPERATING ON A SEQUENCE
OF AUDITORY STATES
It is advantageous to minimize the delay and computational
requirements of the intelligibility measure, particularly if the
types of modification are restricted. Let us assume that the modi-
fication is a spectral modification, that the word sequence and
speaking rate are fixed, and that the highest intelligibility is
achieved by the original speech without environmental noise.
(The latter assumption is an additional simplification required for
this approach.) Then it is natural to use a distortion measure
operating on the sequence of auditory states as a measure of
intelligibility. Such measures can exploit that quantitative knowl-
edge of the auditory periphery has increased significantly in the
last three decades (e.g., [20]).
The straight comparison of the auditory states of the conveyed
and received signal ignores the production noise
v
T
of (4). That is,
the auditory model does not weigh signal components according to
their relevance in terms of precision of signal production. However,
the auditory model precision of a speech component may form a
reasonable match to the precision of speech production, simplifying
the introduction of production noise.
Although auditory models differ in exactly how the inner ear
representation is obtained, they follow in many cases a similar
strategy for modeling the auditory system. In Figure 1, we outline
the basic building blocks of the psychoacoustic model presented in
[21], which is simple but representative of many other models,
such as [20]. The first stage of the auditory model consists of a fil-
ter that mimics the frequency characteristics of the outer and
[FIG1] The basic structure of the auditory model presented in [21].
10
2
10
3
10
4
−80
−70
−60
−50
−40
−30
−20
−10
0
10
Frequency (Hz)
10
2
10
3
10
4
Frequency (Hz)
Level (dB)
0
0.2
0.4
0.6
0.8
1
Response
Envelope
Follower
Outer-Middle Ear Filter Auditory Filterbank
+
Constant
Log
Transform
Inner Ear
Representation
THE LACK OF FEEDBACK,
TOGETHER WITH THE RECENT
ABILITY TO COMMUNICATE
FROM ANYWHERE TO
ANYWHERE, OFTEN LEADS
TO LOW INTELLIGIBILITY.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®