Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

IEEE SIGNAL PROCESSING MAGAZINE [46] MARCH 2015

acoustic features are based on the

short-time discrete Fourier trans-

form (DFT) coefficients, variance

estimation can be based on the

short-time DFT periodogram, i.e.,

||a

,Ti

v =

,Ti

having a variance of

.||E a

,Ti

The low-level measure

(6) can then be used directly to opti-

mize speech intelligibility [15].

The frequency resolution of the human auditory system

decreases with frequency, which reduces the mutual information

from that obtained with (6) for a uniform high resolution. An

improved model of information transfer is obtained by assuming

that the signal is represented with one independent component

per equivalent rectangular bandwidth (ERB), which is consistent

with studies on intelligibility [17]. We show in the section “Mea-

sures Operating on Spectral Band Powers” that this approach pro-

vides an information-theoretical justification of the well-known

SII [18], a low-level measure of intelligibility.

PRACTICAL MEASURES OF INTELLIGIBILITY

Existing practical measures of intelligibility generally operate at

the word-sequence level, at the level of a sequence of auditory

states, or at the level of short-term spectra. We discuss these

classes next and end with a discussion of the constraints that must

be imposed on the optimization.

MEASURES OPERATING ON A WORD SEQUENCE

In the section “Defining Intelligibility,” we discussed that the ex-

pected probability of correct interpretation of the message,

[ ( | )],Ep MM

|TLT

˘˘

is a reasonable measure of intelligibility. This

measure can be approximated as (|)pMM

|LT T T

{{

on real-world

data, where the overbar indicates averaging over realizations .M

{

If the averaging is done in time, i.e., over segments of a single

larger message (e.g., words), then this operation assumes ergodic-

ity. The measure is easily evaluated in a test with human test sub-

jects, where

(|)pMM

|LT T T

{{

can be estimated using histograms. A

machine-based quantitative measure requires a mapping from

any particular acoustic observation

to a message M

{

that cap-

tures the probabilistic nature of this mapping as performed by

humans. As will be discussed in the section “Word-Sequence

Probability-Based Enhancement,”

the standard approach to ASR com-

putes the probability of the observa-

tions given a message (word, or word

sequence). The basic assumption for

machine-based intelligibility en-

hancement is then that the trend of

ASR word probability in noise tracks

the trend of human recognition per-

formance in noise sufficiently well for the modification parame-

ters that are optimized. Experiments confirmed this hypothesis

[11], [19] for a particular set of practical systems.

MEASURES OPERATING ON A SEQUENCE

OF AUDITORY STATES

It is advantageous to minimize the delay and computational

requirements of the intelligibility measure, particularly if the

types of modification are restricted. Let us assume that the modi-

fication is a spectral modification, that the word sequence and

speaking rate are fixed, and that the highest intelligibility is

achieved by the original speech without environmental noise.

(The latter assumption is an additional simplification required for

this approach.) Then it is natural to use a distortion measure

operating on the sequence of auditory states as a measure of

intelligibility. Such measures can exploit that quantitative knowl-

edge of the auditory periphery has increased significantly in the

last three decades (e.g., [20]).

The straight comparison of the auditory states of the conveyed

and received signal ignores the production noise

of (4). That is,

the auditory model does not weigh signal components according to

their relevance in terms of precision of signal production. However,

the auditory model precision of a speech component may form a

reasonable match to the precision of speech production, simplifying

the introduction of production noise.

Although auditory models differ in exactly how the inner ear

representation is obtained, they follow in many cases a similar

strategy for modeling the auditory system. In Figure 1, we outline

the basic building blocks of the psychoacoustic model presented in

[21], which is simple but representative of many other models,

such as [20]. The first stage of the auditory model consists of a fil-

ter that mimics the frequency characteristics of the outer and

[FIG1] The basic structure of the auditory model presented in [21].

−80

−70

−60

−50

−40

−30

−20

−10

Frequency (Hz)

Level (dB)

0.2

0.4

0.6

0.8

Response

Envelope

Follower

Outer-Middle Ear Filter Auditory Filterbank

Constant

Log

Transform

Inner Ear

Representation

THE LACK OF FEEDBACK,

TOGETHER WITH THE RECENT

ABILITY TO COMMUNICATE

FROM ANYWHERE TO

ANYWHERE, OFTEN LEADS

TO LOW INTELLIGIBILITY.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND