Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

IEEE SIGNAL PROCESSING MAGAZINE [47] MARCH 2015

middle ear. This filter is cascaded with an auditory filter bank that

models processing at the level of the basilar membrane in the

cochlea. Subsequently, the envelope of each of the outputs of the

auditory filters is obtained, which simulates the transduction of

the inner hair cells. To model an absolute hearing threshold, a

constant is added to each envelope. In the current context, this

threshold corresponds to an interpretation noise. In the final

stage, a log transform is used to model the loudness dependent

compression of the auditory filter bank outputs by the outer hair

cells. An important difference between the model from [21] and

the more advanced model presented in [20] is the logarithmic

transform, which is a simplification of the adaptation loops that

are used in [20]. The simplification particularly affects the output

near transitions where the gain of adaptation loops changes.

By applying an auditory model to the acoustic sequences

and a

and comparing the results, a distortion measure can be

obtained. Mutual information is a natural measure for this pur-

pose, but, to our best knowledge, it has not been applied to the

auditory representation for intelligibility enhancement. Note that

while mutual information is not affected by smooth invertible

mappings, auditory representations likely are not smooth map-

pings from features such as cepstra, or line spectral frequencies.

This suggests that it may be essential to consider the detailed

behavior of more sophisticated auditory models.

In the literature, various measures have been used to compare

the auditory representations of

and .a

In [14], it was shown

that an

, criterion leads to a mathematically tractable method

and to provide good results for intelligibility enhancement. Refer-

ence [13] uses a similar auditory model for the so-called glimpse

proportion measure of intelligibility: rather than comparing

and a

directly, it compares the auditory representation of the a

with the auditory representation of the environmental noise .v

The glimpse proportion approach computes the proportion of sig-

nal blocks where the auditory representation of the signal is

louder than the noise. In more recent work on the glimpse pro-

portion, a sigmoidal function is applied to the difference of the

auditory signal and noise representations [6], [22]. The method

provides good intelligibility enhancement [6], [22], [23]. Both the

, criterion and glimpse proportion approaches do not explicitly

consider the information conveyed in a particular signal compo-

nent, which should, at least in principle, be a disadvantage com-

pared to mutual information-based approaches.

MEASURES OPERATING ON SPECTRAL BAND POWERS

The mutual information between

and M

(6) can be seen to

correspond to a classic view of intelligibility based on band powers

of the auditory filter bank [17], [18], [24]–[27], by writing it as

(; ) ()IIAMM

LT i

(7)

()logI

t=- -

(8)

()

log

(9)

The maximum mutual information is attained at high SNR and is

Defining /I

and normalizing (7) accordingly, we

recognize I

as the so-called band-importance function and ( )A

as the so-called weighting function or band-audibility function. The

formulation (7) forms the basis of speech intelligibility measures

such as the SII [18] and the extended SII [27]. These measures are

descendants of the so-called articulation index [24], [25], a measure

that predates information theory. In this classic view,

character-

izes the importance of frequency band i and the factor A

is a

weighting function that indicates what fraction of the information is

delivered to the listener. The information-theory derived form of

shown in (9) describes a sigmoidal function that approximates the

definition of

in the SII. [Equation (9) neglects the threshold of

hearing, the effect of high loudness, and the self-masking of noise.]

Our derivation of the band importance function

of (8) makes its

dependency on the production and interpretation noise explicit. If

the relative variances of the production and interpretation noise of a

band are low (high production and interpretation SNR;

,i0

approaches one), that band is important for intelligibility. In the SII

definition, the values of

are set empirically. As is shown in [15],

the differences between the formulas for the classic approach and

the aforementioned information-theoretical derivation are well

within the precision of the original heuristic derivation of the classic

view. The classic SII has proven to be highly correlated with speech

intelligibility in many conditions and has been used as a basis for

speech intelligibility enhancement [4], [8], [12], [28]. It is discussed

in additional detail in the section “SII-Based Enhancement.”

CONSTRAINTS ON OPTIMIZATION

In most cases, the optimization must be performed subject to one

or more constraints. Important constraints are the speech-like

nature of the output, the signal power, and system delay. Addi-

tional constraints may be required. For instance, for a given mes-

sage

(and speaking rate), a longer word sequence will likely be

more intelligible than a short one, thus making a length con-

straint natural.

The speech-like nature, or the speech quality, of the enhanced

output may require an explicit constraint. However, in most prac-

tical systems the speech-like nature is enforced implicitly by either

the modification strategy, or the optimization criterion, or both.

Modification strategies such as slowly varying spectral shaping

facilitate speech-like output only. The maximum probability of

correct phoneme recognition is an example of a criterion that

favors signal features that resemble those of clean speech.

Signal power is a natural constraint. The unconstrained optimi-

zation of signal spectral modifications may lead to an unbounded

increase of the signal power if the reduction in recognition perfor-

mance of the human auditory system for loud sounds is not consid-

ered. Thus, a power constraint must be applied to prevent hearing

injuries and loudspeaker damage. Approximations to perceived

loudness, either in the form of an analytic expression, or in the

form of an algorithm, may also be used as constraints.

The system delay must be constrained in real-time systems.

This may prevent the usage of particular distortion measures

and modification operators.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND