Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

111

112

113

114

115

116

117

118

119

120

IEEE SIGNAL PROCESSING MAGAZINE [116] MARCH 2015

where r

is the correlation coefficient between the reference and

processed speech envelopes estimated in filter bank channel k

(typically, 23 gammatone channels are used), and the [–15], [15]

operator refers to the process of limiting and mapping

SNR

app

into that range. The last step consists of linearly mapping the

apparent SNR to the [0, 1] range using the following rule:

()

(( (),),)

max min

15 15 15

SNR

final

NCM

app

+-+

(2)

The SNR

final

NCM

values are then weighted in each frequency

channel according to the so-called articulation index (AI)

weights

()Wkrecommended in the American National

Standards Institute (ANSI) S3.5 Standard [7]. The final NCM

value is given by:

() ()

Wk k

NCM

SNR

final

NCM

(3)

The NCM has been widely used to characterize the perceived

intelligibility for CI users (e.g., [3] and [4]).

SHORT-TIME OBJECTIVE INTELLIGIBILITY

The short-time objective intelligibility (STOI) metric is based on a

correlation coefficient between the temporal envelopes of the

time-aligned reference and processed speech signal in short-time

overlapped segments [8]. The signals are first decomposed by a

1/3-octave filter bank, segmented into short-time windows, nor-

malized, clipped, and then compared by means of a correlation

coefficient. The normalization step compensates for, e.g., different

playback levels, which do not have a strong negative effect on

intelligibility. Clipping, in turn, sets an upper bound on how

severely degraded one speech time-frequency unit can be. Accord-

ing to [8], clipping is used to avoid changes in intelligibility pre-

diction once speech has already been deemed “unintelligible.” The

resultant correlation coefficients correspond to short-time inter-

mediate intelligibility measures for each of the segments, which

are then averaged to one scalar value corresponding to the pre-

dicted speech intelligibility for the processed signal. The STOI was

originally proposed to assess the intelligibility of time-frequency

weighted noisy speech and enhanced speech for NH listeners.

Nonetheless, a channel selection algorithm for CIs that employs

STOI has been recently proposed [9].

PERCEPTUAL EVALUATION OF SPEECH QUALITY

The International Telecommunications Union ITU-T P.862 stand-

ard, also known as Perceptual Evaluation of Speech Quality

(PESQ) [10], is a widely used objective quality measurement

standard algorithm. As with most intrusive algorithms, the first

step in PESQ processing is to time-align the reference and pro-

cessed speech signals. Once the signals are time aligned, they are

mapped to an auditory representation using a perceptual model

based on power distributions over time-frequency and compres-

sive loudness scaling, and then their differences are taken. Positive

differences indicate that components such as noise are present,

whereas negative differences indicate that components have been

omitted. With PESQ, different scaling factors are applied

to positive and negative disturbances to generate the so-called

symmetrical and asymmetrical disturbances. The final PESQ qual-

ity score is obtained as a linear combination of the symmetrical

and asymmetrical disturbances, with weights optimized using

telephony data. While the original PESQ algorithm described in

[10] was developed for narrow-band speech (8-kHz sampling rate),

wideband (16 kHz) extensions were described in [11] and are used

in the experiments described herein. It is important to emphasize

that the P.862 standard was recently superseded by ITU-T Recom-

mendation P.863 [also known as Perceptual Objective Listening

Quality Assessment (POLQA); see [2] and references therein], thus

covering a wider scope of distortions and speech bandwidths (e.g.,

superwideband). POLQA, however, is not used in this study as its

source code is not publicly available and its license is very costly.

HEARING AID SPEECH QUALITY AND

INTELLIGIBILITY INDICES

As originally described in [12], the HA speech quality index

(HASQI) uses an auditory model to analyze the reference and pro-

cessed signals from an HA. The auditory model was recently

extended in [13] and now serves as the basis of a unified approach

for predicting both intelligibility [14] and quality [15]. This HASQI

Version 2 model is used in the experiments described herein. The

auditory model includes the middle ear, an auditory filter bank,

the dynamic-range compression mediated by the outer hair cells

in the cochlea, two-tone suppression (where a tone at one fre-

quency can reduce the cochlear output for a tone at a different fre-

quency), and the onset enhancement inherent in the inner

hair-cell neural firing behavior. Hearing impairment is incorpo-

rated in the model as a broadening of the auditory filters with

increasing hearing loss, a reduction in the amount of dynamic-

range compression, a reduction in the two-tone suppression, and a

shift in the auditory threshold.

The HA speech intelligibility index (HASPI), in turn, combines

two measures of signal fidelity. The first measure compares the

evolution of the spectral shape over time for the processed signal

with that of the reference signal. The second measure cross-corre-

lates the high-level portions of the two signals in each frequency

band. The envelope measure is sensitive to the dynamic signal

behavior associated with consonants, while the cross-correlation

measure is more responsive to preserving the harmonics in steady-

state vowels. The HASQI quality model incorporates the effects of

noise and nonlinear distortions, as well as linear spectral changes.

The noise and nonlinear terms combine two measurements. The

first measurement compares the time-frequency envelope modula-

tion of the processed and reference signals and is similar to the

envelope comparison used in HASPI. The second measurement is

based on normalized signal cross-correlations in each frequency

band. The linear term compares the long-term spectra and the

spectral slopes. The final quality prediction is the product of the

two terms. Both HASPI [14] and HASQI [15] have been evaluated

for NH and HI listeners over a wide range of processing conditions,

including additive stationary and modulated noise, nonlinear dis-

tortion, noise suppression, dynamic-range compression, frequency

compression, feedback cancellation, and linear filtering.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND