Zoom out Search Issue
IEEE SIGNAL PROCESSING MAGAZINE [116] MARCH 2015
where r
k
is the correlation coefficient between the reference and
processed speech envelopes estimated in filter bank channel k
(typically, 23 gammatone channels are used), and the [–15], [15]
operator refers to the process of limiting and mapping
SNR
app
into that range. The last step consists of linearly mapping the
apparent SNR to the [0, 1] range using the following rule:
()
(( (),),)
.
max min
k
k
30
15 15 15
SNR
SNR
final
NCM
app
=
+-+
(2)
The SNR
final
NCM
values are then weighted in each frequency
channel according to the so-called articulation index (AI)
weights
()Wkrecommended in the American National
Standards Institute (ANSI) S3.5 Standard [7]. The final NCM
value is given by:
() ()
.
Wk
Wk k
NCM
SNR
k
k
k
K
1
23
1
23
final
NCM
=
=
=
=
=
^h
/
/
(3)
The NCM has been widely used to characterize the perceived
intelligibility for CI users (e.g., [3] and [4]).
SHORT-TIME OBJECTIVE INTELLIGIBILITY
The short-time objective intelligibility (STOI) metric is based on a
correlation coefficient between the temporal envelopes of the
time-aligned reference and processed speech signal in short-time
overlapped segments [8]. The signals are first decomposed by a
1/3-octave filter bank, segmented into short-time windows, nor-
malized, clipped, and then compared by means of a correlation
coefficient. The normalization step compensates for, e.g., different
playback levels, which do not have a strong negative effect on
intelligibility. Clipping, in turn, sets an upper bound on how
severely degraded one speech time-frequency unit can be. Accord-
ing to [8], clipping is used to avoid changes in intelligibility pre-
diction once speech has already been deemed “unintelligible.” The
resultant correlation coefficients correspond to short-time inter-
mediate intelligibility measures for each of the segments, which
are then averaged to one scalar value corresponding to the pre-
dicted speech intelligibility for the processed signal. The STOI was
originally proposed to assess the intelligibility of time-frequency
weighted noisy speech and enhanced speech for NH listeners.
Nonetheless, a channel selection algorithm for CIs that employs
STOI has been recently proposed [9].
PERCEPTUAL EVALUATION OF SPEECH QUALITY
The International Telecommunications Union ITU-T P.862 stand-
ard, also known as Perceptual Evaluation of Speech Quality
(PESQ) [10], is a widely used objective quality measurement
standard algorithm. As with most intrusive algorithms, the first
step in PESQ processing is to time-align the reference and pro-
cessed speech signals. Once the signals are time aligned, they are
mapped to an auditory representation using a perceptual model
based on power distributions over time-frequency and compres-
sive loudness scaling, and then their differences are taken. Positive
differences indicate that components such as noise are present,
whereas negative differences indicate that components have been
omitted. With PESQ, different scaling factors are applied
to positive and negative disturbances to generate the so-called
symmetrical and asymmetrical disturbances. The final PESQ qual-
ity score is obtained as a linear combination of the symmetrical
and asymmetrical disturbances, with weights optimized using
telephony data. While the original PESQ algorithm described in
[10] was developed for narrow-band speech (8-kHz sampling rate),
wideband (16 kHz) extensions were described in [11] and are used
in the experiments described herein. It is important to emphasize
that the P.862 standard was recently superseded by ITU-T Recom-
mendation P.863 [also known as Perceptual Objective Listening
Quality Assessment (POLQA); see [2] and references therein], thus
covering a wider scope of distortions and speech bandwidths (e.g.,
superwideband). POLQA, however, is not used in this study as its
source code is not publicly available and its license is very costly.
HEARING AID SPEECH QUALITY AND
INTELLIGIBILITY INDICES
As originally described in [12], the HA speech quality index
(HASQI) uses an auditory model to analyze the reference and pro-
cessed signals from an HA. The auditory model was recently
extended in [13] and now serves as the basis of a unified approach
for predicting both intelligibility [14] and quality [15]. This HASQI
Version 2 model is used in the experiments described herein. The
auditory model includes the middle ear, an auditory filter bank,
the dynamic-range compression mediated by the outer hair cells
in the cochlea, two-tone suppression (where a tone at one fre-
quency can reduce the cochlear output for a tone at a different fre-
quency), and the onset enhancement inherent in the inner
hair-cell neural firing behavior. Hearing impairment is incorpo-
rated in the model as a broadening of the auditory filters with
increasing hearing loss, a reduction in the amount of dynamic-
range compression, a reduction in the two-tone suppression, and a
shift in the auditory threshold.
The HA speech intelligibility index (HASPI), in turn, combines
two measures of signal fidelity. The first measure compares the
evolution of the spectral shape over time for the processed signal
with that of the reference signal. The second measure cross-corre-
lates the high-level portions of the two signals in each frequency
band. The envelope measure is sensitive to the dynamic signal
behavior associated with consonants, while the cross-correlation
measure is more responsive to preserving the harmonics in steady-
state vowels. The HASQI quality model incorporates the effects of
noise and nonlinear distortions, as well as linear spectral changes.
The noise and nonlinear terms combine two measurements. The
first measurement compares the time-frequency envelope modula-
tion of the processed and reference signals and is similar to the
envelope comparison used in HASPI. The second measurement is
based on normalized signal cross-correlations in each frequency
band. The linear term compares the long-term spectra and the
spectral slopes. The final quality prediction is the product of the
two terms. Both HASPI [14] and HASQI [15] have been evaluated
for NH and HI listeners over a wide range of processing conditions,
including additive stationary and modulated noise, nonlinear dis-
tortion, noise suppression, dynamic-range compression, frequency
compression, feedback cancellation, and linear filtering.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®