Zoom out Search Issue
IEEE SIGNAL PROCESSING MAGAZINE [117] MARCH 2015
PERCEPTION-MODEL-BASED QUALITY PREDICTION
In its original version, the perception-model-based quality predic-
tion method, PEMO-Q, compares the auditory-inspired “internal
representation” of the reference speech signal to that of its pro-
cessed counterpart to objectively characterize the quality of the
processed speech signal [16]. The auditory representation is
obtained given the following signal processing chain. First, the
signals are split into critical bands using a gammatone filter bank.
Each subband is half-wave rectified and low-pass filtered at 1 kHz.
Envelope signals are then thresholded to account for the absolute
hearing threshold and passed through an adaptation chain con-
sisting of five consecutive nonlinear feedback loops. Finally, the
envelope signal is either lowpass filtered at 8-Hz modulation fre-
quency (in PEMO-Q’s optional “fast mode”) or analyzed by a linear
modulation filter bank comprising eight filters with center fre-
quencies up to 129 Hz (i.e., in the default mode used here). When
comparing the reference and processed signals, two quality meas-
ures are produced: the overall perceptual similarity measure
(PSM) and a per-frame counterpart PSM
t
.
PSM corresponds to the overall cross-correlation coefficient
between the complete internal representations of the reference and
processed speech signals. PSM
t
, in turn, is a more refined measure
and explicitly accounts for the temporal course of the instantan-
eous audio quality as derived from a temporal frame-by-frame cor-
relation of internal representations. While PSM provides greater
generalizability, PSM
t
has been found to be more sensitive to small
distortions [16]. Since the experiments described in this article will
be dealing with a wider range of speech quality levels, the PSM
measure will be used. PSM was also previously shown to reliably
predict the quality of speech enhancement algorithms [17].
More recently, an extension to PEMO-Q was developed to
account for hearing impairments (PEMO-Q-HI) for HA users
[18]. In the modified version, sensorineural hearing losses are
modeled by an instantaneous expansion and an attenuation
stage applied before the adaptation stage. While the former
accounts for the reduced dynamic compression caused by the
loss of outer hair cells, the latter accounts for the loss of sensi-
tivity due to loss of inner hair cells [19]. With PEMO-Q-HI, the
amount of attenuation and expansion is quantified from the
impaired listeners’ audiograms, as detailed in [18].
NONINTRUSIVE METRICS
ITU-T RECOMMENDATION P.563
In 2004, the ITU-T standardized its first nonintrusive algorithm
called ITU-T P.563 [20]. The P.563 algorithm extracts a number of
signal parameters to detect one of six dominant distortion classes.
The distortion classes are, in decreasing order of “annoyance”:
high level of background noise, signal interruptions, signal-corre-
lated noise, speech robotization (voice with metallic sounds),
unnatural male speech, and unnatural female speech. For each
distortion class, a subset of the extracted parameters is used to
compute an intermediate quality rating. Once a major distortion
class is detected, the intermediate score is linearly combined with
other parameters to derive a final quality estimate. Unnaturalness
of the speech signal is characterized by vocal tract and linear pre-
diction analysis of the speech signal. More specifically, the vocal
tract is modeled as a series of tubes of different lengths and time-
varying cross-sectional areas, which are then combined with
higher-order statistics (skewness and kurtosis) of the linear predic-
tion and cepstral coefficients and tested to see if they lie within the
restricted range expected for natural speech. While P.563 was
developed as an objective quality measure for NH listeners and
telephony applications, a recent study has shown promising
results with P.563 as a correlate of noise-excited vocoded speech
intelligibility for NH listeners, thus simulating CI hearing [21].
Note that the ITU-T P.563 algorithm is only applicable to narrow-
band speech signals sampled at 8-kHz sampling rate.
ModA
The ModA [22] measure is based on the principle that the speech
signal envelope is smeared by the late reflections in a reverberant
room, thus affecting the modulation spectrum of the speech sig-
nal. To obtain the ModA metric, the signal is first decomposed into
()N 4= acoustic bands (lower cutoff frequencies of 300, 775,
1,375, and 3,676 Hz, as in [22]); the temporal envelopes for each
acoustic band are then computed using the Hilbert transform,
downsampled and grouped using a 1/3-octave filter bank with cen-
ter frequencies ranging between 0.5 and 8 Hz. As in [22], 13
modulation filters are used to cover the 0.5–10 Hz modulation fre-
quency range. For each acoustic frequency band, the so-called
area under the modulation spectrum is computed
A
i
^h
and finally
averaged over all N 4= acoustic bands to obtain the ModA meas-
ure, which has been used as an intelligibility correlate for CI users
in reverberant and enhanced conditions [22].
SRMR
The SRMR was originally developed for reverberant and derever-
berated speech and evaluated against subjective NH listener data
[23]. The metric is computed as follows. First, the input speech
signal is filtered by a gammatone filter bank with center frequen-
cies ranging from 125 Hz to approximately half the sampling fre-
quency, and with bandwidths characterized by the equivalent
rectangular bandwidth. For 8-kHz and 16-kHz sampled speech
signals, 23 and 32 filters are used, respectively. Temporal envelopes
are then computed via the Hilbert transform for each of the filter
bank outputs and used to extract modulation spectral energy for
each critical band. To emulate frequency selectivity in the modula-
tion domain [24], modulation frequency bins are grouped into
eight overlapping modulation bands with center frequencies loga-
rithmically spaced between 4 and 128 Hz. Finally, the SRMR value
is computed as the ratio of the average modulation energy content
available in the first four modulation bands (3–20 Hz, consistent
with clean speech content) to the average modulation energy con-
tent available in the last four modulation bands (20–120 Hz), con-
sistent with room acoustics information [25].
SRMR-CI AND SRMR-HA
To tailor the SRMR measure for CI, a few modifications were
recently implemented [26], [27]. First, the gammatone filter
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®