Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

111

112

113

114

115

116

117

118

119

120

IEEE SIGNAL PROCESSING MAGAZINE [117] MARCH 2015

PERCEPTION-MODEL-BASED QUALITY PREDICTION

In its original version, the perception-model-based quality predic-

tion method, PEMO-Q, compares the auditory-inspired “internal

representation” of the reference speech signal to that of its pro-

cessed counterpart to objectively characterize the quality of the

processed speech signal [16]. The auditory representation is

obtained given the following signal processing chain. First, the

signals are split into critical bands using a gammatone filter bank.

Each subband is half-wave rectified and low-pass filtered at 1 kHz.

Envelope signals are then thresholded to account for the absolute

hearing threshold and passed through an adaptation chain con-

sisting of five consecutive nonlinear feedback loops. Finally, the

envelope signal is either lowpass filtered at 8-Hz modulation fre-

quency (in PEMO-Q’s optional “fast mode”) or analyzed by a linear

modulation filter bank comprising eight filters with center fre-

quencies up to 129 Hz (i.e., in the default mode used here). When

comparing the reference and processed signals, two quality meas-

ures are produced: the overall perceptual similarity measure

(PSM) and a per-frame counterpart PSM

PSM corresponds to the overall cross-correlation coefficient

between the complete internal representations of the reference and

processed speech signals. PSM

, in turn, is a more refined measure

and explicitly accounts for the temporal course of the instantan-

eous audio quality as derived from a temporal frame-by-frame cor-

relation of internal representations. While PSM provides greater

generalizability, PSM

has been found to be more sensitive to small

distortions [16]. Since the experiments described in this article will

be dealing with a wider range of speech quality levels, the PSM

measure will be used. PSM was also previously shown to reliably

predict the quality of speech enhancement algorithms [17].

More recently, an extension to PEMO-Q was developed to

account for hearing impairments (PEMO-Q-HI) for HA users

[18]. In the modified version, sensorineural hearing losses are

modeled by an instantaneous expansion and an attenuation

stage applied before the adaptation stage. While the former

accounts for the reduced dynamic compression caused by the

loss of outer hair cells, the latter accounts for the loss of sensi-

tivity due to loss of inner hair cells [19]. With PEMO-Q-HI, the

amount of attenuation and expansion is quantified from the

impaired listeners’ audiograms, as detailed in [18].

NONINTRUSIVE METRICS

ITU-T RECOMMENDATION P.563

In 2004, the ITU-T standardized its first nonintrusive algorithm

called ITU-T P.563 [20]. The P.563 algorithm extracts a number of

signal parameters to detect one of six dominant distortion classes.

The distortion classes are, in decreasing order of “annoyance”:

high level of background noise, signal interruptions, signal-corre-

lated noise, speech robotization (voice with metallic sounds),

unnatural male speech, and unnatural female speech. For each

distortion class, a subset of the extracted parameters is used to

compute an intermediate quality rating. Once a major distortion

class is detected, the intermediate score is linearly combined with

other parameters to derive a final quality estimate. Unnaturalness

of the speech signal is characterized by vocal tract and linear pre-

diction analysis of the speech signal. More specifically, the vocal

tract is modeled as a series of tubes of different lengths and time-

varying cross-sectional areas, which are then combined with

higher-order statistics (skewness and kurtosis) of the linear predic-

tion and cepstral coefficients and tested to see if they lie within the

restricted range expected for natural speech. While P.563 was

developed as an objective quality measure for NH listeners and

telephony applications, a recent study has shown promising

results with P.563 as a correlate of noise-excited vocoded speech

intelligibility for NH listeners, thus simulating CI hearing [21].

Note that the ITU-T P.563 algorithm is only applicable to narrow-

band speech signals sampled at 8-kHz sampling rate.

ModA

The ModA [22] measure is based on the principle that the speech

signal envelope is smeared by the late reflections in a reverberant

room, thus affecting the modulation spectrum of the speech sig-

nal. To obtain the ModA metric, the signal is first decomposed into

()N 4= acoustic bands (lower cutoff frequencies of 300, 775,

1,375, and 3,676 Hz, as in [22]); the temporal envelopes for each

acoustic band are then computed using the Hilbert transform,

downsampled and grouped using a 1/3-octave filter bank with cen-

ter frequencies ranging between 0.5 and 8 Hz. As in [22], 13

modulation filters are used to cover the 0.5–10 Hz modulation fre-

quency range. For each acoustic frequency band, the so-called

area under the modulation spectrum is computed

and finally

averaged over all N 4= acoustic bands to obtain the ModA meas-

ure, which has been used as an intelligibility correlate for CI users

in reverberant and enhanced conditions [22].

SRMR

The SRMR was originally developed for reverberant and derever-

berated speech and evaluated against subjective NH listener data

[23]. The metric is computed as follows. First, the input speech

signal is filtered by a gammatone filter bank with center frequen-

cies ranging from 125 Hz to approximately half the sampling fre-

quency, and with bandwidths characterized by the equivalent

rectangular bandwidth. For 8-kHz and 16-kHz sampled speech

signals, 23 and 32 filters are used, respectively. Temporal envelopes

are then computed via the Hilbert transform for each of the filter

bank outputs and used to extract modulation spectral energy for

each critical band. To emulate frequency selectivity in the modula-

tion domain [24], modulation frequency bins are grouped into

eight overlapping modulation bands with center frequencies loga-

rithmically spaced between 4 and 128 Hz. Finally, the SRMR value

is computed as the ratio of the average modulation energy content

available in the first four modulation bands (3–20 Hz, consistent

with clean speech content) to the average modulation energy con-

tent available in the last four modulation bands (20–120 Hz), con-

sistent with room acoustics information [25].

SRMR-CI AND SRMR-HA

To tailor the SRMR measure for CI, a few modifications were

recently implemented [26], [27]. First, the gammatone filter

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND