Zoom out Search Issue
IEEE SIGNAL PROCESSING MAGAZINE [119] MARCH 2015
Multiple Stimuli with Hidden Reference and Anchor (MUSHRA)
quality scale, with “20” referring to poor quality and “100” repre-
senting excellent quality. Participants selected and listened to the
reference and test stimuli and then indicated their quality judg-
ments by adjusting the corresponding sliders on the computer
screen. Custom HA recordings were obtained for the purpose of
objective speech quality prediction. To this end, the Phonak Savia
BTE HA was programmed to match the amplification targets for
each participant and was subsequently connected to a 2-cc coupler
and placed inside a portable anechoic HA test box. The 32 stimuli
within the database were then played back individually through the
loudspeaker in the test box, and the resulting HA output was stored
in a .wav file with 16-kHz sample rate and 16-bit resolution.
In the second database, the impact of HA speech enhancement
on perceived speech quality was investigated in noise-only, rever-
beration-only, and noise-plus-reverberation listening conditions.
Full details about the data set can be found in [28]. Twenty-two
adult HA users (average age of 71 years) with moderate to severe
sensorineural hearing loss profiles were recruited to participate in
the subjective quality experiments. Each of the participants was
fitted bilaterally with the Unitron experimental BTE HA and seated
at the center of a loudspeaker array, first in a double-walled sound
booth (RT60
.01= s) and then in a reverberant chamber (RT60
.09= s). In each of these rooms, sentences spoken by a male
talker were played from a speaker at 0° azimuth and multitalker
babble or speech-shaped noise at 0 or 5 dB SNR was played from
speakers at 0, 90, 180, and 270° azimuth.
Participants listened to the degraded stimuli four times, each
time with a different HA setting: omnidirectional microphone,
adaptive directional microphone, partial strength signal enhance-
ment (directionality, noise reduction, and speech enhancement
algorithms operating below their maximum strengths), and full
strength signal enhancement (all enhancement algorithms oper-
ating at maximum strength). Within each condition, subjects
rated their perceived quality for each stimulus using the MUSHRA
quality scale. Once again, a customized set of HA recordings was
obtained to enable objective speech quality predictions. To this
end, the bilateral HAs were programmed to match the amplifica-
tion requirements for each HI participant and were then placed on
a Bruel and Kjaer head and torso simulator (HATS). The HATS
was then positioned in the center of the loudspeaker array in each
of the two room environments. The same stimuli used in subject-
ive speech quality experiments were played and the ensuing HA
outputs were stored in .wav files with 16-kHz sample rate and
16-bit resolution. In the analysis described in the section “Experi-
mental Results,” the objective metrics were computed separately
for the left and right channels (using the listeners’ left and right
audiograms, respectively) and then averaged into a final score that
would be compared against the subjective ratings using the per-
formance criteria described next. Moreover, all databases were also
downsampled to 8 kHz, such that ITU-T P.563 could also be tested.
PERFORMANCE CRITERIA
To assess the performance of the tested algorithms, four perfor-
mance criteria were used. As suggested in the literature,
performance values are reported on a per-condition basis, where
condition-averaged objective and subjective intelligibility/qual-
ity ratings are used to reduce intra- and intersubject variability
[2]. First, linear relationships between predicted quality/intelli-
gibility scores and subjective ratings are quantified via a Pear-
son correlation
.
t
^h
Second, the ranking capability of the
objective metrics is characterized by the Spearman rank corre-
lation
,
spear
t
^h
which is computed in a manner similar to t but
with the original data values replaced by their ranks. These two
measures together can provide insight into the need for a non-
linear monotonic mapping between the objective metric scale
and the subjective rating scale. Here, a sigmoidal mapping func-
tion is used and once the objective values are mapped, a new
Pearson correlation (termed
)
sig
t is computed and used as the
third performance criteria. The sigmoid mapping is given by:
%,Y
e1
1
100
()X
12
#=
+
aa--
(4)
where
1
a and
2
a are the fitting parameters, X represents the
objective metric, and Y the mapped intelligibility/quality score.
Finally, the so-called epsilon insensitive root-mean-square esti-
mation error (
f-RMSE) is used. This f-RMSE measure differs
from the conventional one as it considers only differences related
to an epsilon-wide band around the target (subjective) quality/
intelligibility value, thus taking the uncertainty of the subjective
ratings into account. As proposed by ITU-T, epsilon can be defined
as the 95% confidence interval
ci
95
^h
of the subjective ratings and
is given on a per-condition basis [31]. More specifically,
() (. , )
()
,ci c t M
M
c
005
95
v
=
(5)
where c indexes a condition type, M corresponds to the total
number of conditions, v to the standard deviation of the per-con-
dition subjective scores, and (. , )tM0 05 to the t-value computed at
a 0.05 significance level. As such, the per-condition
f-RMSE ( )c is
given by:
() (, () () ()),maxcYcSccic0-RMSE
95
f =-- (6)
where
()
Yc corresponds to the average sigmoid-mapped intelli-
gibility/quality score for a particular degradation condition c
(out of a total of M conditions) and ( )Sc is the corresponding
average subjective score. The final
f-RMSE is then given by:
(),
Md
c
1
--
RMSE RMSE
c
M
1
2
ff
=
-
=
/
(7)
where the degree of freedom d is set to “2” for the sigmoidal
mapping function. An ideal objective metric will possess
sig
t
close to unity and an
f-RMSE close to zero.
When comparing the performance criteria of two or more met-
rics, it is important to characterize the statistical significance of
the difference between them. For correlation-based criteria, a
Fisher transformation z-test can be used; here, a significance level
of 0.05 was used. For the
f
-RMSE criterion, the following statis-
tical significance test was used, as suggested by ITU-T [31]:
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®