Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

121

122

123

124

125

126

127

128

129

130

IEEE SIGNAL PROCESSING MAGAZINE [119] MARCH 2015

Multiple Stimuli with Hidden Reference and Anchor (MUSHRA)

quality scale, with “20” referring to poor quality and “100” repre-

senting excellent quality. Participants selected and listened to the

reference and test stimuli and then indicated their quality judg-

ments by adjusting the corresponding sliders on the computer

screen. Custom HA recordings were obtained for the purpose of

objective speech quality prediction. To this end, the Phonak Savia

BTE HA was programmed to match the amplification targets for

each participant and was subsequently connected to a 2-cc coupler

and placed inside a portable anechoic HA test box. The 32 stimuli

within the database were then played back individually through the

loudspeaker in the test box, and the resulting HA output was stored

in a .wav file with 16-kHz sample rate and 16-bit resolution.

In the second database, the impact of HA speech enhancement

on perceived speech quality was investigated in noise-only, rever-

beration-only, and noise-plus-reverberation listening conditions.

Full details about the data set can be found in [28]. Twenty-two

adult HA users (average age of 71 years) with moderate to severe

sensorineural hearing loss profiles were recruited to participate in

the subjective quality experiments. Each of the participants was

fitted bilaterally with the Unitron experimental BTE HA and seated

at the center of a loudspeaker array, first in a double-walled sound

booth (RT60

.01= s) and then in a reverberant chamber (RT60

.09= s). In each of these rooms, sentences spoken by a male

talker were played from a speaker at 0° azimuth and multitalker

babble or speech-shaped noise at 0 or 5 dB SNR was played from

speakers at 0, 90, 180, and 270° azimuth.

Participants listened to the degraded stimuli four times, each

time with a different HA setting: omnidirectional microphone,

adaptive directional microphone, partial strength signal enhance-

ment (directionality, noise reduction, and speech enhancement

algorithms operating below their maximum strengths), and full

strength signal enhancement (all enhancement algorithms oper-

ating at maximum strength). Within each condition, subjects

rated their perceived quality for each stimulus using the MUSHRA

quality scale. Once again, a customized set of HA recordings was

obtained to enable objective speech quality predictions. To this

end, the bilateral HAs were programmed to match the amplifica-

tion requirements for each HI participant and were then placed on

a Bruel and Kjaer head and torso simulator (HATS). The HATS

was then positioned in the center of the loudspeaker array in each

of the two room environments. The same stimuli used in subject-

ive speech quality experiments were played and the ensuing HA

outputs were stored in .wav files with 16-kHz sample rate and

16-bit resolution. In the analysis described in the section “Experi-

mental Results,” the objective metrics were computed separately

for the left and right channels (using the listeners’ left and right

audiograms, respectively) and then averaged into a final score that

would be compared against the subjective ratings using the per-

formance criteria described next. Moreover, all databases were also

downsampled to 8 kHz, such that ITU-T P.563 could also be tested.

PERFORMANCE CRITERIA

To assess the performance of the tested algorithms, four perfor-

mance criteria were used. As suggested in the literature,

performance values are reported on a per-condition basis, where

condition-averaged objective and subjective intelligibility/qual-

ity ratings are used to reduce intra- and intersubject variability

[2]. First, linear relationships between predicted quality/intelli-

gibility scores and subjective ratings are quantified via a Pear-

son correlation

Second, the ranking capability of the

objective metrics is characterized by the Spearman rank corre-

lation

spear

which is computed in a manner similar to t but

with the original data values replaced by their ranks. These two

measures together can provide insight into the need for a non-

linear monotonic mapping between the objective metric scale

and the subjective rating scale. Here, a sigmoidal mapping func-

tion is used and once the objective values are mapped, a new

Pearson correlation (termed

)

sig

t is computed and used as the

third performance criteria. The sigmoid mapping is given by:

%,Y

100

()X

aa--

(4)

where

a and

a are the fitting parameters, X represents the

objective metric, and Y the mapped intelligibility/quality score.

Finally, the so-called epsilon insensitive root-mean-square esti-

mation error (

f-RMSE) is used. This f-RMSE measure differs

from the conventional one as it considers only differences related

to an epsilon-wide band around the target (subjective) quality/

intelligibility value, thus taking the uncertainty of the subjective

ratings into account. As proposed by ITU-T, epsilon can be defined

as the 95% confidence interval

of the subjective ratings and

is given on a per-condition basis [31]. More specifically,

() (. , )

()

,ci c t M

005

(5)

where c indexes a condition type, M corresponds to the total

number of conditions, v to the standard deviation of the per-con-

dition subjective scores, and (. , )tM0 05 to the t-value computed at

a 0.05 significance level. As such, the per-condition

f-RMSE ( )c is

given by:

() (, () () ()),maxcYcSccic0-RMSE

f =-- (6)

where

()

Yc corresponds to the average sigmoid-mapped intelli-

gibility/quality score for a particular degradation condition c

(out of a total of M conditions) and ( )Sc is the corresponding

average subjective score. The final

f-RMSE is then given by:

(),

RMSE RMSE

(7)

where the degree of freedom d is set to “2” for the sigmoidal

mapping function. An ideal objective metric will possess

sig

close to unity and an

f-RMSE close to zero.

When comparing the performance criteria of two or more met-

rics, it is important to characterize the statistical significance of

the difference between them. For correlation-based criteria, a

Fisher transformation z-test can be used; here, a significance level

of 0.05 was used. For the

-RMSE criterion, the following statis-

tical significance test was used, as suggested by ITU-T [31]:

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND