Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

IEEE SIGNAL PROCESSING MAGAZINE [64] MARCH 2015

enhancement. As a classical Wiener

filter only changes the magnitudes in

the STFT domain, the modified spec-

trum

is inconsistent, meaning

that ( ( )) .XXSTFT iSTFT !

In con-

trast to this, in [29] the relationship

between STFT coefficients across

time and frequency is taken into

account, leading to the consistent

Wiener filter [29], which modifies

both the magnitude and the phase of

the noisy observation to obtain the separated speech. Wiener filter

optimization is formulated as a maximum a posteriori problem

under Gaussian assumptions, and a consistency-enforcing term is

added either through a hard constraint or a soft penalty. Optimiza-

tion is respectively performed directly on the signal in the time

domain or jointly on phase and magnitude in the complex time-

frequency domain, through a conjugate gradient method with a

well-chosen preconditioner. Thanks to this joint optimization, the

consistent Wiener filter was shown to lead to an improved separ-

ation performance compared to the classical Wiener filter and

other methods that attempt to use phase information in combin-

ation with variance estimates [9], [21], [22], in an oracle scenario

as well as in a blind scenario where the speech spectrum is

obtained by spectral subtraction from a stationary estimate of the

noise spectrum.

To combine phase-sensitive magnitude estimation and iterative

approaches, Mowlaee and Saeidi [26] proposed placing the phase-

sensitive magnitude estimator into the loop of an iterative

approach that enforces consistency. Starting with an initial group-

delay-based phase estimate, they proposed to estimate the clean

speech spectral magnitude using a phase-sensitive magnitude esti-

mator similar to [25]. After computing the iSTFT and the STFT

they reestimated the clean speech phase, and from this reestimate

the magnitudes. With this approach,

convergence is reached after only

few iterations.

Another way to jointly estimate

magnitudes and phases is to derive

a joint MMSE estimator of magni-

tudes and phases directly in the

STFT domain when an uncertain

initial phase estimate is available.

This phase-aware complex estima-

tor is referred to as the complex

estimator with uncertain phase (CUP) [12]. The initial phase

estimate can be obtained by an estimator based on signal charac-

teristics, such as the sinusoidal model-based approach [4]. Using

this joint MMSE estimator [12], no STFT iterations are required.

The resulting magnitude estimate is a nonlinear tradeoff

between a phase-blind and a phase-aware magnitude estimator,

while the resulting phase is a tradeoff between the noisy phase

and the initial phase estimate. These tradeoffs are controlled by

the uncertainty of the initial phase estimate, avoid processing

artifacts, and lead to an improvement in predicted speech quality

[12]. Experimental results for the CUP estimator are given in the

following section.

EXPERIMENTAL RESULTS

In this section, we demonstrate the potential of phase processing

to improve speech enhancement algorithms. To focus only on the

differences due to the incorporation of the spectral phases, we

choose algorithms that employ the same statistical models and

PSD estimators: for the estimation of the noise PSD we choose the

speech presence probability-based estimator with fixed priors (see

[6,Sec. 6.3] and references therein) while for the speech PSD we

choose the decision-directed approach [7]. We assume a complex

Gaussian distribution for the noise STFT coefficients and a heavy-

tailed

-distribution for the speech magnitudes. Furthermore, we

use an MMSE estimate of the square root of the magnitudes to

incorporate the compressive character of the human auditory sys-

tem. These models are employed in the phase-blind magnitude

estimator [30], the phase-aware magnitude estimator [25], and the

phase-aware CUP [12]. We use a sampling rate of 8 kHz and 32 ms

spectral analysis windows with 7/8th overlap to facilitate phase

estimation. To assess the speech quality, we employ PESQ as an

instrumental measure that has been originally proposed for speech

coding applications but has been show to correlate with subjective

listening tests also for enhanced speech signals. The results are

averaged over pink noise modulated at 0.5 Hz, stationary pink

noise, babble noise, and factory noise, where the latter three are

obtained from the NOISEX-92 database. To have a fair balance

between male and female speakers, per noise type, the first 100

male and the first 100 female utterances from dialect region 6 of

the Texas Instruments and Massachusetts Institute of Technology

(TIMIT) training database are employed. The initial phase estimate

is obtained based on a sinusoidal model [4], which only yields a

phase estimate in voiced speech. The fundamental frequency is

estimated using PEFAC from the voicebox toolkit (http://www.

[FIG5] The PESQ improvement over the noisy input. The results

are averaged over four noise types. Evaluated (a) on voiced

speech and (b) on the entire signal.

Phase-Blind Magnitude [30]

Phase-Aware Complex [12]

[12], Oracle f

Phase-Aware Magnitude [25]

[25], Oracle f

Global SNR (dB)

PESQ Improvement (MOS)

–10 –5 0 5 10

0.2

0.3

0.4

0.5

0.6

0.7

Global SNR (dB)

(a) (b)

PESQ Improvement (MOS)

–10 –5 0 5 10

0.3

0.2

0.4

0.5

0.6

0.7

WHEN AN INITIAL PHASE

ESTIMATE IS ALSO EMPLOYED

AS UNCERTAIN PRIOR

INFORMATION WHEN IMPROVING

THE SPECTRAL PHASE AS PROPOSED

IN THE PHASE-AWARE COMPLEX

ESTIMATOR CUP, THE PERFORMANCE

CAN BE IMPROVED FURTHER.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND