Zoom out Search Issue

IEEE SIGNAL PROCESSING MAGAZINE [64] MARCH 2015
enhancement. As a classical Wiener
filter only changes the magnitudes in
the STFT domain, the modified spec-
trum
X
N
is inconsistent, meaning
that ( ( )) .XXSTFT iSTFT !
NN
In con-
trast to this, in [29] the relationship
between STFT coefficients across
time and frequency is taken into
account, leading to the consistent
Wiener filter [29], which modifies
both the magnitude and the phase of
the noisy observation to obtain the separated speech. Wiener filter
optimization is formulated as a maximum a posteriori problem
under Gaussian assumptions, and a consistency-enforcing term is
added either through a hard constraint or a soft penalty. Optimiza-
tion is respectively performed directly on the signal in the time
domain or jointly on phase and magnitude in the complex time-
frequency domain, through a conjugate gradient method with a
well-chosen preconditioner. Thanks to this joint optimization, the
consistent Wiener filter was shown to lead to an improved separ-
ation performance compared to the classical Wiener filter and
other methods that attempt to use phase information in combin-
ation with variance estimates [9], [21], [22], in an oracle scenario
as well as in a blind scenario where the speech spectrum is
obtained by spectral subtraction from a stationary estimate of the
noise spectrum.
To combine phase-sensitive magnitude estimation and iterative
approaches, Mowlaee and Saeidi [26] proposed placing the phase-
sensitive magnitude estimator into the loop of an iterative
approach that enforces consistency. Starting with an initial group-
delay-based phase estimate, they proposed to estimate the clean
speech spectral magnitude using a phase-sensitive magnitude esti-
mator similar to [25]. After computing the iSTFT and the STFT
they reestimated the clean speech phase, and from this reestimate
the magnitudes. With this approach,
convergence is reached after only
few iterations.
Another way to jointly estimate
magnitudes and phases is to derive
a joint MMSE estimator of magni-
tudes and phases directly in the
STFT domain when an uncertain
initial phase estimate is available.
This phase-aware complex estima-
tor is referred to as the complex
estimator with uncertain phase (CUP) [12]. The initial phase
estimate can be obtained by an estimator based on signal charac-
teristics, such as the sinusoidal model-based approach [4]. Using
this joint MMSE estimator [12], no STFT iterations are required.
The resulting magnitude estimate is a nonlinear tradeoff
between a phase-blind and a phase-aware magnitude estimator,
while the resulting phase is a tradeoff between the noisy phase
and the initial phase estimate. These tradeoffs are controlled by
the uncertainty of the initial phase estimate, avoid processing
artifacts, and lead to an improvement in predicted speech quality
[12]. Experimental results for the CUP estimator are given in the
following section.
EXPERIMENTAL RESULTS
In this section, we demonstrate the potential of phase processing
to improve speech enhancement algorithms. To focus only on the
differences due to the incorporation of the spectral phases, we
choose algorithms that employ the same statistical models and
PSD estimators: for the estimation of the noise PSD we choose the
speech presence probability-based estimator with fixed priors (see
[6,Sec. 6.3] and references therein) while for the speech PSD we
choose the decision-directed approach [7]. We assume a complex
Gaussian distribution for the noise STFT coefficients and a heavy-
tailed
|
-distribution for the speech magnitudes. Furthermore, we
use an MMSE estimate of the square root of the magnitudes to
incorporate the compressive character of the human auditory sys-
tem. These models are employed in the phase-blind magnitude
estimator [30], the phase-aware magnitude estimator [25], and the
phase-aware CUP [12]. We use a sampling rate of 8 kHz and 32 ms
spectral analysis windows with 7/8th overlap to facilitate phase
estimation. To assess the speech quality, we employ PESQ as an
instrumental measure that has been originally proposed for speech
coding applications but has been show to correlate with subjective
listening tests also for enhanced speech signals. The results are
averaged over pink noise modulated at 0.5 Hz, stationary pink
noise, babble noise, and factory noise, where the latter three are
obtained from the NOISEX-92 database. To have a fair balance
between male and female speakers, per noise type, the first 100
male and the first 100 female utterances from dialect region 6 of
the Texas Instruments and Massachusetts Institute of Technology
(TIMIT) training database are employed. The initial phase estimate
is obtained based on a sinusoidal model [4], which only yields a
phase estimate in voiced speech. The fundamental frequency is
estimated using PEFAC from the voicebox toolkit (http://www.
[FIG5] The PESQ improvement over the noisy input. The results
are averaged over four noise types. Evaluated (a) on voiced
speech and (b) on the entire signal.
Phase-Blind Magnitude [30]
Phase-Aware Complex [12]
[12], Oracle f
0
Phase-Aware Magnitude [25]
[25], Oracle f
0
Global SNR (dB)
PESQ Improvement (MOS)
–10 –5 0 5 10
15
0.2
0.3
0.4
0.5
0.6
0.7
Global SNR (dB)
(a) (b)
PESQ Improvement (MOS)
–10 –5 0 5 10
15
0.3
0.2
0.4
0.5
0.6
0.7
WHEN AN INITIAL PHASE
ESTIMATE IS ALSO EMPLOYED
AS UNCERTAIN PRIOR
INFORMATION WHEN IMPROVING
THE SPECTRAL PHASE AS PROPOSED
IN THE PHASE-AWARE COMPLEX
ESTIMATOR CUP, THE PERFORMANCE
CAN BE IMPROVED FURTHER.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®