Zoom out Search Issue
IEEE SIGNAL PROCESSING MAGAZINE [63] MARCH 2015
is not processed alongside. To illustrate this, let us consider a
speech signal degraded by an impulse train with a period length of
,T
0
which is nonzero every N T f
s00
= samples. In Figure 4, the
noisy signal (a) is presented together with the result obtained
when combining the true clean speech STFT magnitudes with the
noisy phase (b). Even though the clean magnitude is employed,
which represents the best possible result for phase-blind magni-
tude enhancement, the time-domain signal still depicts residual
impulses, which are caused by the noisy phase. In regions where
the enhanced spectral magnitude is close to zero, i.e., in speech
absence, the phase is not relevant and the peaks are well sup-
pressed. During speech presence, however, the spectral magnitude
is nonzero and the phase becomes important. Accordingly, the
residual impulses are most prominent in regions with some speech
energy at low local SNRs, where the noisy phase is close to the
phase of the impulsive noise.
Recently, Sugiyama and Miyahara proposed the concept of
phase randomization to overcome this issue; see, e.g.,[27] and
references therein. First, time-frequency points that are domi-
nated by speech are identified by finding spectral peaks in the
noisy signal. These peaks are excluded from the phase randomi-
zation to avoid speech distortions. To further narrow down
time-frequency regions where randomization of the spectral
phase is sensible, phase-based transient detection can be
employed as well [27]. Then, the spectral phase in bins classified
as dominated by transient noise is randomized by adding a
phase term that is uniformly distributed between
r- and .r In
this way, the approximately linear phase of the dominant noise
component is neutralized. The effect of phase randomization is
depicted in Figure 4(c), where a perfect magnitude estimate is
combined with the modified phase for signal reconstruction. It
can be seen that the residual peaks that are present when the
noisy phase is employed are strongly attenuated, showing that
phase randomization can indeed lead to a considerable increase
of noise reduction, especially in low local SNRs. It is interesting
to note that while the previously described iterative and sinusoi-
dal model-based approaches aim at estimating the phase of the
clean speech signal, the phase randomization approach merely
aims at reducing the impact of the phase of the noise on the
enhanced speech signal. Although the presented example is just
a simple toy experiment, it still highlights the potential of phase
randomization toward an improved suppression of transient
noise, which has also been observed for real-world impulsive
noise, like tapping noise on a touchscreen [27].
RELATION BETWEEN PHASE- AND
MAGNITUDE ESTIMATION
So far, we have discussed phase estimation using iterative
approaches, sinusoidal model-based approaches, and group
delay approaches; we now address the question of how STFT
phase estimation can best be employed to improve speech
enhancement. The most obvious way to do this is to combine
enhanced speech spectral magnitudes in the STFT domain with
the estimated or reconstructed STFT phases. It is interesting to
note that Wang and Lim [10] already stated that obtaining a
more accurate phase estimate than the noisy phase is not worth
the effort “ if the estimate is used to reconstruct a signal by
combining it with an independently estimated magnitude [...].
However, if a significantly different approach is used to exploit
the phase information such as using the phase estimate to fur-
ther improve the magnitude estimate, then a more accurate
estimation of phase may be important” [10]. However, at that
point it was not clear how a phase estimate could be employed
to improve magnitude estimation.
Gerkmann and Krawczyk [25] derived an MMSE estimator of
the spectral magnitude when an estimate of the clean speech
phase is available, referred to as phase-sensitive or phase-aware
magnitude estimation. They were able to show that the informa-
tion of the speech spectral phase can be employed to derive an
improved magnitude estimator that is capable of reducing noise
outliers that are not tracked by the noise PSD estimator. In babble
noise, in a blind setup, the PESQ MOS can be improved by 0.25
points in voiced speech at 0 dB input SNR [25]. Further experi-
mental results are given in the following section.
Instead of estimating phase and magnitude separately, one may
argue that they should ideally be jointly estimated. The first step in
this direction was proposed by Le Roux and Vincent [29] and refer-
ences therein in the context of Wiener filtering for speech
[FIG4] (a) Speech degraded by a click train. (b) Signal obtained by combination of the clean speech spectral magnitude with the noisy
phase. (c) Signal after supplemental phase randomization. Samples that contain a click are highlighted in red.
Noisy Speech
Time (s)
(a)
0.2 0.4 0.6 0.8
–1
–0.5
0
0.5
1
Time (s)
(b)
0.2 0.4 0.6 0.8
–1
–0.5
0
0.5
1
Time (s)
(c)
0.2 0.4 0.6 0.8
–1
–0.5
0
0.5
1
Enhanced Speech
Before Phase Randomization
Enhanced Speech
After Phase Randomization
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®