Zoom out Search Issue

IEEE SIGNAL PROCESSING MAGAZINE [58] MARCH 2015
such as nonnegative matrix factorization, hidden Markov models,
and discriminative approaches such as deep neural networks.
However, mainstream approaches have tended to ignore the
phase, mainly due to the difficulty of modeling it and the lack of
clarity about its importance, as discussed next.
RISE, DECLINE, AND RENAISSANCE OF PHASE
PROCESSING FOR SPEECH ENHANCEMENT
The first proposals for noise reduction in the STFT domain arose in
the late 1970s. While the spectral subtraction approaches only mod-
ified the spectral magnitudes, the role of the STFT phase was also
actively researched at the time. In particular, several authors inves-
tigated conditions under which a signal is uniquely specified by only
its phase or only its magnitude and proposed iterative algorithms
for signal reconstruction from either one or the other (e.g.,[1], [8],
and references therein). For minimum or maximum phase systems,
log-magnitude and phase are related through the Hilbert trans-
form, meaning that only the spectral phase (or only the spectral
magnitude) is required to reconstruct the entire signal. But the
constraint of purely minimum or maximum phase is too restrictive
for real audio signals, and Quatieri [8] showed that more con-
straints are needed for mixed-phase signals. For instance, imposing
a causality or a finite-length constraint on the signal and specifying
a few samples of the phase or the signal itself is in some cases suffi-
cient to uniquely characterize the entire phase function from only
the magnitude. Quatieri [8] also showed how to exploit such con-
straints to estimate a signal from its spectral magnitude: assuming
some time-domain samples are known, and starting with an initial
phase estimate and the known spectral magnitude, the signal is
transformed to the time domain, where the given set of known
samples is used to replace the corresponding time-domain samples.
Then the time-domain signal is transformed back to the frequency
domain, where the resulting magnitude is replaced by the known
magnitude. This procedure is repeated for a certain number of iter-
ations. In the case of the STFT domain, the correlation between
overlapping short-time analysis segments can be exploited to derive
similar iterative algorithms that do not require time-domain sam-
ples to be known. A popular example of such methods is that of
Griffin and Lim (GL) [1], which we describe in more detail later
along with more recent approaches. While algorithms such as GL
can also be employed with magnitudes that are estimated rather
than measured from an actual signal, the quality of the synthesized
speech and the estimated phase strongly depends on the accuracy of
the estimated speech spectral magnitudes and artifacts such as
echo, smearing, and modulations may occur [9].
To explore the relevance of phase estimation for speech
enhancement, Wang and Lim [10] performed listening experi-
ments where the magnitude of a noisy speech signal at a certain
signal-to-noise ratio (SNR) was combined with the phase of the
same speech signal but distorted by noise at a different SNR. Lis-
teners were asked to compare this artificial test stimulus to a noisy
reference speech signal and to set the SNR of the reference such
that the perceived quality was the same for the reference and the
test stimulus. The result of this experiment was that the SNR gain
obtained by mixing noisy magnitudes with a less distorted phase
resulted in typical SNR improvements of 1 dB or less. Hence,
Wang and Lim concluded that improving phase was not critical in
speech enhancement [10]. Similarly, Vary [11] showed that only
for local SNRs below 6 dB a certain roughness could be perceived
if the noisy phase was kept unchanged. Finally, Ephraim and
Malah [7] investigated the role of phase improvement from a sta-
tistical perspective: they showed that, under a zero-mean circular
Gaussian speech and noise model and assuming that time-fre-
quency points are mutually independent given the speech and
noise PSDs, the MMSE estimate of the complex exponential of the
speech phase has an argument equal to the noisy phase. Also, for
more general models for the speech magnitudes with the same
circularity assumption, it has been shown that the noisy phase is
the ML optimal estimator of the clean speech phase, e.g.,[12].
Note, however, that the independence assumption does not hold in
general, and especially not for overlapping STFT frames, where
part of the relationship is actually deterministic.
As a consequence of these observations, subsequent research in
speech enhancement focused mainly on improving magnitude
estimation, while phase estimation received far less attention for
the next two decades. Even methods that considered phase, either
by use of complex domain models, or by integrating out phase in
log-magnitude-based models in a sophisticated way [13], ultimately
used the noisy phase because of similar circularity assumptions.
However, as the performance of magnitude-only methods can
only go so far without considering phase, and with the increase in
computational power of assisted listening and speech communica-
tion devices, all options for improvements are back on the table.
Therefore, researchers started reinvestigating the role of the STFT
phase for speech intelligibility and quality [14], [15]. For instance,
Kazama et al. [14] investigated the influence of the STFT segment
length on the role of phase for speech intelligibility for a segment
overlap of 50%. They found that, while for signal segments
between 4 ms and 64 ms the STFT magnitude spectrum is more
important than the phase spectrum, for segments shorter than
2 ms and segments longer than 128 ms, the phase spectrum is
more important. These results are consistent with Wang and Lim’s
earlier conclusions [10]. To focus on practical applications, Paliwal
et al. [15] investigated signal segments of 32 ms length, but in con-
trast to Wang and Lim [10] and Kazama et al. [14], they used a seg-
ment overlap of 7/8th instead of 1/2 in the STFT analysis, and they
also zero-padded the time segments before computing the Fourier
transform. With this increased redundancy in the STFT, the perfor-
mance of existing magnitude-based speech enhancement can be
significantly improved [15] if combined with enhanced phases. For
instance, Paliwal et. al [15,case 4] report an improvement of 0.2
points of the mean opinion score (MOS) predicted by the instru-
mental “perceptual evaluation of speech quality” (PESQ) measure
for white Gaussian noise at an SNR of 0 dB when combining an
MMSE estimate of the clean speech magnitude with the oracle
clean speech phase in a perfectly reconstructing STFT framework.
Paliwal et al.’s research confirmed the importance of develop-
ing and improving phase processing algorithms. This has recently
been the focus of research by multiple groups. We now survey the
main directions that have been investigated so far: better and
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®