Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

IEEE SIGNAL PROCESSING MAGAZINE [58] MARCH 2015

such as nonnegative matrix factorization, hidden Markov models,

and discriminative approaches such as deep neural networks.

However, mainstream approaches have tended to ignore the

phase, mainly due to the difficulty of modeling it and the lack of

clarity about its importance, as discussed next.

RISE, DECLINE, AND RENAISSANCE OF PHASE

PROCESSING FOR SPEECH ENHANCEMENT

The first proposals for noise reduction in the STFT domain arose in

the late 1970s. While the spectral subtraction approaches only mod-

ified the spectral magnitudes, the role of the STFT phase was also

actively researched at the time. In particular, several authors inves-

tigated conditions under which a signal is uniquely specified by only

its phase or only its magnitude and proposed iterative algorithms

for signal reconstruction from either one or the other (e.g.,[1], [8],

and references therein). For minimum or maximum phase systems,

log-magnitude and phase are related through the Hilbert trans-

form, meaning that only the spectral phase (or only the spectral

magnitude) is required to reconstruct the entire signal. But the

constraint of purely minimum or maximum phase is too restrictive

for real audio signals, and Quatieri [8] showed that more con-

straints are needed for mixed-phase signals. For instance, imposing

a causality or a finite-length constraint on the signal and specifying

a few samples of the phase or the signal itself is in some cases suffi-

cient to uniquely characterize the entire phase function from only

the magnitude. Quatieri [8] also showed how to exploit such con-

straints to estimate a signal from its spectral magnitude: assuming

some time-domain samples are known, and starting with an initial

phase estimate and the known spectral magnitude, the signal is

transformed to the time domain, where the given set of known

samples is used to replace the corresponding time-domain samples.

Then the time-domain signal is transformed back to the frequency

domain, where the resulting magnitude is replaced by the known

magnitude. This procedure is repeated for a certain number of iter-

ations. In the case of the STFT domain, the correlation between

overlapping short-time analysis segments can be exploited to derive

similar iterative algorithms that do not require time-domain sam-

ples to be known. A popular example of such methods is that of

Griffin and Lim (GL) [1], which we describe in more detail later

along with more recent approaches. While algorithms such as GL

can also be employed with magnitudes that are estimated rather

than measured from an actual signal, the quality of the synthesized

speech and the estimated phase strongly depends on the accuracy of

the estimated speech spectral magnitudes and artifacts such as

echo, smearing, and modulations may occur [9].

To explore the relevance of phase estimation for speech

enhancement, Wang and Lim [10] performed listening experi-

ments where the magnitude of a noisy speech signal at a certain

signal-to-noise ratio (SNR) was combined with the phase of the

same speech signal but distorted by noise at a different SNR. Lis-

teners were asked to compare this artificial test stimulus to a noisy

reference speech signal and to set the SNR of the reference such

that the perceived quality was the same for the reference and the

test stimulus. The result of this experiment was that the SNR gain

obtained by mixing noisy magnitudes with a less distorted phase

resulted in typical SNR improvements of 1 dB or less. Hence,

Wang and Lim concluded that improving phase was not critical in

speech enhancement [10]. Similarly, Vary [11] showed that only

for local SNRs below 6 dB a certain roughness could be perceived

if the noisy phase was kept unchanged. Finally, Ephraim and

Malah [7] investigated the role of phase improvement from a sta-

tistical perspective: they showed that, under a zero-mean circular

Gaussian speech and noise model and assuming that time-fre-

quency points are mutually independent given the speech and

noise PSDs, the MMSE estimate of the complex exponential of the

speech phase has an argument equal to the noisy phase. Also, for

more general models for the speech magnitudes with the same

circularity assumption, it has been shown that the noisy phase is

the ML optimal estimator of the clean speech phase, e.g.,[12].

Note, however, that the independence assumption does not hold in

general, and especially not for overlapping STFT frames, where

part of the relationship is actually deterministic.

As a consequence of these observations, subsequent research in

speech enhancement focused mainly on improving magnitude

estimation, while phase estimation received far less attention for

the next two decades. Even methods that considered phase, either

by use of complex domain models, or by integrating out phase in

log-magnitude-based models in a sophisticated way [13], ultimately

used the noisy phase because of similar circularity assumptions.

However, as the performance of magnitude-only methods can

only go so far without considering phase, and with the increase in

computational power of assisted listening and speech communica-

tion devices, all options for improvements are back on the table.

Therefore, researchers started reinvestigating the role of the STFT

phase for speech intelligibility and quality [14], [15]. For instance,

Kazama et al. [14] investigated the influence of the STFT segment

length on the role of phase for speech intelligibility for a segment

overlap of 50%. They found that, while for signal segments

between 4 ms and 64 ms the STFT magnitude spectrum is more

important than the phase spectrum, for segments shorter than

2 ms and segments longer than 128 ms, the phase spectrum is

more important. These results are consistent with Wang and Lim’s

earlier conclusions [10]. To focus on practical applications, Paliwal

et al. [15] investigated signal segments of 32 ms length, but in con-

trast to Wang and Lim [10] and Kazama et al. [14], they used a seg-

ment overlap of 7/8th instead of 1/2 in the STFT analysis, and they

also zero-padded the time segments before computing the Fourier

transform. With this increased redundancy in the STFT, the perfor-

mance of existing magnitude-based speech enhancement can be

significantly improved [15] if combined with enhanced phases. For

instance, Paliwal et. al [15,case 4] report an improvement of 0.2

points of the mean opinion score (MOS) predicted by the instru-

mental “perceptual evaluation of speech quality” (PESQ) measure

for white Gaussian noise at an SNR of 0 dB when combining an

MMSE estimate of the clean speech magnitude with the oracle

clean speech phase in a perfectly reconstructing STFT framework.

Paliwal et al.’s research confirmed the importance of develop-

ing and improving phase processing algorithms. This has recently

been the focus of research by multiple groups. We now survey the

main directions that have been investigated so far: better and

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND