Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

IEEE SIGNAL PROCESSING MAGAZINE [57] MARCH 2015

wrapped to its principle value, i.e.,  .X

+##rz r-=

reveal these structures, alternative representations have been pro-

posed, which consider phase relations between neighboring

time-frequency points instead of absolute phases. Two examples

of such representations are depicted in Figure1(c) and (d). In

(c), the negative derivative of the phase along frequency, known

as the group delay, is shown. It has been shown to be a useful

tool for speech enhancement, e.g.,by Yegnanarayana and Mur-

thy [2]. Besides the group delay, the derivative of the phase

along time, i.e., the instantaneous frequency (IF), also unveils

structures in the spectral phase. For an improved visualization,

in (d), we do not show the IF, but rather its deviation from the

respective center frequency in Hz, which reduces wrapping

along frequency [3], [4]. It is interesting to remark that the tem-

poral as well as the spectral derivatives of the phase both reveal

structures similar to those in the magnitude spectrogram in (a).

Please note that both phase transformations are invertible and

thus carry the same information as the phase itself. No additional

prior knowledge has been injected.

The observed structures in the spectral phase can well be

explained by employing models of the underlying signal, e.g.,by

sinusoidal models in the case of voiced speech [5]. Besides the

structures in the phase that are caused by signal characteristics,

neighboring time-frequency points also show dependencies due

to the STFT analysis: first, because of the finite length of the seg-

ments, neighboring frequency bands are not independent; sec-

ond, successive segments overlap and hence share partly the

same signal information. This introduces particular spectrotem-

poral relations between STFT coefficients within and across

frames of the spectrogram, regardless of the signal. If the spectro-

gram is modified, these relations are not guaranteed to be main-

tained and the modified spectrogram

may not correspond to

the STFT of any time-domain signal anymore. As a consequence,

the resynthesized signal may have a spectrogram

(),XG

where

( ) : ( ( )),XXSTFT iSTFTG =

(1)

which is different from the desired spectrogram ,X

as illus-

trated in Figure 2. Such spectrograms are called inconsistent,

while consistent spectrograms verify

XXG =

and can be

obtained from a time-domain signal.

Since the majority of speech enhancement approaches only

modify the magnitude, the mismatch between the enhanced

magnitude and the degraded phase will most likely lead to an

inconsistent spectrogram. This implies that even if the esti-

mated magnitudes

||X

are optimal with respect to some objec-

tive function, the magnitude spectrogram of the synthesized

time-domain signal is not, as

|()|||XXG !

(where |· | denotes

the element-wise absolute value). To maintain consistency, and

thus also optimality, the STFT phase has to be taken into

account as well.

As a final illustration emphasizing the power of phase, it is

interesting to remark that, from a particular magnitude spectro-

gram, it is possible to reconstruct virtually any time-domain signal

with a carefully crafted phase. For instance, one can derive a

magnitude spectrogram from that of a speech signal such that it

yields either a speech signal similar to the original or a piece of

rock music, depending on the choice of the phase. The point here

is to exploit the inconsistency between magnitude and phase to

make contributions of neighboring frames cancel each other just

enough to reconstruct the energy profile of the target sound.

Reconstruction is thus done up to a scaling factor, and quality is

good albeit limited by dynamic range issues. An audio demonstra-

tion is available in http://www.jonathanleroux.org/research/

LeRoux2011ASJ03_sound_transfer.html.

SPEECH ENHANCEMENT IN THE STFT DOMAIN

Speech enhancement is a field of research with a long-standing

history. In this section, we will wrap up the different fields of

research that have led to remarkable progress over the years.

For a more detailed treatment and references to the original

publications, see [6].

In the STFT domain, noisy spectral coefficients can, for

instance, be improved using spectral subtraction or using mini-

mum mean squared error (MMSE) estimators of the clean

speech spectral coefficients [6,Ch. 4]. Examples of the latter are

the Wiener filter as an estimator of the complex speech coeffi-

cients and the short-time spectral amplitude estimator [7].

These MMSE estimators are driven by estimates of the speech

and noise power spectral densities (PSDs). The noise PSDs can

be estimated in speech pauses as signaled by a voice activity

detector, by searching for spectral minima in each subband, or

based on the speech presence probability [6,Ch. 6]. With the

noise PSD at hand, the speech PSD can be estimated by sub-

tracting the noise PSD from the periodogram of the noisy sig-

nal. This has been shown to be the maximum likelihood (ML)

optimal estimator of the clean speech PSD when considering

isolated and independent time-frequency points and complex

Gaussian distributed speech and noise coefficients [6,Sec. 4.2].

To reduce outliers, the ML speech PSD estimate is often

smoothed, for instance, using the decision-directed approach

[7] or more advanced smoothing techniques [6,Ch. 7].

Over the years, many improvements have been proposed

resulting in a considerable progress thanks to better statistical

models of speech and noise [6,Ch. 3], improved estimation of

speech and noise PSDs [6,Ch. 6 and 7], combination with speech

presence probability estimators [6,Ch. 5], and integration of per-

ceptual models [6,Sec. 2.3.3]. Recent years have seen an explosion

of interest in data-driven methods, with model-based approaches

STFT

iSTFT

Time-Domain

Signals

STFT

Spectrograms

[FIG2] An illustration of the notion of consistency.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND

________________________