Zoom out Search Issue
IEEE SIGNAL PROCESSING MAGAZINE [57] MARCH 2015
wrapped to its principle value, i.e., .X
,
,
X
k
k
+##rz r-=
,
,
To
reveal these structures, alternative representations have been pro-
posed, which consider phase relations between neighboring
time-frequency points instead of absolute phases. Two examples
of such representations are depicted in Figure1(c) and (d). In
(c), the negative derivative of the phase along frequency, known
as the group delay, is shown. It has been shown to be a useful
tool for speech enhancement, e.g.,by Yegnanarayana and Mur-
thy [2]. Besides the group delay, the derivative of the phase
along time, i.e., the instantaneous frequency (IF), also unveils
structures in the spectral phase. For an improved visualization,
in (d), we do not show the IF, but rather its deviation from the
respective center frequency in Hz, which reduces wrapping
along frequency [3], [4]. It is interesting to remark that the tem-
poral as well as the spectral derivatives of the phase both reveal
structures similar to those in the magnitude spectrogram in (a).
Please note that both phase transformations are invertible and
thus carry the same information as the phase itself. No additional
prior knowledge has been injected.
The observed structures in the spectral phase can well be
explained by employing models of the underlying signal, e.g.,by
sinusoidal models in the case of voiced speech [5]. Besides the
structures in the phase that are caused by signal characteristics,
neighboring time-frequency points also show dependencies due
to the STFT analysis: first, because of the finite length of the seg-
ments, neighboring frequency bands are not independent; sec-
ond, successive segments overlap and hence share partly the
same signal information. This introduces particular spectrotem-
poral relations between STFT coefficients within and across
frames of the spectrogram, regardless of the signal. If the spectro-
gram is modified, these relations are not guaranteed to be main-
tained and the modified spectrogram
X
M
may not correspond to
the STFT of any time-domain signal anymore. As a consequence,
the resynthesized signal may have a spectrogram
(),XG
M
where
( ) : ( ( )),XXSTFT iSTFTG =
MM
(1)
which is different from the desired spectrogram ,X
M
as illus-
trated in Figure 2. Such spectrograms are called inconsistent,
while consistent spectrograms verify
XXG =
^h
and can be
obtained from a time-domain signal.
Since the majority of speech enhancement approaches only
modify the magnitude, the mismatch between the enhanced
magnitude and the degraded phase will most likely lead to an
inconsistent spectrogram. This implies that even if the esti-
mated magnitudes
||X
M
are optimal with respect to some objec-
tive function, the magnitude spectrogram of the synthesized
time-domain signal is not, as
|()|||XXG !
M
M
(where |· | denotes
the element-wise absolute value). To maintain consistency, and
thus also optimality, the STFT phase has to be taken into
account as well.
As a final illustration emphasizing the power of phase, it is
interesting to remark that, from a particular magnitude spectro-
gram, it is possible to reconstruct virtually any time-domain signal
with a carefully crafted phase. For instance, one can derive a
magnitude spectrogram from that of a speech signal such that it
yields either a speech signal similar to the original or a piece of
rock music, depending on the choice of the phase. The point here
is to exploit the inconsistency between magnitude and phase to
make contributions of neighboring frames cancel each other just
enough to reconstruct the energy profile of the target sound.
Reconstruction is thus done up to a scaling factor, and quality is
good albeit limited by dynamic range issues. An audio demonstra-
tion is available in http://www.jonathanleroux.org/research/
LeRoux2011ASJ03_sound_transfer.html.
SPEECH ENHANCEMENT IN THE STFT DOMAIN
Speech enhancement is a field of research with a long-standing
history. In this section, we will wrap up the different fields of
research that have led to remarkable progress over the years.
For a more detailed treatment and references to the original
publications, see [6].
In the STFT domain, noisy spectral coefficients can, for
instance, be improved using spectral subtraction or using mini-
mum mean squared error (MMSE) estimators of the clean
speech spectral coefficients [6,Ch. 4]. Examples of the latter are
the Wiener filter as an estimator of the complex speech coeffi-
cients and the short-time spectral amplitude estimator [7].
These MMSE estimators are driven by estimates of the speech
and noise power spectral densities (PSDs). The noise PSDs can
be estimated in speech pauses as signaled by a voice activity
detector, by searching for spectral minima in each subband, or
based on the speech presence probability [6,Ch. 6]. With the
noise PSD at hand, the speech PSD can be estimated by sub-
tracting the noise PSD from the periodogram of the noisy sig-
nal. This has been shown to be the maximum likelihood (ML)
optimal estimator of the clean speech PSD when considering
isolated and independent time-frequency points and complex
Gaussian distributed speech and noise coefficients [6,Sec. 4.2].
To reduce outliers, the ML speech PSD estimate is often
smoothed, for instance, using the decision-directed approach
[7] or more advanced smoothing techniques [6,Ch. 7].
Over the years, many improvements have been proposed
resulting in a considerable progress thanks to better statistical
models of speech and noise [6,Ch. 3], improved estimation of
speech and noise PSDs [6,Ch. 6 and 7], combination with speech
presence probability estimators [6,Ch. 5], and integration of per-
ceptual models [6,Sec. 2.3.3]. Recent years have seen an explosion
of interest in data-driven methods, with model-based approaches
STFT
STFT
iSTFT
iSTFT
Time-Domain
Signals
x
X
X
~
x
~
STFT
Spectrograms
X
X
[FIG2] An illustration of the notion of consistency.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
________________________