Zoom out Search Issue

IEEE SIGNAL PROCESSING MAGAZINE [62] MARCH 2015
harmonic are directly related to each
other through the phase response of
the analysis window
;
k
W
z see, e.g., [4]
for more details. Accordingly, starting
from a phase estimate at a band that
contains a spectral harmonic, possi-
bly obtained using (7), the phase of
the surrounding bands can be inferred by accounting for the phase
shift introduced by the analysis window. For this, only the funda-
mental frequency and the phase response
k
W
z
are required, of
which the latter can be obtained offline either from the window’s
discrete-time Fourier transform (DTFT) or from its DFT with a
large amount of zero padding. The complete setup of [4] is illus-
trated in Figure 3.
It can be argued that for speech enhancement, the phase recon-
struction across frequency bands between harmonics is more
important than the temporal reconstruction on the harmonics: on
the one hand, the local SNR in bands that directly contain har-
monics is rather large for many realistic SNR situations, i.e.,
.
,
,
k
k
Y
S
.zz
,
,
Thus, the temporal alignment of the harmonic com-
ponents is maintained rather well in the noisy signal. Further, the
noisy phase
,k
Y
z
,
in these bands typically yields a good starting
point for the phase reconstruction along frequency. On the other
hand, frequency bands between harmonics are likely to be domi-
nated by the noise, i.e., 
,
,,kk
YN
.zz
,,
and the clean phase relations
across bands are strongly disturbed. Here, the possible benefit of
the phase reconstruction is much larger.
Even though the employed model is simple and limited to
purely voiced speech sounds, the obtained phase estimates yield
valuable information about the clean speech signal that can be
employed for advanced speech enhancement algorithms. Interest-
ingly, even the sole enhancement of the spectral phase can lead to a
considerable reduction of noise between harmonic components of
voiced speech after overlap-add [4]. This is because the speech
components of successive segments are adding up constructively
after the phase modifications, while the noise components suffer
from destructive interference, since the phase relations of the noise
have been destroyed. However, speech distortions are also intro-
duced, which are substantially reduced when the estimated phase
is combined with an enhanced magnitude, as, e.g.,in [25]. Besides
its value for signal reconstruction, the estimated phase can also be
utilized as additional information for phase-aware magnitude esti-
mation [25] and even for the estimation of clean speech complex
coefficients [12], which will be discussed in more detail later.
GROUP DELAY AND TRANSIENT PROCESSING
Structures in the phase are not limited to voiced sounds, but are
also present for other sounds, like impulses or transients. These
structures are well captured by the group delay, which can be seen
in Figure1(c), rendering it a useful representation for phase pro-
cessing. For example, the group delay has been employed to facili-
tate clean speech phase estimation in phase-sensitive noise
reduction [26]. It can be shown geometrically that if the spectral
magnitudes of speech and noise are known, only two possible
combinations of phase values remain, both of which perfectly
explain the observed spectral coeffi-
cients of the mixture. In [26] (and
the references therein), Mowlaee and
Saedi proposed to solve this ambigu-
ity by choosing the phase combin-
ation that minimizes a function of
the group delay.
Besides phase estimation, the group delay has successfully been
employed for the detection of transients sounds, such as sounds of
short duration and speech onsets. To illustrate the role of the phase
for transient sounds, let us consider a single impulse as the sim-
plest example. The DFT of such a pulse is
,Ae
N
nk
2j
0
r-
where n
0
is
the shift of the peak relative to the beginning of the current seg-
ment and
A denotes the spectral magnitude. Hence, we observe a
linear phase with a constant slope of .(/)Nn2
0
r- For impulsive
signals, we accordingly expect a phase difference across frequency
bands that is approximately constant, i.e., a constant group delay.
That this is the case also for real speech sounds can be seen in Fig-
ure 1(c), where transient sounds show vertical lines with almost
equal group delay.
For the detection of impulsive sounds, in [27] a linearity index
kLI
z
^h
is defined, which measures the deviation of the observed
phase difference across frequencies to the one that is expected for
an impulse at
,n
0
i.e., ( / ) .Nn2
0
r- The observed phase differ-
ences are weighted with the spectral magnitude and averaged over
frequency to obtain an estimate of the time domain offset
.n
0
Only
if kLI
z
^h
is close to zero, i.e., the observed phase fits well to the
expected linear phase, an impulsive sound is detected. The detec-
tion can be made either at a segment level or for each time-
frequency point separately. While the former states if an impulsive
sound is present in the current signal segment or not, the latter
allows to localize frequency regions that are dominated by an
impulsive sound, such as a narrowband onset.
Apart from the group delay, the IF, which corresponds to the
temporal derivative of the phase, has also been employed for the
detection of transient sounds, e.g., in [28] and the references
therein. For steady-state signals, like voiced sounds, the IF is
changing only slowly over time, due to the temporal correlation of
the overlapping segments. When a transient is encountered, how-
ever, the most current segment differs significantly from previous
segments and thus the IF also changes abruptly. This can be
observed in Figure1(d), where at speech onsets thin vertical lines
appear in the IF deviation. Hence, the change of the IF from seg-
ment to segment—and its distribution—allow for the detection of
transient sounds, such asnote onsets [28].
The phase of transient sounds is not only relevant for detection,
but also for the reduction of transient noise. In low SNR time-fre-
quency regions, the observed noisy phase is close to the approxi-
mately linear phase of the transient noise. This can lead to artifacts
in the enhanced signal if only the spectral magnitude is improved
and the noisy phase is used for signal reconstruction: usage of the
phase of the transient noise reshapes the enhanced time-domain
signal in an uncontrolled way, such that it may again depict an
undesired transient behavior. Even for a perfect magnitude esti-
mate, the interfering noise is not perfectly suppressed if the phase
THE PHASE OF TRANSIENT
SOUNDS IS NOT ONLY RELEVANT
FOR DETECTION, BUT ALSO
FOR THE REDUCTION OF
TRANSIENT NOISE.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®