Zoom out Search Issue
IEEE SIGNAL PROCESSING MAGAZINE [61] MARCH 2015
with real-valued amplitude A
h
and initial time-domain phase
h
{
for harmonic component .h Due to the fixed relation between the
frequencies, (5) is also referred to as the harmonic model, which is
a special case of the more general sinusoidal model. The harmonic
frequencies and amplitudes are assumed to be slowly changing
over time with respect to the length
N of an STFT signal segment
and we define /AARN2
h
h
,=+
,
^h
and /RN2
h
h
,XX=+
,
^h
as
the representative harmonic amplitudes and frequencies for the
th, signal segment.
In speech enhancement, the sinusoidal model has, for instance,
been employed in [23], where the model parameters are iteratively
estimated from a noisy observation in the STFT domain, and the
enhanced signal is synthesized using (5). In the absence of noise,
synthesis results are reported to be almost indistinguishable from
the clean speech signal, underlining the capability of (5) to accu-
rately model voiced human speech. In contrast to [23], we now dis-
cuss how the sinusoidal model (5) can be employed to directly
reconstruct the STFT phase. If the frequency resolution of the
STFT is high enough to resolve the harmonic frequencies
h
X in
(5), in each frequency band k only a single harmonic component
is dominant. The normalized angular frequency
h
X
,
of the har-
monic that dominates frequency band k is denoted as
|/ |,argmin kN2
,k
h
h
rXX=-
,
,
X
,
"
,
N
(6)
i.e., the harmonic frequency that is closest to the center frequency
/kN2r of the kth frequency band. Interpreting the STFT of a sig-
nal as the output of a complex filter bank subsampled by the hop
size
,R the spectral phase
,k
S
z
,
changes from segment to segment
according to
,mod modR
,,
,
,,kk
k
kk
2
1
2
1
SS SS
zz zzXD=+=+
,
,
,,,
rr
--
^
^
h
h
N
(7)
where the modulo operator mod
2
$
r
^h
wraps the phase to values
between 0 and 2
r.
When the clean signal sn
^h
is deteriorated by noise, the spec-
tral phases and thus the temporal phase differences
,k
S
zD
,
are dete-
riorated as well. With an estimate of the fundamental frequency at
hand, however, the temporal phase relations in each band can be
restored using (7) recursively from segment to segment.
Almost 50 years ago, a similar approach for the propagation of
the spectral phase along time was taken in the phase vocoder [5]
for time-scaling or pitch-shifting of acoustic signals. The temporal
STFT phase difference is modified according to
,
,, ,kk k1
SS S
zz azD=+
,, ,-
tt
(8)
where in this context,
,
S
k
zD
,
is often referred to as the IF. By scaling
,
S
k
zD
,
with the positive real-valued factor ,a the IF of the signal
component is either increased 12
a
^h
or decreased .11a
^h
Comparing (7) to (8), the phase estimation along time for speech
enhancement can be expressed in terms of a phase vocoder with a
scaling factor of
.1a = However, the application is completely dif-
ferent: instead of deliberately modifying the original phase, the clean
speech phase is estimated from a noisy observation. It is worth not-
ing that for the original phase vocoder, in contrast to
phase estimation in speech enhancement, no fundamental frequency
estimate is needed, as the phase difference
,,,kkk1
SSS
zzzD =-
,,,-
can
be taken directly from the clean original signal.
For an accurate estimation of the clean spectral phase along
segments using (7) a proper initialization is necessary [4]. In
voiced sounds, the bands between spectral harmonics contain only
little signal energy and, in the presence of noise, these bands are
likely to be dominated by the noise component, i.e.,
,
,,kk
YN
.zz
,,
where
,k
Y
z
,
and
,k
N
z
,
are the spectral phases of the noisy mixture
and the noise, respectively. Even though the phase might be set
consistent within each band, the spectral relations across fre-
quency bands are distorted already at the initialization stage.
Directly applying (7) to every frequency band therefore does not
necessarily yield phase estimates that could be employed for phase-
based speech enhancement [4].
In the phase vocoder, this problem can be alleviated by aligning
phases of neighboring frequency bands relative to each other,
which is known as phase locking, e.g., [24]. There, the phase is
evolved along time only in frequency bands that directly contain
harmonic components. The phase in the surrounding bands,
which are dominated by the same harmonic, is then set relative to
the modified phase. For this, the spectral phase relations of the
original signal are imposed on the modified phase spectrum.
In the context of speech enhancement, the same principle has
been incorporated to improve the estimation of the clean speech
spectral phase [4]. However, since only a noisy signal is observed,
the clean speech phase relations across frequency bands are not
readily available. To overcome this limitation, again the sinusoidal
model is employed. The spectrum of a harmonic signal segment is
given by the cyclic convolution of a comb-function with the trans-
fer function of the analysis window, which causes spectral leakage.
The spectral leakage induces relations not only between the ampli-
tudes, but also between the phases of neighboring bands. It can be
shown that phases of bands that are dominated by the same
[FIG3] Symbolic spectrogram illustrating the sinusoidal model-
based phase estimation [4]. Starting from the noisy phase at the
onset of a voiced sound in segment ,
0
, in bands containing
harmonic components (red) the phase is estimated along
segments. Based on the temporal estimates, the spectral phase
of bands between the harmonics (blue) is then inferred across
frequency.
Voiced
0
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®