Zoom out Search Issue

IEEE SIGNAL PROCESSING MAGAZINE [60] MARCH 2015
contribution to the signal is obtained by the inverse DFT of the
phase
()i
z
,
combined with the target magnitude; frame , s contri-
bution is then combined by overlap-add to the contribution of the
previous frames, leading to a signal estimate for frame
, ; the
phase
()i 1
z
,
+
is estimated as the phase of this signal estimate to
which the analysis window is applied.
RTISI does lead to better results than GL for the first few itera-
tions, but it quickly reaches a plateau and is ultimately signifi-
cantly outperformed by GL. This is mainly due to the fact that
RTISI does not consider information from future frames at all,
even though the contribution of these future frames will later on
be added to that of the past and current frames, effectively altering
the estimation performed earlier. Its authors thus proposed an
extension to RTISI including an
M frame look-ahead, RTISI-LA.
Instead of considering only the current frame as active, RTISI-LA
performs GL-type updates on the phases in a block of multiple
frames. The contribution of future frames outside the block is dis-
carded during the updates, because the absence of a reliable phase
estimate for them is regarded as likely to make their contribution
more of a disturbance than a useful clue. This creates an asymme-
try, which Zhu et al.[17] proposed to partially compensate by
using asymmetric analysis windows with a reverse effect. Although
the procedure relies on heuristic considerations, the authors show
that it leads to much better performance than GL for a given
number of iterations per block.
While RTISI and RTISI-LA were successful in overcoming
GL’s issues regarding online processing and poor initialization,
they did not tackle the problems of heavy reliance on costly FFT
computations and lack of care for local regularities in the time-
frequency domain. Solving these problems was difficult in the
context of classical approaches relying on enforcing constraints
both in the time-frequency domain (to impose a given magni-
tude) and the time domain (to ensure that magnitude and phase
are consistent), because they inherently had to go back and
forth between the two domains, processing whole frames at a
time. A solution was proposed by Le Roux et al. [18], whose key
idea was to bypass the time domain altogether and reformulate
the problem within the time-frequency domain. The standard
operation of classical iterative approaches, i.e., computing the
STFT of the signal obtained by iSTFT from a given spectrogram,
can indeed be considered as a linear operator in the time-fre-
quency domain. Le Roux et al.noticed that the result of that
operation at each time-frequency bin can be well approximated
by a local weighted sum (LWS) with complex coefficients on a
small neighborhood of that bin in the original spectrogram.
While the very small number of terms in the sum does not suf-
fice to reduce the complexity of the operation compared to using
FFTs, the locality of the sum opens the door to selectively updat-
ing certain time-frequency bins, as well as to immediately propa-
gating the updated value for a bin in the computations of its
neighbors’ updates. Taking advantage of the sparseness of natu-
ral sound signals, Le Roux et al.showed in particular that focus-
ing first on updating only the bins with high energy not only
reduced greatly the complexity of each iteration, but also could
lead to better initializations, the high energy regions serving as
anchors for lower energy ones. While the LWS algorithm was
originally proposed as an extension to GL for batch-mode com-
putations, the authors later showed that it could be effectively
used in online mode as well in combination with RTISI-LA [19].
Interestingly, a different prioritization of the updates based on
energy, at the frame level instead of the bin level, was also suc-
cessfully used by Gnann and Spiertz to improve RTISI-LA [20].
Recently, several authors investigated signal reconstruction
from magnitudes with specific task-related side information. Those
developed in the context of source separation are of particular inter-
est to this article. Gunawan and Sen [21] proposed the multiple
input spectrogram inversion (MISI) algorithm to reconstruct mul-
tiple signals from their magnitude spectrograms and their mixture
signal. The phase of the mixture signal acts as very powerful side
information, which can be exploited by imposing that the recon-
structed complex spectrograms add up to the mixture complex
spectrogram when estimating their phases, leading to much better
reconstruction quality than in situations where the mixture signal
is not available. Sturmel and Daudet’s partitioned phase retrieval
(PPR) method [9] also handles the reconstruction of multiple
sources. Their proposal was to reconstruct the phase of the magni-
tude spectrogram obtained by Wiener filtering by applying a GL-
like algorithm, which keeps the mixture phase in high SNR regions
as a good estimate for the corresponding source and only updates
the phase in low- to mid-SNR regions. Both methods, however,
only modify the phase of the sources, and thus implicitly assume
that the input magnitude spectrograms are close to the true source
spectrograms, which is not realistic in general in the context of
blind or semiblind source separation. Sturmel and Daudet proposed
to extend MISI to allow for modifications of both the magnitude
and phase, leading to the informed source separation using iterative
reconstruction (ISSIR) method [22], and showed that it is efficient
in the context of informed source separation where a quantized ver-
sion of the oracle magnitude spectrograms is available. Methods to
jointly estimate phase and magnitude for blind source separation
and speech enhancement will be presented later.
SINUSOIDAL MODEL-BASED PHASE ESTIMATION
In contrast to the iterative approaches presented in the previous
section, sinusoidal model-based phase estimation [4] does not
require estimates of the clean speech spectral magnitudes.
Instead, the clean spectral phase is estimated using only an esti-
mate of the fundamental frequency, which can be obtained from
the degraded signal. However, since usage of the sinusoidal model
is reasonable only for voiced sounds, these approaches do not
provide valid spectral phase estimates for unvoiced sounds, like
fricatives or plosives.
For a single sinusoid,
,sin n
{X +
^h
with normalized angular
frequency ,X the phase difference between two samples
nnR
21
=+ is given by ( ) ( ) .nnR
21
zz z
DX
=-= For a har-
monic signal, H sinusoids at integer multiples of the normalized
angular fundamental frequency ,
0
X i.e., ( )h 1
h
0
!XX=+
,,02
r
h
6
are present at the same time:
,cossn A n n n
h
h
H
hh
0
1
$ {X
=+
=
-
^^
^
^hhh
h
/
(5)
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®