Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

IEEE SIGNAL PROCESSING MAGAZINE [60] MARCH 2015

contribution to the signal is obtained by the inverse DFT of the

phase

()i

combined with the target magnitude; frame , ’s contri-

bution is then combined by overlap-add to the contribution of the

previous frames, leading to a signal estimate for frame

, ; the

phase

()i 1

is estimated as the phase of this signal estimate to

which the analysis window is applied.

RTISI does lead to better results than GL for the first few itera-

tions, but it quickly reaches a plateau and is ultimately signifi-

cantly outperformed by GL. This is mainly due to the fact that

RTISI does not consider information from future frames at all,

even though the contribution of these future frames will later on

be added to that of the past and current frames, effectively altering

the estimation performed earlier. Its authors thus proposed an

extension to RTISI including an

M frame look-ahead, RTISI-LA.

Instead of considering only the current frame as active, RTISI-LA

performs GL-type updates on the phases in a block of multiple

frames. The contribution of future frames outside the block is dis-

carded during the updates, because the absence of a reliable phase

estimate for them is regarded as likely to make their contribution

more of a disturbance than a useful clue. This creates an asymme-

try, which Zhu et al.[17] proposed to partially compensate by

using asymmetric analysis windows with a reverse effect. Although

the procedure relies on heuristic considerations, the authors show

that it leads to much better performance than GL for a given

number of iterations per block.

While RTISI and RTISI-LA were successful in overcoming

GL’s issues regarding online processing and poor initialization,

they did not tackle the problems of heavy reliance on costly FFT

computations and lack of care for local regularities in the time-

frequency domain. Solving these problems was difficult in the

context of classical approaches relying on enforcing constraints

both in the time-frequency domain (to impose a given magni-

tude) and the time domain (to ensure that magnitude and phase

are consistent), because they inherently had to go back and

forth between the two domains, processing whole frames at a

time. A solution was proposed by Le Roux et al. [18], whose key

idea was to bypass the time domain altogether and reformulate

the problem within the time-frequency domain. The standard

operation of classical iterative approaches, i.e., computing the

STFT of the signal obtained by iSTFT from a given spectrogram,

can indeed be considered as a linear operator in the time-fre-

quency domain. Le Roux et al.noticed that the result of that

operation at each time-frequency bin can be well approximated

by a local weighted sum (LWS) with complex coefficients on a

small neighborhood of that bin in the original spectrogram.

While the very small number of terms in the sum does not suf-

fice to reduce the complexity of the operation compared to using

FFTs, the locality of the sum opens the door to selectively updat-

ing certain time-frequency bins, as well as to immediately propa-

gating the updated value for a bin in the computations of its

neighbors’ updates. Taking advantage of the sparseness of natu-

ral sound signals, Le Roux et al.showed in particular that focus-

ing first on updating only the bins with high energy not only

reduced greatly the complexity of each iteration, but also could

lead to better initializations, the high energy regions serving as

anchors for lower energy ones. While the LWS algorithm was

originally proposed as an extension to GL for batch-mode com-

putations, the authors later showed that it could be effectively

used in online mode as well in combination with RTISI-LA [19].

Interestingly, a different prioritization of the updates based on

energy, at the frame level instead of the bin level, was also suc-

cessfully used by Gnann and Spiertz to improve RTISI-LA [20].

Recently, several authors investigated signal reconstruction

from magnitudes with specific task-related side information. Those

developed in the context of source separation are of particular inter-

est to this article. Gunawan and Sen [21] proposed the multiple

input spectrogram inversion (MISI) algorithm to reconstruct mul-

tiple signals from their magnitude spectrograms and their mixture

signal. The phase of the mixture signal acts as very powerful side

information, which can be exploited by imposing that the recon-

structed complex spectrograms add up to the mixture complex

spectrogram when estimating their phases, leading to much better

reconstruction quality than in situations where the mixture signal

is not available. Sturmel and Daudet’s partitioned phase retrieval

(PPR) method [9] also handles the reconstruction of multiple

sources. Their proposal was to reconstruct the phase of the magni-

tude spectrogram obtained by Wiener filtering by applying a GL-

like algorithm, which keeps the mixture phase in high SNR regions

as a good estimate for the corresponding source and only updates

the phase in low- to mid-SNR regions. Both methods, however,

only modify the phase of the sources, and thus implicitly assume

that the input magnitude spectrograms are close to the true source

spectrograms, which is not realistic in general in the context of

blind or semiblind source separation. Sturmel and Daudet proposed

to extend MISI to allow for modifications of both the magnitude

and phase, leading to the informed source separation using iterative

reconstruction (ISSIR) method [22], and showed that it is efficient

in the context of informed source separation where a quantized ver-

sion of the oracle magnitude spectrograms is available. Methods to

jointly estimate phase and magnitude for blind source separation

and speech enhancement will be presented later.

SINUSOIDAL MODEL-BASED PHASE ESTIMATION

In contrast to the iterative approaches presented in the previous

section, sinusoidal model-based phase estimation [4] does not

require estimates of the clean speech spectral magnitudes.

Instead, the clean spectral phase is estimated using only an esti-

mate of the fundamental frequency, which can be obtained from

the degraded signal. However, since usage of the sinusoidal model

is reasonable only for voiced sounds, these approaches do not

provide valid spectral phase estimates for unvoiced sounds, like

fricatives or plosives.

For a single sinusoid,

,sin n

{X +

with normalized angular

frequency ,X the phase difference between two samples

nnR

=+ is given by ( ) ( ) .nnR

zz z

=-= For a har-

monic signal, H sinusoids at integer multiples of the normalized

angular fundamental frequency ,

X i.e., ( )h 1

!XX=+

,,02

are present at the same time:

,cossn A n n n

$ {X

^hhh

(5)

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND