Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

IEEE SIGNAL PROCESSING MAGAZINE [61] MARCH 2015

with real-valued amplitude A

and initial time-domain phase

{

for harmonic component .h Due to the fixed relation between the

frequencies, (5) is also referred to as the harmonic model, which is

a special case of the more general sinusoidal model. The harmonic

frequencies and amplitudes are assumed to be slowly changing

over time with respect to the length

N of an STFT signal segment

and we define /AARN2

,=+

and /RN2

,XX=+

the representative harmonic amplitudes and frequencies for the

th, signal segment.

In speech enhancement, the sinusoidal model has, for instance,

been employed in [23], where the model parameters are iteratively

estimated from a noisy observation in the STFT domain, and the

enhanced signal is synthesized using (5). In the absence of noise,

synthesis results are reported to be almost indistinguishable from

the clean speech signal, underlining the capability of (5) to accu-

rately model voiced human speech. In contrast to [23], we now dis-

cuss how the sinusoidal model (5) can be employed to directly

reconstruct the STFT phase. If the frequency resolution of the

STFT is high enough to resolve the harmonic frequencies

X in

(5), in each frequency band k only a single harmonic component

is dominant. The normalized angular frequency

of the har-

monic that dominates frequency band k is denoted as

|/ |,argmin kN2

rXX=-

(6)

i.e., the harmonic frequency that is closest to the center frequency

/kN2r of the kth frequency band. Interpreting the STFT of a sig-

nal as the output of a complex filter bank subsampled by the hop

size

,R the spectral phase

changes from segment to segment

according to

,mod modR

,,kk

SS SS

zz zzXD=+=+

,,,

(7)

where the modulo operator mod

wraps the phase to values

between 0 and 2

When the clean signal sn

is deteriorated by noise, the spec-

tral phases and thus the temporal phase differences

are dete-

riorated as well. With an estimate of the fundamental frequency at

hand, however, the temporal phase relations in each band can be

restored using (7) recursively from segment to segment.

Almost 50 years ago, a similar approach for the propagation of

the spectral phase along time was taken in the phase vocoder [5]

for time-scaling or pitch-shifting of acoustic signals. The temporal

STFT phase difference is modified according to

,, ,kk k1

SS S

zz azD=+

,, ,-

(8)

where in this context,

is often referred to as the IF. By scaling

with the positive real-valued factor ,a the IF of the signal

component is either increased 12

or decreased .11a

Comparing (7) to (8), the phase estimation along time for speech

enhancement can be expressed in terms of a phase vocoder with a

scaling factor of

.1a = However, the application is completely dif-

ferent: instead of deliberately modifying the original phase, the clean

speech phase is estimated from a noisy observation. It is worth not-

ing that for the original phase vocoder, in contrast to

phase estimation in speech enhancement, no fundamental frequency

estimate is needed, as the phase difference

,,,kkk1

SSS

zzzD =-

,,,-

can

be taken directly from the clean original signal.

For an accurate estimation of the clean spectral phase along

segments using (7) a proper initialization is necessary [4]. In

voiced sounds, the bands between spectral harmonics contain only

little signal energy and, in the presence of noise, these bands are

likely to be dominated by the noise component, i.e.,

,,kk

.zz

where

and

are the spectral phases of the noisy mixture

and the noise, respectively. Even though the phase might be set

consistent within each band, the spectral relations across fre-

quency bands are distorted already at the initialization stage.

Directly applying (7) to every frequency band therefore does not

necessarily yield phase estimates that could be employed for phase-

based speech enhancement [4].

In the phase vocoder, this problem can be alleviated by aligning

phases of neighboring frequency bands relative to each other,

which is known as phase locking, e.g., [24]. There, the phase is

evolved along time only in frequency bands that directly contain

harmonic components. The phase in the surrounding bands,

which are dominated by the same harmonic, is then set relative to

the modified phase. For this, the spectral phase relations of the

original signal are imposed on the modified phase spectrum.

In the context of speech enhancement, the same principle has

been incorporated to improve the estimation of the clean speech

spectral phase [4]. However, since only a noisy signal is observed,

the clean speech phase relations across frequency bands are not

readily available. To overcome this limitation, again the sinusoidal

model is employed. The spectrum of a harmonic signal segment is

given by the cyclic convolution of a comb-function with the trans-

fer function of the analysis window, which causes spectral leakage.

The spectral leakage induces relations not only between the ampli-

tudes, but also between the phases of neighboring bands. It can be

shown that phases of bands that are dominated by the same

[FIG3] Symbolic spectrogram illustrating the sinusoidal model-

based phase estimation [4]. Starting from the noisy phase at the

onset of a voiced sound in segment ,

, in bands containing

harmonic components (red) the phase is estimated along

segments. Based on the temporal estimates, the spectral phase

of bands between the harmonics (blue) is then inferred across

frequency.

Voiced

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND