Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

IEEE SIGNAL PROCESSING MAGAZINE [62] MARCH 2015

harmonic are directly related to each

other through the phase response of

the analysis window

;

z see, e.g., [4]

for more details. Accordingly, starting

from a phase estimate at a band that

contains a spectral harmonic, possi-

bly obtained using (7), the phase of

the surrounding bands can be inferred by accounting for the phase

shift introduced by the analysis window. For this, only the funda-

mental frequency and the phase response

are required, of

which the latter can be obtained offline either from the window’s

discrete-time Fourier transform (DTFT) or from its DFT with a

large amount of zero padding. The complete setup of [4] is illus-

trated in Figure 3.

It can be argued that for speech enhancement, the phase recon-

struction across frequency bands between harmonics is more

important than the temporal reconstruction on the harmonics: on

the one hand, the local SNR in bands that directly contain har-

monics is rather large for many realistic SNR situations, i.e., 

.zz

Thus, the temporal alignment of the harmonic com-

ponents is maintained rather well in the noisy signal. Further, the

noisy phase

in these bands typically yields a good starting

point for the phase reconstruction along frequency. On the other

hand, frequency bands between harmonics are likely to be domi-

nated by the noise, i.e., 

,,kk

.zz

and the clean phase relations

across bands are strongly disturbed. Here, the possible benefit of

the phase reconstruction is much larger.

Even though the employed model is simple and limited to

purely voiced speech sounds, the obtained phase estimates yield

valuable information about the clean speech signal that can be

employed for advanced speech enhancement algorithms. Interest-

ingly, even the sole enhancement of the spectral phase can lead to a

considerable reduction of noise between harmonic components of

voiced speech after overlap-add [4]. This is because the speech

components of successive segments are adding up constructively

after the phase modifications, while the noise components suffer

from destructive interference, since the phase relations of the noise

have been destroyed. However, speech distortions are also intro-

duced, which are substantially reduced when the estimated phase

is combined with an enhanced magnitude, as, e.g.,in [25]. Besides

its value for signal reconstruction, the estimated phase can also be

utilized as additional information for phase-aware magnitude esti-

mation [25] and even for the estimation of clean speech complex

coefficients [12], which will be discussed in more detail later.

GROUP DELAY AND TRANSIENT PROCESSING

Structures in the phase are not limited to voiced sounds, but are

also present for other sounds, like impulses or transients. These

structures are well captured by the group delay, which can be seen

in Figure1(c), rendering it a useful representation for phase pro-

cessing. For example, the group delay has been employed to facili-

tate clean speech phase estimation in phase-sensitive noise

reduction [26]. It can be shown geometrically that if the spectral

magnitudes of speech and noise are known, only two possible

combinations of phase values remain, both of which perfectly

explain the observed spectral coeffi-

cients of the mixture. In [26] (and

the references therein), Mowlaee and

Saedi proposed to solve this ambigu-

ity by choosing the phase combin-

ation that minimizes a function of

the group delay.

Besides phase estimation, the group delay has successfully been

employed for the detection of transients sounds, such as sounds of

short duration and speech onsets. To illustrate the role of the phase

for transient sounds, let us consider a single impulse as the sim-

plest example. The DFT of such a pulse is

,Ae

where n

the shift of the peak relative to the beginning of the current seg-

ment and

A denotes the spectral magnitude. Hence, we observe a

linear phase with a constant slope of .(/)Nn2

r- For impulsive

signals, we accordingly expect a phase difference across frequency

bands that is approximately constant, i.e., a constant group delay.

That this is the case also for real speech sounds can be seen in Fig-

ure 1(c), where transient sounds show vertical lines with almost

equal group delay.

For the detection of impulsive sounds, in [27] a linearity index

kLI

is defined, which measures the deviation of the observed

phase difference across frequencies to the one that is expected for

an impulse at

i.e., ( / ) .Nn2

r- The observed phase differ-

ences are weighted with the spectral magnitude and averaged over

frequency to obtain an estimate of the time domain offset

Only

if kLI

is close to zero, i.e., the observed phase fits well to the

expected linear phase, an impulsive sound is detected. The detec-

tion can be made either at a segment level or for each time-

frequency point separately. While the former states if an impulsive

sound is present in the current signal segment or not, the latter

allows to localize frequency regions that are dominated by an

impulsive sound, such as a narrowband onset.

Apart from the group delay, the IF, which corresponds to the

temporal derivative of the phase, has also been employed for the

detection of transient sounds, e.g., in [28] and the references

therein. For steady-state signals, like voiced sounds, the IF is

changing only slowly over time, due to the temporal correlation of

the overlapping segments. When a transient is encountered, how-

ever, the most current segment differs significantly from previous

segments and thus the IF also changes abruptly. This can be

observed in Figure1(d), where at speech onsets thin vertical lines

appear in the IF deviation. Hence, the change of the IF from seg-

ment to segment—and its distribution—allow for the detection of

transient sounds, such asnote onsets [28].

The phase of transient sounds is not only relevant for detection,

but also for the reduction of transient noise. In low SNR time-fre-

quency regions, the observed noisy phase is close to the approxi-

mately linear phase of the transient noise. This can lead to artifacts

in the enhanced signal if only the spectral magnitude is improved

and the noisy phase is used for signal reconstruction: usage of the

phase of the transient noise reshapes the enhanced time-domain

signal in an uncontrolled way, such that it may again depict an

undesired transient behavior. Even for a perfect magnitude esti-

mate, the interfering noise is not perfectly suppressed if the phase

THE PHASE OF TRANSIENT

SOUNDS IS NOT ONLY RELEVANT

FOR DETECTION, BUT ALSO

FOR THE REDUCTION OF

TRANSIENT NOISE.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND