Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

IEEE SIGNAL PROCESSING MAGAZINE [56] MARCH 2015

enhancement community. Further-

more, we review both early and

recent methods for phase process-

ing in speech enhancement. We aim

to show that phase processing is an

exciting field of research with the

potential to make assisted listening

and speech communication devices

more robust in acoustically challen-

ging environments.

INTRODUCTION

Let us first consider the common speech enhancement setup con-

sisting of STFT analysis, spectral modification, and subsequent

inverse STFT (iSTFT) resynthesis. The analyzed digital signal

,xn

with time index ,n is chopped into L segments with a

length of N samples, overlapping by NR- samples, where R

denotes the segment shift. Each segment , is multiplied with the

appropriately shifted analysis window ()wn R

,- and trans-

formed into the frequency domain by applying the discrete Fou-

rier transform (DFT), yielding the complex-valued STFT

coefficients

X C

for every segment , and frequency band .k

To compactly describe this procedure, we define the STFT opera-

tor:

.XxSTFT=

Here, x is a vector containing the complete

time-domain signal xn

and X is an N L# matrix of all ,X

,k ,

which we will refer to as the spectrogram. Since we are interested

in real-valued acoustic signals, we consider only complex symmet-

ric spectrograms

,X CS

N L

where S denotes the subset of

spectrograms for which XX

for all , and ,k with X

being the complex conjugate of .X

After some processing, such as magnitude improvement, is

applied on the STFT coefficients, a modified spectrogram

obtained. From X

a time-domain signal can be resynthesized

through an iSTFT operation, denoted

x ().XiSTFT=

For this, the

inverse DFT of the STFT coefficients

is computed and each segment is

multiplied by a synthesis window

();wn R

,- the windowed segments

are then overlapped and added to

obtain the modified time-domain sig-

nal. A final renormalization step is

performed to ensure that, if no processing is applied to the spectral

coefficients, there is perfect reconstruction of the input signal,

i.e., 

.xxiSTFT STFT =

^^hh

The renormalization term, equal to

,w n qR w n qR

^^hh

is R -periodic and can be

included in the synthesis window. A common choice for both

and wn

is the square-root Hann window, which for

overlaps such that /NR N! (e.g., 50%, 75%, etc.) only requires

normalization by a scalar. If the spectrogram is modified, using the

same window for synthesis as for analysis can be shown to lead to a

resynthesized signal whose spectrogram is closest to

in the

least-squares sense [1]. This fact will turn out to be important for

the iterative phase estimation approaches discussed later.

Until recently, in STFT-based speech enhancement, the focus

was on modifying only the magnitude of the STFT components,

because it was generally considered that most of the insight

about the structure of the signal could be obtained from the mag-

nitude, while little information could be obtained from the phase

component. This would seem to be substantiated by Figure1

when considering only (a) and (b), where the STFT magnitude (a)

and STFT phase (b) of a clean speech excerpt are depicted. In

contrast to the magnitude spectrogram, the phase spectrogram

appears to show only little temporal and spectral regularities.

There are nonetheless distinct structures inherent to the spectral

phase, but they are hidden to a great extent because the phase is

[FIG1] (a) Magnitude spectrogram, (b) phase spectrogram, (c) group delay, and (d) IF deviation of the utterance ”glowed jewel-bright”

using a segment length of 32 ms and a shift of 4 ms.

Magnitude

Frequency (kHz)

Time (s)

(dB)

0.2 0.4 0.6 0.8 1

–80

–60

–40

–20

Frequency (kHz)

Time (s)

0.2 0.4 0.6 0.8 1

Frequency (kHz)

Time (s)

0.2 0.4 0.6 0.8 1

Frequency (kHz)

Time (s)

(a)

(c)

(b)

(d)

0.2 0.4 0.6 0.8 1

Phase

(rad)

−π

Group Delay

(ms)

IF Deviation

(Hz)

–100

–50

100

WITH THE ADVANCEMENT OF

TECHNOLOGY, BOTH ASSISTED

LISTENING DEVICES AND SPEECH

COMMUNICATION DEVICES ARE

BECOMING MORE PORTABLE AND

ALSO MORE FREQUENTLY USED.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND