Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

131

132

133

134

135

136

137

138

139

140

IEEE SIGNAL PROCESSING MAGAZINE [133] MARCH 2015

,YY

mix

where the last term is the ratio of the contribution of the sth

source to all the sources in each time–frequency point. This filter

response is used by the well-known Wiener filter, and the recon-

struction is often referred to as the Wiener-style reconstruction.

If we wish to listen to these separated components, we need

to convert them back to the time domain. At this point, we only

have magnitude spectrogram representations

so we need to

find a way to create some phase values to be able to invert them

back to a waveform. Although one can use magnitude inversion

techniques [36], [37], a simple approach that leads to a reason-

able quality is to use the phase of the original mixture. This

leads to the following estimate for the separated complex spec-

trogram, which can be reverted to a time-domain signal:

,YY

mix

where Y

and Y

mix

represent complex spectrograms.

Although we have assumed in this section that the dictionaries

for all sources are known, this is not essential. The technique may

also be employed if the dictionary for one of the sources is not

known. In this case, in addition to estimating the activation matrices,

we must also estimate the unknown dictionary. This is done simply

by using the same iterative updates as for NMF but with (7) only act-

ing on the atoms reserved for modeling the unknown source.

DICTIONARY CREATION

The key to effective modeling and separation of sources is to have

accurate dictionaries of atoms for each of the sources. The basic

NMF (3) aims at estimating both the atoms and their activations

from mixed data. Contrary to that, in supervised processing,

source-specific dictionaries

are obtained in a training stage

from a source-specific data set, and combined to form the whole

dictionary. The dictionary is then kept fixed, and only the activa-

tions are estimated according to (4).

There are two main approaches for dictionary learning: the

first attempts to learn dictionary atoms, which jointly describe the

training data [38], [39], whereas the second approach uses sam-

ples from the training data itself as its dictionary atoms: a sam-

pling-based approach [4], [35]. Good dictionaries have several

properties. They should be capable of accurately describing the

source and generalize well to unseen data. They should be kept

relatively small to reduce computational complexity. They should

be discriminative, meaning that sources cannot be well repre-

sented using a dictionary of another source. These requirements

can be at odds with each other, e.g., because small, accurate dic-

tionaries are often less discriminative. The various approaches for

dictionary creation each have their strengths and weaknesses.

Let us denote the training data of source

s as ,D

a matrix

with as its columns the training samples. The prevailing technique

for dictionary learning is to use unsupervised NMF: For each data

set,

s we write

. and estimate the parameters using the

optimization methods described in the previous sections. The acti-

vations

are discarded, and the dictionaries A

of each source

are concatenated as explained previously. To illustrate this, let us

consider the piano and speech sounds described by the magnitude

spectrograms in the left plots of Figure 6(a) and (b). We use

unsupervised NMF on each individual sound to obtain a 16-atom

dictionary, visualized in the plots on the right-hand sides of

Figure 6(a) and (b). We can observe that the dictionaries capture

NMF

= x

...

+0.1 +0.09 +0.08 +0.08

Noisy

Speech

Underlying

Clean Speech

Estimated

Clean SpeechSpeech Speech SpeechNoise Noise

≈ 0.2

Clean Speech

Exemplars

Noise

Exemplars

+ x

...

+ x

...

Nois

eech

+ x

...

+ x

+0.

+0.0

0.0

≈

0.2

(a)

(b)

[FIG5] An example of supervised separation of noisy speech. In the top left corner, we display the noisy spectrogram of the isolated

word zero corrupted with babble noise. In (a), we display parts of the speech and noise exemplar dictionaries. In (b) the five atoms

with the highest weight are shown. The bottom left spectrogram illustrates the underlying clean speech, whereas the bottom right

spectrogram shows the clean speech reconstruction.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND