Zoom out Search Issue
IEEE SIGNAL PROCESSING MAGAZINE [133] MARCH 2015
,YY
AX
AX
*
*
s
s
s
mix
7=
where the last term is the ratio of the contribution of the sth
source to all the sources in each time–frequency point. This filter
response is used by the well-known Wiener filter, and the recon-
struction is often referred to as the Wiener-style reconstruction.
If we wish to listen to these separated components, we need
to convert them back to the time domain. At this point, we only
have magnitude spectrogram representations
,Y
*
s
so we need to
find a way to create some phase values to be able to invert them
back to a waveform. Although one can use magnitude inversion
techniques [36], [37], a simple approach that leads to a reason-
able quality is to use the phase of the original mixture. This
leads to the following estimate for the separated complex spec-
trogram, which can be reverted to a time-domain signal:
,YY
AX
AX
*
*
s
s
s
mix
7=
rr
where Y
*
s
r
and Y
mix
r
represent complex spectrograms.
Although we have assumed in this section that the dictionaries
for all sources are known, this is not essential. The technique may
also be employed if the dictionary for one of the sources is not
known. In this case, in addition to estimating the activation matrices,
we must also estimate the unknown dictionary. This is done simply
by using the same iterative updates as for NMF but with (7) only act-
ing on the atoms reserved for modeling the unknown source.
DICTIONARY CREATION
The key to effective modeling and separation of sources is to have
accurate dictionaries of atoms for each of the sources. The basic
NMF (3) aims at estimating both the atoms and their activations
from mixed data. Contrary to that, in supervised processing,
source-specific dictionaries
A
s
are obtained in a training stage
from a source-specific data set, and combined to form the whole
dictionary. The dictionary is then kept fixed, and only the activa-
tions are estimated according to (4).
There are two main approaches for dictionary learning: the
first attempts to learn dictionary atoms, which jointly describe the
training data [38], [39], whereas the second approach uses sam-
ples from the training data itself as its dictionary atoms: a sam-
pling-based approach [4], [35]. Good dictionaries have several
properties. They should be capable of accurately describing the
source and generalize well to unseen data. They should be kept
relatively small to reduce computational complexity. They should
be discriminative, meaning that sources cannot be well repre-
sented using a dictionary of another source. These requirements
can be at odds with each other, e.g., because small, accurate dic-
tionaries are often less discriminative. The various approaches for
dictionary creation each have their strengths and weaknesses.
Let us denote the training data of source
s as ,D
s
a matrix
with as its columns the training samples. The prevailing technique
for dictionary learning is to use unsupervised NMF: For each data
set,
s we write
DA
X
ss
s
. and estimate the parameters using the
optimization methods described in the previous sections. The acti-
vations
X
s
are discarded, and the dictionaries A
s
of each source
are concatenated as explained previously. To illustrate this, let us
consider the piano and speech sounds described by the magnitude
spectrograms in the left plots of Figure 6(a) and (b). We use
unsupervised NMF on each individual sound to obtain a 16-atom
dictionary, visualized in the plots on the right-hand sides of
Figure 6(a) and (b). We can observe that the dictionaries capture
NMF
= x
s
1
...
+0.1 +0.09 +0.08 +0.08
Noisy
Speech
Underlying
Clean Speech
Estimated
Clean SpeechSpeech Speech SpeechNoise Noise
≈ 0.2
Clean Speech
Exemplars
Noise
Exemplars
+ x
s
2
+ x
s
3
+ x
s
4
+ x
s
5
+ x
s
6
+ x
s
J
+ x
n
1
...
+ x
n
2
+ x
n
3
+ x
n
4
+ x
n
5
+ x
n
6
+ x
n
K
=
x
s
1
...
Nois
y
Sp
eech
+
x
s
2
s
+
x
s
3
s
+ x
s
4
s
+
x
s
5
s
+
x
s
6
s
+
x
s
J
+
x
n
1
...
+
x
n
2
+
x
n
3
+ x
n
4
+
x
n
5
+
x
n
6
+
x
n
K
+0.
1
+0.0
9
+0.0
8
+
0.0
8
≈
0.2
(a)
(b)
[FIG5] An example of supervised separation of noisy speech. In the top left corner, we display the noisy spectrogram of the isolated
word zero corrupted with babble noise. In (a), we display parts of the speech and noise exemplar dictionaries. In (b) the five atoms
with the highest weight are shown. The bottom left spectrogram illustrates the underlying clean speech, whereas the bottom right
spectrogram shows the clean speech reconstruction.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®