Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

141

142

143

144

145

146

147

148

149

150

IEEE SIGNAL PROCESSING MAGAZINE [140] MARCH 2015

activations that correspond to the

missing frames.

As in the previous example,

sounds typically have strong tempo-

ral and spectral dependencies. Tem-

poral context can be included in a compositional model by

simply concatenating a number of adjacent observations to a

long observation vector [4]. However, this increase of the

dimensionality of the observations makes the inference of atoms

more difficult—e.g., in the above example, we would need mul-

tiple atoms to represent all of the temporally shifted variants of

the bird sounds.

The principles used to model reverberant spectrograms and

estimate reverberation responses and dry signals can be extended

to learn temporal and spectral patterns that span more than one

frame or frequency bin, respectively. These nonnegative matrix

deconvolution (NMD) [2], [33], [61] methods aim at modeling

either temporal or spectral context.

When the model is used in the time domain, it represents a

spectrogram as a sum of temporally shifted and scaled versions of

atomic spectrogram segments

,n x

As before, the atom vectors

are indexed by ,n but now also with

x which is the frame index of the

short-time spectrogram segment.

An illustration of the model is given

in Figure 12. Mathematically, the

model for an individual mixture spectrogram frame

is given as

[],xtyy a

,tt k

. x=-

(22)

where L is the length of atomic spectrogram events. NMD gets its

name from this formulation, as the contribution of a single atom

is the convolution between the atom vectors and the activations.

Again, the parameters of the model can be obtained by min-

imizing a divergence between observations and the model while

constraining the model parameters to nonnegative values. In an

unsupervised scenario where both the atom vectors and their

activations are estimated, care must be taken to limit the number

of atoms and the length of events to avoid overfitting.

Convolution in frequency can be used to model pitch shifting

of atoms. A limitation of the linear models, at least when a

Frequency (kHz)

Spectrogram Matrix Y with Missing Frames

0.5 1

1.5

2 2.5

(a)

3 3.5 4 4.5

Component a

1, τ

Component a

2, τ

Frequency (kHz)

ττ

0 1 2 3 4 5

0.2

0.4

0.6

0.8

Time (s)

(c)(b)

Amplitude

Activations X

[FIG12] An illustration of the NMD model. (a) The magnitude spectrogram of a signal consisting of three bird sounds (Friedmann’s lark)

and background noises. The spectrogram is modeled using NMD to decompose the signal into bird sounds (component 1) and

background noises (component 2). (b) The compositional model represents the spectrogram as the weighted and delayed sum of two

short event spectrogram segments. (c) The curves show the weights for each delay. The impulses in the curves correspond to the start

times of bird sound events in the mixture. The events have been correctly found even though some of the frames in the mixture signal

are missing (black vertical bar). Since NMD models the mixture as a sum of segments longer than the missing-frame segment, the

model parameters can be used to predict the missing frames.

REVERBERATION

CAN BE FORMULATED AS A

COMPOSITIONAL PROCESS.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND