Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

131

132

133

134

135

136

137

138

139

140

IEEE SIGNAL PROCESSING MAGAZINE [136] MARCH 2015

[]

tx is maximally sparse (contains only one nonzero entry),

(17) is in fact identical to nearest neighbor classification. For

less sparse solutions, the difference is that the compositional

model represents an observation as

a combination of atoms, whereas

k-NN represents an observation as a

collection of

atoms that each

individually are close to

In literature, many different types

of metainformation exist. In the

music transcription example of

Figure 7, dictionary atoms were

associated with notes. Even in the previous application, source

separation, we used metainformation by labeling atoms with a

source identity. In speaker identification [44], atoms are associ-

ated with speaker identities. In simple speech processing tasks,

such as phone classification [45] or word recognition [46], the

associated labels are simply the phones or words themselves.

In these examples, the dictionary

A is either constructed or

sampled from training data, which makes it straightforward to

associate labels to atoms. When the dictionary is learned from data,

however, the appropriate mapping from atoms to labels is unclear.

In this scenario, the mapping can be learned by first calculating

atom activations on training data for

which associated labels are known,

followed by NMF or multiple regres-

sion. In [47], this approach was shown

to improve the performance even

with a sampled dictionary. Alterna-

tively, we can treat either

or the

activations []tx as features for a con-

ventional supervised machine-learn-

ing technique such as GMMs [48] or a neural network [49].

Another powerful aspect of the compositional model is that

dictionary atoms can be as easily associated with other kinds of

information, e.g., audio. Consider, for example, a bandwidth exten-

sion task [9], [50] where the goal is to estimate a full-spectrum

audio signal given a bandwidth-limited audio signal. This is a use-

ful operation to perform since, in many audio transmission cases,

high-frequency information is removed to reduce the amount of

MIDI Note Number

Frequency (kHz)

Dictionary Matrix A

30 40 50 60 70

0.5

1.5

2.5

Activation Matrix X

0 5 10 15

MIDI Note Number

Reference Activations

0 5 10 15

Frequency (kHz)

0.5

1.5

2.5

MIDI Note Number

Time (s)

Time (s) Time (s)

(a) (b)

Spectrogram Matrix Y

0 5 10 15

[FIG7] A music analysis example where a polyphonic mixture spectrogram (b) is decomposed into a set of note activations (d) using a

dictionary consisting of spectra of piano notes (a). Each atom in the dictionary is associated with an MIDI note number. The reference

note activations are given in (c). This example is an excerpt from Beethoven’s Moonlight Sonata. Even though the activations are rather

noisy and do not exactly match the reference, the structure of the music is much more clearly visible in the activation plot than in the

spectrogram of the mixture signal.

INFORMATION EXTRACTION

USING THE COMPOSITIONAL

MODEL WORDS BY ASSOCIATING

EACH ATOM IN THE DICTIONARY

WITH METAINFORMATION.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND