Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

131

132

133

134

135

136

137

138

139

140

IEEE SIGNAL PROCESSING MAGAZINE [135] MARCH 2015

may produce. A source may produce any number of distinct

spectral structures. To accommodate all of them, the dictionary

must ideally be large. When we attempt to learn large dictionar-

ies, however, we run into a mathematical restriction:

K becomes

larger than F and, as a result, in the absence of other restric-

tions, trivial solutions for A can be obtained as explained earlier.

Consequently, a learned dictionary with F or more atoms will

generally be trivial and carry little information about the actual

signal itself. Even if the dictionary is

not learned through the decomposi-

tion but specified through other

means such as through random

draws from the training data, we run

into difficulties when we attempt to

explain any spectral vector in terms

of this dictionary. In the absence of

other restrictions, the decomposition of an

F 1# spectral vector

in terms of an FK# dictionary is not unique when KF$ as

explained earlier.

To overcome the nonuniqueness, additional constraints must

be applied through appropriate regularization terms. The most

common constraint that is applied is that of sparsity. Sparsity is

most commonly applied to the activations, i.e., to the columns of

the activation matrix

.X Intuitively, this is equivalent to the claim

that although a source may draw from a large dictionary of atoms,

any single spectral vector will only include a small number of

these. Other commonly applied constraints are group sparsity, pro-

motes sparsity over groups of atoms [40], and temporal continuity,

which promotes smooth temporal variation of activations [3].

The number of atoms in the dictionary has a great impact on

the decomposition, even when the number of atoms is fewer than

.F Ideally, the number atoms should equal the number of latent

compositional units within the signal. In certain cases, we might

know exactly what this number might be (e.g., when learning a

dictionary for a synthetic sound with a discrete number of states),

but more often this information is not available and the number

of atoms in the dictionary must be determined in other ways. A

dictionary with too few elements will be unable to adequately

explain all sounds from a given source, whereas one with too

many elements may overgeneralize and explain unintended

sounds that do not belong to that source as well, rendering it inef-

fective for most processing purposes. Although, in principle, the

Bayesian information criterion can be employed to automatically

obtain the optimal dictionary size, it is generally not as useful in

this setting [41], and more sophisticated reasoning should be

used. Sparsity can be used for automatic estimation of the number

of atoms, e.g., by initializing the dictionary with a large number of

atoms, enforcing sparsity on the activations, and reducing diction-

ary size by eliminating all atoms that exhibit consistently low acti-

vations [42]. Another approach is to make use of Bayesian

formulations that allow for model selection in a natural way. For

example, the Markov chain Monte Carlo methodology has been

applied to estimate the size of a dictionary [41], [43].

In general, the trend is that larger dictionaries lead to better

representations, and consequently superior signal processing,

e.g., in terms of the separation quality [25], provided that they

are appropriately acquired. The downside of larger dictionaries

is, of course, increased computational complexity.

ANALYZING THE SEMANTICS OF SOUND

One of the fundamental goals in audio processing is the extrac-

tion of semantics from audio signals with ample applications

such as music analysis, speech recognition, speaker identifica-

tion, multimedia archive access,

and audio event detection. The

source separation applications

described in previous sections are

often used as a preprocessing step

for conventional machine-learning

techniques used in audio analysis,

such as Gaussian mixture models

(GMMs) and hidden Markov models. The compositional model

itself, however, is also a powerful technique to extract meaning

from audio signals and mixtures of audio signals.

As an example, let us consider a music transcription task.

The goal is to transcribe the score of a music piece, i.e., the pitch

and duration of the sounds (notes) that are played. Even when

considering a recording in which only a single instrument, such

as a piano, is playing, this is a challenging task since multiple

notes can be played at once. Moreover, although each note is

characterized by a single fundamental frequency, their energy

may span the complete harmonic spectrum. These two aspects

make music transcription difficult for conventional methods

based on sinusoidal modeling and STFT spectrum analysis, in

which notes are associated with a single frequency band, or

machine-learning methods, which cannot model overlapping

notes. An example using NMF is shown in Figure 7.

Information extraction using the compositional model

works by associating each atom in the dictionary with metain-

formation, e.g., class labels indicating notes. With the observa-

tion described as a linear combination of atoms, the activation

of these atoms then serves directly as evidence for the presence

of (multiple) associated class labels. Formally, let us define a

label-matrix

,L a binary matrix that associates each atom in A

with one or multiple class labels. The dimensions of L are

,Q K# where Q is the total number of classes. A nonzero entry

in the q th row of L indicates those atoms are associated with

the label .q A straightforward method for classification is to cal-

culate the label activations as

[],tgLx

= (17)

with [] [ [], [], , []]txtxt xtx

K12

the atom activations of an

observation .y

The entries of the Q-dimensional vector g are

an unscaled score proportional to the presence of class labels in

the observation. An example of this procedure is given in

Figure 8, where the dictionary atoms of the source separation

example of Figure 5 are now associated with word labels.

The formulation (17) is closely related to several other tech-

niques such as

k-nearest neighbor (k-NN) classification. When

THE SAMPLING METHODS

TYPICALLY REQUIRE LITTLE

TUNING AND ALLOW FOR

THE CREATION OF LARGE,

OVERCOMPLETE DICTIONARIES.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND