Zoom out Search Issue

IEEE SIGNAL PROCESSING MAGAZINE [135] MARCH 2015
may produce. A source may produce any number of distinct
spectral structures. To accommodate all of them, the dictionary
must ideally be large. When we attempt to learn large dictionar-
ies, however, we run into a mathematical restriction:
K becomes
larger than F and, as a result, in the absence of other restric-
tions, trivial solutions for A can be obtained as explained earlier.
Consequently, a learned dictionary with F or more atoms will
generally be trivial and carry little information about the actual
signal itself. Even if the dictionary is
not learned through the decomposi-
tion but specified through other
means such as through random
draws from the training data, we run
into difficulties when we attempt to
explain any spectral vector in terms
of this dictionary. In the absence of
other restrictions, the decomposition of an
F 1# spectral vector
in terms of an FK# dictionary is not unique when KF$ as
explained earlier.
To overcome the nonuniqueness, additional constraints must
be applied through appropriate regularization terms. The most
common constraint that is applied is that of sparsity. Sparsity is
most commonly applied to the activations, i.e., to the columns of
the activation matrix
.X Intuitively, this is equivalent to the claim
that although a source may draw from a large dictionary of atoms,
any single spectral vector will only include a small number of
these. Other commonly applied constraints are group sparsity, pro-
motes sparsity over groups of atoms [40], and temporal continuity,
which promotes smooth temporal variation of activations [3].
The number of atoms in the dictionary has a great impact on
the decomposition, even when the number of atoms is fewer than
.F Ideally, the number atoms should equal the number of latent
compositional units within the signal. In certain cases, we might
know exactly what this number might be (e.g., when learning a
dictionary for a synthetic sound with a discrete number of states),
but more often this information is not available and the number
of atoms in the dictionary must be determined in other ways. A
dictionary with too few elements will be unable to adequately
explain all sounds from a given source, whereas one with too
many elements may overgeneralize and explain unintended
sounds that do not belong to that source as well, rendering it inef-
fective for most processing purposes. Although, in principle, the
Bayesian information criterion can be employed to automatically
obtain the optimal dictionary size, it is generally not as useful in
this setting [41], and more sophisticated reasoning should be
used. Sparsity can be used for automatic estimation of the number
of atoms, e.g., by initializing the dictionary with a large number of
atoms, enforcing sparsity on the activations, and reducing diction-
ary size by eliminating all atoms that exhibit consistently low acti-
vations [42]. Another approach is to make use of Bayesian
formulations that allow for model selection in a natural way. For
example, the Markov chain Monte Carlo methodology has been
applied to estimate the size of a dictionary [41], [43].
In general, the trend is that larger dictionaries lead to better
representations, and consequently superior signal processing,
e.g., in terms of the separation quality [25], provided that they
are appropriately acquired. The downside of larger dictionaries
is, of course, increased computational complexity.
ANALYZING THE SEMANTICS OF SOUND
One of the fundamental goals in audio processing is the extrac-
tion of semantics from audio signals with ample applications
such as music analysis, speech recognition, speaker identifica-
tion, multimedia archive access,
and audio event detection. The
source separation applications
described in previous sections are
often used as a preprocessing step
for conventional machine-learning
techniques used in audio analysis,
such as Gaussian mixture models
(GMMs) and hidden Markov models. The compositional model
itself, however, is also a powerful technique to extract meaning
from audio signals and mixtures of audio signals.
As an example, let us consider a music transcription task.
The goal is to transcribe the score of a music piece, i.e., the pitch
and duration of the sounds (notes) that are played. Even when
considering a recording in which only a single instrument, such
as a piano, is playing, this is a challenging task since multiple
notes can be played at once. Moreover, although each note is
characterized by a single fundamental frequency, their energy
may span the complete harmonic spectrum. These two aspects
make music transcription difficult for conventional methods
based on sinusoidal modeling and STFT spectrum analysis, in
which notes are associated with a single frequency band, or
machine-learning methods, which cannot model overlapping
notes. An example using NMF is shown in Figure 7.
Information extraction using the compositional model
works by associating each atom in the dictionary with metain-
formation, e.g., class labels indicating notes. With the observa-
tion described as a linear combination of atoms, the activation
of these atoms then serves directly as evidence for the presence
of (multiple) associated class labels. Formally, let us define a
label-matrix
,L a binary matrix that associates each atom in A
with one or multiple class labels. The dimensions of L are
,Q K# where Q is the total number of classes. A nonzero entry
in the q th row of L indicates those atoms are associated with
the label .q A straightforward method for classification is to cal-
culate the label activations as
[],tgLx
t
= (17)
with [] [ [], [], , []]txtxt xtx
K12
f=
<
the atom activations of an
observation .y
t
The entries of the Q-dimensional vector g are
an unscaled score proportional to the presence of class labels in
the observation. An example of this procedure is given in
Figure 8, where the dictionary atoms of the source separation
example of Figure 5 are now associated with word labels.
The formulation (17) is closely related to several other tech-
niques such as
k-nearest neighbor (k-NN) classification. When
THE SAMPLING METHODS
TYPICALLY REQUIRE LITTLE
TUNING AND ALLOW FOR
THE CREATION OF LARGE,
OVERCOMPLETE DICTIONARIES.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®