Zoom out Search Issue

IEEE SIGNAL PROCESSING MAGAZINE [126] MARCH 2015
also the flexibility to use them in
ways that are nonstandard in audio
processing. In this article, we show
how they can be powerful tools for
processing audio data, providing
highly interpretable audio represen-
tations and enabling diverse applica-
tions such as signal analysis and
recognition [4], [7], [8], manipula-
tion and enhancement [9], [10], and coding [11], [12].
The basic premise underlying the application of compositional
models to audio processing is that sound, too, can be viewed as
being compositional in nature. The premise has intuitive appeal:
sound, as we experience it, does indeed have compositional char-
acter. The sounds we hear are usually a medley of component
sounds that are all concurrently present. Although a sound may
mask others by its greater prominence, the sounds themselves do
not generally cancel one another, except in a few cases when it is
done intentionally, e.g., in adaptive noise cancellers. Even sounds
produced by a single source are often compositions of component
sounds from the source, e.g., the sound produced by a machine
combines sounds from all of its parts, and music sounds are com-
positions of notes produced by various instruments.
The compositionality of sound is also evident in time–
frequency characterizations of the signal, as illustrated by
Figure 1. The figure shows a spectrogram—a visual representa-
tion of the magnitude of time–frequency components as a func-
tion of time and frequency—of a signal, which comprises two
notes played individually at first and then played together. The
spectral patterns characteristic of the individual notes are dis-
tinctly observable even when they are played together.
The compositional framework for
sound analysis builds upon these
impressions: it characterizes the
sounds from any source as a con-
structive composition of atomic
sounds that are characteristic of the
source and postulates that the
decomposition of the signal into its
atomic parts may be achieved
through the application of an appropriately constrained compos-
itional model to an appropriate time–frequency representation of
the signal. This, in turn, can be used to perform several of the
tasks mentioned earlier.
The models themselves may take multiple forms. The
nonnegative matrix factorization (NMF) models [3], [13] treat
nonnegative time–frequency representations of the signal as
matrices, which are decomposed into products of nonnegative
component matrices. One of the matrices represents spectral
patterns of the atomic parts, and the other represents their acti-
vation to the signal over time.
The probabilistic latent component analysis (PLCA) models
treat the nonnegative time–frequency representations as histo-
grams drawn from a mixture of multivariate multinomial ran-
dom variables representing the atomic parts [14]. The two
approaches can be shown to be equivalent as well as arithmetic-
ally identical under some circumstances [15].
The purpose of this article is to serve as an introduction to the
application of compositional models to the analysis of sound. We
first demonstrate the limitations of related algorithms that allow
for the cancellation of parts and how compositional models can
circumvent them, through an example. We then continue with a
brief exposition on the type of time–frequency representations
where compositional models may naturally be applied.
We subsequently explain the models themselves. Two of the
most common formulations of compositional models are based
on matrix factorization and PLCA. For brevity, we primarily pre-
sent the matrix factorization perspective, although we also
introduce the PLCA model briefly for completeness.
Within these frameworks, we address various issues, including
how a given sound may be decomposed into the contributions of
its atomic parts, how the parts themselves may be found, restric-
tions of the model vis-à-vis the number and nature of these parts
and of the decomposition itself, and finally how the solutions to
these problems make various applications possible.
WHY CONSTRUCTIVE COMPOSITION?
Before proceeding further, it may be useful to address a question
that may already have struck you. Since the models themselves
are effectively matrix decompositions, what makes the compos-
itional model with its constraints on purely constructive compos-
ition different from other forms of matrix decompositions such as
principal component analysis (PCA), independent component
analysis (ICA), or other similar methods?
The answer is given as illustrated in Figure 2, which shows the
outcome of PCA- and ICA-based decomposition of the spectrogram
Time (s)
Frequency (kHz)
Magnitude Spectrogram
1 2 3
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
[FIG1] A magnitude spectrogram of a simple piano recording.
Two notes are played in succession and then again in unison.
We can visually identify these notes using their unique
harmonic structure.
THE BASIC PREMISE
UNDERLYING THE APPLICATION
OF COMPOSITIONAL MODELS
TO AUDIO PROCESSING IS THAT
SOUND CAN BE VIEWED AS BEING
COMPOSITIONAL IN NATURE.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®