Zoom out Search Issue
IEEE SIGNAL PROCESSING MAGAZINE [128] MARCH 2015
known that the human auditory system effectively acts as a filter
bank [16] and that the amplitude of a signal is encoded by the
nonnegative number of the firings of neurons [17] (even though
neurons encode amplitudes in a nonlinear manner). Thus, the
signal representation used in com-
positional models has some similar-
ities to the representation used in
the human auditory system. For
simplicity, the specific filter bank
analysis we will use is the short-time
Fourier transform (STFT), although
other forms of time–frequency representations may also be used,
some of which we will invoke later in the article. More specifically,
we will work with the magnitude of these representations, i.e.,
with
.|[,]|Ytf
There are three main reasons for using compositional mod-
els on magnitudes of time–frequency representations. First, the
purely constructive composition required by the compositional
framework also necessitates the representations to be non-
negative. Second, the phase spectra (and therefore also the
time–domain signals) of natural sounds are rather stochastic
and therefore difficult to model, whereas the magnitude spectra
are much more deterministic and can be modeled with a simple
linear model. Third, the squared magnitude of time–frequency
components of the signal represents the power in the various
frequency bands. As mentioned earlier, theory dictates that the
power in the sum of uncorrelated signals is the sum of the
power in the component signals. Hence, the power in a signal
composed from uncorrelated atomic units will be the sum of
the power in the units. In practice, however, the time–frequency
components of the signal are estimated from short-duration
windows in which the above relationship does not hold exactly.
Also, more than one component may be used to represent a sin-
gle source, in which case the phases of the components are
coherent. It has been empirically observed that the optimal
magnitude exponent depends on the task at hand and how the
performance is measured [18].
The original signal cannot be recovered directly from the mag-
nitudes of the filter bank output alone; the phase is also required.
This presents a problem since we often would like to reconstruct
the signal from the output of the compositional analysis. For exam-
ple, when a compositional model is used to separate out sources
from a mixed signal, it is often desired to recover the time–domain
signal from the separated time–frequency characterizations, which
comprise only magnitude terms. The missing phase terms must be
obtained through other means. As will be explained in the section
“Source Separation,” this can be accomplished, e.g., by using the
phase of the mixed signal. Thus, compositional models do not,
strictly speaking, perform signal separation but separation using a
midlevel representation that allows separating latent parts of a mix-
ture. Nevertheless, the separated midlevel representation, together
with mixture phases, allows for reconstruction of signals that are
close to source signals before mixing.
An important consideration in deriving time–frequency repre-
sentations is that of time- and frequency-analysis resolution.
Time–frequency representations have a fundamental limitation:
the bandwidth,
,FD of the filters, representing the minimum dif-
ference in frequencies that can be resolved is inversely propor-
tional to the time resolution,
,TD which represents the minimum
distance in time between two seg-
ments of the signal that can be dis-
tinctly resolved. In the case of the
STFT, in particular,
FD is inversely
proportional to the length in samples
of the analysis window employed.
Increasing the length of the analysis
window increases the frequency resolution, but decreases the time
resolution. Low time resolution analysis may result in the tempo-
ral blurring of rapidly changing events, such as those that occur in
speech. On the other hand, low frequency resolution can result in
the obscuring of frequency structures in signals such as music.
Hence, the optimal time/frequency resolution will depend on the
type of the signals we wish to analyze. For instance, music pro-
cessing typically requires longer analysis frames (up to 100 ms),
whereas speech processing typically applies shorter windows (tens
of milliseconds).
COMPOSITIONAL MODELING OF AUDIO
In the following, we represent the magnitude spectrogram (which
we will simply refer to as a spectrogram for brevity) as a matrix
Y R
FT
!
#
+
comprising magnitudes of time–frequency components
[, ].Yft
Here, R
+
represents the set of nonnegative real num-
bers. Each column of the matrix Y is an F-component (magni-
tude) spectral vector ,y R
t
F 1
!
#
+
representing the magnitude
spectrum of one slice or frame of the signal.
In alternate representation variants that attempt to explicitly
capture the temporal dynamics of signals, a single column,
,y
t
may represent multiple adjacent spectra concatenated into a
single vector [4]. In such cases,
,Y R
LFT
!
#
+
where L is the
number of frames that are concatenated together.
The compositional model represents the spectrogram as a non-
negative (purely constructive) linear combination of the contribu-
tions of atomic units (which we will simply refer to as atoms
throughout). In its simplest form, the atomic units themselves are
spectral vectors, representing steady-state sounds, and every spec-
tral vector in the spectrogram can be decomposed into a non-
negative linear combination of these atoms. We describe two
formalisms to achieve this decomposition.
COMPOSITIONAL MODELS AS MATRIX FACTORIZATION
The matrix factorization approach to compositional modeling
treats the problem of decomposing a spectrogram into its
atomic units as nonnegative matrix decomposition.
Let
a
k
represent any atom, representing spectral vectors in
this context. In the matrix factorization approach, we will
represent them as column vectors, i.e.,
.a R
k
F 1
!
#
+
The atoms
are indexed by ,
kK
1g= where K is the total number of
atoms. Each spectral vector y
t
is composed from all the atoms as
[],xtya
k
K
tkk
1
=
=
/
where [ ]xt
k
is the nonnegative activation of
the kth atom in frame .t Thus, the spectrogram is modeled as
THE MAGNITUDE SPECTRA
ARE MORE DETERMINISTIC
AND CAN BE MODELED WITH
A SIMPLE LINEAR MODEL.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®