Zoom out Search Issue

IEEE SIGNAL PROCESSING MAGAZINE [141] MARCH 2015
high-frequency resolution feature representation is used, is that a
distinct atom is required for representing different pitches of a
sound. However, both in speech and music signal processing, the
sources to be modeled will be composed of spectra corresponding
to different pitch values. If a logarithmic frequency resolution is
used, a translation of a spectrum corresponds to a change in its
fundamental frequency. Thus, by shifting the entries of a har-
monic atom we can model different fundamental frequencies. In
the framework of compositional models, we typically not constrain
ourselves to a single pitch shift, but define a set of allowed shifts
,L and estimate activation [ ]xt
,k x
for each of the shifts in each
frame. The model can be expressed as
[].yaxt
,,,ft f k
k
K
k
1 L
=
!
x
x
x+
=
t
//
(23)
Above, a
,fkx+
is the spectrum of the kth atom at frequency ,f
shifted by
x frequency bins.
Figure 13 illustrates this model by using a single component
to represent multiple pitches. The parameters of the model can
again be estimated using the aforementioned principles. The
plots illustrate that the resulting activations nicely represent the
activity of different pitches, which can be useful in music and
speech processing.
MULTICHANNEL TENSOR FACTORIZATION
When multichannel audio recordings are to be processed, tensor
factorization of their spectrograms [62] has been found to be
effective in taking advantage of the spatial properties of sources. In
this framework, a spectrogram representation of each of the chan-
nels is calculated similarly to one-channel representations. The
two-dimensional-spectrogram matrices
Y
c
of each channel c are
concatenated to form a three-dimensional-tensor ,Y which
entries are indexed as ,Y
,,ftc
i.e., by frequency, time, and channel.
The basic tensor factorization model extends one-channel
models by associating each atom with a channel gain
,g
,kc
which
describes the amplitude of the kth atom in the cth channel.
The tensor factorization model is given as
[].ag xtY
,, , ,ftc fk
k
K
kc k
1
.
=
/
(24)
The model is equivalent to parallel factor analysis (PARAFAC)
or canonical polyadic decompositions [63], with the exception
that all the parameters of the model are constrained to non-
negative values. Figure 14 illustrates the model.
In comparison to one-channel modeling, the tensor model is
most effective in scenarios where the amplitudes of individual
sources are different in each channel. The level differences
depend on the way the signals are produced. For example, in
commercially produced music, especially in music produced
Time (s)
Frequency Shift τ
0 1 2 3 4
100
200
300
400
500
0 10 20
0
500
1,000
1,500
2,000
Amplitude
(b) (c)
Atom a
1
Log−Frequency Index f
Log−Frequency Index f
Observed Spectrogram Y
200
400
600
800
1,000
1,200
1,400
Activations x
1, τ
[t]
(a)
[FIG13] An illustration of NMD in frequency. (a) A spectrogram of a violin passage with a logarithmic frequency resolution has been
decomposed into (c) a weighted sum of shifted versions of a single harmonic atom vector. (b) The activations for each pitch shift and
each frame are illustrated. The model allows for representing notes of different pitches with a single harmonic spectrum that is shifted
in frequency.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®