Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

141

142

143

144

145

146

147

148

149

150

IEEE SIGNAL PROCESSING MAGAZINE [141] MARCH 2015

high-frequency resolution feature representation is used, is that a

distinct atom is required for representing different pitches of a

sound. However, both in speech and music signal processing, the

sources to be modeled will be composed of spectra corresponding

to different pitch values. If a logarithmic frequency resolution is

used, a translation of a spectrum corresponds to a change in its

fundamental frequency. Thus, by shifting the entries of a har-

monic atom we can model different fundamental frequencies. In

the framework of compositional models, we typically not constrain

ourselves to a single pitch shift, but define a set of allowed shifts

,L and estimate activation [ ]xt

,k x

for each of the shifts in each

frame. The model can be expressed as

[].yaxt

,,,ft f k

1 L

(23)

Above, a

,fkx+

is the spectrum of the kth atom at frequency ,f

shifted by

x frequency bins.

Figure 13 illustrates this model by using a single component

to represent multiple pitches. The parameters of the model can

again be estimated using the aforementioned principles. The

plots illustrate that the resulting activations nicely represent the

activity of different pitches, which can be useful in music and

speech processing.

MULTICHANNEL TENSOR FACTORIZATION

When multichannel audio recordings are to be processed, tensor

factorization of their spectrograms [62] has been found to be

effective in taking advantage of the spatial properties of sources. In

this framework, a spectrogram representation of each of the chan-

nels is calculated similarly to one-channel representations. The

two-dimensional-spectrogram matrices

of each channel c are

concatenated to form a three-dimensional-tensor ,Y which

entries are indexed as ,Y

,,ftc

i.e., by frequency, time, and channel.

The basic tensor factorization model extends one-channel

models by associating each atom with a channel gain

,kc

which

describes the amplitude of the kth atom in the cth channel.

The tensor factorization model is given as

[].ag xtY

,, , ,ftc fk

kc k

(24)

The model is equivalent to parallel factor analysis (PARAFAC)

or canonical polyadic decompositions [63], with the exception

that all the parameters of the model are constrained to non-

negative values. Figure 14 illustrates the model.

In comparison to one-channel modeling, the tensor model is

most effective in scenarios where the amplitudes of individual

sources are different in each channel. The level differences

depend on the way the signals are produced. For example, in

commercially produced music, especially in music produced

Time (s)

Frequency Shift τ

0 1 2 3 4

100

200

300

400

500

0 10 20

500

1,000

1,500

2,000

Amplitude

(b) (c)

Atom a

Log−Frequency Index f

Observed Spectrogram Y

200

400

600

800

1,000

1,200

1,400

Activations x

1, τ

[t]

(a)

[FIG13] An illustration of NMD in frequency. (a) A spectrogram of a violin passage with a logarithmic frequency resolution has been

decomposed into (c) a weighted sum of shifted versions of a single harmonic atom vector. (b) The activations for each pitch shift and

each frame are illustrated. The model allows for representing notes of different pitches with a single harmonic spectrum that is shifted

in frequency.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND