Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

121

122

123

124

125

126

127

128

129

130

IEEE SIGNAL PROCESSING MAGAZINE [128] MARCH 2015

known that the human auditory system effectively acts as a filter

bank [16] and that the amplitude of a signal is encoded by the

nonnegative number of the firings of neurons [17] (even though

neurons encode amplitudes in a nonlinear manner). Thus, the

signal representation used in com-

positional models has some similar-

ities to the representation used in

the human auditory system. For

simplicity, the specific filter bank

analysis we will use is the short-time

Fourier transform (STFT), although

other forms of time–frequency representations may also be used,

some of which we will invoke later in the article. More specifically,

we will work with the magnitude of these representations, i.e.,

with

.|[,]|Ytf

There are three main reasons for using compositional mod-

els on magnitudes of time–frequency representations. First, the

purely constructive composition required by the compositional

framework also necessitates the representations to be non-

negative. Second, the phase spectra (and therefore also the

time–domain signals) of natural sounds are rather stochastic

and therefore difficult to model, whereas the magnitude spectra

are much more deterministic and can be modeled with a simple

linear model. Third, the squared magnitude of time–frequency

components of the signal represents the power in the various

frequency bands. As mentioned earlier, theory dictates that the

power in the sum of uncorrelated signals is the sum of the

power in the component signals. Hence, the power in a signal

composed from uncorrelated atomic units will be the sum of

the power in the units. In practice, however, the time–frequency

components of the signal are estimated from short-duration

windows in which the above relationship does not hold exactly.

Also, more than one component may be used to represent a sin-

gle source, in which case the phases of the components are

coherent. It has been empirically observed that the optimal

magnitude exponent depends on the task at hand and how the

performance is measured [18].

The original signal cannot be recovered directly from the mag-

nitudes of the filter bank output alone; the phase is also required.

This presents a problem since we often would like to reconstruct

the signal from the output of the compositional analysis. For exam-

ple, when a compositional model is used to separate out sources

from a mixed signal, it is often desired to recover the time–domain

signal from the separated time–frequency characterizations, which

comprise only magnitude terms. The missing phase terms must be

obtained through other means. As will be explained in the section

“Source Separation,” this can be accomplished, e.g., by using the

phase of the mixed signal. Thus, compositional models do not,

strictly speaking, perform signal separation but separation using a

midlevel representation that allows separating latent parts of a mix-

ture. Nevertheless, the separated midlevel representation, together

with mixture phases, allows for reconstruction of signals that are

close to source signals before mixing.

An important consideration in deriving time–frequency repre-

sentations is that of time- and frequency-analysis resolution.

Time–frequency representations have a fundamental limitation:

the bandwidth,

,FD of the filters, representing the minimum dif-

ference in frequencies that can be resolved is inversely propor-

tional to the time resolution,

,TD which represents the minimum

distance in time between two seg-

ments of the signal that can be dis-

tinctly resolved. In the case of the

STFT, in particular,

FD is inversely

proportional to the length in samples

of the analysis window employed.

Increasing the length of the analysis

window increases the frequency resolution, but decreases the time

resolution. Low time resolution analysis may result in the tempo-

ral blurring of rapidly changing events, such as those that occur in

speech. On the other hand, low frequency resolution can result in

the obscuring of frequency structures in signals such as music.

Hence, the optimal time/frequency resolution will depend on the

type of the signals we wish to analyze. For instance, music pro-

cessing typically requires longer analysis frames (up to 100 ms),

whereas speech processing typically applies shorter windows (tens

of milliseconds).

COMPOSITIONAL MODELING OF AUDIO

In the following, we represent the magnitude spectrogram (which

we will simply refer to as a spectrogram for brevity) as a matrix

Y R

comprising magnitudes of time–frequency components

[, ].Yft

Here, R

represents the set of nonnegative real num-

bers. Each column of the matrix Y is an F-component (magni-

tude) spectral vector ,y R

F 1

representing the magnitude

spectrum of one slice or frame of the signal.

In alternate representation variants that attempt to explicitly

capture the temporal dynamics of signals, a single column,

may represent multiple adjacent spectra concatenated into a

single vector [4]. In such cases,

,Y R

LFT

where L is the

number of frames that are concatenated together.

The compositional model represents the spectrogram as a non-

negative (purely constructive) linear combination of the contribu-

tions of atomic units (which we will simply refer to as atoms

throughout). In its simplest form, the atomic units themselves are

spectral vectors, representing steady-state sounds, and every spec-

tral vector in the spectrogram can be decomposed into a non-

negative linear combination of these atoms. We describe two

formalisms to achieve this decomposition.

COMPOSITIONAL MODELS AS MATRIX FACTORIZATION

The matrix factorization approach to compositional modeling

treats the problem of decomposing a spectrogram into its

atomic units as nonnegative matrix decomposition.

Let

represent any atom, representing spectral vectors in

this context. In the matrix factorization approach, we will

represent them as column vectors, i.e.,

.a R

F 1

The atoms

are indexed by ,

1g= where K is the total number of

atoms. Each spectral vector y

is composed from all the atoms as

[],xtya

tkk

where [ ]xt

is the nonnegative activation of

the kth atom in frame .t Thus, the spectrogram is modeled as

THE MAGNITUDE SPECTRA

ARE MORE DETERMINISTIC

AND CAN BE MODELED WITH

A SIMPLE LINEAR MODEL.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND