Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

131

132

133

134

135

136

137

138

139

140

IEEE SIGNAL PROCESSING MAGAZINE [130] MARCH 2015

The various divergences scale differently with their argu-

ments. The squared error scales quadratically, meaning that

(|| ) (||),DDYAX YAX

SQ SQ

aa a= the IS divergence is scale

invariant, i.e., (|| ) (||),DDYAX YAX

IS IS

aa = while the KL diver-

gence scales linearly: (|| ) (||).DDYAX YAX

KL KL

aa a= The rela-

tive merits of the divergences may be inferred from this property:

the squared error divergence puts undue emphasis on high-

energy components, and the IS divergence fails to distinguish

between the noise floor and higher-energy speech components.

The KL divergence provides a good compromise between the two

[3], [19], [20]. A generalization of the divergences in (5) and (6) is

the beta divergence [21], which defines a set of divergences that

are a function of a parameter

The divergences in (5) and (6) (KL, IS, or squared) can be

obtained from maximum likelihood estimation of the parameters,

when observed data is generated by a specific generative model

(Poisson distribution, multiplicative Gamma noise, or additive

Gaussian noise) independently at each time–frequency point [13].

Even though some of these models (e.g., the Poisson distribution)

do not match well with the distribution of natural sounds, the stat-

istical interpretation allows incorporating a prior distributions for

the parameters.

The squared error and KL divergence are convex as the func-

tion of

and for these, the divergence (||)D YY

is also convex

in .Y

In this case, the optimization problem of (4) and its coun-

terpart, where X is specified and A must be estimated, minim-

ize a convex function, and can be solved by any convex

optimization technique.

When

is itself a product of two matrices, e.g., ,YAX=

(||) (|| )DDYY YAX=

becomes biconvex in A and .X This

means that it is not jointly convex in both of these variables, but

if either of them is fixed it is convex in the other. Therefore, (3)

is biconvex and cannot directly be solved through convex opti-

mization methods. Nevertheless, convex optimization methods

may still be employed by alternately estimating one of

A and ,X

holding the other fixed to its current estimate.

A commonly used solution to estimating nonnegative

decompositions is based on the so-called multiplicative updates.

The parameters to be estimated are first initialized to random

positive values and then iteratively updated by multiplying them

with correction terms. The strength of the method stems from

the ability of the updates to fulfill the nonnegativity constraints

easily: provided that both the previous estimate and the correc-

tion term are nonnegative, and the updated term is guaranteed

to be nonnegative as well. The multiplicative updates that

decrease the KL divergence are given as

! 7

(7)

and

,XX

! 7

(8)

where 1 is an all-one matrix of the same size as ,Y 7 is an ele-

ment-wise matrix product, and all the divisions are element wise.

It can be easily seen that if A and X are nonnegative, the terms

that are used to update them are also nonnegative. Thus, the

updates obey the nonnegativity constraints. If both

A and X must

be estimated, (7) and (8) must be alternately computed. If one of

the two is given and only the other must be estimated, then only

the update rule for the appropriate variable need be iterated. For

instance, if

A is given, X can be estimated by iterating (8). In all

cases, the KL divergence is guaranteed be nonincreasing under

the updates. These multiplicative updates as well as rules for mini-

mizing the squared error were proposed by Lee and Seung [22].

In addition to multiplicative updates, various alternative

methods have been proposed, based on, e.g., second-order

methods [23], projected gradient [1, pp. 267–268], etc. The

methods can also be accelerated by active-set methods [24],

[25]. Some divergences, such as the IS divergence, are not con-

vex, and minimizing them requires more carefully designed

optimization algorithms than the convex divergences [13].

There also exist divergences that aim at optimizing the per-

ceptual quality of the representation [12], which are useful in

audio coding applications. However, in most of the other appli-

cations of compositional models, such as source separation and

signal analysis, the quality of the representation is affected more

by its ability to isolate latent compositional units from a mix-

ture signal, not by the ability to accurately represent the obser-

vations. Therefore, simple divergences such as the KL or IS are

the most commonly used even in the applications where a mix-

ture is separated into parts for listening purposes.

COMPOSITIONAL MODELS AS PLCA

The PLCA approach to compositional models treats the spectro-

gram of the signal as a histogram drawn from a mixture multi-

nomial process, where the component multinomials in the

mixture represent the atoms that compose the signal [14]. This

model is an extension of probabilistic latent semantic indexing

and probabilistic latent semantic analysis techniques that have

been successfully used, e.g., for topic modeling of speech [26].

The generative model behind PLCA may be explained as follows.

A stochastic process draws frequency indices randomly from a col-

lection of multinomial distributions. In each draw, it first selects

one of these component multinomials according to some probabil-

ity distribution,

(),Pk where k represents the selected multino-

mial. Subsequently, it draws the frequency, ,f from the selected

multinomial (| ).Pf k Thus, the probability that a frequency, ,f will

be selected in any draw is given by () (| ).PkPf k

To generate a

spectral vector, the process draws frequencies several times. The

histogram of the frequencies is the resulting spectral vector.

The mixture multinomial

() (| )PkPf k

thus represents the

distribution underlying a single spectral vector—the vector itself

is obtained through several draws from this distribution. When we

employ the model to generate an entire spectrogram comprising

many spectral vectors, we make an additional assumption: that the

component multinomials

(| )Pf k are characteristic of the source

that generates the sound, and represent the atomic units for the

source. Hence, the set of component multinomials is the same for

all vectors, and the only factor that changes from analysis frame to

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND