Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

131

132

133

134

135

136

137

138

139

140

IEEE SIGNAL PROCESSING MAGAZINE [131] MARCH 2015

analysis frame is the probability distribution over ,k which speci-

fies how the component multinomials are chosen in any draw. The

overall mixture multinomial distribution model for the spectrum

of the

tth analysis frame in the signal is given by

() ( ) ( | ),Pf PkPfk

(9)

where ()Pk

represents the frame-specific a priori probability of

k in the tth frame and ( | )Pf k represents the multinomial dis-

tribution of f within the kth atom. Even though the formula-

tion of the model is different from NMF, the models are

conceptually similar: decomposition of a signal is equated to

estimation of the atoms

(| )Pf k and their activations ( )Pk

each frame of the signal, given the spectrogram .[, ]Ytf

The estimation can be performed using the expectation

maximization algorithm [27]. The various components of the

mixture multinomial distribution of (9) are initialized randomly

and re-estimated through iterations of the following equations:

(|)

()(| )

() (| )

Pkf

kPfk

PkPfk

(| )

(| )[,]

(|)[,]

,Pf k

Pkf Ytf

P k fYtf

(10)

()

(|)[,]

.Pk

P k fYtf

(11)

The contribution of the kth atom to the overall signal is the

expected number of draws from the multinomial for the atom,

given the observed spectrum, and is given by

[,][,](|)[,]

()(| )

() (| )

.YtfYtfPkfYtf

Pk Pfk

PkPfk

This effectively distributes the intensity of [ , ]Ytf using the poster-

ior probability of the kth source in point [ , ],tf and is equivalent to

the Wiener-style reconstruction described in the next section.

The rest of this article is presented primarily through the

matrix-factorization perspective for brevity. However, many of the

NMF extensions described below are also possible within the PLCA

framework, often in a manner that is more mathematically intuitive

than the matrix-factorization framework. These include, e.g., tensor

decompositions [27], convolutive representations, the imposition of

temporal constraints [28], joint recognition of mixed signals, and

the imputation of missing data [29]. We refer you to the studies

mentioned previously for additional details of these models.

UNIQUENESS, REGULARIZATION, AND SPARSITY

The solutions to (3) and (4) are not always unique. We have

noted that the divergence

D YAX is biconvex in A and .X

As a result, when both A and X are to be estimated, multiple

solutions may be obtained that result in the same divergence.

Specifically, for any ,Y R

if ( , )AXRR

FK KT

is a

solution that minimizes the divergence, then any matrix pair

(, )AX

such that AX AX=

is also a solution.

For ,KF$ in particular, trivial solutions also become possible.

For ,KF= YAX= can be made exact by simply setting AI=

and .XY= For ,KF2 infinite exact decompositions may be

found, for instance, simply by setting the first F columns of A

to the identity matrix; the remaining dictionary atoms become

irrelevant (and can be set to anything at all) as an exact decom-

position can be obtained by setting their activations to zero.

Even if

A is specified and only X must be estimated, the solu-

tion may not be unique although (|| )D YAX is convex in .X This

happens particularly when A is overcomplete, i.e., when .

Any F linearly independent columns of A can potentially be used

to represent an F-dimensional vector with zero error. We can

choose F linearly independent atoms from A R

in up to

ways, potentially giving us at least that many ways of decom-

posing any vector in Y in terms of the atoms in .A If we permit

combinations of more than F atoms, the number of minimum-

divergence decompositions of a vector in terms of A can be much

greater. The exact conditions for the uniqueness of the decomposi-

tions are studied in more detail in [30].

To reduce the ambiguity in the solution, it is customary to

impose additional constraints on the decomposition, which is

typically done through regularization terms that are added to

the divergence to be minimized. Within the NMF framework,

this modifies the optimization problem of (3) to

,(||)(),,argmin D 00AX Y AX X A X

,AX

**Um=+ (12)

where ()XU is a differentiable, scalar function of X whose

value decreases as the conformance of X to the desired con-

straint increases, and

m is a positive weight that is given for the

regularization term.

The introduction of a regularization term as given in (12) can

nevertheless still result in trivial solutions. Two solutions,

(, )AX

and (, ),AX

will result in identical divergence values if AA

and ,XX

i.e., (|| ) (|

YAX YAX=

Structurally, the two

solutions are identical since they are merely scaled versions of

one another. On the other hand, the regularization terms for the

two need not be identical:

() ().XX!UU

As a result, the regular-

ization term on the right-hand side of (12) can be minimized by

simply scaling X by appropriate

e values and scaling A up by

without actually modifying the decomposition obtained.

To avoid this problem, it becomes necessary to scale the atoms

A to have a constant

, norm. Typically, this is done by normal-

izing every atom a

in A such that || | | 1a

i 2

= after every update.

Assuming that all atoms are normalized to unit

, norm, for

the KL divergence, the update rules from (8) are modified to

()

! 7

Um+

(13)

where ( )XU

is the matrix derivative of ( )XU with respect to

.X The update rule for A remains unchanged, besides the

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND