Zoom out Search Issue

IEEE SIGNAL PROCESSING MAGAZINE [131] MARCH 2015
analysis frame is the probability distribution over ,k which speci-
fies how the component multinomials are chosen in any draw. The
overall mixture multinomial distribution model for the spectrum
of the
tth analysis frame in the signal is given by
() ( ) ( | ),Pf PkPfk
tt
k
K
1
=
=
/
(9)
where ()Pk
t
represents the frame-specific a priori probability of
k in the tth frame and ( | )Pf k represents the multinomial dis-
tribution of f within the kth atom. Even though the formula-
tion of the model is different from NMF, the models are
conceptually similar: decomposition of a signal is equated to
estimation of the atoms
(| )Pf k and their activations ( )Pk
t
to
each frame of the signal, given the spectrogram .[, ]Ytf
The estimation can be performed using the expectation
maximization algorithm [27]. The various components of the
mixture multinomial distribution of (9) are initialized randomly
and re-estimated through iterations of the following equations:
(|)
()(| )
() (| )
P
Pkf
kPfk
PkPfk
t
k
k
t
t
1
=
=
ll
l
/
(| )
(| )[,]
(|)[,]
,Pf k
Pkf Ytf
P k fYtf
f
F
t
T
t
T
t
t
11
1
=
==
=
l
l
//
/
(10)
()
(|)[,]
(|)[,]
.Pk
P k fYtf
P k fYtf
f
F
k
K
f
F
t
t
t
11
1
=
==
=
l
l
//
/
(11)
The contribution of the kth atom to the overall signal is the
expected number of draws from the multinomial for the atom,
given the observed spectrum, and is given by
[,][,](|)[,]
()(| )
() (| )
.YtfYtfPkfYtf
Pk Pfk
PkPfk
k
K
kt
t
t
1
==
=
ll
l
/
This effectively distributes the intensity of [ , ]Ytf using the poster-
ior probability of the kth source in point [ , ],tf and is equivalent to
the Wiener-style reconstruction described in the next section.
The rest of this article is presented primarily through the
matrix-factorization perspective for brevity. However, many of the
NMF extensions described below are also possible within the PLCA
framework, often in a manner that is more mathematically intuitive
than the matrix-factorization framework. These include, e.g., tensor
decompositions [27], convolutive representations, the imposition of
temporal constraints [28], joint recognition of mixed signals, and
the imputation of missing data [29]. We refer you to the studies
mentioned previously for additional details of these models.
UNIQUENESS, REGULARIZATION, AND SPARSITY
The solutions to (3) and (4) are not always unique. We have
noted that the divergence
(|
|)
D YAX is biconvex in A and .X
As a result, when both A and X are to be estimated, multiple
solutions may be obtained that result in the same divergence.
Specifically, for any ,Y R
FT
!
#
+
if ( , )AXRR
FK KT
!!
##
++
is a
solution that minimizes the divergence, then any matrix pair
(, )AX
RR
FK
KT
!!
##
++
u
u
such that AX AX=
u
u
is also a solution.
For ,KF$ in particular, trivial solutions also become possible.
For ,KF= YAX= can be made exact by simply setting AI=
and .XY= For ,KF2 infinite exact decompositions may be
found, for instance, simply by setting the first F columns of A
to the identity matrix; the remaining dictionary atoms become
irrelevant (and can be set to anything at all) as an exact decom-
position can be obtained by setting their activations to zero.
Even if
A is specified and only X must be estimated, the solu-
tion may not be unique although (|| )D YAX is convex in .X This
happens particularly when A is overcomplete, i.e., when .
KF
$
Any F linearly independent columns of A can potentially be used
to represent an F-dimensional vector with zero error. We can
choose F linearly independent atoms from A R
FK
!
#
+
in up to
F
K
`
j
ways, potentially giving us at least that many ways of decom-
posing any vector in Y in terms of the atoms in .A If we permit
combinations of more than F atoms, the number of minimum-
divergence decompositions of a vector in terms of A can be much
greater. The exact conditions for the uniqueness of the decomposi-
tions are studied in more detail in [30].
To reduce the ambiguity in the solution, it is customary to
impose additional constraints on the decomposition, which is
typically done through regularization terms that are added to
the divergence to be minimized. Within the NMF framework,
this modifies the optimization problem of (3) to
,(||)(),,argmin D 00AX Y AX X A X
**
,AX
**Um=+ (12)
where ()XU is a differentiable, scalar function of X whose
value decreases as the conformance of X to the desired con-
straint increases, and
m is a positive weight that is given for the
regularization term.
The introduction of a regularization term as given in (12) can
nevertheless still result in trivial solutions. Two solutions,
(, )AX
and (, ),AX
u
u
will result in identical divergence values if AA
1
e=
-
u
and ,XX
e=
u
i.e., (|| ) (|
|)
.
DD
YAX YAX=
u
u
Structurally, the two
solutions are identical since they are merely scaled versions of
one another. On the other hand, the regularization terms for the
two need not be identical:
() ().XX!UU
u
As a result, the regular-
ization term on the right-hand side of (12) can be minimized by
simply scaling X by appropriate
e values and scaling A up by
,
1
e
-
without actually modifying the decomposition obtained.
To avoid this problem, it becomes necessary to scale the atoms
in
A to have a constant
2
, norm. Typically, this is done by normal-
izing every atom a
i
in A such that || | | 1a
i 2
= after every update.
Assuming that all atoms are normalized to unit
2
, norm, for
the KL divergence, the update rules from (8) are modified to
()
,
1
XX
AX
A
AX
Y
! 7
Um+
<
<
l
(13)
where ( )XU
l
is the matrix derivative of ( )XU with respect to
.X The update rule for A remains unchanged, besides the
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®