Zoom out Search Issue
IEEE SIGNAL PROCESSING MAGAZINE [130] MARCH 2015
The various divergences scale differently with their argu-
ments. The squared error scales quadratically, meaning that
(|| ) (||),DDYAX YAX
2
SQ SQ
aa a= the IS divergence is scale
invariant, i.e., (|| ) (||),DDYAX YAX
IS IS
aa = while the KL diver-
gence scales linearly: (|| ) (||).DDYAX YAX
KL KL
aa a= The rela-
tive merits of the divergences may be inferred from this property:
the squared error divergence puts undue emphasis on high-
energy components, and the IS divergence fails to distinguish
between the noise floor and higher-energy speech components.
The KL divergence provides a good compromise between the two
[3], [19], [20]. A generalization of the divergences in (5) and (6) is
the beta divergence [21], which defines a set of divergences that
are a function of a parameter
.
b
The divergences in (5) and (6) (KL, IS, or squared) can be
obtained from maximum likelihood estimation of the parameters,
when observed data is generated by a specific generative model
(Poisson distribution, multiplicative Gamma noise, or additive
Gaussian noise) independently at each time–frequency point [13].
Even though some of these models (e.g., the Poisson distribution)
do not match well with the distribution of natural sounds, the stat-
istical interpretation allows incorporating a prior distributions for
the parameters.
The squared error and KL divergence are convex as the func-
tion of
,y
t
and for these, the divergence (||)D YY
t
is also convex
in .Y
t
In this case, the optimization problem of (4) and its coun-
terpart, where X is specified and A must be estimated, minim-
ize a convex function, and can be solved by any convex
optimization technique.
When
Y
t
is itself a product of two matrices, e.g., ,YAX=
t
(||) (|| )DDYY YAX=
t
becomes biconvex in A and .X This
means that it is not jointly convex in both of these variables, but
if either of them is fixed it is convex in the other. Therefore, (3)
is biconvex and cannot directly be solved through convex opti-
mization methods. Nevertheless, convex optimization methods
may still be employed by alternately estimating one of
A and ,X
holding the other fixed to its current estimate.
A commonly used solution to estimating nonnegative
decompositions is based on the so-called multiplicative updates.
The parameters to be estimated are first initialized to random
positive values and then iteratively updated by multiplying them
with correction terms. The strength of the method stems from
the ability of the updates to fulfill the nonnegativity constraints
easily: provided that both the previous estimate and the correc-
tion term are nonnegative, and the updated term is guaranteed
to be nonnegative as well. The multiplicative updates that
decrease the KL divergence are given as
AA
1X
AX
Y
X
! 7
<
<
(7)
and
,XX
A1
A
AX
Y
! 7
<
<
(8)
where 1 is an all-one matrix of the same size as ,Y 7 is an ele-
ment-wise matrix product, and all the divisions are element wise.
It can be easily seen that if A and X are nonnegative, the terms
that are used to update them are also nonnegative. Thus, the
updates obey the nonnegativity constraints. If both
A and X must
be estimated, (7) and (8) must be alternately computed. If one of
the two is given and only the other must be estimated, then only
the update rule for the appropriate variable need be iterated. For
instance, if
A is given, X can be estimated by iterating (8). In all
cases, the KL divergence is guaranteed be nonincreasing under
the updates. These multiplicative updates as well as rules for mini-
mizing the squared error were proposed by Lee and Seung [22].
In addition to multiplicative updates, various alternative
methods have been proposed, based on, e.g., second-order
methods [23], projected gradient [1, pp. 267–268], etc. The
methods can also be accelerated by active-set methods [24],
[25]. Some divergences, such as the IS divergence, are not con-
vex, and minimizing them requires more carefully designed
optimization algorithms than the convex divergences [13].
There also exist divergences that aim at optimizing the per-
ceptual quality of the representation [12], which are useful in
audio coding applications. However, in most of the other appli-
cations of compositional models, such as source separation and
signal analysis, the quality of the representation is affected more
by its ability to isolate latent compositional units from a mix-
ture signal, not by the ability to accurately represent the obser-
vations. Therefore, simple divergences such as the KL or IS are
the most commonly used even in the applications where a mix-
ture is separated into parts for listening purposes.
COMPOSITIONAL MODELS AS PLCA
The PLCA approach to compositional models treats the spectro-
gram of the signal as a histogram drawn from a mixture multi-
nomial process, where the component multinomials in the
mixture represent the atoms that compose the signal [14]. This
model is an extension of probabilistic latent semantic indexing
and probabilistic latent semantic analysis techniques that have
been successfully used, e.g., for topic modeling of speech [26].
The generative model behind PLCA may be explained as follows.
A stochastic process draws frequency indices randomly from a col-
lection of multinomial distributions. In each draw, it first selects
one of these component multinomials according to some probabil-
ity distribution,
(),Pk where k represents the selected multino-
mial. Subsequently, it draws the frequency, ,f from the selected
multinomial (| ).Pf k Thus, the probability that a frequency, ,f will
be selected in any draw is given by () (| ).PkPf k
k
/
To generate a
spectral vector, the process draws frequencies several times. The
histogram of the frequencies is the resulting spectral vector.
The mixture multinomial
() (| )PkPf k
k
/
thus represents the
distribution underlying a single spectral vector—the vector itself
is obtained through several draws from this distribution. When we
employ the model to generate an entire spectrogram comprising
many spectral vectors, we make an additional assumption: that the
component multinomials
(| )Pf k are characteristic of the source
that generates the sound, and represent the atomic units for the
source. Hence, the set of component multinomials is the same for
all vectors, and the only factor that changes from analysis frame to
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®