Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

131

132

133

134

135

136

137

138

139

140

IEEE SIGNAL PROCESSING MAGAZINE [132] MARCH 2015

additional requirement that atoms must be normalized after

every iteration. There exists also ways to take the normalization

into account in the update, which guarantee that the updates

and normalization together decrease the value of the cost func-

tion [3], [31].

One of the most common constraints is that of sparsity, e.g.,

[4], [32], and [33]. A vector

x is said to be sparse if the number

of nonzero entries in it is fewer than the dimensionality of the

vector itself, i.e.,

.Fx

1 The fewer the nonzero elements, the

sparser the vector is said to be. Sparsity is most commonly

applied to the activations, i.e., the columns of the activation

matrix

.X The sparsity constraint is most commonly included

by employing the

, norm of the activation matrix as a regular-

izer, i.e., ( ) || || [ ] .xtXX

U ==

This leads to the fol-

lowing update rule for the activations:

! 7

(14)

Other constraints may be similarly applied by modifying the regu-

larization function ()XU to favor the type of solutions desired.

Similarly, regularization functions may be applied on dictionary ,A

in which case the update rule of A should be modified. In the con-

text of compositional models for audio, the types of regularizations

applied on the dictionary include sparsity [32] and dissimilarity

between learned atoms and generic speech templates [34].

It must be noted that despite the introduction of regulariza-

tion terms, both (3) and (12) are still typically biconvex, and no

algorithm is guaranteed to reach the global minimum in prac-

tice. Different algorithms and initializations lead to different

solutions, and any solution obtained will, at best, be a local opti-

mum. In practice, this can result in some degree of variation in

the signal processing outcomes obtained through these

decompositions.

The entire discussion in this section also applies to the

PLCA decompositions, although the manner in which the

regularization terms are applied within the PLCA framework is

different. We refer the reader to [14], [27], and [35] for addi-

tional discussion of this topic.

SOURCE SEPARATION

Sound source separation refers to the problem of extracting a sin-

gle or several signals of interest from a mixture containing multi-

ple signals. This operation is central to many signal processing

applications because the fundamental algorithms are typically

built under the assumption that we operate on a clean target sig-

nal with minimal interference. Having the ability to remove

unwanted components from a recording can allow us to perform

subsequent operations that expect a clean input (e.g., speech rec-

ognition or pitch detection). We will predominantly focus on the

case where we only observe a single-channel mixture and briefly

discuss multichannel approaches later in the article.

The compositional model approach to the separation of sig-

nals from single-channel recordings addresses the problem in a

rather simple manner. It assumes that any sound source can draw

upon a characteristic set of atomic sounds to generate signals.

Here, a source can refer to an actual sound source or to some

other grouping of acoustic phenomena that should be jointly

modeled, such as background noise or even a collection of sound

classes that must be distinguished from a target class. A mixture

of signals from distinct sources is composed of atoms from the

individual sources. Hence, the separation of any particular com-

ponent signal from a mixture only requires the segregation of the

contribution of the atoms from that source from the mixture.

Mathematically, we can explain this as follows. We use the

NMF formulation in our explanation. Let matrix

represent

the set of atoms employed by the sth source. We will refer to it

as a dictionary of atoms for that source. Any spectrogram Y

from the sth source is composed from the atoms in the diction-

ary A

as .YAX

sss

= A mixed signal Y

mix

combining signals

from several sources is given by

.YYAX

smix

(15)

Equation (15) can be written more compactly as follows. Let

AAA

be a matrix composed by stacking the dictionaries

for all the sources side by side. Let XXX

be a matrix

composed by stacking the activations for all the sources vertically.

We can now express the mixed signal in compact form as

.YAX

mix

The contribution of the sth source to Y

mix

is simply .YAX

sss

In unsupervised source separation, both ,AX are estimated from

the observation ,Y

mix

followed by a process that identifies which

source each atom is predominantly associated with. In a super-

vised scenario for separation, the dictionaries,

for each of the

sources are known a priori. We address the problem of creating

these dictionaries in the next section. Thus,

6 are known, and

thereby so is .AX can now be estimated through iterations of (8).

The activations X

of source s can be extracted from the

estimated activation matrix X

by selecting the rows corre-

sponding to the atoms from the sth source. The estimated

spectrogram for the sth source is then simply computed as

.YAX

(16)

An example of a source separation task using a dictionary repre-

senting isolated speech digits and a dictionary representing

background noises is shown in Figure 5.

In practice, the decomposition will not be exact and we will

only achieve approximate decomposition, i.e.,

,YAX

mix

. and

as a consequence, Y

mix

is not fully explained by the decomposi-

tion. Hence, the separated signal spectrograms given by (16)

will not explain the mixed signal completely.

To be able to account for all the energy in the input signal,

we can use an alternative method to extract the contributions of

the individual sources. Although the separated signals do not

completely explain the mixed signal, we assume that they do

nevertheless successfully characterize the relative proportions

of the individual signals in the mixture. This leads to the follow-

ing estimate for the separated signals:

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND