Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

131

132

133

134

135

136

137

138

139

140

IEEE SIGNAL PROCESSING MAGAZINE [134] MARCH 2015

the spectral character of each sound. The speech dictionary con-

tains some atoms that have the harmonic structure of vowels, and

others that contain a lot of broadband energy at the high frequen-

cies, representing consonant sounds.

For the piano dictionary, we obtain

atoms that have a harmonic struc-

ture, each predominantly describing

a different note.

An alternative dictionary-learning

technique is based on clustering. In

clustering, data samples are first

grouped based on some measure of

similarity, after which a representation of each group (cluster)

becomes a dictionary atom. A popular technique is the

k-means

clustering approach [25]. Another alternative is given by diction-

ary-learning techniques employed in the field of sparse representa-

tions and compressed sensing [39], which aim at finding

dictionaries that can sparsely represent a data set. Although most

of these methods do not conform to the nonnegativity constraints

of the compositional models we discuss in this article, at least one

popular method, K-singular value decomposition (K-SVD), which

has a nonnegative variant, has been used for the dictionary learn-

ing of audio signals [38].

The advantage of dictionary learning is that it typically yields

dictionaries that generalize well to unobserved data. The NMF

and sparsity-based methods both use the fact that atoms can

linearly combine to model the training data, rather than having

atoms that each individually need to model an observation as

well as possible. This naturally leads

to parts-based dictionaries, in

which only parts of the spectra con-

tain energy. This in turn leads to

small dictionaries and very sparse

representations, which may also be

more interpretable for some phe-

nomena. When the different

sources are highly related, however,

this may also be a disadvantage because a parts-based dictionary

may no longer discriminative with respect to other dictionaries.

The clustering approach typically yields dictionaries that are

larger but more discriminative.

While dictionary learning is a powerful method to create

small dictionaries, it can be difficult to train overcomplete dic-

tionaries, in which there are many more atoms than features. A

large number of atoms would naturally increase the represen-

tation capability of the model, but learning overcomplete dic-

tionaries from data then requires additional constraints such

as sparsity and careful tuning, as will be discussed in the next

section. As an alternative to learning the dictionaries represent-

ing training data, dictionary atoms can also be sampled from

the data. Given a training data set

the dictionary A

is con-

structed as a subset of the columns of .D

By far, the simplest method is random sampling, where the

dictionary is formed by a random subset of columns of .D

Interestingly, dictionaries obtained with this approach yield

comparable and often superior results as more complex diction-

ary creation schemes [4]. The example in Figure 5 used ran-

domly sampled atoms representing isolated speech digits and

background noise.

The sampling methods typically require little tuning and

allow for the creation of large, overcomplete dictionaries. A disad-

vantage is that they may not generalize as well to unseen data,

and that smaller dictionaries are often incapable of accurately

modeling a source because they disregard the fact that atoms can

linearly combine to model an observation.

An alternative approach to dictionary creation, which avoids

the need for training data, is to create dictionaries by using

prior knowledge of the structure of the signals. For example, in

music transcription, harmonic atoms that represent different

fundamental frequencies have been successfully used [8]. In the

excitation-filter model [5] described later in this article, atoms

can describe filter bank responses and excitations. This

approach is only used in a small number of specialized applica-

tions because, while it yields small dictionaries that generalize

well, they are typically not very discriminative.

THE NUMBER OF ATOMS IN THE DICTIONARY

Let us now more carefully consider the issue of the number of

atoms in the dictionary. Dictionary atoms are assumed to

represent basic atomic spectral structures that a sound source

Time (s)

Frequency (kHz)

Speech

Magnitude Spectrogram

1 2 3

5 10 15

Atom Index

Time (s)

(a)

(b)

123 51015

Atom Index

Learned

Speech Dictionary

Frequency (kHz)

Piano

Magnitude Spectrogram

Learned

Piano Dictionary

[FIG6] Learning dictionaries from different sound classes. (a) An

input magnitude spectrogram for a speech recording and the

dictionary that was extracted from it. (b) A piano recording input

and its corresponding dictionary. Note how both dictionaries

capture salient spectral features from each sound.

DICTIONARY ATOMS ARE

ASSUMED TO REPRESENT

BASIC ATOMIC SPECTRAL

STRUCTURES THAT A SOUND

SOURCE MAY PRODUCE.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND