Zoom out Search Issue
IEEE SIGNAL PROCESSING MAGAZINE [134] MARCH 2015
the spectral character of each sound. The speech dictionary con-
tains some atoms that have the harmonic structure of vowels, and
others that contain a lot of broadband energy at the high frequen-
cies, representing consonant sounds.
For the piano dictionary, we obtain
atoms that have a harmonic struc-
ture, each predominantly describing
a different note.
An alternative dictionary-learning
technique is based on clustering. In
clustering, data samples are first
grouped based on some measure of
similarity, after which a representation of each group (cluster)
becomes a dictionary atom. A popular technique is the
k-means
clustering approach [25]. Another alternative is given by diction-
ary-learning techniques employed in the field of sparse representa-
tions and compressed sensing [39], which aim at finding
dictionaries that can sparsely represent a data set. Although most
of these methods do not conform to the nonnegativity constraints
of the compositional models we discuss in this article, at least one
popular method, K-singular value decomposition (K-SVD), which
has a nonnegative variant, has been used for the dictionary learn-
ing of audio signals [38].
The advantage of dictionary learning is that it typically yields
dictionaries that generalize well to unobserved data. The NMF
and sparsity-based methods both use the fact that atoms can
linearly combine to model the training data, rather than having
atoms that each individually need to model an observation as
well as possible. This naturally leads
to parts-based dictionaries, in
which only parts of the spectra con-
tain energy. This in turn leads to
small dictionaries and very sparse
representations, which may also be
more interpretable for some phe-
nomena. When the different
sources are highly related, however,
this may also be a disadvantage because a parts-based dictionary
may no longer discriminative with respect to other dictionaries.
The clustering approach typically yields dictionaries that are
larger but more discriminative.
While dictionary learning is a powerful method to create
small dictionaries, it can be difficult to train overcomplete dic-
tionaries, in which there are many more atoms than features. A
large number of atoms would naturally increase the represen-
tation capability of the model, but learning overcomplete dic-
tionaries from data then requires additional constraints such
as sparsity and careful tuning, as will be discussed in the next
section. As an alternative to learning the dictionaries represent-
ing training data, dictionary atoms can also be sampled from
the data. Given a training data set
,D
s
the dictionary A
s
is con-
structed as a subset of the columns of .D
s
By far, the simplest method is random sampling, where the
dictionary is formed by a random subset of columns of .D
s
Interestingly, dictionaries obtained with this approach yield
comparable and often superior results as more complex diction-
ary creation schemes [4]. The example in Figure 5 used ran-
domly sampled atoms representing isolated speech digits and
background noise.
The sampling methods typically require little tuning and
allow for the creation of large, overcomplete dictionaries. A disad-
vantage is that they may not generalize as well to unseen data,
and that smaller dictionaries are often incapable of accurately
modeling a source because they disregard the fact that atoms can
linearly combine to model an observation.
An alternative approach to dictionary creation, which avoids
the need for training data, is to create dictionaries by using
prior knowledge of the structure of the signals. For example, in
music transcription, harmonic atoms that represent different
fundamental frequencies have been successfully used [8]. In the
excitation-filter model [5] described later in this article, atoms
can describe filter bank responses and excitations. This
approach is only used in a small number of specialized applica-
tions because, while it yields small dictionaries that generalize
well, they are typically not very discriminative.
THE NUMBER OF ATOMS IN THE DICTIONARY
Let us now more carefully consider the issue of the number of
atoms in the dictionary. Dictionary atoms are assumed to
represent basic atomic spectral structures that a sound source
Time (s)
Frequency (kHz)
Speech
Magnitude Spectrogram
1 2 3
0
1
2
3
4
5
5 10 15
Atom Index
Time (s)
(a)
(b)
123 51015
Atom Index
Learned
Speech Dictionary
Frequency (kHz)
Piano
Magnitude Spectrogram
0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
4
5
Learned
Piano Dictionary
[FIG6] Learning dictionaries from different sound classes. (a) An
input magnitude spectrogram for a speech recording and the
dictionary that was extracted from it. (b) A piano recording input
and its corresponding dictionary. Note how both dictionaries
capture salient spectral features from each sound.
DICTIONARY ATOMS ARE
ASSUMED TO REPRESENT
BASIC ATOMIC SPECTRAL
STRUCTURES THAT A SOUND
SOURCE MAY PRODUCE.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®