Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

131

132

133

134

135

136

137

138

139

140

IEEE SIGNAL PROCESSING MAGAZINE [137] MARCH 2015

information to transmit, which negatively impacts intelligibility

and the perception of quality. To use the compositional model

approach for this task, two dictionaries are first constructed: a

bandwidth-limited dictionary

A and a full-bandwidth dictionary .L

The atoms in the dictionaries should be coupled, i.e., each atom in

A should represent a band-limited version of the corresponding

atom in .L This can be done through training on parallel corpora

of full-bandwidth and band-limited signals, or by calculating L

from ,A if the details of the band-limitation process are known

and can be modeled computationally. We then estimate the atom

activations

[]tx using the limited-bandwidth observation y

and

the limited-bandwidth dictionary .A Finally, direct application of

(17) serves as a replacement for the audio reconstruction []tAx

and yields a full-bandwidth reconstruction. We illustrate this pro-

cess in Figure 9. Very similar principles underlay voice conversion,

in which the associated audio is another speaker [51], [52].

Missing data imputation [29], [53], [54] is closely related to

bandwidth extension in that the goal is to estimate a full-spec-

trum audio signal, but with the difference that the missing data

are not a set of predetermined frequency bands but rather arbi-

trary located time–frequency entries of the spectrogram. Algo-

rithms for compositional models can be easily modified so that

model parameters are estimated using only a part of the observed

data (ignoring missing data) [29], [54], but the model output can

be calculated also for entries corresponding to the missing data.

Provided that there is a sufficient amount of observed (not miss-

ing) data, which will allow estimating the activations (and atoms

in the case of unsupervised processing), reasonable estimates of

missing values can be obtained because of dependencies between

observed and missing values. In general, the quality of a model

can be judged by its ability to make predictions, and the capability

of compositional models to predict missing data also illustrates

its effectiveness.

EXCITATION-FILTER MODEL AND

CHANNEL COMPENSATION

Creating dictionaries from training data, as presented earlier,

yields accurate representations as long as the data from which the

dictionaries are learned match the observed data. In many prac-

tical situations, this is not the case, and there is a need to adapt the

learned dictionaries. Moreover, we often have knowledge about the

types of sources to be modeled, e.g., that they are musical instru-

ments but do not have suitable training data to estimate the dic-

tionaries in a supervised manner.

Natural sound sources can be modeled as an excitation sig-

nal being filtered by an instrument body filter or vocal tract fil-

ter. These kinds of excitation- or source-filter models have been

very effective, e.g., in speech coding (several codecs use it). In

addition to modeling the properties of a body filter, the filter can

also model the response from a source to a microphone and,

therefore, also do channel compensation.

In the context of compositional models, excitation-filter models

have been found useful in, e.g., music processing [55], [56], where

both the excitations and filters contain different type of informa-

tion: excitations typically consists of harmonic spectra with differ-

ent fundamental frequency values and are therefore useful in pitch

estimation, whereas the filter carries instrument-dependent infor-

mation that can be used for instrument recognition [5].

0 1 2 3 4 5 6 7 8 9 oh

0.1

0.2

0.3

Activation

0.4

0.5

0.6

0.7

Zero Noise Zero Two Noise

0.2x +0.1x +0.09x +0.08x +0.08x

Digit Labels

+. . .

[FIG8] By associating each dictionary atom from Figure 5 with a word label, the linear combination of speech atoms in Figure 5 serves

directly as evidence for the underlying word classes. We observe that the word zero, underlying the noisy observation of Figure 5,

does indeed obtain the highest score.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND