Zoom out Search Issue
IEEE SIGNAL PROCESSING MAGAZINE [53] MARCH 2015
To demonstrate the use of the
optimality conditions (19), let us
consider the simple
2
, distortion
measure given by
(, ) ,da a a a
TL L T
2
=- (20)
where $
is the
2
, norm. In this
case, (18) is a convex optimiza-
tion problem, so that (19) are
also sufficient conditions. By
using the optimality conditions (19) under the assumption that
H
E
and v
E
are uncorrelated, and including the hybrid deter-
ministic-stochastic model for H
E
introduced in [16], where the
early response is described solely by a deterministic direct path
and the late response is modeled by an exponentially fading sto-
chastic process, the preprocessing algorithm is derived as
,aDD Da
HH
TT
1
K=+
-
u
^h
(21)
where D is a matrix collecting direct path responses of the
channel, and K is a diagonal matrix collecting diffuse reverber-
ation response channel energies. Note that in the case of low
reverberation,
,0"K the scheme (21) reduces to a conven-
tional acoustic cross-talk canceler [30], ,aDa
TT
1
=
-
u
which by
compensating for the direct paths of the channel ,H
E
makes
the cross-signals cancel out at the listeners. We thus conclude
that optimization-based multipoint preprocessing enhancement
as formulated in (18) leads to acoustic cross-talk cancelation,
when applied to the
2
, distortion measure (20).
CONCLUSIONS AND OPEN PROBLEMS
Modern speech communication often leads to the signal being
rendered by a machine in a noisy environment. In these circum-
stances, communication benefits from methods that make speech
more intelligible in noise, particularly if the enhancement can
adapt to the scenario at hand. This requires quantitative models of
the communication process and distortion measures.
The use of a distortion measure facilitates the formulation of
convergent algorithms and generally reduces the need for ad
hoc solutions. Measures formulated at a high level of abstrac-
tion, such as (1) and (3) apply, at least in principle, to all com-
munication tasks. However, when these high-level measures are
applied to specific tasks assumptions must be made, either for
the signal or for a model of the human cognitive system (e.g., by
an ASR system), or both. Thus, optimization of any measure can
never replace the need of extensive real-world testing to verify
the performance of an intelligibility-enhancement system for
the task at hand.
At first sight, the intelligibility-enhancement problem resem-
bles the standard problem of transmission over a noisy channel.
However, we have shown that the unprecise nature of the human
production and interpretation must be accounted for. When that
is done, standardized measures for intelligibility, which have a
long history and were derived heuristi-
cally, are found to be consistent with
communication theory.
While the field of intelligibility
enhancement has developed rapidly,
opportunities for significant improve-
ment remain. Careful accounting for
time-domain masking may improve
performance. Methods developed for
scenarios with additive noise only
must be extended to include reverber-
ation. Refining methods that perform spectral shaping to include
range compression may increase their performance. For meth-
ods based on mutual information, the effect of time and fre-
quency dependencies must be considered. Studies to determine
the best representation (e.g., cepstra or DFT coefficients) and
the determination and usage of appropriate noise distributions
for the model likely will lead to improvement. The determina-
tion of a word choice for a message that is more robust to noise
is an essentially unsolved task.
Although major challenges remain, the field of intelligibility
enhancement has made major strides in recent years. The tech-
nical outcomes will likely become an integral part of speech-
rendering devices in the near future, leading to improved
communication among humans and from machines to humans.
AUTHORS
W. Bastiaan Kleijn (bastiaan.kleijn@ecs.vuw.ac.nz) received the
Ph.D. degree in electrical engineering from Delft University of
Technology, The Netherlands (TU Delft); an M.S.E.E. degree
from Stanford University; and a Ph.D. degree in soil science and
an M.Sc. degree in physics from the University of California,
Riverside. He is a professor at Victoria University of Wellington,
New Zealand, and TU Delft, The Netherlands (part-time). He was
a professor and head of the Sound and Image Processing Labo-
ratory at The Royal Institute of Technology (KTH), Stockholm,
Sweden, from 1996 until 2010 and a founder of Global IP Solu-
tions, a company that provided the original audio technology to
Skype and was later acquired by Google. Before 1996, he was
with the Research Division of AT&T Bell Laboratories in Murray
Hill, New Jersey. He is an IEEE Fellow.
João B. Crespo (j.b.farinhapereiracrespo@student.vu.nl.) is a
Ph.D. student in the Circuits and Systems Group of Delft
University of Technology, The Netherlands. In 2009, he received
his M.Sc. degree in electrical engineering from the Technical
University of Lisbon, Portugal. During the last year of his M.Sc.
studies, he was an exchange student at the Information and
Communication Theory Group of Delft University of
Technology. In 2010–2011, he worked at ExSilent B.V., The
Netherlands, as a digital signal processing developer. His areas
of interest include audio and speech processing, auditory per-
ception, and information theory.
Richard C. Hendriks (R.C.Hendriks@tudelft.nl) obtained the
M.Sc. and Ph.D. degrees (both cum laude) in electrical engi-
neering from Delft University of Technology, The Netherlands,
THE TECHNICAL OUTCOMES
WILL LIKELY BECOME AN
INTEGRAL PART OF SPEECH-
RENDERING DEVICES IN THE
NEAR FUTURE, LEADING TO
IMPROVED COMMUNICATION
AMONG HUMANS AND FROM
MACHINES TO HUMANS.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
______________
________________________
__________________