Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

IEEE SIGNAL PROCESSING MAGAZINE [53] MARCH 2015

To demonstrate the use of the

optimality conditions (19), let us

consider the simple

, distortion

measure given by

(, ) ,da a a a

TL L T

=- (20)

where $

is the

, norm. In this

case, (18) is a convex optimiza-

tion problem, so that (19) are

also sufficient conditions. By

using the optimality conditions (19) under the assumption that

and v

are uncorrelated, and including the hybrid deter-

ministic-stochastic model for H

introduced in [16], where the

early response is described solely by a deterministic direct path

and the late response is modeled by an exponentially fading sto-

chastic process, the preprocessing algorithm is derived as

,aDD Da

K=+

(21)

where D is a matrix collecting direct path responses of the

channel, and K is a diagonal matrix collecting diffuse reverber-

ation response channel energies. Note that in the case of low

reverberation,

,0"K the scheme (21) reduces to a conven-

tional acoustic cross-talk canceler [30], ,aDa

which by

compensating for the direct paths of the channel ,H

makes

the cross-signals cancel out at the listeners. We thus conclude

that optimization-based multipoint preprocessing enhancement

as formulated in (18) leads to acoustic cross-talk cancelation,

when applied to the

, distortion measure (20).

CONCLUSIONS AND OPEN PROBLEMS

Modern speech communication often leads to the signal being

rendered by a machine in a noisy environment. In these circum-

stances, communication benefits from methods that make speech

more intelligible in noise, particularly if the enhancement can

adapt to the scenario at hand. This requires quantitative models of

the communication process and distortion measures.

The use of a distortion measure facilitates the formulation of

convergent algorithms and generally reduces the need for ad

hoc solutions. Measures formulated at a high level of abstrac-

tion, such as (1) and (3) apply, at least in principle, to all com-

munication tasks. However, when these high-level measures are

applied to specific tasks assumptions must be made, either for

the signal or for a model of the human cognitive system (e.g., by

an ASR system), or both. Thus, optimization of any measure can

never replace the need of extensive real-world testing to verify

the performance of an intelligibility-enhancement system for

the task at hand.

At first sight, the intelligibility-enhancement problem resem-

bles the standard problem of transmission over a noisy channel.

However, we have shown that the unprecise nature of the human

production and interpretation must be accounted for. When that

is done, standardized measures for intelligibility, which have a

long history and were derived heuristi-

cally, are found to be consistent with

communication theory.

While the field of intelligibility

enhancement has developed rapidly,

opportunities for significant improve-

ment remain. Careful accounting for

time-domain masking may improve

performance. Methods developed for

scenarios with additive noise only

must be extended to include reverber-

ation. Refining methods that perform spectral shaping to include

range compression may increase their performance. For meth-

ods based on mutual information, the effect of time and fre-

quency dependencies must be considered. Studies to determine

the best representation (e.g., cepstra or DFT coefficients) and

the determination and usage of appropriate noise distributions

for the model likely will lead to improvement. The determina-

tion of a word choice for a message that is more robust to noise

is an essentially unsolved task.

Although major challenges remain, the field of intelligibility

enhancement has made major strides in recent years. The tech-

nical outcomes will likely become an integral part of speech-

rendering devices in the near future, leading to improved

communication among humans and from machines to humans.

AUTHORS

W. Bastiaan Kleijn (bastiaan.kleijn@ecs.vuw.ac.nz) received the

Ph.D. degree in electrical engineering from Delft University of

Technology, The Netherlands (TU Delft); an M.S.E.E. degree

from Stanford University; and a Ph.D. degree in soil science and

an M.Sc. degree in physics from the University of California,

Riverside. He is a professor at Victoria University of Wellington,

New Zealand, and TU Delft, The Netherlands (part-time). He was

a professor and head of the Sound and Image Processing Labo-

ratory at The Royal Institute of Technology (KTH), Stockholm,

Sweden, from 1996 until 2010 and a founder of Global IP Solu-

tions, a company that provided the original audio technology to

Skype and was later acquired by Google. Before 1996, he was

with the Research Division of AT&T Bell Laboratories in Murray

Hill, New Jersey. He is an IEEE Fellow.

João B. Crespo (j.b.farinhapereiracrespo@student.vu.nl.) is a

Ph.D. student in the Circuits and Systems Group of Delft

University of Technology, The Netherlands. In 2009, he received

his M.Sc. degree in electrical engineering from the Technical

University of Lisbon, Portugal. During the last year of his M.Sc.

studies, he was an exchange student at the Information and

Communication Theory Group of Delft University of

Technology. In 2010–2011, he worked at ExSilent B.V., The

Netherlands, as a digital signal processing developer. His areas

of interest include audio and speech processing, auditory per-

ception, and information theory.

Richard C. Hendriks (R.C.Hendriks@tudelft.nl) obtained the

M.Sc. and Ph.D. degrees (both cum laude) in electrical engi-

neering from Delft University of Technology, The Netherlands,

THE TECHNICAL OUTCOMES

WILL LIKELY BECOME AN

INTEGRAL PART OF SPEECH-

RENDERING DEVICES IN THE

NEAR FUTURE, LEADING TO

IMPROVED COMMUNICATION

AMONG HUMANS AND FROM

MACHINES TO HUMANS.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND

______________

________________________

__________________