Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

IEEE SIGNAL PROCESSING MAGAZINE [27] MARCH 2015

desired and the undesired components vary in an unpredictable

way and call for instantaneous estimates.

In the spectrotemporal domain, voice activity detection and

speech presence probability estimation typically aim at identifying

regions in the STFT domain where only undesired components

are present, e.g., [6, ch. 5], [42]. Obviously, this is very difficult for

the given scenario with interfering speech sources that naturally

occupy the same frequency range and whose temporal activity pat-

tern is generally not known, especially if their signal level is com-

parable to the level of the target source in any of the microphone

signals. Therefore, the desired and undesired components can

usually only be separated along the time axis. For example, for

computing the MWF according to (19), it is typically assumed that

the interference and noise can be observed during noise-only peri-

ods, so that with the assumed uncorrelatedness of noise and

desired speech, the crosspower spectral density matrix

can be estimated as

,hh

xx vv

.z UU- (24)

where

U is estimated continually and

U during periods of

interference and noise only. As a fundamental problem, however,

all these methods still suffer from the fact that the interference

and noise estimates cannot be updated while the target source is

active, so that they are prone to failure with nonstationary noise

and interference, such as human speakers.

On the other hand, in the spatial domain, reference informa-

tion for all the interference and noise components can be obtained

by suppressing the target source. Here, the spatial selectivity

allowed by the microphone array topology constitutes the main

limitation. Exploiting the spatial domain for obtaining interfer-

ence and noise reference information is an inherent feature of the

GSC (cf. the section “MVDR Beamformer”), where the BM aims to

suppress the target source. For moving sources and multipath

propagation scenarios, robust adaptation schemes for the BM have

already been proposed, e.g., [23]. These concepts still require

knowledge about the activity of the target source, as the BM

should only be adapted when the target source is dominant. If the

DOA of the target source is known, its activity can be monitored

by directing both a delay-and-sum beamformer and a delay-and-

subtract beamformer in this direction and inferring the activity

from the ratio of its output powers, see, e.g., [23]. However, these

noise estimates will still be suboptimal if the BM could not be

updated while the target source changed its position relative to

the microphones on the user’s head or the acoustic environ-

ment changed.

More recently, a constrained BSS scheme has been proposed

to identify the filters of two-channel blocking matrices [40],

which does not need source activity information and continu-

ously delivers up-to-date estimates for noise and interference.

For this, the cost function in (23) is complemented by a quad-

ratic constraint for one output (here

steering a null toward

the target source:

()WwdJ

= ,

(25)

where w

denotes the vector of demixing filters in W which pro-

duce the output ,y

and d denotes the steering vector corre-

sponding to the DOA of the direct path of the target source. This

yields the constrained ICA cost function

() () (),WWWJJJ

C ICA ICA C

h=+

(26)

whose minimization suppresses the target in one output channel

and thereby provides a reference for all other sources and noise of

unknown coherence. The weight is typically chosen as

..05 08f.

h with larger values required if interfering sources

are close to the target source. It should be noted that, although

the constraint captures only the direct path, constrained ICA will

intrinsically also aim at suppressing all correlated components,

i.e., reflections of the target source, in the same output, thereby

providing an advantage over a delay-and-subtract beamformer as

shown in Figure 6. As the most attractive advantage, however, the

fundamental concept of ICA assures a continuous update of the

noise estimate without the need of estimating the activity of the

involved sources. Recently, it was also shown that this concept can

be generalized to identify all RTFs required for the BM of a GSC

with an arbitrary number of constraints [43].

PRESENTATION OF THE ENHANCED SIGNALS

After extracting the target source using data-independent beam-

forming or statistically optimum filtering (cf. the sections

“Data-Independent Beamforming” and “Statistically Optimum

Signal Extraction”), the enhanced signal needs to be presented

to the listener. While microphone placement is important to

maintain a close relationship to the individual HRTFs, we also

need to distinguish between a monaural system, i.e., a single

device on one ear, and a binaural system, i.e., a system jointly

processing signals, at both ears. While for a monaural system it

seems obvious to just feed the enhanced signal to the loud-

speaker of this device, for a binaural system different output

L, 1

(k, )

L, 2

(k, )

L, M

(k, )

(k, )w

(k, )

(k, )y

(k, )

R, 1

(k, )

R, 2

(k, )

R, M

(k, )

. . .

[FIG7] The general binaural processing scheme.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND