Zoom out Search Issue
IEEE SIGNAL PROCESSING MAGAZINE [27] MARCH 2015
desired and the undesired components vary in an unpredictable
way and call for instantaneous estimates.
In the spectrotemporal domain, voice activity detection and
speech presence probability estimation typically aim at identifying
regions in the STFT domain where only undesired components
are present, e.g., [6, ch. 5], [42]. Obviously, this is very difficult for
the given scenario with interfering speech sources that naturally
occupy the same frequency range and whose temporal activity pat-
tern is generally not known, especially if their signal level is com-
parable to the level of the target source in any of the microphone
signals. Therefore, the desired and undesired components can
usually only be separated along the time axis. For example, for
computing the MWF according to (19), it is typically assumed that
the interference and noise can be observed during noise-only peri-
ods, so that with the assumed uncorrelatedness of noise and
desired speech, the crosspower spectral density matrix
hh
ss
H
0
0
00
z
can be estimated as
,hh
ss
H
0
0
xx vv
00
.z UU- (24)
where
xx
U is estimated continually and
vv
U during periods of
interference and noise only. As a fundamental problem, however,
all these methods still suffer from the fact that the interference
and noise estimates cannot be updated while the target source is
active, so that they are prone to failure with nonstationary noise
and interference, such as human speakers.
On the other hand, in the spatial domain, reference informa-
tion for all the interference and noise components can be obtained
by suppressing the target source. Here, the spatial selectivity
allowed by the microphone array topology constitutes the main
limitation. Exploiting the spatial domain for obtaining interfer-
ence and noise reference information is an inherent feature of the
GSC (cf. the section “MVDR Beamformer”), where the BM aims to
suppress the target source. For moving sources and multipath
propagation scenarios, robust adaptation schemes for the BM have
already been proposed, e.g., [23]. These concepts still require
knowledge about the activity of the target source, as the BM
should only be adapted when the target source is dominant. If the
DOA of the target source is known, its activity can be monitored
by directing both a delay-and-sum beamformer and a delay-and-
subtract beamformer in this direction and inferring the activity
from the ratio of its output powers, see, e.g., [23]. However, these
noise estimates will still be suboptimal if the BM could not be
updated while the target source changed its position relative to
the microphones on the user’s head or the acoustic environ-
ment changed.
More recently, a constrained BSS scheme has been proposed
to identify the filters of two-channel blocking matrices [40],
which does not need source activity information and continu-
ously delivers up-to-date estimates for noise and interference.
For this, the cost function in (23) is complemented by a quad-
ratic constraint for one output (here
)y
p
steering a null toward
the target source:
()WwdJ
p
H
2
2
C
= ,
(25)
where w
p
denotes the vector of demixing filters in W which pro-
duce the output ,y
p
and d denotes the steering vector corre-
sponding to the DOA of the direct path of the target source. This
yields the constrained ICA cost function
() () (),WWWJJJ
C ICA ICA C
h=+
-
(26)
whose minimization suppresses the target in one output channel
and thereby provides a reference for all other sources and noise of
unknown coherence. The weight is typically chosen as
..05 08f.
h with larger values required if interfering sources
are close to the target source. It should be noted that, although
the constraint captures only the direct path, constrained ICA will
intrinsically also aim at suppressing all correlated components,
i.e., reflections of the target source, in the same output, thereby
providing an advantage over a delay-and-subtract beamformer as
shown in Figure 6. As the most attractive advantage, however, the
fundamental concept of ICA assures a continuous update of the
noise estimate without the need of estimating the activity of the
involved sources. Recently, it was also shown that this concept can
be generalized to identify all RTFs required for the BM of a GSC
with an arbitrary number of constraints [43].
PRESENTATION OF THE ENHANCED SIGNALS
After extracting the target source using data-independent beam-
forming or statistically optimum filtering (cf. the sections
“Data-Independent Beamforming” and “Statistically Optimum
Signal Extraction”), the enhanced signal needs to be presented
to the listener. While microphone placement is important to
maintain a close relationship to the individual HRTFs, we also
need to distinguish between a monaural system, i.e., a single
device on one ear, and a binaural system, i.e., a system jointly
processing signals, at both ears. While for a monaural system it
seems obvious to just feed the enhanced signal to the loud-
speaker of this device, for a binaural system different output
x
L, 1
(k, )
x
L, 2
(k, )
x
L, M
(k, )
w
R
(k, )w
L
(k, )
y
R
(k, )y
L
(k, )
x
R, 1
(k, )
x
R, 2
(k, )
x
R, M
(k, )
. . .
. . .
[FIG7] The general binaural processing scheme.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®