Zoom out Search Issue

IEEE SIGNAL PROCESSING MAGAZINE [105] MARCH 2015
potential components of sources [18]. Another technique
applied in single-channel BSS is the computational auditory
scene analysis (CASA) that simulates the segregation and group-
ing mechanism of the human auditory system [19] on the model-
based representation (monaural case) of the auditory scenes. An
important aspect worth considering is the directions of the
extracted sources, which can usually come as a by-product in mul-
tichannel BSS. In single-channel BSS, this information of source
directions has to be provided separately.
DECOMPOSITION USING PAE
In most sound scenes, the mixture comprises not only the dry
sources but also the reverberation and ambient sound, which are
contributed by the acoustics of the surrounding space. Therefore,
the mixing model of the sources in BSS usually does not match
with the actual sound scenes. In this article, we refer to the domi-
nant sources as primary (or direct) components, while the signals
contributed by the sound environment are referred to as ambient
(or diffuse) components. The primary and ambient components
are perceived to be directional and diffuse, respectively. Different
rendering methods should be applied to the primary and ambient
components [6], [7] due to their perceptual differences. Therefore,
rendering of natural sound scenes requires the decomposition of
the mixtures into primary and ambient components [6], [7], [9].
Since stereo is still the most widely used format for digital media
content, our discussion on the decomposition using PAE is
focused on stereo signals
.()M 2=
In PAE, we often follow some intuitive signal models as
discussed in [3], [5], [7], [8], and [20]. In the mth channel, the
mixture
xn
m
^h
is assumed to be the sum of the primary com-
ponent pn
m
^h
and ambient component ,an
m
^h
i.e.,
.xn pn an
mmm
=+
^^^hhh
The discrimination of directional
primary components and diffuse ambient components is mainly
based on their interchannel correlations, where the primary and
ambient components in the two channels are assumed to be
correlated and uncorrelated, respectively. In the basic mixing
model for PAE, the primary components are assumed to be
amplitude panned, while the ambient components are of
approximately equal levels in all channels.
Based on these assumptions, various approaches are proposed
in PAE for stereo signals. Similar to BSS, time-frequency masking
approaches are introduced to extract ambient components
,Anl
m
t
^h
[7], [20] and these approaches can be generalized as
,,,,A nl X nl nl
mmA
W=
t
^^^hhh
(5)
where ,nl01
A
##W
^h
is the real-valued ambient mask at the
time-frequency bin ,.nl
^h
Time-frequency bins having high inter-
channel correlation are considered to be primary components (or
mostly primary components in the soft masking case), whereas low
correlation bins are more likely to be ambient components.
Several linear estimation-based PAE approaches were also
introduced [21], which exploits the differences between the two
channels of the stereo signal to perform the PAE, including PCA-
based approaches [20] and LS-based approaches. In these
approaches, the extracted primary components
,pnpn
01
tt
^^hh
and
ambient components ,an
an
01
tt
^^hh
are expressed as weighted
sums of the mixtures:
.
pn
pn
an
an
w
w
w
w
w
w
w
w
xn
xn
,
,
,
0
1
0
1
10
00
10
1
0
P0,0
P
A
A
P0,1
P1,1
A0,1
A1,1
=
t
t
t
t
^
^
^
^^
^h
h
h
hh
h
R
T
S
S
S
S
S
S
R
T
S
S
S
S
S
=
V
X
W
W
W
W
W
W
V
X
W
W
W
W
W
G
(6)
The solutions for the weights in (6) are derived based on different
performance-related criteria [21]. More specifically, PCA extracts
the primary components having maximum variance and extracts
the ambient components having minimum variance with the con-
straint that the primary and ambient components are uncorrelated,
while LS extracts these components having minimum MSE. Based
on the study in [21], it is recommended that PCA-based approaches
should be used for signals that contain dominant primary compo-
nents (e.g., gaming), while LS-based approaches are preferred for
signals that contain a balanced mix of primary and ambient compo-
nents (e.g., movies). In addition, to deal with more complex types of
input signals that do not fit into the basic mixing model, other tech-
niques have also been introduced, such as time shifting to compen-
sate for time differences [22] and adaptive frequency bin
partitioning for multiple sources in primary components [23]. Fur-
thermore, though it is possible to extend the framework of PAE
from stereo signals to multichannel signals, e.g., [24], more com-
prehensive studies on PAE for multichannel signals are required.
A COMPARISON BETWEEN BSS AND PAE
Both BSS and PAE are extensively applied in sound scene decom-
position—a comparison between these approaches is summarized
in Table 2. The common objective of BSS and PAE is to extract
useful information (mainly the sound sources and their direc-
tions) about the original sound scene from the mixtures, and to
use this information to facilitate natural sound rendering. There
are three common characteristics in BSS and PAE. First, only the
mixtures are available and usually no other prior information is
given. Second, the extraction of the specific components from the
mixtures is based on certain signal models. Third, both techniques
require objective and subjective evaluation.
As discussed earlier, the applications of different signal mod-
els in BSS and PAE lead to different techniques. In BSS, the
mixtures are considered as the sums of multiple sources, and
the independence among the sources is one of the most impor-
tant characteristics. In contrast, the mixing model in PAE is
based on human perception of directional sources (primary
components) and diffuse sound environment (ambient compo-
nents). The perceptual difference between primary and ambient
components is due to the directivity of these components which
can be characterized by their correlations. The applications that
adopted BSS and PAE also have distinct differences. BSS is com-
monly used in speech and music applications, where the clarity
of the sources is usually more important than the effect of the
environment. On the other hand, PAE is more suited for the
reproduction of movie and gaming sound content, where the
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®