Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

101

102

103

104

105

106

107

108

109

110

IEEE SIGNAL PROCESSING MAGAZINE [105] MARCH 2015

potential components of sources [18]. Another technique

applied in single-channel BSS is the computational auditory

scene analysis (CASA) that simulates the segregation and group-

ing mechanism of the human auditory system [19] on the model-

based representation (monaural case) of the auditory scenes. An

important aspect worth considering is the directions of the

extracted sources, which can usually come as a by-product in mul-

tichannel BSS. In single-channel BSS, this information of source

directions has to be provided separately.

DECOMPOSITION USING PAE

In most sound scenes, the mixture comprises not only the dry

sources but also the reverberation and ambient sound, which are

contributed by the acoustics of the surrounding space. Therefore,

the mixing model of the sources in BSS usually does not match

with the actual sound scenes. In this article, we refer to the domi-

nant sources as primary (or direct) components, while the signals

contributed by the sound environment are referred to as ambient

(or diffuse) components. The primary and ambient components

are perceived to be directional and diffuse, respectively. Different

rendering methods should be applied to the primary and ambient

components [6], [7] due to their perceptual differences. Therefore,

rendering of natural sound scenes requires the decomposition of

the mixtures into primary and ambient components [6], [7], [9].

Since stereo is still the most widely used format for digital media

content, our discussion on the decomposition using PAE is

focused on stereo signals

.()M 2=

In PAE, we often follow some intuitive signal models as

discussed in [3], [5], [7], [8], and [20]. In the mth channel, the

mixture

is assumed to be the sum of the primary com-

ponent pn

and ambient component ,an

i.e.,

.xn pn an

mmm

^^^hhh

The discrimination of directional

primary components and diffuse ambient components is mainly

based on their interchannel correlations, where the primary and

ambient components in the two channels are assumed to be

correlated and uncorrelated, respectively. In the basic mixing

model for PAE, the primary components are assumed to be

amplitude panned, while the ambient components are of

approximately equal levels in all channels.

Based on these assumptions, various approaches are proposed

in PAE for stereo signals. Similar to BSS, time-frequency masking

approaches are introduced to extract ambient components

,Anl

[7], [20] and these approaches can be generalized as

,,,,A nl X nl nl

mmA

^^^hhh

(5)

where ,nl01

##W

is the real-valued ambient mask at the

time-frequency bin ,.nl

Time-frequency bins having high inter-

channel correlation are considered to be primary components (or

mostly primary components in the soft masking case), whereas low

correlation bins are more likely to be ambient components.

Several linear estimation-based PAE approaches were also

introduced [21], which exploits the differences between the two

channels of the stereo signal to perform the PAE, including PCA-

based approaches [20] and LS-based approaches. In these

approaches, the extracted primary components

,pnpn

^^hh

and

ambient components ,an

^^hh

are expressed as weighted

sums of the mixtures:

P0,0

P0,1

P1,1

A0,1

A1,1

(6)

The solutions for the weights in (6) are derived based on different

performance-related criteria [21]. More specifically, PCA extracts

the primary components having maximum variance and extracts

the ambient components having minimum variance with the con-

straint that the primary and ambient components are uncorrelated,

while LS extracts these components having minimum MSE. Based

on the study in [21], it is recommended that PCA-based approaches

should be used for signals that contain dominant primary compo-

nents (e.g., gaming), while LS-based approaches are preferred for

signals that contain a balanced mix of primary and ambient compo-

nents (e.g., movies). In addition, to deal with more complex types of

input signals that do not fit into the basic mixing model, other tech-

niques have also been introduced, such as time shifting to compen-

sate for time differences [22] and adaptive frequency bin

partitioning for multiple sources in primary components [23]. Fur-

thermore, though it is possible to extend the framework of PAE

from stereo signals to multichannel signals, e.g., [24], more com-

prehensive studies on PAE for multichannel signals are required.

A COMPARISON BETWEEN BSS AND PAE

Both BSS and PAE are extensively applied in sound scene decom-

position—a comparison between these approaches is summarized

in Table 2. The common objective of BSS and PAE is to extract

useful information (mainly the sound sources and their direc-

tions) about the original sound scene from the mixtures, and to

use this information to facilitate natural sound rendering. There

are three common characteristics in BSS and PAE. First, only the

mixtures are available and usually no other prior information is

given. Second, the extraction of the specific components from the

mixtures is based on certain signal models. Third, both techniques

require objective and subjective evaluation.

As discussed earlier, the applications of different signal mod-

els in BSS and PAE lead to different techniques. In BSS, the

mixtures are considered as the sums of multiple sources, and

the independence among the sources is one of the most impor-

tant characteristics. In contrast, the mixing model in PAE is

based on human perception of directional sources (primary

components) and diffuse sound environment (ambient compo-

nents). The perceptual difference between primary and ambient

components is due to the directivity of these components which

can be characterized by their correlations. The applications that

adopted BSS and PAE also have distinct differences. BSS is com-

monly used in speech and music applications, where the clarity

of the sources is usually more important than the effect of the

environment. On the other hand, PAE is more suited for the

reproduction of movie and gaming sound content, where the

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND