Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

101

102

103

104

105

106

107

108

109

110

IEEE SIGNAL PROCESSING MAGAZINE [104] MARCH 2015

further hinders the distance perception as it leads to inside-the-

head localization (IHL) of sound [1]. IHL of sound is caused by

several factors, such as the use of nonindividualized HRTFs,

absence of equalization, lack of reverberation, and impedance mis-

match due to the presence of headphones [1], [13]. The presence

of individualized HRTFs, equalization, and reverberation can

improve the externalization of sound but does not ensure accurate

distance perception [1].The direct-to-reverberation energy ratio is

found to be the most critical cue for absolute distance perception,

even though the intensity, loudness, and binaural cues can provide

relative cues for distance perception [1]. Since reverberation is an

essential cue for both distance perception and perception of a real

environment context, a veridical simulation of the reverberation is

highly imperative for natural sound rendering [1]. However, accu-

rate simulation of distance perception is challenging since rever-

beration entirely depends on the room characteristics. The correct

amount of reverberation to be added to simulate distance percep-

tion in a particular room can be obtained only by carrying out

acoustical measurements.

SOUND SCENE DECOMPOSITION USING BSS AND PAE

To achieve natural sound rendering in headphones, two important

constituents of the sound scenes are required in the virtualization:

the individual sound sources and characteristics of the sound envi-

ronment. However, this information is usually not directly avail-

able to the end user. One has to work with the existing digital

media content that is available, i.e., the mastered mix distributed

in channel-based formats (e.g., stereo, 5.1 surround sound).

Therefore, to facilitate natural sound rendering, it is necessary to

extract the sound sources and/or sound environment from their

mixtures. In this section, we discuss two types of techniques

applied in sound scene decomposition: BSS and PAE.

DECOMPOSITION USING BSS

Extracting the sound sources from the mixtures, often referred to

as BSS, has been extensively studied in the last few decades. The

basic mixing model in BSS can be considered as anechoic mixing,

where the sources

in each mixture xn

have different

gains g

and delays .

x Hence, the anechoic mixing is formu-

lated as follows:

,,,,,xn gsn en m M12

mmkkmkm

6f!x=-+

^^^hhh

(3)

where en

is the noise in each mixture, which is usually

neglected for most cases. Note that estimating the number of

sources is quite challenging and it is usually assumed to be

known in advance [14]. This formulation can be simplified to

represent instantaneous mixing by ignoring the delays, or can

be extended to reverberant mixing by including multiple paths

between each source and mixture. An overview of the typical

techniques applied in BSS is listed in Table 1.

Based on the statistical independence and non-Gaussianity

of the sources, independent component analysis (ICA) algo-

rithms have been the most widely used techniques in BSS to

separate the sources from mixtures in the determined case,

where the numbers of mixtures and sources are equal [14]. In

the overdetermined case, where there are more mixtures than

sources, ICA is combined with principal component analysis

(PCA) to reduce the dimension of the mixtures, or combined

with least-squares (LS) to minimize the overall mean-square

error (MSE) [14]. In practice, the underdetermined case is the

most common, where there are fewer mixtures than sources.

For the underdetermined BSS, sparse representations of the

sources are usually employed to increase the likelihood of

sources to be disjoint [15]. The most challenging underdeter-

mined BSS is when the number of mixtures is two or lesser, i.e.,

in stereo and mono signals.

Stereo signals (i.e.,

),M 2= being one of the most widely

used audio format, have been the focus in BSS. Many of these

BSS techniques can be considered as time-frequency masking

and usually assume one dominant source in one time-frequency

bin of the stereo signal [16]. In these time-frequency masking-

based approaches, a histogram for all possible directions of the

sources is constructed, based on the range of the bin-wise

amplitude and phase differences between the two channels. The

directions, which appear as peaks in the histogram, are selected

as source directions. These selected source directions are then

used to classify the time-frequency bins and to construct the

mask. For every time-frequency bin

,,nl

the kth source at

mth channel ,S nl

is estimated as:

,,,,S nl nlX nl

mk mk m

^^^hhh

(4)

where the mask and the mth mixture are represented by

,nl

and

Xnl

respectively.

In the case of single-channel (or mono) signals, the separa-

tion is even more challenging since there is no interchannel

information. Hence, there is a need to look into the inherent

physical or perceptual properties of the sound sources. Nonneg-

ative matrix factorization (NMF)-based approaches have been

extensively studied and applied in single-channel BSS in recent

years. The key idea of NMF is to formulate an atom-based repre-

sentation of the sound scene [17], where the atoms have repeti-

tive and nondestructive spectral structures. NMF usually

expresses the magnitude (or power) spectrogram of the mixture

as a product of the atoms and time varying nonnegative weights

in an unsupervised manner. These atoms, after being multiplied

with their corresponding weights, can be considered as

[TABLE 1] AN OVERVIEW OF TYPICAL TECHNIQUES IN BSS.

OBJECTIVE: TO EXTRACT KK(2)2 SOURCES FROM M MIXTURES

CASE TYPICAL TECHNIQUES

DETERMINED: KM=

ICA [14]

OVERDETERMINED: KM1

ICA WITH PCA OR LS [14]

UNDERDETERMINED: KM2 M 22

ICA WITH SPARSE SOLUTIONS [14], [15]

M 2=

TIME-FREQUENCY MASKING [16]

M 1=

NMF [17], [18]; CASA [19]

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND