Zoom out Search Issue

IEEE SIGNAL PROCESSING MAGAZINE [104] MARCH 2015
further hinders the distance perception as it leads to inside-the-
head localization (IHL) of sound [1]. IHL of sound is caused by
several factors, such as the use of nonindividualized HRTFs,
absence of equalization, lack of reverberation, and impedance mis-
match due to the presence of headphones [1], [13]. The presence
of individualized HRTFs, equalization, and reverberation can
improve the externalization of sound but does not ensure accurate
distance perception [1].The direct-to-reverberation energy ratio is
found to be the most critical cue for absolute distance perception,
even though the intensity, loudness, and binaural cues can provide
relative cues for distance perception [1]. Since reverberation is an
essential cue for both distance perception and perception of a real
environment context, a veridical simulation of the reverberation is
highly imperative for natural sound rendering [1]. However, accu-
rate simulation of distance perception is challenging since rever-
beration entirely depends on the room characteristics. The correct
amount of reverberation to be added to simulate distance percep-
tion in a particular room can be obtained only by carrying out
acoustical measurements.
SOUND SCENE DECOMPOSITION USING BSS AND PAE
To achieve natural sound rendering in headphones, two important
constituents of the sound scenes are required in the virtualization:
the individual sound sources and characteristics of the sound envi-
ronment. However, this information is usually not directly avail-
able to the end user. One has to work with the existing digital
media content that is available, i.e., the mastered mix distributed
in channel-based formats (e.g., stereo, 5.1 surround sound).
Therefore, to facilitate natural sound rendering, it is necessary to
extract the sound sources and/or sound environment from their
mixtures. In this section, we discuss two types of techniques
applied in sound scene decomposition: BSS and PAE.
DECOMPOSITION USING BSS
Extracting the sound sources from the mixtures, often referred to
as BSS, has been extensively studied in the last few decades. The
basic mixing model in BSS can be considered as anechoic mixing,
where the sources
sn
k
^h
in each mixture xn
m
^h
have different
gains g
mk
and delays .
mk
x Hence, the anechoic mixing is formu-
lated as follows:
,,,,,xn gsn en m M12
mmkkmkm
k
K
1
6f!x=-+
=
^^^hhh
"
,
/
(3)
where en
m
^h
is the noise in each mixture, which is usually
neglected for most cases. Note that estimating the number of
sources is quite challenging and it is usually assumed to be
known in advance [14]. This formulation can be simplified to
represent instantaneous mixing by ignoring the delays, or can
be extended to reverberant mixing by including multiple paths
between each source and mixture. An overview of the typical
techniques applied in BSS is listed in Table 1.
Based on the statistical independence and non-Gaussianity
of the sources, independent component analysis (ICA) algo-
rithms have been the most widely used techniques in BSS to
separate the sources from mixtures in the determined case,
where the numbers of mixtures and sources are equal [14]. In
the overdetermined case, where there are more mixtures than
sources, ICA is combined with principal component analysis
(PCA) to reduce the dimension of the mixtures, or combined
with least-squares (LS) to minimize the overall mean-square
error (MSE) [14]. In practice, the underdetermined case is the
most common, where there are fewer mixtures than sources.
For the underdetermined BSS, sparse representations of the
sources are usually employed to increase the likelihood of
sources to be disjoint [15]. The most challenging underdeter-
mined BSS is when the number of mixtures is two or lesser, i.e.,
in stereo and mono signals.
Stereo signals (i.e.,
),M 2= being one of the most widely
used audio format, have been the focus in BSS. Many of these
BSS techniques can be considered as time-frequency masking
and usually assume one dominant source in one time-frequency
bin of the stereo signal [16]. In these time-frequency masking-
based approaches, a histogram for all possible directions of the
sources is constructed, based on the range of the bin-wise
amplitude and phase differences between the two channels. The
directions, which appear as peaks in the histogram, are selected
as source directions. These selected source directions are then
used to classify the time-frequency bins and to construct the
mask. For every time-frequency bin
,,nl
^h
the kth source at
mth channel ,S nl
mk
t
^h
is estimated as:
,,,,S nl nlX nl
mk mk m
W=
t
^^^hhh
(4)
where the mask and the mth mixture are represented by
,nl
mk
W
^h
and
,,
Xnl
m
^h
respectively.
In the case of single-channel (or mono) signals, the separa-
tion is even more challenging since there is no interchannel
information. Hence, there is a need to look into the inherent
physical or perceptual properties of the sound sources. Nonneg-
ative matrix factorization (NMF)-based approaches have been
extensively studied and applied in single-channel BSS in recent
years. The key idea of NMF is to formulate an atom-based repre-
sentation of the sound scene [17], where the atoms have repeti-
tive and nondestructive spectral structures. NMF usually
expresses the magnitude (or power) spectrogram of the mixture
as a product of the atoms and time varying nonnegative weights
in an unsupervised manner. These atoms, after being multiplied
with their corresponding weights, can be considered as
[TABLE 1] AN OVERVIEW OF TYPICAL TECHNIQUES IN BSS.
OBJECTIVE: TO EXTRACT KK(2)2 SOURCES FROM M MIXTURES
CASE TYPICAL TECHNIQUES
DETERMINED: KM=
ICA [14]
OVERDETERMINED: KM1
ICA WITH PCA OR LS [14]
UNDERDETERMINED: KM2 M 22
ICA WITH SPARSE SOLUTIONS [14], [15]
M 2=
TIME-FREQUENCY MASKING [16]
M 1=
NMF [17], [18]; CASA [19]
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®