Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

101

102

103

104

105

106

107

108

109

110

IEEE SIGNAL PROCESSING MAGAZINE [106] MARCH 2015

ambient components also contribute significantly to the natu-

ralness and immersiveness of the sound scenes. Subjective

experiments revealed that BSS- and PAE-based headphone ren-

dering can improve the externalization and enlarge the sound

stage with minimal coloration [6].

Despite the recent advances in BSS and PAE, the chal-

lenges due to the complexity and uncertainty of the sound

scenes still remain to be resolved. One common challenge in

both BSS and PAE is the increasing number of audio sources

in the sound scenes, while only a limited number of mixtures

(i.e., channels) are available. In certain time-frequency repre-

sentations, the sparse solutions in BSS and PAE would require

the sources to be sparse and disjoint [15]. Considering the

diversity of audio signals, finding a robust sparse representa-

tion for different types of audio signals is extremely difficult.

The recorded or postprocessed source signals might even be

filtered due to physical or equivalently simulated propagation

and reflections. Moreover, the audio signals coming from

adverse environmental conditions (including reverberation

and strong ambient sound) usually degrade the performance

of the decomposition. These difficulties can be addressed by

studying the features of the resulting signals and by obtaining

more prior information on the sources, the sound environ-

ment, the mixing process [18], and combining auditory with

visual information of the scene.

INDIVIDUALIZATION OF HRTF

Binaural technology is the most promising solution for delivering

spatial audio in headphones, as it is the closest to natural listening.

Unlike conventional microphone recordings, which are meant for

loudspeaker playback, the binaural signals are recorded or synthe-

sized at the ears of the listener. In a binaural audio system, the

spatial encoding (i.e., HRTFs) should encapsulate all the spectral

features due to the interaction of the acoustic wave with the listen-

er’s morphology (torso, head, and pinna). The pinna, which is also

considered as the acoustic fingerprint, embeds the most idiosyn-

cratic spectral features into HRTFs, which are essential for accu-

rate perception of the sound [Figure 3(a)]. Thus, the HRTF

features of the individuals are extremely unique, as shown in

Figure 3(c). Often the HRTFs used for virtualization are nonindi-

vidualized HRTFs, typically measured on a dummy’s head, since

they are easily accessible.

However, the use of nonindividualized HRTFs leads to several

artefacts like IHL, localization inaccuracies in perceiving eleva-

tion, and front–back, up–down reversals. Additionally, subjects dis-

play poor angular resolution and sometimes find it difficult to

pinpoint the exact location of the auditory image in the case of

using nonindividualized HRTFs. Thus, individualization of the

HRTFs [Figure 3(b)] plays a critical role to create an immersive

experience closest to the natural listening experience. There are

various individualization techniques to obtain the individualized

HRTFs from acoustical measurements, anthropometric features of

the listener, customizing generic HRTFs with perceptual feedback

or frontal projection of sound, as summarized in Table 3.

ACOUSTICAL MEASUREMENTS

The most straightforward individualization technique is to actu-

ally measure the individualized HRTFs for every listener at differ-

ent sound positions [25], [26]. This is the most ideal solution but it

is extremely tedious and involves highly precise measurements.

These measurements also require the subjects to remain motion-

less for long periods, which may cause the subjects fatigue. Zotkin

et al. developed a fast HRTF measurement system using the tech-

nique of reciprocity, where a microspeaker is placed into the ear

and several microphones are placed around the listener [13].

Other researchers developed a continuous 3-D azimuth acquisi-

tion system to measure the HRTFs using a multichannel adap-

tive filtering technique [27]. However, all these techniques to

acoustically measure the individual HRTFs require a large

amount of resources and expensive setups.

ANTHROPOMETRIC DATA

Individualized HRTFs can also be modeled as weighted sums of

basis functions, which can be performed either in the frequency

or spatial domain. The basis functions are usually common to

all individuals and the individualization information is often

conveyed by the weights. The HRTFs are essentially expressed as

weighted sums of a set of eigenvectors, which can be derived

from PCA or ICA [26], [13]. The individual weights are derived

from the anthropometric parameters that are captured by opti-

cal descriptors, which can be derived from direct measure-

ments, pictures, or a 3-D mesh of the morphology [13]. The

solution to the problem of diffraction of an acoustic wave with

the listener’s body results in individual HRTFs. This solution

[TABLE 2] COMPARISON BETWEEN BSS AND PAE IN SOUND

SCENE DECOMPOSITION.

BSS PAE

OBJECTIVE TO OBTAIN USEFUL INFORMATION ABOUT THE

ORIGINAL SOUND SCENE FROM GIVEN MIXTURES

AND FACILITATE NATURAL SOUND RENDERING.

COMMON

CHARACTERISTICS

USUALLY NO PRIOR INFORMATION, ONLY MIXTURES

■ BASED ON CERTAIN SIGNAL MODELS

■ REQUIRE OBJECTIVE AS WELL AS SUBJECTIVE

EVALUATION

BASIC MIXING

MODEL

SUMS OF MULTIPLE

SOURCES (INDEPENDENT,

NON-GAUSSIAN, ETC.)

PRIMARY COMPONENTS

(HIGHLY CORRELATED) AND

AMBIENT COMPONENTS

(UNCORRELATED)

TECHNIQUES ICA [14], SPARSE

SOLUTIONS [15],

TIME-FREQUENCY

MASKING [16], NMF

[17], [18], CASA [19], ETC.

PCA [20], LS [8], [21],

TIME-FREQUENCY MASKING

[7], [20], TIME/PHASE-

SHIFTING [22], [23], ETC.

TYPICAL

APPLICATIONS

SPEECH, MUSIC MOVIE, GAMING

APPLICATIONS

SPEECH ENHANCEMENT,

NOISE REDUCTION,

SPEECH RECOGNITION,

MUSIC CLASSIFICATION

SOUND REPRODUCTION,

SOUND LOCALIZATION,

CODING

LIMITATIONS ■ SMALL NUMBER OF

SOURCES

■ SPARSENESS/DISJOINT

■ NO/SIMPLE

ENVIRONMENT

■ SMALL NUMBER OF

SOURCES

■ SPARSENESS/DISJOINT

■ LOW AMBIENT POWER

■ PRIMARY AMBIENT

COMPONENTS UNCORRE-

LATED

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND