Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

141

142

143

144

145

146

147

148

149

150

IEEE SIGNAL PROCESSING MAGAZINE [143] MARCH 2015

well suited for modeling nonlinear

phenomena. Compositional models

use iterative algorithms for finding

the model parameters, and their

computational complexity is quite

significant when large dictionaries

are used. Thus, the accuracy of the

models may need to be compromised

in the case of real-time implementations. The optimization prob-

lems involved with compositional models are often nonconvex,

and therefore, different algorithms and their initializations lead to

different solutions, which needs to be taken into account when

results obtained with the models are examined. Even though

designing algorithms for new compositional models is in general

rather straightforward, the sensitivity of the algorithms to get

stuck into a local minimum far away from the global optimum

increases as the structure of the model becomes more complex,

and the model order increases. To get more accurate solutions

with complex models, carefully designed initializations or regular-

izations may be needed.

Compositional models provide a single framework that ena-

bles modeling of several phenomena present in real-world

audio: additive sources, sources consisting of multiple sound

objects, convolutive noise, and reverberation. Frameworks that

combine these in a systematic and flexible way have already

been presented [57], [58]. Moreover, the ability of the models to

couple acoustic and other types of information enables audio

analysis and recognition directly using the model. To be able to

handle all of this within a single framework is a great advantage

in comparison to methods that tackle just a specific task since it

offers the potential of jointly modeling multiple effects that

affect each other, such as reverberation and source mixing.

ACKNOWLEDGMENTS

Tuomas Virtanen was financially supported by the Academy of

Finland (grant 258708). The research of Jort F. Gemmeke was

funded by the IWT-SBO project ALADIN contract 100049.

AUTHORS

Tuomas Virtanen (tuomas.virtanen@tut.fi) received the M.S.

and doctor of science degrees in information technology from

the Tampere University of Technology (TUT), Finland, in 2001

and 2006, respectively. He is an academy research fellow and an

adjunct professor in the Department of Signal Processing, TUT.

He is also a research associate in the Department of Engineer-

ing, Cambridge University, United Kingdom. He is known for

his pioneering work on single-channel sound source separation

using nonnegative matrix factorization-based techniques and

their application to noise-robust speech recognition, music

content analysis, and audio classification. His other research

interests include the content analysis of audio signals and

machine learning.

Jort F. Gemmeke (jgemmeke@amadana.nl) received the M.S.

degree in physics from the Universiteit van Amsterdam in 2005.

In 2011, he received the Ph.D. degree from the University of

Nijmegen on the subject of noise

robust automatic speech recogni-

tion (ASR) using missing data tech-

niques. He is a postdoctoral

researcher at KU Leuven, Belgium.

He is known for pioneering the field

of exemplar-based noise robust ASR.

His research interests are automatic

speech recognition, source separation, noise robustness, and

acoustic modeling, in particular, exemplar-based methods and

methods using sparse representations.

Bhiksha Raj (bhiksha@cs.cmu.edu) received the Ph.D.

degree from Carnegie Mellon University (CMU) in 2000. From

2001 to 2008, he worked at Mitsubishi Electric Research Labs in

Cambridge, Massachusetts, where he led the research effort on

speech processing. He is an associate professor at the Language

Technologies Institute of CMU with additional affiliations to the

Machine Learning and Electrical and Computer Engineering

Departments of the university. He has been at CMU since 2008.

His research interests include speech and audio processing,

automatic speech recognition, natural language processing, and

machine learning.

Paris Smaragdis (paris@illinois.edu) received the Ph.D.

degree from the Massachusetts Institute of Technology in 2003.

He is an assistant professor in the Computer Science and Electri-

cal Engineering Departments at the University of Illinois, Urbana

Champaign, and a research scientist at Adobe. He is the inventor

of frequency-domain ICA and several of the approaches that are

now common in compositional model-based signal enhance-

ment. His research interests are in computer audition, machine

learning, and speech recognition.

REFERENCES

[1]A.Cichocki,R.Zdunek,A. H.Phan, and S.Amari,Nonnegative Matrix and

Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and

Blind Source Separation. Hoboken, NJ: Wiley, 2009.

[2] P. Smaragdis, “Convolutive speech bases and their application to supervised

speech separation,” IEEE Trans. Audio, Speech, Lang. Processing,vol. 15,no. 1,

pp. 1–12, 2007.

[3] T. Virtanen, “Monaural sound source separation by nonnegative matrix fac-

torization with temporal continuity and sparseness criteria,” IEEE Trans. Audio,

Speech, Lang. Processing, vol. 15, no. 3, pp. 1066–1074, 2007.

[4] J. Gemmeke, T. Virtanen, and A. Hurmalainen, “Exemplar-based sparse rep-

resentations for noise robust automatic speech recognition,” IEEE Trans. Audio,

Speech, Lang. Processing, vol. 19, no. 7, pp. 2067–2080, 2011.

[5] T. Heittola, A. Klapuri, and T. Virtanen, “Musical instrument recognition in

polyphonic audio using source-filter model for sound separation,” in Proc. Int.

Conf. Music Information Retrieval, Kobe, Japan, 2009, pp. 327–332.

[6]J. T.Geiger,F.Weninger,A.Hurmalainen,J. F.Gemmeke,M.Wllmer,B.

Schuller, G. Rigoll, and T. Virtanen, “The TUM+TUT+KUL approach to the CHiME

Challenge 2013: Multi-stream ASR exploiting BLSTM networks and sparse NMF,”

in Proc. 2nd Int. Workshop on Machine Listening in Multisource Environments,

Vancouver, Canada, 2013, pp. 25–30.

[7] Y.-C. Cho and S. Choi, “Nonnegative features of spectro-temporal sounds for

classification,” Pattern Recognit. Lett., vol. 26, no. 9, pp. 1327–1336, 2005.

[8] N. Bertin, R. Badeau, and E. Vincent, “Enforcing harmonicity and smooth-

ness in Bayesian nonnegative matrix factorization applied to polyphonic music

transcription,” IEEE Trans. Audio, Speech, Lang. Processing,vol. 18,no. 3,

pp. 538–549, 2010.

[9] D. Bansal, B. Raj, and P. Smaragdis, “Bandwidth expansion of narrowband

speech using non-negative matrix factorization,” in Proc. EUROSPEECH, Lisbon,

Portugal, 2005, pp. 1505–1508.

[10] A. Ozerov and C. Févotte, “Multichannel nonnegative matrix factorization in

convolutive mixtures for audio source separation,” IEEE Trans. Audio, Speech,

Lang. Processing, vol. 18, no. 3, pp. 550–563, 2010.

THE ABILITY OF THE MODELS

TO COUPLE ACOUSTIC AND OTHER

TYPES OF INFORMATION ENABLES

AUDIO ANALYSIS AND RECOGNITION

DIRECTLY USING THE MODEL.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND

______________

___________