Zoom out Search Issue
IEEE SIGNAL PROCESSING MAGAZINE [143] MARCH 2015
well suited for modeling nonlinear
phenomena. Compositional models
use iterative algorithms for finding
the model parameters, and their
computational complexity is quite
significant when large dictionaries
are used. Thus, the accuracy of the
models may need to be compromised
in the case of real-time implementations. The optimization prob-
lems involved with compositional models are often nonconvex,
and therefore, different algorithms and their initializations lead to
different solutions, which needs to be taken into account when
results obtained with the models are examined. Even though
designing algorithms for new compositional models is in general
rather straightforward, the sensitivity of the algorithms to get
stuck into a local minimum far away from the global optimum
increases as the structure of the model becomes more complex,
and the model order increases. To get more accurate solutions
with complex models, carefully designed initializations or regular-
izations may be needed.
Compositional models provide a single framework that ena-
bles modeling of several phenomena present in real-world
audio: additive sources, sources consisting of multiple sound
objects, convolutive noise, and reverberation. Frameworks that
combine these in a systematic and flexible way have already
been presented [57], [58]. Moreover, the ability of the models to
couple acoustic and other types of information enables audio
analysis and recognition directly using the model. To be able to
handle all of this within a single framework is a great advantage
in comparison to methods that tackle just a specific task since it
offers the potential of jointly modeling multiple effects that
affect each other, such as reverberation and source mixing.
ACKNOWLEDGMENTS
Tuomas Virtanen was financially supported by the Academy of
Finland (grant 258708). The research of Jort F. Gemmeke was
funded by the IWT-SBO project ALADIN contract 100049.
AUTHORS
Tuomas Virtanen (tuomas.virtanen@tut.fi) received the M.S.
and doctor of science degrees in information technology from
the Tampere University of Technology (TUT), Finland, in 2001
and 2006, respectively. He is an academy research fellow and an
adjunct professor in the Department of Signal Processing, TUT.
He is also a research associate in the Department of Engineer-
ing, Cambridge University, United Kingdom. He is known for
his pioneering work on single-channel sound source separation
using nonnegative matrix factorization-based techniques and
their application to noise-robust speech recognition, music
content analysis, and audio classification. His other research
interests include the content analysis of audio signals and
machine learning.
Jort F. Gemmeke (jgemmeke@amadana.nl) received the M.S.
degree in physics from the Universiteit van Amsterdam in 2005.
In 2011, he received the Ph.D. degree from the University of
Nijmegen on the subject of noise
robust automatic speech recogni-
tion (ASR) using missing data tech-
niques. He is a postdoctoral
researcher at KU Leuven, Belgium.
He is known for pioneering the field
of exemplar-based noise robust ASR.
His research interests are automatic
speech recognition, source separation, noise robustness, and
acoustic modeling, in particular, exemplar-based methods and
methods using sparse representations.
Bhiksha Raj (bhiksha@cs.cmu.edu) received the Ph.D.
degree from Carnegie Mellon University (CMU) in 2000. From
2001 to 2008, he worked at Mitsubishi Electric Research Labs in
Cambridge, Massachusetts, where he led the research effort on
speech processing. He is an associate professor at the Language
Technologies Institute of CMU with additional affiliations to the
Machine Learning and Electrical and Computer Engineering
Departments of the university. He has been at CMU since 2008.
His research interests include speech and audio processing,
automatic speech recognition, natural language processing, and
machine learning.
Paris Smaragdis (paris@illinois.edu) received the Ph.D.
degree from the Massachusetts Institute of Technology in 2003.
He is an assistant professor in the Computer Science and Electri-
cal Engineering Departments at the University of Illinois, Urbana
Champaign, and a research scientist at Adobe. He is the inventor
of frequency-domain ICA and several of the approaches that are
now common in compositional model-based signal enhance-
ment. His research interests are in computer audition, machine
learning, and speech recognition.
REFERENCES
[1]A.Cichocki,R.Zdunek,A. H.Phan, and S.Amari,Nonnegative Matrix and
Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and
Blind Source Separation. Hoboken, NJ: Wiley, 2009.
[2] P. Smaragdis, “Convolutive speech bases and their application to supervised
speech separation,” IEEE Trans. Audio, Speech, Lang. Processing,vol. 15,no. 1,
pp. 1–12, 2007.
[3] T. Virtanen, “Monaural sound source separation by nonnegative matrix fac-
torization with temporal continuity and sparseness criteria,” IEEE Trans. Audio,
Speech, Lang. Processing, vol. 15, no. 3, pp. 1066–1074, 2007.
[4] J. Gemmeke, T. Virtanen, and A. Hurmalainen, “Exemplar-based sparse rep-
resentations for noise robust automatic speech recognition,” IEEE Trans. Audio,
Speech, Lang. Processing, vol. 19, no. 7, pp. 2067–2080, 2011.
[5] T. Heittola, A. Klapuri, and T. Virtanen, “Musical instrument recognition in
polyphonic audio using source-filter model for sound separation,” in Proc. Int.
Conf. Music Information Retrieval, Kobe, Japan, 2009, pp. 327–332.
[6]J. T.Geiger,F.Weninger,A.Hurmalainen,J. F.Gemmeke,M.Wllmer,B.
Schuller, G. Rigoll, and T. Virtanen, “The TUM+TUT+KUL approach to the CHiME
Challenge 2013: Multi-stream ASR exploiting BLSTM networks and sparse NMF,”
in Proc. 2nd Int. Workshop on Machine Listening in Multisource Environments,
Vancouver, Canada, 2013, pp. 25–30.
[7] Y.-C. Cho and S. Choi, “Nonnegative features of spectro-temporal sounds for
classification,” Pattern Recognit. Lett., vol. 26, no. 9, pp. 1327–1336, 2005.
[8] N. Bertin, R. Badeau, and E. Vincent, “Enforcing harmonicity and smooth-
ness in Bayesian nonnegative matrix factorization applied to polyphonic music
transcription,” IEEE Trans. Audio, Speech, Lang. Processing,vol. 18,no. 3,
pp. 538–549, 2010.
[9] D. Bansal, B. Raj, and P. Smaragdis, “Bandwidth expansion of narrowband
speech using non-negative matrix factorization,” in Proc. EUROSPEECH, Lisbon,
Portugal, 2005, pp. 1505–1508.
[10] A. Ozerov and C. Févotte, “Multichannel nonnegative matrix factorization in
convolutive mixtures for audio source separation,” IEEE Trans. Audio, Speech,
Lang. Processing, vol. 18, no. 3, pp. 550–563, 2010.
THE ABILITY OF THE MODELS
TO COUPLE ACOUSTIC AND OTHER
TYPES OF INFORMATION ENABLES
AUDIO ANALYSIS AND RECOGNITION
DIRECTLY USING THE MODEL.
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page
q
q
M
M
q
q
M
M
q
M
THE WORLD’S NEWSSTAND
®
______________
______________
______________
___________