Zoom out Search Issue

ManualsBrandsContents Manualsaudio & home theatreZoom in

IEEE SIGNAL PROCESSING MAGAZINE [51] MARCH 2015

straightforward optimization problems that can be solved using the

Karush–Kuhn–Tucker conditions. The resulting analytic solutions

are easy to implement. The later work of [8] models

(, )AE D

more accurately at low SNR values and provides improved perfor-

mance over the original work of [4] under low SNR conditions.

The discussion in this section assumed stationarity. Time varia-

tion can be accounted for by recursive updating of the equivalent

spectrum levels

and D

and periodically recomputing the

gains g

[4]. This is consistent with the SII update described in

[27], which uses frequency-dependent temporal windows.

WORD-SEQUENCE PROBABILITY-BASED

ENHANCEMENT

The section “Defining Intelligibility” identified the suitability of the

expected probability of correct message recognition as a measure for

optimizing intelligibility at a high level of abstraction. We noted in

the section “Measures Operating on a Word Sequence” that, under

an ergodicity assumption, the expectation over messages can be

approximated by averaging over time.

Optimizing a measure derived from

the probability of correct recognition

under a power constraint has been

shown to provide significant intelligi-

bility gain assuming that accurate

sound segmentation information and

an appropriate acoustic speech model

are available [11]. We emphasize that

the method assumes that ASR word probability tracks the human

recognition performance, which was found to be true in [11] but is

not guaranteed. Here we provide more detail about this approach.

To make high-level machine-based optimization feasible in

practice, we can represent the message at the phoneme level. This

means we refine our Markov chain to include an intermediate

level. The chain now becomes

,MuaauM

TTTLL L

"""""

where u

and u

denote the talker and lister phoneme sequences,

respectively. By first performing time alignment of a sequence of

acoustic features vectors

and a sequence of phonemes u

means of an ASR engine, a practical intelligibility enhancement

approach can be defined. The ASR speech model can then be used

to provide the probability densities that characterize clean speech

sounds in the acoustic feature space.

To enhance intelligibility, we want to find the parameters

our speech modification scheme that maximize the average proba-

bility that the listener interpreted phoneme sequence

is the

talker-generated sequence :u

|,,argmax puuCC

| TTuu

(14)

where the subscripts of the density label the density it repre-

sents. Note that the densities are consistent with the models

shown in (4).

Simplifications were introduced in [11] to make the optimiza-

tion tractable. It was tacitly assumed that the message is accu-

rately represented by the phonemes and production noise was not

formally considered. It was also assumed that

(the representa-

tion of the noise) can be approximated as deterministic, which is

reasonable for typical acoustic signal representations and station-

ary noise. The only remaining uncertainty is due to the interpre-

tation noise in the mapping from

to .u

In an ASR system

based on an HMM, this is modeled by the observation noise.

Equation (14) can now be approximated by

argmax puauCC

| TLTua

(15)

()

argmax paupuC

| LT Tau u

LL L

(|,) (),paupuC

LT T

au u

LL L

(16)

where we used Bayes’ rule and where (,),a u C

abbreviated to

is the set of acoustic features observed by the listener, which

is modeled as a deterministic function of the talker phoneme

sequence

and the speech modi-

fication parameters .C The first

term of (16) is the likelihood of the

talker phoneme sequence for the

observed features

the second

term is the a priori probability that

the phoneme sequence

decoded by the listener, and the

third term is the inverse a priori

probability of the listener-observed features. Optimization of the

likelihood term only reduces complexity and provides good

results [11].

The theory is simplest to implement if the sequences are con-

sidered stationary. The averaging of (16) over long time intervals

(multiple sentences) is then preferred. In a practical implementa-

tion, shortcuts may have to be made due to requirements on delay

and complexity and because the stationarity assumption may not

be sufficiently accurate.

A system-level perspective of the proposed approach is

shown in Figure 3. In [11], the approach was validated for a

combination of two modifications: prosody-affecting phoneme

gain adjustment and a spectral modification redistributing the

signal energy across frequency bands. The method compared

favorably to a method based on the optimization of a measure

operating on a sequence of auditory states [14], discussed in the

section “Measures Operating on a Sequence of Auditory States.”

Results reported in [9] suggest that using the full Bayesian

approach rather than optimizing only the likelihood component

of (16) improves performance.

In text-to-speech applications it may be possible to select

from a set of phrases to convey a particular message. The mea-

sure given in (16) has also been used to determine the optimal

phrasing of utterances [19]. This study indicates that maximiz-

ing the probability of correct interpretation of the phoneme

sequence increases intelligibility. Considering prior information

on the predictability of various formulations is expected to fur-

ther enhance performance.

TO MAKE HIGH-LEVEL

MACHINE-BASED OPTIMIZATION

FEASIBLE IN PRACTICE, WE CAN

REPRESENT THE MESSAGE AT THE

PHONEME LEVEL.

THE WORLD’S NEWSSTAND

THE WORLD’S NEWSSTAND