Categories


Basic Technique

Basically, timescale modification of speech is accomplished by first dividing the speech into segments.   Then segments are deleted to speed up the speaker rate or segments are repeated to slow down the speaker rate.  The two key issues are how to segment the speech and what segments can be deleted or repeated without degrading intelligibility.

Speech can be viewed as three broad classes of sounds.  The first is voiced speech which consists of a periodic signal with a specific pitch.  The vowels are voiced speech.  Second is unvoiced speech which is a broadband noise like signal.  The ’s’ and ’sh’ sound are examples of unvoiced speech.  Third, there are transition sounds called stops and plosives.  These sounds are a sudden transition to or from silence.  The ‘t’ sound at the end of a word is a stop while the ‘p’ sound at the beginning of a word is a plosive.  (Rabner, Schafer: Introduction to Digital Speech Processing Now Publishers Inc, 2007).  Finally in this application we are interested in a fourth category, silence. Each of these categories of sounds a handled differently in segmenting speech for timescale modification.

For voiced speech we must use pitch synchronous segmentation to achieve a natural and intelligible sound.  Since voiced speech is a periodic waveform, the dominate frequency or pitch of the signal can be detected.  Voiced sounds are then segmented on period boundaries so that when segments deleted or repeated, the periodic waveform is not distorted.

Unvoiced speech has no detectable period so unvoiced speech can be segmented at arbitrary boundaries.  Segments of unvoiced speech must be kept short enough so that the segments do not contain parts of other types of speech signals.  Periods of silence can be handled in a similar fashion, using short arbitrary length segments.

The transition sounds, stops and plosives, can not be deleted or repeated.  Deletion of these important segments results in unintelligible speech while repeating these segments can result in a stutter effect.

Another consideration is aligning the edges of the speech segments when deleting or repeating segments.  When I first implemented a system for variable rate playback of speech in the 1980s, I simply placed the segments end to end, ignoring any discontinuities between the signal in one segment and the next.  Though I had good results with this, the discontinuities add some noise to the speech.  Since then a technique called Synchronized Overlap and Add (SOLA) has been developed.  SOLA used a windowing function on the segments, gradually rammping down one segment while ramping up the the next segment.  An overlapping portion of the winodwed segments is added at the edges as shown below.  (Coyle,  Doran, Lawlor: Audio Time-Scale Modification Using a Hybrid Time-Frequency Domain Approach).

Comments are closed.