The short time energy measurement of a speech signal can be used to determine voiced vs. unvoiced speech. Short time energy can also be used to detect the transition from unvoiced to voiced speech and vice versa. The energy of voiced speech is much greater than the energy of unvoiced speech.
Equation 1 defines the short time energy for a sampled signal where h(n-m) is a windowing function. For simplicity a rectangular windowing function is used as defined in equation 2.
N in equation 2 is the length of the window in samples. The window must be long enough to encompass several pitch periods to produce a smooth representation of the amplituded of the signal. At the same time the window must be short enough to relfect rapid changes in amplitued that occur at the voiced/unvoiced bounderies. The selection of the window size is a comprimise since a high pitched female or childs voice may ahve a pitch period as small as 16 samples at an 8 kHz. samling rate up to 200 samples for a low pitched male voice. A windo size of 160 samples or about 20 msec. is a good comprimise. One of the advantages of using a tool like Octave to prototype algorithms is that is that it makes it easier to experiment with parameters like the window size.
One problem with the short time energy function is that it is very sensitive to large signal levels since the sample values are squared. This isn’t a problem in Octave since Octave scales audio samples to +/- 1. In a production application using fixed point math this is a disadvantage. In addition a multiply operation is required for each sample. So instead of using the short time energy function, I’ll use a related function, the short time magnitude functions shown in equation 3.
Next post I’ll show the short time magnitude calculation in Octave and a plot for the our speech segment, “Mister Meryk”.










