Classifying speech segments as silence, voiced, or unvoiced is an important component of the speech timescale modification algorithm. This classification is accomplished using short time magnitude and zero crossing rates of the speech signal. Many algorithms have been published for classifying speech segments into voiced, unvoiced that focus on determining the exact endpoints of theses segments. For example, in Digital Processing of Speech Signals Rabiner & Shaffer describe an algorithm that uses short time energy to set the inner bounds for the end points of an utterance. Then the zero crossing rate is used, working outward from these bounds to find the actual endpoints. Applications such as speech recognition or some types of speech compression required this level of precision. The speech timescale modification algorithm only needs to determine if an endpoint of silence, voiced, or unvoiced speech occured within a 10 to 20 millisecond speech segment. For this level of precision, the following simple logic is used:
| zero crossing rate | |||
| low | high | ||
| Magnitude | low | silence | ? |
| high | voiced | unvoiced |
Theoretically a low magnitude with a high zero crossing rate (marked ? in the table) should never occur. Any real world application must deal with such possibilities so the Speed Reader application treats these segments as transitions to be on the safe side.
The final problem is to determine values for high and low levels for magnitude and zero crossing rate. The average magnitude for 100 millisecond segments of the first second of speech is calculated. The 100 millisecond segment with the lowest average magnitude is considered silence. The mean and standard deviation of the magnitude and zero crossing rate are calculated for this sgment to characterize the background noise level.







