The PESQ P.862 (Perceptual Evaluation of Speech Quality) method of measuring audio quality to generate a MOS score on a narrowband (NB) audio codec. PESQ is best suited for G.711 A-law and U-law, and low bandwidth 300 to 3400 Hz voice bandwidth.
Below are the PESQ reference guidelines:
- Reference Audio:
- should be 8-30 seconds
long with at least 3.2 seconds of speech.
Note: Reference Audio of more than 30 seconds in length is supported, however files of this length will negatively impact the reliability of PESQ results due to a limitation of the PESQ algorithm. See the PESQ application guide linked at the end of this article for more information.
- should be made up of 40-80% speech
- should contain some silence
- should be made up of utterances separated by silent periods that represent natural pauses in speech (for example, two short sentences separated by a silent period of at least 1 second, with an overall length of 8 seconds)
- should include a few continuous utterances rather than many short utterances of speech such as rapid listing of digits
- must not include more than 50 utterances (segments)
- should not contain artificial voice signals (for example, TTS)
- must not contain music
- should contain leading and trailing silence of between 0.5 to 2.0 seconds
- should contain adequately low amounts of noise
- should be 8-30 seconds long with at least 3.2 seconds of speech.
- PESQ may give erroneous results if speech is missing or if silence is added to or taken away from the degraded signal (If the durations of speech in the reference and degraded signals differ by more than 25%, the effect may be large enough to significantly bias the result).
- Degraded audio should not have gone through noise reduction systems.
- Results cannot just be the average of the PESQ scores for a Prompt that has been broken into multiple segments (you must segment with sufficient leading/trailing silence).
- PESQ cannot be used to evaluate the effects of the receiving/listening level (In other words, volume differences).
- If long pauses are included at the beginning and end of the degraded signal, then the level alignment process may be sub-optimal. This issue may become a problem if the reference and degraded signal durations differ by more than 20%.
- PESQ does not take into account any distortion in the degraded signal occurring before the start or after the end of the active speech signal.
- PESQ MOS results may depend on the alignment of the coding frame boundaries with the input audio data with the result varying up to 0.25, depending on where the frame boundaries fall. The recommended method of obtaining a stable result would be to average each of the 80 possible alignments.
- PESQ results are 95% reliable and exhibit a known and controlled accuracy when the algorithm is used on the same types of applications as those on which the algorithm has been trained, tested, and validated. In other words, the measurement scenarios need to represent statistically the same type of sample population as the ones on which P.862/P.862.1 has been trained, tested, validated, and calibrated for the determined accuracy values to remain valid. The results' reliability and accuracy become unknown and uncontrolled once the algorithm is used to evaluate speech quality on new types of technologies and/or using other types of codecs and/or new live networks.
Please refer to the PESQ application guide for more details: http://www.itu.int/ITU-T/recommendations/rec.aspx?id=9274.