VoIP details: Protocols and Packets

Carry the speech frame inside an RTP packet

Typical packetization time of 10-20ms per audio frame.

See https://datatracker.ietf.org/wg/avt/charter/ Links to an external site. (Concluded in 2011)

This should be compared to the durations relevant to speech phenomena:

“10 μs: smallest difference detectable by auditory system (localization),

3 ms: shortest phoneme (plosive burst),

10 ms: glottal pulse period,

100 ms: average phoneme duration,

4 s: exhale period during speech.” </pZ

(from Mark D. Skowronski’s slide titled ‘What is a “short” window of time?’ [Skowronski 2003])

Slide Notes

IETF AVT Working Group Charter http://www.ietf.org/html.charters/avt-charter.html Links to an external site.

[Skowronski 2003] Mark D. Skowronski, “Windows Lecture”, from the course EEL 6586: Automatic Speech Processing, University of Florida, Computational Neuro-Engineering Lab, 10 February 2003. http://www.cnel.ufl.edu/~markskow/papers/windows.ppt Links to an external site.

Transcript

[slide77] So, if we look at a voice over IP packet, here's an example of one using the real time protocol. We see that we have an IPv4, v6 header, we have a UDP header. We have an RTP header of 12 octets, and then we have the encoded voice information. But it's only about 33 octets. We have 40 to 60 octets in overhead. Oops, right, so we have really low efficiency, at least for voice. And this is with a 10 to 20 millisecond time per audio frame. So we take 10 or 20 milliseconds of audio encoded and send it off in a packet. Well, this encoding time is important here, the size of this audio frame. Because, and this is from Mark David Skowronski’s slide, pointing out that 10 microseconds, that's the smallest difference in time that we can perceive. Our shortest phoneme, a plosive burst as it's called, is 3 milliseconds long. A glottal pulse, the thing that makes the audio for most of the sounds we speak, is 10 milliseconds long. And the average phoneme is 100 milliseconds long. And the exhale period, for those of us who can't talk continuously, is about every four seconds. So this is about the human way that we produce speech. So why are these numbers so important with respect to this 10 to 20 milliseconds audio frame time? Remember, that audio frame gets turned into an IP packet. Well, what happens if we lose an IP packet? Does that have a lot of significance on what the receiver's going to perceive? Well, for everything except the glottal pulses and these plosive bursts, it's too small. We can ignore the fact that we lost that packet. The receiver will still hear the same thing. And I'm told that in some languages, and Swahili is supposed to be one of them, that there are sounds that are in this range down here that might be important to the listener that could get lost. But for most of the languages that I'm aware of, this is why voice over IP works. Because the sounds that we're making are all longer than the time for an audio frame that we've encoded. So even if we miss one of the audio frames, it doesn't make that big a difference. And even for some of the things, even if we lose several [audio frames], it doesn't make a difference to the perceived user, except for one small problem. Anyone know what that is? And that is phase discontinuities. Because if we go from one audio frame, and we're missing this one, to this audio frame, if we have a waveform that's going like that, and we're at this point in the phase here, but we're at that point in the phase over here, because we missed this transition here, what happens to what you hear? It turns out you'll perceive it as a click, because there's a phase discontinuity. And so you'll have this little bit of annoyance from it, even though you could still understand, and you got all the speech content basically, it would not sound right. Well, there are techniques to be able to hide that.