Playout delay

Playout delay should track the network delay as it varies during a session [Narbutt and Murphy 2001][Moon 1998]
This delay is computed for each talk spurt based on observed average delay and deviation from this average delay -- this computation is similar to estimates of RTT and deviation in TCP
Beginning of a talk spurt is identified by examining the timestamps and/or sequence numbers (if silence detection is being done at the source)
The intervals between talk spurts^† give you a chance to catch-up
- without this, if the sender’s clock were slightly faster than the receiver’s clock the queue would build without limit! This is important as the 8kHz sampling in PC’s CODECs is rarely exactly 8kHz (similar problems happen at other sampling rates^‡).

^† Average silence duration (~596 ms) combine with the average talk-spurt duration (227ms) ⇒ a long-term speech activity factor of 27.6% [P.800].

^‡A common approach is to sample at a high frequency, such as 48 K samples/second, then down sample (or up sample) digitally in software, thus you can take advantage of the fact that you have multiple subsamples for in the incoming speech (or outgoing speech) to do clever things to time expand or compress the audio. Additionally, by using a single high frequency for all of the audio that you are sending to (or receiving from) your audio interface you can mix audio from different sources (for example, playing high quality music in the background while you listen to a G.711 call). For examples of this see [Pardo 2005]

Slide Notes

CCITT Recommendation P.800, Methods for Subjective Determination of Transmission Quality, specifically Section 7: Subjective Opinion Tests, paragraph 3.1.2.3 Silence (gap) characteristics, CCITT, 1988. http://starlet.deltatel.ru/ccitt/1988/ascii/5_1_06.tx Links to an external site.t Links to an external site. Links to an external site.

ITU-T, Methods for Subjective Determination of Transmission Quality},

ITU-T, Recommendation P.800, March 1993

Ignacio Sánchez Pardo, Spatial Audio for the Mobile User, M.Sc. Thesis, Royal Institute of Technology (KTH), School of Information and Communication Technology, Telecommunication Systems Lab, Stockholm, Sweden, IMIT/TSLab-2005-01, March 2005 http://web.it.kth.se/~maguire/DEGREE-PROJECT-REPORTS/050307-Ignacio_Sanchez_Pardo-with-cover.pdf

Miroslaw Narbutt and Liam Murphy, “Adaptive Playout Buffering for Audio/Video Transmission over the Internet”, University College Dublin, Department of Computer Science, Dublin, Ireland, 2001 http://www.eeng.dcu.ie/~narbutt/UKTS_2001.pdf Links to an external site.

Sue B. Moon, Jim Kurose, and Don Towsley, “Packet audio playout delay adjustment: performance bounds and algorithms”, Multimedia Systems (1998) 6:17-28. http://www.cs.unc.edu/Courses/comp249-s02/readings/packet-audio-playout.pdf Links to an external site.

Transcript

[slide85] So, the play-out delay. Ideally, you track what the delay is, and you adapt. Narbutt and Murphy [2001 and] Moon [1998]. The delay is computed for every talk spurt, and what we try to do, is we use that in a computation similar to what we would do in TCP, to estimate the deviation in the round-trip times. Right, because we're trying to estimate, when will the next packet containing audio likely arrive? That's what we're trying to track. The good thing is, the talk spurts identified by, gee, we look at the timestamps and the sequence numbers. And we can see, if the sequence number is the next sequence number, but the timestamp is a large difference in time, what happened? We had an interval of silence. If we're a play-out buffer, what can we do? We can take the things that are in our play-out buffer, and we can be playing them in the silence interval, and all it sounds like was a shorter period of silence. And so we can use the intervals between those talk spurts to be able to deal with the fact that the clocks of the two systems may be very different. So the actual very first thesis I had here at KTH was a young woman by the name of Li Wei, and she built VoIP a gateway between a PBX and an Ethernet LAN. And she did this work at Ericsson, and the vice president in charge of the PBX division was there, and he said, this can't possibly work. Because it's not isochrons. Right? It didn't have a single clock. It had to contend for each packet to get onto the Ethernet. So after a presentation, she took everyone to the lab and showed it working. He said, okay, it'll work for one call, but it won't work for multiple calls. So the second student I had, Thomas Matheson, took this gateway that she built down to the factory and tested it, putting more and more calls onto the system. And then looking at the jitter in the inter-arrival delay of the audio. And what he found is up to 30 calls, it was very, very flat. But after 30 calls, the performance was very poor. Why should that be? Why should that be? Well, the interface on one side was an E1 connected to the PBX, and on the other side connected to the LAN. The magic is an E1 only has capacity for 30 calls. So she'd only ever dimensioned the gateway to handle 30 calls because you actually can't send more than 30 calls across one E1. So I went back to him and said, okay, it works and it scales. What do you have to say? He said, we're afraid. Because it was going to wipe out that entire business division. Because if you can do that, do you need the PBX? No. So Thomas Matheson, however, ran into the problem. Now that he had multiple devices, one of the difficulties was the clocks of the two systems are not exactly the same. So if I have a system here and it has an internal little clock and a system here with its internal little clock, and I just send data from one to the other, even though we say we're sampling at 8 kilo samples per second, if this clock is faster than that, what happens? Oops. I send it more data than it has time to play out. If this clock is faster than that one, what happens? I don't get data. I now have silence. What do I play out when I don't have anything to play? So he early on introduced, not surprisingly, comfort noise.