Media types

RFC 3551: RTP Profile for Audio and Video Conferences with Minimal Control provides a basic RTP profile.

RFC 4245: High-Level Requirements for Tightly Coupled SIP Conferencing defines the media types for the languages of the W3C Speech Interface Framework:

Voice Extensible Markup Language (VoiceXML),
Speech Synthesis Markup Language (SSML),
Speech Recognition Grammar Specification (SRGS),
CallControl XML (CCXML), and
Pronunciation Lexicon Specification (PLS).

Slide Notes

Schulzrinne and S. Casner, ‘RTP Profile for Audio and Video Conferences with Minimal Control’, Internet Request for Comments, vol. RFC 3551 (INTERNET STANDARD), July 2003, Available at http://www.rfc-editor.org/rfc/rfc3551.txt.

O. Levin and R. Even, High-Level Requirements for Tightly Coupled SIP Conferencing, Internet Request for Comments, ISSN 2070-1721, RFC 4245, RFC Editor, November 2005, http://www.rfc-editor.org/rfc/rfc4245.txt

Transcript

[slide450] We can have loads of different media types. W3C, the World Wide Web Consortium, has a description for voice XML, so we can actually have an XML that gives you a description of what to do with voice, a speech synthesis markup language, a speech recognition grammar, a call control XML, and even a pronunciation lexicon. So there's a company that I encountered in the UK a number of years ago, and their service is simply to offer you the correct pronunciation of each person's name in the country. Why would that be a useful service? Well, when the person calls you, they actually pronounce your name correctly, which means you're more likely to want to talk to them than if they mispronounce your name, and you say, oh, this is somebody who doesn't know me at all, right? You'll give up right away. Speech synthesis, the interesting feature here, as we mentioned in class yesterday, this idea of personalized CODECs, you can build a model of someone's speech, and then you can now synthesize it, so you can have very high quality, even though you actually have only very limited bandwidth. And if you combine speech synthesis with speech recognition with voice extensible markup languages, you can build incredibly interesting services. And so today there are lots and lots of services that are built where the user can interact via speech with them. And there's a research group at KTH that's built a system designed for the deaf, so that what you see is a face that moves with the lips and the facial expressions based upon a remote person who's speaking into the telephone, but it now produces a facial time sequence behavior so that someone who's a lip reader can understand what the person who isn't deaf is saying to them over this conferencing system. And this is very important. Why is it so important? Oh, we have an issue about inclusiveness in society, that we should be able to help people and have them participate, whether they're sighted, deaf, etc. That's one of the other reasons. [student answer is inaudible] So even for people who are not deaf, it provides a better experience. Why else? Well, it turns out in most countries, there's a legal obligation for telecom operators to provide services to the deaf and blind. So the result is, as a telecom operator, you have an obligation to provide services to these customers. And that includes being able to have someone who's deaf communicating with someone who's not. In which case the telecom operator has to provide the person or technology in the middle to enable these people to communicate. So there's quite a lot of interest in being able to support these sorts of things.