Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2158
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
Doug Shepherd Joe Finney Laurent Mathy Nicholas Race (Eds.)
Interactive Distributed Multimedia Systems 8th International Workshop, IDMS 2001 Lancaster, UK, September 4-7, 2001 Proceedings
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Doug Shepherd Joe Finney Laurent Mathy Nicholas Race Lancaster University, Computing Department Faculty of Applied Sciences Lancaster LA1 4YR, UK E-mail:
[email protected] {joe,laurent,race}@comp.lancs.ac.uk Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Interactive distributed multimedia systems : 8th international workshop ; proceedings / IDMS 2001, Lancaster, UK, September 4 - 7, 2001. Doug Shepherd . . . (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Tokyo : Springer, 2001 (Lecture notes in computer science ; Vol. 2158) ISBN 3-540-42530-6
CR Subject Classification (1998): H.5.1, C.2, H.4, H.5, H.3 ISSN 0302-9743 ISBN 3-540-42530-6 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2001 Printed in Germany Typesetting: Camera-ready by author, data conversion by Boller Mediendesign Printed on acid-free paper SPIN: 10840274 06/3142 543210
Preface
We are very happy to present the proceedings of the 8th International Workshop on Interactive Distributed Multimedia Systems – IDMS 2001, in co-operation with ACM SIGCOMM and SIGMM. These proceedings contain the technical programme for IDMS 2001, held September 4–7, 2001 in Lancaster, UK. For the technical programme this year we received 48 research papers from both academic and industrial institutions all around the world. After the review process, 15 were accepted as full papers for publication, and a further 8 as short positional papers, intended to provoke debate. The technical programme was complimented by three invited papers: “QoS for Multimedia – What's Going to Make It Pay?” by Derek McAuley, “Enabling the Internet to Provide Multimedia Services” by Markus Hofmann, and “MPEG-21 Standard: Why an Open Multimedia Framework?” by Fernando Pereira. The organisers are very grateful for the help they received to make IDMS 2001 a successful event. In particular, we would like to thank the PC for their first class reviews of papers, particularly considering the tight reviewing deadlines this year. Also, we would like to acknowledge the support from Agilent, BTexact Technologies, Hewlett Packard, Microsoft Research, Orange, and Sony Electronics – without whom IDMS 2001 would not have been such a memorable event. We hope that readers will find these proceedings helpful in their future research, and that IDMS will continue to be an active forum for the discussion of distributed multimedia research for years to come. June 2001
D. Shepherd J. Finney L. Mathy N. Race
Welcome from the Minister of State for Lifelong Learning and Higher Education
To the participants of the 8th International Workshop on Interactive Distributed Multimedia Systems: I am delighted to welcome you all to the IDMS 2001 workshop in Lancaster, and indeed welcome many of you to the UK. We see throughout the world that conferences are being held to try to educate people in the area of distributed computing as it becomes an increasingly important part of the information society we live in. I am happy to see that the University of Lancaster are part of this drive by hosting this international conference. I send you my best wishes for a successful workshop and hope that this opportunity to present interesting research results to a broad professional audience gives stimulus for collaboration and further progress in the field. I wish all participants a pleasant stay in Lancaster.
Margaret Hodge Minister of State for Lifelong Learning and Higher Education
Welcome
VII
Welcome to the Computing Department at Lancaster University
I'd like to extend a very warm welcome to all participants to IDMS 2001. We're particularly pleased that IDMS has come to Lancaster this year where the Distributed Multimedia Research Group has played such a prominent role in the distributed multimedia systems research community over the last 10 years. As a rather envious outsider it has struck me how successful the distributed multimedia systems community has been in tapping the pool of research talent. In recent years at Lancaster, we have enjoyed a very strong flow of students through our Doctoral programme, from our own undergraduate programme and from elsewhere. In large measure, this has been due to the buzz generated by research in distributed multimedia systems and I'm sure this is reflected in other research and development centres. However, we have seen this phenomenon before, where hot research areas soak up talent and resources. Thankfully, what distinguishes the distributed multimedia systems community is the coexistence of three key factors: it addresses a real economic and societal need; it has attracted the resources and backing needed to make it work; and, crucially, it is delivering. Computer Science needs challenging, high profile and successful areas like this to keep the discipline healthy. I hope you have a great time at IDMS 2001. Pete Sawyer Head of Computing Lancaster University
Organisation Programme Chair Doug Shepherd, Lancaster University, United Kingdom
Programme Committee Arturo Azcorra, University Carlos III, Madrid, Spain Gregor v. Bochmann, University of Ottawa, Canada Jan Bormans, IMEC, Belgium Berthold Butscher, GMD-FOKUS, Germany Andrew Campbell, Columbia University, USA Tim Chown, University of Southampton, United Kingdom Luca Delgrossi, University of Piacenza, Italy Michel Diaz, LAAS-CNRS, France Jordi Domingo-Pascual, Politechnic University of Catalonia, Spain Wolfgang Effelsberg, University of Mannheim, Germany Frank Eliassen, University of Oslo, Norway Serge Fdida, LiP6, France Joe Finney, Lancaster University, United Kingdom Vera Goebel, University of Oslo, Norway Tobias Helbig, Philips Research Laboratories, Germany David Hutchison, Lancaster University, United Kingdom Winfried Kalfa, TU Chemnitz, Germany Guy Leduc, University of Liege, Belgium Laurent Mathy, Lancaster University, United Kingdom Ketan Mayer-Patel, University of North Carolina, USA Jason Nieh, Columbia University, USA Philippe Owezarski, LAAS-CNRS, France Fernando Pereira, Instituto Superior Tecnico, Portugal Thomas Plagemann, University of Oslo, Norway Timothy Roscoe, Sprint Labs, USA Hans Scholten, University of Twente, The Netherlands Patrick Senac, ENSICA, France Marten van Sinderen, University of Twente, The Netherlands Greg O'Shea, Microsoft Research Cambridge, United Kingdom Ralf Steinmetz, TU Darmstadt/GMD, Germany Fred Stentiford, BTexact Research, United Kingdom Burkhard Stiller, ETH Zurich, Switzerland Giorgio Ventre, University Federico II, Napoli, Italy Lars Wolf, University of Karlsruhe, Germany
X
Organisation
Local Organisation Joe Finney, Lancaster University, United Kingdom Nicholas Race, Lancaster University, United Kingdom Laurent Mathy, Lancaster University, United Kingdom Barbara Hickson, Lancaster University, United Kingdom Stefan Schmid, Lancaster University, United Kingdom Daniel Prince, Lancaster University, United Kingdom
Supporting and Sponsoring Organisations (Alphabetically) ACM SIGCOMM ACM SIGMM Agilent BTexact Technologies Hewlett Packard Microsoft Research Orange Sony Electronics
Table of Contents
Invited Presentation QoS for Multimedia – What's Going to Make It Pay? ...................................................1 Derek McAuley Short Papers New Resource Control Issues in Shared Clusters ..........................................................2 Timothy Roscoe, Prashant Shenoy Transport-Level Protocol Coordination in Cluster-to-Cluster Applications ................10 David E. Ott, Ketan Mayer-Patel Data to the People – It’s a Matter of Control...............................................................23 Mauricio Cortes, J. Robert Ensor An Access Control Architecture for Metropolitan Area Wireless Networks ...............29 Stefan Schmid, Joe Finney, Maomao Wu, Adrian Friday, Andrew Scott, and Doug Shepherd
Media Distribution Design and Implementation of a QoS-Aware Replication Mechanism for a Distributed Multimedia System ...................................................................................38 Giwon On, Jens B. Schmitt, and Ralf Steinmetz Distribution of Video-on-Demand in Residential Networks ........................................50 Juan Segarra, Vicent Cholvi A QoS Negotiation Scheme for Efficient Failure Recovery in Multi-resolution Video Servers...............................................................................................................62 Minseok Song, Heonshik Shin, and Naehyuck Chang QoS Issues in Multimedia Tolerance of Highly Degraded Network Conditions for an H.323-Based VoIP Service................................................................................................................74 Peter Holmes, Lars Aarhus, and Eirik Maus
XII
Table of Contents
Conception, Implementation, and Evaluation of a QoS-Based Architecture for an IP Environment Supporting Differentiated Services..........................................86 Fabien Garcia, Christophe Chassot, Andre Lozes, Michel Diaz, Pascal Anelli, and Emmanuel Lochin A Service Differentiation Scheme for the End-System ................................................99 Domenico Cotroneo, Massimo Ficco, and Giorgio Ventre Invited Presentation Enabling the Internet to Provide Multimedia Services...............................................110 Markus Hofmann Multimedia Middleware Design and Application of TOAST: An Adaptive Distributed Multimedia Middleware Platform .................................................................................................111 Tom Fitzpatrick, Julian Gallop, Gordon Blair, Christopher Cooper, Geoff Coulson, David Duce, and Ian Johnson QoS Management Middleware: A Separable, Reusable Solution ..............................124 Denise Ecklund, Vera Goebel, Thomas Plagemann, Earl F. Ecklund Jr., Carsten Griwodz, Jan Øyvind Aagedal, Ketil Lund, and Arne-Jørgen Berre State Transmission Mechanisms for a Collaborative Virtual Environment Middleware Platform .................................................................................................138 João Orvalho, Pedro Ferreira, and Fernando Boavida Congestion Control and Adaptation A Stable and Flexible TCP-Friendly Congestion Control Protocol for Layered Multicast Transmission ................................................................................154 Ibtissam El Khayat, Guy Leduc Content-Aware Quality Adaptation for IP Sessions with Multiple Streams...............168 Dimitrios Miras, Richard J. Jacobs, and Vicky Hardman The Minimal Buffering Requirements of Congestion Controlled Interactive Multimedia Applications..........................................................................181 Kang Li, Charles Krasic, Jonathan Walpole, Molly H. Shor, and Calton Pu
Table of Contents
XIII
Short Papers Video Content Management Using Logical Content .................................................193 Nozomu Takahashi, Yuki Wakita, Shigeki Ouchi, and Takayuki Kunieda Adaptive Clustering Using Mobile Agents in Wireless Ad-Hoc Networks ...............199 Robert Sugar, Sandor Imre Mobile 4-in-6: A Novel Mechanism for IPv4/v6 Transitioning.................................205 Joe Finney, Greg O’Shea The Case for Streaming Multimedia with TCP..........................................................213 Charles Krasic, Kang Li, and Jonathan Walpole Invited Presentation The MPEG-21 Standard: Why an Open Multimedia Framework? ............................219 Fernando Pereira Control of Multimedia Networks Selecting the QoS Parameters for Multicast Applications Based on User Profile and Device Capability ...........................................................................221 Khalil El-Khatib, Gregor v. Bochmann, and Yu Zhong A Novel Group Integrity Concept for Multimedia Multicasting................................233 Andreas Meissner, Lars Wolf, and Ralf Steinmetz Constraint-Based Configuration of Proxylets for Programmable Networks ..............245 Krish T. Krishnakumar, Morris Sloman
Author Index ............................................................................................................257
QoS for Multimedia – What’s Going to Make It Pay? Derek McAuley Marconi Research Centre, Cambridge, UK
[email protected] Abstract. Delivering the form of QoS we see in the form of Service Level Agreements is a big deal for ISPs; however, this is a long way from the traditional multimedia view of per application instance QoS. Likewise many OS and middleware platforms take one of the two views on QoS: machines are getting faster so who cares, or, we have a real time scheduling class. Are we ever going to drive QoS into the mainstream?
D. Shepherd et al. (Eds.): IDMS 2001, LNCS 2158, pp. 1-1, 2001. Springer-Verlag Berlin Heidelberg 2001
New Resource Control Issues in Shared Clusters Timothy Roscoe1 and Prashant Shenoy2 1
Sprint Advanced Technology Labs, 1 Adrian Court, Burlingame, CA 94010, USA.
[email protected] 2 Department of Computer Science, University of Massachusetts, Amherst, MA 01003, USA.
[email protected] Abstract. We claim that the renting of machine resources on clusters of servers introduces new systems challenges which are different from those hitherto encountered, either in multimedia systems or cluster-based computing. We characterize the requirements for such “public computing platforms” and discuss both how the scenario differs from more traditional multimedia resource control situations, and how some ideas from multimedia systems work can be reapplied in this new context. Finally, we discuss our ongoing work building a prototype public computing platform.
1
Introduction and Motivation
This paper argues that the growth of shared computing platforms poses new problems in the field of resource control that are not addressed by the current state of the art, and consequently there exist important unresolved resource control issues of interest to the multimedia systems community. The scenario we examine in detail is that of a public computing platform. Such a platform consists of a cluster of processing nodes interconnected by a network of switches and provides computational resources to a large number of small thirdparty service providers who pay the provider of the platform for the resources: CPU cycles, network bandwidth, storage space, storage bandwidth, etc. The platform provider offers service providers a platform which can be, for example, highly available, managed, and located in a geographically advantageous location such as a metropolitan area. In return, the platform provider can use economies of scale to offer service hosting at an attractive rate and still generate profit. Public computing platforms differ from current hosting solutions in that there are many more services than machines: lots of services share a relatively small number of nodes. The challenge for the platform provider is to be able to sell resources like processor cycles and predictable service to many service providers, who may be mutually antagonistic, in a cost-effective manner. Whereas systems for running one, specialized class of application (e.g. web servers, some Application Service Providers) in this manner are already appearing in the marketplace,
Prashant Shenoy was supported in part by NSF grants CCR-9984030, EIA-0080119 and a gift from Sprint Corporation.
D. Shepherd et al. (Eds.): IDMS 2001, LNCS 2158, pp. 2–9, 2001. c Springer-Verlag Berlin Heidelberg 2001
New Resource Control Issues in Shared Clusters
3
the lack of solutions for the more general problem has prevented the range of services offered in this way from being widened. Two research areas feed directly in to this area: both have much to offer, but do not address areas specific to the support of time- and resource-sensitive applications on public computing platforms. Resource control in multimedia systems: Resource control has been central question in multimedia systems research for at least the past 10 years or so. Control of resources within a machine is now relatively well-understood: it has been addressed in completely new operating systems (e.g. [1]), modifications to existing operating systems (e.g. [2]), schedulers (e.g., [14]), and abstractions (e.g., [4]). Many of these advances were motivated by the desire to handle multimedia and other time-sensitive applications. Such mechanisms clearly have a place in a public computing platform designed to handle a diversity of services, not simply for multimedia applications but to provide performance isolation between services owned by providers who are paying for resources. Consequently, public computing platforms enable much of the past and on-going research on resource control for multimedia systems to be applied to a new and more general setting. The caveat though is that most of these techniques were developed for single machine environments and do not directly generalize to multi-resource environments (multiprocessors, clusters), for example see [5]. Cluster-based computing platforms: Much work has been performed recently on the use of clustered computing platforms for network services (see [6] for an example and convincing arguments in favor of the approach). This work aims at delivering high-capacity, scalable, highly-available applications, usually webbased. Typically, a single application is supported, or else the applications are assumed to be mutually trusting—a reasonable assumption in the large enterprise case. Consequently, little attention is paid to resource control, either for realtime guarantees to applications or performance isolation between them [7] ([9] is a notable exception.) Similarly, intra-cluster security is relaxed as a simplifying assumption within the platform [8]. While the arguments for an approach based on clusters of commodity machines carry over into the public computing space, the assumptions about resource control and trust clearly do not: the applications running on such platforms will have diverse requirements and the operators of such applications will be paying money to ensure that those requirements are met. In addition, they may be in competition with each other. Lack of trust between competing applications as well as between applications and the platform provider introduces new challenges in design of cluster control systems. 1.1
What’s Different about Public Computing Platforms
This paper argues that the systems problems of public computing platforms are conveniently similar to the two fields above, but have a specificity of their own.
4
Timothy Roscoe and Prashant Shenoy
They both present new challenges, but also have properties that help to ground and concretize general classes of solutions. The most significant property of systems like this that set them apart from traditional multimedia systems and cluster-based servers is that resources are being sold. From a cluster architecture point of view this means that performance isolation becomes central: it is essential to provide some kind of quantitative resource guarantees since this is what people are paying for. From a multimedia systems point of view this property has two effects. Firstly, resource allocation must extend over multiple machines running a large number of services. This amounts to a problem of placement : which components of which services are to share a machine? Secondly, the policies used to drive both this placement and the resource control mechanisms on the individual machines are now driven by a clear business case. Resource control research in the past has been marked by a lack of clear consensus over what is being optimized by the various mechanisms and policies: processor utilization, application predictability, application performance, etc. The notion of graceful degradation is also made more quantitative in this scenario: we can relate degradation of service to a change in platform revenue. This represents a significant advance over current so-called “economic” or “marketdriven” resource allocation policies since they can now be explicitly linked to a “real” market. We elaborate on these issues below.
2 2.1
Challenges in Designing a Public Computing Platform Challenges for the Platform Provider: The Need for Yield Management
The primary goal for the operator of a public computing platform is to maximize revenues obtained from renting platform resources to service providers. A public computing platform services a wide variety of customers; depending on how much each customer pays for resources, not all users are treated equally. This process is known as yield management, and adds an important twist to the policy side of the resource control problem. Maximizing yield (revenue) requires that platform resources be overbooked. Overbooking of resources is typically based on an economic cost-benefit analyses, which explicitly links resource allocation not to closed market abstractions (e.g. [10]) but to a “real” commercial operation. Beyond this, resource policies will take into account such factors as demographics and psychometric models of client behavior in determining allocations and pricing. In other industries where similar challenges exist (for example, the airline industry [11]), much of this is in the domain of business decision-making and operations research models. The challenge for a public computing platform is to allow as much flexibility as possible in business decisions regarding its operation: it must not impose undue restrictions on business policies, but at the same time should facilitate their implementation. From a systems design point of view this has a number of implications. Firstly, business assessments and policies must be representable in the system, without
New Resource Control Issues in Shared Clusters
5
the system constraining this representation (in other words, without the system taking over too much of the decision-making process). Secondly the system should aid the process of overbooking and reacting to the overloads resulting from overbooking. For instance, service providers that pay more should be better isolated from overloads than others; to achieve this goal, resource control policies should help determine (i) how to map individual applications to nodes in the cluster, (ii) the amount of overbooking on each individual node depending on the yield from that node and the service guarantees that need to be provided to applications, and (iii) how to handle an overload scenario. Thirdly, the system should provide timely feedback into the business domain as the results of the process and the behavior of other commercial parties involved (principally the service providers).
2.2
End-User Challenges: The Need for Appropriate Abstractions
Applications running on a public computing platform will be inherently heterogeneous. One can expect such platforms to run a mix of applications such as streaming audio and video servers, real-time multiplayer game servers, vanilla web servers, and ecommerce applications. These applications have diverse performance requirements. For instance, game servers need good interactive performance and thus low average response times, ecommerce applications need high aggregate throughput (in terms of transactions per second), and streaming media servers require real-time performance guarantees. For each such application (or service), a service provider contracts with the platform provider for the desired performance requirements along various dimensions. Such requirements could include the desired reservation (or share) for each capsule as well as average response times, throughput or deadline guarantees. To effectively service such applications, the platform should support flexible abstractions that enable applications to specify the desired performance guarantees along a variety of dimensions. We propose the abstraction of a capsule to express these requirements. A capsule is defined to be that component of an application that runs on an individual node; each application can have one or more capsules, but not more than one per node. A capsule can have a number of attributes, such as the desired CPU, network and disk reservations, memory requirements, deadline guarantees, etc., that denote the performance requirements of that capsule. It’s important to note that capsules are a post facto abstraction: for reasons detailed in [12] we try not to mandate a programming model for service authors. Capsules are therefore an abstraction used by the platform for decomposing an existing service into resource principals.
2.3
Implications for System Design
The above research challenges have the following implications on platform design.
6
Timothy Roscoe and Prashant Shenoy
Capsule Placement. A typical public computing platform will consist of tens or hundreds of nodes running thousands of third-party applications. Due to the large number of nodes and applications in the system, manual mapping of capsules to nodes in the platform is infeasible. Consequently, an automated capsule placement algorithm is a critical component of any public computing platform. The aim of such an algorithm is clearly to optimize revenue from the platform, and in general this coincides with maximizing resource usage. However, a number critical factors and constrains modify this: Firstly, the algorithm must run incrementally: services come and go, nodes fail, and are added or upgraded, and all this must occur with minimal disruption to service. This means, for instance, that introducing a new capsule must have minimal impact of the placement of existing capsules, since moving a capsule is costly and may involve violating a resource guarantee. Secondly, capsule placement should take into account the issue of overbooking of resources to maximize yield; sophisticated statistical admission control algorithms are needed to achieve this objective. Much of the past work on statistical admission control has focussed on a single node server; extending these techniques to clustered environments is non-trivial. Thirdly, there are technological constraints on capsule placement, for example capsules are generally tied to a particular operating system or execution environment which may not be present on all nodes. Finally, there are less tangible security and business-related constraints on capsule placement. For example, we might not wish to colocate capsules of rival customers on a single node. On the other hand, we might colocate a number of untrusted clients on a single node if the combined revenue from the clients is low. To help make this last constraint tractable, and also integrate notions of overbooking, we introduce the twin abstractions of trustworthiness and criticality, explored in more detail in [13]. These concepts allow us to represent businesslevel assessments of risk and cost-benefit at the system level. Trustworthiness is clearly an issue, since third-party applications will generally be untrusted and mutually antagonistic; isolating untrusted applications from one another by mapping them onto different nodes is desirable. Trustworthiness is a function of many factors outside the scope of the system (including legal and commercial considerations), but the placement of capsules must take trust relationships into account. The complementary notion of criticality is a measure of how important a capsule or an application is to the platform provider. For example, criticality could be a function of how much the service provider is paying for application hosting. Clearly, mapping capsules of critical applications and untrusted applications to the same node is problematic, since a denial of service attack by the untrusted application can result in revenue losses for the platform provider. In summary, capsule placement becomes a multi-dimensional constrained optimization problem—one that takes into account the trustworthiness of an application, its criticality and its performance requirements.
New Resource Control Issues in Shared Clusters
7
Resource Control. A public computing platform should employ resource control mechanisms to enforce performance guarantees provided to applications and their capsules. As argued earlier, these mechanisms should operate in multinode environments, should isolate applications from one another, enforce resource reservations on a sufficiently fine time-scale, and meet requirements such as deadlines. These issues are well understood within the multimedia community for single node environments. For instance, hierarchical schedulers [14] meet these requirements within a node. However, these techniques do not carry over to multi-resource (multi-node) environments. For instance, it was shown in [5] that uniprocessor proportional-share scheduling algorithms can cause starvation or unbounded unfairness when employed for multiprocessor or multi-node systems. Consequently, novel resource control techniques will need to be developed for these environments. Failure Handling. Since high availability is critical to a public computing platform, the platform should handle failures in a graceful manner. In contrast to traditional clusters, the commercial nature of a public computing platform has an important effect on how failures are handled: we can classify failures as to whose responsibility it is to handle them, the platform provider or a service provider. We distinguish three kinds of failures in a public computing platform: (i) platform failures, (ii) application failures, and (iii) capsule failures. A platform failure occurs when a node fails or some platform-specific software on the node fails. Interestingly, resource exhaustion on a node also constitutes a platform failure—the failure to meet performance guarantees (since resources on each node of the platform may be overbooked to extract statistical multiplexing gains, resource exhaustion results in a violation of performance guarantees). Platform failures must be dealt with by detecting them in a timely manner and recovering from them automatically (for instance, by restarting failed nodes or by offloading capsules from an overloaded node to another). An application failure occurs when an application running on the platform fails in a manner detectable by the platform. Depending on the application and the service contract between the platform provider and the service provider, handling application failures could be the responsibility of the platform provider or the service provider (or both). In the former scenario, application semantics that constitute a failure will need to be specified a priori to the platform provider and the platform will need to incorporate application-specific mechanisms to detect and recover from such failures. A capsule failure occurs when an application capsule fails in a way undetectable to the platform provider, for example an internal deadlock condition in an application. Capsule failures must be assumed to be the responsibility of the service provider and the platform itself does not provide any support for dealing with them. We have found this factorization of failure types highly useful in designing fault-tolerance mechanisms for a public computing platform.
8
3
Timothy Roscoe and Prashant Shenoy
Status of Ongoing Work and Concluding Remarks
We believe that there are compelling reasons to host large numbers of Internet services on a cluster-based platform. In particular, we are interested in the case where there are many more services than machines – this is a different space from current commercial hosting solutions, but one where we feel considerable innovation in applications is possible if the economic barrier to entry is very low. Facilitating this innovation requires support for highly diverse resource guarantees: current application-level connection scheduling work restricts applications to web-based or similar request-response systems, and consequently restricts the diversity of feasible services (and how cheap it is to offer them). Much research from the field of multimedia systems can be reapplied here – indeed this may be a more compelling case for resource control facilities in the real world than multimedia workstations. However, both the clustered environment and the business relationships involved in the design of public platforms adds new challenges: (i) heterogeneity of applications, distributed application components, and processing nodes; (ii) place of capsules within the platform; (iii) failure handling in a domain of split responsibility, and (iv) overbooking and yield management. Our current research focuses on these issues for a public computing platform. Specifically, we are investigating techniques for yield management [13], capsule placement, overbooking and predictable resource allocation in such environments.
Acknowledgments The authors would like to acknowledge the suggestions of Bryan Lyles in writing this paper.
References [1] I. Leslie, D. McAuley, R. Black, T. Roscoe, P. Barham, D. Evers, and R. Fairbairns, “The design and implementation of an operating system to support distributed multimedia applications,” IEEE JSAC, vol. 14, no. 7, pp. 1280–1297, 1996. [2] V Sundaram, A. Chandra, P. Goyal, P. Shenoy, J Sahni, and H Vin, “Application Performance in the QLinux Multimedia Operating System,” in Proceedings of the Eighth ACM Conference on Multimedia, Los Angeles, CA, November 2000, pp. 127–136. [3] J. Nieh and M. S. Lam, “The Design, Implementation and Evaluation of SMART: A Scheduler for Multimedia Applications,” in Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, Saint-Malo, France, October 1997. [4] G. Banga, P. Druschel, and J. C. Mogul, “Resource Containers: a new facility for resource management in server systems,” in Proceedings of the Third Symposium on Operating Systems Design and Implementation, New Orleans, Louisiana, March 1999, pp. 45–68.
New Resource Control Issues in Shared Clusters
9
[5] A. Chandra, M. Adler, P. Goyal, and P. Shenoy, “Surplus Fair Scheduling: A Proportional-Share CPU Scheduling Algorithm for Symmetric Multiprocessors,” in Proceedings of the Fourth Symposium on Operating System Design and Implementation (OSDI 2000), San Diego, CA, October 2000. [6] A. Fox, S. D. Gribble, Y. Chawathe, E. A. Brewer, and P. Gauthier, “Cluster-Based Scalable Network Services,” in Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, San Malo, France, October 1997. [7] M. Litzkow, M. Livny, and M. Mutka, “Condor - a hunter of idle workstations,” in Proceedings of the 8th International Conference of Distributed Computing Systems, June 1988, pp. 104–111. [8] S. D. Gribble, M. Welsh, E. A. Brewer, and D. Culler, “The Multispace: an Evolutionary Platform for Infrastructural Services,” in Proceedings of the 1999 Usenix Annual Technical Conference, Monterey, California, June 1999. [9] M. Aron, P. Druschel, and W. Zwaenepoel, “Cluster Reserves: A mechanism for Resource Management in Cluster-based Network Servers,” in Proceedings of the ACM Sigmetrics 2000, Santa Clara, CA, June 2000. [10] Neil Stratford and Richard Mortier, “An economic approach to adaptive resource management,” in Proc. 7th IEEE Workshop on Hot Topics in Operating Systems (HotOS VII), March 1999. [11] Barry C. Smith, John F. Leimkuhler, and Ross M. Darrow, “Yield management at American Airlines,” Interfaces, vol. 22, no. 1, pp. 8–31, January-February 1992. [12] T. Roscoe and B. Lyles, “Distributed Computing without DPEs: Design Considerations for Public Computing Platforms,” in Proceedings of the 9th ACM SIGOPS European Workshop, Kolding, Denmark, September 17-20 2000. [13] T. Roscoe, B. Lyles, and R. Isaacs, “The case for supporting risk assessment in systems,” Sprint Labs Technical Report, May 2001. [14] P. Goyal, X. Guo, and H.M. Vin, “A Hierarchical CPU Scheduler for Multimedia Operating Systems,” in Proceedings of the First USENIX Symposium on Operating System Design and Implementation (OSDI’96), Seattle, October 1996, pp. 107–122.
Transport-Level Protocol Coordination in Cluster-to-Cluster Applications David E. Ott and Ketan Mayer-Patel Department of Computer Science University of North Carolina at Chapel Hill {ott, kmp}@cs.unc.edu
Abstract. Future Internet applications will increasingly use multiple communications and computing devices in a distributed fashion. In this paper, we identify an emerging and important application class comprised of a set of processes on a cluster of devices communicating to a remote set of processes on another cluster of devices across a common intermediary Internet path. We call applications of this type cluster-tocluster (C-to-C) applications. The networking requirements of C-to-C applications present unique challenges that current transport-level protocols fail to address. In particular, these applications require aggregate measurement of network conditions across all associated flows and coordinated transport-level protocol behavior. A Coordination Protocol (CP) is proposed which allows a C-to-C application to coordinate flow behavior in the face of changing network conditions. CP provides cluster endpoints with a consistent view of network conditions, as well as cluster membership and bandwidth usage information. An application may use CP to define and implement a coordination scheme supporting particular flow priorities and other objectives.
1
Introduction
Future Internet applications will increasingly make use of multiple communication and computing devices in a distributed fashion. Examples of these applications include distributed sensor arrays, tele-immersion [9], computer-supported collaborative workspaces (CSCW) [4], ubiquitous computing environments [13], and complex multi-stream, multimedia presentations [16]. In many such applications, no one device or computer produces or manages all of the data streams transmitted. Instead, the endpoints of communication are collections of devices. We call applications of this type cluster-to-cluster (C-to-C) applications. In a C-to-C application, a set of processes distributed on a cluster of devices or computers communicates with another set of processes on a remote cluster of devices or computers. For example, a cluster may be comprised of a collection of capture devices (e.g., video cameras, microphones, etc.), each of which produces a stream of data. The other cluster might be a collection of display devices (e.g., digital light projectors, speakers, head-mounted displays, etc.) and computers that archive data in various ways. Processes are distributed on each device cluster to manage data streams and control the application. D. Shepherd et al. (Eds.): IDMS 2001, LNCS 2158, pp. 10–22, 2001. c Springer-Verlag Berlin Heidelberg 2001
Transport-Level Protocol Coordination in Cluster-to-Cluster Applications
11
C-to-C applications share two important properties. First, communication between endpoints on the same local cluster takes place with minimal delay and loss. We can make this assertion because the local intranet supporting the cluster can be provisioned to comfortably support application traffic loads. In other words, the local networking environment often will be engineered and implemented with the C-to-C application in mind. Second, while no two flows within the application share the exact same end-to-end path, all flows share a common Internet path between clusters. This shared common path represents the majority of the traversal path between endpoints on different clusters, and the region in which network congestion will most likely affect C-to-C application performance. Although the path between clusters is not guaranteed to be the same for all flows (or even all packets within a flow), in general we can expect that network conditions between clusters will be similar. Different flows within a C-to-C application may use more than one transportlevel protocol to accomplish their communication objectives. Streamed audio and video data, for example, may use UDP or a UDP-based protocol like RTP, while control information is exchanged using TCP to ensure reliability. Applicationspecific protocols which handle media encoding, flow control, reliability, or congestion control in specialized ways may also be used. A fundamental problem with current transport-level protocols within the Cto-C application context, however, is that they lack coordination. Application streams share a common intermediary path between clusters, and yet operate in isolation from one another. As a result, flows may compete with one another when network resources become limited, instead of cooperating to use available bandwidth in application-controlled ways. For example, an application may wish to give certain streams priority over others or stipulate that different proportions of available bandwidth be used by specific streams. We are interested in C-to-C applications with rich semantic relationships between flows that can be exploited to improve application performance. In this position paper, we postulate that the C-to-C application architecture can significantly benefit from network mechanisms that allow transport-level protocol coordination of separate, but semantically related, flows of data. In the following sections of this paper we will: – – – –
2
Motivate the C-to-C application architecture with a specific example. Identify a number of networking challenges posed by such an application. Provide a specific proposal for a transport-level coordination mechanism. Defend the design decisions made in our proposed mechanism, and discuss related work.
A Motivating Example
This section describes a specific application which motivates the need for transport-level protocol coordination and illustrates the networking challenges faced by C-to-C applications. The application is the Office of the Future. The Office
12
David E. Ott and Ketan Mayer-Patel
of the Future was conceived by Fuchs et al., and their experimental prototype of the application is described further in [9]. In the Office of the Future, tens of digital light projectors are used to make almost every surface of an office (walls, desktops, etc.) a display surface. Similarly, tens of video cameras are used to capture the office environment from a number of different angles. At real-time rates, the video streams are used as input to stereo correlation algorithms to extract 3D geometry information. Audio is also captured from a set of microphones. The video streams, geometry information, and audio streams are all transmitted to a remote Office of the Future environment. At the remote environment, the video and audio streams are warped using both local and remote geometry information and stereo views are mapped to the light projectors. Audio is spatialized and sent to a set of speakers. Users within each Office of the Future environment wear shutter glasses that are coordinated with the light projectors.
Fig. 1. The Office of the Future.
The result is an immersive 3D experience in which the walls of one office environment essentially disappear to reveal the remote environment and provide a tele-immersive collaborative space for the participants. Furthermore, synthetic 3D models may be rendered and incorporated into both display environments as part of the shared, collaborative experience. Figure 1 is an artistic illustration of the application. We see the Office of the Future as a concrete vision of a C-to-C application that progresses far beyond today’s relatively simple video applications. The Office of the Future uses scores of media streams which must be manipulated in complex ways. The computational complexity of the application requires the use of several computational resources (possibly including specialized graphics hardware).
Transport-Level Protocol Coordination in Cluster-to-Cluster Applications
13
The Office of the Future is a good example of a C-to-C application because the endpoints of the application are collections of devices. Two similarly equipped offices must exchange myriad data streams. Any specific data stream is transmitted from a specific device in one environment to one or more specific devices in the remote environment. Within the strict definition of end-to-end used by current protocols, few if any of these streams will share a complete path. If we relax the definition of end-to-end, however, we can see that all of the data streams will span a common shared communication path between the local networking environments of each Office of the Future. The local network environments are not likely to be the source of congestion, loss, or other dynamic network conditions because these environments are within the control of the local user and can be provisioned to support the Office of the Future application. The shared path between two Office of the Future environments, however, is not under local control and thus will be the source of dynamic network conditions. Current transport-level protocols fail to address the complex needs of an application like the Office of the Future. For example, this application requires dynamic interstream prioritization. Beyond just video information, the Office of the Future uses a number of other media types including 3D models and spatialized audio. Because these media types are integrated into a single immersive display environment, user interaction with any given media type may have implications for how other media types are encoded, transmitted, and displayed. For example, the orientation and position of the user’s head indicates a region of interest within the display. Media streams that are displayed within that region of interest should receive a larger share of available bandwidth and be displayed at higher resolutions and frame rates than media streams that are outside the region of interest. When congestion occurs, lower priority streams should react more strongly than higher priority streams. In this way, appropriate aggregate behavior is achieved and application-level tradeoffs are exploited.
3
Networking Challenges
To generalize the networking challenges faced by complex C-to-C applications like the Office of the Future, we first develop a generic model for C-to-C applications and subsequently characterize the networking requirements associated with this model. 3.1
C-to-C Application Model
We model a generic C-to-C application as two sets of processes executing on two sets of communication or computing devices. Figure 2 illustrates this model. A cluster is comprised of any number of endpoints and a single aggregation point, or AP. Each endpoint represents a process on some endpoint host (typically a networked computer) that sends and/or receives data from another
14
David E. Ott and Ketan Mayer-Patel Endpoint A1
Endpoint B1
App. Process
App. Process
App. Process
Aggregation Point
Endpoint A2
Internet
App. Process
Aggregation Point
Endpoint B2
Cluster−to−Cluster Data Path
App. Process
Endpoint AN
App. Process
Endpoint BN
Cluster A
Cluster B
Fig. 2. C-to-C application model. endpoint belonging to a remote cluster. The AP functions as a gateway node connecting cluster endpoints to the Internet. The common traversal path between aggregation points is known as the cluster-to-cluster data path. Endpoints within the same cluster are connected by an intranet provisioned with enough bandwidth to comfortably support the communication needs of the cluster. In general, latency between endpoints on the same cluster is small compared to latency between endpoints on different clusters. Our overall objective is to coordinate endpoint flows across the cluster-to-cluster data path where available bandwidth is uncertain and constantly changing. 3.2
Networking Requirements
A useful metaphor for visualizing the networking requirements of C-to-C applications is to view the communication between clusters as a rope with frayed ends. The rope represents the aggregate data flows between clusters. Each strand represents one particular transport-level stream. At the ends of the rope, the strands may not share the same path. In the middle of the rope, however, the strands come together to form a single aggregate object. While each strand is a separate entity, they share a common fate and purpose when braided together as a rope. With this metaphor in mind, we identify several important networking requirements of C-to-C applications: – Global measurements of congestion, delay, and loss. Although each stream of a C-to-C application is an independent transport-level flow of data, it is also semantically related to other streams in the same distributed application. As such, network events such as congestion, delay, and loss should be measured and reported for the flows as an aggregate. – Preserved end-to-end semantics. The specific transport-level protocol (i.e., TCP, UDP, RTP, RAP, etc.) that is used by each flow will be specific to the communication requirements of the data within the flow and the role it plays within the larger application. Thus, each transport-level protocol used should still maintain the appropriate end-to-end semantics and mechanisms. For example, if a data flow contains control information that requires inorder, reliable delivery, then the transport-level protocol used to deliver this
Transport-Level Protocol Coordination in Cluster-to-Cluster Applications
15
specific flow (e.g., TCP) should provide these services on an end-to-end basis. Thus, although the flow is part of an aggregate set of flows, it should still maintain end-to-end mechanisms as appropriate. – A coordinated view of current network conditions. Even though each flow maintains its own end-to-end semantics, the flows should receive a coordinated view of current network conditions like available bandwidth and global measures of aggregate delay, loss, and congestion. We need to separate the adaptive dynamic behavior of each transport-level protocol, which depends on end-to-end semantics of individual streams, from the mechanisms used to measure current network conditions, which are global across all streams of the application. – Interstream relative bandwidth measurements. Individual streams within the C-to-C application may require knowledge about their bandwidth usage relative to the other streams of the same application. This knowledge can be used to determine the appropriate bandwidth level of a particular stream given application-level knowledge about interstream relationships. For example, an application may want to establish a relationship between two separate flows of data such that one flow consumes twice as much bandwidth as the other. – Deployability Finally, the mechanisms used to provide C-to-C applications with transport-level protocol coordination need to be reasonably deployable. In the following section, we outline a design for a mechanism that meets these requirements.
4
The Coordination Protocol
Our approach to this problem is to propose a new protocol layer between the network layer (IP) and transport layer (TCP, UDP, etc.) that addresses the need for coordination in C-to-C application contexts. We call this protocol the Coordination Protocol (CP). The coordination function provided by CP is transport protocol independent. At the same time, CP is distinct from network-layer protocols like IP that play a more fundamental role in routing a packet to its destination. CP works by attaching probe information to packets transmitted from one cluster to another. As the information is acknowledged and returned by packets of the remote cluster, a picture of current network conditions is formed and shared among endpoints within the local cluster. A consistent view of network conditions across flows follows from the fact that the same information is shared among all endpoints. Figure 3 shows our proposed network architecture. CP will be implemented between the network layer (IP) and the transport layer (TCP/UDP). We call this layer the coordination layer. The coordination layer will exist on each device participating in the C-to-C application, as well as on the two aggregation points on either end of the cluster-to-cluster data path.
16
David E. Ott and Ketan Mayer-Patel
At the endpoints, CP can be implemented on top of IP in a straightforward manner, much as TCP and UDP are implemented on top of IP. At the aggregation points, CP must be part of the forwarding path. As such, forwarding behavior must be modified to include a CP handling routine before IP forwarding can subsequently take place. We acknowledge this departure from standard IP behavior by labeling the IP layer at the AP’s in Figure 3 as IP∗ . Transport-level protocols will be built on top of CP in the same manner that TCP is built on top of IP. CP will provide these transport-level protocols with a consistent view of network conditions. These transport-level protocols will in turn use this information, along with various configuration parameters, to determine a data transmission rate and related send characteristics. In Figure 3, we show several possible transport-level protocols (C-TCP, C-UDP, and C-RTP) which are meant to represent coordinated counterparts to existing protocols. 4.1
Basic Operation
Figure 4 shows a CP data packet. CP encapsulates transport-level packets by prepending a small header. In turn IP will encapsulate CP packets and the protocol header will indicate that CP is being used. Each CP header will contain an identifier associating the packet with a particular C-to-C application. The remaining contents vary according to the changing role played by the CP header as it traverses the network path from source endpoint to destination endpoint.
Endpoint
Aggregation Point
Aggregation Point
CP IP *
CP IP *
Endpoint
Application Layer Transport Layer
C−RTP
C−RTP
C−TCP C−UDP
Coordination Layer
CP
Network Layer
IP
C−TCP C−UDP CP IP
Packet Path
Fig. 3. CP network architecture.
– As packets originate from source endpoints. The CP header may be used to communicate requests to the local AP. For instance, an endpoint may request to join or leave a cluster, have a membership or bandwidth usage report issued, or post a message to all endpoints within the local cluster. If no request is made, then cluster, flow, and protocol information are simply included in the CP header for identification. – As packets arrive at the local AP. CP header and packet information is processed. Part of the CP header will then be overwritten, allowing the AP to communicate congestion probe information with the remote AP.
Transport-Level Protocol Coordination in Cluster-to-Cluster Applications
17
– As packets arrive at the remote AP. The CP header is processed and used to detect network conditions. Again, part of the CP header is overwritten to communicate network condition information with the remote endpoint. – As packets arrive at the destination endpoint. Network condition information is obtained from the CP header and passed on to the transport-level protocol and the application. 4.2
Detecting Network Conditions
A primary function of CP is to measure congestion and delay along the clusterto-cluster data path. To accomplish this objective, each packet passing from one AP to another will have two numbers inserted into its CP header. The first number will be a sequence number that increases monotonically for every packet sent. The second number will be an echo of the last sequence number received from the other AP. By recording the time when a new sequence number is sent and the time when its echo is received, the AP can approximate the round trip time and infer network delay. Similarly, missing sequence numbers can be used to detect packet loss and infer congestion. Because sequence numbers in the CP header do not have any transport-level function, CP can use whatever packet is being transmitted next to carry this information as long as the packet is part of a flow associated with the C-to-C application. Since the packets of multiple flows are available for this purpose, this mechanism can be used for fine-grained detection of network conditions along the cluster-to-cluster data path.
Header Element
IP Header CP Header Transport−level Header
Protocol Identifier Cluster Identifier Sequence Number Echo Seq. Number Flow Identifier Flow Priority Weight Delay Report Congestion Report
Stage of Communication Always present. Used between aggregation points. Used from endpoint to aggregation point. Used from aggregation point to endpoint.
Packet Data
Fig. 4. CP packet structure.
4.3
Transport-Level Interface
An important aspect of CP is the design of its interface with transport-level protocols. CP provides transport-level protocols with information about network
18
David E. Ott and Ketan Mayer-Patel
conditions, including congestion and delay measurements. These measurements are delivered whenever packets pass from CP to the transport-level protocol through this interface. Variants of existing transport-level protocols will be developed on top of this interface. For example, a coordinated version of TCP (C-TCP) will consider acknowledgements only as an indicator of successful transfer. Congestion detection will be relegated entirely to CP. Because an application will need to provide CP-related information to the transport-level protocol, some aspects of the transport-level interface will be exposed to the networking API (i.e., the socket layer). For instance, an application will need to supply a cluster identifier that associates the flow with a given C-toC application. In the reverse direction, some transport-level protocols will make delay, jitter, loss, or other CP-derived information available to the application layer. 4.4
Cluster Membership Services
An AP maintains a cluster membership table for each C-to-C application. This table is created when a CP packet for a C-to-C application with no current table is received and is maintained as soft state. Endpoints are dynamically added and deleted from this table. Storing the table as soft state avoids explicit configuration and makes the mechanism lightweight and robust A cluster membership table performs two important functions. First, it maintains a list of known cluster endpoints. This allows the AP to broadcast network condition information to all C-to-C application participants. Endpoints may also request a membership report from the AP to receive a current listing of other endpoints. This list may be used in application-specific ways, for example to communicate point-to-point with cluster participants or to track cluster size. The second function is that of bandwidth monitoring. An AP will monitor bandwidth usage for each cluster endpoint and use the cluster membership table to store resulting statistics. An endpoint may request a bandwidth report to obtain information on bandwidth usage among other endpoints, as well as aggregate statistics on the cluster as a whole. This information can be used in application-specific ways to configure and manage flow coordination dynamically.
5 5.1
Design Rationale Benefits of CP
We believe the benefits of our approach to be substantial and include: – Locally deployable. CP requires changes to protocol stacks within application clusters only. No changes are required of forwarding nodes along the Internet path between clusters. Furthermore, cluster endpoints can coexist with non-CP communication and computing devices.
Transport-Level Protocol Coordination in Cluster-to-Cluster Applications
19
– Available to all transport protocols. A consistent view of network conditions is accessible by all flow endpoints, regardless of the transport-level protocol used by a given flow. – Flexible. Our approach does not dictate how an application handles flow coordination. Nor does it enforce coordination schemes through external traffic shaping or packet scheduling at switching points. Instead, flow endpoints are informed of network conditions and adjust in application-defined ways. – Dynamic membership and configuration. The number of endpoints within a cluster may change dynamically, along with the traffic characteristics of any given flow endpoint. – Sensitive to router performance issues. While the CP architecture does require per-flow and per-cluster state to be maintained at the aggregation points, the work involved is limited to simple accounting operations and does not involve packet scheduling, queue management, or other complex processing. Furthermore, the performance of routers along the Internet path between clusters remains entirely unaffected. It is important to note that CP is only one possible implementation of a mechanism to satisfy the networking requirements of C-to-C applications outlined in Section 3. The need for a transport-level coordination mechanism is motivated by the needs of C-to-C applications like tele-immersion distributed sensor arrays, and CSCW, which represent an important class of future Internet applications. The specific design of CP is our attempt to satisfy those needs. 5.2
Design Justification
We anticipate several questions about why we designed CP in the manner that we have and justify its design below. 5.3
Why Do This below the Transport-Level?
The primary reason for implementing CP below the transport-level is to preserve end-to-end semantics of the transport-level. Another possible approach would be to deploy CP at the application level by having all streams of a cluster sent to a multiplexing agent which then incorporated the data into a single stream sent to a demultiplexing agent on the remote cluster. This approach, however, has several drawbacks. By breaking the communication path into three stages, endto-end semantics of the individual transport-level protocols have been severed. This approach also mandates that application-level control is centralized and integrated into the multiplexing agent. A secondary reason for implementing CP below the transport-level is because that is where the CP mechanisms logically belong. The transport-level is associated with the end-to-end semantics of individual streams. The network-level protocol (i.e., IP) is associated with the next-hop forwarding path of individual
20
David E. Ott and Ketan Mayer-Patel
packets. CP deals with streams that are associated together and share a significant number of hops along their forwarding paths, but do not share exact end-to-end paths. This relaxed notion of a stream bundle logically falls between the strict end-to-end notion of the transport-level and the independent packet notion of the network-level. 5.4
Why Do This at Packet Granularity?
By using every packet from every flow of data associated with a C-to-C application we can achieve fine-grained sampling of current network conditions. Finegrained sampling is important because Internet network conditions are highly dynamic at almost all time scales. Sampling across the flows as an aggregate allows each transport-level protocol to have a more complete picture of current network conditions. Also, since each packet payload needs to be delivered to its intended endpoint destination, each CP packet provides a convenient and existing vehicle for probing network conditions and communicating resulting information to application endpoints. Endpoints are given the opportunity to react to changing conditions in a fine-grained manner. 5.5
Why Not Have the Endpoints Exchange Information?
Another possible architecture for a transport-level protocol coordination mechanism would be to exchange delay, loss, and bandwidth information among endpoints. This presupposes, however, that the endpoints are aware of each other. Our CP design, on the other hand, allows for loosely-coupled C-to-C applications in which this may not be the case. Endpoints of a C-to-C application are only required to use the same cluster identifier which can be distributed through a naming service like DNS or some other application-specific mechanism. The aggregation points are a natural place for cluster-wide accounting of loss, delay, and bandwidth since all packets of the cluster will pass through them. The design of CP is cognizant of the additional processing load this places on the aggregation point which is reflected in its relatively simple accounting mechanism. 5.6
Related Work
The ideas behind CP were primarily inspired by the Congestion Manager (CM) architecture developed by Balakrishnan [2]. CM provides a framework for different protocols to share information concerning network conditions. The CM is an excellent example of a coordination mechanism, but operates only when the transport-level flows share the entire end-to-end path. In [6], Kung and Wang propose a scheme for aggregating traffic between two points within the backbone network and applying the TCP congestion control algorithm to the whole bundle. The mechanism is transparent to applications and does not provide a way for a particular application to make interstream tradeoffs.
Transport-Level Protocol Coordination in Cluster-to-Cluster Applications
21
Pradhan et al. propose a way of aggregating TCP connections sharing the same traversal path in order to share congestion control information [8]. Their scheme takes a TCP connection and divides it into two separate (“implicit”) TCP connections: a “local subconnection” and a “remote subconnection.” This scheme, however, breaks the end-to-end semantics of the transport protocol (i.e., TCP). Active networking, first proposed by [11] allows custom programs to be installed within the network. Since their original conception, a variety of active networking systems have been built [14, 1, 3, 12, 15, 5, 7, 10]. They are often thought of as a way to implement new and customized network services. In fact, CP could be implemented within an active networking framework. Active networking, however, mandates changes to routers along the entire network path. This severely hinders deployment. CP requires changes only at the endpoints and at the aggregation points.
6
Summary
In this position paper, we motivated the need for transport-level protocol coordination for a particular class of distributed applications that we described as cluster-to-cluster applications. In particular, C-to-C applications are characterized by many independent, but semantically related, flows of data between clusters of computation and communication devices. This application architecture requires that end-to-end transport-level semantics be preserved, while at the same time, providing each stream with a consistent view of current networking conditions. To achieve this, each transport-level protocol needs measures of congestion, delay, and loss for the aggregate bundle of data flows. We described the design of the Coordination Protocol (CP) which provides for the unique networking requirements of C-to-C applications. The main features of CP are: – – – – –
Transport-level protocol independent. Locally deployable. Fine grained sampling of network conditions. Flexible. Supports dynamic membership and configuration.
CP provides cluster endpoints with a consistent view of network conditions, as well as cluster membership and bandwidth usage information. With this information, a C-to-C application can make coordination decisions according to specific objectives and a privileged understanding of application state. CP will facilitate the development of next generation Internet applications.
References [1] D.S. Alexander et al. Active bridging. Proceedings of SIGCOMM’97, pages 101– 111, September 1997.
22
David E. Ott and Ketan Mayer-Patel
[2] Hari Balakrishnan, Hariharan S. Rahul, and Srinivasan Seshan. An integrated congestion management architecture for internet hosts. Proceeding of ACM SIGCOMM, September 1999. [3] D. Decasper et al. Router plugins: A software architecture for next generation routers. Proceedings of SIGCOMM’98, pages 229–240, September 1998. [4] J. Grudin. Computer-supported cooperative work: its history and participation. Computer, 27(4):19–26, 1994. [5] M. Hicks et al. Plannet: An active internetwork. Proceedings of INFOCOM’99, pages 1124–1133, March 1999. [6] H.T. Kung and S.Y. Wang. Tcp trunking: Design, implementation and performance. Proc. of ICNP ’99, November 1999. [7] E. Nygren et al. Pan: A high-performance active network node supporting multiple code systems. Proceedings of OPENARCH’99, 1999. [8] P. Pradhan, T. Chiueh, and A. Neogi. Aggregate tcp congestion control using multiple network probing. Proc. of IEEE ICDCS 2000, 2000. [9] Ramesh Raskar, Greg Welch, Matt Cutts, Adam Lake, Lev Stesin, and Henry Fuchs. The office of the future: A unified approach to image-based modeling and spatially immersive displays. Proceedings of ACM SIGRAPH 98, 1998. [10] B. Schwarts et al. Smart packets for active networks. Proceedings of OPENARCH’99, 1999. [11] D. L. Tennenhouse and D. Wetherall. Towards an active network architecture. Multimedia Computing and Networking, January 1996. [12] J. van der Merwe et al. The tempest - a practical framework for network programmability. IEEE Network Magazine, 12(3), May/June 1998. [13] M. Weiser. Some computer science problems in ubiquitous computing. Communications of the ACM, 36(7):75–84, July 1993. [14] David Wetherall. Active network vision and reality: lessons from a capsule-based system. Operating Systems Review, 34(5):64–79, December 1999. [15] Y. Yemini and S. da Silva. Towards programmable networks. International Workshop on Distributed Systems Operations and Management, October 1996. [16] T.-P. Yu, D. Wu, K. Mayer-Patel, and L.A. Rowe. dc: A live webcast control system. To appear in Proc. of SPIE Multimedia Computing and Networking, 2001.
Data to the People - It’s a Matter of Control Mauricio Cortes and J. Robert Ensor Bell Laboratories, Lucent Technologies 600 Mountain Avenue, Murray Hill, New Jersey 07974, USA {mcortes, jre}@lucent.com
Abstract. This paper highlights some early results from our current research. We are building prototype multimedia services, e.g., digital television, video on demand, and teleconferencing, for metropolitan area networks. Our work illustrates that application-aware resource management plays a critical role in providing such services economically. The paper describes a middleware system—called Narnia—that supports the development and deployment of these application-aware resource controls. Using Narnia programming abstractions, application developers can create flexible, dynamic resource managers as part of multimedia service implementations. The paper also presents a brief summary of our experiences using Narnia to build applications.
1 Introduction Our goal is to provide multimedia services to communities of people. More precisely, we intend to make multimedia services, e.g., digital television, video on demand, multimedia teleconferencing, and multi-party games, available through packet networks. Towards this goal, we are building a prototype data center, which is illustrated in Figure 1. It contains a switch/router complex that supports data distribution among its users’ endpoint devices, local application servers, and interfaces to external networks. Its application servers store and/or cache data as well as perform specialized manipulations. Its software manages loads on its network and computing resources. Our experience building the data center has shown us that flexible, dynamic, application-aware controls of network and computing resource managers are necessary for economical deployment of multiple services to a large number of people. Without such controls, switches and routers can become overloaded as servers try to deliver large files or real-time data streams to users. Similarly, controls are needed to protect application servers from too many user requests. Simply adding switches and application servers to meet uncontrolled service loads would typically be too expensive. On the other hand, control software can effectively manage service loads. For example, a controller can balance the request load among a group of application servers. In addition, this controller could filter and/or aggregate requests, further reducing the load to any application server.
D. Shepherd et al. (Eds.): IDMS 2001, LNCS 2158, pp. 23-28, 2001. Springer-Verlag Berlin Heidelberg 2001
24
Mauricio Cortes and J. Robert Ensor
Data Center Application Controls Directory Server DTV Control
Switch/Router Complex
VoD Control
DTV Server
Video Server
Network Resource Control
Figure 1: Data Center Components
We have built middleware, which we call Narnia, to support development of application-aware controls for network and computing resources. According to the Narnia reference architecture, each resource control is built as a specialized application. As Figure 2 indicates, a typical application is built as a collection of controllers and gateways. Controllers receive requests from users and other controllers. Therefore, they are able to enforce rules (policies) concerning an application’s behavior. Controllers call upon gateways to manage specific resources, such as network elements or media servers. Gateways interact with these resources according to the resources’ command interfaces. A collection of controllers and gateways can run on one or more physical machines, according to specified deployment configurations. Different collections can be created to form different applications; thus, these components can be reused.
2 System Components Narnia provides an extensible set of abstractions to support a variety of multimedia applications. Its base abstractions are sessions, events, event handlers, and shared resources. Controllers and gateways are built upon these abstractions.
Data to the People – It's a Matter of Control
25
Sessions. Each controller and gateway executes one or more sessions. Sessions are a mechanism for rendezvous and communication among users and servers. They maintain state, which provide contexts for these communications, and they have memory, which can be shared among communicating parties. A session also contains a set of event handlers and shared resources (see below).
Gateway
Controller
Session
Session
Ev Hdlr
Event Handler
To/From Network Devices
Events
Events To/From Users and/or Controllers
Device APIs
Session Event Handler
Events Gateway Session Ev Hdlr
Server APIs To/From Servers
Figure 2: Basic Components of a Control Application
Events. Events are messages sent to and received from sessions; they are Narnia’s communication primitives. Each event is a request sent to a session within a control application or a response sent to an earlier requestor’s site. Narnia connection managers and event readers/writers handle underlying communication details and provide each session with its own priority ordered event queue. The application developer, therefore, deals with events, not their transmission. Event Handlers. Event handlers process events; they give semantics to the events and implement service behaviors and policies. Each session defines a set of events and the handlers for those events; thus, a session is the basic context for a service (application). A service can be modified—perhaps during its execution—by creating a new session with updated event handlers.
26
Mauricio Cortes and J. Robert Ensor
Shared Resources. Sessions can contain shared resources. Each shared resource represents a specific hardware or software entity. A resource is implemented as a named set of event handlers within a session. Edge resources control non-Narnia entities. Their event handlers manage each such entity through the entity’s own control interface, e.g., signal ports or API. Narnia gateways use edge resources to monitor and control non-Narnia based system components, such as network routers and video servers. Narnia controller sessions can also contain shared resources. These non-edge resources may refer, in turn, to edge and non-edge resources. Multiple non-edge resources can access a common (edge or non-edge) resource; however, an edge resource does not access other Narnia resources. Service Management Modules. In addition to these basic building blocks, Narnia provides service management modules, including application and session directories and performance monitors for events and network loads. These monitors, which are built as controllers and gateways, provide data needed for adaptive, dynamic behavior of application-aware resource controls.
3 Experience We have used Narnia to build two service prototypes: a digital television service and a video on demand service. Each of these services is based on a collection of controllers and gateways. The digital television service provides users with standard and premium (restricted access) channels. Programs are delivered as MPEG-2 streams at a nominal rate of 6Mbps. The digital television server distributes only those programs that have been requested by viewers, and it sends each channel’s data to a multicast group representing the viewers of that channel. The digital television controller is responsible for defining and maintaining these multicast groups. It uses a network resource manager (a gateway) to control the relevant network elements (e.g., switches). The video on demand service allows users to select movies for downloading. The service operates via a “controlled blast,” in which a movie is downloaded to a user’s local store at a rate that can be faster than the movie’s real-time streaming rate. Users and the video on demand control negotiate to find a suitable download rate (up to 80Mbps). In our service movie downloads begin at controller-determined intervals. This service definition allows the video on demand controller, like the digital television controller, to manage multicast groups for data distribution. We are now using Narnia to build a multimedia chat room service, which gives users a means of exchanging text messages, talking, and viewing shared video during real-time interactions. The chat room controller coordinates the actions of other controllers to manage each chat room session. A text message controller manages posting, storing, and distribution of text messages. An audio controller manages an audio bridge (through an audio bridge gateway). The bridge is a shared resource that is responsible for mixing audio for chat room participants. The video controller interacts with a video gateway to control distribution of videos from a standard server.
Data to the People – It's a Matter of Control
27
Our work with these various multimedia services has shown that Narnia provides a suitable foundation for building application-aware controls. The programming abstractions of Narnia offer a small, but powerful base for communication among users and service program elements. Furthermore, Narnia’s support for event handlers—event handler classes, factories, and loaders—create a convenient platform for application development and execution. In fact, our directory server, performance monitor, and event logger controllers are the same except for the events and event handlers each one needs. Furthermore, Narnia core components exhibit good performance. Using a Sun Enterprise 450 with two 450MHz CPUs, we have run a controller that receives small (a hundred bytes or so) requests, writes them to a disk log, and sends similar-sized responses back to its clients. The controller can handle approximately 8000 events per second. This level of communication performance gives developers great flexibility in distributing their program modules over different machines.
4 Related Work Several middleware platforms for the support of multimedia services have been described. Some of these platforms include communication primitives similar to those provided by Narnia, and we have selected representatives of these platforms for comparison to our present work. CINEMA[4] defines sessions as the basis for access to remote services. A session represents the collection of resources needed to provide a service. To provide a service, a CINEMA session locates and reserves the service’s required resources. Then it establishes the needed flow of data among the resources. Narnia sessions, in contrast, do not have inherent flow-building capabilities (these are programmed as part of controller and gateway event handlers) and are not limited to a single user interaction (they can persist without active users). Several event-based middleware platforms, such as JEDI[2], Siena[2], and CORBA events[3], decouple event producers and consumers. The latter must register with an event dispatcher, specifying the types of events they are interested in processing. The former send events to the dispatcher, which in turn forwards them to all the consumers registered for that event type. Narnia does not contain such an event dispatcher. Rather, programmers can build such infrastructure using Narnia primitives. One would simply develop a controller with the dispatcher’s functionality. Other controllers and gateways can register their interest in sets of events with the new dispatch controller. However, Narnia does not provide a specification language to describe how an event handler should forward events. JAIN[6] is a collection of Java APIs to help developers build applications for multi-network (i.e. IP, ATM, and PSTN), multi-protocol and multi-party environments. JAIN is based on the JTAPI[5] call model. In contrast, our control applications are based on sessions that have no knowledge of the underlying networks and types of endpoint devices. It is possible to develop a Narnia gateway to uses JAIN-based applications as resources.
28
Mauricio Cortes and J. Robert Ensor
5 Summary This paper has presented a brief description of the Narnia middleware system, which supports both the development and the deployment of application-aware controls. Narnia’s basic abstractions—sessions, events, event handlers, and shared resources— provide the foundation for development of these control applications. Sessions form the basic context mechanism for applications, and events provide the basis for communication among system components. Event handlers act on events to provide service behaviors and policies. Shared resources represent underlying devices accessed by multiple users. The paper has also described our experiences using this middleware to build applications. By building these applications, we have learned that Narnia abstractions create a helpful base for programming controllers and gateways. Performance measurements show that Narnia communication primitives are sufficiently efficient to support distributed applications. Our planned future work includes further application development and performance tests.
References 1. Carzaniga, A., Rosenblum, D., Wolf, A., Challenges for Distributed Event Services: Scalability vs. Expressiveness, Engineering Distributed Objects '99, Los Angeles CA, USA, May 1999. 2. Cugola, G., Di Nitto, E., Fuggetta, A., Exploiting an event-based infrastructure to develop complex distributed systems, Proceedings of the 1998 International Conference on Software Engineering, pp. 261-270, 1998. 3. Object Management Group. CORBAservices: Common object service specification. Technical report, Object Management Group, July 1998. 4. Rothermel, K., Barth, I., and Helbig, T. CINEMA: An Architecture for Configurable Multimedia Applications. in Spaniel, O., Danthine, A., and Effelsberg, W. (eds.). Architecture and Protocols for High-Speed Networks, 253-271 Kluwer Academic Publisher, 1994. 5. Sun Microsystems, Java Telephony Specification (JTAPI), ver. 1.3, June 1999. 6. Tait, D., de Keijzer, J., Goedman, R., JAIN: A New Approach to Services in Communication Networks, IEEE Communications Magazine, pages 94-99, Jan. 2000.
An Access Control Architecture for Metropolitan Area Wireless Networks Stefan Schmid, Joe Finney, Maomao Wu, Adrian Friday, Andrew Scott, and Doug Shepherd DMRG, Computing Department, Lancaster University, Lancaster LA1 4YR, U.K. {sschmid,joe,maomao,adrian,acs,doug}@comp.lancs.ac.uk
Abstract. This paper introduces a novel access control architecture for publicly accessible, wireless networks. The architecture was designed to address the requirements obtained from a case study of ubiquitous Internet service provisioning within the city of Lancaster. The proposed access control mechanism is based on the concepts of secure user authentication, packet marking, and packet filtering at the access routers. The paper demonstrates to what extent this tokenbased, soft-state access control mechanism improves security and robustness, and offers improved performance over that provided by existing approaches within roaming networks. Early indications show the access control mechanism can better be implemented through the use of active routers, in order to facilitate dynamic rollout and configuration of the system. In addition, extensions to Mobile IPv6 are proposed, which provide support for roaming users at a fundamental level.
1 Introduction A wireless network infrastructure, based on 802.11 wireless LAN technology, was installed throughout Lancaster city centre as part of the Guide project [1,2] with the aim of supporting tourists via an online tourist guide application. The second phase of the project, Guide II [3], is intended to interact with the recently formed Mobile IPv6 Testbed collaboration between Cisco Systems, Microsoft Research, Orange and Lancaster University [4]. The new challenge is to “open up” the network and provide general-purpose, ubiquitous services and applications to Lancaster citizens, including public Internet access. In order to control access to the publicly available wireless network, secure user authentication and fine grained access control is essential. The core of the wireless network is based on 11Mbps IEEE 802.11 wireless LAN technology. The network infrastructure consists of numerous base stations distributed around Lancaster covering most of the city centre. Each base station is directly connected to a PC-based router, forming an access router. These access routers are linked back to the university campus backbone via DSL and SDH. For the original tourist guide application (Guide I), the network was simply managed as an Intranet, without globally routable network addressing. However, as Guide D. Shepherd et al. (Eds.): IDMS 2001, LNCS 2158, pp. 29-37, 2001. Springer-Verlag Berlin Heidelberg 2001
30
Stefan Schmid et al.
II aims to provide public Internet services to citizens, several crucial changes to the network implementation are required. (1) Globally routable network addressing must be introduced. (2) A secure access control mechanism that prevents unauthorised or even malicious users from using the network must be incorporated. (3) Further base stations need to be installed to improve coverage. And finally (4) the network protocol needs to incorporate support for host mobility to enable roaming between cells. While the original Guide architecture managed mobility within the Guide application, Guide II requires a transparent mobility mechanism to support general-purpose applications.
2 Requirements This section summarises the key requirements for our access control architecture: Secure Access Control – As the network is publicly available throughout the city centre, a secure mechanism is required to ensure authorised access only. Transparent Access Control – The access control mechanisms must operate fully transparent to any correspondent node on the public Internet. Minimum User Inconvenience – End users of the access network should, apart from the initial authentication at login time, not notice any changes when using the network (i.e., standard applications must be fully supported). Support for Mobile Handoffs – Many of the applications that will be used within the testbed will be multimedia oriented, and may therefore have stringent quality of service constraints. In particular, these applications are likely to be sensitive to the packet loss and delay effects caused by mobile terminals performing handoffs between cells. Consequently, any access control mechanism introduced must not adversely affect the performance of such handoffs. Protection of Access Network – Protection against internal and external security threats such as denial-of-service attacks must be considered.
3 Architecture The access control approach proposed here is based on the concept of packet marking and packet filtering. Access routers are placed at the periphery of the network to enforce packet filtering policies on the data traffic traversing the nodes. As a result, data packets sent by authorised users are forwarded, while packets of unauthorised clients are simply discarded. Authorisation for access to the network is granted by the authentication server. Upon successful authentication of a user, the server issues a short-lived access token to the user’s end system and informs the relevant access router(s) about the approval. The access token is a transitory credential that grants data packets from a particular user node access to the network. It is a soft-state mechanism, which requires the end systems to periodically refresh their authorisations.
An Access Control Architecture for Metropolitan Area Wireless Networks
31
Fig. 1. The access control infrastructure proposed for our wireless network deployed around the City of Lancaster. The key components of our access control architecture are described here. Figure 1 illustrates how this will be implemented for our wireless network at Lancaster: • The Authentication Server is responsible for authenticating and authorising users of the access network. It provides a remote management interface for the set-up and configuration of user accounts. • The Gateway connects the access network with the public Internet or a private Intranet (i.e., Campus network). As an extension of the firewall concept it is concerned with external security threats of arbitrary nodes on the Internet. • The Access Routers (ARs) control access to the publicly accessible, wireless network. They block traffic originating from unauthorised end nodes based on network-level packet filtering. Co-locating the ARs directly with the base stations enables network-level protection to be applied on a per cell basis. • The End Systems (i.e., handheld devices, laptops, etc.) request periodic authentication on behalf of the users and perform packet marking for authorised users. While the latter is realised through minor modification of the network stack, the former can be accomplished either through extensions to the network stack or by means of a user-level system service. 3.1 User Authentication User authentication is carried out between the user’s end system and the authentication server. As user authentication is needed on a periodic basis to refresh the access tokens, a lightweight mechanism is preferable. We therefore propose the use of a very simple request/response protocol (that consists of only one small message in each direction). The authentication request sent from a mobile node to the authentication server includes the user’s username and password as well as the node’s MAC address and IP
32
Stefan Schmid et al.
address. While the username and password are required to authenticate a client, the MAC and IP addresses are used to authorise the client node on the access routers. To prevent malicious users from spoofing the secret password of another user, the authentication protocol encrypts the user’s credentials, using a public key mechanism.
3.2 Access Tokens Access tokens are issued to users upon successful authentication and authorisation. The tokens are the credential that grants data packets originating from authorised user nodes to pass through the access router. A token is essentially a random value that is large enough to make it hard to guess or discover by a brute force search within the lifetime of the token1. Upon successful authorisation, the authentication server generates an access token and sends it to the user for the packet marking. However, before those packets can pass the access router, the router’s access control list must also be updated. Consequently, the authentication server will also send an access list update to the respective access router (and its neighbouring routers) upon generation of a new access token. The access list update includes the triple (access token, IP address, MAC address). For security and robustness reasons, we chose to restrict the validity period of the access tokens to a configurable time interval (referred to as expiration time). The softstate authentication protocol takes care of the periodic refresh of the user authorisation. The refresh time of the soft-state protocol must therefore be sufficiently smaller than the expiration time. The main advantage of short-lived access tokens is that they are hard to spoof; especially in a roaming network, where a node’s physical location changes frequently. 3.3 Packet Filtering As the access control mechanism is based on network level packet filtering within the access routers, each access router maintains an access control list (ACL). This list accommodates the (access token, MAC address, IP address) filter information required to identify ‘authorised packets’ (packets that originate from authorised users). Access routers maintain separate ACLs for each network for which it is responsible. When an access router receives an ‘upstream’ packet from a wireless device, it checks the relevant ACL to decide how to process the packet. If no ACL entry is present for the received packet, then the packet is simply discarded. There is however, one exception to this rule - packets destined for selected well-known servers on the network are allowed to pass. This enables unauthenticated mobile nodes access to limited network services (for example, the authentication server). Upstream packets with MAC addresses matching an access list entry are further inspected. The router checks whether or not the IP address and access token included in the packet is valid. If so, the packet is granted access to the network. 1
Since we envision a tokens lifetime in the order of a few minutes, we suggest a value of 32 bits for now.
An Access Control Architecture for Metropolitan Area Wireless Networks
33
Downstream packets (destined for wireless devices) are also subject to packet filtering. Thus, packets routed to the wireless networks force an ACL lookup. If no entry for the destination address exists, then the packet is dropped. Again, there is an exception for packets sourced from trusted hosts, such as the authentication server. 3.4 Roaming Support We believe that in order to manage a large scale public access network, a set of boundaries must be in place, between areas of different administration. For example, it may be necessary to prevent users from gaining access to a certain part of the network for security reasons, or it may be necessary to use different link layer wireless technologies in different scenarios. As a result, by far the easiest way to separate these administrative areas is by placing each of them into a different layer 3 network. This allows for much more fine-grained and scalable management of the network. As a result of this however, the mobile terminals require now a layer 3 mobile routing protocol, such as Mobile IPv4/v6, to roam between administrative domains. Therefore, upon a network handoff, mobile devices obtain a new care-of-address (CoA) for their new physical network location2. Consequently, network access may well be controlled by a different access router, which will have no knowledge of the user’s previous authorisation. The access control scheme described so far however, could take up to several minutes before the user regains network access (i.e., until the next authentication refresh is carried out), which would clearly be unacceptable. In order to overcome this weakness, we propose two special measures for roaming networks: 1. The mobile node must immediately perform an authentication refresh upon a network handoff. 2. When an access control triple is sent to an access router by the authentication server (in order to authorise a user), a copy of that triple is also sent to all neighbouring access routers3. Consequently, a node’s new access router will already have an entry in its ACL when the node moves into its coverage area. If the respective access router then receives a packet from the mobile node, which has a valid ACL entry that matches the MAC address and access token, but not the IP address, it will still grant access for a short ‘reprieve time’. However, if the router does not receive a fresh access list update for the node’s new IP address before the reprieve time expires, traffic will be blocked. This technique preserves safety by granting access to packets based on a node’s previous authorisation. These extensions have the advantage that they do not interfere with or slow down network-level handoffs. The initial user authentication required when entering a new area is simply delayed (i.e. carried out in the background) to avoid extra latency. 2
3
This is usually achieved either through DHCP (v4/v6) or the auto-configuration mechanisms of IPv6. Neighbouring routers are initially configured manually on the authentication server. However, an automated discovery protocol is subject to further research.
34
Stefan Schmid et al.
The reprieve time must be chosen carefully. On the one hand, the interval should be minimal as it gives provisional access to users based on their previous authorisation while, on the other hand, it must be long enough to complete an authorisation cycle.4 3.5 Enhanced Security The access control solution described so far provides effective access control, but is subject to attack – the scheme is vulnerable to MAC address spoofing. Since the access token is transmitted in clear text, a malicious user in the same cell could snoop the network to obtain another user’s “secret” credential. The combination of this credential and ‘spoofing’ the user terminal’s MAC address would grant an attacker access to the network. Although the soft-state nature of the access control alleviates the problem because access tokens are only a short-lived authorisation, an impostor could develop an automated mechanism to snoop periodically for valid credentials. As a consequence, to fully secure the access control mechanism even against MAC address spoofing, the access token must be encrypted whenever it is sent across the public wireless network. This gives us two areas of vulnerability. Firstly, the request/response protocol between a wireless device and the authentication server, and secondly, when the wireless terminal sends the tagged packets with the access token. The interaction between a mobile device and the authentication server uses a combination of public and secret key mechanisms to avoid the need for a public key distribution infrastructure. The authentication server has a well known public key. Upon initiating an authentication request, the mobile device chooses a random session key. This key is then encrypted (along with the rest of the authentication request) using the server’s public key, and sent to the authentication server. The response from the authentication server (which contains the mobile device’s access token) is then encrypted using this shared session key (based on a symmetric encryption algorithm). This secures the conveyance of the access token. In order to secure the second condition (where the mobile node sends a packet), the access token is also encrypted using the session key. In order to verify the authorisation of a packet, the relevant access router therefore also needs to know this session key. This is conveyed to the access router from the authentication server as part of the ACL update message. In order to prevent replay-attacks (whereby the encrypted access token could simply be copied by an attacker), the access token is encrypted together with the packet’s checksum. The checksum is derived from a CRC of the dynamic fields in the packet, such as the length field in the IP header, the sequence number of the TCP header, etc. In network environments where the communication channels between the authentication server and the access router cannot be trusted, the access control list up date, which also includes the access token, must be encrypted as well. Standard end-to-end encryption techniques such as IPsec are recommended. In cases where users demand a high level of security (for example, full privacy), the architecture supports an additional layer of protection based on data encryption on the 4
We recommend a reprieve time of approximately 2-5 seconds depending on the network performance and authentication server.
An Access Control Architecture for Metropolitan Area Wireless Networks
35
last hop. Thus, end users can choose to encrypt the whole data packet using the shared session key by tunnelling the data to the access router. The access router decrypts the data packets before forwarding them. Finally, it is worth noting that standard IPsec authentication and encryption are entirely complementary to our access control architecture. They can be used in addition to achieve secure end-to-end communication.
4 Implementation This section outlines the implementation of the key components of our access control architecture. Due to the lack of space, we provide only a brief description here. Client Software. We incorporate the network protocol extensions required for the access control mechanism within the Microsoft and Linux Mobile IPv6 stacks. We have chosen these particular protocol stacks, since we implemented both within previous projects [5]. Secure user authentication is accomplished by means of encryption. A client node uses the public key of the authentication server (which can be loosely configured at the client to avoid the need for a key distribution service) to encrypt the secret data of the authentication request. The RSA [6] public-key encryption algorithm is used to securely propagate the secret key to the authentication server. For the encryption of the authentication response, we use symmetric encryption based on the shared session key. This is accomplished via the triple-DES [7] encryption algorithm. The enhanced MIPv6 stack also carries out the packet marking. It adds the most recent access token into every outgoing packet. To prevent MAC address spoofing and replay attacks, the access token is encrypted using the shared session key along with a packet checksum. The 32-bit checksum is taken from dynamic protocol fields of the IPv6 header. In order to simplify the filtering in the access routing, we place the 64-bit access credentials (access token, checksum) in front of the actual IPv6 header. Authentication Server. A user-level application manages the user accounts and performs the actual authentication of the users on the client nodes. The user authentication protocol is based on a lightweight request/response protocol. The client node sends a request message (using UDP) to the authentication server including the encrypted user password, its IPv6 address and MAC address. Upon successful authentication, the server then replies with an access list update (which includes a new access token). Access Router. The access router is based on the LARA++ active router architecture [8] supporting dynamic programmatic extension of router functionality. The access control service is implemented as a collection of active components. The control component is responsible to handle ACL updates from the server and to configure the packet filter component accordingly. The filter component in turn deals with the actual access control. It registers to receive all packets in order to enforce the access rules.
36
Stefan Schmid et al.
Gateway. The gateway connecting the wireless network covering the city centre with the campus backbone is also based on LARA++. Several active components are responsible for securing the access network from malicious external nodes. The router, for example, tries to detect denial-of-service attacks by external nodes based on packet analysis.
5 Related Work Due to the restriction in this short paper, we merely mention directly related approaches here. • Stanford’s web-based access control solution, called SPINACH [9], is a simple approach without the need for specialised hardware or client software, but at the cost of inferior security (i.e., vulnerable to MAC address spoofing). • Microsoft’s CHOICE [10] is based on a proprietary Protocol for Authentication and Negotiation of Services (PANS). The architecture involves, separate PANS Authoriser and PANS Verifier gateways in addition to a global authentication server. The authoriser gateway enables restricted access to the authenticator only. Upon successful authorisation of a client node, the verifier gateway grants full access to the network. It filters packets based on the PANS tags. A custom client module is used to attach the tag to the data packets. • Carnegie Mellon’s Netbar system [11] is based on a remote configurable VLAN switch that is used to enable/disable individual ports depending on the authorisation of the connected machine. • UC Berkeley has proposed a design [12] based on an enhanced DHCP client, an intelligent hub that supports enabling/disabling of individual ports, and a DHCP server that has been extended to remotely configure the hub based on the client’s authorisation.
6 Conclusion This paper has introduced a novel access control architecture for a publicly available, wireless network. Our architecture distinguishes itself from other approaches through the following features: • Soft-state based authorisation in conjunction with the periodic authentication protocol offers a high level of security, as secret credentials (i.e. access tokens, secret keys) are re-issued in a configurable period. • Support for roaming users in a mobile environment is inherent to the architecture. A short reprieve time allows previously authorised client nodes to smoothly roam between networks without adding extra latency. • Configurable levels of security allow network administrators and/or end users to decide whether no encryption, encryption of the access credentials only (prevents MAC address spoofing), or full encryption of the data (provides privacy) should be used on the wireless link.
An Access Control Architecture for Metropolitan Area Wireless Networks
37
The design of our access control implementation is unusual in that we integrate the client-side functionality (i.e. authentication protocol, packet tagging) into our Mobile IPv6 stack (in order to better improve the performance of cell handoff) and build the access router functionality on top of an active router (for reasons of flexibility and extensibility).
References 1. Davies, J. et. Al: Caches in the Air: Disseminating Information in the Guide System. In Proceedings of IEEE Workshop on Mobile Computing Systems and Applications (WMCSA '99), New Orleans. (1999) 2. Cheverst, K. et. Al: Developing a Context-aware Electronic Tourist Guide: Some Issues and Experiences. In Proceedings of CHI ’00, Netherlands, pp 17-24. (2000) 3. Guide II: Services for Citizens. Research Project. Lancaster University, EPSRC Grant GR/M82394. (2000) 4. Mobile IPv6 Research Collaboration with Cisco, Microsoft and Orange. Lancaster University, available via the Internet at “http://www.mobileipv6.net/testbed”. (2001) 5. The LandMARC Project. Research Collaboration with Microsoft Research, Cambridge, available via the Internet at “http://www.LandMARC.net”. (1999) 6. Kaliski, B., Staddon, J.: PKCS #1: RSA Cryptography Specifications. RFC 2437. (1998) 7. Schneier, B.: Applied Cryptography. Second Edition, John Wiley & Sons, New York, NY. (1995). ISBN 0-471-12845-7. 8. Schmid, S., Finney, J., Scott, A., Shepherd, D.: Component-based Active Network Architecture. In Proceedings of 6th IEEE Symposium on Computers and Communications (ISCC ’01), Hammamet, Tunisia. (2001) 9. Poger, E., Baker, M.: Secure Public Internet Access Handler (SPINACH). In Proceedings of the USENIX Symposium on Internet Technologies and Systems. (1997) 10. Miu, A., Bahl, P.: Dynamic Host Configuration for Managing Mobility between Public and Private Networks. In Proceedings of the 3rd Usenix Internet Technical Symposium, San Francisco, California. (2001) 11. Napjus, E. A.: NetBar – Carnegie Mellon’s Solution to Authenticated Access for Mobile Machines. White Paper, available at “http://www.net.cmu.edu/docs/arch/netbar.html”. 12. Wasley, D. L.: Authenticating Aperiodic Connections to the Campus Network. Available at “http://www.ucop.edu/ irc/wp/wpReports/wpr005/wpr005_Wasley.html”. (1996)
Design and Implementation of a QoS-Aware Replication Mechanism for a Distributed Multimedia System Giwon On1, Jens B. Schmitt1 and Ralf Steinmetz1,2 1
Industrial Process and System Communications Dept. of Electrical Eng. & Information Technology Darmstadt University of Technology Merckstr. 25, D-64283 Darmstadt, Germany 2 GMD IPSI German National Research Center for Information Technology Dolivostr. 15, D-64293 Darmstadt, Germany {Giwon.On, Jens.Schmitt, Ralf.Steinmetz}@KOM.tu-darmstadt.de
Abstract. This paper presents the design and implementation architecture of a replication mechanism for a distributed multimedia system medianode which is currently developed as an infrastructure to share multimedia-enhanced teaching materials among lecture groups. The proposed replication mechanism supports the quality of service (QoS) characteristics of multimedia data and the availability of system resources. Each type of data handled and replicated are classified according to their QoS characteristics and replication requirements. The main contribution of this paper is the identification of new replication requirements in distributed multimedia systems and a multicast-based update propagation mechanism by which not only the update events are signalled, but also the updated data are exchanged between replication managers. By prototyping the proposed replication mechanism in medianode, we prove the feasibility of our concept for combining the QoS concept with replication mechanisms.
1 Introduction One major problem about using multimedia material in lecturing is the trade-off between actuality of the content and quality of the presentations. A frequent need for content refreshment exists, but high quality presentation can not be authored by the individual teacher alone at the required rate. Thus, it is desirable that teachers presenting the same or at least similar topics but work at different locations can easily share their multimedia-enhanced lecture materials. The medianode project[1] is intended to provide such means for sharing to lecturers at the universities of the German state Hessen. The design of the medianode system addresses issues of availability, quality of service, access control and distribution of data. To support teachers, it must allow for transparent access to shared content, and it must be able to operate in disconnected mode since lecturers do not have access to the network at all times during their presentations. The medianode system architecture is intended for de-centralized operation of a widely distributed system. Within this distributed system, each participating host is called a medianode and conceptually equal to all other participating nodes, i.e. a medianode is not considered a client or a server. Client or D. Shepherd et al. (Eds.): IDMS 2001, LNCS 2158, pp. 38-49, 2001. Springer-Verlag Berlin Heidelberg 2001
Design and Implementation of a QoS-Aware Replication Mechanism
39
server tasks are taken on by medianodes in the system depending on the their resources and software modules. Replication is the maintenance of on-line copies of data and other resources[2, 5]. Replication of presentation materials and meta-data is an important key to providing high availability, fault tolerance and quality of service (QoS) in distributed multimedia systems[21], and in particular in medianode. For example, when a user requires access (read/write) to a presentation which comprises audio/video data and some resources which are not available in the local medianode at this point of time, a local replication manager copies the required data from their original location and puts it into either one of the medianodes located nearby or the local medianode without requiring any user interaction (user transparent). This function enhances the total performance of medianode by reducing the response delay that is often caused due to insufficient system resources, such as memory, CPU time, and network bandwidth, at a given service time. Furthermore, because of the available replica in the local medianode, the assurance that users can continue their presentation in a situation of network disconnection, is significantly higher than without replica. The structure of the paper is as follows. In Section 2, we give an overview of related work. The merits and limitations of existing replication mechanisms are discussed and a comparison of our approach with previous work is given. Section 3 presents our replication system model. We define the scope of our replication mechanism in medianode and present the characteristics of presentational media types, for which we identify a need for new replica units and granularity. We also describe the proposed replication maintenance concept, e.g. how and when replicas are created and how the updates are signalled and transported. In Section 4, we present our prototype implementation architecture. It describes operation flows in medianode with and without replication system. We conclude the paper with a summary of our work and an outlook towards possible future extensions of our replication mechanism in Section 5.
2 Related Works Several approaches to replication have already been proposed. The approaches differ for distributed file systems from those for Internet-based distributed web servers and those for transaction-based distributed database systems. Well-known replication systems in distributed file systems are AFS[6], Coda[7], Rumor[13], Roam[14] and Ficus[16] which keep the file service semantics of Unix. Therefore, they support to develop applications based on them. They are based either on a client-server model or a peer-to-peer model. Often they use optimistic replication which can hide the effects of network latencies. Their replication granularity is mostly the file system volume, with a large size and low number of replicas. There is some work on optimization for these examples concerning of update protocol and replica unit. To keep the delay small and therefore maintain real-time interaction, it was desirable to use an unreliable transport protocol such as UDP. In the earlier phases, many approaches used unicastbased data exchange, by which the replication managers communicated with each other one-to-one. This caused large delays and prevented real-time interaction. To overcome this problem, multicast-based communication has used recently [9, 11, 12,
40
Giwon On, Jens B. Schmitt, and Ralf Steinmetz
17]. For Coda, the RPC2 protocol is used for multicast-based update exchange, which provides with the Side Effect Descriptor transmission of large files. For limiting the amount of storage used by a particular replica, Rumor and Roam developed the selective replication scheme[18]. A particular user, who only needs a few of the files in a volume, can control which files to store in his local replica with selective replication. A disadvantage of selective replication is the ‘full backstoring’ mechanism: if a particular replica stores a particular file in a volume, all directories in the path of that file in the replicated volume must also be stored. JetFile[9] is a prototyped distributed file system which uses multicast communication and optimistic strategies for synchronization and distribution. The main merit of JetFile is its multicast-based callback mechanism by which the components of JetFile, such as file manager and versioning manager interact to exchange update information. Using the multicast-based callback, JetFile distributes the centralized update information which is normally kept by the server over a number of multicast routers. However, the multicast callbacks in JetFile are not guaranteed to actually reach all replication peers, and the centralized versioning server, which is responsible for serialization of all updates, can lead to a overloaded system state. Furthermore, none of the existing replication systems supports quality of service (QoS) characteristics of (file) data which they handle and replicate.
3 Replication System Model 3.1 Design Goals and Scope of Our Replication System By analysing the service requirements distributed multimedia systems for the example of medianode, we identified a number of issues that the design of our replication system needs to address: •
High availability: The replication system in medianode should enable data/service access in both connected and disconnected operation modes. Users can keep multiple copies of their files on different medianodes that are distributed geographically across several universities in the state of Hessen.
•
Consistency: Concurrent updates and system failures can lead to replicas not being consistent any more, i.e. stale state. The replication system should offer mechanisms for both resolving conflicts and keeping consistency between multiple replicas and their updates. Location and access transparency: Users do not need to know where presentation resources are physically located and how these resources are accessed. Cost efficient update transport: Due to the limitation of system and network resources, the replication system should use multicast-based transport mechanism for exchanging updates to reduce resource utilization. QoS support: The specific characteristics of presentational data, especially of multimedia data should be supported by the proposed replication mechanism.
• • •
In medianode, we mainly focus on the replication service for accessing data in terms of ‘inter-medianode’, i.e. between medianodes, by providing replica maintenance in each medianode. Consequently, a replication manager can be implemented as one or a set of
Design and Implementation of a QoS-Aware Replication Mechanism
41
medianode’s bow instances in each medianode. The replication managers communicate among each other to exchange update information through the whole medianodes. 3.2 Different Types of Presentation Data Data organization comprises the storage of content data as well as meta information about this content data in a structured way. The typical data types which can be identified in medianode are described in [3]. Table 1 shows an overview of these data types with their characteristics. Table 1: Data categories and their characteristics in medianode availabilit y req.
consistency req.
persis tency
update frequency
data size
QoS playback
presentation description
high
middle (high)
yes
low
small/ middle
not required
organizational data
high
high
yes
low
small
not required
file/data description
high
middle
yes
middle
small
not required
multimedia resources
high
middle
yes
middle
large
required
system resources
middle (low)
middle
no
high
small
not required
user session/ token
high
high
no
high
small
not required
target data
3.3 Classification of Target Replicas As argued in subsection 3-1, the main goal of replication is to provide availability of medianode’s services and to decrease the response time for accesses to data located on other medianodes. To meet this goal, data which is characterized by a high availability requirement (See Table 1) should be replicated among the running medianodes. We classify different types of target replicas according to their granularity (data size), requirement of QoS support, update frequency and whether their data type is ‘persistent’ or not (‘volatile’). Indeed, there are three classes of replicas in medianode: •
Truereplicas which are persistent and of large size. Content files of any media type, which also may be parts of presentation files are Truereplicas. Truereplicas are the only replica type from the three types, to which the end users have access for direct manipulation (updating). On the other side, these are also the only replica type which requires the support of really high availability and QoS provision.
•
Metareplicas (replicated metadata objects) that are persistent and of small size. An example would be a list medianodes (sites) which currently contain an up-to-date
42
Giwon On, Jens B. Schmitt, and Ralf Steinmetz
copy of a certain file. This list itself is replicated to increase its availability and improve performance. A metareplica is a replica of this list. • Softreplicas which are non-persistent and of small size. This kind of replicas can be used for reducing the number of messages exchanged between the local and remote medianodes, and thereby reducing the total service response time. I.e., if a local medianode knows about the available local system resources, then the local replication manager can copy the desired data into the local storage bow, and the service that is requested from users which requires exactly the data can be processed in a shorter response time. Information about the available system resource, user session and the validity of user tokens are replicas of this type. All replicas which are created and maintained by our replication system are an identical copy of original media. Replicas with errors (non-identical copy) are not allowed to be created. Furthermore, we also do not support any replication service for function calls, and elementary data types. 3.4 The Replication Mechanism 3.4.1 Replication Model Basically, our replication system does not assume a client-server replication model, because there are no fixed clients and servers in the medianode architecture; every medianode may be client or server depending on its current operations. Peer-to-peer model with the following features is used for our replication system: (a) Every replica manager keeps track of a local file table including replica information. (b) Information whether and how many replicas are created is contained in the every file table. I.e. each local replica manager keeps track of which remote replica managers (medianode) are caching which replicas. (c) Any access to the local replica for reading is allowed, and guaranteed that the local cached replica is valid until notified otherwise. (d) If any update happens, the corresponding replica manager sends a multicastbased update signal to the replica managers which have the replica of the updated replica and therefore members of the multicast group. (e) To prevent excessive usage of multicast addresses, the multicast IP addresses through which the replica managers communicate can be organized in small replica sub-groups. Examples for such sub-groups are file directories or a set of presentations about a same lecture topic. 3.4.2 Update Distribution and Transport Mechanism The update distribution mechanisms in medianode differs between the three replica types and their concerning managers. This is due to the fact that the three replica types have different levels of requirements on and characteristics of high availability, update frequency and consistency QoS (see Table 1). Experience from GLOVE[4] and [5] also shows that differentiating update distribution strategies makes sense for web and other distributed documents. The medianode’s replication system offers an unique interface to the individual update signalling and transport protocols which are
Design and Implementation of a QoS-Aware Replication Mechanism
43
selectively and dynamically loaded and unloaded from the replica transport manager that is implemented as an instance of medianode’s access bow. Possible update transport and signalling protocols are: • RPC protocol [2,10]as a simple update distribution protocol. This mechanism is mainly used at the first step of our simple and fast implementation. • A multicast based RPC communication mechanism. In this case, the updates are propagated via multicast other replica managers which are members of the multicast group. RPC2 [7,11] is a good candidate for the first implementation. RPC2 also offers the transmission of large files, such as the updated AV content files or diff-files, by using the Side Effect Descriptor. But, the RPC2 with Side Effect Descriptor does not guarantee any reliable transport of updates. • LC-RTP based reliable multicast protocol: LC-RTP (Loss Collection-Realtime Transport Protocol)[12] is originally developed as an extension of RTP protocol to support the reliable video streaming within the medianode project. Therefore, we adopt LC-RTP and check the usability of the protocol, depending on the degree of reliability required for the individual groups of replicas. 3.4.3 Approaches for Solving Conflicting Updates and for Resolving Conflicts The possible conflicts that could appear during the shared use of presentational data and files are either (a) update conflict when two or more replicas of an existing file are concurrently updated, (b) naming conflict when two (or more) different files are given concurrently the same name, and (c) update/delete conflict that occur when one replica of a file is updated while another is deleted. In most existing replication systems, the conflict resolving problem for update conflicts was treated as a minor problem. It was argued that most files do not get any conflicting updates, with the reason that only one person tends to update them[9]. Depending on the used replication model and policy, there are different approaches of which our replication system uses the following strategies [7,8,13,14,15]: • Swapping - to exchange the local peer’s update with other peer’s updates; • Dominating - to ignore the updates of other peers and to keep the local update as a final update; • Merging - to integrate two or more updates and build one new update table;
4 Implementation We have implemented a prototype of the proposed replication system model on a Linux platform (SuSe 7.0, Redhat 6.2). Implemented are the media (file) and its replica manager (ReplVerifierBow), update transport manager (ReplTransportBow), replica service APIs which are Unix-like file operation functions such as open, create, read, write, close (ReplFileSysBow), and a session information storage bow (VolatileStorBow) which maintains user’s session and token information. 4.1 Bow Description Table 2 gives a short descriptions of bows which implement medianode’s basic functions and the services of our replication system.
44
Giwon On, Jens B. Schmitt, and Ralf Steinmetz
Table 2: A summary of medianode’s bows used for our replication system Bow
Description
MNBow
• addressible via an unique bow identifier and version number • uses request and response queues and dispatcher threads • defines request processing routine
MNAccessBow
• child class of MNBow and implements access bow API • offers RPC server modules and enables RPC connection from web server • HTML-based presentation files are provided via this bow
ReplTransportBow
• child class of MNBow and a variant of access bow • implements the transport managers for replication service • offers transport protocol modules such as RPC, RPC2, and LCRTP (LC-FTP)
(GUI)TelnetBow
• child class of MNBow and a variant of access bow • offers TCP server modules • acts as telnet server and provides information which medianode maintains such a list of bows loaded by the core, memory usages of a certain medianode’s process running etc. • GUIBow implements a TelnetBBow with a graphical user interface
MNVerifierBow
• child class of MNBow and implements verifier bow API • offers modules needed for user authentication
ReplVerifierBow
• child class of MNBow and a variant of verifier bow • implements the media (file) and replica managers for replication service • maintains the three replica tables
MNStorageBow
• child class of MNBow and implements storage bow API
FileBow
• child class of MNStorageBow • implements functions for local file operation • no interface routine support for replication service
XMLBow
• child class of MNStorageBow • offers modules for dynamic generation of presentation files • no interface routine support for replication service
ReplFileSysBow
• child class of MNStorageBow • implements functions for local file operation • implements interface routines for replication service
VolatileBow
• child class of MNStorageBow • implements functions for maintaining volatile data such as memory usage information • implements interface routines for replication service
Design and Implementation of a QoS-Aware Replication Mechanism
45
()
4.2 Presentation Service without Replication Support The interaction model for medianode’s bows is based on a ‘request-response’ communication mechanism. A bow which needs to access data or services creates a request packet and sends it to the core. According to the request type, the core either processes the request packet directly, or forwards it to a respective bow. The processing results are sent to the origin bow in a response packet. The request and response packets contain all necessary information for the communication between bows as well as for processing the requests.
MNVerifierBow
po ns e
MNAccessBow
Re s
Presentation request from web server Server_presentFiletoClient()
Bo w>
Re cv
userAuthentication()
1 2
9
10
processRequest()
11
8
presentFile() 3
5 Bow->RecvRequest() Bow->sendRequest() 4
12
18
Core->RecvRequest() 6
SubPresentFile()
14
7
MNCore 13
20
Bow->RecvResponse()
19 SendtoWebServer()
15 Bow->RecvRequest()
16
Req/RespQueue DispatcherThread
17
processRequest()
fileOperation()
MNFileBow/MNXMLBow
Figure 1: medianode: internal service flow without a replication support
Core->SendRequest()
getAuthenticationKey()
46
Giwon On, Jens B. Schmitt, and Ralf Steinmetz
Based on this request-response mechanism, we experimented some presentation scenarios with and without a replication service. Figure 1 shows one example of presentation services without a replication system. Upon receiving a presentation request from user via web server, the MNAccessBow creates first a request packet to check user’s authentication (steps 1~4) and sends it via the core (steps 5~7) to MNVerifierBow which puts authentication test value into a response packet and sends it to the origin bow, MNAccessBow (steps 8~11). In the case of a successful authentication, MNAccessBow creates a request packet to get the required presentation data and sends it via the core to a corresponding storage bow (steps 12~15). Either MNFileBow or MNXMLBow, it depends on the requested (media) data type, checks whether the data exists locally, and then creates a response packet which contains either a file handle or an error message, sends it to the MNAccessBow (steps 16~18). MNAccessBow sends then to the web server a response which is either an authentication failure message or a presentation file. 4.3 Presentation Service with Replication Support 4.3.1 Initialization of MediaList and Replica Tables In this subsection, we describe the medianode’s operation flow with the replication service. Basically, the replication service in medianode begins by creating media list and replica tables of the three replica types in each medianode. As shown in Figure 2, ReplFileSysBow sends a request packet via the core to ReplVerifierBow for creating a media list for media data which locate in the local medianode’s file system (steps 1~2). Upon receiving the request packet, ReplVerifierBow creates media list which will be used to check the local availability of any required media data (step 3). ReplVerifierBow then builds the local replica tables for the two replica types, Truereplicas and Metareplicas, if the replica information exists already. A medianode configuration file can specify the default location where replica information is stored. Every type of replica table contains a list of replicas with the information about organization, replica volume identifier, unique file name, file state, version number, number of replicas, a list of replica, a multicast IP address, and some additional file attributes, such as access right, creation/modification time, size, owner, and file type. The third replica table for the Softreplicas to which the local system resource, user session and token information belong may be needed to be created in terms of memory allocation, and the contents of this table can be partly filled when users request some certain services. Once the replica tables are created, they are stored in the local file system and accessible persistently. 4.3.2 Maintaining Replica Tables In medianode, these three replica tables are maintained locally by the local replication manager. So, there is no need to exchange any update-related messages for the files of which there is no replica created. This approach increases the system resource utilization, especially network resources, by decreasing the message numbers exchanged between the replication managers among the distributed medianodes. But, when any medianode wants to get a replica from the local replica tables, the desired replica ele-
Design and Implementation of a QoS-Aware Replication Mechanism
47
ments are copied to the target medianode, and the replication manager at the target medianode keeps these replica elements separate in another replica table which is used only for the management of remote replicas, i.e. for the management of replicas for which their original files are stored in a remote medianode. 4.3.3 Acquiring a Replica to Remote Replication Managers Upon receiving the service requests (data access request) from users, the local medianode attempts to access the required data in a local storage bow (ReplFileSysBow) (step 4~5). 9
ReplTransportBow TransportPolicyLib
MNAccessBow
pons e()
do_repl_transport()
1 5
ReplVerifierBow MeidaListManager
3 Bow->RecvRequest()
6
7 Core->SendRequest()
MNCore
8
MaintainMediaList()
processRequest()
15
Bow
equest() Bow->sendR
4 presentFile()
->Re cvRe s
processRequest()
2
14 13 processRequest()
10
ReplicaListManager
MaintainReplicaTable()
11
Consistency ControlLib check_consistency()
12
fileOperation()
ReplFileSysBow/VolatileBow
Figure 2: medianode: internal service flow with a replication support
48
Giwon On, Jens B. Schmitt, and Ralf Steinmetz
In the case, when the data is not available locally, the local ReplFileSysBow sends a request packet to ReplVerifierBow to get a replica for the data. The ReplVerifierBow then start a process to acquire a replica by creating a corresponding request packet which is passed to ReplTransportBow (steps 6~8). The ReplTransportBow multicasts a data search request to all the peer replication managers and waits for replication managers to respond (step 9). The list of medianodes to which the multicast message is sent can be read from the medianode’s configuration file. Whether the ReplTransportBow waits for all responses or receives the first one is dependent on the optimization policy which is given as configuration flag. After receiving the target replica, the ReplTransportBow sends a response packet to the ReplVerifierBow which then updates the corresponding replica tables, i.e. ReplVerifierBow adds the new replica element to the Truereplicas table and its metadata to the Metareplicas table, respectively (steps 10~13). Finally, the local ReplFileSysBow which originally issued replica creation request creates a response packet including the replica handle and then sends it to the MNAccessBow (steps 14~15).
5 Summary and Future Works In this paper, we presented the replication mechanism of our distributed media system medianode, and described the design and implementation architecture of the prototyped replication system. The main contributions of this paper are (1) to identify the new replication requirements for distributed multimedia systems, and (2) to build a replication mechanism for distributed multimedia systems, which supports the individual QoS characteristics of multimedia data and uses system resource usage information. To achieve these targets, we first studied the characteristics of presentational media types which are handled in medianode, identified replica units and granularity. These have not been considered and not supported in existing replication mechanisms. We then built a QoS-aware replication mechanism, in which the decision whether and when a replica should be created from original file is made by checking the QoS characteristics of the requested media and the current usages of available system resources, for distributed multimedia systems based on the new requirements and the result of feature surveys. The next working steps would be to design other replication services which provide service implementations such as predictive replication to increase access availability and to reduce latency. Similar approaches are Hoarding[7, 19], prefetched caching and resource reservation in advance[20].
References 1. The medianode project. (web site http://www.httc.de/medianode). 2. G. Coulouris, J. Dollimore and T. Kindberg. Distributed Systems, 3rd Ed., Chapter 1,8,14 and 15, Addison-Wesley, 2001. 3. G. On, M. Zink, M. Liepert, C. Griwodz, J. Schmitt, R. Steinmetz. Replication for a Distributed Multimedia System. In Proc. of ICPADS2001, pp.37-42, 2001. 4. G. Pierre, I. Kuz, M. van Steen and A.S. Tanenbaum. Differentiated Strategies for Replicating Web documents, In Proc. of 5th International Workshop on Web Cach-
Design and Implementation of a QoS-Aware Replication Mechanism
49
ing and Content Delivery, Lisbon, May 2000. 5. P. Triantafillou and D.J. Taylor. Multiclass Replicated Data Management: Exploiting Replication to Improve Effciency. In IEEE Trans. on Parallel and Distributed Systems, pages 121-138, Vol.5, No.2, Feb.1994. 6. R. Campbell. Managing Andrew File System (AFS), Prentice Hall PTR, 1998. 7. M. Satyanarayanan, J.J. Kistler, P. Kumar, M.E. Okasaki, E.H. Siegel, and D.C. Steer. Coda: A Highly Available File System for a Distributed Workstation Environment. In IEEE Transaction on Computers, 39(4), April 1990. 8. J. Yin, L. Alvisi and C. Lin. Volume Leases for Consistency in Large-Scale Systems. In IEEE Trans. on Knowledge and Data Engineering, 11(4), 1999. 9. B. Groenvall, A. Westerlund and S. Pink. The Design of a Multicast-based Distributed File System. In Proceedings of Third Symposium on Operating Systems Design and Implementation, (OSDI’99), New Orleans, Louisiana, pages 251-264. February, 1999. 10. K.P. Birman and B.B. Glade. Reliability Through Consistency. In IEEE Software, pages 29-41, May 1995. 11. M. Satyanarayanan and E.H. Siegel. Parallel Communication in a Large Distributed Environment. In IEEE Trans. on Computers, pages 328-348, Vol.39, No.3, March 1990. 12. M. Zink, A. Jones, C. Girwodz and R. Steinmetz. LC-RTP (Loss Collection RTP): Reliability for Video Caching in the Internet. In Proc. of ICPADS 2001: Workshop, pages 281-286. IEEE, July 2000. 13. R. Guy, P. Reiher, D. Ratner, M. Gunter, W. Ma, and G. Popek. Rumor: Mobile Data Access Through Optimistic Peer-to-Peer Replication. In Workshop on Mobile Data Access, November 1998. 14.D. Ratner, P. Reiher, and G. Popek. Roam: A Scalable Replication System for Mobile Computing. In Workshop on Mobile Databases and Distributed Systems (MDDS), September 1999. 15. P. Triantafillou and C. Neilson. Achieving Strong Consistency in a Distributed File System. In IEEE Transaction on Software Engineering, pages 35-55, Vol.23, No.1, January 1997. 16. T.W. Page,Jr., R.G. Guy, G.J. Popek, and J.S. Heidemann. Architecture of the Ficus scalable replicated file system. Technical Report CSD-910005, UCLA, USA, March 1991. 17. M. Mauve and V. Hilt. An Application Developer’s Perspective on Reliable Multicast for Distributed Interactive Media. In Computer Communication Review, pages 28-38, 30(3), July 2000. 18. D.H. Ratner. Selective Replication: Fine grain control of replicated files. Master’s thesis, UCLA, USA, 1995. 19. G.H Kuenning. Seer: Predictive File Hoarding for Disconnected Mobile Operation. PhD. dissertation, UCLA-CSD-970015. UCLA, USA, 1997. 20. C. Griwodz. Wide-Area True Video-on-Demand by a Decentralized Cache-based Distribution Infrastructure. PhD. dissertation, TU Darmstadt, April 2000. 21. J. Chung-I and M.A. Sirbu. Distributed Network Storage with Quality-of-Service Guarantees. web site http://www.ini.cmu.edu/~sirbu/pubs/99251/chuang.h
Distribution of Video-on-Demand in Residential Networks Juan Segarra and Vicent Cholvi Departament de Llenguatges i Sistemes Inform` atics Universitat Jaume I CP 12071 Castell´ o (Spain) {jflor,vcholvi}@uji.es
Abstract. In this paper, we study how to distribute cache sizes into a tree structured server for transmitting video streams through videoon-demand (VoD) way. We use off-line smoothing for videos and our request rates are distributed according to a 24 hour audience curve. For this purpose we have designed a slotted-time bandwidth reservation algorithm, which has been used to simulate our experiments. Our system tests the quality of service (QoS) in terms of starting delay, and once a transmission has started, the system guarantees that it will be transmitted without any delay or quality loss. We tested it for a wide range of users (from 800 to 240 000) and also for a different number of available videos. We demonstrate that a tree structured system with uniform cache sizes performs better than the equivalent system with a proxy-like configuration. We also study delay distribution and bandwidth usage in our system on a representative case.
1
Introduction
New net connections at homes are increasingly becoming faster and having more bandwidth. This allows multimedia applications for these connections, from online radio stations which are widely available nowadays, to heavier applications such as video playing. Previous Studies Much work has already been done on the delivery of video–on– demand (VoD). Maybe the most simple technique consists of using a centralized single–server intended to deliver all video requests to users. However, it is known that this approach does not scale well as the number of users and videos grows. In order to solve this problem, Vassilakis et al. [13] use a hierarchical network of servers so that most popular videos are located as close to the users as possible. They found that such an approach is preferable to using a centralized server. Dan and Sitaram presented a framework for dynamic caching [5]. Basically, this does not require the whole video to be stored on the same server, but rather
This study is partially supported by the CICYT under grant TEL99-0582 and Bancaixa under grant P1-1B2000-12.
D. Shepherd et al. (Eds.): IDMS 2001, LNCS 2158, pp. 50–61, 2001. c Springer-Verlag Berlin Heidelberg 2001
Distribution of Video-on-Demand in Residential Networks
51
only parts of the video. They try to retain in a server the parts brought in by a request for reuse by a closely followed request and then those parts are subsequently discarded. However, this approach does not guarantee real video– on–demand delivery, but rather a good performance of the overall system. New video compression formats such as MPEG2 [8] also allow for the storage and transmission of high quality video streams requiring relatively low capacity. One major problem with compression formats in streams is that their bit rate variability is increased. Transmitting real-time VBR flows is no trivial matter, because a different amount of bandwidth will be needed during the transmission. Naive approaches to this problem use a pessimistic evaluation, as they reserve enough bandwidth for the stream to be transmitted as if it were always in its worst case. Obviously, this is quite inefficient. One of the methods to improve these transmissions is the smoothing [7, 12] of streams before their transmission. This is done by transmitting frames into the client playback buffer in advance of each burst so that peak and rate variability requirements are minimized. Rexford and Towsley show how to compute an optimal transmission schedule for a sequence of nodes by characterizing how the transmission rate varies as a function of the playback delay and the buffer allocation at the server and client nodes [9]. By also using smoothing, several techniques have been proposed which develop a transmission plan [12] consisting of time periods so that, in each period, the transmission may be performed by using a constant–bit–rate (CBR) network service. In [14], the authors present a technique that makes use of this approach. They also develop a video delivery technique via intelligent utilization of the link bandwidth and storage space available at the proxy servers (i.e., servers which are attached to the gateway router that connects the final users). Our Study In our study, we follow an approach similar to [14]. First, we consider that, for each video, we have a transmission plan. Additionally, we split the video into parts, each one corresponding with each time period of the transmission plan. We also allow each part to be placed at the different video servers located along the network. Unlike [14], we do not restrict these servers to being attached to the gateway router that connects the final users (i.e., they do not have to be proxies). In this paper we focus, given the above mentioned scenario, on the analysis of both how to place video parts at the different video servers and how to distribute resources (storage space) throughout the whole system. We also analyze the system behavior by varying both the number of videos and users. Finally, we examine the behavior of the system through the day. The rest of our paper is organized as follows. In Section 2 we describe the scenario we use. In Section 3 we present our video–on–demand delivery system and in Section 4 we show our results. Finally, our conclusions are presented in Section 5.
52
2
Juan Segarra and Vicent Cholvi
Scenario
For our study, we use a hierarchical network architecture for the distribution network. This architecture is used nowadays in residential networks, so it meets our needs. Our tree structure is a binary tree with four depth levels. At the top of the tree, there is a central server, storing all videos, ranging from 100 to 10 000. Furthermore, at each node of the tree there is a cache server, which acts as a client to the upper caches servers. These caches must be understood as static caches; their content is specified off–line, and it does not change in time. Server caches at leafs serve the same number of users, ranging from 50 to 15 000 (so, the total number of users ranges from 800 to 240 000). Links between nodes have a bandwidth of 1.5 Gbps downstream. A moderate bandwidth is also needed upstream in order to send user requests to the central server. Note that final users share a bandwidth of 1.5 Gbps because they share the link to their nearest node. Fig. 1 shows our tree structure. Cache Server Cache Server Cache Server
Central Video Server
0 1 0 1 0 1 0 1 0000 1111 0 1 0 1 000 111 0 1 0 1 000 111 0 1 0 1 000 111 0 1 0 1 0 1 0 1 0 1 0 1 0000 1111 0 1 0 1 0000 1111 0 1 0 1 0000 1111 0 1 0 1 0000 1111 0 1 0 0000 1111 0 1 1 0 1
0 1 11 00 0 1 00 11 11 00 1 0 00 11
Client
Client
. . .
0 1 0 1 00 11
0 1 11 00 0 1 00 11 11 00 1 0 00 11
Client
Client
. . .
Client
0 1 0 1 00 11
Client
Fig. 1. Tree structure of our system.
In our approach to the video transmission problem we study the performance in a 24 hour period. Previous studies on video demand rates [1, 6] have determined the behavior of these rates in this period; so our simulations work with a rate distribution according to these studies. We also consider that a user makes a request only if he is not waiting nor receiving a video (i.e., nobody can receive more than one transmission at a time). In Fig. 2 we can see the request rate distribution obtained in a simulation with one request per day for each user.
Distribution of Video-on-Demand in Residential Networks
53
User request rate during a 24 hour period 0.0022
1 request/day each user
Rate each user (requests per minute)
0.002 0.0018 0.0016 0.0014 0.0012 0.001 0.0008 0.0006 0.0004
0
2
4
6
8
10
12 Time (hr)
14
16
18
20
22
24
Fig. 2. Request rate distribution in a 24 hour period
3
Placement of Video Streams and the Reservation Algorithm
In this section we analyze how to distribute videos throughout the whole system. Since bandwidth usually becomes a bottleneck in stream transmissions, the placement policy is a major issue of the system we are proposing for delivery of video on demand. In order to avoid the rate variability problem we described in the introduction, we use smoothed versions of video streams [12]. Thus, each video is characterized by a collection of tuples time, rate, whose first parameter denotes a time period and whose second one is the CBR rate associated with such a time period. Our distribution policy is applied off–line and it acts on the different video– parts corresponding to the tuples of each video. Basically, it first assigns, at start up, a weight value to each video. That weight value is based on the video popularity, which represents the probability of being requested according to Zipf’s law [3]. This distribution has been found to statistically fit video program popularities estimated through observations from video store statistics [4]. Once such an assignment is done, it distributes the video–parts in the tree, sorting them out by weight value: the higher the weight value, the nearer to users they are placed. When storing these video–parts in places other than the main server, they are replicated in all other caches at the same level. Thus, the video–parts with the highest weight value are placed in all caches next to users. For the above proposed video stream placement policy, it is necessary to use a reservation algorithm. Such an algorithm manages video requests, guaranteeing
54
Juan Segarra and Vicent Cholvi
that, once a video is accepted (maybe after some start–up time delay), it will be delivered without interruption to the final user/s. Our reservation algorithm is constructed around a bandwidth reservation table which represents the bandwidth that will be used by each link at each time– slot, the units of time to provide a bandwidth reservation (in our case it is of 1 minute). Initially, all values in this table are 0, due to there being no bandwidth reservation. This table changes over time, since rows of time–slots are added and removed as videos are requested and served. When a video request is received, we first map out the route which each video–part will follow. Then, we test (for each video–part) to see whether it is already planned for transmission at the same time by the same links. If so, that request is added to the multicast transmission of this video–part without any additional bandwidth requirements. Otherwise, we check whether there is enough bandwidth in each link of the route during the slots of the transmission. In that case, bandwidth is reserved by adding the new video–part rate to the bandwidth used in the table. In order to use a single video stream to serve several near arriving requests, we intentionally delay the playback for an amount of time of 2 minutes (such an amount of time is called batch–time). Fig. 3 shows the code of this algorithm.
Step #1: A user requests a video. Step #2: If that video has already been requested and it has not started yet, the start time of the new request will be the same as the previous one. Otherwise, the new request will start after a batch–time. Step #3: The system calculates the transmission path of each video–part, and the exact time when they have to be transmitted. Step #4: For each video–part do For each link in the transmission path do If there is a reserve for this video–part in this link at the same time, add our request as a multicast. Otherwise, reserve the bandwidth needed for this video–part in this link during the required time–slots. Step #5: Accept video for transmission or go to Step #3 in the next time slot (in which case, all reserves and multicasts for this request performed in Step #4 are canceled).
Fig. 3. Algorithm for the acceptance of video requests.
It must be pointed out that, in general, it may be necessary to synchronize the different video–parts that make up a video during the playback. That is because they may be placed at different servers throughout the system. Whereas this can
Distribution of Video-on-Demand in Residential Networks
55
be done in a number of ways by using protocols such as RTP [10] and RTSP [11], in this paper we will not go into details about how to carry them out.
4
Results
In this section we present the results generated by our simulations. The performance of the system is obtained using the starting delay (or serving time). That is, the time between a video request and the beginning of its transmission. Due to space limitations, we show only the mean value of this magnitude (which follows a similar pattern to both the maximum and 90% starting delays). 4.1
Storage Distribution in the Cache Servers
In these simulations we test the system performance against the number of users with four different cache sizes. All these simulations have a request rate of 1 video/day each user. We have analyzed two possible configurations: uniform cache sizes in all the system, and a proxy–like configuration. Fig. 4 shows how those configurations behave against a centralized system (i.e, a system without caches where all videos are stored in the main server).
Mean serving time with 0 Gb caches 250
100 videos 500 videos 1000 videos 5000 videos 10000 videos
Starting delay (minutes)
200
150
100
50
0
0
2000
4000
6000
8000
10000
12000
14000
16000
Users/branch
Fig. 4. System performance without storage capacity in caches
Performance with Uniform Cache Sizes Our first approach to the problem of distributing cache sizes consists of using the same size in all caches. Figures 4, 5, 6 and 7 show the system performance with 0, 100, 250 and 500 Gb cache size respectively. They show that, when the number of users is low, the system works without any congestion. That is what explains why its performance is independent of the number of videos. However, when we increase
Juan Segarra and Vicent Cholvi Mean serving time with 100 Gb caches 100
100 videos 500 videos 1000 videos 5000 videos 10000 videos
90 80
Starting delay (minutes)
70 60 50 40 30 20 10 0
0
2000
4000
6000
8000 Users/branch
10000
12000
14000
16000
Fig. 5. System performance with 100 Gb cache size
Mean serving time with 250 Gb caches 70
100 videos 500 videos 1000 videos 5000 videos 10000 videos
60
Starting delay (minutes)
50
40
30
20
10
0
0
2000
4000
6000
8000 Users/branch
10000
12000
14000
16000
Fig. 6. System performance with 250 Gb cache size
Mean serving time with 500 Gb caches 40
100 videos 500 videos 1000 videos 5000 videos 10000 videos
35
30 Starting delay (minutes)
56
25
20
15
10
5
0
0
2000
4000
6000
8000 Users/branch
10000
12000
14000
16000
Fig. 7. System performance with 500 Gb cache size
Distribution of Video-on-Demand in Residential Networks
57
the number of users, the system becomes unable to serve their requests without any congestion. In any case, it can be readily seen that the bigger the cache size, the better the serving times offered by the system. With a more accurate view, we can see that the change in times between Fig. 4 and Fig. 5 is the bigger one. This indicates that having a minimum cache size in each node greatly improves our serving times. Thus, adding cache size always increases the system performance, but this increment is lower when the system already has a large cache size. Performance with Proxy–like Configuration In this section, we compare the approach followed in the previous sections with one using proxy servers (i.e., cache servers attached to the gateway router that connects the finals users). In order to be fair, we will compare equivalent systems (i.e., systems with the same global capacity). For instance, given a system with uniform caches sizes of 100 Gb in 14 cache servers, the proxy–like equivalent system would have 175 Gb in its 8 proxy servers. Thus, Fig. 8 would be equivalent to Fig. 5, Fig. 9 to Fig. 6 and Fig. 10 to Fig. 7. As can be seen, in all cases the performance in proxy–like systems is slightly lower than in systems with uniform caches. Starting delays with uniform caches are about 40% lower than with proxy–like caches. The main reason for this is that when we do not have intermediate caches and we request a video which is not in the first cache level, it has to be transmitted from the main server. This results in a faster congestion of upper links. Moreover, because of the tree structure, adding storage size to leaf caches is always more expensive than adding the same size to intermediate caches, so this added size is not enough to compensate the lack of intermediate caches. 4.2
System Behavior Through the Day
As previously pointed out (see Fig. 2), the request rate varies through the day. Consequently, it seems quite reasonable to expect a different system behavior during a 24 hour period. In this section, we analyze the system behavior through the day. For this task, we take a system with 100 Gb of storage capacity in each cache, 1000 different videos and 10 000 users in each final branch (that is 160 000 total users). In order to start this simulation with a non-empty system, we have simulated a period of 48 hours and we have used the results from the last 24 hours. Fig. 11 shows the starting delay distribution of this system. It follows the same pattern as Fig. 2, which is what we expected1 . The first part of the figure corresponds to the end of the maximum audience peak of the previous day. Another interesting point is that, when the request rate is low, starting delay is always below the batch time (2 minutes). This fact demonstrates that some requests are added to previous plans and the system is not congested during this period. 1
Note that local peaks are due to the multicast effect in transmission plans in a high request rate.
58
Juan Segarra and Vicent Cholvi Mean serving time with a proxy-like configuration 120
100 videos 500 videos 1000 videos 5000 videos 10000 videos
Starting delay (minutes)
100
80
60
40
20
0
0
2000
4000
6000
8000 Users/branch
10000
12000
14000
16000
Fig. 8. Starting delay in a proxy–like system equivalent to 100 Gb cache sizes
Mean servign time with a proxy-like configuration 80
100 videos 500 videos 1000 videos 5000 videos 10000 videos
70
Starting delay (minutes)
60
50
40
30
20
10
0
0
2000
4000
6000
8000 Users/branch
10000
12000
14000
16000
Fig. 9. Starting delay in a proxy–like system equivalent to 250 Gb cache sizes
Mean serving time with a proxy-like configuration 60
100 videos 500 videos 1000 videos 5000 videos 10000 videos
Starting delay (minutes)
50
40
30
20
10
0
0
2000
4000
6000
8000 Users/branch
10000
12000
14000
16000
Fig. 10. Starting delay in a proxy–like system equivalent to 500 Gb cache sizes
Distribution of Video-on-Demand in Residential Networks
59
Starting delay distribution in a 24 hour period 14
10000 users/branch, 1000 videos
12
Starting delay (minutes)
10
8
6
4
2
0
0
2
4
6
8
10
12
14
16
18
20
22
24
Time (hr)
Fig. 11. Serving time distribution in a 24 hour period
Fig. 12 shows the percentage of used bandwidth in the links of each level. Whereas level 4 represents the links between the main server and the next two caches (2 links of 1.5 Gb each one), level 1 represents the links which share each user branch (16 links of 1.5 Gb each one). It can be seen that links are used below 100% during the low request rate period. For that reason the minimum delays in Fig. 11 are all around the batch time. It also shows that links at level 4 constitute the bottleneck of the system, because it is the first level which reaches a usage of 100%. This indicates that although intermediate caches serve many requests, 100 Gb each cache is not enough to avoid this bottleneck. This is quite logical, because if each video size is about 1.5 Gb, in 3 cache levels (300 Gb) we could only store 200 of the 1 000 videos there. Thus, to watch the other 800 videos, transmissions must be made from the main server. Note also that bandwidth usage in lower levels may grow even when the upper ones have reached a usage of 100%. This indicates that lower caches are serving the videos which they have stored locally.
5
Conclusions
In this paper we have studied the benefits of using caches in a tree structured system for video–on–demand. We test two ways of distributing the cache size, and we compare our results with the centralized system as a reference. Our simulations demonstrate that a system with 100 Gb in each cache has a performance of about 5 times better than without it, and about 10 times when we use 500 Gb in each cache. We also demonstrate that a uniform distribution offers results about 40% better than a proxy–like configuration. However, there are many other ways of distributing this size, and the performance variation with different configurations can be important. Some other issues need further research. As demonstrated above, links at the main server are the bottleneck. Thus, studies for adjusting the system charac-
60
Juan Segarra and Vicent Cholvi Bandwidth usage in a 24 hour period 100
90
Bandwidth usage (%)
80
70
60
50
40 level 4 level 3 level 2 level 1
30
20
0
2
4
6
8
10
12
14
16
18
20
22
24
Time (hr)
Fig. 12. Bandwidth usage at each level of the tree
teristics to avoid this situation are needed. Furthermore, it will be necessary to add the possibility of changing the video locations on-line, thereby adding more videos or modifying their popularity value without shutting down the system. Finally, in this paper we have used the popularity value for linearly distributing the videos in the caches, but other methods should be tested as well.
References [1] Bell Atlantic. Fact sheet:results of bell atlantic video services. video–on–demand market trial. trial results, 1996. [2] L. Berc, W. Fenner, R. Frederick, and S. McCanne. Rpt payload format for jpeg-compressed video. Request for Comments 2035, Network Working Group, October 1996. [3] Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. Web caching and Zipf-like distributions: Evidence and implications. In Proceedings of the INFOCOM ’99 conference, March 1999. [4] A. Dan, D. Sitaram, and D. Shahabuddin. Dynamic batching policies for an on–demand video server. Multimedia Systems, 4:112–121, 1996. [5] Asit Dan and Dinkar Sitaram. A generalized interval caching policy for mixed interactive and long video environments. Multimedia Computing and Networking, January 1996. [6] PG de Haar et al. DIAMOND project: Video–on–demand system and trials. Eur Trans Teleccommun (8)4: 337–244, 1997. [7] W. Feng, F. Jahanian, and S. Sechrest. An optimal bandwidth allocation strategy for the delivery of compressed prerecorded video. Multimedia Systems, 5(5):297– 309, 1997. [8] D. Gall. Mpeg: a video compression standard for multimedia applications. Communications of the ACM, 34(4):46–58, April 1991. [9] Jeniffer Rexford and Don Towsley. Smoothing variable–bit–rate video in an internetwork. IEEE/ACM Transactions on Networking, pages 202–215, April 1999.
Distribution of Video-on-Demand in Residential Networks
61
[10] H. Schulzrinne, S. Casner, R. Frederick, and S. McCane. RTP: A transport protocol for real–time applications. Request for Comments RFC 1889, Network Working Group, 1994. [11] H. Schulzrinne, A. Rao, and R. Lanphier. Real–time streaming protocol (RTSP). Internet Draft, IETF, 1997. [12] Arun Solleti and Kenneth J. Christensen. Efficient transmission of stored video for improved management of network bandwidth. International journal of network management, 10:277–288, 2000. [13] Constantinos Vassilakis, Michael Paterakis, and Peter Triantafillou. Video placement and configuration of distributed video servers on cable TV networks. Multimedia Systems, 8:92–104, 2000. [14] Yuewei Wang, Zhi-Li Zhang, David H.C. Du, and Dongli Su. A network–conscious approach to end–to–end video delivery over wide area networks using proxy servers. IEEE/ACM Transactions on Networking, 8(4):429–442, August 2000.
A QoS Negotiation Scheme for Efficient Failure Recovery in Multi-resolution Video Servers Minseok Song, Heonshik Shin, and Naehyuck Chang School of Computer Science and Engineering, Seoul National University, Seoul 151-742, Korea
[email protected], {shinhs,naehyuck}@snu.ac.kr
Abstract. In this paper, we present a Quality of Service (QoS) negotiation scheme for efficient failure recovery in multi-resolution video servers with disk arrays. This scheme exploits multi-resolution property of video streams by negotiating service resolutions in order to provide graceful QoS degradation when a disk fails. Using the proposed scheme, not only can we increase the number of admitted clients greatly when all disks are operational but also utilize server resources efficiently. Furthermore, it can provide each client with acceptable QoS even in the presence of disk failure while maximizing server-perceived rewards. The effectiveness of the proposed algorithm is evaluated through simulationbased experiments.
1 Introduction One of the crucial requirements for a Video-on-Demand (VOD) service is the ability to support heterogeneous clients with different Quality of Service (QoS) parameters, such as display size, resolution level, and frame rate. As described in [3], [5], the most viable way to satisfy various QoS requirements of clients is to use multi-resolution video stream whose video sequences are encoded such that subsets of a full-resolution video bit stream can be decoded to extract lower resolution streams. Video data is inherently voluminous and requires high bandwidth throughput. Therefore, most VOD servers employ Redundant Array of Inexpensive Disks (RAID). A fundamental problem, though, is that large disk arrays are highly vulnerable to disk failures [8]. Since delivery and playback of video streams should be performed in a real-time fashion, the VOD server must guarantee the playback rate of accepted video streams even in the presence of disk failures. To guarantee continuous delivery of video streams in the degraded mode (i.e., when a disk has failed), current approaches for the fault-tolerant VOD servers tend to reserve contingent bandwidth or buffers [1], [7], [8], [11]. Thus, they admit too few streams thereby under-utilizing the server resources in the normal mode (i.e., when all disks are operational). Furthermore, they do not consider the buffer and the disk bandwidth constraints simultaneously albeit important for admission control. In this paper, we propose a QoS negotiation scheme for efficient failure recovery in multi-resolution video servers. The proposed scheme exploits multi-resolution property of video streams and uses resources such as disk bandwidth and buffer efficiently. It D. Shepherd et al. (Eds.): IDMS 2001, LNCS 2158, pp. 62–73, 2001. c Springer-Verlag Berlin Heidelberg 2001
A QoS Negotiation Scheme in Multi-resolution Video Servers
63
defines a spectrum of resolution levels and permits acceptable performance when a disk fails by using a smaller amount of resources; thus, it can improve resource utilization in the normal mode. It also considers the priority of clients and thus provides graceful degradation in the degraded mode. The rest of this paper is organized as follows: In Section 2, we present system models to aid further description of our fault tolerance schemes. Next, we propose an admission control algorithm in Section 3 and resource adjustment algorithms in Section 4, respectively. We validate the proposed schemes through the simulation in Section 5 and conclude in Section 6.
2 Multi-resolution Video System Model 2.1 Multi-resolution Video Stream Model An efficient method for extracting multi-resolution video streams is to use sub-band coding schemes [9]. The sub-band coding schemes provide high scalability as well as compression ratios which are comparable to MPEG-2 compression. Further discussion on the video server will be based on this scheme. Multi-resolution video streams can be modeled as follows [9]. We refer to the video segment as the amount of data retrieved during one round. Let us assume that a given video V has gsegments suchthat V = {S0 ... Si ... Sg−1 }. Let us assumethat each segment Si has up to Q display resolutions of data such that Si = {Si0 ... Sij ... SiQ−1 }, where Sij represents the j th enhancement (0 < j ≤ Q − 1) for the segment Si . We refer to Si0 as the base resolution of the segment Si . To display a video stream with resolution k, it requires a reading of data from the base resolution as well as that of all For example, to view the the data from the first enhancement to the k th enhancement. 0 resolution k of Si , we need to retrieve {Si Si1 ... Sik }. Let each Sij be composed of bj bits, and then total bits for the resolution k is kj=0 bj . We define a Unit Request Size for resolution k, N Bk , to be k j=0 bj . N Bk = minm∈[0,Q−1] bm 2.2 Data Placement for Multi-resolution Video Stream Two data placement schemes for multi-resolution video streams have been proposed in the literature. Chang et al. [3] propose a periodic interleaving placement scheme, and Paek et al. [12] define a balanced placement scheme for the multi-resolution video streams. We assume that the disk array consists of D homogeneous disks. The periodic interleaving scheme accesses only one disk during a round for a segment, whereas a segment is divided into D equal amounts of data and placed across D disks in the balanced placement. We employ the periodic interleaving scheme for our data placement policy because it achieves high disk throughput and low system delay time [5].
64
Minseok Song, Heonshik Shin, and Naehyuck Chang
For data availability, a parity encoding technique of RAID 5 is also used. Since a stripe unit size is equal to the data size for the full-resolution video stream requested during a round, any partial resolution stream as well as the full resolution stream can be retrieved contiguously from the surviving disks in the degraded mode. Figure 1 shows our data placement which stores three display resolutions The parity of video streams. S1 S2 S3 . blocks are computed on a segment basis, e.g., P 1 = S0 Disk 1
Disk 2
Disk 3
Disk 4
Disk 5
S0
S1
S2
S3
P1
S4
S5
S6
P2
S7
S8
S9
P3
S 10
S 11
S 12
P4
S 13
S 14
S 15
stripe unit
Fig. 1. RAID for periodic interleaving data placement Let us consider the variation in the load on each disk in the periodic interleaving scheme. Since each access unit is completely stored on a single disk, each client retrieves data from a single disk during a round. Examining the load on disks for any two consecutive rounds, we find that due to the round-robin placement of access units, the clients retrieving data on the same disk will all move to the neighboring disk during the next round. To express such a load shifting property efficiently, we refer to the clients accessing the same disk during a round as a client group. Therefore, we can partition clients into D client groups (say CG1 , CG2 ,...,CGD , respectively). 2.3 QoS Model Feasibility of the multi-resolution presentation gives an opportunity for graceful degradation when there are not enough resources for supporting video streams. QoS degradation, however, should be enforced judiciously because some applications cannot accept video streams below some resolution levels [2]. Furthermore, VOD servers should consider priority requirements of the clients [6]. To efficiently express the priority of the client, we apply the notion of reward [6] to our QoS model. That is, every client is assigned with several rewards indicating its value to the system (e.g., monetary value) when the video stream is serviced with the
A QoS Negotiation Scheme in Multi-resolution Video Servers
65
assigned QoS level successfully. To explicitly describe QoS requirements of clients, we assume that every client is aware of the following information: – A maximum resolution level is assigned to every client, and the video server retrieves a video stream with the maximum resolution level in the normal mode. – A minimum resolution level is assigned to every client, and the video server can degrade the resolution to the minimum resolution level in the degraded mode; however, resolutions below the minimum resolution level are not allowed. – A reward value is assigned to every resolution ranging from the minimum resolution level to the maximum resolution level. Let Cim denote the mth client in the client group CGi (i = 1, ..., D). Let maxm i and minm i be the maximum resolution level and the minimum resolution level for the client m m Cim , respectively, where 0 ≤ minm i ≤ maxi ≤ Q − 1. Each client Ci has resolution m m m m m m levels mini ,...,maxi with rewards RWi (mini ) ,..., RWi (maxi ), respectively, m m where 0 ≤ RWim (minm i ) ≤ ... ≤ RWi (maxi ). We limit the difference between the maximum resolution level and the minimum resolution level, and refer to it as dif f . m m m That is, ∀ Cim , if maxm i ≥ dif f , then mini = maxi − dif f . Otherwise, mini = 0.
3 Admission Control Algorithm The proposed scheme guarantees the video services above the minimum resolution levels in the degraded mode and the maximum resolution level in the normal mode for all the clients while maximizing the sum of rewards in the degraded mode. To satisfy these objectives, we first describe an admission control algorithm in this section, and reward maximization algorithms in the next section. 3.1 Admission Control Condition We consider two constraints, disk bandwidth and buffer, for admission control. Disk Bandwidth Constraint Round-based scheduling is used for data retrieval: The time is divided into equal-sized time intervals and each admitted client is served once every time interval, named round. If we ignore the degraded mode, the service time, which represents total read and seek times requested by the clients in one round, must not exceed the round duration R. However, a contingent disk bandwidth for reconstruction should be reserved to prepare for the degraded mode. Since additional disk load is the same as the load of the failed disk, approximately twice the disk load are required for surviving disks in the degraded mode [8]. Therefore, the service time for the clients in each client group must not exceed 12 R. Let Ni be the number of clients to be serviced from the client group CGi . Let us define STimin and STimax to be the service time for the client group CGi when all the clients in the group retrieve the video streams of the minimum resolution and the maximum resolution level, respectively. Let us assume that the server disk is characterized as shown in Table 1. We use typical seek time model described in [4], in which every
66
Minseok Song, Heonshik Shin, and Naehyuck Chang
Table 1. Notations and parameter values for a disk Transfer rate rt 80Mbps Typical disk seek and rotation time Ts 14 ms
client requires a seeking overhead (seek time + rotational delay) Ts for one read. The following inequality that represents STimin should hold for every client group to guarantee that all the clients retrieve their streams with the minimum resolution level even in the degraded mode: STimin
Ni
=
minm i j=0
(Ts +
bj
rt
m=1
)≤
1 R 2
(1)
To service the maximum resolution levels in the normal mode, the following relationship should hold for every client group: STimax
=
Ni
maxm i j=0
(Ts +
bj
rt
m=1
)≤R
(2)
Buffer Constraint To schedule the video streams in any order during a service round, we use dual-buffer mechanism [5]. Let B be the total buffer size, and CGmh be the most heavily loaded client group. In the degraded mode, the buffer requirement becomes the maximum when the clients in the CGmh retrieve data from the failed disk. Therefore, the following inequality for the buffer constraint should hold to service streams with the minimum resolution levels even in the degraded mode: 2(
m
Ni min i
m N mh min i
bj ) + (D − 1)(
{i|1≤i≤D,i=mh} m=1 j=0
bj ) ≤ B
(3)
m=1 j=0
In the normal mode, there should be sufficient buffer to service the maximum resolution levels. Therefore, the following inequality should hold: m
Ni max D i 2( bj ) ≤ B
(4)
i=1 m=1 j=0
3.2 Admission Control Decision When a client requests a video stream, the server assigns the client to the client group accessing a disk on which the first segment requested by the client is stored. If the admission of the client violates the admission control conditions described in Inequalities (1),(2), (3) and (4), the server delays admitting the client considering the client’s acceptable initial delay. That is, if a suitable group can be found within the client’s acceptable initial delay time, the new client is admitted and becomes a member of that client group; otherwise, it is rejected. After the admission, the client remains a member of the client group until it closes a video stream.
A QoS Negotiation Scheme in Multi-resolution Video Servers
67
4 Resolution Adjustment Algorithms The selected resolutions for the admitted clients may be changed in the following three situations. – When a disk fails. – When admitting a new client in the degraded mode. – When a client closes a video stream. 4.1 When a Disk Fails We define F Rim to be the selected resolution for the client Cim in the degraded mode, and STiF R to be the service time for the client group CGi when all the clients in the group retrieve the video streams with F Rim . Then, the four inequalities below indicate the conditions imposed on F Rim . STimax ≤
STiF R
=
Ni
F Rm i (Ts +
m=1
2(
1 R 2 j=0
(5)
bj
rt
m
Ni max i
bj ) + (D − 1)(
2(
(6)
bj ) ≤ B
(7)
m=1 j=0
m
Ri Ni F
1 R 2
m N mh max i
{i|1≤i≤D,i=mh} m=1 j=0
)≤
Rm N mh F i
bj ) + (D − 1)(
{i|1≤i≤D,i=mh} m=1 j=0
bj ) ≤ B
(8)
m=1 j=0
The degradation of resolutions is incurred by the shortage of either buffer or disk bandwidth. When a disk fails, for every client group, the server checks whether Inequality (5) is satisfied or not. Let Sd be the set of client groups which do not satisfy Inequality (5). The server degrades the resolution for the clients in each client group CGi ∈ Sd on a per client group basis. The server also checks a buffer condition described in Inequality (7). If Inequality (7) is not satisfied, the server degrades the resolution by examining all the clients admitted by the server. There can be the following cases for the resolution degradation. 1. Case 1: Sd = φ, and Inequality (7) is satisfied. 2. Case 2: Sd = φ and Inequality (7) is not satisfied. 3. Case 3: Sd = φ, and Inequality (7) is not satisfied. We now define Resolution Degradation Problems (RDP) as follows.
68
Minseok Song, Heonshik Shin, and Naehyuck Chang
Definition 1 Resolution Degradation Problems (RDP) 1. In Case 1, the resolution degradation problem is to find F Rim for every client Cim i m m in each client group CGi ∈ Sd, that maximizes N m=1 RWi (F Ri ) while satisfying Inequality (6). 2. In Case 2, the resolution degradation problem is to find F Rim for every client Cim D Ni that maximizes i=1 m=1 RWim (F Rim ) while satisfying Inequality (8). In Case 3, the server first determines F Rim for every client Cim in each client group Ni CGi ∈ Sd, that maximizes m=1 RWim (F Rim ) while satisfying Inequality (6). Then, if Inequality (8) is not satisfied yet, the server further degrades the resolutions by examining all the clients admitted by the server. In the degraded mode, every client Cim should receive a stream with one resolution m between minm i and maxi while maximizing the sum of rewards. Hence, RDP is reduced to a multiple choice knapsack problem which is known to be NP-complete [10]. For this reason, we now present heuristic algorithms for RDP. To describe heuristic m algorithms, let us define Impact Factors IFim (k) (k = minm i , ..., maxi − 1) for the m client Ci to be m RWim (maxm i ) − RWi (k) IFim (k) = . N Bmaxm − N Bk i A denominator of the IFim (k) represents the difference of unit request sizes, while a numerator represents the difference of rewards when the resolution is degraded from the maximum resolution level to the k. Therefore, a client with a smaller impact factor should be degraded first because it can return more disk bandwidth to the server while minimizing the reward loss. Let AIi be a set of impact factors for all the clients in the client group CGi (i = 1, ..., D). Sets of impact factors DIi for the client group CGi (i = 1, ..., D) are maintained for later resolution enhancement and are initialized to φ. The heuristic algorithm for resolution degradation when a disk fails, called RDF , is as follows: Algorithm RDF: Resolution Degradation algorithm when a disk Fails 1. Set of impact factors : T Iall , T Ii (i = 1, ..., D); 2. for all the clients Cim in each client group CGi , (i = 1, ..., D) 3. F Rim ←maxm i ; 4. for each client group CGi (i = 1, ..., D) 5. T Ii ←sort(AIi ); (∗ with the increasing order of impact factors ∗) 6. if Sd = φ, and Inequality (7) is satisfied. (∗ Case 1 ∗) 7. for each client group CGk ∈ Sd 8. repeat 9. Find the least impact factor, IFkL (v) ∈ T Ik ; 10. if ( v < F RkL ) 11. F RkL ←v; L 12. T Ik ←T Ik − {IFkL (v)}; 13. DIk ←DIk {IFk (v)}; 14. until ( Inequality (6) is satisfied) 15. if Sd = φ, and Inequality (7) is not satisfied. (∗ Case 2 ∗)
A QoS Negotiation Scheme in Multi-resolution Video Servers
16. 17. 18. 19. 20. 21. 22. 23. 24.
69
T Iall = T I1 ... T ID ; T Iall ←sort(T Iall ); (∗ with the increasing order of impact factors ∗) repeat Find the least impact factor, IFkL (v) ∈ T Iall ; if ( v < F RkL ) F RkL ←v; L T Iall ←T Iall k (v)}; − {IF L DIk ←DIk {IFk (v)}; until ( Inequality (8) is satisfied)
m When a disk fails, F Rim is initialized to maxm i for every client Ci . In addition, impact factors in the AIi are sorted in the ascending order and added to the T Ii (i = 1, ..., D). After initialization (lines 2-5), RDF determines whether Case 1 or Case 2 occurs (line 6 and 15). If Case 1 occurs, for each client group CGk ∈ Sd, the RDF chooses the least value of impact factor IFkL (v) ∈ T Ik and removes it from the T Ik . Then, if F RkL > v, it degrades the resolution to v. The steps for Case 1 above are repeated until Inequality (6) satisfied (lines 7-14). If Case 2 occurs, the degradation is performed by examining all the clients admitted by the server (lines 16-24).
4.2 When Admitting a New Client in the Degraded Mode Just after admitting a new client in the degraded mode, the resolutions for the admitted clients may be degraded to accommodate the new client. We assume that the new client m for the is serviced from the client group CGw . If CGw ∈ Sd, the server finds F Rw clients in the CGw by using the RDF . If Case 2 occurs, then the server enforces the method described from the line 16 to the line 24 in the RDF . 4.3 When a Client Closes a Video Stream in the Degraded Mode When a client closes a video stream in the degraded mode, returned disk bandwidth or buffer is reclaimed to increase the resolutions for the clients. Assuming that an outgoing client Cup belongs to the client group CGu , the heuristic algorithm for resolution enhancement, called REA, is as follows: Algorithm REA: Resolution Enhancement Algorithm 1. Set of impact factors : DIall ; 2. Removes impact factors for the client Cup from DIu and AIu ; 3. Removes the client Cup from the client group CGu and recalculates STuF R ; 4. if CGu ∈ Sd, and Inequality (7) is satisfied. (∗ Degradation for the clients in the client group CGu is due to Case 1 ∗) 5. while (STuF R ≤ 12 R) 6. Find the greatest impact factor, IFuG (v) ∈ DIu ; 7. if ( v > F RuG ) 8. F RuG ←v; 9. DIu ←DIu − {IFuG (v)}; / Sd, and Inequality (7) is not satisfied. (∗ Degradation is due to Case 2 ∗) 10. if CGu ∈
70
11. 12. 13. 14. 15. 16. 17. 18.
Minseok Song, Heonshik Shin, and Naehyuck Chang
DIall = DI1 ... DID ; DIall ←sort(DIall ); (∗ with the decreasing order of impact factors ∗) while (Inequality (8) is satisfied) Find the greatest impact factor, IFkG (v) ∈ DIall ; if ( v > F RkG ) F RkG ←v; DIall ←DIall − {IFkG (v)}; DIk ←DIk − {IFkG (v)};
REA first determines the reasons for the resolution degradation (line 4 and 10). If degradation for the clients in the client group CGu is due to Case 1, the REA chooses the greatest impact factor, IFuG (v) in DIu and removes it from the DIu . Then, if v > F RuG , it increases the resolution to the v. Above steps are repeated while STuF R ≤ 1 2 R (lines 5-9). If degradation is due to Case 2, resolution enhancement is enforced by examining all the clients admitted by the server (lines 13-18).
5 Experimental Results To evaluate the effectiveness of our schemes, we have performed simulation-based experiments. The server consisting of 32 disks stores 1000 video clips, and each of the video length is 50 minutes. Arrival of client requests is assumed to follow the Poisson distribution and the mean arrival rate is assumed to be 20 arrivals/min, where the choice of video clips is random. Each segment has 11 display resolutions whose bit rates are shown in Table 2 [5], and the maximum resolution level for every client is assumed to be random. The objectives for the simulation are as follows: 1. Number of admitted clients according to the degree of resolution degradation 2. Effectiveness of heuristic algorithms
Table 2. Scalable video data rates Resolution index 0 1 2 3 4 5 6 7 8 9 10 Current bit rate (kb/s) 190 63 63 64 126 127 127 127 126 127 190 Total bit rate (kb/s) 190 253 316 380 506 633 760 887 1013 1140 1330
5.1 Number of Admitted Clients According to the Degree of Resolution Degradation Figure 2 shows the number of admitted clients for 10 hours according to the buffer size and dif f when R is 2 seconds. The number of admitted clients increases with the dif f value. The reason for this is that greater dif f provides more opportunity for the resolution degradation. The number of admitted clients also increases with the buffer size.
A QoS Negotiation Scheme in Multi-resolution Video Servers
71
bgaa
beaa
Number of admitted clients
bcaa
baaa
iaa
gaa
eaa
caa
a a
b
c
d
e
f
g
h
diff baa~s
caa~s
daa~s
eaa~s
faa~s
Fig. 2. Effects of buffer size and dif f 5.2 Effectiveness of Heuristic Algorithms An optimal solution for RDP can be found by exhaustively enumerating the feasible solutions and selecting the one with the highest reward sum. However, its computational complexity is O((dif f + 1)Ni ) for the client group CGi . Since the number of admitted clients in a client group is approximately 35, finding the optimal solution when dif f > 1 is almost impossible. Thus, we only compare our heuristic algorithm with the optimal solution when dif f = 1. A reward value of 40 is assigned to the maximum resolution level, and the rewards are assumed to decrease by random values between 1 and 5 for one resolution degradation. The metric used here is the sum of rewards for a client group. Figure 3 shows the sum of rewards according to the round length, and the sum of rewards by our methods is very close to the optimal sums. To evaluate our heuristic algorithm for dif f > 1, let us consider the following algorithm, called RDWF. For simplicity, we ignore Case 2. Algorithm RDWF: Resolution Degradation algorithm considering reWard diFference 1. for all the clients Cim in each client group CGi , (i = 1, ..., D) 2. F Rim ←maxm i ; 3. for each client group CGk ∈ Sd 4. repeat p 5. For the client whose F Rkm is greater than minm k , find a client Ck whose p p p p RWk (maxk ) − RWk (F Rk − 1) is minimum; 6. F Rkp ←F Rkp − 1; 7. until ( Inequality (6) is satisfied) Figure 4 shows the sum of rewards according to the number of admitted clients in a client group when dif f = 4. RDF shows better performance than RDW F because
72
Minseok Song, Heonshik Shin, and Naehyuck Chang beaa
bcaa
Sum of rewards
baaa
iaa
gaa
eaa
caa
a b
b_f
c
Round length (seconds) RDF
Optimal solution
Fig. 3. Comparison between optimal solution and RDF when dif f = 1 the RDF reflects both rewards and bandwidth for the degradation, whereas the RDW F only considers the difference of rewards.
6 Conclusions In this paper, we have proposed a QoS negotiation scheme for multi-resolution video servers to provide efficient failure recovery. The proposed scheme defines a QoS model which includes both a range of resolution levels and their respective rewards to the system in order to provide QoS degradation in the degraded mode. It also estimates the additional disk load and the buffer size in the degraded mode at admission time, and thus it can guarantee acceptable performance even in the degraded mode. To maximize the server-perceived rewards, we have presented heuristic algorithms for resolution degradation and enhancement. We have demonstrated the effectiveness of the scheme through simulations. Our simulation results show that it increases the admission ratio as well as server-perceived rewards. We are currently working on the QoS negotiation scheme for VBR video streams in multi-resolution video servers.
References [1] S. Berson, L. Golubchik, and R. Muntz. Fault tolerant design of multimedia servers. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 364–375, May 1995. [2] R. Boutaba and A. Hafid. A generic platform for scalable access to multimedia-on-demand systems. IEEE Journal on Selected Areas in Communications, 17(9):1599–1613, September 1999.
A QoS Negotiation Scheme in Multi-resolution Video Servers
73
bdfa
bdaa
Sum of rewards
bcfa
bcaa
bbfa
bbaa
bafa db
dc
dd
de
df
dg
dh
di
dj
ea
Number of admitted clients in a client group RDF
RDWF
Fig. 4. Comparison between RDF and RDW F for R = 2 seconds [3] E. Chang and A. Zakhor. Scalable video data placement on parallel disk arrays. In Proceedings of IS/SPIE International Symposium on Electronic Imaging: Science and Technology, pages 208–221, December 1994. [4] E. Chang and A. Zakhor. Cost analyses for vbr video servers. IEEE multimedia, 3(4):56– 71, 1996. [5] E. Chang and A. Zakhor. Disk-based storage for scalable video. IEEE Transactions on Circuits and Systems for Video Technology, 7(5):758–770, October 1997. [6] I. Chen and C. Chen. Threshold-based admission control policies for multimedia servers. The Computer Journal, 39(9):757–766, 1996. [7] A. Cohen and W. Burkhard. Segmented information dispersal (sid) for efficient reconstruction in fault-tolerant video servers. In Proceedings of ACM International Multimedia Conference, pages 277–286, November 1996. [8] L. Golubchik, J. Lui, and M. Papadopouli. A survey of approaches to fault tolerant vod storage servers: Techniques, analysis, and comparison. Parallel Computing, 24(1):123– 155, January 1998. [9] J. Lui K. Law and L. Golubchik. Efficient support for interactive service in multi-resolution vod systems. The VLDB Journal, 24(1):133–153, January 1999. [10] S. Martello and P. Toth. Knapsack Problems: Algorithms and Computer Implementations. John Wiley & Sons, 1990. [11] B. Ozden, R. Rastogi, P. Shenoy, and A. Silberschatz. Fault-tolerant architectures for continuous media servers. In Proceeding of the ACM SIGMOD Conference on Management of Data, pages 79–90, 1996. [12] S. Paek, P. Bocheck, and S. Chang. Scalable mpeg2 video servers with heterogeneous qos on parallel disk arrays. In Proceedings of International Workshop on Network and Operating System Support for Digital Audio and Video, pages 363–374, April 1995.
Tolerance of Highly Degraded Network Conditions for an H.323-Based VoIP Service Peter Holmes, Lars Aarhus, Eirik Maus Norwegian Computing Center, P.O. Box 114 Blindern, Oslo, Norway {peter.holmes, lars.aarhus, eirik.maus}@nr.no
Abstract. The empirical work presented here concerns the effects of very large packet loss and delay conditions upon the performance of an H.323-based VoIP service. Specifically, it examines call establishment performance and audio voice quality under these conditions. The work was performed as part of an investigation concerning the effects of GEO satellite environments upon H.323based VoIP, but its findings are relevant to all kinds of networks which may suffer or become subject to highly aberrant transmission conditions. The call establishment tests showed that the H.323 protocols could establish calls successfully under severely impaired conditions. The listening tests showed that uncontrolled jitter was the most destructive parameter to listening quality. Packet loss was also influential, though somewhat less dramatically.
1
Introduction
The empirical work presented here concerns the effects of very large packet loss and delay conditions upon the performance of an H.323-based VoIP system. It was performed during Q2-Q4 2000 within the IMMSAT project (“Internet Protocol based MultiMedia Services over a Satellite Network”) [1] [2], a project carried out by Norsk Regnesentral [14] and Nera SatCom AS [16]. Two primary types of VoIP tests were carried out: call establishment and audio voice quality. The main objectives of prototype testing were: • To verify that the H.323 protocols could be used to establish calls with an acceptable voice quality, in an environment that approximates the characteristics of a GEO (Geosynchronous Earth Orbit) satellite-based IP network. • To find the parametric limits for delay, jitter, and packet loss where the H.323 protocols could no longer provide an acceptable VoIP service. That is, where call establishment becomes severely impaired / fails and where audio voice quality degrades such that usability of the service is severely or wholly impaired. Regarding other related work, Bem, et.al. [3] provide a thorough introduction to the area and issues of broadband satellite multimedia systems, clarifying the basic physical and architectural distinctions amongst satellite systems, as well as some of the technical and legal issues to be addressed. Farserotu and Prasad [5] provide a more brief survey, and include concise descriptions of the basic issues and references to the latest work in areas such as enhancements to TCP/IP, enhanced QoS awareness, D. Shepherd et al. (Eds.): IDMS 2001, LNCS 2158, pp. 74-85, 2001. Springer-Verlag Berlin Heidelberg 2001
Tolerance of Highly Degraded Network Conditions for an H.323-Based VoIP Service
75
IP security over SATCOM, onboard processing, switching and routing, and service enabling platforms. Metz [7] offers another concise introduction. The IMMSAT results can perhaps be best contrasted to recent work by Nguyen, et.al. [8]. There, the work studied the effects of link errors and link loading upon link performance (e.g., packet and frame loss, delay, etc.) for an H.323-based VoIP service. The IMMSAT study focused upon identifying the levels of link delay, jitter and packet loss which could be sustained and tolerated.
2
H.323 Call Establishment
H.323 calls [9] can be established according to a variety of call models. A call model basically defines the system entities to and from which signals are exchanged. Readers can find a comprehensive H.323 tutorial at [10]. This section focuses upon signaling procedures and timers relevant to H.323 call establishment. 2.1
Basic Signaling Procedures for H.323 Call Establishment
The provision of H.323-based communication proceeds in 5 phases: • phase A: call setup (RAS and H.225 messages [13]); connection of endpoints; H.245 call control channel establishment [11]) • phase B: initial communication and capability exchange (H.245 procedures); master/slave determination • phase C: establishment of audio visual communication (H.245 procedures); media stream address distribution; correlation of media streams in multipoint conferences and communication mode command procedures • phases D and E: call services and call termination, respectively [9]. In IMMSAT, call establishment testing concerned itself specifically with the success or failure of phases A-C. That is, call establishment was judged successful when voice could be transmitted between the endpoints. 2.2
Signaling and Timing Details within Call Setup
The discussion in this section specifically focuses upon call setup (phase A) within a Direct Endpoint Call Signaling via Gateway call model. This was the call model used during prototype testing. It is important to note that in this call model, the Gateway is both a callee endpoint and a calling endpoint. The RAS and H.225 signal exchanges in phase A for that call model are illustrated in Figure 1.
76
Peter Holmes, Lars Aarhus, and Eirik Maus
Timers. System entities start and stop various timers as they send and receive different signals. These include the T303, T310 and T301 timers, as specified for use within call setup by H.225 and Q.9311. Each calling endpoint starts a T303 timer at the moment it issues a Setup signal. By default [13], each calling endpoint should receive an Alerting, Call Proceeding, Connect, Release Complete (or other message) from its respective called endpoint within 4 seconds. Should a calling endpoint receive a Call Proceeding signal before T303 timeout, it should start a T310 timer. Its T310 timer runs until it receives an Alerting, Connect or Release Complete signal. Typical timeout values for the T310 timer are about 10 seconds [15]. If a calling endpoint receives an Alerting signal before a T310 timeout, it should start a T310 timer. This timer runs until a Connect or Release Complete signal is received. Its timeout value should be 180 seconds or greater [13]. Mandatory vs. optional signaling. Not all signals appearing in Figure 1 are mandatory; certain are optional, and certain can be omitted in special cases. For instance, the Alerting ("terminal is ringing") signal can be omitted should the callee answer the phone (Connect) before it has time to ring. Otherwise, H.225 requires the callee endpoint to send an Alerting signal, regardless of call model. H.225 also states that a Gateway should forward an Alerting signal (as shown in Figure 1). Endpoint 1
Gatekeeper 1
Gateway 1
Endpoint 2
ARQ (1) ACF (2) Setup (3) Setup (4) Call Proceeding (5)
Call Proceeding (5) ARQ (6) ACF/ARJ (7)
Alerting (8)
Alerting (8) Connect (9)
Connect (10) T15240
RAS Messages Call Signalling Messages
Fig. 1. Direct endpoint call signaling via Gateway: phase A (adapted from [9])
1
It has been pointed out that certain equipment implementations use two timers in place of the three timers mentioned here (T303, T310 and T301) [15]; the discussion in this section accords with H.225 and Q.931.
Tolerance of Highly Degraded Network Conditions for an H.323-Based VoIP Service
77
When call setup signaling transpires via a Gateway, H.323 requires the Gateway to send a Call Proceeding signal to the calling endpoint, whenever the Gateway judges that it might require more than four seconds to respond to that endpoint. This is required in order to help prevent a T303 timeout (or its equivalent) within the calling endpoint. Perhaps surprisingly, H.225 does not require that the callee endpoint (e.g., endpoint 2) issue a Call Proceeding signal. In network environments with extraordinarily long delays, the absence of a Call Proceeding signal from a callee endpoint could yield a T303 timeout (or its equivalent) in the Gateway. In many implementations, a T303 timeout causes call setup to fail2. In short, use of the Call Proceeding signal by the callee endpoint can be highly valuable when facing very large delay conditions. Use of this signal helps serve to alleviate, though not decouple, end-to-end call setup timing constraints. In satellite environments, issuance of this message from callee endpoints helps mitigate the effects of multiple satellite hops during H.323 call setup.
3
Prototype Testing
The satellite-based environment parameters in focus within IMMSAT were delay, jitter, packet loss and bit error rate (BER). For a typical GEO satellite-based environment, measured values3 for these parameters are: • • • •
Delay: 550 ms one-way delay (includes processing delay) Jitter: ±15 ms (standard deviation) BER: 10-6 average bit error rate Packet Loss: 1% average packet error (assuming 10-6 BER and 100 byte packets on average).
3.1 Prototype Configuration The network configuration and prototype elements within the IMMSAT prototype are given in Figure 2. Details about the specific kind of equipment used is provided in the figure, and explicated in [6]. References to the equipment include [17][18][21] [22]. In the figure, the H.323 client types are indicated with NM for the Microsoft NetMeeting 3.01 clients, IPT for Siemens LP5100 IP telephones [20], and ISDN for Siemens Profiset 30 ISDN telephones4 [19]. The configuration enabled testing the effects of both one and two hop calls5. 2
H.225 and Q.931 allow the Setup signal to be retried. Since TCP underlies the H.225 call signaling messages, however, many implementations choose to abort and clear the call, rather than retry Setup [15]. 3 Information as to which system these values were measured from is confidential. 4 These acronyms are followed by a number (1 or 2) which distinguishes different instances of the same kind of system entity. 5 Two hop calls can occur, for example, when two mobile satellite terminals require (e.g., audio) media transcoding in order to communicate. In such cases, the audio media is transmitted from the first terminal to the satellite gateway via the satellite, transcoded in the
78
Peter Holmes, Lars Aarhus, and Eirik Maus
All testing was performed using various combinations of H.323 clients, IP telephones, ISDN telephones, H.323 Gateway, and Gatekeeper (for most tests). A Direct Endpoint Call Signaling via Gateway call model was employed. Voice encoding and decoding was fixed to G.711 in the ISDN/H.320 network, and to G.723.1 (6.3Kb/s), in the IP/H.323 network. During testing, transcoding was always performed in the H.323 Gateway.
Nera Satellite Gateway H.323
Gateway ISDN Network simulated by Dialogic QuadSpan PRI E1 and BRI/80 SC cards in a PC
ISDN2
E1 (2 Mbps) ISDN PRI
“The Cloud”(Shunra) - Impairment simulator configured for Satellite levels of delay, BER, jitter, packet loss, etc.
ISDN1
ISDN phone
H.323 Clients & IP telephones (at satellite terminals) NM1 PC with MS NetMeeting 3.01
cPCI
NM2 IP RADCOM PrismLite protocol analyser
Dialogic DM3 IPLink card
PC with MS NetMeeting 3.01 Siemens LP5100 IP telephone
H.323 Gatekeeper
ISDN phone Satellite Gateway IP Network
IPT1
Siemens LP5100
IP Network
IP telephone
(Simulated Satellite Connected)
IPT2
Nera LAN/Ethernet with Hub
Fig. 2. IMMSAT Prototype, including network and telephony configuration details
3.2
Test Strategy and Approach
A test strategy was chosen in which an effort was made to independently identify critical values for each individual parameter6. For call establishment tests, a critical value was a value for a single parameter (e.g., delay, packet loss, etc.) at which call establishment began to fail. Once a critical value was identified for a specific parameter during a test series, two or three calls were made in the parametric space neighboring upon that critical value. Unfortunately, time available for testing yielded only very small sample sizes for each test series. Early, yet significant experimentation with the impact of varying BER upon call establishment indicted that it was necessary to raise this rate to values greater than 1/50000, in order that call establishment should occasionally fail. The same was true
6
gateway, then transmitted on to the second terminal, again via the satellite. In this example, this process transpires for audio media transmission in each direction. Clearly, certain parameters (e.g., delay and jitter) are interdependent. In this document, the term 'individual parameter' is use to mean a parameter which could be independently adjusted within the satellite simulator.
Tolerance of Highly Degraded Network Conditions for an H.323-Based VoIP Service
79
in regard to the perception of BER's effect upon audio voice quality7. Since such a rate is far outside the expected performance range for most networks, including GEO satellite, further independent testing of this variable was terminated. Complete details about the approach and test process for call establishment and audio voice quality testing can be found in [6]. Some details about audio voice quality testing are included below, however, due to their relevance in this context. The voice quality tests included two aspects, one-way listening quality and twoway conversational quality. To rate the subjective listening quality of audio voice, Recommendation P.800's Quantal-Response Detectability Testing scale was employed (see [12], Annex C). An abbreviated version of this scale is included in the legend for Figure 4. In order to employ a consistent audio reference during listening quality tests, playback of a pre-recorded audio source via the Gateway was used. It must be noted that the time and resources available for listening quality testing were so limited that only a single subject was employed to perform the listening tests and to rate performance. Testing procedures were made as "blind" as possible [6]. As part of the listening quality tests: • silence suppression was turned off in all prototype elements (whenever and wherever possible) • audio frame size was 30 msec • number of audio frames per packet was clearly noted as a significant variable (either 1 or 2 audio frames per packet) • listening quality for both one and two hop calls was checked.
4
Analysis of Results
Since the total number of trials in each test series was so very limited, a judgement was made to analyze the results on the basis of apparent, rather than statistical trends. 4.1
Call Establishment
Figure 3 summarizes two major sets of call establishment tests. Each major set included eight test series. These two major sets include: • investigations of the effects of varying delay (test series A-H) and • investigations of the effects of varying packet loss (test series J-R). The relevant variables for each test series are: the type of client terminal initiating the call (the caller), the type of client terminal answering the call (the callee), whether the call involved one or two simulated satellite hops and the settings for the satellite simulator. When reviewing Figure 3, it is important to note that the horizontal axes employed therein are not linear.
7
This is consistent with the findings of Nguyen, et.al. [8].
80
Peter Holmes, Lars Aarhus, and Eirik Maus
Effects of delay. In regard to the effects of delay upon call establishment, there were two major observations. The first was that in the face of delay, call establishment performance did not seem to be influenced by the addition of a second (simulated) satellite hop. This can be seen in Figure 3 by comparing: test series C to D, test series C to E, and test series F to H. It is expected that this behavior is related to the characteristics of a Direct Endpoint Call Signaling via Gateway model which fully employs the Call Proceeding signal, as explained in section 2.2. With respect to this first observation, the comparison of F to G is not conformant. It is presumed that this result is a consequence of the second observation described below. The second major observation is that increases in delay apparently seemed to impact calls employing IPT (as either caller or callee) more adversely than calls which did not involve IPT. Rather than including here a detailed clarification involving signal exchanges and timing, a higher-level explanation is offered instead. call establishment: DELAY (normal distribution) Cloud settings: standard deviation = 15 msec packet loss = 0; BER = 0 Test series
Test series
caller: ISDN
not tested
call establishment: PACKET LOSS (random) Cloud settings: delay = 550 msec standard deviation = 15 msec BER = 0 caller: not tested
ISDN
A
IPT
--
--
--
--
--
--
1/1
--
3/3
0/2
0/2
--
--
J
IPT
--
--
--
--
--
3/3
1/3
0/3
B
NM
--
--
--
--
--
--
--
--
1/1
--
3/3
0/2
0/2
K
NM
--
--
--
--
3/3
2/3
--
--
1,00
3,00 10,00 15,00 20,00 25,00 30,00 packet loss (percent) callee: ISDN
550 1000 1200 1300 1400 1500 2000 2500 3000 3500 4000 4500 5000 delay average (msec) callee: ISDN
0,00
caller:
caller:
C
ISDN
--
--
--
--
--
2/2
0/2
--
--
--
--
--
--
L
ISDN
--
--
1 /1
3/3
1/2
0/1
--
--
D
IPT
--
--
--
2/2
--
3/3
1/2
0/2
--
--
--
--
--
M
IPT
--
--
--
4/4
1/3
0/2
--
--
E
NM
--
2/2
--
2/2
0/3
0/3
0/3
--
--
--
--
--
--
N
NM
--
--
--
3/3
2/3
0/2
--
--
1,00
3,00 10,00 15,00 20,00 25,00 30,00 packet loss (percent) callee: IPT
550 1000 1200 1300 1400 1500 2000 2500 3000 3500 4000 4500 5000 delay average (msec) callee: IPT
0,00
caller:
caller:
F
ISDN
--
--
--
--
--
--
--
--
--
3/3
2/2
0/2
--
P
ISDN
--
--
2/2
3/3
--
1/1
3/3
0/2
G
IPT
--
3/3
2/2
2/2
0/2
0/2
0 /1
--
0/1
--
--
--
--
Q
IPT
--
--
3/3
2/3
2/4
1/3
1/2
0/1
H
NM
--
--
--
--
--
--
--
--
--
2/2
1/3
0/1
0/1
R
NM
--
--
--
--
3/3
1/2
0/1
1,00
3,00 10,00 15,00 20,00 25,00 30,00 packet loss (percent) callee: NM
550 1000 1200 1300 1400 1500 2000 2500 3000 3500 4000 4500 5000 delay average (msec) callee: NM
-0,00
LEGEND: X/Y
--
X instances of success after Y trials success (repeated or inferred) partial success (or unknown rate) lack of success (repeated or inferred) untested parameter settings two hop call
Fig. 3. Results of call establishment testing for delay and packet loss
IPT as a caller endpoint: In test series F, G and H, the second leg of the call involves signaling between the Gateway and NetMeeting. Series F and H both demonstrate that the second call leg can tolerate up to 4000 ms delay. Series G begins to fail at 1400 ms, however, which implies that the cause of failure lies on the first leg
Tolerance of Highly Degraded Network Conditions for an H.323-Based VoIP Service
81
of the call. A reasonable hypothesis in this case is that timeouts occurred within the IPT terminal, the originating caller. IPT as a callee endpoint: In test series B, E and H, the first leg of the call involves signaling between NetMeeting and the Gateway. Series B and H both demonstrate that the first call leg can tolerate up to 4000 ms delay. Series E begins to fail at 1400 ms, which implies that the cause of failure lies on the second leg of the call. It is also interesting to notice that the other two series having IPT as callee endpoint (series C and D) also demonstrated failure at 2000 ms. A reasonable hypothesis in this case is that IPT did not issue a Call Proceeding signal to the Gateway, a situation which caused the Gateway to timeout and clear the call. As mentioned earlier, H.225 does not require that callee endpoints issue the Call Proceeding signal. Still, these test series seem to demonstrate the value of that signal being issued by the callee endpoint. It must be mentioned here that though reasonable, the two hypotheses above could not be completely verified. The circumstances regarding this condition are explained further in [6]. Effects of packet loss. Analysis of the call establishment testing for packet loss did not reveal any apparent trends with respect to the variables under study (e.g., caller vs. callee type, hop count, etc.). The only apparent trend is that the success rate of call establishment seemed to significantly diminish when packet loss reached 15-20%. It should be mentioned here that some of the successful call establishment trials employing packet loss rates of 20-25% took one minute or more to establish. Despite the lack of usable audio quality at this level of loss (see below), successful call establishment could still be of value in other kinds of contexts (e.g., a “sign of life” in an emergency). 4.2
Listening Quality
Figure 4 summarizes six major cases of listening tests: one matrix for each major case (case matrices A-F). The approach used for testing is completely described in [6]. In the figure, each case matrix is also colored according to a “usability scale” defined for IMMSAT. The colors essentially map P.800's seven point quantal-response detectability scale ([12], Annex C) onto a three point scale; this was done in order to ease perception of the significance of the listening test results. The legend in Figure 4 explains this mapping. Note that in certain case matrices, parts of the matrix are colored despite the fact that no trial was performed. These instances of coloring were inferred, based upon the performance of other trials within the same matrix. The parameters which distinguish each major case are: the type of client terminal initiating the call (the caller), the type of client terminal answering the call (the callee), whether the call involved one or two (simulated) satellite hops, whether the listening test was performed using a headset or a handset, and the number of audio frames per packet. For the cases in which IPT was the callee, an additional parameter was the setting for the HiNet jitter buffer; this buffer could be set to either ‘Long’ or ‘Short’. The HiNet jitter buffer is a buffer internal to the Siemens IP telephones. The parameters which distinguish the trials within each major case are the settings for the
82
Peter Holmes, Lars Aarhus, and Eirik Maus
satellite simulator. It is important to note that the axes employed for each major case matrix are not linear. The analysis here is discussed using comparisons amongst major cases A-F (i.e., “Case Comparisons I-V” in Figure 4’s legend); these are depicted as directed arcs in the figure. The legend succinctly describes the variation in parameter values for each case comparison. Note further that the case comparisons can be viewed as tree rooted at case B. Jitter (standard deviation in msec) 0 -----
0% Packet Loss 1% (random) 3% 10 % case: A
15 -----
30 -----
50 0 ----
75 2 ----
100 5 ----
Jitter (standard deviation in msec) 150 6 ----
0% Packet Loss 1% (random) 3% 10 %
IV
GW > IPT (handset used; 2 frames / packet; one hop; HiNet Jitter Buffer: Long)
case: D
0 0 ----
15 2 1 2 4
30 5 5 5 6
50 6 6 ---
75 -----
100 -----
150 6 6 6 --
GW > IPT (handset used; 2 frames / packet; one hop; HiNet Jitter Buffer: Short)
V III Jitter (standard deviation in msec) 0 0 --3
0% Packet Loss 1% (random) 3% 10 % case: B
15 0 1 3 5
30 0 --4
50 0 2 3 5
75 -----
100 -----
Jitter (standard deviation in msec) 150 1 2 2 4
0% Packet Loss 1% (random) 3% 10 %
GW > NM (headset used; 2 frames / packet; one hop)
case: E
0 0 1 ---
15 -----
30 -1 1 --
50 1 -2 --
75 -----
100 -----
150 6 ----
NM > IPT (handset used; 2 frames / packet; two hops; HiNet Jitter Buffer: Long)
I Jitter (standard deviation in msec) 0 0 ----
0% Packet Loss 1% (random) 3% 10 % case: C
15 0 1 3 4
30 -----
50 0 1 3 5
75 -----
100 -----
Jitter (standard deviation in msec) 150 1 2 3 6
II
0% Packet Loss 1% (random) 3% 10 %
GW > NM (headset used; 1 frame / packet; one hop)
Detection of, or disturbance by noise / distortion 0 1 2 3 4 5 6 --
Inaudible Just audible Slight Moderate Rather loud Loud Intolerable untested
LEGEND Usability very good - good; (0-2) good - fair; (3) poor - unusable; (4-6) unknown
case: F
Case Comparisons I: II: III: IV: V:
B vs. C C vs. F B vs. A A vs. D A vs. E
0 0 2 4 --
15 1 2 3 6
30 1 2 4 --
50 0 2 5 6
75 -----
100 -----
150 3 5 4 6
NM > NM (headset used; 1 frame / packet; two hops)
Difference(s) (1) 2 vs. 1 frame per packet (1) 1 vs. 2 simulated hops (2) caller terminal (1) callee terminal; (2) headset vs. handset (1) HiNet jitter buffer: Long vs. Short (1) 1 vs. 2 simulated hops (2) caller terminal
Fig. 4. Results of listening tests for jitter and packet loss
Case B depicts a listening test where the configuration supported “good – very good” audio quality for jitter values as high as 150 msec and packet loss up to 1%. Audio quality began to noticeably degrade to “fair” when packet loss reached 3%. Comparison with case C (comparison I) indicates that use of two vs. one audio frames per packet does not appear to be of any significance. Differences are noticeable in comparison II, however; case F involves two hops, while case C involves only one. This comparison illustrates that audio quality degrades quite noticeably when a second hop is introduced, an expected result due to the cumulative effects of packet loss, delay and jitter across hops. Worth noting in case F is the near-usability of the trial having 150 msec jitter and 0% packet loss per hop. This result seems to indicate that NetMeeting is internally operating with a relatively “large” jitter buffer. No user-level access to this parameter was available via the NetMeeting application, however, in order to confirm this hypothesis. Consider now comparison III (case B vs. case A). In case A, the most significant difference from case B is that instead of NetMeeting, the Siemens LP5100 telephone
Tolerance of Highly Degraded Network Conditions for an H.323-Based VoIP Service
83
is used as the callee terminal. For the LP5100 telephone, jitter has a severely destructive impact upon audio quality as soon as it reaches 100 msec. This was the case even though the HiNet jitter buffer was set to ‘Long’. When that buffer was set to ‘Short’ (see case D, comparison IV), the effect of jitter was devastating. In contrast to the LP5100 telephone’s poor performance in the face of jitter, the results seem to indicate that its audio quality was somewhat better than NetMeeting, when faced with the same levels of packet loss. Though this is not explicitly shown in comparison III, one may choose to infer this by considering how well the LP5100 performed with respect to packet loss in case E a two hop call (comparison V). In summary, the one-way listening tests showed that uncontrolled jitter was the most destructive parameter to audio quality. Packet loss was also influential, though somewhat less dramatically. 4.3
Conversational Quality
Even when the intelligibility of audio is unaffected by its transmission medium, the exclusive effect of delay can have severe consequences upon the character of a conversation (see e.g., [4]). Since this area has already been so well investigated, the tests for conversational quality were only quick, general checks performed using the typical values expected for delay, BER, packet loss, etc. In general, the conversational quality between clients at these levels was satisfactory. Of note, the two hop calls tended to included some (very) small amount of added noise / distortion, when compared to the one hop calls. For the sake of experience, conversational quality was occasionally tested when delays were varied between 1500-4000 msec and other satellite parameters were held at their typical values. As expected, users could ultimately manage to successfully communicate. With such large delays, however, it was necessary for them to adopt an unnatural "single duplex" mode of conversation.
5
Conclusions
Generally, all tests performed using the IMMSAT prototype indicated that H.323based VoIP services in a GEO satellite-based IP network are practical and usable. The effects of typical, GEO satellite environment conditions upon H.323-based VoIP services are expected to be negligible. These results are consistent with those of Nguyen, et.al. [8]. Further remarks about call establishment and audio voice quality follow below. 5.1
Call Establishment
The call establishment tests showed that the H.323 protocols could establish calls successfully under severely impaired conditions conditions which were at least an order of magnitude worse than in typical GEO satellite-based environments.
84
Peter Holmes, Lars Aarhus, and Eirik Maus
The call establishment tests also indicated that the client type on each end of the call can have great significance with respect to the amount of delay which can be tolerated during call establishment. For example, it appeared that one of the clients failed to issue the Call Proceeding signal. This signal is not required by the standard but, when facing extreme conditions, its use can yield greater tolerance of delay. Lastly, it was interesting to see that both one and two hop call configurations performed equally well in the face of delay. It is expected that this behavior is due to the characteristics of a Direct Endpoint Call Signaling via Gateway model which fully employs the Call Proceeding signal. 5.2
Audio Voice Quality
The experiments showed that audio quality appreciably degrades when moving from a one hop call to a two hop call. This kind of result is expected due to the cumulative effects across multiple hops. The experiments showed that client type also has an impact upon voice quality; use of headset instead of a handset also makes a difference here. Still, jitter is the network characteristic that seemed to have the greatest impact on listening quality. Even though the highly adverse conditions tested in this investigation are not likely to be observed in most networks, the ability to handle jitter is crucial for offering an acceptable VoIP service. Using a larger jitter buffer in certain system entities may help alleviate the problem, though at the cost of an increase in overall delay. The experience of large delays when trying to converse can be frustrating and confusing, even to users who have been informed of and prepared in advance for such conditions. Users practiced with large delays can adapt and reconcile themselves with the condition and ultimately use the service, but conversations must transpire in a “single duplex” mode in order that communication be achieved.
Acknowledgements The authors wish to thank the project team at Nera SatCom: Ronald Beachell, Harald Hoff and Paul Mackell. A special thanks as well to Ronald Beachell, for his diagram of the IMMSAT prototype configuration (Figure 2).
References 1. Beachell, R.L., Bøhler, T., Holmes, P., Ludvigsen, E., Maus, E., Aarhus, L., "1551IMMSAT; IP Multimedia over Satellite Study Report", Copyright 2000, Nera SatCom AS / Norsk Romsenter. Also available as: "IP Multimedia over Satellite: Study Report", NR Report 967, December 2000, ISBN 82-539-0473-8. See: http://www.nr.no/publications/nrrapport967.pdf 2. Beachell, R.L., Holmes, P., Thomassen, J., Aarhus, L., "1523-IMMSAT-TR; IP Multimedia over Satellite: Test Report for the IMMSAT Prototype", Copyright 2000, Nera SatCom AS /
Tolerance of Highly Degraded Network Conditions for an H.323-Based VoIP Service
85
Norsk Romsenter. Also available as: " IP Multimedia over Satellite: Test Report", NR Technical Note IMEDIA/10/00 (Classified), December 2000. 3. Bem, D., Wieckowski, W., Zielinski, R., Broadband Satellite Systems, IEEE Communications Surveys & Tutorials, vol.3, no.1, 2000, pp. 2-15; see: http://www.comsoc.org/pubs/surveys/1q00issue/pdf/zielinski.pdf 4. Douskalis, B., IP Telephony - The Integration of Robust VoIP Services, Prentice-Hall, 2000. 5. Farserotu, J., Prasad, R., A Survey of Future Broadband Multimedia Satellite Systems, Issues and Trends, IEEE Communications, vol. 38, no. 6, June 2000, pp. 128-133. 6. Holmes, P., Aarhus, L., Maus, E., “An Investigation into the Effects of GEO Satellite Environments upon H.323-based Voice over IP”, NR Report 973, May 2001, ISBN 82-5390479-7. Available as: http://www.nr.no/publications/report973.pdf 7. Metz, C., IP-over-Satellite: Internet Connectivity Blasts Off, IEEE Internet Computing, JulyAugust 2000, pp.84-89. 8. Nguyen, T., Yegenoglu, F., Sciuto, A., Subbarayan, R., Voice over IP Service and Performance in Satellite Networks, IEEE Communications, Vol. 39, No. 3, March 2001, pp. 164-171. 9. Packet-based multimedia communications systems, ITU-T Recommendation H.323 v3 (with change marks), ITU-T October 1999. 10. From DataBeam: "A Primer on the H.323 Series Standard", see: http://www.h323analyser.co.uk/h323_primer-v2.pdf 11. Control protocol for multimedia communication, ITU-T Recommendation H.245 v5 (with change marks), ITU-T 1999. 12. Methods for Subjective Determination of Transmission, ITU-T Recommendation P.800, ITU-T August 1986: http://www.itu.int/itudoc/itu-t/rec/p/p800.html 13. Call signaling protocols and media stream packetization for packet based multimedia communication systems, ITU-T Recommendation H.225.0 draft v4 (with change marks), ITU-T 1999. 14. Norsk Regnesentral (Norwegian Computing Center) WWW site; see: http://www.nr.no/ekstern/engelsk/ 15. Personal communication from Espen Kjerrand, Ericsson AS Norway, May 2001. 16. Nera SatCom AS WWW site; see: http://www.nera.no/ 17. Dialogic IPLink cPCI Solutions; see: http://www.dialogic.com/products/d_sheets/4798web.htm 18. Dialogic QuadSpan™ Voice Series; see: http://www.dialogic.com/products/d_sheets/3967web.htm 19. Siemens Profiset 30isdn: see http://www.siemens.no/telefoni/isdn/profi30.htm 20. Siemens optiPoint 300 advance IP telephone (formerly LP5100 IP); see: http://www.ic.siemens.com/CDA/Site/pss/1,1294,208375-1-999,00.html 21. The Cloud, from Shunra Software Ltd.; see: http://www.shunra.com/products/thecloud.htm 22. RADCOM Prism and PrismLite protocol analyzers; see: http://www.radcom-inc.com/radcom/test/prism.htm
Conception, Implementation, and Evaluation of a QoS-Based Architecture for an IP Environment Supporting Differentiated Services Fabien Garcia1, Christophe Chassot1, Andre Lozes1, Michel Diaz1, Pascal Anelli2, Emmanuel Lochin2 1
LAAS/CNRS, 7 avenue du Colonel Roche, 31077 Toulouse cedex 04. France {fgarcia, chassot, alozes, diaz}@laas.fr 2 LIP6, 8 rue du Capitaine Scott, 75015 Paris. France {pascal.anelli, emmanuel.lochin}@lip6.fr
Abstract. Research reported in this paper deals with the design of a communication architecture with guaranteed end-to-end quality of service (QoS) in an IPv6 environment providing differentiated services within a single Diff-Serv domain. The paper successively presents the design principles of the proposed architecture, the networking platform on which the architecture has been developed and the experimental measurements validating the IP level mechanisms providing the defined services. Results presented here have been obtained as part of the experiments in the national French project @IRS (Integrated Architecture of Networks and Services).
1
Introduction
Technical revolutions in computer science and telecommunications have led to the development of several new types of distributed applications: multimedia and cooperative applications, interactive simulation, etc. These new applications present challenging characteristics and constraints to network designers, such as higher bandwidths, the need for bounded delays, etc. As a result, the Internet community formed two research and development efforts (IETF1 Int-Serv [1] and Diff-Serv [2] working groups) whose goal is to develop a new generation of protocols within a revised architecture for the TCP/IP protocol suite. One of the key points of that architecture is that the new “differentiated” or “integrated” services respond to the needs of new applications. Performed within the national French project @IRS2, work presented within this article straddles this context. More precisely, it deals with the 1 2
IETF: Internet Engineering Task Force The @IRS project (Integrated Networks and Services Architecture) is a national project of the France's RNRT (Réseau National de la Recherche en Télécommunications), whose objective is the development and experimentation of innovative Internet mechanisms and protocols within an heterogeneous telecommunications infrastructure (ATM, satellite, wireless, LAN, etc.). Initiated in December 1998, project @IRS has ended in April 2001.
D. Shepherd et al. (Eds.): IDMS 2001, LNCS 2158, pp. 86-98, 2001. Springer-Verlag Berlin Heidelberg 2001
Conception, Implementation, and Evaluation of a QoS-Based Architecture
87
conception, the implementation and the evaluation of a communication architecture providing a guaranteed end-to-end QoS in an IPv6 differentiated services environment constituting a single Diff-Serv domain. Several other works have been initiated to target the QoS problem within the Internet. Among them, let’s cite the TFTANT activity [3] and the GEANT, TEQUILA, CADENUS, AQUILA and GCAP IST projects [4,5,6,7,8]. The article is structured as follows. Section 2 presents the architecture principles. Section 3 describes the experimental platform over which the architecture has been developed. Section 4 details the experimental scenarios for the validation of the IP QoS mechanisms providing the defined services; results of the corresponding measurements are also provided and analysed. Conclusions and future work are presented in Section 5.
2
General Architecture
The following two major sections (2.1 and 2.2) successively present the architecture defined at the end-to-end level and then at the network level. 2.1
End-to-End Level
The basic underlying principle that supports the proposal of the @IRS end-to-end architecture is one of many dedicated to the transport of multimedia flows [9,10,11,12]. The idea is that the traffic exchanged within a distributed application can be decomposed into several data flows each one requiring its own specific QoS (i.e. delay, reliability, order, …). That is, each application can request a specific QoS for each flow via a consistent API (Application Programming Interface) offering parameters and primitives for a diverse set of necessary services. By way of a session (see Figure 1), the application layer software is then allowed to establish one or many end-to-end communication channels, each being: (1) unicast or multicast, (2) dedicated to the transfer of a single flow of application data, and (3) able to offer a specific QoS. Besides the API, three other modules are defined: • the first one provides multiple transport layer possibilities, such as TCP, UDP, or the partial-order, partial-reliable POC (Partial Order Connection [13,14]); • the second one implements the mechanisms linked to the utilisation of QoS services at the IP layer (e.g., RSVP, RSVP light, manual configuration); • the third one associates a given transport channel with a given IP QoS service.
88
Fabien Garcia et al. Application Adaptation application / API API Association transport channel / IP QoS
UDP / TCP / POC
IP QoS system management
...
RSVP, … IPv6 + traffic control
Figure 1. Architecture of the End-to-End Communication System Two types of modifications have been defined for the applications: the translation of transport layer function calls (e.g. socket) to the new API, and the insertion of an adaptation layer managing the establishment of end-to-end QoS-based communication channels, along with translating the applications' needs into generic parameters understood by the API. Since the experimental results presented in the paper are “IP QoS” rather than “Transport QoS” oriented, we do not present the API in detail. However, we precise that the QoS specified for each end-to-end channel is expressed by the following parameters: • a partial order, expressed as both an intra and inter flow order since a user may want a logical synchronisation service either within each flow (e.g., logical synchronisation of two media transported within the same end-to-end channel) or between flows (e.g., logical synchronisation of voice and video); • a partial reliability defined, for example, by a maximum number of consecutive lost packets, and/or by a maximum running percentage of lost packets; • a maximum end-to-end transmission delay. Although the question is pertinent, the goal of this paper is not to describe and enforce ordering relationships within channels that may lose packets to tackle synchronisation needs associated with multimedia applications, or to reduce the transit delay of application level data units. Identically, the paper does not study the usefulness of RTP based propositions. Finally, in addition to QoS parameters, an application must specify four service parameters: • the first one characterises the traffic generated by the application sender; for this, the token bucket model is used; • the second one designates which transport protocol to use (e.g. UDP, TCP or POC). One of the proposed future research activities is to determine if it is possible and/or efficient to have the system automatically select the transport protocol. For now, this choice must be specified by the application;
Conception, Implementation, and Evaluation of a QoS-Based Architecture
89
•
the third one designates the IP layer's QoS management desired by the application (e.g. Best Effort service, Int-Serv Guaranteed or Controlled Load services, Diff-Serv Premium or Assured services, etc.). Here again, automatic selection is a proposed research activity outside the scope of the work presented here; • the final parameter identifies the address, either unicast/multicast, of a set of destination application softwares. Although the architecture is designed so as to allow several kinds of Transport protocols or IP level management systems, the architecture implemented within the project only includes UDP and TCP at the Transport level and a Diff-Serv proposition (described in section 2.2) at the IP level. 2.2
Network Level
QoS functions performed at the network level can be divided in two parts: those related to the data path and those related to the control path [15]. On the data path, QoS functions are applied by routers at the packet level in order to provide different levels of service. On the control path, QoS functions concern routers configuration and act to enforce the QoS provided. If studies performed during the @IRS project tackle the two areas (data and control paths), only the data part has been implemented and evaluated over the experimental platform. In this section, we first describe the defined services at the IP level. Then we detail the different functions required for the services implementation, with a clear separation between the control path and the data path. Services. Three services have been defined at the IP level: •
GS (Guaranteed Service) - analogous to the Premium Service [16] - is used for data flows having strong constraints in both delay and reliability. Applications targeted by GS are those which do not tolerate QoS variation; • AS (Assured Service) is appropriate for responsive flows having no strong constraints in terms of delay, but requiring a minimum average bandwidth. More precisely, a flow served in AS has to be provided with an assured bandwidth for the part of its traffic (IN packets) respecting the characterisation profile specified for the flow. Part of the traffic exceeding the characterisation (OUT - or opportunistic - packets) is conveyed in AS as far as no congestion occurs in the network on the path used by the flow; • BE: Best Effort service offers no QoS guarantees. Three markings have been defined to differentiate packets: EF (Expedited Forwarding), AF (Assured Forwarding) and DE (Discard Eligibility) [17,18], corresponding to the three services provided at the IP level.
90
Fabien Garcia et al.
Control Path QoS Functions. In order to implement the mentioned services in the testbed network (described in section 3), the two mechanisms involved in the control path are admission control and route change protection. Note that the multicast issue is not studied in this paper. Admission Control. The admission control takes care of the acceptance of new flows in service classes. Its decisions are taken according to the TCA1 contracted between the user domain and the service provider domain. Our proposition is different for AS and GS: • for AS, a per flow admission control is applied at the edge of the network (core routers are not implied); this control is based on the amount of AS traffic already authorised to enter the network by the considered edge router. This gives the guarantee that at any time, the amount of in profile AS packets (IN packets) in the network will be at most the sum of the AS authorised at each edge router; • for GS, as a delay guarantee is needed, the admission control involves all the routers on the data path (edge and core routers). As for AS, it is applied per flow, each flow being identified by the couple (flow_id field, source address). Route Change Protection. This issue is raised for GS only as a route change can have a heavy impact on the service and even results in the service not being available anymore. As it has been decided in the @IRS project that QoS and routing issues would be considered separately, our solution is to use static routing only. Data Path QoS Functions. QoS functions involved in the data path are policing, scheduling and congestion control. We first describe these three functions, then we detail their implementation within routers input and output interfaces. Data Path Functions Description. Policing. Policing deals with the actions to be taken when out of profile traffic arrives in a given service class: • for AS, the action is to mark the out of profile packets with a higher drop precedence than for the IN traffic. OUT packets are called opportunistic packets because they are processed like the other IN AS packets, as far as no congestion occurs in the network. If congestion occurs, the congestion control described here after is applied. Targeted applications are those whose traffic is elastic, that is with a variable profile (a minimum still being assured); • for GS, as a guarantee exists, we must be sure that enough resources are available; therefore the amount of GS traffic in the network must be strictly controlled. To enforce this, the chosen policing method is to shape the traffic at the edge router and to drop out of profile GS packets.
1
TCA - Traffic Control Agreement: part of the SLA (Service Level Agreement [19]) that describes the amount of traffic a user can send to the service provider network.
Conception, Implementation, and Evaluation of a QoS-Based Architecture
91
Scheduling. Once again, the scheduling is different for AS and GS packets: • GS scheduling is implemented by a Priority Queuing (PQ) mechanism. This choice is due to the fact that a PQ based scheduler adds the smallest delay to the packet forwarding due to packetisation effect; • the remaining bandwidth is shared by a Weighted Fair Queuing (WFQ) between AS and BE traffic. This ensures that there will not be service denial for BE. Congestion Control. The congestion control issue is essential for QoS services, as a congestion can prevent the network from offering the contracted QoS to a flow: • GS traffic does not need congestion control as it benefits from a priority queuing ensuring that all its packets are served up to the maximal capacity of a given router. Associated with a strong policing (drop of out of profile packets) at the network boundary, this guarantees that no congestion will occur in GS queues; • for AS, as opportunistic traffic is authorised to be sent in the network, the amount of AS packets in any router can’t be known a priori. Therefore, a drop precedence system has been implemented; it allows the drop of opportunistic packets as soon as a congestion is about to occur in an AS queue. A Partial Buffer Sharing (PBS) has been chosen on the AS queues rather than a Random Early Discard (RED) method. This choice comes from recent studies which show that simple systems such as PBS avoid the queue length oscillation problems raised by RED [20,21]. Let us now look at the implementation of these functions over the @IRS platform through the input interface of the first router (called edge router as opposed to core routers) and the output interface of all routers. Data Path Functions Implementation Input Interface of Edge Router. This interface is the first encountered by a packet when it enters the network. Its logical structure is shown in Figure 2.
IPv6 packets
MF Classi fier
insideDropper profile GS
Meter shaper
BE AS
outsideprofile
Marker
Reject
EF
DE
inside-profile AF
Meter outside-profile
Figure 2. Input Interface Structure This interface is in charge of: packets classification, which is based on information from the IPv6 header (source address and flow_id); this is done through a MultiField Classifier; • measuring AS and GS flows to determine whether they are in or out of profile; • shaping GS packets and dropping them if necessary; • marking AS and GS packets with the appropriate Diff-Serv CodePoint (DSCP) •
92
• •
Fabien Garcia et al.
marking AS packets with the precedence due to their being in or out profile; marking BE packets to prevent them from entering the network with the DSCP of another service class.
Output Interface of All Routers. In the Diff-Serv model, all routers must implement a set of forwarding behaviours called Per Hop Behaviour (PHB). In the @IRS architecture, these behaviours are implemented through scheduling and AS congestion control as described in section (0). They are integrated in the output interface of each router (Figure 3). FIFO
EF
BA classification
AF
PBS WFQ
PQ
Additional rate control
BE FIFO
Figure 3. Output Interface Structure Two additional points must be noted: the Behaviour Aggregate classifier which classifies packets according to their DSCP, and the rate control at the output of core routers. As it will appear more clearly in the next section, this rate control is necessary to avoid congestion at the ATM level. Indeed, congestion at the ATM level is due to the fact that ATM Virtual Paths (VP) through which the packets are sent have a smaller bandwidth capacity than the ATM cards of the routers. Combined with the limited buffer size in the ATM switches, this could create ATM cell losses when the throughput of the router is not controlled.
3
Experimental Platform
The testing network environment used for applications of the project is shown in Figure 4. Local equipment is connected by edge routers (Rb) to an Internet Service Provider (ISP) represented by the national ATM RENATER 2 platform. Four core routers (Rc) are introduced within the ISP; physically, they are located in the local platforms, but logically they are part of the ISP. By means of its edge router, each site is provided with an access point to the ISP (Rb). This access point is characterised by a statically established traffic contract called Service Level Agreement (SLA) [19]. For each service, the SLA consists of: • classification and packet (re)marking rules; • a sending traffic profile called Traffic Conditioning Agreement (TCA) [19] (for @IRS, the model chosen was the Token Bucket); • actions the network has to perform when the application violates its traffic profile.
Conception, Implementation, and Evaluation of a QoS-Based Architecture
93
Site 2
Host
Rb Rc
ISP
RENATER2
Rc
Site 1 Rb
Rc
Rb
Site 3
Rc
Edge router
Rb
Site 4
Core router
Figure 4. Experimental Platform It is the edge router’s responsibility to implement the SLA as it introduces application flows within the ISP. All the ATM VPs offer a 1Mbit/s Constant Bit Rate (CBR) service.
4
Experimental Scenarios and Measurements
The purpose of the experimental scenarios described in this section is to validate the IP mechanisms developed to provide a slightly degraded QoS for AS flows and an excellent QoS for GS flows when the network load is increasing. 4.1
Studied Cases
Two cases have been considered: • case 1: the edge router is not in state of congestion and the AS (respectively GS) flow respects its traffic profile. This case is the reference one; • case 2: the edge router is now in state of congestion and the AS (resp. GS) flow still respects its traffic profile; the congestion is due to BE traffic. Here, we want to study the behaviour of the AS (resp. GS) flow when the network is overloaded. 4.2
Tools
The software tool used for the experiments is a traffic generator (named Debit6) developed by LIP6 and LAAS. Debit6 is able to: • send/receive UDP/IPv6 flows under Free BDS or Window NT operating system; • send traffic respecting a token bucket type-like profile; • measure throughput and loss rate for a given experiment session; • measure transit delay for each packet.
94
Fabien Garcia et al.
4.3
QoS Validation
Experiment Configuration. Measurements have been realised between LAAS (Toulouse) and LIP6 (Paris) over the IPv6 inter-network environment illustrated on Figure 5. BSD AS/GS
BSD
Rc Rb
BE
BSDexo
X
ATM national platform RENATER 2
ATM switch
Rc X
AS/GS Ethernet
Rb 100 Mbits/s
ATM switch
ISP LAAS (Toulouse)
LIP6 (Paris)
Figure 5. Inter-network Environment Bandwidth of the link connecting LAAS (resp. LIP6) to the ISP (via an ATM VP) is such that the maximal throughput provided at the UDP level is 107 Kbytes/s. In the following of the paper, we use the term link bandwidth to refer to this throughput. Hypothesis and Measured Parameters. Edge and core routers have been configured with the following hypothesis: • the maximal amount of GS traffic that can be generated by the edge router (in average) has been fixed to 20 Kbytes/s, i.e. about 20% of the link bandwidth; • the maximal amount of AS traffic that can be generated by the edge router (in average) has been fixed to 40 Kbytes/s, i.e. about 40% of the link bandwidth; • the rate control applied by the core router is 107 Kbytes/s (i.e. the link bandwidth); • weights associated to the AS and BE packet scheduling within the WFQ mechanism are respectively 0.8 and 0.2. Measured parameters are the loss rate, the transit delay and the throughput; more precisely: • the loss rate corresponds to the ratio between the number of not received packets and the number of sent packets for the complete experiment session; • minimal, maximal and average values of the transit delay are calculated for the complete experiment session (about 300 seconds); • the throughput is measured both at the sending and receiving side and corresponds to the number of user data bytes sent or received for the complete experiment session divided by the communication duration (about 300 seconds). Note that hosts are synchronised by means of NTP (Network Time Protocol), inducing a +/- 5 milliseconds (ms) uncertainty on the transit delay measurements.
Conception, Implementation, and Evaluation of a QoS-Based Architecture
95
Measurements Specification. Two kinds of measurements have been defined (see Table 1). In both cases, AS and GS flows have to respect their traffic profile and all flows (GS, AS and BE) are generated by bursts of 1 UDP packet whose size is 1024 bytes. The inter-packet delay is the variable parameter used to change the throughput of the generated flows. 1st kind of measure
0
GS (% of the maximal 50 amount of GS traffic) 100 AS (% of the maximal 50 amount of AS traffic) 100 2nd kind of measurements AS and GS
50
BE (% of the link bandwidth) 25 50 75 100 case 1 case 2 case 1 case 2 case 1 case 2 case 1 case 2 BE (% of the link bandwidth) 100 case 2
Table 1. QoS Measurements Specification (case 1 = no congestion ; case 2 = congestion) The first kind of measurements is aimed at validating the protection of an AS (resp. a GS) flow in presence of a BE flow whose load is increased from 0% to 100% of the link bandwidth. AS and GS flows are not generated together. For these measurements: • the BE flow is sent from the BSDexo PC (LAAS); values of its mean rate are successively 0%, 25%, 50%, 75% and 100% of the link bandwidth; • AS and GS flows are transmitted using Debit6 from LAAS BSD PC to LIP6 BSD PC; two cases are considered: • a single AS (resp. GS) flow is generated with a mean rate corresponding to 50% of the maximal amount allowed by the edge router to AS (resp. GS) traffic, i.e. 20 Kbytes/s (resp. 10 Kbytes/s); • a single AS (resp. GS) flow is generated with a mean rate corresponding to 100% of the maximal amount, i.e. 40 Kbytes/s (resp. 20 Kbytes/s). The second kind of measurements is aimed at validating the protection of an AS flow and a GS flow generated together, in presence of a BE flow whose load corresponds to the totality of the link bandwidth (100%). For these measurements: • the BE flow is sent from the BSDexo PC (LAAS); its mean rate is 107 Kbytes/s ; • AS and GS flows are transmitted using Debit6 from LAAS BSD PC to LIP6 BSD PC, each one with a mean rate corresponding to 50% of the maximal amount provided by the edge router (i.e. 20 Kbytes/s for AS and 10 Kbytes/s for GS). Results and Analysis. For the first kind of measurements, due to space limitation, only the results for 100% of AS (respectively GS) are exposed in this section. Protection of the AS Flows (vs. BE). Results of the measurements are exposed in Table 2. These results conform to the excepted ones. When the network is not in state of congestion (first three columns), one can conclude that:
96
Fabien Garcia et al.
− − − −
the average transit delay is almost constant (less than 1 ms gap); the loss rate is almost null; the receiving throughput is almost the same as the sending one; the maximum delay is not guaranteed. Note that the second and third points are compliant to the expectations as we only consider in-profile traffic; measurements with out of profile traffic would show a receiving throughput inferior to the sending throughput (and a loss rate higher than 0) When the network is in state of congestion, one can see that the transit delay value is slightly increased, which again is correct for the expected service. AS – 100% Transit - min - max delay (s) Through put Loss rate (%)
0% 0.019 0.045 0.020 40765 40761 0.0
BE (% of the link bandwidth) 25% 50% 75% 100% 0.019 0.019 0.019 0.019 0.032 0.030 0.033 0.036 0.021 0.021 0.024 0.024 40963 40964 40963 40964 40963 40963 40963 40963 0.0 0.0 0.0 0.0
Table 2. AS - 100% vs. BE Protection of the GS Flows (vs. BE). Results of the measurements are exposed in Table 3. These results are similar to those exposed for the AS flows, the main difference being that the maximum delay for GS packets seems slightly lower than the maximum for AS packets. This is normal as we only compare AS vs. BE and GS vs. BE; indeed, in both cases, the flow that benefits from QoS goes through the network nearly as quickly as it can. GS – 100% Transit - min - max delay (s) Through put Loss rate (%)
0% 0.018 0.024 0.019 18289 18289 0.0
BE (% of the link bandwidth) 25% 50% 75% 100% 0.019 0.019 0.019 0.019 0.029 0.031 0.032 0.033 0.020 0.022 0.023 0.025 18289 18290 18289 18289 18289 18289 18289 18289 0.0 0.0 0.0 0.0
Table 3. GS - 100% vs. BE Protection of the AS and GS flows (vs. BE). Results of the measurements are exposed in Figure 6, which presents the % of packets received with a delay less than or equal to the value denoted on the x-axis. The difference between AS and GS is now apparent for the maximum delay which is much higher for AS. Moreover, one can observe that the transit delay of the AS packets is more widely spread than the GS one. Note that the BE curve is not shown because it is out of the scale of the graphic.
Conception, Implementation, and Evaluation of a QoS-Based Architecture
97
100 BE ( 100 Ko/s)
Packets received (%)
90 AS ( 20 Ko/s) 80 GS ( 10 Ko/s) 70 60 50
Transit delay (s)
40 30 20 10
Min
Moy Av.
Max
AS
0.019
0.025
0.043
GS
0.019
0.024
0.033
BE
0.284
4.126
4.245
0 0,00
0,01
0,02
0,03
0,04
0,05
0,06
End to end delay (s)
Figure 6. AS - 50% vs. GS - 50% vs. BE - 100%
5
Conclusions and Future Work
Work presented in this paper has been realised within the @IRS French project. It consists in the conception, the implementation and the validation of a communication architecture supporting differentiated IP level services as well as a per flow end-toend QoS. The proposed architecture has been exposed in section 2; the network over which it is deployed has been described in section 3; finally, an experimental evaluation of the QoS provided by the defined IP services (GS, AS and BE) has been exposed in section 4. Several conclusions may be stated: • the first one is that a differentiated services architecture may be easily deployed over a VPN (Virtual Private Network)-like environment such as the one described; • the second conclusion is that if experimental results correspond to the expected ones (as far as the GS, AS and BE services are concerned), the effect of the IP level parameters (router’s queue length, WFQ weights, PBS threshold, …) appears to be a crucial point in the implementation of the services. For instance, size of the AS queues has an influence on the offered service, short queues inducing a delay priority, longer queues inducing a reliability priority. In the same way, the PBS threshold defines the way the network reacts to the out of profile traffic, thus providing a service more or less adapted to applications needing a base assurance and willing to use additional available resources.
•
Three major perspectives of this work are currently under development: the first one is to extend the experimentations exposed in this paper so as to evaluate the IP QoS when several routers are overloaded with best effort traffic;
98
• •
Fabien Garcia et al.
the second one is to evaluate the QoS at the application level when several flows requiring a same class of service are generated; finally, it is our purpose to formalise the semantic of guarantee associated with the QoS parameters and to develop a mechanism allowing the application to be dispensed from the explicit choice of the Transport and IP services to be used.
Finally, a long term perspective of this work is the extension to a multi-domain environment, by using for example bandwidth brokering.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.
IETF: IntServ WG: http://www.ietf.org/html.charters/intserv-charter.html IETF: DiffServ WG: http://www.ietf.org/html.charters/diffserv-charter.html TF-TANT: http://www.dante.net/tf-tant Campanella, M., Ferrari, T., et al: Specification and implementation plan for a Premium IP service. http://www.dante.org.uk/tf-ngn/GEA-01-032.pdf (2001). TEQUILLA: http://www.ist-tequila.org CADENUS: http://www.cadenus.org AQUILA: http://www-st.inf.tu-dresden.de/aquila GCAP: http://www.laas.fr/GCAP/ Campbell, A., Coulson, G., Hutchinson, D.: A quality of service architecture. ACM Computer Communication Review (1994). Nahrstedt, K., Smith, J.: Design, Implementation and experiences of the OMEGA end-point architecture. IEEE JSAC, vol.14 (1996). Gopalakrishna, G., Parulkar, G.: A framework for QoS guarantees for multimedia applications within end system. GI Jahrestagung. Zurich (1995). Chassot, C, Diaz, M, Lozes, A.: From the partial order concept to partial order multimedia connection. Journal for High Speed Networks, vol. 5, n°2 (1996). Amer, P., Chassot, C., Connolly, C., Conrad, P., Diaz, M.: Partial Order Transport Service for MM and other Appli. IEEE/ACM Trans. on Neting, vol.2, n°5 (1994). Connolly, T., Amer, P., et al.: An Extention to TCP: Partial Ord. Serv. RFC 1693. Zhao W, Olshefski D, Schulzrinne H.: Internet Quality of Service: an Overview. Technical Report CUCS-003-00. http://www.cs.columbia.edu/~hgs/netbib/ (2000). Nichols, K., Jacobson, V., Zhang, L.: A Two-bit Differentiated Services Architecture for the Internet. November 1997. Heinanhen, J., Baker, F., Weiss, W., and al.: An Assured Fwding PHB. RFC 2597. Jacobson, V., Nichols, K., Poduri, K.: An Expedited Forwarding PHB. RFC 2598. Blake, S., Black, D., Carlson, M.: An Arch. for Differentiated Services. RFC 2475. Bonald, T., May, M., Bolot, J.: Analytic Evaluation of RED Performance. Proceedings INFOCOM’2000, Tel Aviv (2000). Ziegler, T., Fdida, S., Brandauer, C.: Stability Criteria of RED with TCP Traffic. Internal report. http://www.newmedia.at/~tziegler/red_stab.pdf. (2000).
A Service Differentiation Scheme for the End-System* Domenico Cotroneo1, Massimo Ficco2, and Giorgio Ventre1 1
Dipartimento di Informatica e Sistemistica, Università degli Studi di Napoli “Federico II” Via Claudio 121, 80125 Napoli, Italy {cotroneo, giorgio}@unina.it 2
ITEM - Laboratorio Nazionale di Informatica e Telematica Multimediali Via Diocleziano 328 80124 Naples, ITALY
[email protected] Abstract. A number of research studies show that the operating system has a substantial influence on communication delay in distributed environments. Thus, in order to provide applications with end-to-end QoS guarantees, network resource management alone is not sufficient. To date, several approaches have been proposed, addressing QoS issues from the end-system point of view. However, while network QoS provisioning has achieved a good level of standardization, no standard proposals exist for the end-systems. We claim that a single architectural model, taking into account both end-system and network resource management is needed. In this paper we propose a QoS architecture, which extends the concept of service differentiation inside the end-system. A system prototype has been developed and tested on a Diffserv network scenario. The prototype incorporates a priority-based communication mechanism inside the end-point operating system and a local marker so to implement an appropriate mapping between local classes of services and network QoS levels. Experimental results have also been provided in order to evaluate the impact of the proposed architecture on Diffserv routers.
1 Introduction The problem of providing real-time communications between hosts over the Internet has become more understood. In fact, a number of research studies show that operating system has substantial influence on communication delay in distributed environments. These studies have demonstrated that in order to provide application with QoS guarantees, the network resource management alone is not sufficient. End-system resources, such as processing capacity, have also to be taken into account and they have to be managed in concert with the networking ones. With respect to the network, experiences over the Internet have showed the lack of a fundamental technical element: real-time applications do not work well across the *
This work has been performed with the support of the Italian Ministry of Research and University (MURST) under a grant Progetti Cluster “Cluster 16 Multimedialità: LABNET2” and under a grant PRIN “MUSIQUE”.
D. Shepherd et al. (Eds.): IDMS 2001, LNCS 2158, pp. 99-109, 2001. Springer-Verlag Berlin Heidelberg 2001
100
Domenico Cotroneo, Massimo Ficco, and Giorgio Ventre
network because of variable queueing delays and congestion losses. The Internet, as originally conceived, offers only a very simple quality of service: point-to-point besteffort data delivery. Thus, before real-time applications can be broadly used, the Internet infrastructure must be modified in order to support more stringent QoS guarantees. Based on these assumptions, the Internet Engineering Task Force (IETF) has defined two different models for network QoS provisioning: Integrated Services [1] and Differentiated Services [2]. These frameworks provide different, yet complementary, solutions to the issue of network support for quality of service. A number of researchers dealt with the issue of providing Quality of Service guarantees inside network end-points. David K.Y. Yau and Simon S. proposed an endsystem architecture designed to support network QoS [3]. They implemented a framework, called Migrating Sockets, composed of an user level protocol that minimize hidden scheduling on the sender side, and an active demultiplexing mechanism with constant overhead on the receiver side. With Migrating Sockets, performancecritical protocol services such as send and receive are accessed as a user level library linked with applications. An architectural model providing Quality of Service guarantees to delay sensitive networked multimedia applications, including end-systems, has been proposed by K. Nahrstedt and J. Smith [4]. K. Nahrstedt and J. Smith also proposed a new model for resource management at the end-points, called the QoS Broker [5]. The broker orchestrates resources at the end-points, coordinating resource management across layer boundaries. As an intermediary, it hides implementation details from applications and per-layer resource managers. The QoS broker manages communication among the entities to create the desired system configuration. Configuration is achieved via QoS negotiation resulting in one or more connections through the communication system. While we recognize that the cited works represent fundamental milestones for the pursuit of quality of service mechanisms inside the end-systems, we claim that the time is ripe for taking an integrated view on QoS-enabled communication. A single architectural model is thus needed, integrating network centric and endsystem centric approaches and taking into account the most relevant standards proposed by the research community. In this work we propose a QoS architecture, which extends the concept of service differentiation inside the end-system. A system prototype, running on Sun Solaris 2.6, has been developed and tested in a Diffserv network scenario. The prototype incorporates a priority-based communication mechanism inside the end-point operating system, called the Priority Broker [6], and a local marker so to implement an appropriate mapping between local classes of services and network QoS levels. The rest of the paper is organized as follows. In section 2 we briefly introduce the architecture and the services provided by Priority Broker. In section 3 we illustrate the design and the implementation of an enhanced version of the Priority Broker, including local marker and a more efficient buffer management schema. In section 4 we present the DiffServ network testbed we adopted. In section 5 we provide a critical discussion of the experimental results, obtained from the execution of the Priority Broker (PB) on each end-system connected to the Diffserv network. Ultimately, conclusions and future works are described in section 6.
A Service Differentiation Scheme for the End-System
101
2 The Priority Broker (PB) The PB, which an earlier prototype is provided in [6], is an architectural model that supports communication flows with a guaranteed priority level. It is conceived as daemon process running on the end-system, which provides an API (Broker Library Interface) enabling applications to request the following three classes of communication services: 1. 2. 3.
Best-effort: no Qos guarantees; low priority: for adaptive applications; high priority: for real-time applications.
It is implemented as a new software module residing between applications and the kernel and provides an additional level of abstraction, capable of interacting with the operating system in order to provide the applications with a well-defined set of differentiated services, while monitoring the quality of the communication by means of flow control techniques.
Fig. 1. Conceptual schema of the Priority Broker architecture.
The Priority Broker is composed of four components: the sender-broker, the receiver-broker, the communication subsystem, and the connection manager. The sender-broker is responsible for the transmission of packets with the priority required by end-users, while the receiver-broker manages incoming packets. The communication subsystem provides the I/O primitives that enable discrimination, in the kernel I/O subsystem, among outgoing packets. Finally, the connection manager is in charge of handling all service requests on behalf of the local and remote processes.
102
Domenico Cotroneo, Massimo Ficco, and Giorgio Ventre
The PB, in turn, relies on the services made available by both the transport provider and the I/O mechanisms of the operating system upon which it resides. Further implementation details can be found in [6].
3
An Enhanced Version of the Priority
As we mentioned, we envision a distributed environment where resources are explicitly required by the applications and appropriately managed by the local OS, which is also in charge of mapping local requests onto the set of services made available by the network. More precisely, we enforce a correspondence between end-system priorities and network classes of service, thus obtaining an actual end-to-end Per Hop Behavior, where the first and last hops take into account transmission between end hosts and network routers. To the purpose, in this section we propose an extension of architecture described in [6]. The enhancement we propose deals with the sender-broker. We introduced a new module, in charge of mapping local requests onto the set of services made available by the Diffserv network (Marker). The overall architecture is depicted in Figure 2.
Fig. 2. End-to-end communication schema
A Service Differentiation Scheme for the End-System
103
As the figure shows, communication in this scenario takes place with the following steps: 1. 2. 3. 4.
A sender application selects its desired level of service among the set of available QoS classes and it asks the local connection manager to appropriately setup the required resources. The connection manager is in charge of performing local admission control strategy. Once the application calls the BLI send primitive, data is queued in the part of shared memory assigned to the application. The sender-broker transmits the packets with the required priority.
The sender-broker is structured as a multithreaded process. Each thread is in charge of managing and transmitting data with a required priority level. The Marker is part of sender-broker, it sets the particular DS codepoint corresponding to the required class of service. To cope with this issue, an appropriate mapping has been defined between local classes of service and network QoS levels. The solution we propose is shown in the following figure.
Fig. 3. Marker’s mapping schema
To transmit packets with the priority assigned by the user by means of BLI, the Sender-broker uses the following mechanisms, as shown in figure 4: 1. 2. 3.
I/O primitives made available by the Operating System to assure flow control on the sender side when accessing the network; Shared memory to avoid multiple copies between applications and kernel to send messages; Synchronization mechanisms to implement the communication protocol between user applications and PB.
104
Domenico Cotroneo, Massimo Ficco, and Giorgio Ventre
As far as I/O primitives are concerned, our PB runs on Solaris 2.6. This kind of system provides a powerful I/O mechanism, the STREAMS framework [10]. STREAMS uses message queuing priority. All messages have an associated priority field: default messages have a priority of zero, while priority messages have a priority band greater than zero. Non-priority, ordinary messages are placed at the end of the queue following all other messages that can be waiting. Priority band messages are placed below all messages that have a priority greater than or equal to their own, but above any with a smaller priority. To discriminate among three different classes of service, we found, by trial and error, that the best choice was to use a priority of 0 for Best Effort, 1 for Low Priority and 255 for High Priority.
Fig. 4. Sender-Broker architecture.
4
Experimental Testbed
We deployed the system over the heterogeneous platform illustrated in figure 5. The figure illustrates a Diffserv network, composed of three routers (Aphrodite, Gaia, Zeus) based on the Class Based Queuing (CBQ) algorithm [7]. The routers are lowend PCs running FreeBSD. In order to be able to take measurements, a monitor application is installed on each routers.
A Service Differentiation Scheme for the End-System Furore (End-System) Solaris 2.6 OS
105
Minori (End-System) Solaris 2.5.1 OS
Fast IP1
Fast Ethernet switch
IP1 IP1
IP2
IP2
IP1
IP1 IP3
IP1 IP2
IP3
Gaia (CBQ Router) FreeBSD 3.3.4 OS
Zeus (CBQ Router) FreeBSD 3.3.4 OS
Cronus (End-System) FreeBSD 3.3.4 OS
IP1
Aphrodite (CBQ Router) FreeBSD 3.3.4 OS Helios (End System) Linux Redhat 5.1 OS
IP1
Grid2k (End-System) Solaris 2.6 OS
Fig. 5. Testbed DiffServ architecture
During the experiments the following Sun’s workstations are involved as endsystem: Furore, Minori e Grid2k. Furore and Minori, based respectively on Solaris 2.6 and Solaris 2.5.1, are connected to Aphrodite by a Fast Ethernet switch, while Grid2k is directly wired to Zeus. On each end-system is installed the Priority Broker, which consists of a daemon process and the BLI user-level library. Furore and Minori act as senders, while Grid2k as receiver. Several sender applications can be launched on these end-systems. On sender sides (Furore and Minori) the PB takes care of marking the outgoing IP packets, by setting the TOS field as mentioned in section 3. CBQ routers are configured as follow: the 98% of the bandwidth is assigned to the Best Effort (BE) traffic. To the Assured Forwarding (AF) traffic is assigned the 96% of the BE, and the 95% of this, is assigned to the Expedited Forwarding (EF) traffic. Finally, the bandwidth allocation hierarchy is: 1. 2. 3.
2% of the total bandwidth to the BE traffic; 1% of the total bandwidth to the AF traffic ; 95% of the total bandwidth to the EF traffic.
We chose such configuration in order to better discriminate flows inside the monitors.
106
5
Domenico Cotroneo, Massimo Ficco, and Giorgio Ventre
Experimental Results
In this section we present some of the experiments we performed in order to investigate the behaviour of overall architecture. Experiments are performed on the testbed illustrated in section 4. The maximum bandwidth that can be managed by routers is 2 Megabits per second. All the data presented in the following are measured by the monitor application running on the routers. The first test was run to compare the behaviour of the routing activity in the case of the PB is active or not. Three sender applications run concurrently on workstation Furore. Such senders, A, B and C use different classes of service, respectively BE, AF and EF. Sender applications are launched in the following sequence: the sender A is the first, the sender B is the second, and the sender C is the third. Each application transmits about 20000 messages of 1200 bytes. A (BE) B (AF) C (EF)
Fig. 6. Test results without the Priority Broker
First, we performed measurements without executing the PB on the end-systems. Figure 6 shows that differentiated services are scheduled by the routers according to the CBQ router configuration. The two graphics depicted in figure 7, report the same experimental results in the case of the PB was running on the end-systems. From the left-side of figure 7, it should be noted that the sender A (BE traffic) is able to transmit until B (AF traffic) starts. As soon as sender B starts, the PB preempts resources allocated to A, letting only sender B to transmit. In a similar way sender B is pre-empted as soon as C (EF traffic) starts. From the right picture of figure 7, it should be noted that sender A resumes its transmission only when B and C end. Conversely, the two graphics show the impact of the PB on the router activity. Class of Services, belonging to the same end-system, with lower priorities are scheduled directly at the end-system. This is due to PBs running on the end-system. We believe that such behaviour emphasize the differentiation performed inside the Diffserv routers.
A Service Differentiation Scheme for the End-System
107
A (BE) B (AF) C (EF)
A (BE) B (AF) C (EF)
Fig. 7. Test result with the Priority Broker
The second test was run to evaluate the combined effect of PB and Diffserv routers in a real world scenario. In this case four sender applications run concurrently on two different workstations: two of these run on workstation Furore and the others on workstation Minori. Two senders, running on Minori, use respectively BE and AF priority, while priorities used by senders running on Furore are BE and EF. Senders start transmitting their flows according to the following sequence: 1. 2. 3. 4.
BE flow on Furore; BE flow on Minori; AF flow on Minori; EF flow on Furore.
Fig. 8. Test Results in a real world scenario with PB
108
Domenico Cotroneo, Massimo Ficco, and Giorgio Ventre
As soon as Furore and Minori start respectively AF and EF senders, no more BE traffic is scheduled by the routers. This is due to the pre-emption performed by the PB at the end-systems by PBs. The BE traffic is resumed when AF and EF end (second graphics depicted in figure 8). Figure 9 shows the same experimental results when PBs were not running.
A (BE) B (AF) C (EF)
Fig. 9. Test results in a real world scenario without PB
We believe that such behaviour emphasize the differentiation performed inside the Diffserv routers. We are currently investigating benefits of the PB form end-to-end point of view.
6 Conclusion and Future Work This work has presented a QoS architecture, called Priority Broker, which extends the concept of service differentiation inside the end-system. A system prototype, running on Sun Solaris 2.6, has been developed and tested in a Diffserv network scenario. The prototype incorporates a priority-based communication mechanism inside the endpoint operating system and a local marker so to implement an appropriate mapping between local classes of services and Diffserv class of services. The PB architecture consists of a daemon process and user level library, called Broker Library Interface (BLI). We integrated our PB in the Diffserv architecture, thus obtaining a single architectural model, which is based on the central concept of service differentiation. The PB is able to discriminate services inside the end-system, the Diffserv router outside of it. Experiments conducted on the system prototype demonstrated the influence of the end-system in the router activity. PB running on the end-system “help” the router scheduler to better discriminate its class of services
A Service Differentiation Scheme for the End-System
109
Future work will aim to: 1. Refine measurements in order to evaluate the end-to-end behaviour of applications using our architecture; 2. Investigate the porting of this architecture to different operating systems. We are current the implementing a new version of the PB running on Linux Real Time. In order to discriminate services inside the end-system, in this case the communication subsystem will exploit the real time features instead of the I/O mechanisms.
References 1.
R. Braden, D.Clark, and S. Shenker. "Integrated Services in the Internet Architecture: an Overview". RFC 1633, July 1994. 2. S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss. "An Architecture for Differentiated Services". RFC 2475, Dec. 1998. 3. David K.Y.Yau and Simon S. Lam “Migrating Sockets-End System Support for Networking with Quality of Service Guarantees”, IEEE/ACM Trans. On Networking. VOL. 6, December 1998. 4. K. Nahrstedt and J. Smith “A Service Kernel for Multimedia Endstations”, Technical Report, MS-CIS, University of Pennsylvania. 5. K. Nahrstedt and J. Smith, “The QoS Broker”, IEEE Multimedia Spring 1995,Vol.2, No.1, pp. 53-67. 6. D. Cotroneo, M. Ficco, S.P.Romano and G.Ventre, “Bringing Service Differentiation to the End System”, IEEE conference ICON 2000, Oct. 2000. 7. S.Floyd and V. Jacobson, “Link-sharing and resource management models for packet networks”, IEEE/ACM transaction on Networking, 3(4), August 1995. 8. G. Banga P. Druschel, „Resource containers: New facility for resource management in server system“, 3rd USENIX Symposium on Operating Systems Design and Implementation, New Orleans, Feb, 1999. 9. G. Banga, “Operating System Support for Server Application”, PhD Thesis, Rice University, May 1999. 10. Uresh Vahalia, UNIX Internals, The new frontiers, Prentice Hall International, 1996.
Enabling the Internet to Provide Multimedia Services Markus Hofmann Bell Labs, Holmdel, New Jersey, USA
[email protected] Abstract. The Internet as of today is still mostly governed by the end-to-end principle, which demands that the network itself is to be kept as simple as possible and that all intelligence resides at the end-systems. This principle proved to be very successful and beneficial for the evolution of the Internet. Despite its success, we have recently seen more application-specific functionality moving into the network, in particular to the edges of the network. Deployment of network caches and content-aware switches are probably the most widely known examples for this kind of functionality. It helps accelerating the delivery of static web pages by moving content closer to the user. However, margins for such basic delivery services are getting slimmer. Service providers have to take advantage of opportunities to provide new value-added content services for differentiation and additional revenue. Examples of such services include, but are not limited to, content filtering, content adaptation, dynamic and personalized content assembling, ad insertion and virus scanning. This talk outlines the evolution from traditional web caching towards a flexible and open architecture to support a variety of content-oriented multimedia services. Enhancements include added support for streaming media, a component for global traffic redirection, and a set of protocols and interfaces for valueadded features, such as compression, filtering, or transformation. Additional functionality can be integrated in different components, or be available as plugins. The talk will conclude with service examples currently being implemented on top of such a platform.
D. Shepherd et al. (Eds.): IDMS 2001, LNCS 2158, pp. 110-110, 2001. Springer-Verlag Berlin Heidelberg 2001
Design and Application of TOAST: An Adaptive Distributed Multimedia Middleware Platform Tom Fitzpatrick1, Julian Gallop2, Gordon Blair3, Christopher Cooper2, Geoff Coulson1, David Duce4 and Ian Johnson2 1
2
Computing Department, Lancaster University, Lancaster, LA1 4YR UK. {tf, geoff}@comp.lancs.ac.uk
CLRC Rutherford Appleton Laboratory, Chilton, Didcot, Oxfordshire, OX11 0QX UK {J.R.Gallop, C.S.Cooper, I.J.Johnson}@rl.ac.uk Department of Computer Science, University of Tromsø, 9037 Tromsø, Norway (on leave from Lancaster University)
[email protected] 3
Oxford Brookes University, Gipsy Lane, Oxford, OX3 0BP UK
[email protected] 4
Abstract. The rise of mobile computing and wireless network technology means that, increasingly, applications must adapt to their environment, in particular network connectivity and resource availability. This paper outlines the TOAST middleware platform which provides component-oriented CORBA support for adaptive distributed multimedia applications. In particular, the paper examines how the areas of reflection and open implementation have shaped the approach to adaptation support in TOAST. The paper then discusses novel ongoing research which is investigating middleware support for distributed cooperative visualization using TOAST as a base.
1 Introduction The modern computer user is increasingly likely to be mobile. With the advent of powerful laptop computers, personal digital assistants (PDAs) and, more importantly, wireless networking technology, mobile computing is a fast-growing area. Understandably, the mobile computing user expects applications to function regardless of mobility issues, in the same way that cellular voice or fax services are (generally) expected to be available regardless of location. To achieve this ideal, there are, unsurprisingly, many challenges to be overcome. Perhaps the greatest hurdle for mobile computing is that the Quality of Service (QoS) differences between mobile network types make transparent use of most networked applications impossible. Today, a mobile user may roam from a desktop scenario with a 10 or (more commonly) 100Mbit/s Ethernet connection, round a local site with 2-10Mbit/s Wireless LAN connectivity or in the wide-area with lowbandwidth GSM dialup or TETRA wireless link. In addition to connectivity changes, resource availability may also vary. For example, a laptop computer may reduce its processor power to increase battery life when mobile. In addition, PC-CARD devices such as media capture or compression cards may need to be removed to make room D. Shepherd et al. (Eds.): IDMS 2001, LNCS 2158, pp. 111-123, 2001. Springer-Verlag Berlin Heidelberg 2001
112
Tom Fitzpatrick et al.
for network interface cards, thereby reducing the capabilities of the system or placing a higher load on software. In almost all cases, conventional applications with fixed QoS constraints simply cannot function correctly in this dynamic environment. As a result, a new class of application called adaptive applications has emerged [3, 7]. As their name implies, such applications adapt to QoS changes so that they remain functional in a mobile or changing environment. The TOAST multimedia middleware platform, developed in the Adapt Project [1] and being extended in the Visual Beans Project [25, 13], provides a component-oriented framework that supports the development of adaptive distributed multimedia applications. This paper is structured as follows. Section 2 outlines the design of the TOAST middleware platform, while section 3 describes how TOAST supports adaptive applications. Section 4 presents some details about the current implementation of TOAST, while section 5 discusses how ongoing research is investigating the application of TOAST to the area of distributed cooperative visualization in the Visual Beans Project.
2 The Design of TOAST 2.1 Introduction The Toolkit for Open Adaptive Streaming Technologies (TOAST) is a CORBA-based multimedia middleware platform. TOAST is an attempt to not only implement multimedia support in CORBA in an RM-ODP [2] inspired manner, but to do this in a way which provides the same plug-and-play component-based mechanism as common multimedia programming frameworks [16, 18, 5]. While basic CORBA multimedia streaming support exists [20], it does not truly follow the RM-ODP model, nor the concepts of component-orientation [24]. A key feature is that TOAST is entirely designed and specified in terms of CORBA IDL, allowing it transcend language and platform boundaries. While this section presents a brief overview of the design of the TOAST platform, a full discussion of the design and implementation of TOAST can be found in [10]. 2.2 Component Model The main abstraction in TOAST is that of a multimedia processing component; the basic unit from which applications are built. Examples of components include audio input/output components, video grabbers, network sources/sinks and filters. Following a dataflow or ‘plug-n-play’ model, components exchange continuous media and other data through stream interfaces and flow interfaces. Flow interfaces are the fundamental endpoint for media interaction in TOAST and come in two varieties, input and output. Stream interfaces are a direct realization of the RM-ODP concept of a stream interface; an interface which groups one or more flow interfaces into a single composite interaction point, thus simplifying otherwise complicated connection processes. A commonly-used example of a stream interface is
Design and Application of TOAST
113
a (bi-directional) telephone handset with two unidirectional audio flows. An IDL definition of such a stream interface in TOAST is shown below: interface Handset: TOAST::Stream { INFLOW audioIn; OUTFLOW audioOut; };
interface VideoHandset: Handset { INFLOW videoIn; OUTFLOW videoOut; };
Components can possess both flow and stream interfaces, allowing more traditional RM-ODP compliance using only stream interfaces, or a more low-level ‘plug-n-play’ multimedia programming framework approach with flow interfaces. An IDL definition of a component with both flow and stream interfaces appears below: interface Phone: TOAST::Component { STREAM handset; OUTFLOW recordChannelOut; };
While components may statically declare their stream and flow interfaces as attributes of their main interface (INFLOW, OUTFLOW and STREAM are pre-processor macros), underlying mechanisms allow these to be discovered through a dynamic querying/enumeration process. Also, rather than use enriched IDL to (statically) declare a flow interface's media or information type, this information is always queried dynamically from the flow interface itself. In addition to stream and flow interfaces, TOAST provides components with event-based interaction. This form of interaction is more asynchronous, dynamic and decoupled than traditional CORBA interaction, and is ideal for the type of application that TOAST is designed for. The programmer can enumerate and query which events a component supports at run-time, and indeed even be notified when new event types are made available by a component. Event-based interaction in TOAST is less ambitious and consequently more lightweight than the CORBA Event Service [19], and is seen as complementary to that service rather than as a replacement. 2.3 Basic Component Interaction Local binding in TOAST is the process by which two collocated stream or flow interfaces are bound together to enable media data to flow between them. Since the local binding of stream interfaces is essentially a compound operation involving the local binding of all compatible pairs of flow interfaces, this section will only examine flow interface local binding. Flow interface local binding is strictly typed, requiring both flow interfaces to agree on a single, common mediatype which they will exchange. The process of mediatype negotiation is an important one, since without it the programmer must ensure that all connections are meaningful, reducing extensibility and allowing inconsistency. This process of type-checking connections between multimedia processing components is found in DirectShow [5] but not in systems such as MASH [18] or the Berkeley CMT [16] where, for example, a video capture component could be connected to an audio renderer – presumably with painful results.
114
Tom Fitzpatrick et al.
2.4 Distributed Component Interaction Whereas collocated flow and stream interfaces can be directly locally bound, to allow distributed interaction requires a binding object, a concept defined by the RM-ODP. Binding objects (actually standard TOAST components, albeit complex ones) are essential to allow components to interact over a network, or between incompatible address spaces. Binding objects, or more simply bindings, are responsible for the complete end-to-end communications path between the components that they bind. Events Component
Management Component
Component
Component
Video Decoding Component
Video Rendering Component
Component Component
Distributed Binding Object
Frame Dropping Filter
t en on mp Co
e urc nt So pone m Co
t
Video Encoding Component
Co mp on en t
Video Capture Component
Component
Co Sink mp on en
Component
Figure 1. Complex binding with a binding object
Binding objects are created by a binding factory and are locally bound to their client components in a semi-recursive fashion. Note that bindings can be arbitrarily complex and are in fact special types of TOAST component. Fig. 1 presents an example of how both local and complex binding is used to build distributed applications. Binding factories are a crucial part of the TOAST architecture, since they allow components to interact in a distributed manner by creating binding objects to fulfil a particular role. In addition, in combination with the provision of new component implementations via component factories, augmenting the range of available types of binding through the addition of new binding factories is the principal means of achieving extensibility in TOAST. Value-Added High-level
Video On Demand B/Factory
Low-level Primitive
RTP Binding Factory
UDP Binding Factory TOAST CORE ATM Binding Factory
Reliable Binding Factory
Audio Binding Factory Video Binding Factory
Group Binding Factory
Internet Telephony B/Factory Adaptive Audio B/Factory Multimedia Collaboration Binding Factory
Figure 2. Hierarchy of binding factories for recursive binding
Design and Application of TOAST
115
To encourage extensibility, it is intended that binding factories be available to the programmer as a set of distributed services available on the network. Furthermore, the process of creating a binding object is intended to be recursive; whereby higher-level binding factories call upon lower-level binding factories to aid their construction efforts (see Fig. 2). This process of recursive binding is explained in [2]. 2.5 Component Factories The dynamic capability discovery features of TOAST components are complemented by component factories which allow component instances to be created dynamically. Each TOAST application or process automatically possesses a component factory allowing components to be created in that process. A component factory exports a CORBA interface, thus allowing it to be accessed from anywhere in the network (permissions allowing). The TOAST API allows one to create components either inprocess, on a given node or collocated with another component, with a single call. In addition, new component implementations may be added to factories at run-time.
3 Support for Adaptation in TOAST 3.1 Approach TOAST adopts a framework approach to adaptation, in which specific algorithms and techniques are made possible through a generic support architecture. The benefits of this approach include extensibility with regard to adaptation techniques and mechanisms. This generic support architecture is heavily influenced by the concepts of open implementation and reflection. 3.2 Use of Open Implementation Techniques Background. Open implementation [14] seeks to overcome the limitations of the ‘black box’ approach to software engineering by opening up key aspects of an implementation to the programmer. Reflection [22] is often used to provide a principled means of open implementation by separating the functionality provided by a system (its interface) from the underlying implementation. In a reflective system, a meta-interface allows one to manipulate a causally-connected self-representation of the system’s implementation through inspection and adaptation. Open implementation and reflection were applied initially to programming languages [14, 26] but have also been used in windowing systems [22], operating systems [29] and CORBA [17, 23]. Complementary work at Lancaster investigates the use of reflection techniques in more generalized middleware in the OpenORB project [32]. Open Bindings. TOAST applies reflection and open implementation to supporting adaptation, for example for mobile computing environments. Bindings are responsible for the entire end-to-end communications process, including resource management,
116
Tom Fitzpatrick et al.
signalling, flow filtering/compression and scheduling. To support mobile computing it is crucial that the application be able to exert some control over binding objects [7]. Providing this control via a control interface on a binding has several drawbacks however. In particular, it is difficult if not impossible to design a general means of achieving adaptation, mainly due to the sheer proliferation of possible actions, for example in the mobile computing domain [12]. In addition, such an approach would neither be extensible nor dynamic, and would, for example, preclude the use of new adaptation techniques or components introduced during the lifetime of an application or binding. Instead of a control interface, TOAST allows bindings to offer a meta-interface which operates on a causally-connected self-representation of the implementation of the binding object. This self-representation is modelled as a component graph; the nodes of the graph being the TOAST components that make up the binding’s implementation; the arcs between them representing the local bindings between these components. Adaptation is achieved by directly inspecting and adapting this selfrepresentation, as will be discussed in section 3.3. Meta-Interface MPEG Video Binding RTP Binding Raw Video Source
MPEG Video Encoder
RTP Sender
UDP Binding
RTP Receiver
MPEG Video Decoder
Raw Video Sink
Figure 3. An open binding and its implementation graph
Bindings which expose their implementation in this way are called open bindings. An example of an open binding is shown in Fig. 3. Note the use of hierarchical open bindings in the example, where the various subordinate bindings resulting from a recursive binding process can also have their implementations accessed. This recursion is terminated by primitive components which do not expose their implementation, for example standard media processing/filtering components or UDP network bindings. TOAST uses interface mapping to associate flow interfaces on internal implementation components with a binding’s externally visible interfaces in a principled manner. 3.3 Adaptation through Open Implementation Introduction. As mentioned previously, adaptation is achieved in TOAST by inspecting and adapting the implementation of a binding via its meta-interface; effectively a procedural rather than declarative approach to QoS management [1]. This approach is the result of earlier work at Lancaster investigating the use of logic or QoS attributes to specify QoS requirements [6]. While valid for certain types of scenario, we have found that declarative techniques do not extend well to the mobile, adaptive environment targeted by TOAST. For example, in the face of greatly varying
Design and Application of TOAST
117
network connectivity. Achieving adaptation in TOAST is divided into two categories: minor adaptation and major adaptation, which will both be discussed next. Minor Adaptation. This form of adaptation is the most basic supported by TOAST. This technique allows fine-grained changes or tweaks to be made to the operation of a binding while it is running. The intended use is to allow the programmer or adaptive agent (e.g. a control script or active object) to counteract slight changes in the environment, such as network QoS fluctuations or end-system resource availability. The first step in minor adaptation is the identification and location of the appropriate components within a binding’s implementation graph using the binding’s meta-interface. Once located, these components may be interrogated, for example to determine the true state of QoS or performance. If desired, adaptation can then be achieved by accessing these components’ control interfaces to alter their operation. An example of minor adaptation would be locating a primitive network binding component offered network QoS reports via its interface (e.g. through TOAST events), and using this information to change the encoding rate/quality of a video encoder component elsewhere in the implementation graph. Component interface RTPSender: Component
RTP Sender Component
interface PacketSizeSet interface TransmissionStats interface TransportDest
Figure 4. Restricting access to implementation-graph components
While such techniques can be used to good effect, it is also important to allow a meta-interface to maintain consistency in the face of adaptation. For example, one could easily access a network source component in a binding and change its destination address, thereby destroying the operation of the binding. For this reason, the meta-interface and implementation component graph refer to components using opaque IDs; to access a component’s interface one must request it specifically. This allows a meta-interface to restrict access to certain ‘safe’ components, or, by way of TOAST COM-style multiple interface support, even to only certain interfaces on certain components (see Fig. 4). Major Adaptation. The second, most drastic, form of adaptation supported by TOAST is major adaptation. Similarly to minor adaptation, one uses a binding’s meta-interface to inspect its implementation; however major adaptation involves changing the structure of this implementation graph to alter the implementation of the binding. Since a binding’s implementation is represented as a graph, the metainterface provides methods to add and remove nodes and arcs (components and local bindings) to and from this graph. In addition, a number of higher-level or compound adaptation operations are provided. These include operations allowing the transparent hot-swapping of one side of a local binding and completely replacing one component with another ‘similar’ one.
118
Tom Fitzpatrick et al. rawVideoIn
encVideoOut H.263 Video Encoder
Raw Video Source
MJPEG Video Encoder
H.263 Video Decoder
Transport Binding
MJPEG Video Decoder
Raw Video Sink
Figure 5. Using major adaptation to reconfigure a binding
Perhaps the most common use for major adaptation is to alter the ‘big-picture’ strategy of a binding to adapt to an environmental change. This might involve replacing unsuitable components with newer ones, the insertion of extra components (e.g. QoS filters [28]) or the removal of redundant components. An example of major adaptation is shown in Fig. 5. Here, the meta-interface is being used to adapt a unidirectional video binding in the face of a network change from ~2Mbit/s WaveLAN to a