38• Networking
38• Networking Asynchronous Transfer Mode Networks Abstract | Full Text: PDF (129K) Channel Coding Abstr...
95 downloads
1941 Views
7MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
38• Networking
38• Networking Asynchronous Transfer Mode Networks Abstract | Full Text: PDF (129K) Channel Coding Abstract | Full Text: PDF (160K) Client–Server Systems Abstract | Full Text: PDF (232K) Code Division Multiple Access Abstract | Full Text: PDF (188K) Data Compression for Networking Abstract | Full Text: PDF (300K) Ethernet Abstract | Full Text: PDF (129K) Group Communication bstract | Full Text: PDF (172K) High-Speed Protocols Abstract | Full Text: PDF (148K) Intelligent Networks Abstract | Full Text: PDF (123K) Internetworking Abstract | Full Text: PDF (107K) ISO OSI Layered Protocol Model Abstract | Full Text: PDF (153K) Local Area Networks Abstract | Full Text: PDF (148K) Metropolitan Area Networks Abstract | Full Text: PDF (224K) Mobile Network Objects Abstract | Full Text: PDF (118K) Multicast Abstract | Full Text: PDF (101K) Multiple Access Schemes Abstract | Full Text: PDF (158K)
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ELECTRONICS%20ENGINEERING/38.Networking%20.htm (1 of 2)16.06.2008 15:59:53
38• Networking
Network Flow and Congestion Control Abstract | Full Text: PDF (122K) Network Management Abstract | Full Text: PDF (120K) Network Operating Systems Abstract | Full Text: PDF (160K) Network Performance and Queueing Models Abstract | Full Text: PDF (154K) Network Reliability and Fault-Tolerance Abstract | Full Text: PDF (118K) Network Routing Algorithms Abstract | Full Text: PDF (172K) Network Security Framework Abstract | Full Text: PDF (122K) Network Security Fundamentals Abstract | Full Text: PDF (213K) Remote Procedure Calls Abstract | Full Text: PDF (112K) Signaling Abstract | Full Text: PDF (110K) Telephone Networks Abstract | Full Text: PDF (233K) Token Ring Local Area Networks Abstract | Full Text: PDF (828K) Wireless Networks Abstract | Full Text: PDF (616K)
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ELECTRONICS%20ENGINEERING/38.Networking%20.htm (2 of 2)16.06.2008 15:59:53
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5301.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Asynchronous Transfer Mode Networks Standard Article Tatsuya Suda1 1University of California, Irvine, Irvine, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5301 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (129K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5301.htm (1 of 2)16.06.2008 16:16:51
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5301.htm
Abstract The sections in this article are ATM Standards ATM Traffic Control Hardware Switch Architectures for ATM Networks Continuing Research in ATM Networks | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5301.htm (2 of 2)16.06.2008 16:16:51
ASYNCHRONOUS TRANSFER MODE NETWORKS
749
ASYNCHRONOUS TRANSFER MODE NETWORKS Asynchronous transfer mode, or ATM, is a network transfer technique capable of supporting a wide variety of multimedia applications with diverse service and performance requirements. It supports traffic bandwidths ranging from a few kilobits per second (e.g., a text terminal) to several hundred megabits per second (e.g., high-definition video) and traffic types ranging from continuous, fixed-rate traffic (e.g., traditional telephony and file transfer) to highly bursty traffic (e.g., interactive data and video). Because of its support for such a wide range of traffic, ATM was designated by the telecommunication standardization sector of the International Telecommunications Union (ITU-T, formerly CCITT) as the multiplexing and switching technique for Broadband, or highspeed, ISDN (B-ISDN) (1). ATM is a form of packet-switching technology. That is, ATM networks transmit their information in small, fixedlength packets called cells, each of which contains 48 octets (or bytes) of data and 5 octets of header information. The small, fixed cell size was chosen to facilitate the rapid processing of packets in hardware and to minimize the amount of time required to fill a single packet. This is particularly important for real-time applications such as voice and video that require short packetization delays. ATM is also connection-oriented. In other words, a virtual circuit must be established before a call can take place, where a call is defined as the transfer of information between two or more endpoints. The establishment of a virtual circuit entails the initiation of a signaling process, during which a route is selected according to the call’s quality of service requirements, connection identifiers at each switch on the route are established, and network resources such as bandwidth and buffer space may be reserved for the connection. Another important characteristic of ATM is that its network functions are typically implemented in hardware. With the introduction of high-speed fiber optic transmission lines, the communication bottleneck has shifted from the communication links to the processing at switching nodes and at terminal equipment. Hardware implementation is necessary to overcome this bottleneck because it minimizes the cell-processing overhead, thereby allowing the network to match link rates on the order of gigabits per second. Finally, as its name indicates, ATM is asynchronous. Time is slotted into cell-sized intervals, and slots are assigned to J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
750
ASYNCHRONOUS TRANSFER MODE NETWORKS
calls in an asynchronous, demand-based manner. Because slots are allocated to calls on demand, ATM can easily accommodate traffic whose bit rate fluctuates over time. Moreover, in ATM, no bandwidth is consumed unless information is actually transmitted. ATM also gains bandwidth efficiency by being able to multiplex bursty traffic sources statistically. Because bursty traffic does not require continuous allocation of the bandwidth at its peak rate, statistical multiplexing allows a large number of bursty sources to share the network’s bandwidth. Since its birth in the mid-1980s, ATM has been fortified by a number of robust standards and realized by a significant number of network equipment manufacturers. International standards-making bodies such as the ITU and independent consortia like the ATM Forum have developed a significant body of standards and implementation agreements for ATM (1,4). As networks and network services continue to evolve toward greater speeds and diversities, ATM will undoubtedly continue to proliferate.
Higher layer functions
Higher layers
Convergence
CS
Segmentation and reassembly
Layer management
Generic flow control Cell header generation/extraction Cell VPI/VCI translation Cell multiplex and demultiplex Cell rate decoupling Header error control (HEC) Cell delineation Transmission frame adaptation Transmission frame generation/recovery Bit timing Physical medium
SAR
AAL
ATM
TC Physical layer PM
Figure 2. Functions of each layer in the protocol reference model.
The telecommunication standardization sector of the ITU, the international standards agency commissioned by the United Nations for the global standardization of telecommunications, has developed a number of standards for ATM networks. Other standards bodies and consortia (e.g., the ATM Forum, ANSI) have also contributed to the development of ATM standards. This section presents an overview of the standards, with particular emphasis on the protocol reference model used by ATM (2).
establishment, call maintenance, and call release; and the management plane defines the operations necessary to control information flow between planes and layers and to maintain accurate and fault-tolerant network operation. Within the user and control planes, there are three layers: the physical layer, the ATM layer, and the ATM adaptation layer (AAL). Figure 2 summarizes the functions of each layer (1). The physical layer performs primarily bit-level functions, the ATM layer is primarily responsible for the switching of ATM cells, and the ATM adaptation layer is responsible for the conversion of higher-layer protocol frames into ATM cells. The functions that the physical, ATM, and adaptation layers perform are described in more detail next.
Protocol Reference Model
Physical Layer
The B-ISDN protocol reference model, defined in ITU-T recommendation I.321, is shown in Fig. 1 (1). The purpose of the protocol reference model is to clarify the functions that ATM networks perform by grouping them into a set of interrelated, function-specific layers and planes. The reference model consists of a user plane, a control plane, and a management plane. Within the user and control planes is a hierarchical set of layers. The user plane defines a set of functions for the transfer of user information between communication endpoints; the control plane defines control functions such as call
The physical layer is divided into two sublayers: the physical medium sublayer and the transmission convergence sublayer (1).
ATM STANDARDS
Management plane Control plane
User plane
Higher layers
Higher layers
ATM adaptation layer ATM layer Physical layer
Plane management Layer management
Figure 1. Protocol reference model for ATM.
Physical Medium Sublayer. The physical medium (PM) sublayer performs medium-dependent functions. For example, it provides bit transmission capabilities including bit alignment, line coding and electrical/optical conversion. The PM sublayer is also responsible for bit timing (i.e., the insertion and extraction of bit timing information). The PM sublayer currently supports two types of interface: optical and electrical. Transmission Convergence Sublayer. Above the physical medium sublayer is the transmission convergence (TC) sublayer, which is primarily responsible for the framing of data transported over the physical medium. The ITU-T recommendation specifies two options for TC sublayer transmission frame structure: cell-based and synchronous digital hierarchy (SDH). In the cell-based case, cells are transported continuously without any regular frame structure. Under SDH, cells are carried in a special frame structure based on the North American SONET (synchronous optical network) protocol (3). Regardless of which transmission frame structure is used, the TC sublayer is responsible for the following four functions: cell rate decoupling, header error control, cell delineation, and transmission frame adaptation. Cell rate decoupling is the insertion of idle cells at the sending side to adapt the ATM cell
ASYNCHRONOUS TRANSFER MODE NETWORKS
Bit 8
7
6
5
751
Bit 4
3
GFC VPI
2
1
VPI VCI VCI
VCI
PT HEC
CLP
8 1 2 3 4 5
7
6
5
4
3
2
1
VPI VPI
VCI VCI
VCI
PT HEC
UNI header
CLP
1 2 3 4 5
NNI header Figure 3. ATM cell header structure.
stream’s rate to the rate of the transmission path. Header error control is the insertion of an 8-bit CRC in the ATM cell header to protect the contents of the ATM cell header. Cell delineation is the detection of cell boundaries. Transmission frame adaptation is the encapsulation of departing cells into an appropriate framing structure (either cell-based or SDHbased). ATM Layer The ATM layer lies atop the physical layer and specifies the functions required for the switching and flow control of ATM cells (1). There are two interfaces in an ATM network: the user-network interface (UNI) between the ATM endpoint and the ATM switch, and the network-network interface (NNI) between two ATM switches. Although a 48-octet cell payload is used at both interfaces, the 5-octet cell header differs slightly at these interfaces. Figure 3 shows the cell header structures used at the UNI and NNI (1). At the UNI, the header contains a 4-bit generic flow control (GFC) field, a 24-bit label field containing virtual path identifier (VPI) and virtual channel identifier (VCI) subfields (8 bits for the VPI and 16 bits for the VCI), a 2-bit payload type (PT) field, a 1-bit cell loss priority (CLP) field, and an 8-bit header error check (HEC) field. The cell header for an NNI cell is identical to that for the UNI cell, except that it lacks the GFC field; these four bits are used for an additional 4 VPI bits in the NNI cell header. The VCI and VPI fields are identifier values for virtual channel (VC) and virtual path (VP), respectively. A virtual channel connects two ATM communication endpoints. A virtual path connects two ATM devices, which can be switches or endpoints, and several virtual channels may be multiplexed onto the same virtual path. The 2-bit PT field identifies whether the cell payload contains data or control information. The CLP bit is used by the user for explicit indication of cell loss priority. If the value of the CLP is 1, then the cell is subject to discarding in case of congestion. The HEC field is an 8-bit CRC that protects the contents of the cell header. The GFC field, which appears only at the UNI, is used to assist the customer premises network in controlling the traffic flow. At the time of writing, the exact procedures for use of this field have not been agreed upon. ATM Layer Functions The primary function of the ATM layer is VPI/VCI translation. As ATM cells arrive at ATM switches, the VPI and VCI values contained in their headers are examined by the switch to determine which outport port should be used to forward the cell. In the process, the switch translates the cell’s origi-
nal VPI and VCI values into new outgoing VPI and VCI values, which are used in turn by the next ATM switch to send the cell toward its intended destination. The table used to perform this translation is initialized during the establishment of the call. An ATM switch may either be a VP switch, in which case it translates only the VPI values contained in cell headers, or it may be a VP/VC switch, in which case it translates the incoming VPI/VCI value into an outgoing VPI/VCI pair. Because VPI and VCI values do not represent a unique end-toend virtual connection, they can be reused at different switches through the network. This is important because the VPI and VCI fields are limited in length and would be quickly exhausted if they were used simply as destination addresses. The ATM layer supports two types of virtual connections: switched virtual connections (SVC) and permanent, or semipermanent, virtual connections (PVC). Switched virtual connections are established and torn down dynamically by an ATM signaling procedure. That is, they exist only for the duration of a single call. Permanent virtual connections, on the other hand, are established by network administrators and continue to exist as long as the administrator leaves them up, even if they are not used to transmit data. Other important functions of the ATM layer include cell multiplexing and demultiplexing, cell header creation and extraction, and generic flow control. Cell multiplexing is the merging of cells from several calls onto a single transmission path, cell header creation is the attachment of a 5-octet cell header to each 48-octet block of user payload, and generic flow control is used at the UNI to prevent short-term overload conditions from occurring within the network. ATM Layer Service Categories The ATM Forum and ITU-T have defined several distinct service categories at the ATM layer (1,4). The categories defined by the ATM Forum include constant bit rate (CBR), real-time variable bit rate (VBR-rt), non-real-time variable bit rate (VBR-nrt), available bit rate (ABR), and unspecified bit rate (UBR). ITU-T defines four service categories, namely, deterministic bit rate (DBR), statistical bit rate (SBR), available bit rate (ABR), and ATM block transfer (ABT). The first of the three ITU-T service categories correspond roughly to the ATM Forum’s CBR, VBR, and ABR classifications, respectively. The fourth service category, ABT, is solely defined by ITU-T and is intended for bursty data applications. The UBR category defined by the ATM Forum is for calls that request no quality of service guarantees at all. Figure 4 lists the ATM service categories, their quality of service (QoS) parameters,
752
ASYNCHRONOUS TRANSFER MODE NETWORKS
ITU-T service categories
DBR
SBR
ABT
ABR
ATM forum service categories
CBR
VBR-rt
VBR-nrt
ABR
Cell loss rate Cell transfer delay Cell delay variation Traffic descriptors (contract)
Specified Specified Specified PCR/CDVT
UBR Unspecified
Unspecified Unspecified
PCR/CDVT PCR/CDVT PCR/CDVT SCR/BT MCR/ACR
PCR = Peak Cell Rate; SCR = Sustained Cell Rate; CDVT = Cell Delay Variation Tolerance; BT = Burst Tolerance; MCR = Minimum Cell Rate; ACR = Allowed Cell Rate. Figure 4. ATM layer service categories.
and the traffic descriptors required by the service category during call establishment (1,4). The constant bit rate (or deterministic bit rate) service category provides a very strict QoS guarantee. It is targeted at real-time applications, such as voice and raw video, which mandate severe restrictions on delay, delay variance (jitter), and cell loss rate. The only traffic descriptors required by the CBR service are the peak cell rate and the cell delay variation tolerance. A fixed amount of bandwidth, determined primarily by the call’s peak cell rate, is reserved for each CBR connection. The real-time variable bit rate (or statistical bit rate) service category is intended for real-time bursty applications (e.g., compressed video), which also require strict QoS guarantees. The primary difference between CBR and VBR-rt is in the traffic descriptors they use. The VBR-rt service requires the specification of the sustained (or average) cell rate and burst tolerance (i.e., burst length) in addition to the peak cell rate and the cell delay variation tolerance. The ATM Forum also defines a VBR-nrt service category, in which cell delay variance is not guaranteed. The available bit rate service category is defined to exploit the network’s unused bandwidth. It is intended for non-realtime data applications in which the source is amenable to enforced adjustment of its transmission rate. A minimum cell rate is reserved for the ABR connection and therefore guaranteed by the network. When the network has unused bandwidth, ABR sources are allowed to increase their cell rates up to an allowed cell rate (ACR), a value that is periodically updated by the ABR flow control mechanism (to be described in the section entitled ‘‘ATM Traffic Control’’). The value of ACR always falls between the minimum and the peak cell rate for the connection and is determined by the network. The ATM Forum defines another service category for nonreal-time applications called the unspecified bit rate (UBR) service category. The UBR service is entirely best effort; the call is provided with no QoS guarantees. The ITU-T also defines an additional service category for non-real-time data applications. The ATM block transfer service category is intended for the transmission of short bursts, or blocks, of data. Before transmitting a block, the source requests a reservation of bandwidth from the network. If the ABT service is being used with the immediate transmission option (ABT/IT), the
block of data is sent at the same time as the reservation request. If bandwidth is not available for transporting the block, then it is simply discarded, and the source must retransmit it. In the ABT service with delayed transmission (ABT/DT), the source waits for a confirmation from the network that enough bandwidth is available before transmitting the block of data. In both cases, the network temporarily reserves bandwidth according to the peak cell rate for each block. Immediately after transporting the block, the network releases the reserved bandwidth. ATM Adaptation Layer The ATM adaptation layer, which resides atop the ATM layer, is responsible for mapping the requirements of higher layer protocols onto the ATM network (1). It operates in ATM devices at the edge of the ATM network and is totally absent in ATM switches. The adaptation layer is divided into two sublayers: the convergence sublayer (CS), which performs error detection and handling, timing, and clock recovery; and the segmentation and reassembly (SAR) sublayer, which performs segmentation of convergence sublayer protocol data units (PDUs) into ATM cell-sized SAR sublayer service data units (SDUs) and vice versa. In order to support different service requirements, the ITU-T has proposed four AAL-specific service classes. Figure 5 depicts the four service classes defined in recommendation I.362 (1). Note that even though these AAL service classes are similar in many ways to the ATM layer service categories defined in the previous section, they are not the same; each exists at a different layer of the protocol reference model, and each requires a different set of functions. AAL service class A corresponds to constant bit rate services with a timing relation required between source and destination. The connection mode is connection-oriented. The CBR audio and video belong to this class. Class B corresponds to variable bit rate (VBR) services. This class also requires timing between source and destination, and its mode is connection-oriented. The VBR audio and video are examples of class B services. Class C also corresponds to VBR connectionoriented services, but the timing between source and destination needs not be related. Class C includes connectionoriented data transfer such as X.25, signaling, and future high-speed data services. Class D corresponds to connectionless services. Connectionless data services such as those supported by LANs and MANs are examples of class D services. Four AAL types (Types 1, 2, 3/4, and 5), each with a unique SAR sublayer and CS sublayer, are defined to support the four service classes. AAL Type 1 supports constant bit rate services (class A), and AAL Type 2 supports variable bit rate services with a timing relation between source and desti-
Class A Timing relation between source and destination Bit rate Connection mode
Class B
Required
Constant Connection oriented
Class C
Class D
Not required
Variable Connectionless
Figure 5. Service classification for AAL.
ASYNCHRONOUS TRANSFER MODE NETWORKS Cell header
ered to the user or discarded according to the user’s choice. The use of the CF field is for further study.
SAR-SDU payload
Figure 6. SAR-SDU format for AAL Type 5.
nation (class B). AAL Type 3/4 was originally specified as two different AAL types (Type 3 and Type 4), but because of their inherent similarities, they were eventually merged to support both class C and class D services. AAL Type 5 also supports class C and class D services. AAL Type 5. Currently, the most widely used adaptation layer is AAL Type 5. AAL Type 5 supports connection-oriented and connectionless services in which there is no timing relation between source and destination (classes C and D). Its functionality was intentionally made simple in order to support high-speed data transfer. AAL Type 5 assumes that the layers above the ATM adaptation layer can perform error recovery, retransmission, and sequence numbering when required, and thus, it does not provide these functions. Therefore, only nonassured operation is provided; lost or corrupted AAL Type 5 packets will not be corrected by retransmission. Figure 6 depicts the SAR-SDU format for AAL Type 5 (5,6). The SAR sublayer of AAL Type 5 performs segmentation of a CS-PDU into a size suitable for the SAR-SDU payload. Unlike other AAL types, Type 5 devotes the entire 48octet payload of the ATM cell to the SAR-SDU; there is no overhead. An AAL specific flag (end-of-frame) in the ATM PT field of the cell header is set when the last cell of a CS-PDU is sent. The reassembly of CS-PDU frames at the destination is controlled by using this flag. Figure 7 depicts the CS-PDU format for AAL Type 5 (5,6). It contains the user data payload, along with any necessary padding bits (PAD) and a CS-PDU trailer, which are added by the CS sublayer when it receives the user information from the higher layer. The CS-PDU is padded using 0 to 47 bytes of PAD field to make the length of the CS-PDU an integral multiple of 48 bytes (the size of the SAR-SDU payload). At the receiving end, a reassembled PDU is passed to the CS sublayer from the SAR sublayer, and CRC values are then calculated and compared. If there is no error, the PAD field is removed by using the value of length field (LF) in the CSPDU trailer, and user data is passed to the higher layer. If an error is detected, the erroneous information is either deliv-
CS-PDU CS-PDU trailer User data
PAD
CF
LF
CS layer
I AAL5 cell Indication PAD : Pad (0 to 47 bytes) CF : Control field (2 bytes) LF : Length field (2 bytes) CRC: Cyclic redundancy check (4 bytes) AAL5 cell
AAL Type 1. AAL Type 1 supports constant bit rate services with a fixed timing relation between source and destination users (class A). At the SAR sublayer, it defines a 48-octet service data unit (SDU), which contains 47 octets of user payload, 4 bits for a sequence number, and a 4-bit CRC value to detect errors in the sequence number field. AAL Type 1 performs the following services at the CS sublayer: forward error correction to ensure high quality of audio and video applications, clock recovery by monitoring the buffer filling, explicit time indication by inserting a time stamp in the CS-PDU, and handling of lost and misinserted cells that are recognized by the SAR. At the time of writing, the CS-PDU format has not been decided. AAL Type 2. AAL Type 2 supports variable bit rate services with a timing relation between source and destination (class B). AAL Type 2 is nearly identical to AAL Type 1, except that it transfers service data units at a variable bit rate, not at a constant bit rate. Furthermore, AAL Type 2 accepts variable length CS-PDUs, and thus, there may exist some SAR-SDUs that are not completely filled with user data. The CS sublayer for AAL Type 2 performs the following functions: forward error correction for audio and video services, clock recovery by inserting a time stamp in the CS-PDU, and handling of lost and misinserted cells. At the time of writing, both the SARSDU and CS-PDU formats for AAL Type 2 are still under discussion. AAL Type 3/4. AAL Type 3/4 mainly supports services that require no timing relation between the source and destination (classes C and D). At the SAR sublayer, it defines a 48-octet service data unit, with 44 octets of user payload; a 2-bit payload type field to indicate whether the SDU is at the beginning, middle, or end of a CS-PDU; a 4-bit cell sequence number; a 10-bit multiplexing identifier that allows several CS-PDUs to be multiplexed over a single VC; a 6-bit cell payload length indicator; and a 10-bit CRC code that covers the payload. The CS-PDU format allows for up to 65535 octets of user payload and contains a header and trailer to delineate the PDU. The functions that AAL Type 3/4 performs include segmentation and reassembly of variable-length user data and error handling. It supports message mode (for framed data transfer) as well as streaming mode (for streamed data transfer). Because Type 3/4 is mainly intended for data services, it provides a retransmission mechanism if necessary. ATM Signaling
CRC SAR layer
AAL5 cell
753
AAL5 cell
Figure 7. CS-PDU format, segmentation and reassembly of AAL Type 5.
ATM follows the principle of out-of-band signaling that was established for N-ISDN. In other words, signaling and data channels are separate. The main purposes of signaling are (1) to establish, maintain, and release ATM virtual connections and (2) to negotiate (or renegotiate) the traffic parameters of new (or existing) connections (7). The ATM signaling standards support the creation of point-to-point as well as multicast connections. Typically, certain VCI and VPI values are reserved by ATM networks for signaling messages. If additional signaling VCs are required, they may be established through the process of metasignaling.
754
ASYNCHRONOUS TRANSFER MODE NETWORKS
ATM TRAFFIC CONTROL The control of ATM traffic is complicated as a result of ATM’s high-link speed and small cell size, the diverse service requirements of ATM applications, and the diverse characteristics of ATM traffic. Furthermore, the configuration and size of the ATM environment, either local or wide area, has a significant impact on the choice of traffic control mechanisms. The factor that most complicates traffic control in ATM is its high-link speed. Typical ATM link speeds are 155.52 Mbit/ s and 622.08 Mbit/s. At these high-link speeds, 53-byte ATM cells must be switched at rates greater than one cell per 2.726 애s or 0.682 애s, respectively. It is apparent that the cell processing required by traffic control must perform at speeds comparable to these cell-switching rates. Thus, traffic control should be simple and efficient, without excessive software processing. Such high speeds render many traditional traffic control mechanisms inadequate for use in ATM because of their reactive nature. Traditional reactive traffic control mechanisms attempt to control network congestion by responding to it after it occurs and usually involves sending feedback to the source in the form of a choke packet. However, a large bandwidth-delay product (i.e., the amount of traffic that can be sent in a single propagation delay time) renders many reactive control schemes ineffective in high-speed networks. When a node receives feedback, it may have already transmitted a large amount of data. Consider a cross-continental 622 Mbit/ s connection with a propagation delay of 20 ms (propagationbandwidth product of 12.4 Mbit). If a node at one end of the connection experiences congestion and attempts to throttle the source at the other end by sending it a feedback packet, the source will already have transmitted over 12 Mb of information before feedback arrives. This example illustrates the ineffectiveness of traditional reactive traffic control mechanisms in high-speed networks and argues for novel mechanisms that take into account high propagation-bandwidth products. Not only is traffic control complicated by high speeds, but it also is made more difficult by the diverse QoS requirements of ATM applications. For example, many applications have strict delay requirements and must be delivered within a specified amount of time. Other applications have strict loss requirements and must be delivered reliably without an inordinate amount of loss. Traffic controls must address the diverse requirements of such applications. Another factor complicating traffic control in ATM networks is the diversity of ATM traffic characteristics. In ATM networks, continuous bit rate traffic is accompanied by bursty traffic. Bursty traffic generates cells at a peak rate for a very short period of time and then immediately becomes less active, generating fewer cells. To improve the efficiency of ATM network utilization, bursty calls should be allocated an amount of bandwidth that is less than their peak rate. This allows the network to multiplex more calls by taking advantage of the small probability that a large number of bursty calls will be simultaneously active. This type of multiplexing is referred to as statistical multiplexing. The problem then becomes one of determining how best to multiplex bursty calls statistically such that the number of cells dropped as a result of excessive burstiness is balanced with the number of bursty
traffic streams allowed. Addressing the unique demands of bursty traffic is an important function of ATM traffic control. For these reasons, many traffic control mechanisms developed for existing networks may not be applicable to ATM networks, and therefore novel forms of traffic control are required (8,9). One such class of novel mechanisms that work well in high-speed networks falls under the heading of preventive control mechanisms. Preventive control attempts to manage congestion by preventing it before it occurs. Preventive traffic control is targeted primarily at real-time traffic. Another class of traffic control mechanisms has been targeted toward non-real-time data traffic and relies on novel reactive feedback mechanisms. Preventive Traffic Control Preventive control for ATM has two major components: call admission control and usage parameter control (8). Admission control determines whether to accept or reject a new call at the time of call set-up. This decision is based on the traffic characteristics of the new call and the current network load. Usage parameter control enforces the traffic parameters of the call after it has been accepted into the network. This enforcement is necessary to ensure that the call’s actual traffic flow conforms with that reported during call admission. Before describing call admission and usage parameter control in more detail, it is important to first discuss the nature of multimedia traffic. Most ATM traffic belongs to one of two general classes of traffic: continuous traffic and bursty traffic. Sources of continuous traffic (e.g., constant bit rate video, voice without silence detection) are easily handled because their resource utilization is predictable and they can be deterministically multiplexed. However, bursty traffic (e.g., voice with silence detection, variable bit rate video) is characterized by its unpredictability, and this kind of traffic complicates preventive traffic control. Burstiness is a parameter describing how densely or sparsely cell arrivals occur. There are a number of ways to express traffic burstiness, the most typical of which are the ratio of peak bit rate to average bit rate and the average burst length. Several other measures of burstiness have also been proposed (8). It is well known that burstiness plays a critical role in determining network performance, and thus, it is critical for traffic control mechanisms to reduce the negative impact of bursty traffic. Call Admission Control. Call admission control is the process by which the network decides whether to accept or reject a new call. When a new call requests access to the network, it provides a set of traffic descriptors (e.g., peak rate, average rate, average burst length) and a set of quality of service requirements (e.g., acceptable cell loss rate, acceptable cell delay variance, acceptable delay). The network then determines, through signaling, if it has enough resources (e.g., bandwidth, buffer space) to support the new call’s requirements. If it does, the call is immediately accepted and allowed to transmit data into the network. Otherwise it is rejected. Call admission control prevents network congestion by limiting the number of active connections in the network to a level where the network resources are adequate to maintain quality of service guarantees.
ASYNCHRONOUS TRANSFER MODE NETWORKS
One of the most common ways for an ATM network to make a call admission decision is to use the call’s traffic descriptors and quality of service requirements to predict the ‘‘equivalent bandwidth’’ required by the call. The equivalent bandwidth determines how many resources need to be reserved by the network to support the new call at its requested quality of service. For continuous, constant bit rate calls, determining the equivalent bandwidth is simple. It is merely equal to the peak bit rate of the call. For bursty connections, however, the process of determining the equivalent bandwidth should take into account such factors as a call’s burstiness ratio (the ratio of peak bit rate to average bit rate), burst length, and burst interarrival time. The equivalent bandwidth for bursty connections must be chosen carefully to ameliorate congestion and cell loss while maximizing the number of connections that can be statistically multiplexed. Usage Parameter Control. Call admission control is responsible for admitting or rejecting new calls. However, call admission by itself is ineffective if the call does not transmit data according to the traffic parameters it provided. Users may intentionally or accidentally exceed the traffic parameters declared during call admission, thereby overloading the network. In order to prevent the network users from violating their traffic contracts and causing the network to enter a congested state, each call’s traffic flow is monitored and, if necessary, restricted. This is the purpose of usage parameter control. (Usage parameter control is also commonly referred to as policing, bandwidth enforcement, or flow enforcement.) To monitor a call’s traffic efficiently, the usage parameter control function must be located as close as possible to the actual source of the traffic. An ideal usage parameter control mechanism should have the ability to detect parameter-violating cells, appear transparent to connections respecting their admission parameters, and rapidly respond to parameter violations. It should also be simple, fast, and cost effective to implement in hardware. To meet these requirements, several mechanisms have been proposed and implemented (8). The leaky bucket mechanism (originally proposed in Ref. 10) is a typical usage parameter control mechanism used for ATM networks. It can simultaneously enforce the average bandwidth and the burst factor of a traffic source. One possible implementation of the leaky bucket mechanism is to control the traffic flow by means of tokens. A conceptual model for the leaky bucket mechanism is illustrated in Fig. 5. In Fig. 8, an arriving cell first enters a queue. If the queue is full, cells are simply discarded. To enter the network, a cell must first obtain a token from the token pool; if there is no token, a cell must wait in the queue until a new token is generated. Tokens are generated at a fixed rate corresponding to the average bit rate declared during call admission. If the
Arriving cells
Departing cells Queue Token pool Token generator
Figure 8. Leaky bucket mechanism.
755
number of tokens in the token pool exceeds some predefined threshold value, token generation stops. This threshold value corresponds to the burstiness of the transmission declared at call admission time; for larger threshold values, a greater degree of burstiness is allowed. This method enforces the average input rate while allowing for a certain degree of burstiness. One disadvantage of the leaky bucket mechanism is that the bandwidth enforcement introduced by the token pool is in effect even when the network load is light and there is no need for enforcement. Another disadvantage of the leaky bucket mechanism is that it may mistake nonviolating cells for violating cells. When traffic is bursty, a large number of cells may be generated in a short period of time, while conforming to the traffic parameters claimed at the time of call admission. In such situations, none of these cells should be considered violating cells. Yet in actual practice, leaky bucket may erroneously identify such cells as violations of admission parameters. A virtual leaky bucket mechanism (also referred to as a marking method) alleviates these disadvantages (11). In this mechanism, violating cells, rather than being discarded or buffered, are permitted to enter the network at a lower priority (CLP ⫽ 1). These violating cells are discarded only when they arrive at a congested node. If there are no congested nodes along the routes to their destinations, the violating cells are transmitted without being discarded. The virtual leaky bucket mechanism can easily be implemented using the leaky bucket method described earlier. When the queue length exceeds a threshold, cells are marked as ‘‘droppable’’ instead of being discarded. The virtual leaky bucket method not only allows the user to take advantage of a light network load but also allows a larger margin of error in determining the token pool parameters. Reactive Traffic Control Preventive control is appropriate for most types of ATM traffic. However, there are cases where reactive control is beneficial. For instance, reactive control is useful for service classes like ABR, which allow sources to use bandwidth not being used by calls in other service classes. Such a service would be impossible with preventive control because the amount of unused bandwidth in the network changes dynamically, and the sources can only be made aware of the amount through reactive feedback. There are two major classes of reactive traffic control mechanisms: rate-based and credit-based (12,13). Most ratebased traffic control mechanisms establish a closed feedback loop in which the source periodically transmits special control cells, called resource management cells, to the destination (or destinations). The destination closes the feedback loop by returning the resource management cells to the source. As the feedback cells traverse the network, the intermediate switches examine their current congestion state and mark the feedback cells accordingly. When the source receives a returning feedback cell, it adjusts its rate, either by decreasing it in the case of network congestion or increasing it in the case of network underuse. An example of a rate-based ABR algorithm is the Enhanced Proportional Rate Control Algorithm (EPRCA), which was proposed, developed, and tested through the course of ATM Forum activities (12).
756
ASYNCHRONOUS TRANSFER MODE NETWORKS
Credit-based mechanisms use link-by-link traffic control to eliminate loss and optimize use. Intermediate switches exchange resource management cells that contain ‘‘credits,’’ which reflect the amount of buffer space available at the next downstream switch. A source cannot transmit a new data cell unless it has received at least one credit from its downstream neighbor. An example of a credit-based mechanism is the Quantum Flow Control (QFC) algorithm, developed by a consortium of reseachers and ATM equipment manufacturers (13).
Conflict 000
0 1
0 1
001
2 3
2 3
Input ports 101
HARDWARE SWITCH ARCHITECTURES FOR ATM NETWORKS In ATM networks, information is segmented into fixed-length cells, and cells are asynchronously transmitted through the network. To match the transmission speed of the network links and to minimize the protocol processing overhead, ATM performs the switching of cells in hardware-switching fabrics, unlike traditional packet switching networks, where switching is largely performed in software. A large number of designs has been proposed and implemented for ATM switches (14). Although many differences exist, ATM switch architectures can be broadly classified into two categories: asynchronous time division (ATD) and spacedivision architectures. Asynchronous Time Division Switches The ATD, or single path, architectures provide a single, multiplexed path through the ATM switch for all cells. Typically a bus or ring is used. Figure 9 shows the basic structure of the ATM switch proposed in (15). In Fig. 6, four input ports are connected to four output ports by a time-division multiplexing (TDM) bus. Each input port is allocated a fixed time slot on the TDM bus, and the bus is designated to operate at a speed equal to the sum of the incoming bit rates at all input ports. The TDM slot sizes are fixed and equal in length to the time it takes to transmit one ATM cell. Thus, during one TDM cycle, the four input ports can transfer four ATM cells to four output ports. In ATD switches, the maximum throughput is determined by a single, multiplexed path. Switches with N input ports and N output ports must run at a rate N times faster than the transmission links. Therefore, the total throughput of ATD ATM switches is bounded by the current capabilities of device logic technology. Commercial examples of ATD switches are the Fore Systems ASX switch and Digital’s VNswitch.
4 5
4 5
6 7
6 7
Output ports
Figure 10. A 8 ⫻ 8 Banyan switch with binary switching elements.
Space-Division Switches To eliminate the single-path limitation and increase total throughput, space-division ATM switches implement multiple paths through switching fabrics. Most space-division switches are based on multistage interconnection networks, where small switching elements (usually 2 ⫻ 2 cross-point switches) are organized into stages and provide multiple paths through a switching fabric. Rather than being multiplexed onto a single path, ATM cells are space-switched through the fabric. Three typical types of space-division switches are described next. Banyan Switches. Banyan switches are examples of spacedivision switches. An N ⫻ N Banyan switch is constructed by arranging a number of binary switching elements into several stages (log2 N stages). Figure 10 depicts an 8 ⫻ 8 self-routing Banyan switch (14). The switch fabric is composed of twelve 2 ⫻ 2 switching elements assembled into three stages. From any of the eight input ports, it is possible to reach all the eight output ports. One desirable characteristic of the Banyan switch is that it is self-routing. Because each cross-point switch has only two output lines, only one bit is required to specify the correct output path. Very simply, if the desired output addresses of a ATM cell is stored in the cell header in binary code, routing decisions for the cell can be made at each cross-point switch by examining the appropriate bit of the destination address. Although the Banyan switch is simple and possesses attractive features such as modularity, which makes it suitable for VLSI implementation, it also has some disadvantages. One of its disadvantages is that it is internally blocking. In other words, cells destined for different output ports may contend for a common link within the switch. This results in
TDM bus Buffers
Buffers
Input port 0
Output port 0
Input port 1
Output port 1
Input port 2
Output port 2
Input port 3
Output port 3
Batcher network
Input ports
Banyan network
Output ports
Timing Figure 9. A 4 ⫻ 4 asynchronous time division switch.
Figure 11. Batcher–Banyan switch.
ASYNCHRONOUS TRANSFER MODE NETWORKS
Input ports
1 2 N Bus interfaces 1
2 Output ports
N
Figure 12. A knockout (crossbar) switch.
blocking all cells that wish to use that link, except for one. Hence, the Banyan switch is referred to as a blocking switch. In Fig. 10, three cells are shown arriving on input ports 1, 3, and 4 with destination port addresses of 0, 1, and 5, respectively. The cell destined for output port 0 and the cell destined for output port 1 end up contending for the link between the second and third stages. As a result, only one of them (the cell from input port 1 in this example) actually reaches its destination (output port 0), while the other is blocked. Batcher–Banyan Switches. Another example of space-division switches is the Batcher–Banyan switch (14). (See Fig. 11.) It consists of two multistage interconnection networks: a Banyan self-routing network and a Batcher sorting network. In the Batcher–Banyan switch, the incoming cells first enter the sorting network, which takes the cells and sorts them into ascending order according to their output addresses. Cells then enter the Banyan network, which routes the cells to their correct output ports. As shown earlier, the Banyan switch is internally blocking. However, the Banyan switch possesses an interesting feature. Namely, internal blocking can be avoided if the cells arriving at the Banyan switch’s input ports are sorted in ascending order by their destination addresses. The Batcher–Banyan switch takes advantage of this fact and uses the Batcher soring network to sort the cells, thereby making the Batcher– Banyan switch internally nonblocking. The Starlite switch, designed by Bellcore, is based on the Batcher–Banyan architecture (16). Crossbar Switches. The crossbar switch interconnects N inputs and N outputs into a fully meshed topology; that is, there are N2 cross points within the switch (14). (See Fig. 12.) Because it is always possible to establish a connection between any arbitrary input and output pair, internal blocking is impossible in a crossbar switch. The architecture of the crossbar switch has some advantages. First, it uses a simple two-state cross-point switch (open and connected state), which is easy to implement. Sec-
Input ports
Output ports N×N
Input ports
Output ports
Buffers (a)
(b)
ond, the modularity of the switch design allows simple expansion. One can build a larger switch by simply adding more cross-point switches. Lastly, compared to Banyan-based switches, the crossbar switch design results in low transfer latency, because it has the smallest number of connecting points between input and output ports. One disadvantage to this design, however, is the fact that it uses the maximum number of cross points (cross-point switches) needed to implement an N ⫻ N switch. The knockout switch by AT&T Bell Labs is a nonblocking switch based on the crossbar design (17,18). It has N inputs and N outputs and consists of a crossbar-based switch with a bus interface module at each output (Fig. 12). Nonblocking Buffered Switches Although some switches such as Batcher–Banyan and crossbar switches are internally nonblocking, two or more cells may still contend for the same output port in a nonblocking switch, resulting in the dropping of all but one cell. In order to prevent such loss, the buffering of cells by the switch is necessary. Figure 13 illustrates that buffers may be placed (1) in the inputs to the switch, (2) in the outputs to the switch, or (3) within the switching fabric itself, as a shared buffer (14). Some switches put buffers in both the input and output ports of a switch. The first approach to eliminating output contention is to place buffers in the output ports of the switch (14). In the worst case, cells arriving simultaneously at all input ports can be destined for a single output port. To ensure that no cells are lost in this case, the cell transfer must be performed at N times the speed of the input links, and the switch must be able to write N cells into the output buffer during one cell transmission time. Examples of output buffered switches include the knockout switch by AT&T Bell Labs, the Siemens & Newbridge MainStreetXpress switches, the ATML’s VIRATA switch, and Bay Networks’ Lattis switch. The second approach to buffering in ATM switches is to place the buffers in the input ports of the switch (14). Each input has a dedicated buffer, and cells that would otherwise be blocked at the output ports of the switch are stored in input buffers. Commercial examples of switches with input buffers as well as output buffers are IBM’s 8285 Nways switches, and Cisco’s Lightstream 2020 switches. A third approach is to use a shared buffer within the switch fabric. In a shared buffer switch, there is no buffer at the input or output ports (14). Arriving cells are immediately injected into the switch. When output contention happens, the winning cell goes through the switch, while the losing cells are stored for later transmission in a shared buffer common to all of the input ports. Cells just arriving at the switch join buffered cells in competition for available outputs. Because
Input ports
Output ports N(B+1) × N(B+1)
N×N
Buffers
757
NB (c)
Figure 13. Nonblocking buffered switches.
758
ASYNCHRONOUS TRANSFER MODE NETWORKS
more cells are available to select from, it is possible that fewer output ports will be idle when using the shared buffer scheme. Thus, the shared buffer switch can achieve high throughput. However, one drawback is that cells may be delivered out of sequence because cells that arrived more recently may win over buffered cells during contention (19). Another drawback is the increase in the number of input and output ports internal to the switch. The Starlite switch with trap by Bellcore is an example of the shared buffer switch architecture (16). Other examples of shared buffer switches include Cisco’s Lightstream 1010 switches, IBM’s Prizma switches, Hitachi’s 5001 switches, and Lucent’s ATM cell switches. CONTINUING RESEARCH IN ATM NETWORKS ATM is continuously evolving, and its attractive ability to support broadband integrated services with strict quality of service guarantees has motivated the integration of ATM and existing widely deployed networks. Recent additions to ATM research and technology include, but are not limited to, seamless integration with existing LANs [e.g., LAN emulation (20)], efficient support for traditional Internet IP networking [e.g., IP over ATM (21), IP switching (22)], and further development of flow and congestion control algorithms to support existing data services [e.g., ABR flow control (12)]. Research on topics related to ATM networks is currently proceeding and will undoubtedly continue to proceed as the technology matures.
9. B. J. Vickers et al., Congestion control and resource management in diverse ATM environments, IECEJ J., J76-B-I (11): 1993. 10. J. S. Turner, New directions in communications (or which way to the information age?), IEEE Commun. Mag., 25 (10): 1986. 11. G. Gallassi, G. Rigolio, and L. Fratta, ATM: Bandwidth assignment and bandwidth enforcement policies. Proc. GLOBECOM’89. 12. ATM Forum, ATM Forum Traffic management specification version 4.0, af-tm-0056.000, April 1996, Mountain View, CA: ATM Forum. 13. Quantum Flow Control version 2.0, Flow Control Consortium, FCC-SPEC-95-1, [Online], July 1995. http://www.qfc.org 14. Y. Oie et al., Survey of switching techniques in high-speed networks and their performance, Int. J. Satellite Commun., 9: 285– 303, 1991. 15. M. De Prycker and M. De Somer, Performance of a service independent switching network with distributed control, IEEE J. Select. Areas Commun., 5: 1293–1301, 1987. 16. A. Huang and S. Knauer, Starlite: A wideband digital switch. Proc. IEEE GLOBECOM’84, 1984. 17. K. Y. Eng, A photonic knockout switch for high-speed packet networks, IEEE J. Select. Areas Commun., 6: 1107–1116, 1988. 18. Y. S. Yeh, M. G. Hluchyj, and A. S. Acampora, The knockout switch: A simple, modular architecture for high-performance packet switching, IEEE J. Select. Areas Commun., 5: 1274– 1283, 1987. 19. J. Y. Hui and E. Arthurs, A broadband packet switch for integrated transport, IEEE J. Select. Areas Commun., 5: 1264– 1273, 1987. 20. ATM Forum, LAN emulation over ATM version 1.0. AF-LANE0021, 1995, Mountain View, CA: ATM Forum. 21. IETF, IP over ATM: A framework document, RFC-1932, 1996.
BIBLIOGRAPHY 1. CCITT Recommendation I-Series. Geneva: International Telephone and Telegraph Consultative Committee. 2. J. B. Kim, T. Suda, and M. Yoshimura, International standardization of B-ISDN, Comput. Networks ISDN Syst., 27: 1994. 3. CCITT Recommendation G-Series. Geneva: International Telephone and Telegraph Consultative Committee. 4. ATM Forum Technical Specifications [Online]. Available www: www.atmforum.com 5. Report of ANSI T1S1.5/91-292, Simple and Efficient Adaptation Layer (SEAL), August 1991. 6. Report of ANSI T1S1.5/91-449, AAL5—A New High Speed Data Transfer, November 1991. 7. CCITT Recommendation Q-Series. Geneva: International Telephone and Telegraph Consultative Committee. 8. J. Bae and T. Suda, Survey of traffic control schemes and protocols in ATM networks, Proc. IEEE, 79: 1991.
22. Ipsilon Corporation, IP switching: The intelligence of routing, The Performance of Switching [Online]. Available www: www.ipsiolon.com
TATSUYA SUDA University of California, Irvine
ATC. See AIR TRAFFIC CONTROL. ATM. See STATISTICAL MULTIPLEXING. ATM NETWORKS. See ASYNCHRONOUS TRANSFER MODE NETWORKS.
ATM NETWORKS, VIDEO ON. See VIDEO ON ATM NETWORKS.
ATMOSPHERICS. See WHISTLERS. ATTENUATION. See REFRACTION AND ATTENUATION IN THE TROPOSPHERE.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5306.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Channel Coding Standard Article Irving S. Reed1 and Xuemin Chen2 1University of Southern California, Los Angeles, CA 2General Instrument Corporation, San Diego, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5306 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (160K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5306.htm (1 of 2)16.06.2008 16:21:12
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5306.htm
Abstract The sections in this article are Error-Handling Processes and Error-Control Strategies Basic Principles of Error-Control Codes Linear Block Codes Cyclic Codes BCH Codes Reed–Solomon Code Noisy Channel Coding Theorem Coding Performance and Decoding Complexity Keywords: error correction and detection; channel capacity; error-control coding; block codes; convolutional codes; information theory; BCH codes; reed-solomon codes; hamming codes; maximum likelihood decoding | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5306.htm (2 of 2)16.06.2008 16:21:12
CHANNEL CODING
CHANNEL CODING The general term channel coding implies a technique by which redundancy symbols are attached to the data by a channel encoder of the system. These redundancy symbols are used to detect and/or correct and/or interpolate erroneous data at the channel decoder. Channel encoding is achieved by imposing relations on the information data and redundancy symbols of the system. These restricting relations make it possible for the decoder to correctly extract the original source signal with high reliability and fidelity from a possibly corrupted received or retrieved signal. Channel coding is used in digital communication systems for several possible reasons: (1) to increase the reliability of noisy data communications channels or data storage systems; (2) to control errors in such a manner that a faithful reproduction of the data can be obtained; (3) to increase the overall signal-to-noise energy ratio (SNR) of a system; (4) to reduce the noise effects within a system; and (5) to meet the commercial demands of efficiency, reliability, and a high performance of an economically practical digital transmission and storage system. All of these objectives must be tailored to the particular application. Therefore, channel coding is also called errorcontrol coding, error-correction, or detection coding. ERROR-HANDLING PROCESSES AND ERROR-CONTROL STRATEGIES Figure 1 shows a physical layer coding model of a digital communication system. The same model can be used to describe an information storage system if the storage medium is considered to be the channel. The source information is usually composed of binary or alphanumeric symbols. The encoder converts the information messages into electrical signals acceptable to the channel. Then these signals are sent to the channel (or storage medium), where they may be disturbed by noise. Next, the output of the channel is sent to the decoder, which makes a decision to determine which message was sent. Finally, this message is delivered to the recipient data (sink). Typical transmission channels are twisted-pair telephone lines, coaxial cable wires, optical fibers, radio links, microwave links, satellite links, and so forth. Typical storage media can be semiconductor memories, magnetic tapes and discs, compact discs (CDs), optical memory units, or digital video discs (DVDs). Each of these channels or media is subject to various types of noise disturbances. For example, the disturbance on a telephone line may come from impulsive circuit switching noise, thermal noise, crosstalk between lines, or a loss of synchronization. The disturbances on a CD often are caused by surface defects, dust, or a mechanical failure. Therefore, the problems a digital communication system faces
Source
Encoder
Channel or storage medium
Decoder
Sink
Retransmission request Figure 1. A physical layer model of a communication or storage system.
187
are the possible message errors that might be caused by these different disturbances. To overcome these problems, ‘‘good’’ encoders and decoders need to be designed to improve the performance of these channels. Figure 2 shows the block diagram of a typical error handling system. In an ideal system, the symbols that are obtained from the channel (or storage medium) should match the symbols that originally entered the channel (or storage medium). In any practical system, there often are occasional errors, and the purpose of channel coding is to detect and possibly correct such errors. The first stage in Fig. 2 is concerned with encoding for error avoidance and the use of redundancy. This includes, for example, such processes as precoding data for modulation, the placing of digital data at an appropriate position on the tape for certain digital formats, the rewriting of a read-afterwrite error in a computer tape, and error-correction and detection encoding. Following these moves, the encoded data are delivered to the modulator in the form of a signal vector or code. Then the modulator transforms the signal vector into a waveform that matches the channel. After being transmitted through the channel, the waveform often is disturbed by noise. The demodulation of this waveform can produce corrupted signal vectors, which in turn cause possible errors in the data. On receipt of the data, errors are first detected. The detection of an error then requires some course of action. For example, in a bi-directional link a retransmission might be requested. Finally, the correctable error patterns can be eliminated effectively by an error-correction engine. The error-control strategy for the error-handling system shown in Fig. 2 depends primarily on the application. That is, such a strategy depends on the channel properties of the particular communication link and the type of error-control codes to be used. Without the feedback line, shown in Fig. 2, communication channels are one-way channels. The codes in this case are mainly designed for error-correction and/or error-concealment. Error control for a one-way system is usually accomplished by the use of forward error correction (FEC), that is, by employing error-correcting codes that automatically correct errors which are detected at the receiver. Communication systems frequently employ two-way channels, a fact that must be considered in the design of an error-control system. With a two-way channel, both error-correction and error-detecting codes can be used. When an error-detecting code is used and an error is detected at one terminal, a request for a repeat is given to the transmitting terminal. There are examples of real one-way channels in which the error probabilities can be reduced by the use of an error-correcting code but not by an error detection and retransmission system. For example, with a magnetic-tape storage system, usually too much time has passed to ask for a retransmission after the tape has been stored for any significant period of time, say a week or sometimes just a day. In such a case, errors are detected when the record is read. Encoding with FEC codes is usually no more complex than it is with errordetecting codes. It is the decoding that requires sophisticated digital equipment. On the other hand, there are good reasons for using both error detection and retransmission for some applications when possible. Error detection is by its nature a much simpler computational task than error correction and requires much
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
188
CHANNEL CODING Retransmission request
Figure 2. The major processes in an error-handling system.
Error avoidance and redundancy encoding
Coded modulation
less complex decoding equipment. Also, error detection with retransmission tends to be adaptive. In a retransmission system, redundant information is utilized only in the retransmitted data when errors occur. This makes it possible, under certain circumstances, to obtain better performance with a system of this kind than is theoreticaly possible over a oneway channel. Error control by the use of error detection and retransmission is called automatic repeat request (ARQ). In an ARQ system, when an error is detected at the receiver, a request is sent to the transmitter to repeat the message. This process continues until it is verified that the message was received correctly. Typical applications of ARQ are the protocols for many fax modems. There is a definite limit to the efficiency of a system that uses simple error detection and retransmission alone. First, short error-detecting codes are not efficient detectors of errors. On the other hand, if very long codes are used, retransmission must be done too frequently. It can be shown that a combination of both the correction of the most frequent error patterns along with detection and retransmission of the less frequent error patterns is not subject to such a limitation. Such a mixed error-control process is usually called a hybrid error-control (HEC) strategy. In fact, HEC is often more efficient than either a forward error correction system or a detection and retransmission system. Many present-day digital systems use a combination of forward error correction and detection with or without feedback. If the error rate demanded by the application cannot be met by the unaided channel or storage medium, some form of error handling may be necessary. BASIC PRINCIPLES OF ERROR-CONTROL CODES We have seen that the performance of an error-handling system relies on error-correction and/or detection codes that are designed for the given error-control strategy. There are two different types of codes that are commonly used today: block and convolutional codes. It is assumed for both types of codes that the information sequence is encoded using an alphabet set Q of q distinct symbols, called a q-ary set, where q is a positive integer. In general, a code is called a block code if the coded information sequence can be divided into blocks of n symbols and each block can be decoded independently. The encoder of a block code divides the information sequence into message blocks of k information symbols each. A message block, called the message word, is represented by the k-tuple m ⫽ (m0, m1, . . ., mk⫺1) of symbols. Evidently, there are a total of qk different possible message words. The encoder transforms each message word m independently into an n-symbol codeword c ⫽ (c0, c1, . . ., cn⫺1). Therefore, corresponding with
Channel or storage medium
Demodu− lation
Error detection
Error correction
the qk different possible messages, there are qk different possible codewords at the encoder output. This set of qk codewords of length n is called an (n, k) block code. The code rate of an (n, k) block code is defined to be R ⫽ k/n. If q ⫽ 2, the codes are called binary block codes and can be implemented with a combinational logic circuit. The encoder for a convolutional code accepts k-bit blocks of an information sequence and produces an encoded sequence of n-bit blocks. (In convolutional coding, the symbols are used to denote a sequence of blocks rather than a single block.) However, each encoded block depends not only on the corresponding k-bit message block, but also on the m previous message blocks. Hence, the encoder is said to have a memory of order m. The set of encoded sequences produced by a k-input and n-output encoder of memory of order m is called an (n, k, m) convolutional code. Again, the ratio R ⫽ k/n is called the code rate of the convolutional code. Since the encoder contains memory, it is implemented with a sequential logic circuit. The basic principle of error-control coding is to add redundancy to the message in such a way that the message and its redundancy are related by some set of algebraic equations. When a message is disturbed, the message with such constrained redundancy still can be decoded by the use of these relations. In other words, the error-control capability of a code comes from this relational redundancy which is added to the message during the encoding process. To illustrate the principle of error-control coding, an example of a binary block code of length 3 is discussed. There are a total of eight different possible binary 3-tuples: (000), (001), (010), (011), (100), (101), (110), (111). First, if all of these 3tuples are used to transmit messages, one has the example of a (3,3) binary block code of rate 1. In this case, if a one-bit error occurs in any codeword, the received word becomes another codeword. Since any particular codeword may be a transmitted message and there are no redundancy bits in a codeword, errors can neither be detected nor corrected. The error detection and correction processes are closely related and will be dealt with presently. The actual correction of an error is simplified tremendously by the adoption of binary codes. There are only two symbols, 0 and 1, in this case. Hence, to correct a symbol it is sufficient to know that the symbol is wrong. Figure 3 shows the minimal circuit needed for correction once the bit in error has been identified. The exclusive-OR (XOR) gate shows up extensively in error correction circuits, and the figure also demonstrates its truth table. One way to remember the characteristics of this useful device is that there always is an output ‘‘1’’ when the inputs are different. Inspection of the truth table shows that there is an even number of 1’s in each row and, as a consequence, the device is also called an even parity gate.
CHANNEL CODING
error. For further understanding of convolutional codes, readers may refer to Refs. 1 and 2.
Truth table of XOR gate
In Wrong
189
A
B
C
0 0 1 1
0 1 0 1
0 1 1 0
LINEAR BLOCK CODES
A C B XOR gate
Out
Figure 3. Exclusive-OR Gate.
Parity is a fundamental concept in error detection. In the previous example, let only four of the 3-tuples, (000), (011), (101), (110), be chosen as codewords for transmission. These are equivalent to the four 2-bit messages, (00), (01), (10), (11), with the third bit in each 3-tuple equal to the XOR of its first and second bits. This is an example of a (3,2) binary block code of rate 2/3. If a received word is not a codeword, i.e., the third bit does not equal the XOR of the first and second bits, then an error is detected. However, this code cannot correct any error. To illustrate what can happen when there are errors, suppose that the received word is 010. Such an error cannot be corrected even if there is only one bit in error since, in this case, the transmitted codeword has three possibilities: (000), (011), (110). To achieve error correction, more redundancy bits need to be added to the message words for transmission. Suppose only the two 3-tuples (000), (111) are chosen as codewords. This is a (3,1) binary block code of rate 1/3. The codewords (000), (111) are encoded by duplicating the source bits 0, 1 two additional times, that is, two redundancy bits in each codeword. If this codeword is sent through the channel, and one- or two-bit errors occur, the received word is not a codeword. Errors are detected in this scenario. If the decision (rule of the decoder) is to decide the original source bit as the bit which appears as the majority of the three bits of the received word, a one-bit error is corrected. For instance, if the received word is (010), this decoder would say that 0 was sent. Consider next the example of a (2,1,2) binary convolutional code. Let the information sequence be m ⫽ (m0, m1, . . ., (2) m6) ⫽ (1011100) and the encoded sequence be c ⫽ (c(1) 0 c0 , (2) (1) (2) c(1) 1 c1 , . . ., c6 c6 ) ⫽ (11,10,00,01,10,01,11). Also assume the relations between the components of vectors m and c are given by
ci(1) = mi−2 + mi−1 + mi ci(2) = mi−2 + mi where m⫺2 ⫽ m⫺1 ⫽ 0 and ‘‘⫹’’ means sum modulo 2. Suppose the third digit of the received sequence is in error. That is, let the received sequence begin with c ⫽ (11,00,00, . . .). The following are the eight possible beginning code sequences (00,00,00, . . .), (00,00,11, . . .), (00,11,10, . . .), (00,11,01, . . .), (11,10,11, . . .), (11,10,00, . . .), (11,01,01, . . .), (11,01,10, . . .). Clearly, the sixth path, which differs from the received sequence in but a single position, is intuitively the best choice. Thus, a single error is corrected by this observation. Next, suppose digits 1, 2, and 3 were all erroneously received. For this case, the closest code sequence would be (00,00,00, . . .) and the decoder would make an undetectable
In the previous section, it was shown for a binary block code that the positions of the failed bits can be determined by the use of more parity bits. If these parity bits are generated by a linear combination of message bits, the code is called a linear block code. Some important concepts of a linear block code are introduced next by an example of the Hamming code. Consider a binary linear block code of length 7 and rate R ⫽ 4/7. A four-bit message word m ⫽ (m0, m1, m2, m3) is used to compute three redundancy bits and to make a seven-bit codeword c ⫽ (c0, c1, . . ., c6) from the following set of equations:
c i = mi ,
i = 0, 1, 2, 3
c 4 = m0 + m2 + m3 c 5 = m0 + m1 + m2 c 6 = m0 + m1 + m3 These equations provide the parity-check equations in the matrix form, cHT ⫽ 0, where
1
H=
1 1
0 1 1
1 1 0
1 0 1
1 0 0
0 1 0
0 0 1
is called the parity-check matrix of the code and ‘‘T’’ denotes matrix transpose. The codewords are generated by c ⫽ m ⭈ G, where
G=
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
1 0 1 1
1 1 1 0
1 1 0 1
is called the generator matrix of the code. In the Hamming code, four message bits are examined in turn, and each bit that is a ‘‘1’’ causes the corresponding row of G to be added to an XOR sum. For example, if the message word is (1001), the top and bottom rows of G are componentwise XORed. The first four columns of G form a submatrix, which is known as an identity matrix. Therefore, the first four data bits in the codeword are identical to the message bits that were to be conveyed. This is useful because the original message bits are encoded in an unmodified form, and the check bits are simply attached to the end of the message to construct the so-called systematic codeword. Almost all channel block coding systems use systematic codes. The redundancy or parity bits are calculated in such a manner that they do not use every message bit. If a message bit is not included in a parity check, it can fail without affecting the outcome of that check. For example, if the second bit of a codeword fails, the outcome of the parity-check equation given by the first row of H is not affected. However, the outcomes of other parity-check equations, given by the second and third rows of H, are affected. The position of the error is deduced from the pattern of these successful and unsuccessful
190
CHANNEL CODING
checks in the parity-check matrix. This pattern is known as a syndrome defined by the matrix equation s ⫽ r ⭈ HT, where r ⫽ c ⫹ e and e is a seven-bit error vector. In the previous example of the Hamming code, let a failed second bit be assumed in a received word r, i.e., e1 ⫽ 1, and ei ⫽ 0 for i ⬆ 1. Because this bit is included in only two of the parity-check equations, there are two 1’s in the failure pattern, namely, 011. Since considerable care was taken in the design of the matrix pattern for generating the check bits, the syndrome, 011, is actually the address (i.e., [011]T is the first column of H) of the error bit. This is a fundamental feature of the original Hamming codes, due to Richard Hamming in 1950. It is useful at this point to introduce the concept of Hamming distance. This is the number of positions in which two sequences of length n differ. The minimum (Hamming) distance d of a block code is by definition the least Hamming distance between any two distinct codewords of this code. That is, the minimum distance of a binary code equals the minimum number of bits that needs to be changed in order to change any codeword into any other codeword. A linear code of length n, dimension k, and minimum distance d is often denoted by the notation (n, k, d). If errors corrupt a codeword so that it is no longer a codeword, it is definitely detectable and possibly correctable. If errors convert one codeword into another, it is impossible to detect. Therefore, the minimum distance d indicates the detection and correction capacities of the code. For the Hamming code example it can be found by a direct verification for the 24 ⫽ 16 codewords that the minimum distance of the Hamming code is 3. This coincides with the fact that two or fewer bit errors in any codeword of the Hamming code produce a noncodeword. Hence two bit errors are always detectable. Correction is also possible if the following minimum distance rule is used: Correction (decoding) with the minimum distance rule decodes each received word r to the codeword that is closest to it in Hamming distance. For example, if the received word from the Hamming code is r ⫽ (1111100) and an error occurs in the second bit [i.e., e ⫽ (0100000)], the minimum distance rule always correctly decodes r to the codeword c ⫽ (1011100). It can be shown that the Hamming code is able to correct all single-bit errors. Associated with this fact is the important theorem for a linear block code given next. Theorem A linear block code (n, k, d) has the following minimum distance decoding: 1. If d ⱖ e ⫹ 1, then the code can detect e errors. 2. If d ⱖ 2t ⫹ 1, then the code can correct t errors. 3. If d ⱖ t ⫹ e ⫹ 1 for e ⱖ t, then the code can correct t errors and simultaneously detect e errors. Intuitively, item (2) of this theorem can be explained as follows: If a codeword c is transmitted and errors occur in ⱕt positions, then the received word r clearly resembles the transmitted codeword c more than any other codeword. It has been shown for the Hamming code that the syndromes s are the addresses of the error bits. This concept can be generalized to all linear block codes. That is, each syn-
drome corresponds to one and only one error vector if the number of errors satisfies e ⱕ (d ⫺ 1)/2. Another simpler decoding method, called syndrome decoding, can be shown to be equivalent to the minimum distance decoding rule: 1. Compute the syndrome s of the received word r. 2. Determine the error vector e that corresponds to the syndrome s by a lookup table. 3. Decode r by choosing the corrected codeword to be r ⫺ e. CYCLIC CODES The implementation of the encoder of a Hamming code can be made very fast by the use of the parity-check equations. Such an implementation is ideal for some applications, such as computer memory protection, which requires short codes and a fast access time. However, in many other applications, the messages are transmitted and stored serially, and it is desirable to use relatively large data blocks to reduce the memory storage devoted to preambles, addressing, and synchronization. Where large data blocks are to be handled, the use of simple parity-check equations for encoding has to be abandoned because it would become impossibly complex. However, the principle of the generator and the parity-check matrices can still be employed for the encoder, but now these matrices usually are generated algorithmically. For the decoder, the syndromes are used to find the bits in error not by using a simple lookup table, but by solving algebraic equations. A subclass of linear block codes, called cyclic codes, can provide long codes that have the required encoding and decoding structures. A linear code C of length n is said to be cyclic if every cyclic shift of a codeword c is also a codeword, that is, c = (c0 , c1 , . . ., cn−1 ) ∈ C ⇒ cπ = (cn−1 , c0 , c1 , . . ., cn−2 ) ∈ C When messages can be accessed serially, simple circuitry can be used for the encoder since the same gate can be used for many XOR operations. Unfortunately, the reduction in complexity of the encoding process is paralleled by an increase in the difficulty of explaining what takes place. The methodology described so far about how an error-correction system works is mainly in engineering terms. However, it can be described more mathematically so that the encoding and decoding of long codes can be accomplished in a more efficient manner. Toward this end, codewords of a cyclic code can be represented by the use of polynomials in the following way:
c = (c0 , c1 , . . ., cn−1 ) ∈ C ⇒ c(x) = c0 + c1 x + · · · + cn−1 xn−1 ∈ C(x) where C(x) represents the set of polynomials that is associated with the set of all codewords of C. The term c(x) is called a code polynomial. It is clear that c앟(x) ⫽ x ⭈ c(x) ⫺ cn⫺1 ⭈ (xn ⫺ 1) 僆 C(x), that is, c앟(x) ⬅ x ⭈ c(x) mod(xn ⫺ 1). Let Rn[x] ⫽ GF(2)[x]/(xn ⫺ 1) denote the polynomial ring of degree at most n ⫺ 1 over the finite (or Galois) field GF(2) of two elements. Let g(x) be the unitary polynomial of smallest degree in C(x). Then the degree of g(x) equals n ⫺ k, and every polynomial c(x) 僆 C(x) can be represented as c(x) ⫽ m(x)g(x) for some m(x) 僆 Rn[x].
CHANNEL CODING
c(x) m(x) Figure 4. An encoder of the (7,4,3) cyclic Hamming code.
Therefore, a cyclic code encoder can be conceived to be a polynomial multiplier that can be implemented by the use of what is called a shift register. For example, the cyclic Hamming code has the generator polynomial g(x) ⫽ x3 ⫹ x ⫹ 1. Hence, one implementation of the encoder of this code is the shift register device shown in Fig. 4. The register of the encoder is initially set to zero. Then let the message word, m(x) ⫽ m0 ⫹ m1x ⫹ m2x2 ⫹ m3x3, be the input in sequence from m3 to m0. After shifting seven times, a codeword c(x) ⫽ c0 ⫹ c1x ⫹ ⭈ ⭈ ⭈ ⫹ c6x6 is the sequential output of the bits c6 to c0. Many other methods for encoding cyclic codes can be implemented by the use of shift register circuits. The most useful of these techniques is the systematic encoding method. Encoding of an (n, k) cyclic code in systematic form consists of three steps: (1) multiply the message polynomial m(x) by xn⫺k; (2) divide xn⫺km(x) by g(x) to obtain the remainder b(x); and (3) form the codeword c(x) ⫽ b(x) ⫹ xn⫺km(x). Recall that in the decoding of a linear code, the first step is to compute the syndrome vector s from r by s ⫽ r ⭈ HT. If the syndrome is zero, the decoder accepts r as a codeword. If the syndrome is not equal to zero, r is not a codeword and the presence of errors is detected. For a cyclic code in systematic form, the syndromes are computed easily. The received word r is treated as the polynomial of degree n ⫺ 1 or less, i.e., r(x) ⫽ r0 ⫹ r1x ⫹ r2x2 ⫹ ⭈ ⭈ ⭈ ⫹ rn⫺1xn⫺1. A polynomial division of r(x) by the generator polynomial g(x) yields r(x) ⫽ a(x)g(x) ⫹ s(x), where the remainder s(x) is a polynomial of degree n ⫺ k ⫺ 1 or less. The n ⫺ k coefficients of s(x) form the syndrome s. Therefore, s(x) is called the syndrome polynomial of the cyclic code. In general, the decoding of cyclic codes for error-correction consists of the same three steps used for decoding any linear code: syndrome computation, association of the syndrome with the error pattern, and error correction. However, the limit to this approach is the complexity of the decoding circuit that is needed to determine the error word from the syndrome. Such procedures tend to grow exponentially with code length and the number of errors that need to be corrected. Many cyclic codes have considerable algebraic and geometric properties. If these properties are properly used, then a simplification in the decoding process is usually possible. Cyclic codes are well suited to error detection, and several have been standardized for use in digital communications. The most common of these have the following generator polynomials:
x
16
+x
15
+x +1 2
x16 + x12 + x5 + 1
(CRC − 16) (CRC − CCITT )
These codes can detect many combinations of errors, and the implementation of both the encoding and error-detecting circuits is quite practical. Since every codeword of a cyclic code can be computed from its generator polynomial g(x) as the product c(x) ⫽ d(x)g(x), it is clear that s(x) ⬅ 0 mod(g(x)) if and
191
only if the received polynomial r(x) is a code polynomial. This very useful fact is often employed to design the efficient errordetecting circuits of most cyclic codes. The results discussed in this section are easily generalized to codes constructed over any finite field GF(q), where q is some power of a prime number p. BCH CODES One class of cyclic codes was introduced in 1959 by Hocquenghem, and independently in 1960 by Bose and Ray-Chaudhuri. The codes are known as BCH codes and can be described by means of the roots of a polynomial g(x) with coefficients in a finite field. A cyclic code of length n over GF(2) is called a BCH code of designed distance 웃 if its generator g(x) is the least common multiple of the minimal polynomials of 웁l, 웁l⫹1, . . ., 웁l⫹웃⫺2 for some l, where 웁 is a primitive nth root of unity. If n ⫽ 2m ⫺ 1, i.e., 웁 is a primitive element of GF(2m), then the BCH code is called a primitive BCH code. The performance of a BCH code is specified by its designed distance using the following fact: the minimum distance of a BCH code with designed distance 웃 is at least 웃. This fact is usually called the BCH bound. A primitive BCH code of designed distance 웃 has the minimum distance d ⱕ 2웃 ⫺ 1. To decode BCH codes, let’s once again consider a BCH code of length n over GF(2) with designed distance 웃 ⫽ 2t ⫹ 1 and let 웁 be a primitive nth root of unity in GF(2m). Consider a codeword c(x) and assume that the received word is r(x) = r0 + r1 x + · · · + rn−1 xn−1 Let e(x) ⫽ r(x) ⫺ c(x) ⫽ e0 ⫹ e1x ⫹ ⭈ ⭈ ⭈ ⫹ en⫺1xn⫺1 be the error vector. Now, define the following: M ⫽ 兵i兩ei ⬆ 0其 is the set of positions where errors occur. e ⫽ 兩M兩 is the number of errors. The polynomial (z) ⫽ ⌸ i僆M (1 ⫺ 웁iz) in z is called the errorlocator polynomial. Also, let 웆(z) ⫽ 兺i僆M ei웁iz ⌸ j僆M兵i其 (1 ⫺ 웁jz) be what is known as the error-evaluator polynomial. It is clear that if one can find (z) and 웆(z), then the errors can be corrected. In fact, an error occurs in position i if and only if (웁⫺i) ⫽ 0, and in that case the error is given by ei ⫽ ⫺웆(웁⫺i)웁i / ⬘(웁⫺i), where ⬘( ⭈ ) denotes the derivative. Assume that the number of errors e ⱕ t (if e ⬎ t, one does not expect to be able to correct the errors). Observe that
ω(z) = σ (z) =
X XX XX X i∈M ∞
ei β i z = 1 − βiz
i∈M
(β i z)l
l=1
∞
ei β li =
zl
l=1
∞
ei
i∈M
zl e(β l )
l=1
where all of these calculations use the operations of what is known as a formal power series over the finite field GF(2m). For 1 ⱕ l ⱕ 2t, one gets e(웁l) ⫽ r(웁l), i.e., the receiver knows the first 2t coefficients on the right-hand side of the equation. Therefore, 웆(z)/ (z) is known as mod z2t⫹1. It is claimed that the receiver must determine polynomials (z) and 웆(z) in such
192
CHANNEL CODING
a manner that deg[웆(z)] ⱕ deg[(z)] with deg[(z)] being as small as possible under the condition,
ω(z) ≡ σ (z)
X
Digital source
m
c
Encoder
Modulator
2t
zl r(β l )(mod z2t+1 )
n(t)
l=1
In practice, it is very important to find a fast algorithm that actually determines (z) and 웆(z) by solving these equations. Two commonly used algorithms are the Berlekamp–Massey decoding algorithm introduced by E. R. Berlekamp and J. Massey, and the Euclidean algorithm. Interested readers may refer to (1,3–5).
REED–SOLOMON CODE A very useful class of nonbinary cyclic codes is called the Reed–Solomon (RS) code. RS codes were first discovered by I. S. Reed and G. Solomon in 1958. An RS code is defined over GF(pm) with length pm ⫺ 1 and minimum distance d ⫽ n ⫺ k ⫹ 1. Its generator polynomial is g(x) ⫽ (x ⫺ 움u)(x ⫺ 움u⫹1) ⭈ ⭈ ⭈ (x ⫺ 움u⫹d⫺2), where 움 is a primitive element in GF(pm) and where u is some integer. Since RS codes are cyclic, they can be encoded by the product of g(x) and the polynomial associated with the information vector, or by a systematic encoding. For example, the RS code of length n ⫽ 7, dimension k ⫽ 5, and minimum distance d ⫽ 3 where p ⫽ 2 is specified by the generator polynomial g(x) ⫽ (x ⫺ 움3)(x ⫺ 움4) ⫽ x2 ⫹ 움6x ⫹ 1, where 움 is a root of the irreducible polynomial x3 ⫹ x ⫹ 1 and is a primitive element of the finite field GF(23). In the RS codes, data bits are assembled into words, or symbols, which become elements of the Galois field upon which the code is based. The number of bits in the symbol determines the size of the Galois field, and hence the number of symbols in a codeword. A symbol length of eight bits is commonly used because it fits in conveniently with modern byte-oriented computers and processors. The Galois, or finite, field with eight-bit symbols is denoted by GF(28). Thus, the RS codes defined over GF(28) have a length of 28 ⫺ 1 ⫽ 255 symbols. As each symbol contains eight bits, the codeword is 255 ⫻ 8 ⫽ 2040 bits long. A primitive polynomial commonly used to generate GF(28) is g(x) ⫽ x8 ⫹ x4 ⫹ x3 ⫹ x2 ⫹ 1. The decoders of RS codes are usually implemented by the Euclidean algorithm and the Berlekamp–Massey algorithm.
NOISY CHANNEL CODING THEOREM Several important classes of codes have been discussed in the previous sections. In this section the performance to be expected from channel coding is discussed briefly. Some commonly used quantities for measuring performance improvement by channel coding include the error probabilities from the decoder, such as the bit-error rate of the system, the probability of an incorrect decoding of a codeword, and the probability of an undetected error. In the physical layers of a communication system, these error probabilities usually depend on the particular code, the decoder, and, more importantly, on the underlying channel/medium error probabilities.
Digital sink
m
c
Decoder
Demodulator
s(t)
Additive noise channel
r(t)
Figure 5. Coded system on an additive noise channel.
Figure 5 shows a block diagram of a coded system for an additive noise channel. In such a system, the source output m is encoded into a code sequence (codeword) c. Then c is modulated and sent to the channel. After demodulation the decoder receives a sequence r which satisfies r ⫽ c ⫹ e, where e is the error sequence and ‘‘⫹’’ usually denotes componentˆ reprewise vector XOR addition. The final decoder output m sents the recovered message. The primary purpose of a decoder is to produce an estiˆ of the transmitted information sequence m that is mate m based on the received sequence r. Equivalently, since there is a one-to-one correspondence between the information sequence m and the codeword c, the decoder can produce an ˆ ⫽ m if and only if estimate cˆ of the codeword c. Clearly, m cˆ ⫽ c. A decoding rule is a strategy for choosing an estimated codeword cˆ for each possible received sequence r. If the codeword c was transmitted, a decoding error occurs if and only if cˆ ⬆ c. On the assumption that r is received, the conditional error probability of the decoder is defined by P(E r ) = P(cˆ = c r )
X
(1)
The error probability of the decoder is then given by P(E) =
P(E r )P(rr )
(2)
r
where P(r) denotes the probability of receiving the codeword r and the summation is over all possible received words. Evidently, P(r) is independent of the decoding rule used since r is produced prior to the decoding process. Hence, the optimum decoding rule must minimize P(E兩r) ⫽ P(cˆ ⬆ c兩r) for all r. Since minimizing P(cˆ ⬆ c兩r) is equivalent to the maximization of P(cˆ ⫽ c兩r), P(E兩r) is minimized for a given r by choosing cˆ to be some codeword c that maximizes P(cc r ) =
P(rr c )P(cc ) P(rr )
(3)
That is, cˆ is chosen to be the most likely codeword, given that r is received. If all codewords are equally likely, i.e., P(c) is the same for all c, then maximizing Eq. (3) is equivalent to the maximization of the conditional probability P(r兩c). If each received symbol in r depends only on the corresponding transmitted symbol, and not on any previously transmitted symbol, the channel is called a discrete memoryless channel (DMC). For a DMC, one obtains P(rr c ) =
Y i
P(ri ci )
(4)
CHANNEL CODING
since for a memoryless channel each received symbol depends only on the corresponding transmitted symbol. A decoder that chooses its estimate to maximize Eq. (4) is called a maximum likelihood decoder (MLD). One of the most interesting problems in channel coding is to determine for a given channel how small the probability of error can be made in a decoder by a code of rate R. A complete answer to this problem is provided to a large extent by a specialization of an important theorem, due to Claude Shannon in 1948, called the noisy channel coding theorem or the channel capacity theorem. Roughly speaking, Shannon’s noisy channel coding theorem states: For every memoryless channel of capacity C, there exists an error-correcting code of rate R ⬍ C such that the error probability P(E) of the maximum likelihood decoder for a power-constrained system can be made arbitrarily small. If the system operates at a rate R ⬎ C, the system has a high probability of error, regardless of the choice of the code or decoder. The capacity C of a channel defines the maximum number of bits that can be reliably sent per second over the channel.
CODING PERFORMANCE AND DECODING COMPLEXITY The noisy channel coding theorem states that there exist ‘‘good’’ error-correcting codes for any rate R ⬍ C such that the probability of error in an ML decoder is arbitrarily small. However, the proof of this theorem is nonconstructive, which leaves open the problem of the search for specific ‘‘good’’ codes. Also, Shannon assumed exhaustive ML decoding that has a complexity that is proportional to the number of words in the code. It is clear that long codes are required to approach capacity and, therefore, that more practical decoding methods are needed. These problems, left by Shannon, have kept researchers searching for good codes for almost 50 years until the present time. Gallagher (6) showed that the probability of an error of a ‘‘good’’ block code of length n and rate R ⬍ C is bounded exponentially with block length as follows: P(E) ≤ enE b (R)
(5)
where what is known as the error exponent Eb(R) is greater than zero for all rates R ⬍ C. Like Shannon, Gallagher continued to assume a randomly chosen code and an exhaustive ML is then of the order of decoding. The decoding complexity K ⬵ enR, and therefore the dethe number of codewords, i.e., K creasing of P(E) is bounded only algebraically, with a decoding complexity given by P(E) ≤ Kˆ −E b (R)/R
(6)
The exponential error bound given in Eq. (6) for block codes also extends to convolutional codes of memory order m with the form, P(E) ≤ e−(m+1)nE c (R)
193
positive functions of R for R ⬍ C and are completely determined by the channel characteristics. It is shown in (2) and (6) that the complexity of an implementation of ML decoding algorithm called the Viterbi algo ⬵ e(m⫹1)nR. rithm is exponential in the constraint length, i.e., K Thus, the probability of error is again only an algebraic function of its complexity, as follows: P(E) ≤ Kˆ −E c (R)/R
(8)
Both of the bounds Eq. (5) and Eq. (7) imply that an arbitrarily small error probability is achievable for R ⬍ C either by increasing the code length n for block codes or by increasing the memory order m for convolutional codes. For codes to be very effective, they must be long in order to average the effects of noise over a large number of symbols. Such a code may have as many as 2200 possible codewords and many times the number of possible received words. While an exhaustive ML decoding still conceptually exists, such a decoder is impossible to implement. It is very clear that the key obstacle to an approach to channel capacity is not only in the construction of specific ‘‘good’’ long codes, but also in the problem of its decoding complexity. Certain simple mathematical constructs enable one to determine the most important properties of ‘‘good’’ codes. Even more importantly, such criteria often make it feasible for the encoding and decoding operations to be implemented in practical electronic equipment. Thus, there are three main aspects of the channel coding problem: (1) to find codes that have the error-correcting ability (this usually demands that the codes be long); (2) a practical method of encoding; and (3) a practical method of making decisions at the receiver, that is, performing the error correction process. Interested readers should refer to the literature (1–9). BIBLIOGRAPHY 1. S. Lin and D. J. Costello, Jr., Error Control Coding: Fundamentals and Applications, Englewood Cliffs, NJ: Prentice-Hall, 1983. 2. A. J. Viterbi and J. K. Omura, Principles of Digital Communication and Coding, New York: McGraw-Hill, 1979. 3. W. W. Peterson and E. J. Weldon Jr., Error-Correcting Codes, 2nd ed., Cambridge, MA: The MIT Press, 1972. 4. F. J. MacWilliams and N. J. A. Sloane, The Theory of Error-Correcting Codes, New York: North-Holland Publishing Company, 1977. 5. E. R. Berlekamp, Algebraic Coding Theory, New York: McGrawHill, 1968. 6. R. G. Gallager, Information Theory and Reliable Communications, New York: Wiley, 1968. 7. G. D. Forney Jr., Concatenated Codes, Cambridge, MA: The MIT Press, 1966. 8. R. Blahut, Theory and Practice of Error Control Codes, Reading, MA: Addison-Wesley, 1983. 9. G. C. Clark and J. B. Cain, Error Correction Coding for Digital Communications, New York: Plenum, 1981.
(7) IRVING S. REED
where (m ⫹ 1)n is called the constraint length of the convolutional code, and the convolutional error exponent Ec(R) is greater than zero for all rates R ⬍ C. Both Eb(R) and Ec(R) are
University of Southern California
XUEMIN CHEN General Instrument Corporation
194
CHAOS, BIFURCATIONS, AND THEIR CONTROL
CHAOS. See CHAOS, BIFURCATIONS, AND THEIR CONTROL.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5303.htm
●
HOME ●
ABOUT US ●
//
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering
Browse this title
Client–Server Systems Standard Article Andrzej Goscinski1 and Wanlei Zhou1 1Deakin University, Geelong, Vic, Australia Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5303 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (232K)
●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
Abstract The sections in this article are The Client-Server Model Communication Between Clients and Servers Sun’s Network File System The Development of the Rhodos
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5303.htm (1 of 2)16.06.2008 16:21:39
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5303.htm
Building the Distributed Computing Environment on Top of Existing Operating Systems | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5303.htm (2 of 2)16.06.2008 16:21:39
CLIENT–SERVER SYSTEMS
CLIENT–SERVER SYSTEMS By amalgamating computers and networks into one single computing system, a distributed computing system has created the possibility of sharing information and peripheral resources. Furthermore, these systems improve performance of a computing system and individual users through parallel execution of programs, load balancing and sharing, and replication of programs and data. Distributed computing systems are also characterised by enhanced availability and increased reliability. However, the amalgamation process has also generated some serious challenges and problems. The most important, critical challenge was to synthesise a model of distributed computing to be used in the development of both application and system software. Another critical challenge was to develop ways to hide distribution of resources and build relevant services upon them. The synthesis of a model of distributed computing has been influenced by a need to deal with the issues generated by distribution such as Locating and accessing remote data, programs, and peripheral resources Coordinating distributed programs executing on different computers Maintaining the consistency of replicated data and programs Detecting and recovering from failures Protecting data and programs stored and in transit Authenticating users The model that has been used to develop application and system software of distributed computing systems is the clientserver model. Because of this the current image of computing is client-server distributed computing. The goal of this article is to introduce and discuss the client-server model and the communication paradigm which supports this model, and to show how this model has influenced the development of different systems and applications. This article contains three major parts. The first part introduces the client-server model and different concepts and extensions to this model. The second part discusses communication supporting distributed computing systems built based on the client-server model. It contains a detailed discussion of two dimensions of the communication paradigm: the communication pattern, one-to-one and group communication, and the techniques, message passing and remote procedure call (RPC), which are used to design and build client-server based applications. The third part presents advanced applications developed based on the client-server model. The first and simplest of the presented applications of the client-server model is the network file system (NFS). It is an extension to central-
431
ised (local) operating systems (e.g., Unix, MS-DOS) which allows transparent remote file access. Subsequent sections show the RHODOS distributed operating system and distributed computing environment (DCE), respectively. RHODOS has been built from scratch on top of a bare computer. It employs the concept of a microkernel which is a cornerstone of the whole client-server-based operating system. It provides full transparency to the user. On the other hand, DCE is built on top of existing operating systems such as Unix and VMS, and hides differences among individual computers. However, it does not fully support transparency. THE CLIENT-SERVER MODEL The Client-Server Model in a Distributed Computing System A distributed computing system is a set of application and system programs and data dispersed across a number of independent personal computers connected by a communication network. In order to provide requested services to users the system and relevant application programs must be executed. Because services are provided as a result of executing programs on a number of computers with data stored on one or more locations, the whole activity is called distributed computing. The problem is how to formalize the development of distributed computing. The main issue of distributed computing is programs in execution, which are called processes. The second issue is that these processes cooperate or compete in order to provide the requested services. The client-server model is a natural model of distributed computing, which is able to deal with the problems generated by distribution, could be used to describe these processes and their behavior when providing services to users, and allows the design of system and application software for distributed computing systems. According to this model there are two processes: the client, which requests a service from another process; and the server, which is the service provider. The server performs the requested service and sends back a response. This response could be a processing result, a confirmation of completion of the requested operation, or even a notice about a failure of an operation. The client-server model and the association between this model and the physical environment this model is used in are illustrated in Fig. 1. The basic items of the model are the client and server and request and response, and the elements of a distributed computing system are distinguished. This figure firstly shows that the user must send a request to an individual server in order to be provided with a given service. A need for another service requires the user to send a request to another server. Secondly, the client and server processes execute on two different computers. They communicate at the virtual (logical) level by exchanging requests and responses. In order to achieve this virtual communication physical messages between these two processes are sent. This implies that operating systems of computers and the communication system of a distributed computing system are actively involved in the service provision. The most important features of the client-server model are simplicity, modularity, extensibility, and flexibility. Simplicity manifests itself by closely matching the flow of data with the control flow. Modularity is achieved by organizing and integrating a group of computer operations into a separate ser-
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
432
CLIENT–SERVER SYSTEMS
Client
Server Request
Client process
Server process Response
Virtual communication
Operating system
Operating system
Communication
Communication
Requesting message Figure 1. The client–server model and its association with operating systems and a communication facility.
vice. Also, any set of data with operations on this data can be organized as a separate service. The whole distributed computing system developed based on the client–server model can be easily extended by adding new services in the form of new servers. The servers which do not satisfy user requirements can be easily modified or even removed. Only the interfaces between the clients and servers must be maintained. From the user’s point of view a distributed computing system can provide services such as: printing, electronic mail, file service, authentication, naming, database service, and computing service. These services are provided by appropriate servers. Cooperation between Clients and Servers in a Distributed Computing System A system where there is only one server would not be able to provide high performance and reliable and cost-effective services to the user. As was shown in the previous section, one server is used to provide services to more than one client. The simplest form of cooperation between clients and servers (based on sharing) allows for lowering the costs of the whole system and more effective use of resources. An example of a service which is based on this form of cooperation is a printing service and file service. Processes can act as either clients or servers. It depends on the context. A file server which receives a request to read a file from a user’s client process must check on the access rights of this user. For this purpose it sends a request to an authentication server and waits for a response. Its response to the client depends on a response from the authentication server; the file server acts as a client of the authentication server. Thus, a service provided to the user by a distributed computing system developed based on the client-server model can require a chain of cooperating servers. Distributed computing systems provide the opportunity to improve performance through parallel execution of programs on a network (sometimes called clusters) of workstations, and decrease the response time of databases through data replication. Furthermore, they can be used to support synchronous distant meetings and cooperative workgroups. They also can increase reliability by service multiplication.
Responding message
Network communication
In these cases many servers must contribute to the overall application. Furthermore, it would require in some cases simultaneous requests to be sent to a number of servers. Different application will require different semantics for the cooperation between clients and servers. Distributed computing systems have moved from the basic one-to-one client–server model to the one-to-many and chain models in order to improve performance and reliability. Furthermore, client and server cooperation can be strongly influenced and supported by some active entities which are extensions to the client-server model. In a distributed computing system there are two different forms of cooperation between clients and servers. The first form assumes that a client requests a temporary service. Another situation is generated by a client which wants to arrange for a number of calls to be directed to a particular serving process. This implies a need for establishing long-term bindings between a client and a server. Groups in Distributed Computing Systems A group is a collection of processes, in particular servers, which share common features (described by a set of attributes) or application semantics. In general, processes are grouped in order to deal with this set of processes as a single abstraction; form a set of servers which can provide an identical service (but not necessary of the same quality); encapsulate the internal state and hide interactions among group members from the clients and provide a uniform interface to the external world; and deliver a single message to multiple receivers thereby reducing the sender and receiving overheads (1). There are two types of groups: closed and open (2). In a closed group only the members of the group can send and receive messages to access the resources of the group. In an open group not only can the members of the group exchange messages and request services but nonmembers of the group can send messages to group members. Importantly, the nonmembers of the group need not join the group nor have any knowledge that the requested service is provided by a group. Four group structures are often supported to provide the most appropriate policy for a wide range of user applications.
CLIENT–SERVER SYSTEMS
The peer group is composed of a set of member processes that cooperate for a particular purpose. Fault-tolerant and loadsharing applications dominate this type of group style. The client-server group is made from a potentially large number of client processes with a peer group of server processes. It is an open group. The diffusion group is a special case of the client-server group, where a single request message is sent by a client process to all servers. The hierarchical group is an extension to the client–server group. In large applications with a need for sharing between large numbers of group members, it is important to localize interactions within smaller clusters of components in an effort to increase performance. According to external behavior, groups can be classified into two major categories: deterministic and nondeterministic. A group is considered deterministic if each member must receive and act on a request. This requires coordination and synchronization between the members of the group. In a deterministic group, all members are considered equivalent. Nondeterministic groups assume their applications do not require consistency in group state and behavior, and they relax the deterministic coordination and synchronization. Each group member is not equivalent and can provide a different response to a group request, or not respond at all, depending on the individual group member’s state and function. In order to act properly and efficiently, each member of the group must exchange messages amongst themselves (above normal application messages) to resolve the current status and membership of the group. Any change in group membership will require all members to be notified to satisfy the requested message requirements. Furthermore, users are provided with primitives to support group membership discovery (3) and group association operations (4). Group membership discovery allows a process to determine the state of the group and its membership. However, as the requesting process has no knowledge of the group members location, a network broadcast is required. There are four operations to support group association: create, destroy, join, and leave. Initially a process requiring group communication creates the required group. A process is considered to be a group member after it has successfully issued a group join primitive, and will remain a member of the group until the process issues a leave group primitive. When the last member of the group leaves, the group will be destroyed. Extensions to the Client-Server Model A client and server can cooperate either directly or indirectly. In the former case there is no additional entity which participates in exchanging requests and responses between a client and a server. Indirect cooperation in the client-server model requires two additional entities, called agents, to request a service and to be provided with the requested service. The role of these agents can vary from a simple communication module which hides communication network details to an entity which is involved in mediating between clients and servers, resolving heterogeneity issues, and managing resources and cooperating servers. As was presented previously, a client can invoke desired servers explicitly by sending direct requests to these servers. In this case the programmer of a user application must concentrate on both an application and on managing server coop-
433
eration and communication. Writing resource management and communication software is expensive, time consuming, and error prone. The interface between the client and the server is complicated, differs from one application to another, and the whole service provided is not transparent to the client process. Clients can also request multiple services implicitly. This requires the client to send only one request to a general server. A requested service will be composed by this invoked server by cooperating, based on information provided in the request, with other servers. After completion of necessary operations by involved servers, the general server sends a response back to the client. This coordination operation can be performed by a properly designed agent. Despite the fact that such an agent is quite complicated, the cooperation between the client and the server is based on a single, well-defined interface. Furthermore, transparency is provided to the client which reduces the complexity of the application. Cooperation between a client and multiple servers can be supported by a simple communication system which employs a one-to-one message protocol. Although this communication pattern is simple, its performance is poor because each server involved must be invoked by sending a separate message. The overall performance of a communication system supporting message delivery in a client–server based distributed computing system can be dramatically improved if a one-to-many communication pattern is used. In this case a single request is sent by the client process to all servers, specified by a single group name. The use of multicast at the physical/data link layer improves this system even further. The Three-Tier Client-Server Architecture Agents and servers acting as clients can generate different architectures of distributed computing systems. The threetier client-server architecture extends the basic client-server model by adding a middle tier to support the application logic and common services. In this architecture, a distributed application consists of the three components: user interface and presentation processing component, responsible for accepting inputs and presenting the results (the client tier); computational function processing component, responsible for providing transparent, reliable, secure, and efficient distributed computing—it is also responsible for performing necessary processing to solve a particular application problem (application tier); data access processing component, responsible for accessing data stored on external storage devices, such as disk drives (back-end tier). These components can be combined and distributed in various ways to create different configurations with varying complexity. Figure 2(a) shows a centralized configuration where all the three types of components are located in a single computer. Figure 2(b) shows three two-tier configurations where the three types of components are distributed on two computers. Figure 2(c) shows a three-tier configuration where all the three types of components are distributed on different computers. Figure 3 illustrates an example implementation of the three-tier architecture. In this example, the upper tier consists of client computers that run user interface processing software. The middle tier is computers that run computational function processing software. The bottom tier is back-
434
CLIENT–SERVER SYSTEMS
Computer User interface/presentation
Computational functions
Data access
(a) Client computer User interface
Server computer
Client computer
Computational function
User interface
Data access
Computational function
Server computer
Client computer
Server computer
User interface
Computational function
Computational function
Data access
Data access
(b)
Client computer
Server computer
Data server computer
User interface
Computational function
Data access
(c) Figure 2. One- (a), two- (b), and three-tier (c) client–server configurations.
end data servers. In a three-tier client–server architecture, application clients usually do not interact directly with the data servers, instead, they interact with the middle tier servers to obtain services. The middle tier servers will then either fulfil the requests themselves, sending the result back to the clients, or more commonly, if additional resources are required, servers in the middle tier will act (as clients themselves) on behalf of the application clients to interact with the data servers in the bottom tier or other servers within the middle tier. Compared with a normal two-tier client–server architecture, the three-tier client–server architecture demonstrates: (1) better transparency, since the servers within the application tier allow an application to detach the user interface from back-end resources, and (2) better scalability, since servers as individual entities can be easily modified, added, or removed. Service Discovery To invoke a desired service a client must know whether there is a server which is able to provide this service, its characteristics, name, and location. This is the issue of service discovery. In the case of a simple distributed computing system, where there are only a few servers, there is no need to identify the existence of a desired server—information about all available servers is available a priori. This implies that service discovery is restricted to locating the server which provides the desired service. On the other hand, in a large distributed computing system which is a federation of a set of distributed computing systems, with the potential for many service providers who offer and withdraw these services dy-
namically, there is a need to learn both whether a proper service (e.g., a very fast color printer of high quality) is available at a given time, and if so its name and location. Service discovery is achieved through the following approaches. Computer Address Is Hardwired into Client Code. This approach requires the location of the server, in the form of a computer address, to be provided. However, it is only applicable in very small and simple systems, where there is only one server process running on the destination computer. Another version of this approach is based on a more advanced naming system, where requests are sent to processes rather than to computers. In this case each process is located using a pair 具computer_address, process_name典. A client is provided with not only the name of a server, but also with the address of a server computer. This solution is not location transparent as the user is aware of the location of the server. Broadcast Is Used to Locate Servers. According to this approach each process has a unique name. In order to send a request a client must know the name of the server. However, this is not enough because the operating system of the computer where the server runs must know the address of the server’s computer. For this purpose the client’s operating system broadcasts a special locate request containing the name of the server, which will be received by all computers on a network. An operating system which finds the server’s name in the list of its processes sends back a ‘here I am’ response containing its address (location). The client’s operating system receives the response and can store (cache) the server’s
CLIENT–SERVER SYSTEMS
Client tier
User
User
Client computer
Client computer
User interface and presentation
User interface and presentation
…
Client process for application tier
Application tier
Client process for application tier
Server computer
Server computer
Computational functions and services
Computational functions and services
…
Client process for back-end tier
Back-end tier
435
Client process for back-end tier
Fast parallel computer
Fast parallel computer
…
Data services
Data services
Data storage
computer address for future communication. This approach is transparent; however, the broadcast overhead is high as all computers on a network are involved in the processing of the location request. Server Location Lookup Is Performed via a Name Server. This approach is very similar to the broadcast-based approach; however, it reduces the broadcast overhead. In order to learn the address of a desired server, an operating system of the client’s computer sends a ‘where is’ request to a special system server, called a name server, asking for the address of a computer where the desired server runs. This means that the name and location (computer address) of the name server are known to all computers. The name server sends back a response containing an address of the desired server. The client’s operating system receives the response and can cache the server’s computer address for future communication. This approach is transparent and much more efficient than the broadcast-based approach. However, because the name server is centralized, the overall performance of a distributed computing system could be degraded as the name server can become a bottleneck. Furthermore, the reliability of this approach is low; if a name server computer crashed a distributed computing system cannot work. In a large distributed computing system there could be a large number of servers. Moreover, servers of the same type
Data storage
Figure 3. An application of the three-tier architecture in a distributed computing system.
can be characterized by different attributes describing the services they provide (e.g., one laser printer is a color printer, another is a black and white printer). Furthermore, servers can be offered by some users and revoked dynamically. A user is not able to know names and attributes of all these servers, and their dynamically changing availability. There must be a server which could support users to deal with these problems. A Broker Is Employed. This approach is very similar to the server location lookup performed via a name server approach. However, there are real conceptual differences between a broker and a name server which frees clients from remembering ASCII names or path names of all servers (and eventually the server locations), and allows clients to identify attributes of servers and learn about their availability. A broker is a server which (1) allows a client to identify available servers which can be characterized by a set of attributes which describe the properties of a desired service; (2) mediates cooperation between clients and servers; (3) allows service providers to register the services they support by providing their names, locations, and features in the form of attributes; (4) advertises registered services and makes them available to clients; and (5) withdraws services dynamically. Thus, a broker is a server which embodies both service management and naming services.
436
CLIENT–SERVER SYSTEMS
Client–Server Interoperability Reusability of servers is a critical issue for both users and software manufactures due to the high cost of software writing. This issue can be easily resolved in a homogeneous environment because the accessing mechanisms of clients may be made compatible with software interfaces, with static compatibility specified by types and dynamic compatibility by protocols. Cooperation between heterogeneous clients and servers is much more difficult as they are not fully compatible. Thus, the issue is how to make them interoperable. Wegner (5) defines interoperability as the ability of two or more software components to cooperate despite differences in language, interface, and execution platform. There are two aspects of client–server interoperability: a unit of interoperation, and interoperation mechanisms. The basic unit of interoperation is a procedure (5). However, larger granularity units of interoperation may be required by software components. Furthermore, preservation of temporal and functional properties may also be required. There are two major mechanisms for interoperation: interface standardization and bridging. The objective of the former is to map client and server interfaces to a common representation. The advantages of this mechanism are: (1) it separates communication models of clients from those of servers, and (2) it provides scalability, since it only requires m ⫹ n maps, where m and n are the number of clients and servers, respectively. The disadvantage of this mechanism is that it is closed. The objective of the latter is to provide a two-way map between client and server. The advantages of this mechanism are: (1) openness, and (2) flexibility—it can be tailored to the requirements of a given client and server pair. However, this mechanism does not scale as well as the interface standardization mechanism, as it requires m ⫻ n maps. Conclusions In this section we introduced the client-server model and some concepts related to this model. Partitioning software into clients and servers allows us to place these components independently on computers in a distributed computing system. Furthermore, it allows these clients and servers to execute on different computers in a distributed computing system in order to complete the processing of an application in an integrated manner. This paves the way of achieving high productivity and high performance in distributed computing. The client-server model is becoming the predominant form of software application design and operation. However, to fully benefit from the client–server model, there is a need to employ an operating system and communication network which links computers on which these processes run. Furthermore, in order to locate a server, the operating system must be involved. The question is what class of an operating system can be used. There are two classes of operating systems which could be employed to develop a distributed computing system: a network operating system and a distributed operating system. A network operating system is constructed by adding a module to the local centralized operating system of each computer which allows processes to access remote resources and services; however, in the major-
ity of cases this solution does not fully support transparency. A distributed operating system is built from scratch, which hides distribution of resources and services; this solution, although futuristic from the current-practice point of view, provides location transparency. It is clear that the extensions to the basic client-server model, described in the previous sections, are achieved through an operating system. Furthermore, network communication services are invoked by an operating system on behalf of cooperating clients and servers. COMMUNICATION BETWEEN CLIENTS AND SERVERS Distributed computing systems must be fast in order to instil in users the feeling of a huge powerful computer sitting on their desks. This implies that communication between the clients and servers must be fast. Furthermore, the speed of communication between remote client and server processes should not be highly different from the speed between local processes. The issue is how to build a communication facility within a distributed computing system to achieve high communication performance. One of the strongest factors which influences the performance of a communication facility is the communication paradigm: that is, the communication model supporting cooperation between clients and servers and the operating system support provided to deal with the cooperation. There are two issues in the communication paradigm. Firstly, a client can send a request to either a single server or a group of servers. This leads to two patterns of communication: one-to-one and one-to-many, also called group communication (which are operating system abstractions). Secondly, these two patterns of interprocess communication could be developed based on two different techniques: message passing, adopted for distributed computing systems in the late 1970s; and remote procedure call (RPC), adopted for distributed computing systems in mid-1980s. These two techniques are supported by two respective sets of primitives provided by an operating system. Furthermore, communication between processes on different computers can be given the same format as communication between processes on a single computer. The following topics are discussed in this section: message passing, including communication primitives; semantics of these primitives; direct and indirect communication; blocking and nonblocking primitives; buffered and unbuffered exchange of messages; and reliable and unreliable primitives are considered. Also, RPC is discussed. The basic features of this technique; parameters, results and their marshalling; client-server binding; and reliability issues are presented. Thirdly, group communication is discussed. In particular, the basic concepts of this communication pattern; group structures; different types of groups; group membership; message delivery and response semantics; and message ordering in group communication are presented. Message Passing—Message-Oriented Communication We define message-oriented communication as a form of communication in which the user is explicitly aware of the message used in communication and the mechanisms used to deliver and receive messages (6).
CLIENT–SERVER SYSTEMS
Basic Message Passing Primitives. A message is sent and received by executing the following two primitives: send(dest, src, buffer). The execution of this primitive sends the message stored in buffer to a server process named dest. The message contains a name of a client process named src to be used by the server to send a response back. receive(client, buffer). The execution of this primitive causes the receiving server process to be blocked until a message arrives. The server process specifies the client name of a process from whom a message is desired, and provides a buffer to store an incoming message. It is obvious that the receive primitive must be issued before a message arrives; otherwise the request could be declared as lost and must be retransmitted by the client. Of course, when the server process sends any message to the client process, it must use these two primitives also; the server sends a message by executing the primitive send and the client receives it by executing the primitive receive. There are several points that should be discussed at this stage. All of them are connected with a problem stated as follows: What semantics should these primitives have. The following alternatives are presented: direct or indirect communication via ports; blocking versus nonblocking primitives; buffered versus unbuffered primitives; reliable versus unreliable primitives; and structured forms of message passing based primitives. Direct and Indirect Communication via Ports. A very basic issue in message-based communication is where do messages go. Message communication between processes uses one of two techniques: the sender designates either a fixed destination process or a fixed location for receipt of a message. The former technique is called direct communication—it uses direct names; the latter is called indirect communication and it exploits the concept of a port. In direct communication, each process that wants to send or receive a message must explicitly name the recipient or sender of the communication. In this case, the send and receive primitives have the following form: send(dest, src, buffer), receive(client, buffer). The dest and client are the names of a destination process (server) and sending process (client) from whom the server is prepared to receive a request. This scheme exhibits a symmetry in naming: that is, both the sender and the receiver have to name one another in order to communicate. A variant of this scheme employs asymmetry in naming: only the client names the server, whereas the server is not required to name the client. Direct communication is easy to implement and to use. It enables a process to control the times at which it receives messages from each process. The disadvantage of the symmetric and asymmetric schemes is the limited modularity of the resulting process definition. Changing the name of the process may necessitate the examination of all other process’ destination. All references to the old process must be found, in order to modify them to the new name. This is not desirable from the point of view of separate compilation. Moreover, the receive primitive in a server should allow receipt of a mes-
437
sage from any client to provide a service to whatever client process calls it. Direct communication does not allow more than one client. Similarly, direct communication does not make it possible to send one request to more than one identical server. This implies the need for a more sophisticated technique. Such a technique is based on ports. A port can be abstractly viewed as a protected kernel object into which messages may be placed by processes and from which messages can be removed: that is, the messages are sent to and received from ports. Processes may have ownership, send, and receive rights on a port. Each port has a unique identification (name) that distinguishes it. A process may communicate with other process by a number of different ports. In this case dest in the send primitive is a name of a port of the server the request is sent to. Logically associated with each port is a FIFO queue of finite length. Messages which have been sent to this port but which have not yet been removed from it by a process reside on this queue. Messages may be added to this queue by any process which can refer to the port via a local name (e.g., capability). A port should be declared. A port declaration serves to define a queuing point for messages. A process which wants to remove a message from a port must have the appropriate receive rights. Usually, only one process may have receive access to a port at a time. Messages sent to a port are normally queued in FIFO order. However, an emergency message can be sent to a port and receive special treatment with regard to queuing. Blocking versus Nonblocking Primitives. One of the most important properties of message passing primitives concerns whether their execution could cause delay. We distinguish blocking and nonblocking primitives. We say that a primitive has nonblocking semantics if its execution never delays its invoker; otherwise, a primitive is said to be blocking. In the former case, a message must be buffered. The previously described primitives have blocking semantics. It is necessary to distinguish two different forms of the blocking send primitives. These forms are generated by different criteria. The first criterion reflects the operating system design and addresses buffer management and message transmission. The blocking and nonblocking send primitives are illustrated in Fig. 4. If the blocking send primitive is used, the sending process (client) is blocked: that is, the instruction following the send primitive is not executed until the message has been completely sent. The blocking receive implies that the process which issued this primitive remains blocked (suspended) until a message arrives, and being put into the buffer specified in the receive primitive. If the nonblocking send primitive is used, the sending process (client) is only blocked for the period of copying a message into the kernel buffer. This means that the instruction following the send primitive can be executed even before the message is sent. This can lead toward parallel execution of a process and message transmission. The second criterion reflects the client-server cooperation and the programming language approach to dealing with message communication. In this case the client is blocked until the server (receiver) has accepted the request message and
438
CLIENT–SERVER SYSTEMS
Trap to kernel, process blocked
Client running Send()
Client running
Trap to kernel
Send() Message Client copied to blocked kernel buffer
Client blocked
Client running
Message being sent
Client running
Returned from kernel, process unblocked
Time
Time
(a)
(b)
Figure 4. Operating system oriented blocking (a) and unblocking (b) send primitives.
the result or acknowledgment has been received by the client, illustrated in Fig. 5. There are three forms of the receive primitive. The blocking receive is the most common, since the receiving process often has nothing else to do while awaiting receipt of a message. There is also a nonblocking receive primitive, and a primitive for checking whether a message is available to receive. As a result, a process can receive all messages and then select one to process. Unbuffered versus Buffered Message Passing Primitives. In some message-based communication systems, messages are buffered between the time they are sent by a client and received by a server. If a buffer is full when a send is executed,
Client running
Trap to kernel, process blocked
Send()
Receive() Server blocked
Client blocked
Message being sent, received by the server (eventually processed), the response sent back
Server running Send() Receive()
Client running
Return from kernel, process unblocked
Server Blocked
Time Figure 5. Client–server cooperation oriented blocked send primitive.
there are two possible solutions: the send may delay until there is a space in the buffer for the message, or the send might return a code to the client, indicating that because the buffer is full the message could not be sent. The situation of the receiving server is different. The receive primitive informs an operating system about a buffer into which the server wishes to put an arrived message. The problem occurs when the receive primitive is issued after the message arrives. The question is what to do with the message. The first possible approach is to discard the message. The client could time out and re-send, and hopefully the receive primitive will be invoked in the meantime. Otherwise, the client can give up. The second approach to deal with this problem is to buffer the message in the operating system area for a specified period of time. If during this period the appropriate receive primitive is invoked the message is copied to the invoking server space. If the receive primitive is not invoked and the timeout expires the message is discarded. Buffered message passing systems are more complex than unbuffered message passing based systems, since they require creation, destruction, and management of the buffers. Also, they generate protection problems and cause catastrophic event problems when a process owning a port dies or is killed. Unreliable versus Reliable Primitives. Different catastrophic events, such as a computer crash or a communication system failure can happen in a distributed computing system. These can cause either a requesting message being lost in the network, a response message being lost or delayed in transit, or the responding computer ‘‘dying’’ or becoming unreachable. Moreover, messages can be duplicated, or delivered out of order. The primitives discussed previously cannot cope with these problems. These are called unreliable primitives. The unreliable primitive send merely puts a message on the network. There is no guarantee of delivery provided and no automatic retransmission is carried out by the operating system when a message is lost. Dealing with failure requires providing reliable primitives. In a reliable interprocess communication, the send primitive handles lost messages using internal retransmissions and acknowledgments on the basis of timeouts. This implies that when send terminates, the process is sure that the message was received and acknowledged. Reliable and unreliable receive differ in that the former automatically sends an acknowledgment confirming message reception, whereas the latter does not. Two-way communication requires the utilization of the basic message passing primitives in a symmetrical way. If the client requested any data, the server sends reply messages (responses) using the send primitive. For this reason the client has to set the receive primitive up to receive any message from the server. Reliable and unreliable primitives are contrasted in Fig. 6. Structured Forms of Message Passing Based Communication. A structured form of communication using message passing is achieved by distinguishing requests and replies and providing for bidirectional information flow. This means that the client sends a request message and waits for a response. The set of primitives is as follows.
CLIENT–SERVER SYSTEMS
Send()
When remote procedure calls are used a client interacts with a server by means of a call statement
Send() Timeout Send()
service_name (value_args, result_args) Receive() Send()
Timeout Send()
Receive() Send()
Timeout
(a)
439
(b)
Figure 6. Unreliable (a) and reliable (b) message passing primitives.
send(dest, src, buffer). Sends a request and gets a response; it combines the previous client’s send to the server with a receive to get the server’s response. get_request(client, buffer). Done by the receiver (server) to acquire a message containing work for them to do. send_response(src, dest, buffer). The receiver (server) uses this primitive to send a reply after completion of the work. It should be emphasised that the semantics, described in the previous sections, can be linked with these primitives. The result of the send and receive combination in the structured form of the send primitive is one operation performed by the interprocess communication system. This implies that rescheduling overhead is reduced, buffering is simplified (because request data can be left in a client’s buffer, and the response data can be stored directly in this buffer), and the transport-level protocol is simplified. Remote Procedure Call Message passing between remote and local processes is visible to the programmer. It is a completely untyped technique. Programming message passing based applications is difficult and error prone. An answer to these problems is the RPC technique which is based on the fundamental linguistic concept known as the procedure call. The very general term remote procedure call means a type-checked mechanism that permits a language-level call on one computer to be automatically turned into a corresponding language-level call on another computer. The first and most complete description of the RPC concept was presented in Ref. 7. Basic Features of Remote Procedure Calls. The idea of remote procedure calls (RPC) is very simple and is based on the observation that a client sends a request and then blocks until a remote server sends a response. This approach is very similar to a well-known and well-understood mechanism referred to as a procedure call. Thus, the goal of a remote procedure call is to allow distributed programs to be written in the same style as conventional programs for centralized computer systems. This implies that RPC must be transparent. This leads to one of the main advantages of this communication approach: the programmer does not have to know that the called procedure is executing on a local or a remote computer.
To illustrate that both local and remote procedure calls look identical to the programmer, suppose that a client program requires some data from a file. For this purpose there is a read primitive in the program code. In a system supported by a classical procedure call, the read routine from the library is inserted into the program. This procedure, when executing, puts the parameters into registers, and then traps to the kernel as a result of issuing a READ system call. From the programmer point of view there is nothing special; the read procedure is called by pushing the parameters onto the stack and is executed. In a system supported by RPC (Fig. 7), the read routine is a remote procedure which runs on a server computer. In this case, another call procedure called a client stub from the library is inserted into the program. When executing, it also traps to the kernel. However, rather than placing the parameters into registers, it packs them into a message and issues the send primitive, which forces the operating system to send it to the server. Next, it calls the receive primitive and blocks itself until the response comes back. The server’s operating system passes the arrived message to a server stub, which is bound to the server. The stub is blocked waiting for messages as a result of issuing the receive primitive. The parameters are unpacked from the received message and a procedure is called in a conventional manner. Thus, the parameters and return address are on the stack, and the server does not see that the original call was made on a remote client computer. The server executes the procedure call and returns the results to the virtual caller: that is, the server stub. The stub packs them into a message and issues send to return the results. The stub comes back to the beginning of its loop to issue the receive primitive, and blocks waiting for the next request message. The result message on the client computer is copied to the client process (practically to the stub’s part of the client) buffer. The message is unpacked, and the results are extracted and copied to the client in a conventional manner. As a result of calling read, the client process finds its data available. The client does not know that the procedure was executing remotely. It is evident that the semantics of remote procedure calls is analogous to local procedure calls: the client is suspended when waiting for results; the client can pass arguments to the remote procedure; and the called procedure can return results. However, since the client’s and server’s processes are on different computers (with disjoint address spaces), the remote procedure has no access to data and variables of the client’s environment. There is a difference between message passing and remote procedure calls. Whereas in message passing all required values must be explicitly assigned into the fields of a message before transmission, the remote procedure call provides marshalling of the parameters for message transmission: that is, the list of parameters is collected together by the system to form a message.
440
CLIENT–SERVER SYSTEMS
Client program
Client computer
Server computer Server program
Read()
Pack parameters
Receive
Send
Unpack parameters
Receive Unpack parameters
Figure 7. The sequence of operations in RPC.
Parameters and Results in RPCs. One of the most important problems of the remote procedure call is parameter passing and the representation of parameters and results in messages. Parameters can be passed by value or by reference. By-value message systems require that message data be physically copied. Thus, passing value parameters over the network is easy: the stub copies parameters into a message and transmits it. If the semantics of communication primitives allow the client to be suspended until the message has been received, only one copy operation is necessary. Asynchronous message semantics often require that all message data be copied twice: once into a kernel buffer and again into the address space of the receiving process. Data copying costs can dominate the performance of by-value message systems. Moreover, by-value message systems often limit the maximum size of a message, forcing large data transfers to be performed in several message operations reducing performance. Passing reference parameters (pointers) over a network is more complicated. In general, passing data by-reference requires sharing of memory. Processes may share access to either specific memory areas or entire address spaces. As a result, messages are used only for synchronization and to transfer small amounts of data, such as pointers to shared memory. The main advantage of passing data by-reference is that it is cheap—large messages need not be copied more than once. The disadvantages of this method are that the programming task becomes more difficult, and it requires a combination of virtual memory management and interprocess communication, in the form of distributed shared memory. Marshalling Parameters and Results. Remote procedure calls require the transfer of language-level data structures between two computers involved in the call. This is generally performed by packing the data into a network buffer on one computer and unpacking it at the other site. This operation is called marshalling. More precisely, marshalling is the process (performed when sending the request as well as when sending the result back) in which three actions can be distinguished:
Pack parameters
Call
Return
Send
Communication network
Extracting the parameters to be passed to the remote procedure and the results of executing the procedure; Assembling these two into a form suitable for transmission among computers involved in the remote procedure call; and Disassembling them on arrival. The marshalling process must reflect the data structures of the language. Primitive types, structured types, and user-defined types must be considered. In the majority of cases, marshalling procedures for scalar data types and procedures to marshal structured types built from the scalar ones are provided as a part of the RPC software. Client-Server Binding. Usually, RPC hides all details of locating servers from clients. However, as we stated in a previous section, in a system with more than one server (e.g., file server, print server), the knowledge of location of clients’ files or a special type of a printer is important. This implies the need for a mechanism to bind a client and a server, in particular, to bind an RPC stub to the right server and remote procedure. There are two aspects of binding: the way the client specifies what it wants to be bound to (this is the problem of naming), and the ways the client locates the server and the specification of the procedure to be invoked (this is the problem of addressing). In a distributed computing system there are two different forms of cooperation between clients and servers. The first form assumes that a client requests a temporary service. Another situation is generated by a client which wants to arrange for a number of calls to be directed to a particular serving process. These imply a need for a run-time mechanism for establishing long-term bindings between this client and a server. In the case of requests for a temporary service, the problem can be solved using broadcast and multicast messages to locate a server. In the case of a solution based on a name server, that solution is not enough, because the process wants
CLIENT–SERVER SYSTEMS
to call the located server during a time horizon. This means that a special binding table should be created containing established long-term binding objects (i.e., a client name and a server name), should be registered. The RPC run-time procedure for performing remote calls expects to be provided a binding object as one of its arguments. This procedure directs the call to the binding address received. It should be possible to add new binding objects to the table, remove binding objects from the binding table (which in practice means breaking a binding), and update the binding table. In systems with name servers, broadcasting is replaced by the operation of sending requests to a name server requesting a location of a given server and sending a response with an address of this server. Binding can take place at compile time, link time, or call time. Error Recovery Issues. Because the client and server are separate processes which run on separate computers, they are prone to failures of themselves, their computers, or the communication system. The remote procedure may not be complete successfully. For example, the result message is not returned to the client as a response to its call message, because one of four events may occur: the request message is lost; the result (response) message is lost; the server computer crashes and is restarted; and the client computer crashes and is restarted. These events form the basis for design of RPC recovery mechanisms. Three different semantics of RPC and their mechanisms can be identified to deal with problems generated by these four events: Maybe call semantics. Timeouts are used to prevent a client waiting indefinitely for a response message; At-least-once call semantics. This mechanism usually includes timeouts and a call retransmission procedure. The client tries to call the remote procedure until it gets a response or can tell that the server has failed; Exactly once call semantics. In the case of at-least-once call semantics it can happen that the call can be received by the server more than once, because of lost responses. This can have the wrong effect. To avoid this the server sends each time (when retransmitting) as its response the result of the first execution of the called procedure. Thus, the mechanisms for these semantics include, in addition to those used in at-least-once call semantics (i.e., timeouts, retransmissions), call identifications and the server’s table of current calls. This table is used to store the calls received first time and procedure execution results for these calls. Message Passing versus Remote Procedure Calls. A problem arises in deciding which of the two interprocess communication techniques presented is better, if any, and whether there are any suggestions for when, and for what systems, these facilities should be used. First of all, the syntax and semantics of the remote procedure call are the functions of the programming language being used. On the other hand, choosing a precise syntax and semantics for message passing is more difficult than for RPC because there are no standards for messages. Moreover, neglecting language aspects of RPC and because of the variety
441
of message passing semantics, these two facilities can look very similar. Examples of a message passing system that look like RPC are message passing for the V system (which in Ref. 8 is called now the remote procedure call system) and message passing for Amoeba (9) and RHODOS (10). By comparing the remote procedure call and message passing, the former has the important advantage that the interface of a remote service can be easily documented as a set of procedures with certain parameter and result types. Moreover, from the interface specification, it is possible to automatically generate code that hides all of the details of messages from a programmer. On the other hand, a message passing model provides flexibility not found in remote procedure call systems. However, this flexibility is at the cost of difficulty in the preparation of precisely documented behavior of a message passing interface. The problem is when these facilities should be used. The message passing approach appears preferable when serialization of request handling is required. The RPC approach appears preferable when there are significant performance benefits to concurrent request handling. RPC is particularly efficient for request–response transactions. Group Communication Distributed computing systems provide the opportunity to improve the overall performance through parallel execution of programs on a network of workstations, decreasing the response time of databases using data replication, supporting synchronous distant meetings and cooperative workgroups, and increasing reliability by service multiplication. In these cases many servers must contribute to the overall application. This implies a need to invoke multiple services by sending a simultaneous request to a number of servers. This leads toward group communication. The concept of a process group is not new. The V-system (11), Amoeba (2), Chorus (12), and RHODOS (10) all support this basic abstraction in providing process groups to applications and operating system services with the use of group communication. Basic Concepts of Group Communication. Group communication is an operating system abstraction which supports the programmer by offering convenience and clarity. This operating system abstraction must be distinguished from the message transmission mechanisms such as multicast (one-tomany physical entities connected by a network) or its special case broadcast (one-to-all physical entities connected by a network). A request is sent by a client called src to a group of servers providing the desired service named group_name by executing either send(group_name, src, buffer) when the message passing technique is used, or call service_name (value_args, result_args) when the RPC technique is used. This request is delivered following the semantics of a primitive used. The primitives should be constructed such that there is no difference between invoking a single server or a group of servers. This means that communication pattern transparency is provided to the programmer. Thus, groups should be named in the same manner that single processes are named. Each group is treated as one sin-
442
CLIENT–SERVER SYSTEMS
gle entity; its internal structure and interactions are not shown to the users. The mapping of group names on multicast addresses is performed by an interprocess communication facility of an operating system and supported by a naming server. However, if multicast or even broadcast is not provided, group communication could be supported by one-to-one communication at the network level. Communication groups are dynamic. This means that new groups can be created and some groups can be destroyed. A process can be a member of more than one group at the same time. It can leave a group or join another one. In summary, group communication shares many design features with message passing and RPC. However, there are some issues which are very specific, and their knowledge could be of a great value to the application programmer.
Message Ordering in Group Communication. The semantics of message ordering are an important factor in providing good application performance and reduction in the complexity of distributed application programming. The order of message delivery to members of the group will dictate the type of group it is able to support. There are four possible message ordering semantics:
Message Delivery Semantics. Message delivery semantics of a group relates to the successful delivery of a message to processes in a group. There are four choices of delivery semantics:
Causal Ordering. The causal ordering semantic delivers request messages to all members of the current group such that the causal ordering of message delivery is preserved. This implies that if the sending of a message m⬘ causally follows the delivery of message m, then each process in the group receives m before m⬘.
Single Delivery. Single delivery semantics require that only one of the current group members needs to receive the message for the group communication to be successful. k-Delivery. In k-delivery semantics, at least k members of the current group will receive the message successfully. Quorum Delivery. With quorum delivery semantics, a majority of the current group members will receive the message successfully. Atomic Delivery. With atomic delivery all current members of the group successfully receive the message or none does. This delivery semantic is the most stringent as processes can and do fail and networks may also partition during the delivery process of the request messages, making some group members unreachable. Message Response Semantics. By providing a wide range of message response semantics the application programmer is capable of providing flexible group communication to a wider range of applications. The message response semantics specify the number and type of expected message responses. There are five broad categories for response semantics: No Responses. By providing no response to a delivered request message the group communication facility is only able to provide unreliable group communication. Single Response. The client process expects (for successful delivery of a message) a single response from one member of the group. k-Responses. The client process expects to obtain k responses for the delivered message from the members of the process group. By using k response semantics the groups resilience can be defined (13). The resilience of a group is based on the minimum number of processes that must receive and respond to a message. Majority Response. The client process expects to receive a majority of responses from the current members of the process group. Total Response. The client process requires all current members of the group to respond to the delivery of a request message.
No Ordering. This semantic implies that all request messages will be sent to the current group of processes in no apparent order. FIFO Ordering. This semantic implies that all request messages transmitted in the first-in first-out (FIFO) order by a client process to the current members of the group will be delivered in the FIFO order.
Total Ordering. Total ordering semantic implies that all messages are reliably delivered in sequence to all current members of the group or no member will receive the message. Also, total ordered semantic guarantees that all group members see the same order of messages. Total order is more stringent that FIFO ordering as all message transfers between all current members of the group are in order. This implies that all processes within the current group perceive the same total ordering of messages. In causal ordering we are concerned with the relationship of two messages while in total ordering we are concerned with seeing the same order of messages for all group member processes. Conclusions In this section we described two issues of the communication paradigm for the client-server cooperation: firstly, the communication pattern, including one-to-one and one-to-many (group communication); secondly, two techniques, messagepassing and RPC, which are used to develop distributed computing systems. The message passing technique allows clients and servers to exchange messages explicitly using the send and receive primitives. Various semantics, such as direct and indirect, blocking and nonblocking, buffered and unbuffered, reliable and unreliable can be used in message passing. The RPC technique allows clients to request services from servers by following a well-defined procedure call interface. Various issues are important in RPC, such as marshalling and unmarshalling of parameters and results, binding a client to a particular server, and handling exceptions.
SUN’S NETWORK FILE SYSTEM The first major step in the development of distributed software was made when inexpensive diskless personal computers were connected by inexpensive local networks in order to share a file service or a printer service.
CLIENT–SERVER SYSTEMS
Distributed File Systems A distributed file system is a key component of any distributed computing system. The main function of such a system is to create a common file system that can be shared by all the clients which run on autonomous computers in the distributed computing system. The common file system should store programs and data and make them available as needed. Since files can be stored anywhere in a distributed computing system, a distributed file system should provide location transparency. To achieve such a goal a distributed file system usually follows the client-server model. A distributed file system typically provides two types of services: the file service and the directory service, which are implemented by the file server and the directory server, respectively, distributed over the network. These two servers can also be implemented as a single server. The file server provides operations on the contents of files such as read, write, and append. The directory server provides operations such as directory and file creation and deletion, for manipulating directories and file names. The client application program interface (client API, usually in the form of a process or a group of processes) runs on each client computer and provides a uniform user-level interface for accessing file servers. In this section we will present one of the most important achievement of the 1980s, which is still in use now, the Network File System, known as NFS, developed based on the client-server model. NFS Architecture NFS was developed by Sun Microsystems and introduced in late 1984 (14). Since then it has been widely used in both industry and academia. NFS was originally developed for use on Unix workstations. Currently, many manufacturers support it for other operating systems (e.g., MS-DOS). Here, NFS is introduced based on the Unix system. To understand the architecture of NFS, we need to define the following terms: INODE. This is a data structure that represents either an open file or directory within the Unix file system. It is used to identify and locate a file or directory within the local file system. RNODE. The remote file node is a data structure that represents either an open file or directory within a remote file system. VNODE. The virtual file node is a data structure that represents either an open file or directory within the virtual file system (VFS). VFS. The virtual file system is a data structure (linked lists of VNODEs) that contains all necessary information on a real file system that is managed by the NFS. Each VNODE associated with a given file system is included in a linked list attached to the VFS for that file system. The NFS server integrates functions of both a file server and a directory server and the NFS clients use a uniform interface, the VFS/VNODE interface, to access the NFS server. The VFS/VNODE interface abstraction makes it possible to achieve the goal of supporting multiple file system types in a generic fashion. The VFS and VNODE data structures pro-
443
vide the linkage between the abstract uniform file system interface and the real file system (such as a Unix or MS-DOS file systems) that accesses the data. Further, the VFS/ VNODE interface abstraction allows NFS to make remote files and local files appear identical to a client program. In NFS, a client process transparently accesses files through the normal operating system interface. All operating system calls that manipulate files or file systems are modified to perform operations on VFSs/VNODEs. The VFS/VNODE interface hides the heterogeneity of underlying file systems and the location of these file systems. The steps of processing a user-level file system call can be described as follows (Fig. 8): 1. The user-level client process makes the file system call through the normal operating system interface. 2. The request is redirected to the VFS/VNODE interface. A VNODE is used to describe the file or directory accessed by the client process. 3. If the request is for accessing a file stored in the local file system, the INODE pointed by the VNODE is used. The INODE interface is used and the request is served by the Unix file system interface. 4. If the request is for accessing a file stored locally in other types of file systems (e.g., MS-DOS file system), a proper interface of that file system is used to serve the request. 5. If the request is for accessing a file stored remotely, the RNODE pointed to by the VNODE is used and the request is passed to the NFS client and some RPC messages are sent to the remote NFS server that stores the requested file. 6. The NFS server processes the request by using the VFS/VNODE interface to find the appropriate local file system to serve the request. The Role of RPC The communication between NFS clients and servers is implemented as a set of RPC procedures. The RPC interface provided by a NFS server includes operations for directory manipulation, file access, link manipulation, and file system access (15). The actual specifications for these remote procedures are defined in the RPC language, and the data structures used by the procedures are defined in the XDR format. The RPC language is a C-like language used as input into Sun’s RPC Protocol Compiler utility. This utility can be used to output the actual C language source code. NFS servers are designed to be stateless, meaning that there is no need to maintain information (such as whether a file is open or the position of the file pointer) about past requests. The client keeps track of all information required to send requests to the server. Therefore, NFS RPC requests are designed to completely describe the operation to be performed. Also, most NFS RPC requests are idempotent, meaning that an NFS client may send the same request one or more times without any harmful side effects. The net result of these duplicate requests is the same. NFS RPC requests are transported using the unreliable User Datagram Protocol (UDP). NFS servers notify clients when an RPC completes by sending the client an acknowledgment (also using UDP).
444
CLIENT–SERVER SYSTEMS
Server computer
Client computer User-level client process 1 Operating system kernel and interfaces 2
Virtual file system interface VNODE
Virtual file system interface VNODE 4 Other types of file system (e.g., MS-DOS)
3
5
UNIX file system INODE
Remote file system
6 NFS server
NFS client
Disk
RPC/XDR
RPC/XDR
UNIX file system INODE
Other types of file system (e.g., MS-DOS)
Disk
Figure 8. The NFS structure. See text for description.
A NFS client sends its RPC requests to a NFS server one at a time. Although a client computer may have several NFS RPC requests in progress at any time, each of these requests must come from a different client. When a client makes an RPC request, it sets a timeout period during which the server must service and acknowledge it. If the server does not acknowledge during the timeout period, the client retransmits the request. This may happen if the request is lost along the way or if the server is too slow because of overloading. Since the RPC requests are idempotent, there is no harm if the server executes the same request twice. If the client gets a second acknowledgment from the request, the client simply discards it. Conclusions In this section we showed an application of the client-server model in the development of a distributed file system based on the Network File System. The NFS server integrates functions of both a file server and a directory server. It has been built as an extension module to a centralized operating system (e.g., Unix or MS-DOS). NFS clients use RPC to communicate with the NFS system. This system allows clients running on diskless computers to access and share files. THE DEVELOPMENT OF THE RHODOS The vast majority of design and implementation efforts in the area of distributed computing systems have concentrated on client-server-based applications running on centralized operating systems (e.g., Unix, VMS, OS/2). However, there have been huge research efforts on the development of operating systems built from scratch based on the client-server model (called distributed operating systems). These systems support distributed computing systems developed on a set of personal
homogeneous computers connected by local or fast wide area networks. The results achieved have changed, and are still changing, operating systems of distributed computing systems and the development of applications supported by these systems. The following systems have been developed based on the clientserver model: V (8), Amoeba (2), Chorus (12), and RHODOS (16). Distributed Operating Systems A distributed operating system is one that looks to its users like a centralized operating system, but runs on multiple, independent computers connected by fast local or wide area networks. There are the following four major goals (the first three are the goals of a centralized operating system) of a distributed operating system: Hide details of hardware by creating abstractions: for example, software which provides a set of higher level functions which form a virtual computer; Manage resources to allow their use in the most effective way and support user processes in the most efficient way; Create a pleasant user computational environment; and Hide distribution of resources, information, peripheral and computational resources, in order to provide full transparency to users. A generic architecture of a distributed operating system which allows these goals to be achieved has the following software levels. Software providing an abstraction sits on bare hardware, and allows the handling of interrupts and context switching. The second level of a distributed operating system is formed by software which manages physical resources such
CLIENT–SERVER SYSTEMS
as processor time, memory, input/output, and virtual resources such as processes, remote communication, communication ports, and network protocols. It depends on the support provided by functions of the software abstraction level. The second level provides it services to the system services level. This third software level allows the management of files and object (services, resources) names, and creates a human user interface formed by graphics terminals, command interpreters and authentication systems. This level creates an image of a computer system for users. User processes form the software level sitting on the system services level. In a client-server based distributed operating system all management functions and services provided to user processes are modelled and developed as individual cooperating server processes. User processes act as clients. However, because servers cooperate in order to achieve the goals of a distributed operating system they also act as clients. As physical memory in a distributed computing system is not shared, remote processes communicate using messages. In order to have a uniform communication model, local processes also communicate using messages. This provides communication transparency in a natural manner. In this section we will use RHODOS to illustrate the application of the client-server model in the development of a new class of distributed operating systems. For this reason we will mainly concentrate on the kernel servers and microkernel as they form a new image of operating systems for distributed computing systems. The RHODOS Architecture RHODOS (research oriented distributed operating system) is a microkernel and message passing based system developed using the client-server model. This operating system is capable of supporting parallel processing on a network of work-
P1
Name server
P2
File server
Migration manager IPC manager Serial driver
Interrupt handling
Process manager Terminal driver
Global schedule server
System servers
Network manager
Kernel servers
Data collection manager
Rex manager Memory manager
stations and providing load sharing and balancing in order to provide high-performance services to users (10). There are three layers of cooperating processes in RHODOS: user processes, system servers, and kernel servers (Fig. 9). Each process executes in user mode and is confined to an individual address space which is controlled and maintained by the RHODOS microkernel. In RHODOS, software creating abstractions forms a microkernel. The microkernel provides the following functions: context switching, interrupt handling, basic operations on memory pages relating to the hardware, and local interprocess communication. Furthermore, this microkernel is responsible for storing and managing basic data structures. Kernel servers implement the mechanisms of the RHODOS functionality. Two groups of kernel servers can be distinguished. To the first group belong these servers which provide services which could be identified in any distributed or network operating system: process management, memory management, remote IPC management, communication protocols, and I/O management (drivers in RHODOS have also been developed as individual servers). The second group encompasses servers which provide advanced services which are necessary to support parallel processing on a network of workstations, and load sharing and balancing. These services are: process migration, remote process creation, and data collection. System servers implement the policy of the RHODOS functionality. They provide services such as naming, file accessing and manipulation (in basic and transaction modes), two-way and m-way authentication, and global scheduling. A broker service has also been developed and will be installed shortly. In order to provide these services, system servers act as clients and invoke relevant kernel servers and the nucleus using standard system calls.
User applications and processes
P3
Authentication server
445
Ethernet driver
Local IPC The nucleus (microkernel)
Context switching
Page handling
Figure 9. The logical architecture of RHODOS.
446
CLIENT–SERVER SYSTEMS User process
RHODOS IPC Primitives
RHODOS Communication facility
IPCM
Send() recv() call()
Interprocess communications manager
Transport protocol
Network manager
IP/Ethernet Microkernel
Local IPC
Physical layer
Figure 10. RHODOS interprocess communication facility.
User applications and processes are those developed and allocated to perform task for users. These processes have no special privileges and obtain services by calls to the microkernel and system servers. Communication in RHODOS In RHODOS, access to local and remote services is achieved in the same transparent manner, via a system name of that service and uniform interprocess communication, which is provided by the Interprocess Communication (IPC) facility. The facility provides three basic communication primitives: send(), recv(), and call(). Both send() and recv() provide the basic message passing semantics while the call(), recv(), and send() in combination provide synchronous RPC. In providing both message passing and RPC semantics the programmer is able to select the most appropriate communication technique for a given application. The functioning of the IPC facility is divided into three sections: local IPC module, the IPC manager, and the network manager (Fig. 10). The local IPC module is an integral part of the RHODOS microkernel and provides local communication between processes on the same personal computer. If the destination process exists on the local computer, the module will complete the transfer. Otherwise, the IPC module sends a request to the IPC manager to provide a remote communication service. The primary responsibility of the IPC Manager is the receiving and transmitting of remote messages for all processes within the RHODOS distributed computing system. It also supports group communication. This service is achieved with the cooperation of the name server by assigning a single name to a group of names. Furthermore, in order to support one-toone and group communication the IPC manager is responsible for address resolution. In particular, a message that is sent to an individual process or a group requires the IPC manager to resolve the destination processes’ (servers’) location and provide the mechanism for the transport of the message to the desired process or group of processes. In order to deliver a message to a remote process (server), the IPC manager invokes a delivery server, called the network manager. This server consists of a protocol stack employing transport, network, and data link layer protocols.
Currently, the transport service is provided by a fast specialized RHODOS Reliable Datagram Protocol (RRDP). Network and data link layer protocols are provided by the IP/ Ethernet suit. RHODOS Kernel Servers and Services One of the basic features of the RHODOS design is that each resource is managed by a relevant server: the process manager is responsible for processes and basic operations on processes, the space manager for memory, and the IPC manager for remote and group communication and address resolution. A process is a very special resource, because it is constructed based on some basic resources such as spaces, data structures usually called process control blocks, communication ports, and buffers. Thus, in RHODOS advanced operations on processes such as process migration and remote process creation are provided by separate servers: the migration manager and REX manager. Process Manager. The job of the process manager is to manage the processes that are created in RHODOS. The process manager manipulates the process queues and deals with parent processes waiting for child processes to exit. It cooperates with other kernel servers, for instance with the migration manager to transfer a process’ state during migration; and the remote execution manager to set up a process’ state when a process is created. Space Manager. One of the goals of RHODOS is portability across hardware platforms. Thus, RHODOS memory management has been separated into two sections: hardware dependent and hardware independent. The small hardware-dependent section is found in the microkernel and the larger hardware-independent section comprises RHODOS space manager. This server deals with spaces, logical units of memory, independent of physical units (e.g., pages), which are mapped to the physical memory. The space manager supports two types of page operations: copy_on_write, which allows twin processes to share pages while they are reading them but makes separate copies when either process attempts to write to the page; copy_on_reference, which is used in process migration where only referenced pages are transferred from a source computer to a destination computer the process has been migrated. Handling exceptions, creating spaces and transferring pages have been extended by adding additional functions in order to provide an operating system built in support for Distributed Shared Memory (DSM). Two consistency models are supported in the RHODOS DSM: invalidation and update based. Device Manager. Transparency is an important feature of RHODOS. This not only includes interprocess communication between remote hosts, but also a transparent unified interface of physical devices such as serial ports, keyboards, video screens, and disks. Device drivers provide this interface. Device drivers in RHODOS are in their own right processes with the privilege and status of kernel servers. The benefits obtained from implementing device drivers as processes include the ability to enable and disable new drivers dynamically, as well as to use normal process debugging tools whilst the de-
CLIENT–SERVER SYSTEMS
vice driver is active. The device manager is the controlling entity that allows users to access a requested physical device. Migration Manager. The process migration manager is responsible for the migration of running processes from the home computer to a remote computer. To migrate a process in RHODOS involves migrating the process state, address space, communication state, file state, and other resources. Thus, process migration requires the cooperation of all the servers managing these resources, the process, space, IPC managers, and the file server, respectively. The process migration manager only coordinates these servers, and all of them cooperate following the client-server model. Process migration in RHODOS is a transaction-based operation performed on processes. Thus, the initial request from the source process migration manager to the destination process migration manager to migrate a selected process starts the transaction. The destination process migration manager commits this transaction by sending a response back, if all operations of installation of resources on the destination computer by individual servers, the process, space, IPC managers, and the file server have been completed successfully. Otherwise, an abort response is sent back. Remote Execution Manager. The function of the remote execution (REX) manager is to provide coordination for creation of processes on local and remote computers. If a process is
created on a local computer only a local REX manager is involved. If processes are created on remote computers, the home REX manager cooperates with remote REX managers to ensure processes are created correctly whilst maintaining the link with the process that issued the request. The generic cooperation of the servers is shown in Fig. 11. Data Collection Manager. The RHODOS Data Collection System is responsible for collection and dissemination of the operational statistics of processes and exchanged messages in the RHODOS environment. The Data Collection System consists of a data collection manager (server) and stubs of code within the microkernel and other servers. The data collection manager is designed to be activated periodically and when special events occur (e.g., a new process was created, a process was killed), and provide a central repository for the accumulation of statistics. It provides accurate process statistics to the global scheduler. These statistics will permit the global scheduler to make the most appropriate decisions concerning process placement within the RHODOS environment. RHODOS System Servers The RHODOS system provides direct services to users by employing the following servers: the naming server, file server, authentication server, and the broker server, called the trader. Furthermore, RHODOS provides a special service which improves the overall performance of all services by em-
Local computer
Remote computer
User process
User process
Request (create process)
447
Response (child info) Request (create process)
REX manager
REX manager
Response
Request (allocate process) Response (SName)
Process manager
Response (I/O) Request (I/O)
Request (retrieve image)
Device manager
Space manager
File server File server computer
Figure 11. The generic cooperation of servers involved in local or remote process creation.
448
CLIENT–SERVER SYSTEMS
ploying the global scheduler. The utilization of the clientserver model in the development of user-oriented services of a distributed operating system is presented here based on the global scheduler. RHODOS provides global scheduling services in order to allocate/migrate processes to idle or lightly loaded computers to share computational resources and balance load. Global scheduling employs both static allocation and load balancing. Static allocation is employed when system load remains steadily high and new processes have to be created. Static allocation is making the decision of where to create new processes. Load balancing is employed to react to large fluctuations in system load. Load balancing is making a decision when to migrate a process, which process to migrate, and where to migrate this process. These servers make these decisions based on the information about the current load of the personal computers participating in global scheduling, their load trends, and the process communication pattern. Conclusions In this section we showed an application of the client-server model in the development of an advanced distributed operating system, RHODOS. RHODOS consists of a microkernel and two layers of cooperating servers, called kernel servers and system servers. Generally speaking, kernel servers implement the mechanism of the RHODOS functionality, whereas system servers implement the policy of the RHODOS functionality. User processes, sitting on top of the RHODOS software, obtain services from RHODOS servers. When a RHODOS server receives a service request, it may serve the request directly, or it may contact other servers if services from these servers are required. BUILDING THE DISTRIBUTED COMPUTING ENVIRONMENT ON TOP OF EXISTING OPERATING SYSTEMS The previous section contains a presentation of RHODOS, an example of a distributed operating system, developed based on the client-server model and the concept of a microkernel. The whole system has been built from scratch on bare hardware. There is another approach to building a distributed computing environment by putting it on top of existing operating systems. Such a software layer hides the differences among the individual computers, and forms a single computing system. The Role of the Client-Server Model in Building a Distributed Computing Environment Open Software Foundation’s Distributed Computing Environment (DCE) (17) is a vendor-neutral platform for supporting distributed applications. DCE is a standard software structure for distributed computing that is designed to operate across a range of standard Unix, VMS, OS/2, and other operating systems. It includes standards for RPC, name, time, security, and thread services—all sufficient for client–server computing across heterogeneous architectures. DCE uses the client–server model to support its infrastructure and transparent services. All DCE services are provided through servers. By using DCE, application programmers can avoid considerable work in creating supporting
Applications Client DCE
Server
Distributed file services Security service
Directory services
Time service
RPC Threads Operating system
Figure 12. The logical architecture of DCE.
services, such as creating communication protocols for various parts of a distributed program, building a directory service for locating those pieces, and maintaining a service for providing security in their own program. In the previous section we mainly addressed the kernel servers and microkernel of RHODOS as they are the result of the new approach based on the client-server model and the concept of a microkernel to building distributed computing systems. Here, since DCE is a complete extension of centralized operating systems to form a distributed computing system we mainly concentrate on servers which directly provide services to users. The Architecture of DCE The architecture of DCE masks the physical complexity of the networked environment by providing a layer of logical simplicity, composed of a set of services that can be used separately or in combination to form a comprehensive distributed computing system. Servers that provide DCE services usually run on different computers; so do clients and servers of a distributed application program that use DCE. DCE is based on a layered model which integrates a set of fundamental technologies (Fig. 12). To applications, DCE appears to be a single logical system with two broad categories of services (18): The DCE Core Services. They provide tools with which software developers can create end-user applications and system software products for distributed computing: Threads. DCE supports multithreaded applications; RPC. The fundamental communication mechanism which is used in building all other services and applications; Security Service. Provides the mechanism for writing applications that support secure communication between clients and servers; Cell Directory Services (CDS). Provides a mechanism for logically naming objects within a DCE cell (a group of client and server computers); Distributed Time Service (DTS). Provides a way to synchronize the clocks on different computers in a distributed computing system.
CLIENT–SERVER SYSTEMS
DCE Data-Sharing Services. In addition to the core services, DCE provides important data-sharing services, which require no programming on the part of the end user and which facilitate better use of shared information: Distributed File Service (DFS). Provides a high-performance, scalable, secure method for sharing remote files; Enhanced File Service (EFS). Provides features which greatly increase the availability and further simplify the administration of DFS. In a typical distributed environment, most clients perform their communication with only a small set of servers. In DCE, computers that communicate frequently are placed in a single cell. Cell size and geographical location are determined by the people administering the cell. Cells may exist along social, political, or organizational boundaries and may contain up to several thousand computers. Although DCE allows clients and servers to communicate in different cells, it optimizes for the more common case of intra-cell communication. One computer can belong to only one cell at a time. The Role of RPC DCE RPC is based on the Apollo Network Computing System (NCA/RPC). The components of DCE RPC can be split into the following two groups according to the stage of their usage: Used in Development. It includes IDL (Interface Definition Language) and the idl compiler. IDL is a language used to define the data types and operations applicable to each interface in a platform independent manner. idl compiler is the tool used to translate IDL definitions into code which can be used in a distributed application; Used in Runtime. It includes RPC runtime library, rpcd (RPC daemon), and rpccp (RPC control program). To build a basic DCE application, the programmer has to supply the following three files: The Interface Definition File. It defines the interfaces (data structures, procedure names, and parameters) of the remote procedures that are offered by the server; The Client Program. It defines the user interfaces, the calls to the remote procedures of the server, and the client side processing functions; The Server Program. It implements the calls offered by the server. DCE uses threads to improve the efficiency of RPCs. A thread is a lightweight process that executes a portion of a program, cooperating with other threads concurrently executing in the same address space of a process. Most of the information that is a part of a process can then be shared by all threads executing within the process address space. Sharing reduces significantly the overhead incurred in creating and maintaining the information, and the amount of information that needs to be saved when switching between threads of the same program.
449
The Servers of DCE All the higher-level DCE services, such as the directory services, security service, time service, and distributed file services, are provided by relevant servers. Directory Services. The main job of the directory services is to help clients find the locations of appropriate servers. To let clients access the services offered by a server, the server has to place some binding information into the directory. A directory is a hierarchically structured database which stores dynamic system configuration information. The directory is a realization of the naming system. Each name has attributes associated with it, which can be obtained via a query using the name. Each cell in a DCE distributed computing system has its own directory service, called the Cell Directory Service (CDS), that stores the directory service information for a cell (18). It is optimized for intra-cell access, since most clients communicate with servers in the same cell. Each CDS consists of CDS servers and CDS clerks. A CDS server runs on a computer containing a database of directory information (called the clearinghouse). Each clearinghouse contains some number of directories, analogs to but not the same as directories in a file system. Each directory, in turn, can logically contain other directories, object entries, or soft links (an alias that points to something else in CDS). Each cell may have multiple CDS servers. Nodes which do not run a CDS server must run a CDS clerk. A CDS clerk acts as an intermediary between a distributed application and the CDS server on a node not running a CDS server. When a server wishes to make its binding information available to clients, it exports that information on one of its cell’s CDS servers. When a client wishes to locate a server within its own cell, it imports that information from the appropriate CDS server by calling on the CDS clerk on its computer. DCE uses the Domain Name System (DNS) or Global Directory Service (GDS, based on the X.500 standard) to enable clients to access servers in foreign cells. To access a server in a foreign cell, a client gives the cell’s name and the name of the desired server. A CDS component called a Global Directory Agent (GDA) extracts the location of the named cell’s CDS server from DNS or GDS, then a query is sent directly to this foreign server. Security Service. DCE provides the following four security services: authentication, authorization, data integrity, and data privacy. A security server (it may be replicated) is responsible for providing these services within a cell. The security server has the following three components: Registry Service. It is a database of principal (a user of the cell), group, and organization accounts, their associated secret keys, and administration policies. Key Distribution Service. It provides tickets to clients. A ticket is a specially encrypted object that contains a conversation key and an identifier that can be presented by one principal to another as a proof of identity. Privilege Service. It supplies the privileges of a particular principal. It is used in authorization.
450
CLIENT–SERVER SYSTEMS
The security server must run on a secure computer, since the registry on which it relies contains a secret key, generated from a password, for every principal in the cell. They are based on the Kerberos V5.0, created by the MIT/Project Athena, and DCE extends Kerberos version 5 by providing authorization services. Time Service. Distributed Time Service (DTS) of DCE is designed to keep a set of clocks on different computers synchronized. DTS uses the usual client-server structure: DTS clients, daemon processes called clerks, request the correct time from some number of servers, receive responses, and then reset their clocks as necessary to reflect this new knowledge. There are several components that compose the DCE DTS: Time Clerk. It is the client side of DTS. It runs on a client computer and keeps the computer’s local time synchronized by asking a time server for the correct time and adjusting the local time accordingly. Time Servers. There are three types of time servers. The local time server maintains the time synchronization of a given LAN. The global time server and courier time servers are used to synchronize time among interconnected LANs. A time server synchronizes with other time servers by asking these time servers for correct times and by adjusting its time accordingly. DTS API. It provides an interface where application programs can access time information provided by the DTS. Distributed File Services. DCE uses its distributed file services (DFS) to join the file systems of individual computers within a cell into a single file space. A uniform and transparent interface is provided for applications to accessing files located in the network. DFS is derived from the Andrew File System. It uses RPC for client-server communication and threads to enhance parallelism; it relies on the DCE directory to locate servers; and it uses DCE security services to protect from attackers. DFS is based on the client-server model. DFS clients, called cache managers, communicate with DFS servers using RPC on behalf of user applications. There are two types of DFS servers: a fileset location server which stores the locations of system and user files in DFS, and a file server which manages files. A typical interaction between various components of DFS is shown in Fig. 13. At first, the application issues a file request call to the cache manager in its computer. If the requested file is located in the local cache, the request is served using the local copy of the file. Otherwise, the cache manager locates the fileset location server through the CDS server, and the location of the file server that stores the requested file is found through the fileset location server. Finally, the cache manager calls the file server and the file data are accessed. Conclusions In this section we described an application of the client-server model in the development of an advanced distributed computing environment, DCE. DCE is built on top of existing operating systems and it hides the heterogeneity of underlying computers by providing an integrated environment for dis-
Client computer
Fileset 3
1 Location data
CDS
Application File request
2
Cache manager 4
DFS file server Cache
Files Figure 13. Interactions between DFS components. 1: file request from an application; 2: locate the fileset location server; 3: locate the file server that stores the requested file; 4: access the requested file.
tributed computing. DCE consists of many integrated services, such as thread and RPC services, security service, directory service, time service, and distributed file service, that are necessary in performing client–server computing in a heterogeneous environment. Most of these services are implemented as individual servers or groups of cooperating servers. Application processes act as clients of DCE servers. Now in its fifth year (DCE 1.0 was announced in 1991), DCE has gone through several major stages of evolution and enhancement (through DCE 1.1 and DCE 1.2). Because of its operating system independence, DCE has gained significant support from user and vendor communities. BIBLIOGRAPHY 1. L. Liang, S. T. Chanson, and G. W. Neufeld, Process groups and group communications: Classifications and requirements, IEEE Computer, 23 (2): 56–66, 1990. 2. A. S. Tanenbaum, Experiences with the Amoeba distributed operating system, Commun. ACM, 33 (12): 46–63, 1990. 3. F. Cristian, Understanding fault tolerant distributed systems, Commun. ACM, 34 (2): 56–78, 1991. 4. K. P. Birman and T. A. Joseph, Reliable communication in the presence of failures, ACM Trans. Comp. Sys., 5 (1): 47–76, 1987. 5. P. Wegener, Interoperability, ACM Comp. Surv., 28 (1): 285– 287, 1996. 6. A. Goscinski, Distributed Operating Systems: The Logical Design, Reading, MA: Addison-Wesley, 1991. 7. A. D. Birrell and B. J. Nelson, Implementing remote procedure calls, ACM Trans. Comp. Sys., 2 (1): 39–59, 1984. 8. D. R. Cheriton, The V distributed system, Commun. ACM, 31 (3): 314–333, 1988. 9. A. S. Tanenbaum and R. van Renesse, Distributed operating systems, ACM Comp. Sur., 17 (4): 419–470, 1985. 10. D. De Paoli et al., Microkernel and kernel server support for parallel execution and global scheduling on a distributed system, Proc. IEEE First Int. Conf. Algorithms Architectures Parallel Process., Brisbane, April, 1995. 11. D. R. Cheriton and Zwaenepoel, Distributed process groups in the V distributed system, ACM Trans. Comp. Sys., 3 (2): 77–107, 1985.
CLINICAL ENGINEERING 12. M. Rozier et al., Chorus distributed operating system, Comput. Syst., 1: 305–379, 1998. 13. M. F. Kaashoek and A. S. Tanenbaum, Efficient reliable group communication for distributed system, Department of Mathematics and Computer Science Technical Report, Vrije Universiteit, Amsterdam, 1994. 14. R. Sandberg et al., Design and implementation of the Sun Network File System, Proc. Summer USENIX Conf., 119–130, 1985. 15. Sun Microsystems, NFS Version 3 Protocol Specification (RFC 1813), Internet Network Working Group Request for Comments, No. 1813, Network Information Center, SRI International, June, 1995. 16. G. Gerrity et al., Can we study design issues of distributed operating systems in a generalized way?—RHODOS, Proc. 2nd Symp. Experiences Distributed Multiprocessor Syst. (SEDMS II), Atlanta, March, 1991. 17. Distributed Computing Environment Rationale, Open Software Foundation, 1990. 18. The OSF Distributed Computing Environment, Open Software Foundation, 1992.
ANDRZEJ GOSCINSKI WANLEI ZHOU Deakin University
451
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5304.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Code Division Multiple Access Standard Article Kamran Kiasaleh1 1University of Texas at Dallas, Richardson, TX Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5304 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (188K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5304.htm (1 of 2)16.06.2008 16:22:11
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5304.htm
Abstract The sections in this article are Signal Generation and Mathematical Modeling Despreading and Detection Synchronization Interference Near-Far Problem and Power Control Channel Effects Interference-Dispersive Channel Rake Receiver | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5304.htm (2 of 2)16.06.2008 16:22:11
CODE DIVISION MULTIPLE ACCESS
–
:
CODE DIVISION MULTIPLE ACCESS With the advent of personal wireless communication systems in recent years, the need for instantaneous, seamless personal communication has grown. Unfortunately, this increase in demand strains a natural resource, the radio frequency (RF) spectrum. It is imperative that the design of any communication system that is intended for use in a personal communication domain be as bandwidth efficient as possible. In other words, one has to design communication systems for a bandlimited scenario and to make an attempt to maximize the information throughput for the allotted bandwidth. There are two basic means by which the RF spectrum can be shared among many users: collision-free and collision-impaired multiple access. In a collision-impaired scheme, a protocol is used
521
(which is typically common to all users) by each user to obtain access to the available spectrum. Collisions (the event in which two or more users make an attempt to use the same common resource) are possible in this scenario, and hence one must accommodate for such events (e.g., retransmission). In the collision-free scenario, it is assumed that a user is able (at least in theory) to obtain access to the channel upon request (perhaps with some delay) and that there is never any form of collision possible. In practice, often a hybrid of the two scenarios is used to provide access to the RF medium. Although there are numerous forms of collision-free multiple access, the following means of sharing the RF spectrum have received most attention. There are time-division multiple access (TDMA), frequency-division multiple access (FDMA), and code-division multiple access (CDMA). The concepts of TDMA and FDMA may be explained as follows. In the TDMA scenario, the access to the RF spectrum is rather implicit via time-slot allocation. Namely, there is no single portion of the allotted frequency spectrum that is assigned to an individual user. Instead, users occupy the entire allotted frequency spectrum and are assigned nonoverlapping time slots for communication. In contrast, FDMA operates on the assumption that the nonoverlapping portions of the RF spectrum can be allocated to individual users and communication for each user can proceed in a continuous fashion in time. The CDMA approach is different from TDMA and FDMA in two important aspects. First, explicit frequency assignments are not necessary. More important, communication can initiate at any time, and hence no explicit time-slot assignment must be performed prior to communication. The means by which user discrimination is achieved is through exploiting the correlation properties of binary (or perhaps higher-order) codes used to form CDMA signals. To illustrate this point, let us consider the following. In most CDMA systems, the information provided by a user is often of a bandwidth much smaller than the bandwidth allocated for CDMA communication. First, via a bandwidth spreading tactic, the information provided by a user is expanded in bandwidth to the maximum allowable bandwidth for CDMA communication. This procedure is repeated for all users involved, with each user taking advantage of a bandwidth spreading strategy that is independent of those used by others. At the receiver, a reverse operation (i.e., a bandwidth despreading operation) is performed. Obviously, if this operation is performed successfully, the original signal is recovered. However, since all the users involved occupy the entire allotted frequency band and are allowed to communicate at all times, the bandwidth despreading operation performed on an intended signal is also affected by the presence of other interfering signals. The key assumption in CDMA is that the independence of the bandwidth spreading and then despreading operations guarantees that the bandwidth despreading of an interfering signal leads to a signal whose bandwidth remains identical to the bandwidth allowed for CDMA communication. Namely, only the intended signal is transformed to its original shape, while other signals remain as wideband signals. Considering that a digital receiver makes an attempt to measure the useful band-limited energy of a signal, the detection of the desired signal is hampered by only a small fraction of the energies of the interfering signals, which appears as a spectrally flat noise. To elaborate, one can view a digital re-
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
522
CODE DIVISION MULTIPLE ACCESS
ceiver as a narrowband filter that is designed to capture the energy (the area beneath the power spectrum) of the desired signal. Such a filter will have a bandwidth proportional to the bandwidth of the desired signal after the bandwidth despreading operation. Since the undesired signals remain wideband after such an operation, the contribution of the interfering signals to the detected energy in the desired frequency band will be small compared with the detected energy of the desired signal. This, in turn, leads to a successful recovery of the desired signal and the rejection of a large portion of the unwanted energy. To gain further insight, we proceed to formulate this problem in the next section. SIGNAL GENERATION AND MATHEMATICAL MODELING The preceding operation can take on a mathematical form. First, we assume that there are N CDMA users that can be active at any point in time. That is, we assume that the allotted frequency spectrum is accessible to N CDMA signals at all times. Let us begin by describing a direct-sequence CDMA signal. In particular, we are interested in representing the jth CDMA signal. For all intents and purposes, one can describe the jth CDMA signal at the transmitter as x j (t) = Re{x˜ j (t)e iω c t } = Re{d j (t)PN j (t)e iω c t }
(1)
where Re兵x其 denotes the real part of x, x˜(t) is the complex envelope of the CDMA signal, i ⫽ 兹⫺1, 웆c denotes the carrier 앝 frequency in rad/s, and dj ⫽ 兺n⫽⫺앝d(j) n Pd(t ⫺ nTs) is the data bearing portion of the jth signal, with d(j) n and Pd(t) denoting the complex data symbol for the jth transmitted signal in the nth signaling interval, taking on an M-ary phase-shift-keying (MPSK) signaling constellation and a unit amplitude nonreturn-to-zero (NRZ) pulse shape of duration Ts s, respectively. Moreover, PNj (t) denotes the jth complex PN signal defined as PN j (t) =
∞ n=−∞
( j) sn,I Pc (t − nTc ) + i
∞ n=−∞
( j) sn,Q Pc (t − nTc )
(2)
(j) where s(j) n,I and sn,Q are the in-phase (I) and quadrature (Q) phase pseudorandom real spreading sequences for the nth chip interval of the kth user taking on 兵⫺1, ⫹1其 according to a PN code generating device (a PN code generator typically consists of one or a combination of a number of linear feedback shift registers); Pc(t) is the chip pulse shape, typically assumed to be a square root raised-cosine pulse shape; and Tc is the chip interval given by
Tc =
Ts Pg
(3)
where Pg Ⰷ 1 denotes the processing gain for the CDMA system. This parameter will be explained in a different context in the ensuing discussion. We also assume that PN j (t ± kPTc ) = PN j (t) for k = 1, 2, 3, . . . This implies that PN codes here are assumed to be periodic with a period of PTc seconds. This further implies that the PN sequences have a period of P chips.
The preceding formulation implies that x˜ j (t) = d j (t)PN j (t)
(4)
From Eq. (4), one can observe that a CDMA signal is obtained via PN code multiplication, justifying the name direct sequence. Before going any further, let us observe the impact of the spreading operation on the spectrum of a narrowband signal. Equation (4) sheds light on the means by which the direct-sequence spreading expands the bandwidth of dj(t). It can easily be inferred from Eq. (4) that the outcome of the correlation operation is to yield a signal whose bandwidth is identical to that of the PN code [PNj(t)]. Since PN code’s bandwidth ⫺1 is far greater than that of dj(t) (i.e., T⫺1 c Ⰷ Ts ), a bandwidth spreading operation is realized. We also note that Pg =
BCDMA BData
(5)
where BCDMA and BData denote the bandwidths of the CDMA and data signals, respectively. This can easily be verified by noting that the bandwidth of a direct-sequence CDMA signal may be shown to be 움1T⫺1 c for some 움1, while the bandwidth of the data signal is 움2T⫺1 s . Since the CDMA and data signals possess identical characteristics, 움1 ⫽ 움2, using Eq. (3), we arrive at Eq. (5). This result indicates that the processing gain for a CDMA signal is identical to the bandwidth spreading factor or PG. In the remainder of this article, for the sake of simplicity, we deal with x˜ j(t), the complex envelope of the jth CDMA signal. In the ensuing analysis, the correlation properties of the PN codes are needed to understand the means by which CDMA receivers function. For this reason, let us define
Ra( j) (n, τ , τˆ ) =
1 PN j (t − τ )PN∗j (t − τˆ )n,τˆ 2
(6)
as the partial autocorrelation function of the jth PN code observed over PG chip symbols with
f (t)n,τˆ =
1 Ts
nTs + τˆ (n−1)Ts + τˆ
f (t) dt
denoting a time-averaging operation over the interval [(n ⫺ 1)Ts ⫹ ˆ , nTs ⫹ ˆ ]. This function will be used in the subsequent analysis to discuss the characteristics of PN code acquisition and tracking systems. It is important to note that, in commercial CDMA systems, the period of the PN code (i.e., PTc) is substantially greater than the processing gain, resulting in an R(j) ˆ ) that is a function of n and represents a (n, , the partial autocorrelation function of the jth PN code. (Since PN codes are often generated using linear feedback shift registers, one may assume that the resulting codes are periodic with periods that are dependent on the structural properties of the generating shift registers.) In fact, due to the pseudorandom nature of the PN code, R(j) ˆ ) may be viewed as a (n, , a random sequence. However, if one assumes a large processing gain (large number of chip symbols per integration interval), R(j) ˆ ) does not vary substantially with n, and a (n, , hence R(j) ˆ ) may be approximated by R(j) ˆ ). For the a (n, , a ( , scenario where PG ⫽ P (i.e., PN code is repeated every symbol
CODE DIVISION MULTIPLE ACCESS
interval), R(j) ˆ ) is not a function of n and reduces to the a (n, , autocorrelation function of jth PN code. It is also important to note that R(j) ˆ ), as defined prea ( , viously, is a complex function. In practice, however, the complex PN codes are designed so that the I and Q PN codes (hereafter, Re兵PNj(t)其 and Im兵Pnj(t)其) are referred to as the I and Q PN codes, respectively) are a pair of uncorrelated sequences. That is, Re{PN j (t)}Im{PN j (t)}n,0 ≈ 0; for all j
(7)
In that event, R(j) ˆ ) is a real function that can be expressed a ( , as
1 Ra( j) (τ , τˆ ) = Re{PN j (t − τ )}Re{PN j (t − τˆ )}n,τˆ 2 1 + Im{PN j (t − τ )}Im{PN j (t − τˆ )}n,τˆ 2
SYNCHRONIZATION
(9)
Hence, R(j) ˆ ) may be viewed as the partial autocorrelation a ( , function of the real PN sequences that form the complex PN signal. Using the preceding notation, the despreading operation may also be explained. DESPREADING AND DETECTION Since a binary PN spreading is used, it is fairly easy to see that
1 1 x˜ (t)PN∗j (t)n,0 = 2 j 2Ts 1 = 2Ts
nTs (n−1)Ts nTs (n−1)Ts
x˜ j (t)PN∗j (t) dt (10) d j (t)|PN j (t)| dt = 2
dn( j)
where PN*j (t) is the complex conjugate of PNj(t) and it is assumed that 兩PNj(t)兩2 ⫽ 1. The factor is included to account for the fact that the PN code consists of real and imaginary spreading sequences. For complex spreading signals, we also observe that 1 1 1 |PN j (t)|2 = (Re{PN j (t)})2 + (Im{PN j (t)})2 2 2 2
(11)
where Im兵x其 is the imaginary part of x. Since the real and imaginary parts of PNj(t) are also binary PN codes with unit amplitudes, (Re{PN j (t)})2 = (Im{PN j (t)})2 =
1 |PN j (t)|2 = 1 2
At this junction, we need to consider the received signal when the signals described previously have been subjected to an imperfect channel. In particular, we need to be concerned with the case where the channel coherence bandwidth is smaller than the bandwidth of the CDMA signal. (Coherence bandwidth of a dispersive channel may be viewed as the maximum bandwidth that a signal can take on without getting distorted by the characteristics of the channel.) This implies a frequency selective operation for most practical applications and, in particular, for wireless communication scenarios. Hence, we need to examine the impact of channel on a CDMA signal. This also plays a critical role in selecting a detection mechanism for the problem at hand. Before doing so, however, the important problem of synchronization is addressed.
(8)
If one assumes that the I and Q PN codes possess identical partial autocorrelation properties (a situation where this condition is not satisfied is of little practical interest), then Ra( j) (τ , τˆ ) = Re{PN j (t − τ )}Re{PN j (t − τˆ )}n,τˆ
523
(12)
When PNj(t) is assumed to be real, then it is fairly easy to see that PNj2(t) ⫽ 1. Hereafter, the signal processing defined by Eq. (10) is referred to as a matched filtering (MF), or despreading operation. To elaborate, as can be seen in Eq. (10), the outcome of the correlation operation is the original narrowband signal dj(t).
In virtually any form of digital communication, synchronization in time (symbol clock recovery) precludes communication. CDMA systems are not exempt from this requirement. The synchronization in a CDMA system, however, is somewhat different from its TDMA counterpart. In TDMA systems, one requires synchronization in frequency (and, in some cases, in phase) before a data clock can be recovered. Often, a dotting sequence (1010101. . .) is included in the preamble of a TDMA frame to provide the clock synchronization subsystem the necessary signal to lock onto. In a CDMA scenario, since the desired signal is spread in frequency over the entire allotted CDMA band, the acquisition of PN code clock, which for most practical systems also implies data clock acquisition, must be achieved in the absence of phase and frequency synchronization. (Here, we are interested in the scenario where the PN code clock and data symbol clock are derived from a common source. Hence, an acquisition of the PN code clock leads to data symbol clock recovery.) This is due to the fact that if one chooses to achieve phase and frequency estimation in the absence of PN code acquisition, the phase and frequency synchronizers must extract synchronization information from a wideband signal. This, in general, is a formidable task due to the large bandwidth of typical CDMA signals. Hence, in a CDMA system, PN code timing acquisition precedes any other form of synchronization. Upon the recovery of the PN code phase the CDMA signal is despread and then an accurate estimate of frequency or phase is obtained. We are then faced with a situation where PN code clock must be recovered in a noncoherent fashion. Before describing the mechanism by which PN code clock synchronization can be acquired, we need to point out that synchronization here is achieved in two phases. In phase I, an initial synchronization of the PN code phase is established via acquiring the epoch of the received PN code to within a fraction of a chip interval. This problem is identical to the estimation of the propagation delay between a transmitter/receiver pair when the propagation delay is less than the period of the PN code. In the event that the propagation delay is greater than the period of the PN code, the synchronization procedure yields an estimate of the propagation delay reduced mod PTc. Phase I of the PN code synchronization is equivalent to estimating
524
CODE DIVISION MULTIPLE ACCESS
the state of the shift register that generates the desired PN code. In phase II, the epoch of the desired PN code is tracked so that a real-time estimate of PN code phase can be maintained at the receiver. As noted earlier, PN code estimation is accomplished in the face of unknown channel phase and frequency. In general, one can assume that the partial autocorrelation of a PN code satisfies the following property: |Ra( j) (τ , τˆ )| Ra( j) (τ , τ ) for |τ − τˆ |
> Tc
(13)
This property is critical to a successful PN code acquisition since it can be exploited to realize a PN code acquisition model. To gain further insight, let us assume that the communication channel is such that the transmission through the channel introduces a delay of , a phase offset of , a frequency error or offset of ⌬웆 rad/s, and an amplitude distortion of A(t). That is, let
| < r˜ (t)PN∗j (t − r) ˆ >n,τˆ |2
(14)
n=1
where we have assumed that ˆ denotes an estimate of , the propagation delay between transmitter and receiver, and we have collected energy over L symbol intervals. As noted earlier, if ⬎ PTc, then ˆ is the estimate of reduced mod PTc. Obviously, the objective of a PN code acquisition model is to bring to within a fraction of Tc of . Namely, we are interested in acquiring an estimate ˆ where 兩 ⫺ ˆ 兩 ⱕ Tc /Ns for Ns ⱖ 2. In the ensuing discussion, we further assume that the modulation is absent. This assumption is motivated by the fact that in commercial CDMA systems a pilot signal (a CDMA signal without the modulating signal) is provided by transmitter to aid synchronization. The presence of modulation further complicates the model without adding any further insight. For this reason, we proceed with a pilot-signalaided synchronization model. We further assume that an initial frequency estimate is obtained, and hence ⌬웆 is assumed to be relatively small compared with 1/Ts. If one assumes that the jth PN signal (without modulation) is used to generate the CDMA signal and that the amplitude distortion in the channel remains relatively constant for a symbol time, i.e., A(t) 앒 A, then (for most channels of interest, this assumption is valid)
g(τ , τˆ ) =
L n=1
| < Aei[ωt+θ ]PN j (t − τ )PN∗j (t − τˆ ) >n,τˆ |2
L
|eiθ < PN j (t − τ )PN∗j (t − τˆ ) >n,τˆ |2
n=1
= A2 D2 (ωTs )
L
| < PN j (t − τ )PN∗j (t − τˆ ) >n,τˆ |2
n=1
(16) where D(⌬웆Ts) accounts for the distortion caused by the presence of the frequency error. D(⌬웆Ts) is a decreasing function of ⌬웆Ts, and hence for ⌬웆Ts Ⰶ 1 one can expect a small level of distortion. As can be seen, the phase error is eliminated with the aid of the absolute value function. Now, let us consider the case where the I and Q PN codes are nearly orthogonal. That is, let Re{PN j (t)}Im{PN j (t)}n,0 ≈ 0
g(τ , τˆ ) ≈ 4LA2 D2 (ωTs )(Ra( j) (τ , τˆ ))2
denote the complex envelope of the received signal at the input of a CDMA receiver in the absence of additive noise. (Since the inclusion of an additive noise results only in the presence of a noisy term at the output of the correlation operation, and hence does not provide any further insight, we may proceed with a noiseless model to illustrate the function of the PN code acquisition model.) A noncoherent PN code acquisition model computes g(, ˆ ), given by L
g(τ , τˆ ) = A2 D2 (ωTs )
(17)
With some effort, it can be shown that
r(t) ˜ = A(t)ei[ωt+θ ]x(t ˜ − τ)
g(τ , τˆ ) =
Hence,
(15)
(18)
Hence, the function of the absolute value operation is to eliminate any phase error that may be present at the receiver, while the integration operation is intended to yield R(j) ˆ ). a ( , Due to Eq. (13), it is relatively easy to observe that g(, ˆ ) may be used to launch a search for a correct epoch of the code. The function of L is to provide a confidence in declaring whether or not the correct epoch of the code has been acquired when additive noise (or interference) is present. Before discussing the acquisition model based on the above observation, let us consider the case where additive noise is present. In the presence of noise, additional terms in g(, ˆ ) that are dependent on noise must be accounted for. In that case, one can argue that E{ g(τ , τˆ )} = 4LA2 D2 (ωTs )(Ra( j) (τ , τˆ ))2
(19)
where E兵其 is the ensemble average of the enclosed. That is, the operation described by Eq. (14) yields an output whose average value provides one with the necessary function to carry out PN code acquisition. Hence g(, ˆ ) may be used as an indicator of the PN code acquisition state. The search mechanism then consists of a chip-by-chip search that will be carried out in a serial fashion. In this scheme, g(, ˆ ) is obtained for a ˆ . In the event g(, ˆ ) falls below a predefined threshold, ˆ is increased by Tc /Ns (Ns is the number of steps per chip interval). Once the local PN code epoch is within a chip interval of the received PN code, the output of the correlator will exceed the threshold device that is designed to yield an optimum performance. At this stage, the synchronizer declares PN code acquisition and proceeds with PN code tracking. Since noise can impair this process, the performance of this acquisition model is determined in terms of the statistics of the acquisition time (mean and standard deviation of the acquisition time), probability of acquisition, and probability of false acquisition. Phase II of synchronization involves the tracking of the PN code. This process involves maintaining a local PN code signal whose epoch is different from the epoch of the received signal
CODE DIVISION MULTIPLE ACCESS
by no more than a fraction of the chip time Tc. This objective is achieved via a PN code tracking loop that generates a pair of PN signals that are delayed and advanced by a fraction of chip time with respect to the local PN code. More specifically, the following signal is formed:
S(τe ) = g τe −
Tc Ns
− g τe +
Tc Ns
(20)
where e ⫽ ⫺ ˆ . In arriving at Eq. (20) it is assumed that when 兩 ⫺ ˆ 兩 ⬍ Tc, g(, ˆ ) ⫽ g( ⫺ ˆ ). This signal is then used as an error signal to adjust ˆ . Since a voltage-controlled-oscillator (VCO) provides the clock signal for the generation of the local PN code, the adjustment of ˆ can be achieved using S(e). The expected value of function S(e), i.e., E兵S(e)其, is often referred to as the ‘‘S-curve’’ of the tracking loop. Such a function determines the tracking behavior of the loop. In particular, the variance of the steady state timing error as well as the mean time to loss of lock are dependent upon this function. Using Eq. (19), we have
525
tained with the I and Q PN sequences that make up the jth PN code satisfy the following properties: Pg n 1 =1
sn( j),I sn( j)+n,I 1 1
=
Pg n 1 =1
sn( j),Q sn( j)+n,Q 1 1
=
Pg ≤ λa
n=0 otherwise
(25)
and Pg n 1 =1
sn( j),I sn( j)+n,Q ≤ λc ; for all n 1
(26)
1
with a Ⰶ PG and c Ⰶ PG denoting the peak out-of-phase autocorrelation function and peak cross-correlation function, respectively, of the I and Q PN sequences. For a ⫽ 0 and c ⫽ 0, a pair of distinct phase shifts of the I or Q PN sequences are uncorrelated. Moreover, the I and Q PN sequences may be viewed as uncorrelated sequences as well. In that event, Eq. (23) denotes the exact, and not an approximate, expression. In practice, however, one encounters a and c that are nonzero, and hence Eq. (23) must be used as an approximate E{S(τe )} partial autocorrelation function. The approximation due to a 2 2 Ⰶ Pg and c Ⰶ Pg conditions, however, is a good one. We note T T c c − Ra( j) τe + = 4LA2 D2 (ωTs ) Ra( j) τe − that although one requires that a Ⰶ c, the critical assumpNs Ns tion for detection is that both c and a remain significantly (21) smaller than PG. Given the assumption states above, we arrive at an STo gain an insight into the operation of this loop, let us con- curve for the above tracking loop that is approximately a linsider a scenario where the I and Q PN sequences are uncorre- ear function of for the range [⫺T /N , T /N ]. More impore c s c s lated and possess identical autocorrelation functions. As tant, the slope of the S-curve remains positive for this range. noted earlier, this assumption leads to There are several aspects of this function that are of interest. First, when the timing error is zero, which implies a perfect Ra( j) (τe ) = Re{PN j (t − τ )}Re{PN j (t − τ − τe )}n,τ +τ e synchronization has been achieved, E兵S(e)其 ⫽ 0. In this case, for |τ − τˆ | < Tc (22) the input to the VCO is reduced to zero. Second, as the timing error begins to depart from 0, the signal at the input of the VCO will have a magnitude that is proportional to e. Hence, When Pc(t) is an NRZ pulse, S(e) provides the VCO with a signal that is an odd and monotonic function of the timing error, and thus can be used to Re{PN j (t − τ )}Re{PN j (t − τ − τe )}n,τ +τ e adjust ˆ . As noted above, the other feature of the above S |τ | τ e e ≈ Pg Tc 1 − rect ; |τe | < Tp (23) curve is that it is nearly a linear function of e in the vicinity Tc 2Tc of e ⫽ 0. This is an important property, since the initial synchronization yields an estimate of the PN code epoch that is and within ⫾Tc /Ns of the received PN code epoch. In this case, one can assume that the loop provides us with an error signal Ra( j) (τ + nTP ) = Ra( j) (τ ) for all integer n (24) that is directly proportional to the timing error, and hence a linear tracking loop results. Finally, in practice, Pc(t) is chosen to be a square-root In Eq. (23), raised-cosine pulse shape. In that case, a somewhat different result emerges. That is, 1 |x| < 0.5 rect(x) = Tc Tc 0 otherwise 2 2 E{S(τe )} = 4LA2 D2 (ωTs ) PRC τe = − PRC τe + Ns Ns (27) This situation is commonly referred to as the time-limited case. This is due to the fact that the chip pulse shape extends ⴱ Pc(t) (䊊 ⴱ denotes a convolution operaover a finite time interval, and subsequently its spectrum is where PRC(t) ⫽ Pc(t) 䊊 tion) is a raised-cosine pulse shape given by Eq. (28) (note extended over a large frequency range. Equation (24) implies that the square-root raised cosine pulse Pc(t) is implicitly dethat the autocorrelation function of the PN code is a periodic function. This property is a direct consequence of the fact that fined in terms of PRC(t)) PN codes are periodic functions with period Tp ⫽ PTc. Tp here, then, denotes the period of the PN code. Note that Eq. (24) is a property common to all PN codes, whereas Eq. (23) is ob-
PRC (t) =
sin(πt/Tc ) cos(παt/Tc ) πt/Tc 1 − (2αt/Tc )2
(28)
526
CODE DIVISION MULTIPLE ACCESS
This case is referred to as the bandwidth-limited case. Note that Pc(t) extends over several chip intervals, leading to a spectrum that is limited in bandwidth. Although E兵S(e)其 does not yield a linear S-curve over the entire interval of [⫺Tc /Ns, Tc /Ns], it provides us with all the necessary conditions for a successful PN code tracking. That is, E兵S()其 is a linear function of e in the vicinity of e ⫽ 0. Also, E兵S(e)其 possesses a positive slope in the range [⫺Tc /Ns, Tc /Ns]. Hence, one may expect a tracking performance similar to that of the NRZ chip pulse shape case. INTERFERENCE Now let us consider the received signal at the input of a CDMA receiver where other CDMA signals are present. We consider two possibilities. First, it is considered that the channel is nondispersive, and hence no multipath components are present. In the second case, a more general scenario where multipath scattering is present is considered. In the event that the channel is nondispersive,
r˜ (t) = A j eiθ j (t ) x˜ j (t − τ j ) +
N
Al eiθ l (t ) x˜l (t − τl ) + z(t) ˜ (29)
l=1;l = j
where now N ⫺ 1 other CDMA signals are present. It is assumed that the lth signal encounters l seconds of propagation delay, an amplitude scaling of Al, and a random phase shift of l(t). Note that any frequency errors caused by channel is represented by dl(t)/dt. As can be seen, the received signal is corrupted by many interfering signals and an additive noise z˜(t). The term z˜ (t) is a complex white Gaussian noise whose real and imaginary parts are a pair of independent white Gaussian noise processes with a two-sided power spectrum density of N0 W/Hz over the frequency range of interest. For the additive noise, we have
practice by acquiring the PN code, despreading the signal, and, with the aid of a frequency estimator, acquiring an estimate of the frequency. Then the outcome of the bandwidth despreading operation (when the nth symbol is of interest) after frequency compensation and delay estimation is
1 ˜r(t)PN∗j (t − τˆ j )n,τˆ j 2 A j eiθ j nTs + τˆ j = zn + d (t − τ j )PN j (t − τ j )PN∗j (t − τˆ j ) dt 2Ts (n−1)Ts + τˆ j j (30) N A eiθ l nTs + τˆ j l + dl (t − τl )PNl (t − τl )PN∗j (t − τˆ j ) dt 2T s (n−1)T + τ ˆ s l=1;l = j j It is not immediately obvious whether or not the nth symbol can be recovered using this operation. Depending on the type of detection used to recover the transmitted data symbol, an estimate of j may be needed at the receiver. To go any further, without loss of generality, let us assume that ˆ j ⱖ j when 兩ˆ j ⫺ j兩 ⱕ Tc. In that event, Eq. (30) reduces to
1 ˜r(t)PN∗j (t − τˆ j )n,τˆ j 2 iθ j nTs +τ j Aje = zn + d (t − τ j )PN j (t − τ j )PN∗j (t − τˆ j ) dt 2Ts (n−1)Ts + τˆ j j (31) A j eiθ j nTs + τˆ j ∗ d j (t − τ j )PN j (t − τ j )PN j (t − τˆ j ) dt + 2Ts nTs +τ j N Al eiθ l nTs + τˆ j + dl (t − τl )PNl (t − τ1 )PN∗j (t − τˆ j ) dt 2T s (n−1)T + τ ˆ s l=1;l = j j where
zn =
∗
E{z(t) ˜ z˜ (t − s)} = 2E{z˜ r (t)z˜ r (t − s)} = 2E{z˜ i (t)z˜ i (t − s)} = 2N0 δ(t) where 웃(t) is a dirac-delta function and z˜r(t) and z˜i(t) denote the real and imaginary parts of z˜(t), respectively. Note that Re兵z˜(t)ei웆ct其 may now be considered as a band-limited Gaussian noise whose power spectrum remains flat over the frequency range of interest about 웆c rad/s at N0 /2 W/Hz. To gain an insight into the means by which CDMA receivers overcome interference, let us consider the outcome of a bandwidth despreading operation. Furthermore, let us also assume a scenario where we are interested in recovering the jth signal. Obviously, one requires that the receiver acquires an estimate of j. This task remains with the PN code acquisition subsystem discussed previously. Assuming that a successful delay estimation is performed, an estimate of j that is within ⫾Tc /Ns (Ns ⱖ 2) of j can be obtained. Let such an estimate be ˆ j. Also, let us assume that the frequency shift in the signal caused by the channel is compensated for and that the residual frequency error caused by estimation process is small enough so that l(t) 앒 l for the observation interval. That is, l now denotes the residual phase error at the receiver caused by channel phase shift and imperfect estimation and compensation of frequency. This condition is typically satisfied in
1 2Ts
nTs + τˆ j (n−1)Ts + τˆ j
∗ z(t)PN ˜ j (t − τˆ j ) dt
denotes a zero mean Gaussian random variable. This equation then leads to
1 ∗ x(t)PN ˜ j (t − τˆ j )n, τˆ j 2 ( j) = zn + A j eiθ j [R(1) (τ j , τˆ j )dn( j) + R(2) (τ j , τˆ j )dn+1 ] j j +
N l=1;l = j
(32)
Al eiθ l [R(1) (τl , τˆ j )d (lp ) + R(2) (τl , τˆ j )d (lp )+1 ] j,l j,l l
l
where
R(1) (t1 , t2 ) = j
1 2Ts
nTs +t 1 (n−1)Ts +t 2
PN j (t − t1 )PN∗j (t − t2 ) dt
and
R(2) (t1 , t2 ) = j
1 2Ts
nTs +t 2 nTs +t 1
PN j (t − t1 )PN∗j (t − t2 ) dt
are partial autocorrelation functions of the jth PN code. In general, the PN codes are selected from a family of codes with
CODE DIVISION MULTIPLE ACCESS
identical autocorrelation properties, and hence the subscript of j may be dropped. Moreover,
R(1) (t , t ) = j,k 1 2
1 2Ts
nTs +t 1 (n−1)Ts +t 2
PNk (t − t1 )PN∗j (t − t2 ) dt
527
constant over two consecutive time slots and a differential phase modulation is used, an estimate of j is not required at the receiver. For other scenarios, the output of the despreader is fed to a channel estimation system so that an estimate of j (and Aj in some cases) can be obtained and compensated for.
and
R(2) (t , t ) = j,k 1 2
1 2Ts
nTs +t 2 nTs +t 1
PNk (t − t1 )PN∗j (t − t2 ) dt
denote the partial cross-correlation function of the jth and kth PN codes. In arriving at Eq. (32), we have assumed that the integration interval has coincided with the pl and pl ⫹ 1th signaling interval of the lth interfering signal. From Eq. (32), it is rather obvious that the desired symbol d(j) n is recovered. This recovery method, however, has yielded a number of undesirable terms. First, the presence of timing error, similar to other digital receivers, has resulted in the introduction of intersymbol interference in the detection process [note the term involving d(j) n⫹1]. Furthermore, the detection process is now corrupted by an interfering signal, even in the absence of additive noise. To estimate the impact of interference, the properties of the partial autocorrelation and crosscorrelation functions of the PN codes must be evaluated. Before doing so, let us examine the preceding result more carefully. First, it is obvious from the definition of R(2) j (t1, t2) that when ˆ j ⫽ j, the intersymbol interference is reduced to zero. That is, R(2) ˆ j, ˆ j) ⫽ 0. Since 兩ˆ j ⫺ j兩 ⱕ Tc /Ns for some Ns ⱖ 2, j ( for a typical PN code with large processing gain, R(1) ˆ j) Ⰷ j (j, R(2) ˆ j) for all j. That is, the intersymbol interference may j (j, be viewed as negligible for most practical cases. The other interfering terms, however, are dependent on the partial cross-correlation functions of PN codes and thus cannot be suppressed readily. If one assumes that the product of two PN codes results in yet another wideband code, then the bandwidth of the interfering signal remains unchanged, yielding a wideband interfering signal. That is, the despreading operation manages to despread the bandwidth of the desired signal while yielding a wideband interference. Since the integration over a symbol time is equivalent to a filtering operation over a bandwidth of 1/Ts Hz and the despreading interference signal possesses a bandwidth proportional to 1/Tc, the contribution of the interference noise to the detection process is undermined by a factor proportional to the spreading gain. In other words, the interference contributes only 1/PG of its total power to the detection of d(j) n . Stated differently, the key assumption of a CDMA receiver is that
|R(1) (τ j , τˆ j )| j |R(1) (τl , τˆ j )| + |R(2) (τl , τˆ j )| j,l j,l
≈ PG
for all l ⬆ j. Hence, for a large processing gain, one can expect a significant reduction in the interference level at the output of a CDMA receiver. Note that there are (N ⫺ 1) interferers present, and hence one must consider a large enough PG so that the total interference level remains small. Finally, note that d(j) n is scaled by an unknown coefficient ij Aje in Eq. (32). If a phase modulation is used, then j must be estimated at the receiver. In the event that j remains
NEAR-FAR PROBLEM AND POWER CONTROL Another important fact revealed by Eq. (32) is that the interfering signals’ power levels are different from that of the desired signal. Obviously, if Aj ⫽ max兵Al; l ⫽ 1, . . ., N其, a favorable outcome results. That is, the channel has caused an attenuation in the desired signal that is smaller than those experienced by the interfering signals. Since this condition cannot be guaranteed in a mobile communication environment, the interfering signals can take on relatively large amplitudes compared with the desired signal. In this case, the interfering signals can completely suppress the desired signal, resulting in an unacceptable performance for a CDMA system. Since no fading is considered here, and assuming that all the CDMA signals are originated at the transmitter at identical power levels, the aforementioned scenario is only encountered when the distance between the desired user and the receiver is larger than all or some of the distances between the interfering users and the receiver. This problem is commonly referred to as the ‘‘near-far’’ problem in CDMA receivers. Considering the wide range of distances a mobile user can take on, this problem can severely hamper the performance of a wireless CDMA system. In theory, this problem can be circumvented by regulating the power levels of all CDMA transmitters so that received signals at the receiver possess identical power levels (i.e., Aj ⫽ A for all j). The mechanism by which this goal may be achieved is known as power control. Power control, in practice, is accomplished using either an open-loop or a closedloop mechanism. For the sake of discussion, let us consider a mobile CDMA scenario. Furthermore, let r˜(t) denote the received signal at a CDMA base station. Hence, N denotes the number of active mobile transmitters. In an open-loop mechanism, the base station sends a signal (pilot signal) with known power level to all the mobile units (forward link). Mobile units measure the received power levels and, in turn, set their transmitter power levels for the reverse link in accordance with the received power level. (Typically, the power level of a mobile transmitter is increased by x dB, if the pilot signal is received at the mobile at ⫺x dB power level. For the case where the pilot is received at ⫹x dB, the transmitter power level is reduced by x dB.) If the communication channel remains the same for all users and fast fading can be ignored, this mechanism can yield favorable results. Although reciprocity exists between reverse and forward links of a wireless channel when log-normal fading (slow fading) is of concern (log-normal shadowing effect is due to the obstruction of the direct path of communication), the forward and reverse links experience different fast fading effects. That is, the information that is obtained regarding the channel condition by observing the forward channel’s power level (pilot power) may not be used to estimate the channel characteristics in the reverse link. For this reason, after initial power
528
CODE DIVISION MULTIPLE ACCESS
level setting using the open-loop mechanism, a closed-loop procedure is followed to overcome the near-far problem. In this case, Aj is a function of not only propagation distance but also channel fading characteristics. In this case, the base station makes a measurement of the power levels of the received signals from individual mobile units. This information is reported back to the mobile units using what is known as a power control bit, which indicates whether the mobile should boost or reduce its power in some fixed dB increments. This process is repeated up to 2000 times per second in some modern systems. Given the fast rate of updates, this procedure can overcome the impact of rapid fluctuations in the power level. Note that the power level adjustments of the mobile units is based on the information regarding the reverse link, and hence one can expect a more effective means of circumventing the near-far problem using the closed-loop power control mechanism. CHANNEL EFFECTS So far, we have considered a perfect communication channel. That is, we have assumed that the bandwidth despreading operation is performed on an exact replica of the transmitted signal at the receiver. As noted earlier, we are concerned with a dispersive channel. Let the impulse response of the channel be
˜ h(t) =
Np
c˜l (t)δ(t − τl (t))
(33)
l=1
where h˜(t) is the complex impulse response of the channel, 웃(t) is the dirac-delta function, l(t) denotes the propagation delay for the lth multipath between transmitter and the receiver, and c˜l(t) is a complex multiplicative distortion (MD) denoting the channel fading effect for the lth resolvable path of the multipath channel. The term c˜l(t) is often modeled as a low-pass complex Gaussian process. Moreover, Np denotes the total number of resolvable multipaths. The set of multipath delays encountered in a channel is often referred to as the delay profile of a scattering channel. For most channels of interest and when the observation interval is short enough to render a constant delay profile, one may assume that l(t) 앒 l. Np and l are determined by the multipath profile of the channel, whereas the characteristics of c˜l(t) is a function of the Doppler spectrum of the channel. Due to the Gaussian property, one can fully characterize the statistics of c˜j(t) using only the second-order statistics of the process. It is shown that the MD processes have autocorrelation functions that satisfy (assuming no log-normal shadowing) E{c˜l (t)c˜l (t − τ )∗ |σl2 } = σl2 J0 (2π f d(l ) τ )ei2π f e τ
(34)
with l2, f d(l), and f e denoting the mean square value of the MD for the lth path of the signal, the maximum Doppler spread experienced by the lth path of the signal, and the residual frequency error in hertz at the receiver, respectively. Moreover, E兵()兩l2其 denotes the expected value of the enclosed condition on l2. Note that we have kept the discussion as general as possible to entertain the possibility of including a scenario where the desired and interfering users may be at different
Doppler rates. When log-normal shadowing is present, we have σl2 = Pl 10ζ /10 where is a normal probability density function (log-normal shadowing) with a zero mean and a standard deviation of (many field trials have shown to be in the 4 dB to 8 dB range for microcellular urban environments) and Pl is the received power in the absence of shadowing for the lth path of the signal. Hence, the average power can be calculated using E{σl2 } = ηPl where
η = E{10ζ /10} = exp
ln(10) 10
(35)
2 σζ2 2
and E{}
denote the expected value of the enclosed with respect to . Hence, E{c˜l (t)c˜l (t − τ )∗ } = Rc(l ) (τ ) = ηPl J0 (2π f d(l ) τ )ei2π f e τ
(36)
Also, since uncorrelated fading is considered, E{c˜l (t)c˜n (t − τ )∗ } = Rc(l ) (τ )δ[l − n]
(37)
where
δ[x] =
1 x=0 0 otherwise
Since c˜l(t)’s are all Gaussian, 兵c˜l(t); for all l其 is a set of mutually independent Gaussian random processes. Finally, suppose that the channel, in addition to causing a multipath effect, adds an additive noise. That is, the complex envelope of the jth received signal in the absence of user-induced interference may now be approximated as
r(t) ˜ =
Np
c˜l (t)x˜ j (t − τl (t)) + z(t) ˜
(38)
l=1
Therefore, a CDMA receiver must estimate some or all of l’s before any form of communication can take place. Due to the unique properties of the PN codes, an estimate of l is acquired via establishing PN code acquisition for each path. INTERFERENCE-DISPERSIVE CHANNEL Now let us consider a more realistic scenario where other CDMA signals are present and the channel suffers from multipath scattering. In that event,
r(t) ˜ =
Np
c˜l, j (t)x˜ j (t − τl, j (t))
l=1
+
Np N k=1;k = j l=1
(39) c˜l,k (t)x˜k (t − τl,k (t)) + z(t) ˜
CODE DIVISION MULTIPLE ACCESS
where now N ⫺ 1 other CDMA signals and their respective multipath components are considered. Note that we have introduced c˜l, j(t) as the MD for the lth path of the jth signal. All the properties for the MD processes discussed previously can be extended to this scenario as well. That is, we consider c˜l, j(t) as independent, baseband complex Gaussian processes for all l and j. Moreover, l, j(t) now denotes the delay encountered by the lth path of the jth CDMA signal and is assumed to be slow varying. The preceding may also be presented as Np
r(t) ˜ =
c˜l, j (t)d j (t − τl, j (t))PN j (t − τl, j (t)) Np N
+
vious case, where fading was absent and without loss of generality, let us assume that ˆ m, j ⱖ m, j. In that event, Eq. (42) reduces to
1 ∗ x(t)PN ˜ j (t − τˆm, j )n, τˆ m, j 2 nTs +τ m, j 1 = zn + c˜ (t)d j (t − τm, j )PN j (t − τm, j ) 2Ts (n−1)Ts + τˆm, j m, j nTs + τˆ m, j 1 PN∗j (t − τˆm, j ) dt + c˜ (t)d j (t − τm, j ) 2Ts nTs +τ m, j m, j PN j (t − τm, j )PN∗j (t − τˆm, j ) dt +
l=1
c˜l,k (t)dk (t − τl,k (t))PNk (t − τl,k (t)) + z(t) ˜
k=1;k = j l=1
(40) As can be seen, the received signal is corrupted by many interfering signals. Considering a situation where path delays are slow varying, we have a simplified model given by
529
1 2Ts
Np l=1;l = m
nTs + τˆ m, j (n−1)Ts + τˆ m, j
τl, j )PN∗j (t
c˜l, j (t)d j (t − τl, j )PN j (t − − τˆm, j ) dt N p nTs + τˆ N m, j 1 + c˜ (t)dk (t − τl,k ) 2Ts k=1;k = j l=1 (n−1)Ts + τˆm, j l,k PNk (t − τl,k )PN∗j (t − τˆm, j ) dt
(43)
This equation then leads to
r(t) ˜ =
Np
c˜l, j (t)d j (t − τl, j )PN j (t − τl, j )
l=1
+
Np N
(41) c˜l,k (t)dk (t − τl,k )PNk (t − τl,k ) + z(t) ˜
k=1;k = j l=1
1 ∗ x(t)PN ˜ j (t − τˆm, j )n, τˆ m, j 2 ( j) = zn + c˜m, j [R(1) (τm, j , τˆm, j )dn( j) + R(2) (τm, j , τˆm, j )dn+1 ] j j +
At this stage, we assume that the observation interval (symbol time) is short enough so that the delay profile for the channel remains unchanged. That is, m, j(t) 앒 m, j for the observation interval. This condition is satisfied for most practical applications. To gain an insight into the means by which CDMA receivers overcome interference in the presence of the multipath effect, let us consider the outcome of a bandwidth despreading operation. Furthermore, let us also assume a scenario where we are interested in recovering the mth path of the jth signal. This situation is encountered in practice where the strongest paths of the desired signal are acquired by the PN code acquisition subsystem. More precisely, we require that the receiver acquires an estimate of m, j. Assuming that a successful delay estimation is performed, an estimate of m, j that is within ⫾ c /Ns of m, j can be obtained. Let such an estimate be ˆ m, j. Then the outcome of the bandwidth despreading operation (when the nth symbol is of interest) is
1 ∗ x(t)PN ˜ j (t − τˆm, j )n, τˆ m, j 2 1 = zn + c˜m, j (t)d j (t − τm, j )PN j (t − τm, j )PN∗j (t − τˆm, j )n,τˆ m, j 2 Np 1 + c˜ (t)d j (t − τl, j )PN j (t − τl, j )PN∗j (t − τˆm, j n,τˆ m, j 2 l=1;l = m l, j Np N 1 + c˜ (t)dk (t−τl,k )PNk (t−τl,k )PN∗j (t−ˆτm, j )n,τˆ m, j 2 k=1;k = j l=1 l,k
(42) As seen before, it is not immediately obvious whether the desired symbol can be recovered in this case. Similar to the pre-
Np l=1;l = m
+
c˜l, j [R(1) (τl, j , τˆm, j )dq( j) + R(2) (τl, j , τˆm, j )dq( j) j j
Np N k=1;k = j l=1
l, j
l, j
+1
]
c˜l,k [R(1) (τ , τˆ )d (k) + R(2) (τ , τˆ )d (k) p p j,k l,k m, j j,k l,k m, j l ,k
l ,k +1
]
(44) where we have assumed that the channel remains constant over a symbol time [and hence c˜m, j(t) is replaced with c˜m, j]. This assumption is satisfied for many practical communication systems. Moreover, we note that the integration interval has coincided with the ql, jth and ql, j ⫹ 1th signaling intervals of the lth (l ⬆ m) multipath of the desired signal ( jth signal). Similarly, we have assumed that the integration interval includes the pl,kth and pl,k ⫹ 1th signaling interval of the lth path of the kth interfering signal. Obviously, ql, j and pl,k are dependent on the delay profile of the channel and the relative delays encountered by various users, respectively. It is rather obvious that the desired symbol d(j) n is recovered. Considering the conditions imposed on the cross-correlation and autocorrelation function of PN codes in the previous sections, it is rather easy to see that the interfering signals and their respective multipath components are detected at a power level that is approximately PG times smaller than that of the desired signal.
RAKE RECEIVER In the previous section, we introduced CDMA signaling and its properties. We also demonstrated that the received signal at a receiver often comprises delayed and attenuated versions of the desired signal (reflections) due to the multipath effect. It was also demonstrated that the interference due to the
530
CODE DIVISION MULTIPLE ACCESS
other active CDMA users adversely affects the received signal in a typical CDMA receiver. For the sake of clarity, in what follows we consider a scenario in which the desired signal and its multipath components only are present. From Eq. (44), it is obvious that if one despreader is used to extract the mth path of the jth signal, the other multipath components appear as interference to this form of detection. Note that with the exception of amplitude and phase distortion effects, the spreading signal for all the reflected CDMA signals is known to the receiver, and hence one can capture this useful energy using an arrangement that is analogous to a garden rake. To elaborate, without loss of generality, let us consider a single user scenario where the strongest multipath component of the received signal is c˜1(t)x˜(t ⫺ 1(t)). That is, the delay associated with the strongest component of the multipath signal is 1(t). Moreover, let us assume that the PN code acquisition and tracking subsystem has locked onto this component of the received signal. The other components of the multipath signal [i.e., N 兺j⫽2p c˜j(t)x˜(t ⫺ j(t)] may now be regarded as interference. It is obvious that with the exception of c˜j(t) and j(t), the received multipath components contain the useful modulation. Let us now suppose that from Np possible multipaths, we are only interested in capturing Nf signals. Nf is commonly referred to as the number of ‘‘fingers’’ in a rake receiver. Also, let ⫽ [1, 2, . . ., Nf ] denote a vector containing the Nf significant multipath delays in an ascending order. Moreover, let ˆ ⫽ [ˆ 1, ˆ 2, . . ., ˆ Nf ] be a vector of delay estimates obtained by the PN code acquisition system. A rake receiver, after acquiring the Nf possible delays, performs an MF operation. The MF operation involves correlation with PN*j (t ⫺ ˆ j); j ⫽ 1, 2, . . ., Nf and integration over a symbol time. That is, the receiver forms the following set of variables: Yn( j) =
1 ˜r(t)PN∗j (t − τˆ j )n,τˆ ; j = 1, 2, . . ., Nf j 2
(45)
At this stage, there are two possibilities. In one scenario, it is possible to estimate the channel MD, and hence cˆj(t) [an estimate of c˜j(t)]; j ⫽ 1, 2, . . ., Nf can be obtained at the receiver. This scenario is of critical importance to a coherent modulation case where a knowledge of channel phase is necessary for a successful demodulation. In the other scenario, such estimates are not available and the modulation scheme used allows for a noncoherent detection. We consider the coherent demodulation case first. In this case, since an MF operation is required, one must recompute Yj as follows: ( j) Yn,c
path, in the absence of timing and channel estimation errors and in the face of an additive white Gaussian noise (AWGN), no other receiver yields an energy level higher than that produced by the arrangement just suggested. For this reason, this receiver is referred to as a maximal ratio combiner (MRC). In the other scenario, the receiver uses Eq. (45) to compute Yj. At this stage, one must remove channel phase ambiguities before a decision can be rendered on the transmitted symbol. If a frequency-shift-keying (FSK) modulation is used, then
Dn =
Dn(c)
=
Nf
|Yn( j) |2
j=1
Note that although phase ambiguities have been removed, the amplitude fluctuations due to c˜j(t) have not been compensated for. This deficiency could seriously impair the performance of the subsequent demodulation process. For yet another modulation scheme, known as differential phase-shift keying (DPSK), the desired information is stored in the difference (reduced mod 2앟) between two consecutive phases of the received signal. It is also assumed that the channel remains stationary for at least two consecutive symbol intervals. To recover the desired symbol, the receiver forms the following decision variables:
Dn =
Nf
( j)∗ Yn( j)Yn−1
j=1
Similar to its FSK counterpart, this receiver is impaired by the changes in the channel amplitude due to c˜j(t). Additionally, any phase changes in two consecutive symbol intervals can produce unfavorable results in this case. BIBLIOGRAPHY 1. M. K. Simon et al., Spread Spectrum Communications Handbook, McGraw-Hill, New York, 1994. 2. W. C. Jakes, Microwave Mobile Communications, New York: Wiley, 1974. 3. J. G. Proakis, Digital Communications, 3rd ed., New York: McGraw-Hill, 1995. 4. A. J. Viterbi, CDMA—Principles of Spread Spectrum Multiple Access Communication, Reading, MA: Addison-Wesley, 1995.
KAMRAN KIASALEH University of Texas at Dallas
1 = ˜r(t)cˆ∗j (t)PN∗j (t − τˆ j )n,τˆ ; j = 1, 2, . . ., Nf j 2
Then a decision variable is formed for the nth transmitted data symbol as follows:
Nf
CODER, VIDEO. See VIDEO COMPRESSION STANDARDS. CODING DATA TRANSMISSION. See INFORMATION THEORY OF DATA TRANSMISSION CODES.
( j) Yn,c
j=1
This variable is passed on to a coherent demodulator for further processing. Since an MF operation is performed for each
CODING, IMAGE. See IMAGE AND VIDEO CODING. CODING SPEECH. See SPEECH CODING. CODING THEORY. See ALGEBRAIC CODING THEORY. COGNITIVE OPERATIONS ON DATA. See INFORMATION SCIENCE.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5305.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Data Compression for Networking Standard Article Wade Wan1 and Xuemin Chen1 1Broadcom Corporation, Irvine, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5305 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (300K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5305.htm (1 of 2)16.06.2008 16:22:34
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5305.htm
Abstract The sections in this article are Basic Terminology and Methods for Data Coding Fundamental Compression Algorithms JPEG H.261 and H.263 MPEG H.264/AVC/JVT Keywords: nyquist's theorem; analog-to-digital converter; pulse code modulation (PCM); differential PCM (DPCM); adaptive DPCM (ADPCM); delta modulation (DM); huffman coding; run-length coding; arithmetic coding; transform coding; subband coding; vector quantization; image and audio coding (JPEG, H.261, H.263, MPEG) | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5305.htm (2 of 2)16.06.2008 16:22:34
DATA COMPRESSION FOR NETWORKING DATA COMPRESSION VIDEO CODEC TRANSFORM CODING IMAGE CODING IMAGE PROCESSING
tization, subsampling, and interpolation), hybrid coding (e.g., JPEG, MPEG-1, MPEG-2, MPEG-4, H.261, H.263, and H.264/AVC/Mpeg-4 Part 10), and other proprietary developed coding techniques (e.g., Intel’s Indeo, Microsoft’s Windows Media Audio and Video, General Instrument’s DigiCipher, IBM’s Ultimotion Machine, and Apple’s Quick Time, etc.). The purpose of this article is to provide the reader with a basic understanding of the principles and techniques of data coding and compression. Various compression schemes are discussed for transforming audio, image and video signals into compressed digital representations for efficient transmission or storage. Before embarking on this venture, it is appropriate to first introduce and clarify the basic terminology and methods for signal coding and compression.
SIGNAL REPRESENTATION
BASIC TERMINOLOGY AND METHODS FOR DATA CODING
VIDEO SIGNAL PROCESSING
The word signal originally referred to a continuous time and continuous amplitude waveform, called an analog signal. In a general sense, people now view a signal as a function of time, where time may be continuous or discrete, and where the amplitude or values of the function may be continuous or discrete, and may be scalar or vector-valued. Thus, a signal is meant to represent a sequence or a waveform whose value at any time is a real number or real vector. In many applications, a signal also refers to an image which has an amplitude that depends on two spatial coordinates, instead of one time variable; or it can also refer to a video (moving images), where the amplitude is a function of two spatial variables and a time variable. The word data is sometimes used as a synonym for signal, but more often it refers to a sequence of numbers or more generally, vectors. Thus, data can often be viewed as a discrete time signal. During recent years, however, the word data has been increasingly been associated in most literature with the discrete or digital case, that is, with discrete time and discrete amplitude, what is called a digital signal. Physical sources of analog signals such as speech, audio, image, video, and all observable electrical waveforms are analog and continuous time in nature. The first step to convert analog signals to digital form is sampling. An analog continuously fluctuating waveform can usually be characterized completely from the knowledge of its amplitude values at a countable set of points in time so that, in effect, one can “throw away” the rest of the signal. One does not need to observe how it behaves in between any two isolated instances of observation. This is at the same time remarkable and intuitively obvious. It is remarkable that one can discard so much of the waveform and still be able to accurately recover the missing parts. The intuitive idea is that, if one samples periodically at regularly spaced intervals, and the signal does not fluctuate too quickly so that no unexpected wiggles can appear between two consecutive sampling instants, then one can expect to recover the complete waveform by a simple process of interpolation or smoothing, where a smooth curve is drawn that passed through the known amplitude values at the sampling in-
VIDEOTELEPHONY DIGITAL TELEVISION AUDIO CODING SPEECH CODING There has been an explosive growth of multimedia communication over networks during the past two decades. Video, audio, and other continuous media data, as well as additional discrete media such as graphics, are parts of integrated network applications. For these applications, the traditional media (e.g., text, images), as well as the continuous media (e.g., video, audio), must be processed. Such processing, referred to as “data coding”, often yields better and more efficient representations of text, image, graphics, audio, and video signals. The uncompressed media data often require very high transmission bandwidth and considerable storage capacity. To provide feasible and cost-effective solutions for the current quality requirements, compressed text, image, graphics, audio, and video streams are transmitted over networks. As shown in Refs. 1–6 there exist many data coding and compression techniques that are, in part, competitive, and, in part, complementary. Most of these techniques are already used in today’s products, while other methods are still undergoing development or are only partly realized. Today and in the near future, the major coding schemes are linear predictive coding, layered coding, and transform coding. The most important compression techniques are entropy coding (e.g., run-length coding, Huffman coding, and arithmetic coding), source coding (e.g., vector quan-
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright © 2007 John Wiley & Sons, Inc.
2
Speech Coding
stants. When watching a movie, one is actually seeing 24 still pictures flashed on the screen every second. (Actually, each picture is flashed twice.) The movie camera that produced these pictures was actually photographing a scene by taking one still picture every 1/24th of a second. Yet, one has the illusion of seeing continuous motion. In this case, the cinematic process works because the brain is somehow doing the interpolation. This is an example of sampling in action in daily life. For an electrical waveform, or any other one-dimensional signal, the samples can be carried as amplitudes on a periodic train of narrow pulses. Consider a scalar time function x(t), which has a Fourier transform X(f). Assume there is a finite upper limit on how fast x(t) can wiggle around or vary with time. Specifically, assume that X(f) = 0 for |f| ≥ W. Thus, the signal has a strictly low-pass spectrum with cutoff frequency W hertz (Hz). To sample this signal, one can periodically observe the amplitude at isolated time instants t = kT for k = . . . , −2, −1, 0, 1, 2, . . . . The sample rate is fs = 1/T and T is the sampling period or sampling interval in seconds. The idealized case of the sampling model is impulse sampling with a perfect ability to observe isolated amplitude values at the sampling instants kT. The effect of such a sampling model is seen as the process of multiplying the original signal x(t) by a sampling function, s(t), which is the periodic train of impulses p(t) (e.g., Dirac delta functions for ideal case) given by
where the amplitude scale is normalized to T so that the average value of s(t) is unity. In the time domain, the effect of this multiplication operation is to generate a new impulse train whose amplitudes are samples of the waveform x(t). Thus
Therefore, one now has a signal y(t) which contains only the sample values of x(t) and all values in between the sampling instants have been discarded. Figure 1 is an example of continuous signal waveform and its sampled waveform. The complete recovery of x(t) from the sampled signal y(t) can be achieved if the sampling process satisfies the following fundamental theorem:
Nyquist Sampling Theorem. A signal x(t) that is bandlimited to W (Hz) can be exactly reconstructed from its samples y(t) when it is periodically sampled at a rate fs = 2W. This minimum sampling frequency of 2W (Hz) is called the Nyquist frequency or Nyquist rate. If we violate the condition of the sampling theorem, that is, the sampling rate is less than twice of the maximum frequency component in the spectrum of the signal to be sampled, then the recovered signal will be the original signal plus an additional undesired waveform whose spectrum overlaps with the high-frequency components of the original signal. This
Figure 1. An example of sampling process, which shows (a) the original analog waveform and (b) its corresponding sampled waveform.
undesired component is called aliasing noise and the overall effect is referred to as aliasing, since the noise introduced here is actually a part of the signal itself but with its frequency components shifted to a new frequency. The rate at which a signal is sampled usually determines the amount of processing, transmission or storage that will subsequently be required. Hence, it is desirable to use the lowest possible sampling rate that will satisfy a given application. On the other hand, most physical signals are not strictly bandlimited. However, typically, the contribution of the higher frequency signal components diminishes in importance as frequency increases over certain values. For example, music often does not have a welldefined cutoff frequency, below which significant power density exists, and above which no signal power is present. But human ears are not sensitive to very-high-frequency sound. So how does one choose a meaningful sampling rate that is not higher than necessary and yet does not violate the sampling theorem? The answer is to first decide how much of the original signal spectrum is really needed to be retained. Analog lowpass filtering is then performed on the analog signal before sampling, so that the “needless” high-frequency components are suppressed. This analog prefiltering is often called antialias filtering. For example, in digital telephony, the standard antialias filter has a cutoff of 3.4 kHz, although the speech signal contains frequency components extending well beyond this frequency. This cutoff allows the moderate sampling rate of 8 kHz to be used and retains the voice fidelity that was already achieved with analog telephone circuits, which were already limited to roughly 3.4 kHz. In summary, analog prefiltering is needed to prevent aliasing of the signal and noise components that lie outside of the frequency band that must be preserved and reproduced. Just as a waveform is sampled at discrete times, the value of the sampled waveform at a given time is also converted to a discrete value. Such a conversion process is
Speech Coding
Figure 2. PCM coded signal of sampled waveform in Fig. 1(b).
called quantization, which will introduce loss on sampled waveform. The resolution of quantization depends on the number of bits used in measuring the height of the waveform. For example, an 8-bit quantization yields 256 possible values. Lower resolutions of quantization will result in higher losses of the digital signal. The electronic device that converts a signal waveform into digital samples is called an analog-to-digital converter (ADC). The reverse conversion is performed by a digital-to-analog converter (DAC). The first process to sample analog signals and then quantize the sample values was pulse code modulation (PCM). PCM was invented in the 1930s, but only became prevalent in the 1960s, when transistors and integrated circuits became available. Figure 2 depicts the steps involved in PCM at a high level. PCM does not require sophisticated signal processing techniques and related circuitry. Hence, it was the first method to be employed, and is the prevalent method used today in telephone plants. PCM provides excellent quality. PCM is specified by the International Telephone and Telegraph Consultative Committee (CCITT). The current name of the specification is International Telecommunication Union (ITU) for voice coding in Recommendation G.711. A problem with PCM is that it requires a fairly high bandwidth (e.g., 64 kHz, for voice coding) to code a signal. PCM has been around for a long time, and new technologies are beginning to demand attention. Of all the available schemes emerging from the laboratory, differential pulse code modulation (DPCM) and adaptive DPCM (ADPCM) schemes are among the most promising techniques. If a signal has a high correlation between adjacent samples, the variance of the difference between adjacent samples is smaller than the variance of the original signal. If this difference is coded, rather than the original signal, fewer bits are needed for the same desired accuracy. That is, it is sufficient to represent only the first PCM-coded sample as a whole, and all following samples as the difference from the previous one. This is the idea behind DPCM. In general, fewer bits are needed for DPCM than for PCM. In a typical DPCM system, the input signal is bandlimited, and an estimate of the previous sample (or a prediction of the current signal value) is subtracted from the input. The difference is then sampled and coded. In the simplest case, the estimate of the previous sample is formed by taking the sum of the decoded values of all the past differences (which ideally differ from the previous sample only by a quantizing error). DPCM exhibits a significant improvement over PCM when the signal spectrum is peaked
3
at the lower frequencies and rolls off toward the higher frequencies. A modification of DPCM is delta modulation (DM). When coding the differences, it uses exactly one bit, which indicates whether the signal increases or decreases. This leads to an inaccurate coding of steep edges. This technique is particularly profitable if the coding does not depend on 8-bit grid units. If differences are small, a smaller number of bits are sufficient. A prominent adaptive coding technique is ADPCM. It is a successive development of DPCM. Here, differences are encoded by the use of only a small number of bits (e.g., 4 bits). Therefore, either sharp transitions are coded correctly (these bits represent bits with a higher significance), or small changes are coded exactly (DPCM-encoded values are the less-significant bits). In the second case, a loss of high frequencies would occur. ADPCM adapts to this “significance” for a particular data stream as follows: the coder divides the value of DPCM samples by a suitable coefficient and the decoder multiplies the compressed data by the same coefficient, that is, the step size of the signal changes. The value of the coefficient is adapted to the DPCMencoded signal by the coder. In the case of a high-frequency signal, large DPCM coefficient values occur. The coder determines a high value for the coefficient. The result is a very coarse quantization of the DPCM signal in passages with steep edges. Low-frequency portions of such passages are hardly considered at all. For a signal with permanently relatively small DPCM values, the coder will determine a small coefficient. Thereby, a fine resolution of the dominant low-frequency signal portions is guaranteed. If high-frequency portions of the signal suddenly occur in such a passage, a signal distortion, in the form of a slopeoverload, arises. Considering the actually defined step size, the greatest possible change by a use of the existing number of bits will not be large enough to represent the DPCM value with an ADPCM value. The transition of the PCM signal will be faded. It is possible to explicitly change the coefficient that is adaptively adjusted to the data in the coding process. Alternatively, the decoder is able to calculate the coefficients itself from an ADPCM-encoded data stream. In ADPCM, the coder can be made to adapt to DPCM value change by increasing or decreasing the range represented by the encoded bits. In principle, the range of bits can be increased or decreased to match different situations. In practice, the ADPCM coding device accepts the PCM coded signal and then applies a special algorithm to reduce the 8-bit samples to 4-bit words using only 15 quantization levels. These 4-bits words no longer represent sample amplitudes; instead, they contain only enough information to reconstruct the amplitude at the distant end. The adaptive predictor predicts the value of the next signal on the level of the previously sampled signal. A feedback loop ensures that signal variations are followed with minimal deviation. The deviation of the predicted value, measured against the actual signal, tends to be small, and can be encoded with 4-bits.
4
Speech Coding
FUNDAMENTAL COMPRESSION ALGORITHMS The purpose of compression is to reduce the amount of data for multimedia communication. The amount of compression that an encoder achieves can be measured in two different ways. Sometimes the parameter of interest is compression ratio—the ratio between the original source data and the compressed data sizes. However, for continuoustone images another measure, the average number of compressed bits/pixel, is sometimes a more useful parameter for judging the performance of an encoding system. For a given image, however, the two are simply different ways of expressing the same compression. Compression in multimedia systems is subject to certain constraints. The quality of the coded and, later on, decoded data, should be as good as possible. To make a cost-effective implementation possible, the complexity of the technique should be minimal. The processing of the algorithm can not exceed certain time spans. A natural measure of quality in a data coding and compression system is a quantitative measure of distortion. Among the quantitative measures, a class of criteria often used is the mean square criterion. It refers to some type of average or sum (or integral) of squares of the error between the sampled data y(t) and decoded or decompressed data y (t). For data sequences y(t) and y (t) of N samples, the quantity
is called the average least squares error (ALSE). The quantity
is called the mean square error (MSE), where E represents the mathematical expectation. Often ALSE is used as an estimate of MSE. In many applications, the (mean square) error is expressed in terms of a signal-to-noise ratio (SNR), which is defined in decibels (dB) as
where σ 2 is the variance of the original sampled data sequence. Another definition of SNR, used commonly in image and video coding applications, is
The PSNR value is roughly 12 to 15 dB above the value of SNR. Another commonly used method for measuring the performance of data coding and compression system is rate distortion theory. Rate distortion theory provides some useful results, which tell us the minimum number of bits required to encode the data, while admitting a certain level of distortion, and vice versa.
The rate distortion function of a random variable x gives the minimum average rate RD (in bits per sample) required to represent (or code) it while allowing a fixed distortion D in its reproduced value. If x is a Gaussian random variable of variance σ 2 , and y is its reproduced value and if the distortion is measured by the mean square value of the difference (x-y), that is, D = E[(x − y)2 ], then the rate distortion function of x is defined as
Data coding and compression systems are considered optimal if they maximize the amount of compression subject to an average or maximum distortion. As shown in Table 1, compression techniques fit into different categories. For their use in multimedia systems, one can distinguish among entropy, source, and hybrid coding. Entropy coding is a lossless process, while source encoding is a lossy process. Most multimedia systems use hybrid techniques, which are a combination of the two coding techniques. Entropy coding is used independently of the media’s specific characteristics. Any input data sequence is considered to be a simple digital sequence and the semantics of the data is ignored. Entropy encoding reduces the size of the data sequence by focusing on the statistical characteristics of the encoded data series to allocate efficient codes, independent of the characteristics of the data. Entropy encoding is an example of lossless encoding as the decompression process regenerates the data completely. The concept of entropy is derived from classical 19th century thermodynamics. The basic ideas of entropy coding are as follows: First, one defines the term information by using video signals as examples. Consider a video sequence in which each pixel takes on one of K values. If the spatial correlation have been removed from the video signal, the probability that a particular level i appears will be Pi , independent of the spatial position. When such a video signal is transmitted, the information I imparted to the receiver by knowing which of the K levels is the value of a particular pixel, is −log2 Pi bits. This value, averaged over an image, is referred to as the average information of the image, or the entropy. The entropy can therefore be expressed as
The entropy is also extremely useful for measuring the performance of a coding system. In “stationary” systems— systems where the probabilities are fixed—it provides a fundamental lower bound, called the entropy limit, for the compression that can be achieved with a given alphabet of symbols. Entropy encoding attempts to perform efficient code allocation (without increasing the entropy) for a signal. Runlength encoding, Huffman encoding, and arithmetic encoding are well-known entropy coding methods (7) for efficient code allocation, and are commonly used in actual encoders.
Speech Coding
Run-length coding is the simplest entropy coding. Data streams often contain sequences of the same bytes or symbols. By replacing these repeated byte or symbol sequences with the number of occurrences, a substantial reduction of data can be achieved. This is called run-length coding, which is indicated by a special flag that does not occur in the data stream itself. For example, the data sequence: GISSSSSSSGIXXXXXX can be run-length coded as: GIS# 7GIX# 6, where # is the indicator flag. The character “S” occurs 7 consecutive times and is “compressed” to 3 characters “S# 7”, as well as the character “X” occurs 6 consecutive times and is also “compressed” to 3 characters “X# 6”. Run-length coding is a generalization of zero suppression, which assumes that just one symbol appears particularly often in sequences and the coding focuses on uninterrupted sequences, or runs, of zeros or ones to produce an efficient encoding. Huffman coding is an optimal way of coding with integer-length code words. Huffman coding produces a “compact” code whose definition is for a particular set of symbols and probabilities, no other integer code can be found that will give better coding performance than this code. Consider the example given in Table 2. The entropy— the average ideal code length required to transmit the weather—is given by
However, fractional-bit lengths are not allowed, so the lengths of the codes listed in the column to the right do not match the ideal information. Since an integer code always needs at least one bit, increasing the code for the symbol “00” to one bit seems logical. The Huffman code assignment procedure is based on a coding “tree” structure. This tree is developed by a sequence of pairing operations, in which the two least probable symbols are joined at a “node” to form two “branches” of the tree. As the tree is constructed, each node at which two branches meet is treated as a single symbol with a combined probability that is the sum of the probabilities for all symbols combined at that node. Figure 3 shows a Huffman code pairing sequence for the four-symbol case in Table 2. In Fig. 3 the four symbols are placed on the number line from 0 to 1, in order of increasing probability. The cumulative sum of the symbol probabilities is shown at the left. The two smallest probability intervals are paired, leaving three probability intervals of size 1/8, 1/8, and 3/4. We establish the next branch in the tree by again pairing the two smallest probability intervals, 1/8 and 1/8, leaving two probability intervals, 1/4 and 3/4. Finally, the tree is completed by pairing the 1/4 and 3/4 intervals. To create the code word for each symbol, one assigns a 0 and 1, respectively (the order is arbitrary), to each branch of the tree. Then concatenate the bits assigned to these branches, starting at the “root” (at the right of the tree) and the following the branches back to the “leaf” for each symbol (at the far left). Notice that each node in this tree requires a binary decision—a choice between the two possibilities—and, therefore, appends one bit to the code word.
5
Figure 3. Huffman coding tree for the sequence symbols given in Table 2. It demonstrates the Huffman code assignment process.
Figure 4. A process of partitioning the numbered line into subintervals for the arithmetic coding. It illustrates a possible ordering for the symbol probabilities in Table 2.
One of the problems with Huffman coding is that symbols with probabilities greater than 0.5 still require a code word of length one. This leads to less efficient coding, as can be seen for the codes in Table 2. The coding rate R achieved with Huffman codes in this case is as follows:
This rate, when compared with the entropy limit of 1.186 bit/pixel, represents an efficiency of 86 percent. Arithmetic coding is an optimal coding procedure that is not constrained to integer-length codes. In arithmetic coding, the symbols are ordered on the number line in the probability interval from 0 to 1 in a sequence that is known to both encoder and decoder. Each symbol is assigned a subinterval equal to its probability. Note that, since the symbol probabilities sum to one, the subintervals precisely fill the symbol probabilities in Table 2. Figure 4 illustrates a possible ordering for the symbol probabilities in Table 2. The objective in arithmetic coding is to create a code stream that is a binary fraction pointing to the interval for the symbol being coded. Thus, if the symbol is “00”, the code stream is a binary fraction greater than or equal to binary 0.01 (decimal 0.25), but less than binary 1.0. If the symbol
6
Speech Coding
is “01”, the code stream is greater than or equal to binary 0.001, but less than binary 0.01. If the symbol is “10”, the code stream is greater than or equal to binary 0.0001, but less than binary 0.001. Finally, if the symbol is “11”, the code stream is greater than or equal to binary 0, but less than 0.0001. If the code stream follows these rules, a decoder can see which subinterval is pointed to by the code stream and decode the appropriate symbol. Coding additional symbols is a matter of subdividing the probability interval into smaller and smaller subintervals, always in proportion to the probability of the particular symbol sequence. As long as one follows the rule—never allow the code stream to point outside the subinterval assigned to the sequence of symbols—the decoder will decode that sequence. For a detailed discussion of Huffman coding and arithmetic coding, interested readers should refer to (7). Source coding takes into account the semantics of the data. The degree of compression that can be reached by source coding depends on the data contents. In the case of lossy compression techniques, a one-way relation between the original sequence and the encoded data stream exists; the data streams are similar but not identical. Different source coding techniques make extensive use of the characteristics of the specific medium. An example is sound source coding, where sound is transformed from time-dependent to frequency-dependent sound concatenations, followed by the encoding. This transformation, followed by encoding, substantially reduces the amount of data.
Predictive Coding. Prediction is the most fundamental aspect of source coding. The basis of predictive encoding is to reduce the number of bits used to represent information by taking advantage of correlation in the input signal. DPCM and ADPCM, discussed above, are among the simplest prediction coding methods. For digital video, signals exhibit correlation both between pixels within a frame (spatial correlation) and between pixels in differing frames (temporal correlation). Video compression techniques typically fall into two main types: (1) interframe prediction, which uses a combination of motion-prediction and interpolated frames to achieve high-compression ratio; (2) intraframe coding, which compresses every frame of video individually. Interframe prediction techniques take advantage of the temporal correlation, while the spatial correlation is exploited by intraframe coding methods. It is amenable also to utilize intra- and interfield prediction methods for interlaced video which scans alternate lines to distribute the pixels of a single frame across two fields. Motion compensation (MC), one of the most complex prediction methods, reduces the prediction error, by predicting the motion of the imaged objects. The basic idea of MC arises from a commonsense observation: in a video sequence, successive frames (or fields) are likely to represent the same details, with little difference between one frame and the next. A sequence showing moving objects over a still background is a good example. Data compression can be effective if each component of a frame is represented by its difference with the most similar component— the predictor—in the previous frame, and by a vector—the
motion vector—expressing the relative position of the two components. If an actual motion exists between the two frames, the difference may be null or very small. The original component can be reconstructed from the difference, the motion vector, and the previous frame. A weakness of prediction-based encoding is that the influence of any errors during data transmission affects all subsequent data. In particular, when interframe prediction is used, the influence of transmission errors is quite noticeable. Since predictive encoding schemes are often used in combination with other schemes, such as transformbased schemes, the influence of transmission errors must be given due consideration. Transform Coding. If we consider the frequency distribution of signals containing strong correlation, it appears that the signal power is concentrated in the low-frequency region. In general, it is possible to exploit for compression any systematic bias in components of the signal. The key idea behind transform coding is to transform the original signal in such a way as to emphasize the bias, making it more amenable to techniques that remove redundancy. One optimal transform is called the Karhunen–Loeve (KL) transformation. The KL transform can completely remove the statistical correlation of image data and provide a minimum mean-square-error (3). In application of the KL transform to images, there are dimensionality difficulties. The KL transform depends on the statistics as well as the size of the image. It is known that fast KL transform algorithms only exist for certain statistical image models. A number of orthogonal transforms, including the discrete Fourier transform (DFT) and the discrete cosine transform (DCT), have been used in various compression algorithms. Of these transforms, DCT is the most widely used for video compression, because the power of the transformed signal is well concentrated in the low frequencies, and it can be computed rapidly. The following expresses a two-dimensional DCT for an N × N pixel block.
where
After the transformation, DCT coefficients are quantized by levels specified in a quantization table. Usually, larger values of N improve the SNR, but the effect saturates above a certain block size. Further, increasing the block size increases the total computation cost required. The value of
Speech Coding
7
Figure 5. A block diagram of the MC + DCT coding scheme which shows the basic function blocks such as motion estimation, motion prediction, DCT variable length coding, and so on.
N is thus chosen to balance the efficiency of the transform and its computation cost, block sizes of 8 and 16 are common. For large quantization, DCT using block sizes of 8 and 16 often lead to “blocking artifacts”—visible discontinuities between adjacent blocks. In practice, DCT is used in conjunction with other techniques, such as prediction and entropy coding. The Motion Compensation Plus Discrete Cosine Transform (MC + DCT) scheme, which will repeatedly be referred to, is a prime example of such a combination. MC + DCT. Suppose that the video to be encoded consists of digital television or teleconferencing services. For this type of video, MC carried out on the basis of frame differences is quite effective. MC can be combined with the DCT for even more effective compression. The overall configuration of MC + DCT is illustrated in Fig. 5. The selection of block size compares its input signal with that of the previous frame (generally in units of 8 × 8 pixel blocks) and selects those that exhibit motion. MC operates by comparing the input signal in units of blocks against a locally decoded copy of the previous frame, extracting a motion vector, and using the motion vector to calculate the frame difference. The motion vector is extracted by, for example, shifting vertically or horizontally a region several pixels on a side and performing matching within the block or the macroblock (a 16 × 16 pixel segment in a frame) (8). The motion-compensated frame-difference signal is then discrete cosine transformed, in order to remove spatial redundancy. A variety of compression techniques are applied in quantizing the DCT coefficients; the reader is directed to the references for details (8). A leading method is zig-zag scan, which has been standardized in JPEG, H.261, H.263, MPEG-1, -2, and -4, for video transmission encoding (8). Zig-zag scan, which transforms two-dimensional data into one dimension, is illustrated in Fig. 6. Because the dc component of the coefficients is of critical importance, ordinary linear quantization is employed for them. Other components are scanned, for example, in zig-zag fashion, from low to high frequency, linearly quantized, and variablelength-encoded by the use of run-length and Huffman cod-
Figure 6. The zig-zag scan pattern for a 8 × 8 block.
ing. Subband Coding. Subband coding (5) refers to the compression methods that divide the signal into multiple bands to take advantage of a bias in the frequency spectrum of the video signal. That is, efficient encoding is performed by partitioning the signal into multiple bands and taking into account the statistical characteristics and visual significance of each band. The general form of a subband coding system is shown in Fig. 7. In the encoder, the analyzing filters partition the input signal into bands. Such a process is called subband decomposition. Each band is separately encoded, and the encoded bands are multiplexed and transmitted. The decoder reverses this process. Subband encoding does offer several advantages. Unlike DCT compression techniques, it is not prone to blocking artifacts. Furthermore, subband encoding is the most natural coding scheme when hierarchical processing is needed for video coding. The main technological features to be determined in subband encoding are the subband analysis method (2- or 3-dimensional), the structure of the analyzing filters, the bit allocation method, and the compression method within each band. In particular, there are quite a number of candidates for the form of the
8
Speech Coding
Figure 7. A simplified block diagram of subband coding scheme.
analysis and the structure of the filters. The filters must not introduce distortion due to aliasing in-band analysis and synthesis. Figure 8 shows a two-band analysis and synthesis system. Consider the following analyzing filter as an example:
For these analyzing filters, the characteristics of the synthesizing filters are
The relationship between the input and output is then
Clearly, the aliasing components completely cancel. The basic principles illustrated hold unchanged when twodimensional filtering is used in a practical application. Figure 9 illustrates how the two-dimensional frequency domain may be partitioned either uniformly or in an octave parent. If one recalls that signal power will be concentrated in the low-frequency components, then the octave method seems the most natural. Since this corresponds to constructing the analyzing filters in a tree structure, it lends itself well to implementation with filter banks. In practical applications, one of the most important decomposition filters is what is called discrete wavelet transform (DWT). Wavelet theory provides a unified framework
for multiresolution image compression. DWT-based compression enables coding of still image textures with a high coding efficiency as well as scaleable spatial resolutions at fine granularity. The organization of a subband codec is similar to the DCT-based codec. The principal difference is that encoding and decoding are each broken out into a number of independent bands. Quality can be fixed at any desired value by adjusting the compression and quantization parameters of the encoders for each band. Entropy coding and predictive coding are often used in conjunction with subband coding to achieve high-compression performance. If one considers quality from the point of view of the rate-distortion curve, then, at any given bit rate, the quality can be maximized by distributing the bits such that distortion is constant for all bands. A fixed number of bits is allocated, in advance, to each band’s quantizer, based on the statistical characteristics of the band’s signal. In contrast, adaptive bit distribution adjusts the bit count of each band according to the power of the signal. In this case, either the decoder of each subband must also determine the bit count for inverse quantization, using the same criterion as is used by the encoder, or the bit count information must be transmitted along with the quantized signal. Therefore, the method is somewhat lacking in robustness. Vector Quantization. As opposed to scalar quantization, in which sample values are independently quantized one at a time, vector quantization (VQ) attempts to remove redundancy between sample values by collecting several sample values and quantizing them as a single vector. Since the input to a scalar quantizer consists of individual sample values, the signal space is a finite interval of the real number line. This interval is divided into several regions, and each region is represented in the quantized outputs by a
Speech Coding
9
Figure 8. A two-band subband coding system. It demonstrates the in-band encoding and the inband decoding blocks in Fig. 7.
Figure 9. Subband splitting patterns in two-dimensional frequency domain: (a) uniform split (8 × 8); (b) octave split.
single value. The input to a vector quantizer is typically an n-dimensional vector, and the signal space is likewise an n-dimensional space. To simplify the discussion, consider only the case where n = 2. In this case, the input to the quantizer is the vector xj , which corresponds to the pair of samples (s1j , s2j ). To perform vector quantization, the signal space is divided into a finite number of nonoverlapping regions, and a single vector to represent each region is determined. When the vector xj is input, the region containing xj is determined, and the representative vector for that region, yj , is output. This concept is shown in Fig. 10. If we phrase the explanation explicitly in terms of encoding and decoding, the encoder determines the region to which the input xj belongs and outputs j, the index value which represents the region. The decoder receives this value j, extracts the corresponding vector yj from the representative vector set, and outputs it. The set of representative vectors is called the codebook. The performance of vector quantization is evaluated in the same manner as for other schemes, that is, by the relationship between the encoding rate and the distortion. The encoding rate R per sample is given by the following equation
where K is the vector dimensionality, and N is the number of quantization levels. The notation · represents the smallest integer greater than or equal to x (the “ceiling” of x). We define the distortion as the distance between the input vector xj and the output vector yj . In video encoding, the square of the Euclidean distance is generally used as
Figure 10. An example of VQ with two-dimensional vectors. To perform VQ, the signal space is divided into a finite number of nonoverlapping region, and a single vector is used to represent all vectors in each region.
a distortion measure, because it makes analytic design of the vector quantizer for minimal distortion more tractable. However, it is not necessarily the case that subjective distortion perceived by a human observer coincides with the squared distortion. To design a high-performance vector quantizer, the representative vectors and the regions they cover must be chosen to minimize total distortion. If the input vector probability density function is known in advance, and the vector dimensionality is low, it is possible to perform an exact optimization. However, in an actual application it is rare for the input vector probability density to be known in advance. The well-known LBG algorithm is widely used for adaptively designing vector quantizers in this situation (9). LBG is a practical algorithm that starts out with some rea-
10
Speech Coding
sonable codebook and, by adaptively iterating the determination of regions and representative vectors, converges on a better codebook. Figure 11 shows the basic structure of an image codec based on vector quantization. The image is partitioned into M-pixel blocks, which are presented, one at a time, to the VQ encoder as the Al-dimensional vector xj . The encoder locates the closest representative vector in its prepared codebook and transmits the representative vector’s index. The decoder, which need only perform a simple table lookup in the codebook to output the representative vector, is an extremely simple device. The simplicity of the decoder makes VQ coding very attractive for distribution-type video services. VQ coding, combining with other coding methods, has been adopted in many high-performance compression systems. Table 1 shows examples of coding and compression techniques that are applicable in multimedia applications in relation to the entropy, source, and hybrid coding classification. Hybrid compression techniques are a combination of well-known algorithms and transformation techniques that can be applied to multimedia systems. For a better and clearer understanding of hybrid schemes to be identified in all schemes (entropy, source, and hybrid) a set of typical processing steps is described. This typical sequence of operations has been shown in Fig. 5, which is performed in the compression of still images and video sequences. The following four steps describe the compression of one image: 1. Preparation includes analog-to-digital conversion and generating an appropriate digital representation of the information. An image is divided into blocks of 8 × 8 pixels, and represented by a fixed number of bits per pixel. 2. Processing is actually the first step of the compression process which makes use of sophisticated algorithms. A transformation from the time to the frequency domain can be performed by a use of DCT. In the case of motion video compression, interframe coding uses a motion vector for each 16 × 16 macroblock or 8 × 8 block. 3. Quantization processes the results of the previous step. It specifies the granularity of the mapping of real numbers into integers. This process results in a reduction of precision. In a transformed domain, the coefficients are distinguished according to their significance. For example, they could be quantized using a different number of bits per coefficient. 4. Entropy encoding is usually the last step. It compresses a sequential digital data stream without loss. For example, a sequence of zeros in a data stream can be compressed by specifying the number of occurrences followed by the zero itself. In the case of vector quantization, a data stream is divided into blocks of n bytes each. A predefined table contains a set of patterns. For each block, a table entry with the most similar pattern is identified. Each pattern in the table is associated with an index. Such a table can be multidimen-
sional; in this case, the index will be a vector. A decoder uses the same table to generate an approximation of the original data stream. In the following sections the most relevant work in the standardization bodies concerning image and video coding is outlined. In the framework of International Standard Organization (ISO/IEC/JTC1), three subgroups were established in May 1988: the Joint Photographic Experts Group (JPEG) is working on coding algorithms for still images; the Joint Bilevel Image Experts Group (JBIG) is working on the progressive processing of bilevel coding algorithms, and the Moving Picture Experts Group (MPEG) is working on representation of motion video. In the International Telecommunication Union (ITU), H.261 and H.263 are also developed for video conference and telephone applications. The results of these standard activities are presented next.
JPEG The ISO 10918-1 JPEG International Standard (1992) Recommendation T.81 is a standardization of compression and decompression of still natural images (4). JPEG provides the following important features:
JPEG implementation is independent of image size. JPEG implementation is applicable to any image and pixel aspect ratio.
Color representation is independent of the special implementation.
JPEG is for natural images, but image content can be of any complexity, with any statistical characteristics.
The encoding and decoding complexities of JPEG are
balanced and can be implemented by a software solution. Sequential decoding (slice-by-slice) and progressive decoding (refinement of the whole image) should be possible. A lossless, hierarchical coding of the same image with different resolutions is supported. The user can select the quality of the reproduced image, the compression processing time, and the size of the compressed image by choosing appropriate individual parameters. The key steps of the JPEG compression are DCT (8 × 8), quantization, zig-zag scan, and entropy coding. Both Huffman coding and arithmetic coding are options of entropy coding in JPEG. The JPEG decompression just reverses its compression process. A fast coding and decoding of still images also used for video sequences is known as Motion JPEG. Today, JPEG software packages, together with specific hardware support, are already available in many products. ISO 11544 JBIG is specified for lossless compression of binary and limited bits/pixel images (4). The basic structure of the JBIG compression system is an adaptive binary arithmetic coder. The arithmetic coder defined for JBIG is identical to the arithmetic-coder option in JPEG.
Speech Coding
11
Figure 11. The basic structure of a VQ codec.
H.261 AND H.263 ITU Recommendations H.261 and H.263 (6) are digital video compression standards, developed for video conferencing and videophone applications, respectively. Both H.261 and H.263 are developed for real-time encoding and decoding. For example, the maximum signal delay of both compression and decompression for H.261 is specified to be 150 ms for the end-to-end delay of targeted applications. Unlike JPEG, H.261 specifies a very precise image format. Two resolution formats each with an aspect ratio of 4:3 are specified. The so-called Common Intermediate Format (CIF) defines a luminance component (Y) of 288 lines, each with 352 pixels. The chrominance components (Cb and Cr ) each have a resolution of 144 lines and 176 pixels per line to fulfill the 2:1:1 requirement. QuarterCIF (QCIF) has exactly half of the CIF resolution, that is, 176 × 144 pixels for the luminance and 88 × 72 pixels for the other components. All H.261 implementations must be able to encode and decode QCIF. In H.261 and H.263, data units of the size 8 × 8 pixels are used for the representation of the Y, as well as the Cb and Cr components. A macroblock is the result of combining
four Y blocks with one block of the Cb and Cr components. A group of blocks is defined to consist of 33 macroblocks. Therefore, a QCIF-image consists of three groups of blocks, and a CIF-image comprises twelve groups of blocks. Two types of pictures are considered in the H.261 coding. These are I-pictures (or intraframes) and P-pictures (or interframes). For I-picture encoding, each macroblock is intracoded. That is, each block of 8 × 8 pixels in a macroblock is transformed into 64 coefficients by a use of DCT and then quantized. The quantization of dc-coefficients differs from that of ac-coefficients. The next step is to apply entropy encoding to the dc- and ac-parameters, resulting in a variable-length encoded word. For P-picture encoding, the macroblocks are either MC + DCT coded or intracoded. The prediction of MC + DCT coded macroblocks is determined by a comparison of macroblocks from previous images and the current image. Subsequently, the components of the motion vector are entropy encoded by a use of a lossless variable-length coding system. To improve the coding efficiency for low bit-rate applications, several new coding tools are included in H.263. Among them are the PBpicture type and overlapped motion compensation, and so on.
12
Speech Coding
MPEG The ISO/IEC/JTC1/SC29/WG11 MPEG working group has produced three specifications, ISO 11172 MPEG-1, ISO 13818 MPEG-2, and ISO 14496 MPEG-4 (8), for coding of combined video and audio information. MPEG-1 is intended for image resolutions of approximately CIF or SIF (360 × 240) and bit rates of about 1.5 Mbit/s for both video and audio. MPEG-2 is specified for higher resolutions (including interlaced video) and higher bit rates (4 Mbit/s to 15 Mbit/s, or more). MPEG-4 was originally targeted for very low bitrate coding applications. The targeted applications were modified after MPEG-4 compression was found to be effective over a wide range of bitrates. In addition, a completely new concept of encoding a scene as separate “AV” objects was developed in MPEG-4. There are three major parts composed in the MPEG-1, -2, and -4 specifications: Part 1, Systems; Part 2, Video; and Part 3, Audio. The system part specifies a system coding layer for combining coded video and audio and also provides the capability of combining private data streams and streams that may be defined at a later date. The specification describes the syntax and semantic rules of the coded data stream. MPEG’s system coding layer specifies a multiplex of elementary streams such as audio and video, with a syntax that includes data fields directly supporting synchronization of the elementary streams. The system data fields also assist in the following tasks: 1. Parsing the multiplexed stream after a random access 2. Managing coded information buffers in the decoders 3. Identifying the absolute time of the coded information The system semantic rules impose some requirements on the decoders; however, the encoding process is not specified in the ISO document and can be implemented in a variety of ways, as long as the resulting data stream meets the system requirements. MPEG-1, -2 and -4 video often use three types of frames (or pictures): Intra (I) frames; Predicted (P) frames; and Bidirectional (B) frames. Similar to H.261, I-type frames are compressed using only the information provided by the DCT algorithm. P-frames are derived from the preceding I frames (or from other P frames) by using MC (predicting motion forward in time) + DCT; P frames are compressed to approximately 60:1. Bidirectional B interpolated frames are derived from the previous I or P frame and the future I or P frame. B frames are required to achieve the low average data rate. Field-block-based DCT and MC were developed in MPEG-2 for efficient coding of interlaced video. MPEG-1 and -2 video can yield compression ratios of 50:1 to 200:1. It can provide 50:1 compression for broadcast quality at 6 Mbit/s. It also can provide 200:1 compression to yield VHS quality at 1.2 Mbit/s to 1.5 Mbit/s. MPEG-2 can also provide high-quality video for High Definition Television at about 18 Mbit/s. Note that the MPEG video coding algorithms are asymmetrical. Namely, in general, it requires more computational complexity to compress full-motion video than to decompress it. This is useful for applications where the signal
is produced at one source but is distributed to many. The MPEG standards also specify efficient compression algorithms for high-performance audio (8). For example, MPEG-1 audio coding uses the same sampling frequencies as Compact Disc Digital Audio and Digital Audio Tape, that is, 44.1 kHz and 48 kHz, additionally, 32 kHz is available, all at 16 bits. Three layers of an encoder are shown in Fig. 12. An implementation of a higher layer must be able to decode the MPEG audio signals of lower layers. Similar to the use of the two-dimensional DCT for video, a transformation into the frequency domain is applied for audio. The Fast Fourier Transform (FFT) is suitable for audio coding, and the spectrum is split into 32 noninterleaved subbands. For each subband, the amplitude of the audio signal is calculated. Also, for each subband, the noise level is determined simultaneously to the actual FFT by using a psychoacoustic model. At a higher noise level, a coarse quantization is performed, and at a lower noise level, a finer quantization is applied. The quantized spectral portions of layers one and two are PCM-encoded and those of layer three are Huffman-encoded. The audio coding can be performed with a single channel, two independent channels, or one stereo signal. In the definition of MPEG, there are two different stereo modes: two channels that are processed either independently or as joint stereo. In the case of joint stereo, MPEG exploits redundancy of both channels to achieve a higher compression ratio. Each layer specifies 14 fixed bit rates for the encoded audio data stream, which, in MPEG, are addressed by a bit rate index. The minimal value is always 32 kbit/s. These layers support different maximal bit rates: layer one allows for a maximal bit rate of 448 kbit/s, layer two for 384 kbit/s, and layer three for 320 kbit/s. For layers one and two, a decoder is not required to support a variable bit rate. In layer three, a variable bit rate is specified by switching the bit rate index. For layer two, not all combinations of bit rate and mode are allowed:
32 kbit/s, 48 kbit/s, 56 kbit/s, and 80 kbit/s are only allowed for a single channel.
64 kbit/s, 96 kbit/s, 112 kbit/s, 128 kbit/s, 160 kbit/s, and 192 kbit/s are 192 kbit/s are allowed for all modes.
224 kbit/s, 256 kbit/s, 320 kbit/s, and 384 kbit/s are allowed for the modes stereo, joint stereo, and dual channel modes. H.264/AVC/JVT The latest video codec was developed as a joint effort between the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Motion Picture Experts Group (MPEG) (10, 11). The intent of this effort was to create a standard that would produce good video quality at half the bitrates that previous video standards such as MPEG-2 and H.263 required. The techniques used in this new standard were to be constructed in a manner to allow the new standard to be applicable over a very wide range of bitrates and resolutions. The tradeoff compared to previous standards was an increase in complexity that could be eased by Moore’s law and other technological advances. The first version of this
Speech Coding
13
Figure 12. The key functional blocks of audio encoding in MPEG.
standard was completed in 2003. Additional extensions known as the Fidelity Range Extensions (FRExt) were finished in 2004 to support higher-fidelity video coding beyond the support of 8-bit 4:2:0 video (e.g., 10-bit, 12-bit, 4:2:2 and 4:4:4 video). The same syntax has been published by both organizations: the ITU-T H.264 standard and the ISO/IEC MPEG-4 Part 10 standard. Note that MPEG-4 Part 10 is not the same as MPEG-4 Part 2, the original video codec in the MPEG-4 suite of standards. This codec may also be referred to as the AVC (Advanced Video Coding) standard or the JVT standard which references the joint partnership between VCEG and MPEG which was known as the Joint Video Team. H.264/AVC contains many new features for more effective video compression than older standards. Some of these features include:
Multi-picture inter-picture prediction. Previous standards had limited inter-picture prediction to one (for P-pictures) or two (for B-pictures) reference pictures. H.264/AVC allows for up to 16 reference pictures to be used. In addition, there are fewer restrictions on the pictures that can be used for prediction. For example, in MPEG-2, B-pictures were not allowed to be used as reference pictures for the prediction of other pictures. This restriction is not present in H.264/AVC. Variable block-size motion compensation. Block sizes ranging from 4 × 4 pixels to 16 × 16 pixels can be chosen to match the size of objects and regions in the video content. Motion compensation using increased fractional-pixel precision. While half-sample precision was used in MPEG-1, MPEG-2 and H.263, quarter-sample precision is used for luma and eighth-sample preceision is used for chroma in H.264/AVC. Spatial prediction from neighboring blocks for intracoding. For example, only DC coefficients were pre-
dicted in MPEG-2. In H.264/AVC, spatial prediction using neighboring blocks is performed for AC coefficients. An exact integer transform similar to the DCT is specified to allow for exact decoding. Previous standards specified approximations to the ideal DCT which may result in drift when the encoder and decoder implementations differed. In-loop deblocking filter to reduce the blocking artifacts common to DCT-based compression algorithms. Context-adaptive binary arithmetic coding and context-adaptive variable-length coding that are more efficient than previous entropy coding. For evaluation of video, image and audio quality, subjective criteria are often used. The subjective criteria employ rating scales such as goodness scales and impairment scales. A goodness scale may be a global scale or a group scale. The overall goodness criterion rates perceptual quality on a scale ranging from excellent to unsatisfactory. A training set is used to calibrate such a scale. The group goodness scale is based on comparisons within a set of data. The impairment scale rates an image, video or audio sequence on the basis of the level of degradation present when compared with a reference image, video or audio sequence. It is useful in applications such as video coding, where the encoding process might introduce degradation in the output images. BIBLIOGRAPHY 1. N. S. Jayant and P. Noll, Digital Coding of Waveform, Englewood Cliffs, NJ: Prentice-Hall, 1984. 2. K. R. Rao and P. Ypi, Discrete Cosine Transform, San Diego: Academic Press, 1990. 3. A. K. Jain, Fundamentals of Digital Image Processing, Englewood Cliffs, NJ: Prentice-Hall, 1989.
14
Speech Coding
4. W. B. Pennebaker and J. L. Mitchell, JPEG Still Image Data Compression Standard, New York: Van Nostrand Reinhold, 1993. 5. J. W. Woods (ed.), Subband Image Coding, Boston: Kluwer, 1991. 6. K. Jack, Video Demystified, 2nd ed., San Diego: HighText Interactive, 1996. 7. T. M. Cover and J. A. Thomas, Elements of Information Theory, New York: Wiley, 1991. 8. B. G. Haskell, A. Puri, and A. N. Netravali, Digital Video: An Introduction to MPEG-2, New York: Chapman & Hall, 1997. 9. A. Gersho and R. M. Gray, Vector Quantization and Signal Compression, Boston: Kluwer, 1992. 10. G. J. Sullivan and T. Wiegand,“Video Compression – From Concepts to the H.264/AVC Standard”, Proceedings of the IEEE, Vol. 93, Issue 1, December 2004, p. 18–31. 11. A. Puri, X. Chen and A. Luthra,“Video Coding Using the H.264/MPEG-4 AVC Compression Standard”, Signal Processing: Image Communication,Vol. 19, Issue 9,October 2004, p. 793–849.
WADE WAN XUEMIN CHEN Broadcom Corporation, Irvine, CA
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5307.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Ethernet Standard Article Mart L. Molle1 1University of California, Riverside, Riverside, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5307 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (129K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5307.htm (1 of 2)16.06.2008 16:22:49
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5307.htm
Abstract The sections in this article are Ethernet Components Ethernet Operation Ethernet System Design Issues IEEE 802.3 Ethernet Standard History Keywords: IEEE 802.3 Ethernet standard; 10BASE-T; fast ethernet; gigabit ethernet; manchester, 4B/5B and 8B/10B encoding schemes; ethernet frame format; CSMA/CD; binary exponential backoff; link management and data encapsulation; ethernet components (e.g., transceiver; repeater; network interface); autonegotiation of link capabilities; full-duplex operation and flow control | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5307.htm (2 of 2)16.06.2008 16:22:49
ETHERNET
ETHERNET Ethernet is a widely used local area network (LAN) technology that allows multiple end stations (such as desktop computers, servers, printers, gateways to other networks, etc.) to exchange data among themselves within a single building or campus environment. The sending station segments the data into a sequence of frames, each of which is sent independently through the network to the destination(s). Every frame carries a globally unique 48-bit source and destination address and other information, laid out according to a standard format. However, the length of a frame can vary between a minimum of 64 bytes and a maximum of 1518 bytes. By design, Ethernet provides only a ‘‘best effort’’ delivery service: The network will not reorder or duplicate frames, but there is no guarantee that a particular frame will reach the destination. Applications must run a reliable transport protocol, such as TCP/IP, on top of the Ethernet service to guarantee delivery. ETHERNET COMPONENTS A typical Ethernet system is shown in Fig. 1. Each end station contains a network interface, which contains some temporary storage for frames being sent or received from the network, along with logic for executing the medium access control (MAC) algorithm, calculating the cyclic redundancy code (CRC) for error detection, and performing related functions. In many cases, the network interface is on a small card or printed circuit board that can be added to an end station if network connectivity is required. The network interface uses a transceiver to perform the actual data transmission and reception over the physical link. Initially, Ethernet used external transceivers, attached to the network interface by an attachment unit interface (AUI) cable. Today, however, the transceiver is often integrated with the network interface card. In some cases, changeable transceiver types may be plugged into the card through a medium independent interface (MII). A variety of link types have been defined, including coaxial cable, unshielded twisted pair (UTP) cabling, and both multimode and single-mode optical fiber. Coaxial cable is restricted to 10 Mbit/s operation. If UTP cabling meets Category 5 requirements, then operation at 100 Mbit/s can be supported via 100BASE-TX and it is expected that operation at
1 Gbit/s will be supported in the future via the 1000BASE-T standard now being developed. Optical fiber can support speeds up to 1 Gbit/s. Multiple transceivers can be connected to a single coaxial cable segment (up to 100 stations per segment of ‘‘thick’’ cable in 10BASE5, and up to 30 stations per segment of 50 ⍀ ‘‘thin’’ RG58 cable in 10BASE2). Coaxial cable segments are inherently half duplex because the electrical signals travel in both directions away from the transmitters along a single metallic conductor, passing the transceivers belonging to all other stations before being absorbed by terminating resistors at the ends of the segment. On the other hand, UTP and fiber-optic segments can support full-duplex transmission because each segment uses a separate signaling path to carry data in each direction between two transceivers at its endpoints. Larger networks are constructed by joining multiple segments together using active electronic devices that relay data from one of the attached segments to the other(s). A repeater (or ‘‘hub’’) immediately copies all bits arriving on each segment to all other segments, whether or not they are part of a valid frame. Segments joined together by repeaters form a single collision domain. If more than one end station within the given collision domain transmit frames at the same time, the data will get garbled together to form a collision, which cannot be understood by any of the receivers. A bridge (or ‘‘switch’’) copies frames that arrive on each segment to those segments that might contain its destination(s). Multicast, broadcast, and frames addressed to an unrecognized destination are copied to all other segments, while the rest are only copied to the segment that contains the destination station. Segments joined together by bridges form a single broadcast domain.
ETHERNET OPERATION High-Level Service Interface Ethernet provides a service interface to each end station that consists of independent, asynchronously operating frame transmitter and receiver functions. These functions are invoked by the higher-layer protocols (such as TCP/IP) in the end station. To send some data, the station creates a higherlayer datagram and passes it to the Ethernet transmit function. After the transmit function returns the outcome (success or failure) of this request, the station can call the transmit function again with another frame. The transmit function begins by converting the higher-layer datagram to an Ethernet frame by adding a 64-bit preamble and start-frame delimiter,
End station
End station
Network interface
Repeater
Bridge
Link
Link
Link
Collision domain Figure 1. Typical Ethernet system.
167
Broadcast domain
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
168
ETHERNET
a 48-bit source and destination addresses, and a 16-bit length/type value to the beginning and then adding a 32-bit frame check sequence computed with the CRC-32 polynomial to the end. Then the function attempts to transmit the frame over the outgoing link according to the rules of the medium access control (MAC) protocol. The transmit function reports successful delivery if it is able to transmit the entire frame without ever detecting a collision, and it reports failure if every one of the 16 allowable attempts to transmit the frame resulted in collisions. To receive some data, the station calls the Ethernet frame receiver function, and then it waits until the function returns with the next incoming datagram. Once activated, the receive function scans the incoming bit stream until it finds a valid preamble and start-frame delimiter, and then it gathers the rest of the incoming bits until the end of the transmission to form a candidate frame. If the candidate is shorter than the minimum frame length, then it is deemed to be a collision fragment and is discarded. If the candidate does not have a valid CRC, then it is discarded because of bit errors. Once a candidate frame has passed all the validation tests, its destination address is compared with the address of this end station and a list of recognized multicast and broadcast addresses. If there is an address match, then the receive function strips off the Ethernet encapsulation and returns the enclosed datagram to the station. Otherwise, the frame is discarded by the address filter and the receive function resumes looking for another frame in the incoming bit stream. Medium Access Control Ethernet uses a MAC algorithm called Carrier Sense Multiple Access with Collision Detection (CSMA/CD) to control the transmission of frames. CSMA/CD is a distributed algorithm for serializing the transmissions by multiple end stations over a shared channel. When the end station requests the transmission of a frame, the MAC layer frame transmitter starts executing a sequence of trial-and-error steps, as determined by network activity that is reported by its transceiver. In particular, the transceiver sets the carrierSense control signal whenever there are any data present on link, and it sets the collisionDetect control signal if it determines that the data originated from more than one transmitter. Originally, for the case of coaxial cables, both carrierSense and collisionDetect were obtained by analog logic. For example, in 10BASE5 networks, the signal levels used by the transceiver are offset from zero, so that each transmitter acts as a constant 41 mA current source acting on the two 50 ⍀ termination resistors connected in parallel. In this case, an analog voltmeter will read approximately 1 V when there are data present on the link, and a voltage threshold of approximately 1.5 V can be used to identify a collision. Thus, coaxial cable segments can support receive mode collision detection, which means that a transceiver can report collisions among third-party end stations. However, the Ethernet MAC algorithm does not use this feature. For other network types, such as UTP cabling and optical fiber, data are carried unidirectionally over a pair of physical links and both carrierSense and collisionDetect can be obtained using digital logic. A transceiver sets carrierSense if there are data present on the transmit or receive links, and it sets collisionDetect if data are present on both links.
Each transmitter is required to leave a 96-bit interframe gap between the last previous data on the link and the start of its own transmission. This provides some time for the receivers to handle one frame before the arrival of the next. The interframe gap is controlled by the deferring control signal, which becomes true when start-of-carrier is detected but becomes false 96 bit-times after end-of-carrier is detected. If carrierSense returns during the interframe gap part 1, then it is assumed to be caused by analog effects inside the transceiver or the arrival of additional fragments within the same collision event, so the 96-bit interframe gap timer is restarted at the next end-of-carrier. However, if carrierSense returns during the interframe gap part 2, then it is ignored and the interframe gap timer continues to run. The switch from part 1 to part 2 can occur at any time during the first 64 bit-times of the interframe gap. The reason for having part 2 is to prevent a station whose interframe gap timer runs too fast from enjoying an unfair advantage over the other stations. Suppose station A’s clock ran 1% faster than station B’s clock. If station A were to transmit a burst of consecutive frames, then station B’s interframe gap timer would never expire and station B would be blocked from accessing the channel until station A had transmitted its entire burst. However, if station B were to transmit the same burst, then station A’s interframe gap time would expire after every frame, allowing it to compete with B for access to the channel. Under half-duplex mode, the MAC layer transmitter schedules the first attempt to transmit a frame immediately, if deferring is false, or as soon as the deferring control signal becomes false, otherwise. Obviously, this persistent strategy minimizes the delay between the request by the higher-layer protocol and the first attempt. However, it also means that the transmission of large frames are often followed by collisions on a busy network, as multiple stations wait for the same end-of-carrier event to trigger their respective attempts. If the entire frame is transmitted without triggering the collisionDetect control signal, then it is assumed that the frame reached its destination and its successful delivery is reported to the higher protocols at the station. Otherwise, the unsuccessful attempt must be aborted, and possibly rescheduled at a later time. When an attempt is aborted, the transmitter first finishes the 64-bit preamble and start-frame delimiter if necessary, and then it substitutes a 32-bit jam sequence in place of the remaining bits in the frame. In general, the jam can be any bit sequence as long as it is not intentionally chosen to be a valid CRC. Thus, in general a collision fragment starts with a normal preamble and start-frame delimiter (part of which may be garbled), followed by a string of bits that is at least as large as a 32-bit jam sequence and shorter than the minimum frame transmission time, so it can be easily discarded by the receivers using a length threshold. The slot time, which is defined as 512 bit times for networks operating at speeds below 1 Gbps and 4096 bit times (or 512 bytes) for Gigabit Ethernet, is a key parameter for half-duplex operation. In order for each transmitter to reliably detect collisions, the minimum transmission time for a complete frame must be at least a slot time, whereas the round-trip propagation delay (including both logic delays in all electronic components and the propagation delay in all links) must be less than a slot time. Thus, none of the affected stations could have finished its transmission before detecting the collision. Similarly, the receivers must be able to identify
ETHERNET
(and discard) incoming collision fragments using the fact that their length, after removing the preamble and start-frame delimiter, is less than a slot time. As a result of all these requirements, it can be shown that the round-trip delay in a half-duplex Ethernet collision domain must be less than a slot time minus the jam length. To see this, suppose the roundtrip delay between stations A and B is bit times, and station A starts transmitting at time 0, while station B waits until time /2 (when the data from A are about to arrive) before starting its own transmission. In this case, station B detects the collision immediately, but must still transmit a 96-bit minimum size collision fragment from /2 to /2 ⫹ 96. Meanwhile, station A detects the collision at , sends its jam signal, and stops transmitting at ⫹ 32. In this case, a receiver adjacent to station A would have received a total of ⫹ 96 bits between the start of A’s preamble and the end of B’s jam. Thus, after removing the 64-bit preamble and start-frame delimiter the receiver would be left with a ⫹ 32-bit collision fragment. After each collision, the binary exponential backoff (BEB) algorithm is used to schedule the next retransmission (if any) of the affected frame. Let attempts be the number of times this frame has already been transmitted. If attempts ⫽ 16, then the frame is dropped with an excessive collision error. Otherwise, the BEB algorithm generates a random integer r in the range (0, 2min兵attempts, 10其 ⫺ 1) and instructs the transmitter to sleep for r slot times before making its next attempt. By selecting backoff delays that are multiples of the slot time, colliding stations that pick different delays will not collide with each other again. BEB adjusts the range after each attempt, in an effort to provide an average of one distinct backoff slot per transmitter. Initially, the BEB algorithm only knows that its own station has a frame, so it (greedily) selects a zero backoff before the first attempt. Thereafter, if a collision occurs after randomly selecting one of N slots, BEB raises its estimate to 2N active stations, based on the number of transmitters that selected its own slot. The doubling stops when N ⫽ 1024, since that is the maximum number of stations allowed in a single collision domain according to the Ethernet standard. It is important to note that each station’s transmit function runs an independent copy of the BEB algorithm, which it restarts from the beginning for each new frame. Consequently, the backoff delays selected by different stations following the same collision can become very lopsided, resulting in an unfairness problem known as the capture effect. For example, suppose stations A and B collide at time 0, and it is the first attempt by A to transmit its packet and is the Kth attempt by B to transmit the other packet. In this case, A’s range of backoff delays is 2K⫺1 times smaller than B’s range, so A is very likely to retransmit in an earlier slot. Moreover, if A decides to transmit more frames, A’s first attempt to transmit a new frame will collide with B’s (K ⫹ 1)st attempt to transmit the same frame, so it is even more likely that A will retransmit its packet in an earlier slot. Because of the capture effect, a single station on a busy network can transmit large numbers of consecutive frames while many other stations are unable to transmit any packets. Half-duplex operation in Gigabit Ethernet is more complex to allow the slot time to be increased to 4096 bit times (because higher-speed operation increases the bandwidth–delay product on a fixed diameter network), without changing the existing 512-bit minimum frame size or one-frame-at-a-time
169
service interface. This was accomplished by introducing two new features. First, the minimum transmission time for a short frame was increased to 4096 bit times by appending extended carrier symbols to the end of the frame, if necessary. Extended carrier is a new code word that is neither a data bit nor an idle symbol. If a collision occurs before the end of the extended carrier has been sent, then the transmitter treats the attempt like a normal collision and retransmits the frame after a random backoff delay. Second, a technique called frame bursting was introduced to improve efficiency with short frames. In this case, once a station has successfully transmitted one frame, it is permitted to maintain control of the channel while it sends some additional frames by filling the interframe gap with extended carrier symbols instead of idle symbols. The station can keep adding more frames to the burst until it either runs out of frames or exceeds a total burst length of 65,536 bit times. Full-duplex mode can be used on point-to-point segments that use separate signaling paths for each direction, such as UTP cabling and optical fiber. Under full-duplex operation, the MAC layer transmitter and receiver functions operate independently of each other. Thus, the collisionDetect control signal is never true, and the carrierSense control signal gets split into two signals: receiveDataValid indicates that data are present on the incoming link, while carrierSense indicates that data are present on the outgoing link and are only used to calculate the interframe gap. Full-duplex operation also includes an optional flow control method using pause frames. One end station can temporarily stop all traffic from the other end station (except control frames) by sending a pause frame. The duration of the pause (in multiples of a 512-bit time delay quantum) is controlled by a 16-bit parameter. Traffic resumes when the specified number of bit times has elapsed. If an additional pause frame arrives before the current pause time has expired, its parameter replaces the current pause time, so a pause frame with parameter zero allows traffic to resume immediately. Repeater Operation Repeaters may be attached to the same link types as end stations, but do not contain a MAC layer entity. Instead, a simple finite-state machine is used to control the forwarding of bits among its ports. If one port has incoming data, the data are sent to all other ports. If more than one port has incoming data, a jam signal is sent to all ports. Although some of the preamble bits may be lost while a transceiver is synchronizing with an incoming frame, the repeater is required to transmit the full 64-bit preamble and start-frame delimiter on each output port. Thus, the interframe gap between two consecutive frames can change every time they pass through a repeater. To limit the amount by which the interframe gap can shrink, a maximum of four repeaters is permitted in the path between any pair of end stations in the same collision domain. Repeaters also play a role in improving the robustness of large networks by automatically partitioning misbehaving ports from the rest of the network. For example, port partitioning will be triggered if an incoming data bit from a segment continues well beyond the maximum frame length (a condition known as ‘‘jabber’’), or if many consecutive transmissions to that segment result in collisions (an indication
170
ETHERNET
that a single frame might be colliding with itself after traveling around a loop). Physical Layer Data Encoding Ethernet uses a variety of physical layer data encoding schemes, depending on the link speed and type. In particular, 10 Mbit/s Ethernet uses Manchester encoding, which is a twolevel encoding scheme using a baud rate of twice the bit rate, to distribute both data bits and the clock from the sender to the receiver(s). Each data bit is represented by a pair of channel symbols: either ‘‘HI’’ followed by ‘‘LO’’ (i.e., nominally 0 V followed by ⫺2.05 V on a coaxial cable link) to send a logical ‘‘0’’ data bit, or ‘‘LO’’ followed by ‘‘HI’’ to send a logical ‘‘1’’ data bit. In this way, the receiver(s) can easily synchronize with the data stream and recover the incoming data based on the direction of the transition at the midpoint of each bit. A transceiver with no data to send is required to generate an idle pattern consisting of a constant string of ‘‘HI’’ symbols (i.e., 0 V), which allows multiple transceivers to be connected to a single link without interfering with each other (except through collisions). However, this approach also means that the receivers cannot distinguish between an idle link and a broken link, which limits the fault detection capabilities of the system. Thus, the 10 Mbit/s fiber-optic inter-repeater link (FOIRL) introduced a Link Integrity Test to provide some fault detection on each of its dedicated signaling paths. The same Link Integrity Test was also used in 10BASE-T. An idle transmitter must send a short burst of energy called a link test pulse once every 16 ms. If the corresponding receiver has not seen either data or a link test pulse for at least 50 ms, then it declares the link to be broken. Manchester encoding is very simple and robust, but it is unsuitable for higher-speed operation because its high baud rate means that the link must carry frequencies much higher than the bit rate. Thus Fast Ethernet has adopted the same 4B/5B encoding used by the Fiber Distributed Data Interface (FDDI) for transmission over Category 5 UTP (via the 100BASE-TX standard) and optical fiber (via the 100BASEFX standard). Under 4B/5B, each 4-bit ‘‘nibble’’ of data sent over the MII is converted into a 5-bit code word for transmission over the physical link. Two signaling levels are used, so the baud rate is 20% higher than the bit rate. Since there are twice as many 5-bit code words available compared to the number of distinct 4-bit data nibbles, the code words can be chosen in such a way that the encoder never outputs more than three consecutive logical ‘‘0’’ bits. Thus, since the transceiver indicates logical ‘‘0’’ and ‘‘1’’ bits by generating no change or a reversal of the current signal level, respectively, the receiver(s) will see at least one transition every 3 bit times, allowing them synchronize with the incoming signal and recover both data bits and the clock. (100BASE-TX also randomizes the output of the encoder, so particular data sequences don’t create repeating signaling patterns that might cause high levels of electromagnetic interference.) Fast Ethernet also differs from the earlier designs by taking advantage (at the physical layer) of the fact that only point-topoint transmission over dedicated links will be used. In particular, the Link Integrity Test from 10BASE-T is not needed because all transmitters are always on. As soon as a device is turned on, its transceiver establishes a low-level connection with its peer at the other end of the link. If it has no data to
send, the transmitter sends one of the unused 4B/5B code words, which has been reserved to indicate that the link is in the idle state, in order to maintain clock synchronization. Additional code words are used as delimiters to mark the start and end of each MAC frame, so the preamble and startframe delimiter are reduced to framing overhead that no longer serves any real purpose. Gigabit Ethernet yet again introduces some different encoding schemes. For transmission over optical fiber (via the 1000BASE-SX and 1000BASE-LX standards), the same 8B/ 10B encoding used in Fiber Channel is used. In this case, every 8-bit data byte transferred across the GMII is mapped into a 10-bit code word for transmission over the physical channel. Since there are four code words available for each data byte, only those code words with sufficient transitions to permit clock recovery at the receiver are used. Moreover, the 8B/10B code also maintains direct current (dc) balance over the long term in the following way. The running disparity of the data stream is defined as the difference between the total number of logical ‘‘1’’ bits minus the number of logical ‘‘0’’ bits transmitted. If the running disparity is positive, then one set of code words will be used; otherwise, another set of code words will be used. At least half the bits in every code word belonging to the first set are logical ‘‘0’’ bits, while at least half the bits in every code word belonging to the second set are logical ‘‘1’’ bits. At the time of this writing, the details of the encoding scheme for 1000 Mbit/s operation over Category 5 UTP cabling (via the proposed 1000BASE-T standard) was still under development. It is expected that a full-duplex link will be created by transmitting simultaneously and in both directions over all four pairs in a UTP cable. Each combination of an 8-bit data byte and an alternating clock bit is mapped into a code word that assigns one of five possible voltage levels to each of the four pairs in the UTP cable, giving a total of 54 ⫽ 625 possible code words to represent 512 different combinations. Some of the unused code words are used for ‘‘nondata’’ control symbols, such as idle, extended carrier, and start-frame or end-frame delimiters. Notice that each pair in the UTP cable need only carry data symbols at the same baud rate as Fast Ethernet (i.e., 125 Mbit/s). However, since a five-level encoding is more prone to errors than a twolevel encoding, a trellis decoder is used to reduce the bit error rate. Autonegotiation of Link Capabilities Over time, the Ethernet standard has been updated many times to support advances in technology. Initially, these changes were made to allow new media types (e.g., ‘‘thin’’ coaxial cable, optical fiber, UTP cabling, etc.) to be used with the existing Ethernet standard for 10 Mbit/s operation. However, following the introduction of Fast Ethernet, the same Category 5 UTP cabling and RJ-45 connectors could be used for transmitting at 10 Mbit/s (according to the 10BASE-T standard) or at 100 Mbit/s (using any one of the 100BASETX, 100BASE-T4, or 100BASE-T2 standards). Eventually, transmission at 1 Gbps will also be possible, when the proposed 1000BASE-T standard is completed. In addition, UTP cabling is also compatible with full-duplex operation and its optional flow control scheme. Therefore, it is now possible to design a standards-compliant Ethernet device that plugs into Category 5 UTP cabling and operates according to one of more
ETHERNET
than a dozen different ‘‘modes,’’ and many vendors have designed products that support several of these modes (e.g., 10/ 100 Mbit/s network interface cards). Unfortunately, this variety of operating modes also means that two standards-compliant Ethernet devices that are designed to use the same UTP cabling can’t communicate unless they also use the same mode. Since manual configuration is tedious and error-prone, the Ethernet standard includes a method that allows the attached devices to automatically select their highest common operating mode; this method is called Autonegotiation. When a device that supports autonegotiation is initialized, it transmits a fast link pulse (FLP) over the attached link every 16 ms. Each FLP looks like a normal link test pulse to existing 10BASE-T devices that do not support autonegotiation. However, the FLP is actually a burst of much shorter pulses that encodes a 16-bit ‘‘page’’ of information about the capabilities of the sending device. The encoding for a page consists of 33 pulse positions, each 125 애s apart. The odd pulse positions, which are always present, are used for clock recovery. The data are carried by the even pulse positions, where the presence of a pulse indicates a logical ‘‘1’’ data bit while absence of a pulse indicates a logical ‘‘0’’ data bit. The data in the page are used to advertise the set of capabilities supported by the sending device, and they also include an acknowledgment bit to indicate successful reception of the page being sent in the opposite direction (i.e., the data contained in at least three incoming FLPs were the same) and a next page bit to indicate there are more data to come after this page has been received. Once both devices have finished exchanging their respective pages of data, the link is established using their highest common operating mode (if one exists) using a fixed priority list that gives preference to full-duplex operation over halfduplex operation and also gives preference to higher speeds over lower speeds. After the link has been established, no more FLPs are sent: Higher-speed operation uses an active idle pattern without any link test pulses, and if 10BASE-T is selected, then conventional link test pulses will be used.
ETHERNET SYSTEM DESIGN ISSUES The design of Ethernet systems is limited by several factors. First, there is a maximum distance for each link type, which is determined by the given combination of transceiver type and medium. This distance limit is determined by the physical characteristics of the channel, as well as by how the signal quality changes as a function of distance due to such factors as attenuation (i.e., the signal level at the receiver is too low to be distinguished from background noise) and dispersion (i.e., successive code words blend together because of variability in the signal velocity and cannot be distinguished by the receiver). These factors determine the maximum length of the coaxial cable segments in 10BASE5 and 10BASE2, and they limit the maximum length of a UTP cable segment to 100 m no matter which data rate we use. For optical fiber, these limits are quite large for 10 Mbit/s and 100 Mbit/s operation, but become quite significant for 1 Gbit/s operation. Second, when half-duplex operation is used, then the worst-case round-trip propagation delay must be restricted to less than one slot time in order for the CSMA/CD algorithm to function properly. This is why a 10 Mbit/s collision domain
171
can span a maximum diameter of approximately 2.5 km whereas a 100 Mbit/s collision domain is limited to only 205 m. When full-duplex operation is used, then the propagation delay need not be related to the slot time in any way. Third, when half-duplex operation is used, there can be at most four repeaters in the path between any pair of stations. This restriction comes about to prevent excessive shrinkage of the interframe gap, as explained above. And, finally, the network must be loop-free, whether it is constructed using repeaters and half-duplex links or using bridges (switches) and full-duplex links.
IEEE 802.3 ETHERNET STANDARD The Institute of Electrical and Electronic Engineers (IEEE) Working Group 802.3 is responsible for defining an open vendor-independent standard for Ethernet. The Ethernet standard covers most of the functions of Layers 1 and 2 from the OSI reference model. However, these functions are divided into many sublayers by the Ethernet Reference Model, which is shown in Fig. 2 (1). The data link layer in the OSI reference model is divided into (1) logical link control (along with an optional MAC control sublayer to manage flow control in fullduplex links), which is outside the scope of the Ethernet standard, and (2) medium access control, which is the top sublayer within the Ethernet standard. All of the functions of the OSI physical layer are included in the Ethernet standard. The physical layer has been partitioned into sublayers in several ways, depending on the particular physical medium and/or data rate. Initially, the physical layer functions were separated into (1) physical layer signaling (PLS), which takes care of encoding and decoding the bit stream inside the end station, and (2) an external transceiver known as the medium attachment unit (MAU), which handles the actual transmission and reception of data over the link and generates the carrierSense and collisionDetect control signals. The communication between the PLS and MAU was defined by the attachment unit interface (AUI). However, as higher-speed operation was being developed, a different functional partitioning was adopted. First, Fast Ethernet introduced a new optional 4-bit-wide media independent interface (MII) to replace the bit-serial AUI, which defines a standard way to connect a removable transceiver. The MII can also be used for 10 Mbit/s operation, to simplify the design of equipment that can run at more than one speed. For Gigabit Ethernet, this interface was further changed into an 8-bit-wide Gigabit Media Independent Interface (GMII). The MII or GMII sits between (a) a small reconciliation sublayer, which manages the interface on behalf of the MAC sublayer, and (b) the physical coding sublayer (PCS), which handles encoding and decoding of the data stream and generation of the carrierSense and collisionDetect control signals. Below the PCS is the physical medium attachment (PMA) Sublayer, which contains Ethernet-specific functions for managing the physical transceiver, such as autonegotiation of speed and duplex settings and the operational status of the link. Finally, the lowest level functions were put into the physical-medium-dependent (PMD) sublayer so that the existing methods for transmission over optical fiber and UTP cable in the FDDI standard could be reused in Fast Ethernet. A similar partitioning was used
172
ETHERNET
OSI reference model layers
LAN CSMA/CD Higher layers
Application
LLC—Logical link control
Presentation
MAC Control (optional) MAC—Media access control
Session PLS
Transport
Reconciliation MII
Network Data link Physical
Figure 2. Ethernet reference (From Ref. 1, with permission.)
model.
Reconciliation MII PCS PMA PMD
PLS AUI
AUI PMA
MAU MDI
PMA MDI
Reconciliation GMII
MDI
PCS PMA PMD MDI
Medium
Medium
Medium
Medium
1 Mbit/s, 10 Mbit/s
10 Mbit/s
100 Mbit/s
1000 Mbit/s
AUI = Attachment unit interface MDI = Medium dependent interface MII = Media independent interface GMII = Gigabit media independent interface MAU = medium attachment unit
in Gigabit Ethernet, which uses a PMD derived from fiber channel. In addition to the many sublayers that define the Ethernet functional specifications, the Ethernet standard also defines several groups of managed objects that provide a uniform set of attributes for getting information about the current status of the device (e.g., whether it is currently operating in halfduplex or full-duplex mode), retrieving data from cumulative activity counters (e.g., the number of octets of data sent or received), or setting operating parameters (e.g., whether or not it is set to promiscuously receive all incoming frames). These managed objects can be accessed through the Simple Network Management Protocol (SNMP), once the appropriate methods have been defined in the code contained in the Management Information Base (MIB) associated with the device. HISTORY Development of Ethernet began at the Xerox Palo Alto Research Center in 1973 with a prototype that operated at a data rate of 3.94 Mbit/s. An overview of this system was published by Metcalfe and Boggs (2) in 1976. By 1980, a commercial version of Ethernet, known as Ethernet version 2, had been jointly developed by Digital Equipment Corporation, Intel, and Xerox. Ethernet version 2 included a number of changes, including (1) larger values for the minimum frame size and slot time (2) and an increase in the data rate to 10 Mbit/s. The Ethernet version 2 ‘‘blue book’’ specification (3) formed the basis of the original IEEE 802.3 standard for Ethernet, published in 1983. However, the 802.3 standard introduced some technical changes, notably the replacement of the 16-bit ‘‘type field’’ by a 16-bit ‘‘length field’’ in the frame header. This distinction was eventually removed when a frame type was defined for the flow control pause frame in 1996. Because support for new media types was added to the 802.3 standard, a new naming convention was adopted (4).
PHY
PLS = Physical layer signaling PCS = Physical coding sublayer PMA = Physical medium attachment PHY = Physical layer device PMD = Physical medium dependent
The original version designed to operate over ‘‘thick’’ coaxial cable became 10BASE5, indicating that it operated at 10 Mbit/s, employed base band signaling, and had a maximum segment length of 500 m. In 1985, the standards for 10BASE2 (which defines Ethernet operation over ‘‘thin’’ RG-58 coaxial cable segments up to 185 m long) and 10BROAD36 (which defines Ethernet operation over broadband CATV systems in which the distance between the head end and the stations is at most 1.8 km) were approved. An even lower cost option called 1BASE5, based on the 1 Mbit/s AT&T Starlan design, was approved in 1987 but never became very popular. In 1987, 10 Mbit/s fiber-optic transceivers appeared in a limited way when a fiber-optic inter-repeater link (FOIRL) was approved, and they appeared more generally in 1993. 10BASET, which defines 10 Mbit/s operation over UTP cabling, was approved in 1990. Fast Ethernet, which operates at a data rate of 100 Mbit/ s, includes a variety of transceiver types. 100BASE-TX (for operation of two pairs of Category 5 UTP cabling), 100BASEFX (for operation over optical fiber), and 100BASE-T4 (for operation over four pairs of Category 3 UTP cabling), was approved in 1995 and published as 802.3u (5). In 1996, 100BASE-T2 (for operation on two pairs of Category 3 UTP cabling) was published as 802.3y. However, neither 100BASE-T4 nor 100BASE-T2 has received widespread popularity. At the same time, full-duplex operation was defined in 802.3x, which was approved in 1996. Gigabit Ethernet, which operates at a data rate of 1000 Mbit/s, also includes a number of subtypes. The 802.3z standard, which will be approved in 1998, includes 1000BASE-SX (a short-wavelength laser, suitable for limited distances over multimode fiber), 1000BASE-LX (a long-wavelength laser, suitable for moderate distances over multimode fiber and much longer distances over single mode fiber), and 1000BASE-CX (a short-haul copper jumper cable) (6). In addition, the development of 1000BASE-T (for transmission over distances of up to 100 m using four pairs of Category 5 UTP
ETHICS AND PROFESSIONAL RESPONSIBILITY
cable) is well underway and should be published as 802.3ab sometime in 1999. BIBLIOGRAPHY 1. IEEE 802.3 CSMA/CD (ETHERNET) Working Group Web Site [Online]. Available www: http://grouper.ieee.org/groups/802/3/ 2. R. M. Metcalfe and D. R. Boggs, Ethernet: Distributed packet switching for local computer networks, Commun. ACM, 19 (7): 395–404, 1976. 3. Digital Equipment Corp., Intel Corp., and Xerox Corp., The Ethernet: A local area network data link layer and physical layer specifications, September 30, 1980. 4. ANSI/IEEE Std 802.3, Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifications, 5th ed., 1996. 5. IEEE Std 802.3u-1995, Media access control (MAC) parameters, physical layer, medium attachment units, and repeater for 100 Mb/s operation, type 100BASE-T, 1995. 6. IEEE Draft P802.3z/D4, Media access control (MAC) parameters, physical layer, repeater and management parameters for 1000 Mb/s operation, December 1997.
MART L. MOLLE University of California, Riverside
ETHERNET. See LOCAL AREA NETWORKS.
173
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5335.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Group Communication Standard Article P. M. Melliar-Smith1 and L. E. Moser1 1University of California, Santa Barbara, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5335 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (172K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5335.htm (1 of 2)16.06.2008 16:23:04
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5335.htm
Abstract The sections in this article are Formal Models and Definitions Message Delivery Algorithms Flow Control Algorithms Fault Detectors Membership Algorithms Future Directions | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5335.htm (2 of 2)16.06.2008 16:23:04
500
GROUP COMMUNICATION
does not apply an update to that replica. This is an example of an omission inconsistency, as shown in Fig. 1(a). Inconsistency can also arise, as shown in Fig. 1(b), if one process creates a new database entry and transmits an instruction in a message broadcast to the other processes that they should also create that entry. A second process receives the message and generates an update for the new entry, communicating the update in a message broadcast to the other processes. If one of the processes holding a replica of the database receives the second message before the first, it may be unable to handle the message. This is an example of a causal ordering inconsistency. Inconsistency can also arise when two or more processes try to claim a resource, as shown in Fig. 1(c). The requests for the resource are sent in broadcast messages to the multiple processes holding the resources, with the first claimant getting the resource. Multiple processes are needed to manage the resource to ensure continued operation, if a process should fail. If those processes receive the messages in different orders, they may grant the resource to different requesters. This is an example of a total ordering inconsistency. Group communication systems provide message ordering and delivery services that assist the application programmer in avoiding these inconsistencies. • Reliable delivery of messages ensures that all messages broadcast to a group of processes are delivered to all members of the group, thus precluding omission inconsis-
P1
P2
P3
P4
P3
P4
Update A
(a)
P1
P2 Create A
Update A
GROUP COMMUNICATION In distributed computer systems, several or many computers cooperate to perform an application, and information may be replicated on several computers. Replication of information may be required to provide fault tolerance or increased availability after a fault. Replication may also be used to reduce the time required to access the information by providing local or nearby copies of it. Much of the difficulty of programming distributed applications derives from the need to maintain the consistency of replicated information, in the presence of asynchrony and faults. When the replicated information has become inconsistent, the application programming and/or human intervention required to restore consistency can be quite difficult. In general, it is easier to maintain consistency than to restore it. Inconsistency can arise, for example, if a process holding one replica of the data fails to receive a message and thus
? (b)
P1
P2
P3
Claim A
P4 Claim A Grant A to P4
Grant A to P1
(c) Figure 1. Examples of (a) omission inconsistency, (b) causal ordering inconsistency, and (c) total ordering inconsistency.
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
GROUP COMMUNICATION
tencies. This also requires that messages are delivered exactly once. • Causally ordered delivery of messages ensures that, if a message depends on a prior message, then no process receives that message before it receives the prior message. This precludes causal ordering inconsistency. • Totally ordered delivery of messages ensures that any pair of messages delivered by two or more processes are delivered in the same order by those processes, thus precluding total ordering inconsistency. These message delivery services are discussed further in the next section, titled ‘‘Formal Models and Definitions,’’ and algorithms for achieving these services are given in the section titled ‘‘Message Delivery Algorithms.’’ Group communication systems also provide membership services that maintain the membership of the group, add new processes to the group, remove departing or faulty processes from the group, and report changes in the group membership to the application. If the network partitions into multiple components with no communication among them, membership algorithms are confronted with conflicting objectives. If the application must maintain a single consistent state of its data, then the membership algorithm must form a single primary membership component. If, however, all processes must continue operation even when disconnected, then the membership algorithm must form multiple disconnected memberships. Group membership algorithms for primary component membership and for partitionable membership are described in the section titled ‘‘Membership Algorithms.’’ Group membership services depend on the detection of faulty processes. Unfortunately, in an asynchronous model of computation, it is impossible to distinguish between a process that has crashed and one that is merely slow. Consequently, most group communication systems depend on unreliable fault detectors (1) that detect faulty processes but that are unreliable because they might also suspect a slow process to be faulty. Unreliable fault detectors are discussed in the section titled ‘‘Fault Detectors.’’ To enable a new member of the group to participate meaningfully in the group, it is necessary for a process that is already a member of the group to assemble and transfer state information to the new member. If a message has been processed before the state information is transferred, and if that same message is processed by the new process after it receives the state, an incorrect state can result. An incorrect state can also result if a message has not been processed before the state is assembled and transferred, but the new process determines that the message is an old message and thus ignores it. Consequently, processes must agree on which messages are delivered and processed before the membership change so that their effects are included in the transferred state, and which messages are delivered after the membership change so that they are processed by the new process. Virtual synchrony (2), discussed in the section titled ‘‘Formal Models and Definitions,’’ ensures that the processes agree on which messages precede a membership change and which follow it. Group communication systems can achieve high performance by exploiting available broadcast mechanisms at the physical and network layers in local- and wide-area networks.
501
Implementing broadcasts by multiple point-to-point messages results in poor performance. To achieve high performance, group communication systems must address the issues of buffer management and flow control. When several, or many, processes broadcast messages almost simultaneously, the communication medium and the buffers can quickly become exhausted, resulting in lost messages. Broadcasts to many destinations can also result in large numbers of acknowledgments returning to the source, again exhausting the available resources. The best group communication protocols carefully manage the flow of messages to reduce network contention and ensure that buffers are available at the destinations. Such systems deliver messages reliably to many destinations with performance as good as that of point-to-point protocols operating between a single source and a single destination. Fully integrated group communication protocols (3) address these issues in a unified fashion. Such systems are highly efficient and provide a comprehensive set of services at little or no extra cost. A standard set of services reduces the amount of skill required to exploit the protocol effectively. On the other hand, group communication toolkits (4) provide microprotocols that the user assembles into a protocol that is customized for the application. A custom protocol avoids the costs of services not needed by the application but, instead, incurs the costs of interfaces and mechanisms that are designed to work independently. Furthermore, the user may find it difficult to choose an appropriate set of microprotocols for the particular application.
FORMAL MODELS AND DEFINITIONS The models, definitions, and algorithms described here can be used for sets of processors or for sets of processes executing on multiple processors. In this article, we refer to processes but processors can be substituted throughout. A distributed system is a collection of processes that communicate by messages. In a group communication system, the processes are organized into groups and messages are broadcast to all members of the group. In contrast, messages that are multicast may be sent to some of the processes but not necessarily all of them. We distinguish between the terms receive and deliver. A process receives messages that were broadcast by the processes, and a process delivers messages to the application. Distributed systems can be classified as synchronous or asynchronous. In an asynchronous system, it is not possible to place a bound on the time required for a computation or for the communication of a message. Even though processes may have access to local clocks, those clocks are used only for local activities; they are not synchronized and are not used for global coordination. The advantage of the asynchronous model is that the algorithms can be designed to operate correctly without regard to the number of individual processes or the timing characteristics of the processes or of the communication medium. The disadvantage is that performance characteristics, such as message delivery latency, are necessarily probabilistic, and performance analysis and prediction are difficult. In contrast, the synchronous model requires that processes have access to local clocks that are synchronized across the system with a known bound on the skew between clocks, and
502
GROUP COMMUNICATION
that computation and communication operations complete within specified periods of time. The synchronous model is particularly suited to hard real-time systems and has the advantage that algorithms are deterministic and less complex. The disadvantage is that conservative assumptions are required to approximate synchronous operation in the real world and the resulting system may be inefficient. The timed asynchronous model (5) closely resembles the asynchronous model but includes, in addition, a requirement that eventually there will exist an interval of stability within which computation and communication operations complete successfully within a specified time bound. For this model, the algorithms are generally similar to those for the asynchronous model but are simpler because they need to guarantee termination only within the stability interval. The delay until an interval of stability is reached will, however, probably be longer than the delay until termination of the asynchronous algorithm. The timed asynchronous model trades simpler algorithms against longer termination times.
achieve consensus in finite time in an asynchronous distributed system, even if communication is reliable and only one process can crash. Chandra and Toueg (1) have shown, however, that consensus is possible in an asynchronous distributed system that is subject to crash faults if an unreliable fault detector is provided. Randomized algorithms can also be used to achieve consensus in an asynchronous distributed system that is subject to faults (7,8). In practical systems with unreliable communication, even the reliable delivery of a single message cannot be guaranteed. There is also a nonzero probability that all of the processes will fail. Impossibility results should not, however, be regarded as a proof that asynchronous distributed systems cannot be built. Rather, they are a reminder that the algorithms must be robust against unfortunate sequences of asynchronous events and faults, so that message ordering and membership decisions can be reached in a reasonable time with a high probability. Message Delivery Services
Fault Models In any distributed system model, a process that is nonfaulty (correct) performs the steps of the algorithms according to the specifications. The behavior of a faulty process depends on the particular fault model adopted. In the fail-stop and crash models, a faulty process takes no actions (i.e., it sends no messages and ignores all messages that it receives). These two models differ in that a fail-stop process reports that it has failed, whereas a crash process does not. Both models typically assume that a faulty process never recovers or, if it recovers, is regarded as a new process. Other models allow a faulty process to be repaired and to be reconfigured into the system. When a process recovers, the process knows its process identifier and can retrieve the data it had written to persistent storage, such as disk, before it failed. In the more general Byzantine fault model, processes can exhibit arbitrary and even malicious behavior, such as generating incorrect messages or sending different messages to different processes that purport to be the same message. In general, it is more difficult to develop algorithms that are resilient to Byzantine faults than to fail-stop or crash faults. Most fault models admit communication faults in the form of message loss, which is caused by corruption by the medium or buffer overflow in the intermediate switches or at the destinations. Some fault models also admit network partitioning faults, which split the system into two or more components so that processes in the disconnected components cannot communicate with one another. The synchronous model augments this fault model in that any computation or communication operation that does not complete within its specified time bound constitutes a fault. Similarly, any excessive skew between clocks constitutes a fault. Impossibility Results The problem of maintaining consistent message delivery and membership in a system subject to faults is related to the problem of achieving consensus in such a system. Fischer, Lynch and Paterson (6) have shown that it is impossible to devise a deterministic algorithm that can guarantee to
We now consider various message delivery services. The most basic type of message delivery service is unreliable message delivery, which provides a best-effort service with no guarantees of message delivery or of the order of message delivery. Unreliable message delivery is used for applications such as audio and video streaming. Many other applications, however, require one of the more stringent message delivery services, which are described here and shown in Fig. 2. Reliable Delivery. Reliable message delivery requires that if any nonfaulty (correct) process receives a message, then all nonfaulty processes eventually receive that message, possibly after multiple retransmissions. Reliable delivery can be achieved only probabilistically in the presence of an unreliable communication medium that repeatedly loses messages. Source Ordered Delivery. Source ordered delivery, or FirstIn First-Out (FIFO) delivery, requires that messages from a particular source are delivered in the order in which they were sent. Source ordered delivery is appropriate for multimedia streaming and data distribution applications. Causally Ordered Delivery. Causally ordered delivery (9) satisfies the following two properties: • If process P sends message M1 before it sends message M2 and process Q delivers both messages, then Q delivers M1 before it delivers M2. • If process P receives message M1 before it sends message M2 and process Q delivers both messages, then Q delivers M1 before it delivers M2. Taking the transitive closure of this ‘‘delivers before’’ relation yields a partial order on messages. A partial order, denoted by 앑, satisfies the following properties: • Antireflexive: M ⳝ M. • Antisymmetric: If M1 앑 M2, then M2 ⳝ M1. • Transitive: If M1 앑 M2 and M2 앑 M3, then M1 앑 M3. Delivery of messages in causal order precludes causal ordering inconsistency and prevents anomalies in the processing of
GROUP COMMUNICATION
Messages multicast
503
Messages delivered
Process A
A1
A2
A3
A1 B1
C1 B2
A3 C3
Process B
B1
B2
B3
A1
B2 A2 C1 B3
C2 C3
Process C
C1
C2
C3
C1 B1 C2
B2
A2
Unreliable source ordered delivery A1
A2
A3
A1 B1 A2 C1 B2 B3 C2 A3 C3
B1
B2
B3
A1 B1 B2 A2 C1
C1
C2
C3
C1 B1 C2 A1 B2 C3 A2 A3 B3
B3 A3 C2 C3
Reliable source ordered delivery A1
A2
A3
A1 B1 A2 C1 B2 B3 C2 A3 C3
B1
B2
B3
A1 B1 C1 A2 B2 B3 A3 C2 C3
C1
C2
C3
C1 B1 C2
A1 B2 C3 A2 A3 B3
Reliable causally ordered delivery A1
A2
A3
A1 B1
A2 C1 B2 B3 C2 A3 C3
B1
B2
B3
A1 B1
A2 C1 B2 B3 C2 A3 C3
C1
C2
C3
A1 B1
A2 C1 B2 B3 C2 A3 C3
Reliable group ordered delivery
data contained in the messages, but it does not alone suffice to maintain the consistency of replicated data. Group Ordered Delivery. Group ordered delivery requires that, if processes P and Q are members of a group G, then P and Q deliver the messages originated by the processes in G in the same total order. A reliable group ordered message delivery service helps to maintain the consistency of replicated data, but inconsistencies can still arise in interactions beween groups. Totally Ordered Delivery. Totally ordered delivery, also called atomic delivery, subsumes partially ordered delivery but requires in addition the property: • Comparable: M1 앑 M2 or M2 앑 M1. Thus, totally ordered delivery requires that, if process P delivers message M1 before it delivers message M2 and another process Q delivers both messages, then Q delivers M1 before it delivers M2. If P and Q are both members of the same process group and M1 and M2 are both sent to that group, then group ordered delivery would have sufficed to ensure this property. Typical applications contain hundreds of process groups, and many messages are sent to multiple groups. Totally ordered delivery precludes total ordering inconsistency and is important where systemwide consistency across many
Figure 2. Examples of four types of message delivery services.
groups is required. Thus, more generally, totally ordered delivery implies that there are no cycles in the total order, because the total order is a partial order. Message Stability. A message is stable at a process when that process has determined that all of the other processes in its current membership have received the message. This determination is typically based on acknowledgments of messages. When a process determines that a message has become stable, it can reclaim the buffer space used by the message because it will never need to retransmit that message again. The concept of stability of messages is quite distinct from stability of data, which requires that the data have been written to nonvolatile storage, such as a disk. Delivery of messages only when the messages have become stable is useful, for example, in transaction-processing systems where a transaction must be committed by all of the processes or none of them. Membership Services Maintaining the membership of the groups is an important part of group communication systems because algorithms may block if processes are faulty or cannot communicate with one another. An unreliable fault detector can be used to detect apparently faulty processes, to trigger the membership algorithm, and to ensure that the algorithm satisfies liveness requirements (i.e., that decisions will be made and that the sys-
504
GROUP COMMUNICATION
Initial membership Primary component membership Primary component
Partitionable membership Suspended
Suspended Figure 3. The primary component membership model allows only a single component of a partitioned system to continue to operate, whereas the partitionable membership model allows continued operation in all components.
tem will continue to make progress). The membership algorithm removes apparently faulty processes from the membership and adds new or recovered processes into the membership. Different group communication systems have adopted different formulations of the membership problem. In the primary component model (2), for each process group a single sequence of memberships must be maintained across the distributed system over time, as shown at the left of Fig. 3. In this model, the membership algorithm, upon successive invocations, yields a sequence of memberships over time. In contrast, the partitionable membership model (10) allows multiple disjoint memberships to exist concurrently. At opposite ends of the spectrum are two approaches to the partitionable membership problem, the maximal memberships approach and the disjoint memberships approach. In the maximal memberships approach, the memberships are precisely the maximal cliques, and a nonfaulty process may belong to several (perhaps many) concurrent memberships. In the disjoint memberships approach, concurrent memberships do not intersect (i.e., each nonfaulty process is a member of exactly one membership at a time, and any pair of processes in a membership can communicate). Thus, each membership is a clique, although not necessarily a maximal clique. The partitionable membership model with disjoint memberships is shown at the right of Fig. 3. Neither the disjoint membership approach nor the maximal cliques approach is ideal, but it is not obvious how an intermediate approach would define the collection of memberships. An algorithm that solves these membership problems must ensure that the processes in a membership reach agreement on the membership in a finite amount of time. The algorithm should also ensure that faulty processes are eventually removed from the membership and that nonfaulty processes are not removed capriciously so that a trivial membership is not installed when a larger membership could have been installed. Thus, the membership algorithms must ensure the following properties:
• Agreement: All processes in the membership agree on the membership set. • Termination: A new membership must be formed within a finite amount of time. • Nontriviality: The agreed upon membership should be appropriate, and nondegenerate if possible. The appropriateness of the membership may need to be determined by heuristics. For the partitionable membership problem, several existing specifications admit algorithms that yield degenerate memberships by partitioning the membership into singletons even when larger memberships would have been possible. Specification of the partitionable membership problem is an open research topic. Virtual Synchrony and Extended Virtual Synchrony Virtual synchrony (2) ensures that view (configuration) changes occur at the same point in the message delivery history for all operational processes, as shown in Fig. 4. Processes that are members of two successive views must deliver exactly the same set of messages in the first view. A failed process that recovers can be readmitted to the system only as a new process. Thus, failed processes are not constrained as to the messages they deliver or their order, and messages delivered by a failed process have no effect on the system. If the system partitions, only processes in one component, the primary component, continue to operate; all of the other processes are deemed to have failed. Extended virtual synchrony (11) extends the concept of virtual synchrony to systems in which all components of a partitioned system continue to operate and can subsequently remerge and to systems in which failed processes can be repaired and can rejoin the system with stable storage intact. Two processes may deliver different sets of messages, when one of them has failed or when they are members of different components, but they must not deliver messages inconsistently. In particular, if process P delivers message M1 before P delivers message M2, then process Q must not deliver message M2 before Q delivers message M1, even if the system has partitioned and P and Q can no longer communicate. Extended virtual synchrony eliminates gratuitous inconsistencies between processes that become disconnected by a partitioning fault. Interestingly, extended virtual synchrony
QR Membership is Q and R
P
State transfer P joins group Membership is P,Q, and R
PQR Messages processed before state transfer Messages processed after state transfer
Figure 4. When a membership change brings a new process into the group, the current state of the existing members must be transferred to the new process. Virtual synchrony ensures that all processes agree on which messages precede the transfer of state to a new process and which follow that transfer.
GROUP COMMUNICATION
Nack
Ack
Multicast message
Figure 5. If a process transmits a message to many destinations, it may suffer from an implosion of acknowledgments.
can be guaranteed only if messages are born ordered, meaning that the relative order of any two messages is determined directly from the messages, as broadcast by their sources. MESSAGE DELIVERY ALGORITHMS We now consider algorithms that provide different types of message delivery, as defined in the section titled ‘‘Formal Models and Definitions.’’ Reliable Delivery Algorithms Reliable delivery algorithms typically depend on underlying physical or network layer broadcast or multicast mechanisms that provide only an unreliable best-effort service in which messages may be lost. Algorithms that provide a reliable delivery service aim to ensure that every message is delivered to all of the intended destinations. Error detection and retransmission are typically more often used to provide reliable delivery. Traditional broadcast and multicast algorithms exploit a positive acknowledgment strategy to provide reliable delivery. On receipt of a message, a destination transmits a positive acknowledgment to the source. The source retransmits the message repeatedly until it has received a positive acknowledgment from every destination. Positive acknowledgment algorithms are effective in improving reliability, but they suffer from two problems. First, large numbers of acknowledgments must be transmitted, even when the underlying mechanisms are quite reliable and few messages need to be retransmitted. Second, if there are many destinations, the source must receive and process many acknowledgments for each message that it transmits, resulting in substantial processing overhead, as shown in Fig. 5. This is called the ack implosion problem. Consequently, most reliable broadcast and multicast algorithms use negative acknowledgments to achieve reliable delivery. The source transmits messages with sequence numbers. Destinations detect missing messages by gaps in the sequence numbers and transmit, to the source, negative acknowledgments that list the missing messages. On receipt of a negative acknowledgment, the source retransmits the requested messages. Negative acknowledgment algorithms can
505
be used to achieve reliable delivery, even though they use fewer acknowledgment messages. Reliable delivery algorithms based on negative acknowledgments suffer from two problems. First, if a message is not delivered to several destinations (e.g., because it was lost as a result of buffer overflow at an intermediate switch), all of those destinations will transmit a negative acknowledgment when one would have sufficed. This is called the nack implosion problem. As shown in Fig. 6, this waste can be reduced if a destination suppresses its own negative acknowledgment if it has received a negative acknowledgment that some other destination transmitted (12,13). Suppression of negative acknowledgments is combined with a carefully chosen delay before the negative acknowledgment is transmitted to minimize the probability that multiple negative acknowledgments are transmitted. The second problem with negative acknowledgments is that they provide no indication to the source that all, or even any, destinations have received the messages. To ensure that the source can retransmit any message for which it might receive a negative acknowledgment, the source would need to retain every message indefinitely. Consequently, negative acknowledgments are typically used in combination with positive acknowledgments. The positive acknowledgments confirm that messages have been received by every destination and will not subsequently need to be retransmitted, thus allowing the source to recover the buffer space used by those messages. The use of acknowledgments and retransmissions is ineffective for synchronous systems because it introduces arbitrary delays into the delivery of messages, delays that might exceed the specified bounds. Consequently, in many synchronous designs, processes transmit messages multiple times, typically over multiple communication paths and possibly multiple times on each path. With a proper design and a highquality communication medium, the probability that no copy of the message reaches the destination is negligible (14). Causal Order Algorithms To determine causal dependencies between messages and delivery of messages in causal order, additional information must be included in the messages to indicate their causal predecessors. A naive strategy would require a process to include in every message it transmits a list of all messages it has received since the previous message it transmitted.
Retransmission
Initial multicast Message lost
Nack suppressed
Nack Nack suppressed Figure 6. Excessive numbers of negative acknowledgments and retransmissions can be avoided if each process suppresses its negative acknowledgment or retransmission on receiving a similar transmission from another process.
506
GROUP COMMUNICATION
P1
P2
P3
P4
M1 M2 M3
Nack Time Ack
M4 M5
Figure 7. The transitivity of acknowledgments, piggybacked on regular messages, can be used to derive a causal order while requiring little additional information to be transmitted.
A more sophisticated and efficient algorithm exploits transitivity of positive acknowledgments (15,16). As shown in Fig. 7, message M3 transmitted by process P1 contains positive acknowledgments of messages M1 and M2. If process P2 now transmits message M4 containing a positive acknowledgment of M3, P2’s message also implicitly acknowledges messages M1 and M2 and indicates that M4 causally follows M1, M2 and M3. If process P3 has received messages M1, M3 and M4 but has not received message M2, then P3 transmits message M5 containing a positive acknowledgment of M4 and a negative acknowledgment of M2. The positive acknowledgment of M4 implicitly acknowledges M1 and M3 and indicates that M5 causally follows M1, M3 and M4. The negative acknowledgment of M2 serves to trigger a retransmission of M2 so that M2 can be delivered before M5. Because maintaining the graph structure used by this strategy to determine the causal dependencies is computationally expensive, a variation (17) on this strategy requires a process to receive all of the predecessors of a message before it issues a positive acknowledgment of the message. Thus, the positive acknowledgments directly yield the causal dependencies, whereas the negative acknowledgments trigger retransmissions. This reduces the computational cost of deriving the causal dependencies, with a small cost in increased latency. Another strategy commonly used to determine a causal order on messages exploits a vector clock (2). Each process maintains a local clock that can be either a real-time clock or a logical Lamport clock (9). A process maintains a logical Lamport clock as follows. When the process receives a message, it compares its local clock with the timestamp in the message. If the value of the message timestamp is greater than the value of its local clock, the process advances its local clock to match the timestamp. When the process transmits a message, it first increments its local clock and then uses that value to timestamp the message. As shown in Fig. 8, each process also maintains a local vector clock that contains one entry in the vector for each process in the group. The process’s own entry in the vector is its own Lamport clock. When a process transmits a message, it includes the vector clock in the message as a vector timestamp. When a process receives a message, it compares every entry in the message’s vector timestamp with the correspond-
ing entry in its local vector clock. If a value in the message’s vector timestamp is greater than the corresponding value in its local clock, the process advances the entry in its local clock to the corresponding value in the message. To determine the causal order between two messages, a process compares the corresponding entries in the vector timestamps of the messages. If every entry in one message’s vector timestamp is greater than or equal to the corresponding entry in the other message’s vector timestamp, then that message causally follows the other message. If both vector timestamps contain an entry that is greater than the corresponding entry in the other message’s vector timestamp, then the two messages are concurrent and neither follows the other. The vector clock strategy is effective only if the number of processes is small. As the number of processes increases, the transmission cost for the vector timestamp, and the computational cost of maintaining the vector clock, increase proportionately. Some group communication systems are based entirely on a causal order on the messages (18). Other group communication systems do not construct a causal order on the messages but rather impose a total order directly, where the total order satisfies the causal order requirement. In general, such total order algorithms are as efficient as causally ordered algorithms within a local area but incur higher latency over wide areas. Most synchronous systems do not construct explicit causal or total orders on messages during system operation. Rather, any causal or total order dependencies are considered in advance during the design of the system and the development of the preplanned schedule of operations (14). Total Order Algorithms Total order algorithms can be classified as symmetric or asymmetric, depending on whether all processes play the same role
P1
5
10 15
6
10 15
P2
4
6
6 6
13 15
7
13 15
P3
12 14
6
10 15
6
10 17
6
6
13 15
6
6
17
10 15
13 15
These two messages are concurrent. Neither precedes the other. 13 15
9
This message 12 15 causally precedes this message.
13 15
7
3
6
13 17
13 18
Figure 8. Vector clocks, maintained by the processes and included in each message, allow the causal order to be derived.
GROUP COMMUNICATION
Broadcast message
M5
M21M5
Sequencer
Acknowledgment message determines the message order.
M21 Broadcast message
Figure 9. Acknowledgment messages, broadcast by the sequencer process, impose the total order on messages broadcast by the other processes.
or some processes are distinguished from others. Typical asymmetric algorithms are sequencer algorithms in which one process determines the ordering of messages broadcast by the other processes, and also token algorithms in which a process can broadcast only when it holds a token that rotates through the set of processes. Asymmetric algorithms are quite efficient, but they are vulnerable to a single point of failure. Algorithms based on timestamping messages are more symmetric and are highly efficient, but they may exhibit high latency in wide-area networks. Intermediate between the symmetric and asymmetric algorithms are the hybrid algorithms in which a central core of processes executes a symmetric total order algorithm and other processes transmit their own messages to one of the core processes for ordering and broadcasting. None of the preceding algorithms is fault-tolerant. When a process becomes faulty, the total order algorithm blocks temporarily until the membership algorithm has detected and diagnosed the fault and has formed a new membership that excludes the faulty process. A completely different class of algorithms contains fault-tolerant total ordering algorithms, based on voting (15,19). Such algorithms continue to order messages even though some processes are faulty. They are, however, quite sophisticated and computationally expensive and, thus, have not been widely used. Sequencer Algorithms. In the sequencer algorithms, one process is responsible for determining the total order on messages broadcast by all processes. In the Amoeba system (20), every process transmits its messages over a point-to-point connection to the sequencer process. The sequencer then determines the message order and broadcasts the messages. In alternative sequencer algorithms (21–24), shown in Fig. 9, the originators of the messages broadcast their messages. The messages are received by the sequencer, which then determines the total order and broadcasts an acknowledgment message that lists the various broadcast messages in the total order. Other processes cannot deliver a message until they have received both the message and the acknowledgment message from the sequencer containing the ordering information. If a process does not find the message it broadcast listed
507
in the acknowledgment message, it rebroadcasts the message. If a process finds a message listed in the acknowledgment message but has not received the message, it requests a retransmission with a negative acknowledgment. If the sequencer broadcasts the messages, each message is transmitted twice, which increases the load on the communication medium, and the sequencer may become a bottleneck. If the originator broadcasts its own messages, the load on the communication medium and on the sequencer is reduced, but the processes then receive two transmissions, the broadcast message and the acknowledgment message from the sequencer, which increases the load on the processes. Because the sequencer is a single point of failure and a processing bottleneck, and also to avoid the need for positive acknowledgments, most sequencer algorithms rotate the responsibility for sequencing through the processes in the group. In Ref. (21), the acknowledgment message not only orders the message but also transfers the responsibility for sequencing the next batch of messages to the next process. If the sequencer process fails, this rotation stops, as does the delivery of messages, until the membership algorithm has removed the faulty process from the membership. Token Algorithms Another strategy for totally ordering messages exploits a token rotating around a logical ring (3,25–27). Only the holder of the token can broadcast messages. The token contains a sequence number that is incremented every time a message is broadcast, which imposes a total order on the messages broadcast in the group, as shown in Fig. 10. The token also contains additional information, including positive and negative acknowledgments and also flow control information. To avoid the overhead of circulating the token when there are no messages to broadcast, the algorithm may contain a mechanism for stopping and restarting the circulation of the token. Token algorithms can be very efficient for small groups with a heavy load of message communication. The processing required is simple and the flow control information in the token is effective for ensuring that the buffers at the destinations do not overflow, even under high load. For large groups under low load, the delay waiting for the token to arrive causes a higher latency than for a sequencer algorithm. This
Broadcast message
7
M5 Token
5 M7
7
M6 5
6
Figure 10. In a token algorithm, the token contains a sequence number that is incremented for every message broadcast, imposing a total order on the messages broadcast in the group.
508
GROUP COMMUNICATION
latency can be minimized by passing the token only to the processes that have messages to broadcast and that have requested the token (4,28). If, however, the token does not visit all processes, alternative arrangements must be made for collecting the positive acknowledgments that must be obtained from all of the processes in order to manage buffer space efficiently. A faulty process causes loss of the token and stops message transmission until a membership algorithm has removed the faulty process from the membership. On the other hand, the continuously circulating token allows rapid detection of faulty processes. Timestamp Algorithms. An elegant strategy for totally ordering messages involves timestamping the messages and delivering them in timestamp order (3,29). The timestamps can be derived from either a logical Lamport clock or, alternatively, from synchronized physical clocks. In order that a process can deliver the message with the lowest timestamp, it must know that it will not subsequently receive a message with a lower timestamp from any other process in the group. This can be guaranteed if it receives the messages in reliable FIFO order and if it has received a message with a higher timestamp from every other process in the group. Some processes may need to transmit null messages to ensure that a message from them is always available, to allow messages from other processes to be delivered promptly. Timestamp algorithms involve simple program code and, consequently, can be very efficient. They also have the advantage that messages can be ordered within small groups with the confidence that the local total and causal order is consistent with a system-wide total and causal order, precluding subtle ordering anomalies. The disadvantage of timestamp algorithms, particularly in large groups where many processes transmit infrequently, is that large numbers of null messages may be required. Algorithms have been devised to combine null messages from many processes, thereby reducing their number. As in the sequencer and token algorithms, a faulty process causes message ordering to stop until the membership algorithm has removed the faulty process from the membership. Hybrid Algorithms. Hybrid algorithms (30,31) for total ordering messages provide efficient operation in large systems where many processes have messages to transmit only occasionally. Certain processes, typically those with high transmission rates and also high bandwidth communication links, are designated to be core processes, as shown in Fig. 11. The Message transmitted point-to-point to core group M5
core processes broadcast and deliver messages using one of the other total ordering algorithms. Other processes transmit their messages, point-to-point, to any core process, which then orders and broadcasts those messages. Effective operation of a hybrid algorithm depends on an appropriate choice of processes for the core. Algorithms have been developed to determine that choice dynamically, adding processes to the core, or removing them, as their message transmission rates change. Hybrid algorithms are particularly important when the group size is large (thousands of processes) but only a few processes transmit frequently, as may occur in Internet applications. If, however, the listen-only processes require reliable delivery of messages, they must still transmit positive and negative acknowledgments, and care is required to avoid ack implosion (13). Voting Algorithms. The voting algorithms that produce a total order on messages (15,19) are completely different from the algorithms described earlier. They start from a causal order derived from acknowledgments, as is shown in Fig. 7. Candidate messages that have not yet been ordered but do not follow any other unordered message are selected. Such messages are candidates for immediate advancement into the total order. A voting strategy is used in which messages vote for messages that precede them in the causal order but that have not yet been advanced to the total order. If the causal order is narrow with few concurrent messages, so that it is almost a total order, the voting algorithm is likely to terminate in the first round. If the causal order is broad, several rounds of voting may be required, and termination of the voting algorithm depends on randomness properties of the causal order. Unfortunately, space does not permit a full description of the rather subtle voting algorithm or of the intricate proof of correctness (19). The most interesting feature of the voting algorithms is that, unlike other total ordering algorithms, they are faulttolerant and do not stop ordering messages in the presence of a faulty process. The absence of a hiatus in ordering messages is important for some applications. Moreover, unlike the other total ordering algorithms, the membership algorithm can be mounted above the total ordering algorithm (32), which allows the membership algorithm to be simpler and more robust. The disadvantage of the voting algorithm is its computational cost. The complexity of the algorithm is also a disadvantage because few developers want to use an algorithm if they do not understand why it works. FLOW CONTROL ALGORITHMS
M5
Message rebroadcast in order by core group Core group
Figure 11. In a hybrid message-ordering algorithm, the core processes order and broadcast messages sent to them by the other processes.
Group communication systems incur particularly severe flow control problems because, to achieve high performance, any one process must be able to transmit messages up to the capacity of the network and of the destinations. If, however, several processes transmit messages simultaneously at that rate, saturation of the communication medium can occur, resulting in message loss and retransmission. Moreover, several senders can transmit messages substantially faster than any destination can handle them. This causes messages to accumulate in the input buffers at the destinations until they
GROUP COMMUNICATION
overflow and message loss occurs. In a local area, the high bandwidth of the communication medium may allow even a single sender to overwhelm the destinations. In a wide area, the critical resource is the available bandwidth of the network, which is often much lower than in a local area and is potentially highly variable because of contention with unrelated traffic. Experience demonstrates that message loss in modern communication networks is caused mainly by flow control and buffering problems. The most effective flow control algorithms currently available for a local area are those used by token-based protocols (3,26). Only one process can broadcast or multicast at a time and the token carries flow control information from one process to the next around the ring. If the number of messages transmitted in one token rotation is restricted to the buffer capacity of the receivers, and if each process empties its buffer before releasing the token, buffer overflow is avoided. The token also carries information about the backlog of messages that could not be sent because of flow control, ensuring that all processes receive a fair share of the medium. Sequencer and timestamp algorithms use a window flow control strategy in the style of that used by TCP/IP (33,34). When a process broadcasts a message, it reduces the remaining window space, restoring the window space when it has received acknowledgments for that message from all members of the group (and, thus, no longer needs to buffer the message for possible retransmission). If each process in a group is provided with its own window then, given finite resources, those windows must be smaller than what would have been possible had the processes shared a window. Thus, the transmission rate of a process will be restricted because some of the resources have been allocated for other processes. If all processes share a window, then a process must reduce the space in the window for each message it receives as well as for each message it transmits, again restoring the window space when it has received acknowledgments from all members of the group. However, with a shared window and without control over multiple concurrent transmissions, several processes may transmit messages that attempt to utilize the same residual window space, leading to buffer overflow and message loss. For wide-area group communication systems operating over the Internet, window flow control is essential to achieve good performance. Internet switches use a flow control strategy, Random Early Drop (RED) (35), closely matched to TCP/IP. In contrast, wide-area group communication systems operating over ATM must accommodate the rate-based quality of service mechanisms of ATM (34), defined for each transmitter separately. In both cases, the relatively long delay until acknowledgments are received, which is inevitable in wide-area networks, can severely degrade the performance.
FAULT DETECTORS A fault detector is a distributed algorithm such that each process has a local fault detector module that reports the processes it currently suspects as being faulty (1). For fail-stop and crash faults, fault detectors are typically based on timeouts that are local to the process, with no communication between processes. If a process has not received a message from another process within a certain period of time, its fault detector adds that process to the list of those sus-
509
pected of being faulty. This includes processes that have not acknowledged receipt of a message within a reasonable amount of time. Failure to acknowledge receipt of a message forces other processes to retain the message in their buffers for possible retransmission and could exhaust that buffer space, causing the system to stop. For Byzantine faults, fault detectors must rely on costly techniques such as reliable broadcast or diffusion algorithms and message signatures. Even in models that admit only fail-stop and crash faults, fault detectors are inherently unreliable because processes that are nonfaulty but excessively slow or processes that fail to receive a message an excessive number of times may be suspected, whereas processes that are faulty may not be suspected immediately.
MEMBERSHIP ALGORITHMS The two types of membership, primary component membership and partitionable membership, shown in Fig. 3, satisfy different application objectives. A primary component membership is most useful when the application must maintain a single consistent state for its data in the primary component, at the cost of suspending the operation of processes in the nonprimary components, for example, in banking. A partitionable membership is appropriate when all processes must continue operation, with the cost of reconciling inconsistent data when communication is reestablished between disconnected components, for example, in industrial control. Algorithms exist for both types of membership (10,26,32,36–39), and significant problems exist for both (40,41). For primary component membership, it is possible that no membership satisfies the requirements for being the primary membership (such as a majority of the processes in the group). In practice, however, membership algorithms almost always find primary components quite quickly. For partitionable membership, the algorithm may form a trivial or inappropriate membership, such as allowing every process to form an isolated singleton membership. In practice, however, partitionable membership algorithms do not choose such memberships in preference to other more appropriate memberships. Robust membership algorithms are difficult to program because they must operate under uncertain conditions and must handle additional faults that occur during their operation. Implementation details, such as the relative lengths of timeouts, are very important for robust operation and depend on the underlying platform on which the algorithms operate. We provide next a broad outline of the strategies used by typical membership algorithms. More details can be found in Refs. 26 and 36. Typical membership algorithms involve four phases— initiation, discovery, agreement, and recovery—as shown in Fig. 12. Initiation of the membership algorithm may result from an explicit request by a process to join or leave the group, a suspicion by a fault detector, or reception of a message from a foreign process (not in this membership but in a concurrent membership within a partitioned system) after remerging of a partitioned system. In the discovery phase, all processes broadcast messages inviting responses from other processes. Each such process broadcasts responses that enumerate all processes from
510
GROUP COMMUNICATION
PQR Membership is P, Q and R P fails
Normal operation
Fault detected
Membership algorithm operating
Initiation phase Discovery phase Agreement phase Recovery phase Membership algorithm done Membership is Q, R, and S
S joins group
S
Last few messages of old membership are delivered Normal operation resumes with new messages
Figure 12. The four phases of a membership algorithm.
which they have received messages, the known set, and all processes that they suspect as having failed, the fail set. On receipt of such a message, a process merges all of the processes in the known set of the message into its own known set. Similarly, a process merges all of the processes in the fail set of the message into its own fail set. If a process has not received a response from a process in its known set within a timeout, it also adds that process to its fail set. The discovery phase ends either by a timeout or by agreement on a membership. The discovery phase can, however, be reentered at any time if further processes are suspected, if agreement cannot be reached, or if one or more processes do not install the new membership. The agreement phase seeks to find a set of processes such that every process in that set agrees that its proposed new membership is that set. The proposed membership is typically the difference of the known set and the fail set. If the processes are only partially connected, so that some processes cannot communicate with other processes, a heuristic algorithm may be used to choose an appropriate membership. For a primary component membership, the proposed membership must also satisfy some size constraint or other criterion for being a primary component. The agreement is then confirmed, typically by some variation of two-phase commit. The proposer, usually the process having the lowest identifier, broadcasts a proposal message. The other members then respond with a commit message. The proposer then broadcasts an install message to begin the recovery phase and install the new membership. If any process rejects the membership or does not respond, the proposer returns to the discovery phase. Similarly, if a process does not receive a propose or install message, it returns to the discovery phase. In the recovery phase, the processes first complete the delivery of messages from the old membership and then install the new membership. The processes in the new membership and the same old membership exchange information regarding messages that they have received from the old membership and then retransmit those messages so that all members have them. The messages are then ordered and delivered to ensure virtual synchrony. When all messages of the old mem-
bership are delivered, the algorithm delivers a membership change message announcing the membership change, enumerating the new membership, and starting normal operation with the new membership. Even before the delivery of some of the messages of the old membership, the membership algorithm may have delivered additional membership change messages reporting the loss of processes. Such additional messages are necessary to achieve extended virtual synchrony. If a process determines that any member of its new membership has returned to the discovery phase, it first completes its installation of the new membership and then reinvokes the membership algorithm. For a primary component membership algorithm, it is essential that only a single sequence of memberships exists over time. If the proposer becomes faulty at a critical moment, it may be impossible for the remaining processes to determine whether the proposer has installed the new membership. The system must then stop until the proposer has recovered. This risk of a hiatus can be reduced, but not eliminated, by using three-phase commit in place of two-phase commit. For a partitionable membership algorithm, termination is easy to demonstrate. The known set and the fail set are monotonically increasing and bounded above by the finite number of potential members. Each attempt to form a membership can be defeated by a process that was previously unknown, causing an increase in the known set, or by a process that does not respond, causing an increase in the fail set. Because both sets increase monotonically and are bounded above by the set of potential members, the algorithm terminates, possibly in a singleton membership containing only the process itself. Membership algorithms for operation on top of a fault-tolerant total ordering algorithm (32) are often simpler and more elegant than the algorithm outlined earlier. This simplicity comes at the expense of greater complexity in the faulttolerant total ordering algorithm. For synchronous systems, membership algorithms are much simpler than for asynchronous systems (14,42). Typically, at the end of each prescheduled sequence of message exchanges, each process reports the set of processes from which it received messages during that sequence. This set of processes constitutes its proposed membership. If a process receives a membership that differs from its own, it can choose either to exclude that process from its membership or to exclude other processes from its membership so as to bring its membership into agreement with that of the other process. In principle, these choices are heuristic choices for synchronous systems as they are for asynchronous systems. Practical synchronous systems, however, typically have simpler and more robust communication media and, thus, incur fewer problems in reaching agreement on a membership quickly.
FUTURE DIRECTIONS Much research remains to be undertaken in the area of faulttolerant distributed systems. An important research topic is the integration of group communication protocols with protocols for real-time, multimedia, and data transfer. Real-time protocols, such as are used for instrumentation and control, typically seek to provide low latency and low jitter (variance in latency) but not reliable delivery because new data are
GROUP COMMUNICATION
transmitted more or less continuously. Multimedia protocols that provide broadcasting or multicasting of audio and video, need low latency and low jitter but not reliable delivery. Data transfer protocols provide reliable delivery and may provide broadcasting or multicasting but usually do not need message ordering between multiple sources. The use of these protocols in a distributed system depends on group communication for overall coordination. It is essential to establish a causal order between the control information transmitted through the group communication protocol and the start or end of realtime, multimedia, or data transmission. As group communication protocols become more established, they will be used in larger systems and over wider areas. Over wide areas, with high data rates, existing flow control algorithms are ineffective. To preclude overwhelming the buffers in the intermediate switches, a relatively small window is needed, but messages in the window are quickly transmitted and the source then remains idle for a long time until the acknowledgments return. New flow control algorithms will be required. With existing protocols, the latency to message ordering and delivery can increase substantially over a wide area. Some increase in latency is inevitable because of the propagation delay through the network, but new protocols that can order messages with a latency that is close to this minimum will be required (27). Wide-area systems are also subject to network partitioning; however, many applications require all components of a partitioned system to continue operation. Existing group communication systems provide message delivery and membership algorithms that continue to operate in all components of a partitioned system. Even though the system is partitioned, the disconnected components can perform operations that are inconsistent with those performed in other components. When communication is eventually restored, these inconsistencies must be reconciled. The programming required to achieve such reconciliation is currently quite difficult, and expensive manual intervention may be required. Proposals have been made (44–46) to simplify this programming, although human insight is still required to establish the application requirements on which such programming depends. The development of strategies for preventing or reconciling inconsistencies in partitioned systems is an important topic of research. Group communication is in the middle of a range of approaches to the development of fault-tolerant distributed systems. One end of that range is focused on efficiency, while the other end is focused on simplification of the application programming. When communication networks and group communication protocols were slow, a strong emphasis on efficiency was appropriate (47). A similar concern has led to the development of microprotocol toolkits (4) from which a custom group communication protocol can be constructed, optimized specifically for the particular application. With increasing network performance and more efficient protocols, some of that efficiency can be sacrificed for simpler application programming. The group communication protocols described in this article employ a message-passing application programmer interface. This message-passing interface necessarily exposes to the application programmer the problems of distribution, replication, consistency, and fault tolerance. Correct solutions to these problems require considerable skill and experience, and
511
typical application programmers are not well-trained to solve those problems. Consequently, fault-tolerant distributed systems are still quite difficult and expensive to program. New approaches to building fault-tolerant distributed systems are being investigated. Using the Common Object Request Broker Architecture (CORBA) (48,49), such systems (46,50) provide transparent object replication and fault tolerance. This allows the application programmer to write a distributed object program as though it were to operate unreplicated, without affecting the application programming or the functional behavior of the application. The approach still employs group communication protocols such as those described here, but does not expose those protocols to the application programmer. Such an approach will make the benefits of fault-tolerant distributed systems available to a wider range of applications.
BIBLIOGRAPHY 1. T. D. Chandra and S. Toueg, Unreliable failure detectors for reliable distributed systems, J. ACM, 43 (2): 225–267, 1996. 2. K. P. Birman and R. van Renesse, Reliable Distributed Computing with the Isis Toolkit, Los Alamitos, CA: IEEE Comput. Soc. Press, 1994. 3. L. E. Moser et al., Totem: A fault-tolerant multicast group communication system, Commun. ACM, 39 (4): 54–63, 1996. 4. R. van Renesse, K. P. Birman, and S. Maffeis, Horus: A flexible group communication system, Commun. ACM, 39 (4): 76–83, 1996. 5. F. Cristian, Synchronous and asynchronous group communication, Commun. ACM, 39 (4): 88–97, 1996. 6. M. J. Fischer, N. A. Lynch, and M. S. Paterson, Impossibility of distributed consensus with one faulty process, J. ACM, 32 (2): 374–382, 1985. 7. M. Ben-Or, Randomized agreement protocols, in B. Simons and A. Spector (ed.), Fault-Tolerant Distributed Computing, Berlin, Germany: Springer-Verlag, 1990, pp. 72–83. 8. G. Bracha and S. Toueg, Asynchronous consensus and broadcast protocols, J. ACM, 32 (4): 824–840, Oct. 1985. 9. L. Lamport, Time, clocks, and the ordering of events in a distributed system, Commun. ACM, 21 (7): 558–565, 1978. 10. D. Dolev, D. Malki, and R. Strong, A framework for partitionable membership service, Tech. Rep. CS95-4, Inst. Comput. Sci., Hebrew Univ., Jerusalem, Israel, 1995. 11. L. E. Moser et al., Extended virtual synchrony, Proc. 14th IEEE Int. Conf. Distrib. Comput. Syst., Poznan, Poland, 1994, pp. 56–65. 12. S. Floyd et al., A reliable multicast framework for light-weight sessions and application level framing, ACM/IEEE Trans. Netw., 5: 784–803, 1997. 13. K. Berket, L. E. Moser, and P. M. Melliar-Smith, The InterGroup protocols: Scalable group communication for the Internet, IEEE GLOBECOM ’98: 3rd Global Internet Mini-Conf., Sydney, Australia, 1998. 14. H. Kopetz and G. Grunsteidl, TTP—A protocol for fault-tolerant real-time systems, IEEE Comput., 27 (1): 14–23, 1994. 15. P. M. Melliar-Smith, L. E. Moser, and V. Agrawala, Broadcast protocols for distributed systems, IEEE Trans. Parallel Distrib. Syst., 1: 17–25, 1990. 16. P. M. Melliar-Smith and L. E. Moser, Trans: A reliable broadcast protocol, IEE Proc. I Trans. Commun., 140: 481–492, 1993.
512
GROUPWARE
17. Y. Amir et al., Transis: A communication sub-system for high availability, Proc. 22nd IEEE Int. Symp. Fault-Tolerant Comput., Boston, MA, 1992, pp. 76–84. 18. S. Mishra, L. L. Peterson, and R. D. Schlichting, Consul: A communication substrate for fault-tolerant distributed programs, Distrib. Syst. Eng., 1 (2): 87–103, Dec. 1993. 19. L. E. Moser, P. M. Melliar-Smith, and V. Agrawala, Asynchronous fault-tolerant total ordering algorithms, SIAM J. Comput., 22 (4): 727–750, 1993. 20. M. F. Kaashoek and A. S. Tanenbaum, Group communication in the Amoeba distributed operating system, Proc. 11th IEEE Int. Conf. Distrib. Comput. Syst., Arlington, TX, 1991, pp. 222–230. 21. J. M. Chang and N. F. Maxemchuk, Reliable broadcast protocols, ACM Trans. Comput. Syst., 2: 251–273, 1984. 22. F. Cristian and S. Mishra, The pinwheel asynchronous atomic broadcast protocols, Proc. 2nd Int. Symp. Autonomous Decentralized Syst., Phoenix, AZ: Apr. 1995, pp. 215–221. 23. W. Jia, J. Kaiser, and E. Nett, RMP: Fault-tolerant group communication, IEEE Micro, 16 (2): 59–67, Apr. 1996. 24. B. Whetten and S. Kaplan, A high performance totally ordered multicast protocol, Proc. Int. Workshop Theory and Practice Distrib. Syst., Dagstuhl Castle, Berlin, Germany: Springer-Verlag, Sept. 1994, pp. 33–57. 25. T. Abdelzaher et al., RTCAST: Lightweight multicast for realtime process groups, Proc. 1996 IEEE Real-Time Technology and Applications Symp., Brookline, MA: June 1996, pp. 250–259.
Proc. 12th Symp. Reliable Distrib. Syst., Princeton, NJ: Oct. 1993, pp. 2–11. 39. A. M. Ricciardi and K. P. Birman, Process membership in asynchronous environments, TR 93-1328, Dept. of Computer Science, Cornell Univ., Ithaca, NY, 1993. 40. E. Anceaume et al., On the formal specification of group membership services, Tech. Rep. 95-1534, Dept. of Computer Science, Cornell Univ., Ithaca, NY, 1995. 41. T. D. Chandra et al., On the impossibility of group membership, Tech. Rep. 95-1548, Dept. of Computer Science, Cornell Univ., Ithaca, NY, 1995. 42. A. S. Tanenbaum, Computer Networks, Upper Saddle River, NJ: Prentice-Hall, 1996. 43. R. Koch, L. E. Moser, and P. M. Melliar-Smith, Global causal ordering with minimal latency, Tech. Rep. 98-08, Dept. Electr. Comput. Eng., Univ. California, Santa Barbara, 1998. 44. O. Babaoglu, A. Bartoli, and G. Dini, Enriched view synchrony: A programming paradigm for partitionable asynchronous distributed systems, IEEE Trans. Comput., 46: 642–658, 1997. 45. P. M. Melliar-Smith and L. E. Moser, Surviving network partitioning, IEEE Comput., 31 (3): 62–69, 1998. 46. P. Narasimhan, L. E. Moser, and P. M. Melliar-Smith, Replica consistency of CORBA objects in partitionable distributed systems, Distrib. Syst. Eng., 4: 139–150, 1997.
26. Y. Amir et al., The Totem single-ring ordering and membership protocol, ACM Trans. Comput. Syst., 13: 311–342, 1995.
47. D. R. Cheriton and D. Skeen, Understanding the limitations of causally and totally ordered communication, Proc. 14th ACM Symp. Operating Systems Principles, Asheville, NC: Dec. 1993; Operating Syst. Rev., 27 (5): 44–57, Dec. 1993.
27. B. Rajagopalan and P. K. McKinley, A token-based protocol for reliable, ordered multi-cast communication, Proc. 8th IEEE Symp. Reliable Distrib. Syst., Seattle, WA: Oct. 1989, pp. 84–93.
48. Object Management Group. The Common Object Request Broker: Architecture and Specification, Rev. 2.1, OMG Tech. Doc. PTC/ 97-09-01, 1997.
28. G. A. Alvarez, F. Cristian, and S. Mishra, On-demand asynchronous atomic broadcast, Proc. 5th IFIP Int. Working Conf. Dependable Computing for Critical Applications, Urbana-Champaign, IL: 1995, pp. 119–137.
49. R. M. Soley, Object Management Architecture Guide, Object Management Group, OMG Tech. Doc. 92-11-1, 1992.
29. D. A. Agarwal et al., The Totem multiple-ring ordering and topology maintenance protocol, ACM Trans. Comput. Syst., 16: 93– 132, 1998. 30. P. D. Ezhilchelvan, R. A. Macedo, and S. K. Shrivastava, Newtop: A fault-tolerant group communication protocol, Proc. 15th Int. Conf. Distrib. Computing Syst., Vancouver, BC, Canada: May/ June 1995, pp. 296–306. 31. L. E. T. Rodrigues, H. Fonseca, and P. Verissimo, Totally ordered multicast in large-scale systems, Proc. 16th IEEE Int. Conf. Distrib. Comput. Syst., Hong Kong, 1996, pp. 503–510. 32. L. E. Moser, P. M. Melliar-Smith, and V. Agrawala, Processor membership in asynchronous distributed systems, IEEE Trans. Parallel Distrib. Syst., 5: 459–473, 1994. 33. D. E. Comer, Internetworking with TCP/IP, Englewood Cliffs, NJ: Prentice-Hall, 1995. 34. S. Floyd and V. Jacobson, Random early detection gateways for congestion avoidance, IEEE/ACM Trans. Netw., 1: 397–413, 1993. 35. Y. Amir et al., Membership algorithms for multicast communication groups, Proc. 6th Int. Workshop Distrib. Algorithms, Haifa, Israel, 1992, pp. 292–312. 36. F. Cristian, Reaching agreement on processor-group membership in synchronous distributed systems, Distrib. Comput., 4 (4): 175– 187, 1991. 37. M. A. Hiltunen and R. D. Schlichting, A configurable membership service, IEEE Trans. Comput. 47 (5): 573–586, May 1998. 38. F. Jahanian, S. Fakhouri, and R. Rajkumar, Processor group membership protocols: Specification, design and implementation,
50. L. E. Moser, P. M. Melliar-Smith, and P. Narasimhan, Consistent object replication in the Eternal system, Theory Practice Object Syst., 4 (2): 81–92, 1998.
P. M. MELLIAR-SMITH L. E. MOSER University of California
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5308.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering High-Speed Protocols Standard Article Martina Zitterbart1 1TU Braunschweig, 38106 Braunschweig, Germany Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5308 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (148K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5308.htm (1 of 2)16.06.2008 16:23:24
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5308.htm
Abstract The sections in this article are Characteristics of High-Speed Networks Light-Weight Transport Protocols Evolution of TCP Implementation Techniques Summary | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5308.htm (2 of 2)16.06.2008 16:23:24
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering c 1999 John Wiley & Sons, Inc. Copyright
HIGH-SPEED PROTOCOLS Buzzwords such as data highways and information society are becoming ubiquitous and are no longer specific to the research environment. In this context, many emerging applications are pushing the communication world to drastic changes. Currently the most prominent example can be seen in the World Wide Web, WWW. Furthermore, the popularity of networked multimedia applications, such as teleconferencing and telecollaboration, is also constantly increasing. As a result, many new opportunities for network users appear in the public sector as well as in the business and commercial sector. However, in order to serve all these applications, suitable communication systems are required, including the underlying network as well as network and transport layer protocols. All components together need to provide high performance with respect to throughput and latency. Moreover, the integration of multiple services as it is typical for multimedia applications (e.g., audio, video, and data stream) forms a key requirement. Due to the emergency of fiber-based technology, high-speed networks are being established that enable data rates well over the megabit—even the gigabit—per second threshold. In addition, ATM (asynchronous transfer mode) is under development and will be capable of integrating various services within a single network, the B-ISDN (broadband integrated services network). Moreover, high-speed protocols and efficient implementation techniques have been developed during the last couple of years. They especially address the specific characteristics of high-speed networks and new application requirements. This article focusses on issues related to high-speed networks. This article is structured as follows. Important characteristics of high-speed networks are presented first, followed by information on light-weight transport protocols. Protocol mechanisms as well as the most popular light-weight protocols are discussed. A section is devoted to the evolution of the widely used Internet protocol TCP (Transmission Control Protocol). Following this, implementation techniques including parallel protocol processing and dedicated hardware support are presented. Finally, some conclusions and perspectives on future trends are given.
Characteristics of High-Speed Networks High-speed networks are characterized by a high data rate, typically well in the hundreds of Mbps or even in the gigabit per second range and above. However, there is no specific data rate that qualifies a network to be called high-speed network. Compared to traditional low-speed networks (e.g., Ethernet), the data rate is higher by several orders of magnitude. This leads to some very different characteristics of such networks, especially with respect to end-to-end latency. In low-speed networks, end-to-end latency is dominated by the data rate of the link. In contrast to this, the speed of signal propagation clearly dominates the end-to-end latency in high-speed networks. As a result, the so-called bandwidth-delay-product is rapidly increasing in high-speed networks. This means that a large amount of data can be buffered within the network. This buffering capability is commonly referred to as path capacity. 1
2
HIGH-SPEED PROTOCOLS
Fig. 1. End systems communicating over a high-speed network.
The basic scenario depicted in Fig. 1 is used to clarify the importance of the path capacity. Two end systems are interconnected via a communication link of length l = 5000 km. The data rate on the link is r = 1.2 Gbit/s. Furthermore, we assume the communication link to be a fiber link. The speed of light in a fiber link is approximately v = 2 ∗ 105 km/s. Thus, the signal propagation speed per kilometer calculates to = 1/v = 5 µs/km. The end-to-end transmission delay d of the communication link is d = l ∗ = 25 ms, that is, the round trip time for end-to-end communication is 50 ms. The path capacity p of a link can be calculated as follows: p = r ∗ l/v. For the example, the path capacity is p = 30 Mbit, that is, 30 Mbit of data are stored on the transmission link. Table 1 presents path capacities for various networks with different data rates. The drastic increase in path capacity compared to low-speed networks has a major impact on higher-layer protocols. Several protocol mechanisms that regulate the data flow between end systems are affected, especially at the transport layer. Taking the numbers above, a sending station has to wait 50 ms before it can expect to receive an indication from the receiving station about the transmitted data. However, during that time, the sender can already transmit 60 Mbit (i.e., 7.5 MByte) of data. Thus, it can send a complete file without receiving any indication of proper reception or of any errors. This situation is very different from low-speed networks where usually the first feedback information arrives at the sender after the transmission of a few bits (e.g., 4 bits if ISDN is used in the example). This drastic change of behavior requires enhanced protocol mechanisms. Moreover, due to the increased speed within the network, the time in which a data unit needs to be handled in the attached systems (end systems, routers, . . .) decreases dramatically. Given the previous example, data units of 8 kbytes need to be processed in about 55 µs. However, if the length of the data unit is only 53 bytes (e.g., an ATM cell), it needs to be received and processed in less than 0.4 µs. Comparable numbers in a 10 Mbit/s Ethernet are 6.5 ms and 42 µs, respectively. These numbers underline that requirements on protocol
HIGH-SPEED PROTOCOLS
3
processing and memory access speed are other significant factors that need to be addressed with the advent of high-speed networks.
Light-Weight Transport Protocols Protocol mechanisms at the transport layer must address the increasing path capacity in order to provide efficient communication services to the application. In the following, first basic protocol mechanisms are discussed, followed by the presentation of selected light-weight protocols that introduced and applied novel protocol mechanisms. Basic Protocol Mechanisms. Several protocol mechanisms are typically part of connection-oriented transport protocols that provide a reliable service. Among them are mechanisms for: • • •
connection establishment and termination error control flow control
Connection establishment can be seen as a performance critical protocol function, especially with the event of applications that are based on the client/server paradigm. Typically, a handshake-based mechanism is used to establish a connection as depicted in Fig. 2(a). The sender issues a connect request message and needs to wait for a connect indication message before it is allowed to transfer user data to the peer entity. During this handshake procedure some parameters, such as data rate and window size, can be negotiated among the sender, the receiver, and the service provider. This includes QoS (quality of service) parameters for multimedia services. However, the handshake procedures lead to a latency of at least one round-trip time before the first byte of user data can be sent, that is, a delay of 50 ms with respect to the example presented previously. In this time 60 Mbit of user data could have been transmitted. Many client/server based applications do not require a large amount of data to be sent and, thus, the time needed for connection establishment, tconn , can easily dominate the time needed for user data transfer, tdata , that is, tconn ≥ tdata . In order to increase the efficiency of connection establishments, implicit mechanisms have been developed. User data can be transmitted immediately with or after the connection establishment message; see Fig. 2(b). Implicit mechanisms drastically decrease the connection set-up latency. Therefore, transactions can be finalized much faster. For example, the transaction time ttrans can be reduced to ttrans ≈ 50 ms instead of ttrans > 100 ms with the handshake-based mechanism. However, guarantees with respect to quality of service can not be given. Therefore, such mechanisms are not targeted for multimedia applications. Protocols serving multimedia applications typically use a separate signaling protocol in order to establish a connection with dedicated QoS requirements. Connection termination also needs special attention in the environment of client/server applications and high-speed networks. If handshake-based mechanisms are used for establishment and termination of connections, the number of packets and the time consumed for connection management may be higher than the number of packets and the time needed to transmit the user data. Therefore, the number of connection management packets should be minimized. In addition to an implicit connection establishment, so-called timerbased mechanisms have been introduced in order to avoid connection termination messages. Timers are used at both sides to determine the point in time at which the connection is terminated. The timers are updated each time user data are either received or sent. The value of the timer needs to consider the round-trip time of the connection and possible data retransmissions. The mechanism is very sensitive to a proper dimensioning of the timer. A disadvantage of timer-based connection handling can be seen in the fact that connection state must be held longer, since the timer needs to be set to a high enough value. This can be a burden at servers which are frequently requested.
4
HIGH-SPEED PROTOCOLS
Fig. 2. Connection establishment.
Protocol mechanisms that implement error control also need to be adapted for usage in networks with highpath capacity. Typically, very simple error control mechanisms are implemented within transport protocols. Acknowledgments are used by the receiver to indicate the correct reception of data units that have been sent. If a data unit is not acknowledged it is retransmitted by the sender after the timeout of a related timer. The most simple acknowledgment mechanism uses cumulative acknowledgments. All correctly received data units in sequence are acknowledged. All data received subsequently to a corrupted data unit are discarded and not acknowledged. The sender needs to wait for at least a round-trip time in order to know whether the data unit has been received correctly or not. For retransmission of erroneous data, often the so-called go-back-N mechanism is applied, especially in conjunction with cumulative acknowledgments. With this mechanism, all data following the corrupted data unit are retransmitted. Due to the potentially large amount of data in transit, this can dramatically increase the load on the network. In case of selective acknowledgments, more advanced retransmission mechanisms can be implemented that reduce the traffic with respect to the amount of data to be retransmitted. Selective acknowledgments allow—in contrast to cumulative mechanisms—the acknowledgment of data that has been correctly received subsequent to a corrupted data unit. The disadvantage can be seen in increased memory requirements at the receiving system. However, latency can be significantly reduced which is highly desirable for many applications, especially for interactive applications (e.g., teleconferencing). Selective acknowledgments are very attractive in high-speed networks since they do not require the retransmission of all data in transit. Furthermore, forward error correction (FEC) appears as a useful alternative in high speed networks since it completely avoids the round trip time to correct—at least part of—the errors. The basic principle is that redundant data are transmitted. Thus, the receiver may be capable to reconstruct the original data even in case of lost or corrupted data. However, this does not guarantee complete reliability. Therefore an additional option to use retransmissions is required. This is needed if too many data are corrupted or lost and, thus, the original data cannot be reconstructed. The potential disadvantages of FEC can be seen in the increased load on the network as well as in the considerable high processing power needed for good FEC mechanisms. The task of flow control is to regulate the flow of data between two communicating protocol entities. The resources of the receiving system (buffer, processing power) are protected against overload conditions. The basic mechanism used in various transport protocols is a credit-based sliding window mechanism. The sender has a certain credit—the window—that is, an amount of data (measured in bytes or data units) that it is allowed to send before receiving an acknowledgment. This credit basically reflects the buffer capability of
HIGH-SPEED PROTOCOLS
5
the receiver. With increasing path capacity the utilization of the communication link by two communicating stations decreases with such an approach. Typical window sizes are in the range of kbytes and, thus, smaller than the path capacity of high-speed networks. Therefore, a new mechanism called rate control has been developed during the last couple of years. It is applied in various light-weight protocols. With rate control, traffic is controlled via the rate of the sending station in contrast to the credit used by sliding window. Sending and receiving system as well as the network agree on this rate, that is, the receiving system and the network claim that they can process data received with the rate agreed upon. Light-Weight Transport Protocols. During the past twenty years various transport protocols have been developed. The Internet protocol TCP can be stated as the most prominent example. It was one of the earliest transport protocols and currently is implemented on almost all computers. Since many changes in the networks can be observed, especially with the event of high-speed networks, the requirements on transport protocols have changed. Furthermore, application requirements did dramatically change. The client/serverparadigm is penetrating the communication area today and service integration is getting increasingly important with respect to multimedia applications. In order to address the changes related to high speed networks, various so-called light-weight transport protocols have been developed. Light-weight protocols shorten the regular data path and, generally, minimize protocol overhead. The most prominent ones are briefly described in the following with respect to their individual contributions. The protocols are: • • • • •
Delta-t (1), (2) NETBLT (3), (4) VMTP (5), (6), (7) URP (8) XTP (9), (10)
The “father” of transport protocols can be seen in TCP. Each of the protocols listed above inherits some mechanisms from TCP and replaces or improves other mechanisms. An excellent overview of the correlation among these protocols can be found in (11). The transport protocol of the OSI protocol suite, transport protocol class 4 (TP4), is not discussed further since it does not introduce any new protocol mechanisms and it is not of practical relevance today. TCP itself is constantly undergoing changes. Those relevant to high-speed networks are summarized in the section on TCP. The presented light-weight transport protocols have been designed with a connectionless network service in mind, such as IP (Internet Protocol). The only exception is URP which is based on the virtual-circuit oriented network Datakit. Delta-t. In the late 1970s, the Lawrence Livermore Laboratory designed a new transport protocol called Delta-t. The development was driven by the idea to provide high performance communication support for client/server applications. Delta-t presents one of the earliest protocols that addresses the specific requirements for client/server applications. It provides some trail-blazing approaches that are more or less standard today. One of its main characteristics is that it minimized the latency between initiating the connection establishment and the actual start of the data-transfer phase as well as the overall number of data units exchanged. The novelty that is introduced by Delta-t is an implicit connection establishment mechanism. It minimizes the number of data units exchanged and, thus, the delay until user-data transfer can start. Handshake procedures are completely avoided. Data can be transferred in Delta-t immediately after issuing the connection establishment. Another novelty of Delta-t is the definition of a timer-based protocol mechanism for connection termination. It avoids a handshake procedure for connection termination and, thus, further reduces the overhead with respect to the number of data units exchanged. For such a mechanism, unambiguous connection identifiers are needed. Furthermore, connection identifiers need to be frozen after connection termination, that
6
HIGH-SPEED PROTOCOLS
is, they cannot be immediately used by a newly established connection, in order to avoid problems with delayed duplicates of an already closed connection. Although Delta-t is not used widely today, it provided important insights in efficient support of client/server applications. The connection handling mechanisms have been adopted by subsequent protocols, even by TCP (see section on TCP). Network Block Transfer Protocol. The protocol NETBLT (network block transfer protocol) was developed at MIT. It directly targets transmission links with high-path capacity. However, it addresses links with a lowspeed data rate: satellite links. The extreme length of satellite links (72000 km) leads to a high-path capacity without the necessity of a very high data rate on that link. In such environments, window based flow control can not be applied efficiently. It either leads to large windows or underutilized communication links. NETBLT specifically addresses this problem and a novel mechanism that combines window based flow control with a newly introduced rate based flow control is presented. Furthermore, in NETBLT error control and flow control are decoupled and, thus, not overloaded. NETBLT introduces so-called bursts for the implementation of the rate based flow control. Two burst parameters are defined to regulate data transmission: burst size and burst rate. The burst size limits the length of a burst, that is, the amount of data that can be sent continuously. The burst rate defines the minimum time interval between two subsequent bursts. Consequently, the data rate on the link is controlled by these two parameters. Therefore, larger windows can be used without causing an overflow at the receiver. Burst size and burst rate are negotiated among the involved parties. With respect to error control, NETBLT also introduces novel mechanisms. It uses a combination of cumulative and selective acknowledgments. Selective acknowledgments signal that certain data units have been lost or corrupted. Cumulative acknowledgments signal the correct reception of a sequence of data units. Since selective acknowledgments are used, NETBLT can apply selective retransmissions as well. In order to reduce the overhead introduced by acknowledgments, so-called buffers are introduced as basic units for the handling of acknowledgments. Buffers contain a number of data units that are logically treated as single unit. Thus, an acknowledgement is needed with respect to buffers and not according to the number of data units exchanged. The size of the buffer is negotiated between sender and receiver. Versatile Message Transaction Protocol. The protocol VMTP (versatile message transaction protocol) was developed at Stanford University for usage with the distributed operating system V. The main purpose was in an efficient support of remote procedure calls, that is, transaction-based applications. The goal is somewhat comparable to that of Delta-t. VMTP introduces so-called message transactions that model remote procedure calls. A message transaction comprises the request and one or multiple optional response messages. Message transactions can be directly forwarded from one server to another without involvement of the client. The design of VMTP incorporates an implicit and timer-based connection handling derived from Delta-t. Furthermore, VMTP supports multicasting, that is, a transaction request can be sent to multiple servers. It is among the first protocols that address the need for specific multicasting support. A novelty of VMTP can be seen in the 64-bit long unambiguous connection identifiers which are independent from the network layer address. This enables the migration of VMTP entities in the network, for example in order to implement load balancing in distributed systems. VMTP further applies a rate control mechanism in order to control the data flow between communicating entities. The mechanism is comparable to NETBLT. However, the rate control is also slightly different from that applied in NETBLT. VMTP uses so-called interpacket gaps, that is, it controls the gaps between two consecutive data units. Therefore, a finer timer granularity is required compared to the interburst gaps controlled in NETBLT. However, VMTP was from the beginning developed with a hardware supported implementation in mind. Thus, a higher timer resolution could be integrated more easily than within a software implementation.
HIGH-SPEED PROTOCOLS
7
The idea of a hardware implementation also influenced the packet format defined in VMTP. A pipelined processing of different header fields (e.g., encryption, checksum) is directly supported. Universal Receiver Protocol. The protocol URP (universal receiver protocol) was developed in the mid1980s at AT&T in order to provide universal data transport across their Datakit network. The communication structure of URP is based on simplex transmissions, thus, a full duplex connection is composed of two simplex connections, one per direction. Data are handled on the bases of 8 bits. Two operating modes are distinguished for the receiver: block mode and character mode. In block mode, sequence numbered blocks are used for data transmission. Block retransmission is available as an option in case of error. In character mode, data are transmitted as character stream without the option of retransmission. The service provided by character mode and by block mode without retransmission is very different from the services discussed previously. Data are delivered error-free and in sequence, but some data in between may be lost. Moreover, in character mode, data losses can not be detected. Thus, character mode reflects a very simple service that can be implemented with low overhead. Generally, URP is one of the first protocols that provides some flexibility with respect to the provided service. This is very important, especially with the advent of multimedia applications. The grade of URP service can be selected by the transmitter. Since URP was designed for high-speed networks, optimizations were integrated in the protocol. One of the most interesting design issues is the relocation of some processing intensive tasks from the receiver to the sender. This is motivated by the fact that usually the receiver represents the performance bottleneck. Protocol mechanisms that follow such an approach are called sender-driven mechanisms. This is reflected mainly in the acknowledgment procedure of URP. The sender explicitly asks for an acknowledgment from the receiver, for example, after having sent a full window of data. Thus, the decision and the processing related to it are located on the sender side and not at the receiver side. Furthermore, in block mode a so-called reject mechanism is implemented that basically reflects a selective acknowledgment. This allows the speed up of retransmissions. The data format used in URPs block mode is very different from those of other protocols. It uses a trailer of 4 bytes that follows a block of data; no header is involved. This simplifies processing in hardware. A hardware based implementation of URP was a vision of the protocol designers. URP is not widely used today. However, many concepts are inherited and enhanced in XTP which is among the most prominent protocols during the last couple of years. EXpress Transport Protocol. The protocol XTP (express transport protocol) was developed with the goal of a VLSI implementation. Moreover, it was targeted toward an efficient support for real time applications. The idea of a hardware implementation influenced the data format of XTP similarly to VMTP and URP. However, it needs to be stated that the current data format of XTP 4.0 is very different from that described in the first specifications of XTP. For example, the control fields moved entirely from the packet trailer to the header. XTP introduces an out-of-band signaling at the transport layer. It is the first protocol that uses out-of-band signaling over connectionless network services. The exchange of control information between XTP entities and the transfer of user data are clearly separated. Moreover, the protocol was designed as a set of orthogonal protocol functions that can be mapped onto a multiprocessor platform. Such a design also forms a sound basis for a protocol that provodes multiple services. This idea was inherited from URP. XTP provides many different mechanisms and, thus, is capable of supporting different services for a large variety of applications. Different strategies and options of protocol mechanisms can be selected by individual applications. Generally, XTP uses many of the protocol mechanisms introduced by earlier protocols, for example, selective acknowledgments and rate-based flow control. Moreover, the use of sender-driven mechanisms of URP which relaxes processing requirements at the receiver is applied. In some sense, XTP collected mechanisms that are suited for high-speed networks from earlier protocols. XTP introduces a flexible addressing mechanism, that is, different addressing schemes can be applied (e.g., Internet addressing or OSI addressing). With this flexibility, XTP can be used in different networking environments, such as Internet or OSI. However, today it is mainly used over the Internet, that is, over IP.
8
HIGH-SPEED PROTOCOLS
Furthermore, XTP provides multicasting. Different mechanisms have been supported over time. Multicast support forms an important issue for many emerging applications. Proper support is still under discussion today. Version 4.0 provides a reliable multicast service.
Evolution of TCP As discussed in the previous section, many new transport protocols have been designed with respect to the characteristics of high-speed networks and with respect to application requirements. However, none of these protocols is widely used today. TCP, which was developed over twenty years ago, clearly dominates the market. However, TCP has also changed over time; mechanisms have been integrated to solve some problems of TCP and that support TCP over high-speed networks. Some prominent examples are: • • •
avoidance of the Silly Window Syndrome window scaling connection count
TCP uses a sliding window mechanism for flow control and cumulative acknowledgements with go-back-N retransmission. In this context, one of the first performance problems that was observed was the so-called Silly Window Syndrome. It is characterized by a situation in which the receiver advertises only small windows and, thus, the sender transmits data in small segments (12). Such a situation is initially caused by a sender, who—in the case of a push flag—has only few data to send which are acknowledged immediately by the receiver. During long data transmissions (e.g., large file transfers) the sender cannot recover from such a situation and the performance of the data transfer can decrease drastically. However, the Silly Window Syndrome can be avoided by using the following simple algorithms at the receiver and sender, respectively. The sender should refrain from sending small segments. As a general rule the sender should only send if the window is filled by more than 25% of the window size. At the receiver side, two complementary algorithms should be implemented. First, the receiver should avoid advertising small windows. The receiver should only offer a new window if at least some fraction of the overall window size (e.g., 50%) can be offered. Furthermore, the receiver should refrain from sending an acknowledgment at all if the push flag was not set in the received segment and no data are flowing in the reverse direction. This decreases the processing requirements of the sender since the number of acknowledgments that it needs to process is reduced. These algorithms are suggested for use with TCP in the host requirements document (13). TCP uses a 16-bit field for window advertisements and the basic unit for advertisements is a byte. Thus, the maximum window size wmax can be calculated as follows: wmax = 216 byte = 65 kbyte. With this limited window size, high-speed networks cannot be highly utilized. In order to overcome this problem, a new option has been introduced in TCP that allows the use of other basic units for window scaling. The basic unit is advertised within a SYN segment during connection establishment (see Fig. 3). The basic units are multiples of a byte and they are coded logarithmically, for example, the factor 3 leads to a scaling factor S = 23 = 8. There exists an upper limit of S = 214 for the scaling factor in order to bound the highest usable sequence number. Proper support of client/server applications leads to another new TCP option, the so-called connection count. Connection counters are used to avoid the handshake procedure for connection establishment. This version of TCP is called T/TCP (transaction TCP) or TAO (TCP accelerated open). The connection count is monotonically increased every time a new connection is established. The server stores the last connection count received from a client. If a SYN segment is received by the server, it compares the connection count with the value stored at the server. If the received connection count is larger, the connection can immediately be established. If it is smaller, the normal handshake procedure is used.
HIGH-SPEED PROTOCOLS
9
Fig. 3. Transaction support of T/TCP.
Implementation Techniques During the past few years, implementation techniques for communication systems have been discussed for various reasons. The two main reasons are efficiency and flexibility. The latter occurs especially in environments with networked multimedia applications and is not discussed further here. Given that currently available transmission technology can deliver data across networks at rates exceeding the gigabit per second range, one challenge in building high performance communication systems is the design of efficient protocol implementations. Considering the increasing gap between the growth of physical network bandwidth and the growth in processing power available for protocol processing, multiprocessor platforms were considered to overcome performance bottlenecks. Therefore, the communications system has to be adequately partitioned in order to make use of such architectures. Besides processing requirements, data manipulation functions (i.e., memory access and data copies) play a crucial role for high performance communication systems. The technique of integrated layer processing has been introduced to overcome this bottleneck. Due to the many difficulties with this technique, no major breakthrough could be achieved up to now. Parallel Protocol Processing. In this section, different types and levels of parallelism that can be applied in communication systems are introduced first. Then, some of the most prominent projects in parallel protocol processing are briefly summarized and their results are evaluated. Types of Parallelism. Two types of parallelism can be distinguished: • •
spatial parallelism temporal parallelism
Spatial parallelism is based on the mapping of the processing task on multiple concurrently operating processing units. Two types of spatial parallelism can be distinguished: • •
SIMD-like organization MISD-like organization
Spatial parallelism based on an SIMD (single instruction multiple data) organization reflects parallelism among identical tasks independently and concurrently operating on different data units. A scheduling discipline, for example, round robin, may be used to allocate data to processing units. An SIMD organization requires only a minimum of synchronization among the processing units. However, it does not decrease the processing time for a single data unit. However, it increases the number of data units processed during a certain time interval. The performance benefits are limited by the maximum number of concurrently utilizable
10
HIGH-SPEED PROTOCOLS
processing units which can be approximately calculated by multiplying the mean packet processing time with the target throughput. Spatial parallelism based on an MISD (multiple instructions single data) organization is associated with concurrent processing of different tasks on the same data. Applying an MISD organization reduces the processing time required for a single data unit since multiple tasks are processed concurrently. However, a high degree on synchronization among the processing units may be required. Careful balancing of the system is needed to ensure that synchronization overhead does not paralyze system performance. Temporal parallelism is provided by the concept of pipelining which is similar to assembly lines in industrial plants. To achieve pipelining, the processing task has to be subdivided into a sequence of subtasks each mapped on a different pipeline stage. Thereby, each pipeline stage processes different data at the same point in time. The protocol VMTP has especially addressed this issue in the protocol design. Pipelining does not decrease the processing time needed for a single data unit. It increases the completion rate of packets and, consequently, the system throughput. Since the performance of a pipeline is limited by its slowest stage, stage balancing is an important issue. Levels of Parallelism. In communication subsystems, mainly four different levels of parallelism can be distinguished based on the granularity of an atomic unit: • • • •
stack level entity level function level intra-function level
The stack level forms the most coarse-grained level of parallelism applied to communication subsystems. An atomic unit comprises a complete protocol stack consisting of different protocol layers. Mainly spatial parallelism providing SIMD organization can be implemented at the stack level. Each stack can be associated with a separate connection or a dedicated application data stream (e.g., in case of connectionless services). In this case, packets are scheduled on a per connection basis and parallelism takes place among concurrent connections. However, this approach does not improve the performance of a single stack and, moreover, load balancing between the stacks cannot be influenced since the relation of packets to corresponding connections dictates the data distribution over the processing units. As a consequence, some processing units may be heavily loaded requiring additional processing power, whereas others are lightly loaded or even idle. An alternative approach using per packet scheduling may lead to a more beneficial load distribution. Packets are scheduled based on the availability of processing units, independent of their relation to a connection, or an application data stream. The performance of a single connection can be increased by concurrently processing multiple data units associated with a single connection. However, this solution requires synchronization among processing units concurrently operating on different packets belonging to the same connection. The entity level provides a finer granularity. An atomic unit is either associated with a complete protocol, or, more modular, with the receive entity formed by the receive part of the protocol, and the send entity formed by the send part. Spatial parallelism can be applied resulting in the same trade-offs as at the stack level. Entities belonging to different protocol layers can be implemented in a layer pipeline using temporal parallelism to form a complete protocol stack. The layer pipeline can further be subdivided into a receive pipeline and a send pipeline which are mutually independent to a large extent. They may coordinate their operation on data units belonging to the same connection by using common connection control information. Furthermore, synchronization among send and receive pipeline may be required. At the function level, protocol functions are used as atomic units (e.g., connection establishment, flow control). A first and necessary step in that direction is the analysis of protocols in view of extracting the intrinsic parallelism among protocol functions. The result of this analysis can be given in the form of a dependency graph representing the relationship among the protocol functions, mainly in terms of data dependency.
HIGH-SPEED PROTOCOLS
11
At the intra-function level, concurrent processing is applied in order to increase the performance of a single protocol function. Some examples include computing intensive functions, such as checksum processing, encoding or encryption, or functions operating on large data bases, for example, the routing function. Experiences in Parallel Protocol Processing. Many projects on parallel protocol processing had been started by the end of the 1980s. In this article, only a few are selected that provide dedicated experience and cover the main areas just presented. Most projects experiment with the OSI protocol stack. Few projects address new protocols, such as XTP. One project intensively investigated parallelism at the stack level (18). Important issues that need to be considered at that level are packet ordering, handling of segmented data, access to connection control information, and shared protocol data (e.g., IP routing table). Packets are scheduled on a packet-by-packet bases among the involved stacks. In (19) a centralized internal sequence number scheme is applied to packets as well as to control information. Based on this, processing of packets or control information arriving out-of-order is postponed until a processor identifies them as the next to be processed. To avoid any synchronization problems associated with concurrent processing of connection control information, a separate disjoint processing stage is implemented. This is a proper design in cases where packet processing dominates control processing, for example in the presentation layer. Since scheduling is performed on a per packet bases, segments of the same data unit may be processed by different processing units. The reason is that during reception from the network it cannot be determined whether a segment of some higher layer packet is received. This problem of segmented data units can be avoided by using per connection scheduling at the stack level. However, this requires an unambiguous identification of higher layer connections during packet reception from the network, that is, intelligent packet filters are needed. The results reported in (18) were achieved with up to four processors. A performance improvement of up to 3.5 was achieved with four processors. The results were highly dependent on the size of a data unit. Large data units allow for a higher degree of parallelism and, thus, lead to better performance. Few projects addressed function level parallelism. Some of them with dedicated HW support (20,21), some of them using transputer networks (22,23). The XTP project even tried to directly couple protocol design and VLSI (very large scale integration) implementation (24). They failed with respect to VLSI implementation. However, XTP became a very prominent protocol that highly influenced the development of communication systems over the last years, especially with respect to integrated services and multicasting. The projects that were using transputer networks for parallel protocol processing either applied function level parallelism (15,22) or entity level parallelism (25). All of them needed to overcome the distributed memory architecture associated with transputer networks. In (25) a global memory for several transputers was implemented. Generally, these projects achieved some performance advantage compared to traditional implementations. However, with respect to the increased cost due to multiprocessing, these numbers are not very promising. This mismatch between cost increase and performance speed-up can be seen as one of the main reasons why parallel protocol processing did not succeed up to now. In (25) a speed-up of 1.55 is reported with 2 transputers and a speed-up of 2.17 with the use of four transputers. The reason for the low speed-up numbers can be found in the uneven loadbalancing that occurs because the protocols were not designed for parallel processing. In (22) a speed-up of 3.73 was reported with the use of eight transputers. Again, uneven load balancing is the reason for this comparable low speed-up. In Ref. 26 a protocol processor consisting of multiple processors and memory modules was applied for the implementation of a light-weight transport protocol. The main characteristic of the protocol can be seen in the exchange of complete state information instead of state updates. Processing of 20,000 headers/s is expected to be feasible. The operating system was identified as performance bottleneck. Located between entity level parallelism and function level parallelism are approaches that accelerate processing of an entity by the use of dedicated hardware for some performance critical functions, such as buffer management, timer management and the like. In (27) a dedicated software process was in charge of buffer management. The XTP project, for example, targeted a dedicated VLSI support of buffer management.
12
HIGH-SPEED PROTOCOLS
Furthermore, the usage of specialized memory chips, such as CAMs, was addressed in (20) in order to speed-up timer management. For the implementation of the retransmission timer, CAMs were used. However, the usage of special memory did not succeed mainly because of cost reasons. Checksumming is a typical protocol function that can be implemented in hardware. In transport protocols it usually requires read access to the user data along with some arithmetical calculations. Dependent on the location of the checksum in the data unit, this function can be processed on-the-fly during data movements applying pipelined parallelism and reducing the number of memory accesses needed. Advanced network interfaces provide support for on-the-fly processing of transport layer checksums on the network adapter. Pipelining is usually applied at the interface between network adapter and host processing. Efficient implementation techniques are required to avoid this interface becoming the system bottleneck. Mapping of the adapter memory into the hosts virtual memory is a possible way of avoiding data movements between host and adapter. Examples for memory-mapped network interfaces are (28) and (29).
Summary The event of high-speed networks with a high path capacity stimulated the development of so-called lightweight transport protocols. This article provided an overview of the most important light-weight protocols and the novelties that were introduced by these protocols. Moreover, the evolution of TCP is briefly summarized with respect to high-speed networks. It can be observed that some of the mechanisms introduced within the context of light-weight protocols are now being used in TCP. Examples are implicit connection establishment and timer-based connection termination. Thus, it can be expected that TCP will further adapt promising mechanisms if they are required by dedicated applications. An example that is under discussion is the support of selective acknowledgments. Furthermore, it can be expected that TCP will stay among the most popular protocols—it will not be replaced by a light-weight transport protocol. However, with respect to multicasting, it might be replaced by transport protocols that have dedicated multicasting support. An excellent investigation on protocol mechanisms for reliable multicast protocols can be found in Ref. 30. The increased network speed also imposes higher processing requirements on communication systems. Parallel processing can be seen as one approach to increase processing power. The possibilities of applying parallelism to protocol processing have been briefly summarized. Some of the most interesting projects have been selected and presented with their main achievements. However, it must be stated that, generally, parallel protocol processing did not succeed for several reasons. Protocols are not designed for parallel processing which leads to problems in load balancing among the processors involved. Thus, no linear speed-up can be achieved. This leads to the problem that the increased cost due to parallel processing cannot be justified. The most promising approach can be seen in entity acceleration by implementing dedicated functions in hardware. Implementations complient with this approach can be found quite often (e.g., hardware-based onthe-fly checksumming). Finally, it needs to be stated that even entity acceleration may be less important for certain protocol functions in the future. This is because processors do provide more and more dedicated instructions for performance intensive functions related to multimedia (e.g., video coding (28)).
BIBLIOGRAPHY 1. R. W. Watson Timer-Based Mechanisms in Reliable Transport Protocol Connection Management, Computer Networks, North-Holland, Vol. 5, 1981, 47–56. 2. R. W. Watson The Delta-t Transport Protocol: Features and Experience, in H. Rudin, R. Williamson (eds.), Protocols for High-Speed Networks, Elsevier (North-Holland), 1989, pp. 3–18. 3. D. Clark M. Lambert L. Zhang NETBLT: A bulk data transfer protocol, Request for Comments RFC 998, March 1987.
HIGH-SPEED PROTOCOLS
13
4. D. Clark M. Lambert L. Zhang NETBLT: a high throughput transport protocol, Proceedings of the ACM SIGCOMM ’86, Stowe, VT, 1986, 353–359. 5. D. Cheriton VMTP: A protocol for the next generation of communication systems, ACM SIGCOMM ’86, Stowe, VT, pp. 406–415. 6. D. Cheriton C. Williamson VMTP as the transport layer for high-performance distributed systems, IEEE Commun. Mag., 27: June 1989, 37–44. 7. D. Cheriton C. Williamson VMTP: versatile message transaction protocol–protocol specification, Request for Comments RFC 1045, February 1988. 8. A. G. Fraser W. T. Marshall Data transport in a byte-stream network, IEEE J. Selected Areas in Communications, 7: September 1989, 1020–1033. 9. W. Strayer B. Dempsey A. Weaver XTP. The Xpress Transfer Protocol, Reading, MA: Addison-Wesley, 1992. 10. XTP Forum, Xpress transport protocol specification revision 4.0, Technical Report, March 1995. 11. W. Doeringer et al. A survey of light-weight transport protocols for high-speed networks, IEEE Trans. Commun., 38: 1990, 2025–2039. 12. D. Clark Window and acknowledgement strategy in TCP, Request for Comments RFC 813, July 1982. 13. R. Braden (Ed.) Requirements for internet hosts—communication layers, Request for Comments RFC 1122, October 1989. 14. W. Stevens TCP/IP Illustrated, Volume 3, Reading, MA: Addison-Wesley, 1996. 15. M. Zitterbart High speed transport components, IEEE Network Magazine, January 1991, 54–61. 16. D. Clark D. Tennenhouse Architectural considerations for a new generation of protocols, ACM SIGCOMM ’90, September 1990, 200–208. 17. B. Ahlgren M. Bjorkman P. Gunningberg Integrated Layer Processing Can Be Hazardous to Your Performance, in W. Dabbous, C. Diot (eds.), Protocols for High-Speed Networks, V, London: Chapman & Hall, 1996. 18. R. Lam M. R. Ito On the parallel implementation of OSI protocols processing systems, Proceedings IPPS94, Cancun, Mexico, April 1994. 19. M. Goldberg G. Neufeld M. Ito A Parallel Approach to OSI Connection-Oriented Protocols, in B. Pehrson, P. Gunningberg, S. Pink (eds.), Protocols for High-Speed Networks, III, Elsevier (North-Holland), 1992, pp. 219–232. 20. D. Feldmeier An Overview of the TP++ Transport Protocol Project; in A. N. Tantawy (ed.), High Performance Networks: Frontiers and Experience, Norwell, MA: Kluwer-Academic Publisher, 1993. 21. T. F. La Porta M. Schwartz Performance analysis of MSP: A feature-rich high-speed transport protocol, IEEE/ACM Trans. Networking, 1 (6): 740–753, 1993. 22. M. Zitterbart Parallelism in Communication Subsystems, in A. N. Tantawy (ed.), High Performance Communications, Norwell, MA: Kluwer-Academic Publisher, 1994, 177–194. 23. T. Braun M. Zitterbart A parallel implementation of XTP on transputers, 16th IEEE Conference on Local Computer Networks, Minneapolis, MN, October 1991, 91–96. 24. Protocol Engines, PE-1000 Series, Protocol Engine Chipset, 1992. 25. M. Kaiserswerth The parallel protocol engine, IEEE/ACM Trans. Networking, 1: 1993, 650–663. 26. A. N. Netravali W. D. Roome K. Sabnani Design and implementation of a high-speed transport protocol, IEEE Trans. Commun., 35 (11): 2010–2024, 1990. 27. M. Kaiserswerth A parallel implementation of the ISO 8802-2.2 protocol, IEEE TRICOMM ’91, Chapel Hill, NC, April 1991. 28. M. Blumrich et al. Virtual-memory-mapped network interface; IEEE Micro, February 1995. 29. R. Minnich D. Burns F. Hady The Memory-integrated network interface; IEEE Micro, February 1995. 30. D. Towsley J. Kurose S. Pingali A comparison of sender-initiated and receiver-initiated reliable multicast protocols, IEEE J. Selected Areas Commun., 15 (3): 398–406, 1997. 31. A. Peleg S. Wilkie U. Weiser Intel MMX for multimedia PCs, Communication of the ACM, 40 (1): January 1997.
MARTINA ZITTERBART TU Braunschweig
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5310.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Intelligent Networks Standard Article J. Place1 and J. Stach1 1University of Missouri— Kansas City, Kansas City, MO Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5310 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (123K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5310.htm (1 of 2)16.06.2008 16:23:38
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5310.htm
Abstract The sections in this article are In Telecommunications Standards Bodies Transaction-Based Call Processing Intelligent Network Conceptual Model In Services New Telecommunications Services and the In In Evolution Conclusion Keywords: telecommunications; advanced intelligent network; ain architecture | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5310.htm (2 of 2)16.06.2008 16:23:38
504
INTELLIGENT NETWORKS
INTELLIGENT NETWORKS Telecommunications systems around the world are facing dramatic changes. Customers are clamoring for new services, and technology is progressing at a disconcertingly fast pace. Concurrently, telecommunications service providers face many mandates from regulatory agencies such as maintenance of low-cost universal service. The only constant for both telecommunications providers and subscribers is rapid change. At the heart of these rapid changes is the evolution of a concept called the intelligent network (IN). The intelligent network has evolved as a way to speed up the development and introduction of telecommunications services and to provide those services in an efficient and cost-effective manner. The intelligent network is a concept designed to extend the capabilities of telecommunications systems. Our current telecommunications system includes both wireline and wireless networks, and both are heavily dependent on the IN. The IN was designed to provide telecommunications services independent of call connection services, i.e., switching processes. Additionally, the IN was designed to be independent of hardware, service providers, and network protocols and to be an overarching construct spanning all service providers. For example, prior to the deployment of the IN, 800 services required every switch to have the capability to translate the 800 number to the actual called number and to bill correctly. Additionally, every switch must have the correct number translation table correctly connect the call. When new 800 numbers were added, the translation tables must be updated in all switches. The IN moved the 800 service from the call connection system to the call services systems; thus new numbers are added to the IN 800 database which can be accessed by many switches. The IN separates call connection from service provision. IN TELECOMMUNICATIONS STANDARDS BODIES The International Telecommunication Union is the international body developing IN standards. The ITU is an agency of the United Nations. The European Telecommunications Standardization Institute (ETSI) recommends standards for Europe and Bellcore develops North American IN standards. International Telecommunication Union (ITU) The ITU was reorganized in 1993 into three sections: • Radio Communications (ITU-R). This section was formed by combining the International Radio Consultative Committee (CCIR) and the International Frequency Board (IFRB). The ITU-R is concerned with the allocation of the RF spectrum, and its regulations are distributed through the Radio Regulations Board. • Telecommunications Standards (ITU-T). This was formed from the International Consultative Committee
on Telegraphy and Telephony (CCITT). ITU-T recommendations work to standardize global telecommunications and are usually distributed through conferences and workshops. • Telecommunications Developmment (ITU-D). This was formed from the Bureau of Telecommunications Development (BTD) and develops technical specifications for international public telephone networks. ITU standards for IN are divided into capability sets, e.g., CS-0, CS-1, CS-2, and are the standard for Europe and Asia. The ITU documents are generally referred to as the ‘‘Q.12xy’’ series where the ‘‘x’’ digit identifies the capability set. ‘‘Recommendation Q.1200’’ is the general CS-0 IN structure document (1). The ‘‘Q.1201’’ series includes recommendations for IN architecture principles (2), IN service plane architecture (3), global functional plane architecture (4), distributed functional plane architecture (5), physical plane architecture (6), and other key CS-1 interface components (7–12). Recommendations for IN CS-2 may be found in the ‘‘Q.122y’’ series which includes an introduction to IN CS-2 (13), service plane for IN CS-2 (14), and other CS-2 recommended standards (14–17). See the ITU World Wide Web site at http:// www.itu.ch/ for more information. European Telecommunications Standardization Institute (ETSI) The ETSI defines technical standards that are generally consistent with ITU recommendations but do not fully adopt the ITU recommendation structure. Some ETSI standards documents are the IN user’s guide for CS-1 (18), IN ETSI:61010 (19), IN distributed functional plane (20), and others (21,22). See the ETSI World Wide Web site at http://www.etsi.org/ sitemap/ for more information. Bell Communications Research (Bellcore) Standards for the IN in North America are called Advanced Intelligent Network (AIN) releases and are produced by Bellcore, e.g., AIN 0, 0.1, 0.2. See the Bellcore World Wide Web site at http://www.bellcore.com/ for more information. The evolution of IN standards from both Bellcore and the ITU continues. Telecommunications Information Networking Architecture (TINA) and Information Networking Architecture (INA), which are evolutionary offshoots, are discussed separately later. The two IN standards from the ITU and Bellcore, never very far apart, are getting closer to one another, and it is widely believed that the ITU standards eventually will prevail because of the necessity of producing global telecommunications standards that ease world-wide integration and interoperability. For more detail about telecommunications standards, see Refs. 23 or 24. TRANSACTION-BASED CALL PROCESSING New telecommunications services have one thing in common: They require extensive software and hardware support from the underlying telecommunications network. To better understand the current status of telecommunications, we need to go backwards in time. IN as an architecture is a natural evolution of the basic telecommunication network.
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
INTELLIGENT NETWORKS
The public switched telephone network (PSTN) in North America, prior to the breakup of AT&T, was served by a hierarchy of switches with five levels. Class 5 switches or endoffices terminated the subscriber local loop. Calls were routed from end-office to end-office over interoffice trunks or they were directed to class 4 or higher switches for further routing depending on the called party and traffic congestion. The class 4 switch was called a toll center. There were three classes of toll center: class 3, class 2, and class 1. Call routing was a function of the called party location and the traffic volume. Obviously, the shortest route was always the best but it was not always available, and thus the call was occasionally directed to higher level switches for routing. With the breakup of AT&T and the subsequent separation of local and long-distance service, the upper layers of the switch hierarchy were abandoned for a fully interconnected network of digital switches. The end-office still terminates the subscriber loop but has become much more than a termination center. The end-office terminates traditional analog voice circuits, but it also has a connection to the MTSOs serving the wireless carriers for the area. The end-office also terminates T1 and higher-speed lines that bypass the switch and are connected to a local connection matrix as well as connected to the LEC’s access tandem for inter-LATA and intra-LATA calls. Until direct distance dialing (DDD) was deployed in the early 1950s, all information needed to create connections was stored on individual telephone switches. Stored program control (SPC), which made telephone switches specialized computers, was implemented in network switches such as AT& T’s No. 1 ESS which was deployed in 1965 (25). No. 1 ESS provided residential services such as ‘‘call waiting’’ through software resident in the switch. As SPC logic became more complex, it took longer to develop and test new services, and thus it became increasingly expensive and time-consuming to deploy each new innovation as each of the 15,000⫹ switches in the country had to be loaded with the new software. It took 3 to 4 years from conception to delivery for a new service. It was not uncommon to see switch logic exceed a million lines of program code. Many existing switches had to be extended by adjunct processors—special-purpose computers—to correctly interpret numbers that were dialed and execute their associated services. It became increasingly evident that call connection and call service activities had to be separated. AT&T first accomplished this when they introduced centralized databases to support ‘‘Calling Card’’ and ‘‘800 Service.’’ These facilities were implemented at a network control point (NCP) and were accessed by a specialized signaling network called the common channel interoffice signaling network (CCIS). In the 1980s the ITU defined the common channel signaling system no. 7 (CCS7), which became the intelligent network interconnection mechanism. CCS7 is a packet network that is used for call setup (i.e., out-of-band signaling) and is separate from the resources used to connect the call. After the forced divestiture of the Bell System in the United States, the RBOCs also deployed centralized databases to support 800 service and alternate billing services. The switches, databases, and operations systems that formed these services were collectively referred to as Intelligent Network 1 (IN/1) (26,27). Bellcore recognized that the separation of interconnection and service provision facilities offered huge
505
potential for service development and began development of an expanded IN architecture called IN/2. RBOCs realized that IN/2 could not be implemented in a timely manner and focused on a subset of IN/2 functionality which could be deployed in stages over several years. At about the same time, switch vendors did not believe that they could deliver sufficient performance to support all of the specified services. Bellcore called a multivendor forum to resolve the concerns of all parties and the results of this forum were published in March 1990 (28). The next stage IN development was called Advanced Intelligent Network (AIN) and was defined by a series of numbered releases starting with release 0. AIN 0.1 has been deployed and AIN 0.2 enhancements are currently are being deployed. In the section entitled ‘‘Intelligent Network Conceptual Model,’’ we describe the IN conceptual model. In the section entitled ‘‘IN Services,’’ we discuss the IN service structure. In the section entitled ‘‘New Telecommunications Services and the IN,’’ we discuss some new telecommunications services and their dependence on the IN. In the section entitled ‘‘IN Evolution,’’ we describe the evolution of IN and we conclude the chapter in the section entitled ‘‘Conclusion.’’ INTELLIGENT NETWORK CONCEPTUAL MODEL Organization and Interface. The overarching purpose of the IN is to provide a framework to deploy advanced telecommunications services to subscribers. We describe IN components and their general function within the framework of providing these subscriber services. Although implementations are slightly different from vendor to vendor and from service provider to service provider, intelligent network services are delivered by well-defined interactions between switching systems and intelligent network service logic. A key justification for the deployment of the intelligent network is to provide a telecommunications infrastructure that would facilitate the implementation of sophisticated subscriber services. The best way to view the IN is as a reference model called the intelligent network conceptual model (INCM). The INCM is the basis for development of ITU IN standards (29). The INCM is a four-layer model which, from the bottom up, consists of the physical plane, the distributed function plane, the global function plane, and the service plane. The INCM is shown in Fig. 1. INCM layer descriptions follow. IN Physical Plane. IN physical component descriptions follow. Entities in the physical plane perform the ‘‘real’’ functions that implement the IN—for example, access and maintain support databases, route control packets, or make call-setup decisions. For more detail see Refs. 15,30,31,24, and 32. The architecture of the physical plane is shown in Fig. 2. • Service Switching Point (SSP). The SSP provides the access point to intelligent network services for the subscriber and executes the call model that describes actions to be taken by the IN service layer. The SSP detects IN service requests and formats requests for call implementation instructions from the AIN service logic. The SSP is a logical entity that coexists with the switch; that is, the SSP function has been embedded into switching points and is the access point into the IN.
506
INTELLIGENT NETWORKS
Services
Call forwarding
• Service Control Point (SCP). The SCP provides the service control function (SCF) and the service data function (SDF). The SCP is responsible for responding to SSP queries resulting from interaction with the call model. The software modules that provide IN services are implemented in the SCP. The SCP provides instructions to the SSP on how to continue with call setup. The SCPs are duplicated—mated—for redundancy and to ensure proper response to service requests. The SCP is both a database and a processing environment. The processing environment contains service logic programs (SLPs) that provide specific services to requesting SSPs. The database portion provides information that is processed by the SLPs. For example, an SCP is responsible for number translation in an 800/888 service. The specific 800 SLP uses the dialed 800 number as a lookup argument to query its database for the real number of the called party. The SLP may also use time of day and caller location as additional criteria in selecting the phone number of the called party, e.g., dialing a regional number for a pizza delivery service could trigger an SCP query which would route the call to the location closest to the caller for faster delivery service. Modern SCPs are designed to handle thousands of transactions a second. As IN services expand, SCPs will become more narrowly specialized because it would be impossible for them to handle the volume of transactions that will be generated otherwise. Many researchers believe that almost all calls eventually will require an SCP transaction and many calls will generate several. With mandated local number portability service (LNP), sometimes referred to as the subscriber’s universal personal number (UPN), every call may require an SCP database lookup to find the real phone number to be used to route the call in the same way that 800/888 numbers require an SCP translation.
800/888 Services
Service plane
Global service logic
Xlate
Billing
Routing
Basic call model Global functional plane SSF
Distributed service logic SDF
SRF
SCF
SMF
Distributed functional plane Physical networking entities
SSP IP
STP SSP
Adjunct
Physical plane
Figure 1. IN conceptual model.
• Service Data Point (SDP). The SDP is the platform for the standalone service data function (SDF). The SDP contains both customer and network data that is accessed as part of the execution of an IN service.
Other networks SCP STP SCP STP IP Adjunct
SSP
IP SSP PSTN subscribers ISDN CCS 7 Transport X.25
PSTN subscribers
Links
Figure 2. IN Physical plane architecture.
• Intelligent Peripheral (IP). The IP is the platform that supports the specialized resource function (SRF). The IP responds to directives issued by the SCP or SSP and is used to play out announcements, synthesize speech, provide voice messaging, and speech recognition. The IP is accessed through an integrated speech and digital network (ISDN) link for better performance, or it may be connected to an SCP through the signaling subnetwork. Since the IP may be responsible for time-sensitive actions such as playing out announcements to the calling or called party, e.g., ‘‘Please enter your PIN now,’’ a time-sensitive link is essential. • Adjunct (ADJ). The ADJ provides the same service functions as the SCP and is considered functionally equivalent. The ADJ differs from the SCP only in the interface. The ADJ is connected to SSPs by high-speed links rather than the CCS7 network. Thus, the adjunct may be more suitable to support services that require very fast response. The adjunct usually provides specialized services and may be directly programmable by the service subscriber.
INTELLIGENT NETWORKS
• Service Node (SN). The SN provides the CCF/SSF of the SSP, the SDF, and SCF of an SCP, and the SRF of an IP in one physical entity. The SN allows highly specialized functions to be implemented in one physical device, thus reducing the CCS7 network latency and other interdevice communications overhead. • Service Management Point (SMP). The SMP provides the service management function for all physical entities in the IN. The SMP allows maintenance and testing, and provides the interface between IN entities. IN Distributed Service Plane. Objects in the distributed service plane (DSP) are called functional elements and are the service logic functions (software) associated with the hardware elements in the physical plane. Some of the functional elements at the DSP and their associated physical processors follow. • SSF: Service switching function which is associated with the SSP. • SCF: Service control function which is associated with the SCP. • SDF: Service data function which is associated with the SCP. In some implementations, the SDF can co-reside with the SCF in the SCP. • SRF: Specialized resource function which is associated with the IP. • SMF: System management function which supports service creation, deployment and maintenance, and is associated with the SMP. • SCEF: Service creation environment function allows the specification and testing of IN services. The DFP allows a software function to be viewed independently of the IN physical architecture. Obviously, the software function must at some time be physically implemented on a specific hardware platform once the function has been well-defined; however, this physical association becomes an engineering task. IN Global Functional Plane. The components of the global functional plane (GFP) are called service-independent building block (SIBs). Subscriber services are defined and deployed by combining SIBs. SIBs are standardized architecture-independent functions that expect certain standard input arguments such as caller number and calling number and produce certain standard output arguments such as called party number. SIBs perform basic network functions such as collecting digits, verifying an ID number, or translating inputs. The abstractions of the GFP allow a service function definition that is independent of hardware. Thus IN engineers have some flexibility during function implementation. In Fig. 1, several SIBs (Xlate, Billing, Routing) are shown in the functional plane. These SIBs are used as building blocks to create features (800/888 service, call forwarding) in the service plane. A single SIB may be used to create several service features. For example the three SIBs Xlate, Billing, and Routing are used to make up the 800/888 service feature. The Xlate SIB translates the 800/888 number into actual called party; the Billing SIB determines who should pay for call charges and the Routing SIB uses time of day to deter-
507
mine the actual called party number. The same Routing SIB is used in the implementation of the call forwarding service feature. IN Service Plane. Subscriber services are called features. Magedanz and Popescu-Zeletin (24) further divide features into ‘‘call-related’’ and ‘‘management-service-related’’ features. Call-related features include call waiting, speed dialing, and call forwarding, while management-related features deal with billing, network management, and service deployment. IN SERVICES The concept of IN service must be viewed from two perspectives. The first perspective is the definition of service as the addition of functionality for subscribers. Magedanz and Popescu-Zeletin (24) call functionality ‘‘value added’’ services. The second perspective involves video, voice and signaling services and is called ‘‘bearer services.’’ The IN service concept does not include specific subscriber services, rather the IN is a platform for service development independent of a specific service definition. Included in each new IN version is an expanded set of components that are used to define and deploy specific subscriber services. Basic Call Model A special SIB called the basic call model (BCM) (24) coordinates the telephony and call processing portions of a call. The call model consists of a series of common subscriber actions as ‘‘off-hook’’ and ‘‘digits dialed.’’ A BCM is shown in Fig. 3. Figure 3 shows the call origination and termination BCM. Point in calls (PICs) are checkpoints in call processing used by the SSP to determine if outside services are required (i.e., SCF services). The trigger control points TCPs in the trigger table indicate the specific service logic processes (SLPs) to diagnose and provide the requested service. SLPs make up the service control function and reside in the SCP. These SLPs are called to provide the service and guide the SSP in further processing the call. At each PIC, a decision is made in the switch about the need for AIN services. Figure 4 shows the relationship among CCF, SSF, and SCF logic. If services are required, call processing is suspended at the associated TCP. The trigger table indicates the service logic in the SCP required to progress the call, and the SCF resumes call progressing the CCF logic after providing the requested service. At each PIC in the originating and terminating call model, the switch makes decisions about processing the call on a step-by-step basis. Actions taken by the calling or called party are noted by the SSP; that is ‘‘off hook’’ (the phone is lifted out of its cradle or ‘‘digits dialed’’) the called party number has been dialed. The SSP determines at each PIC if additional outside services are necessary to complete the call. The specific action taken by the SSP depends on its SLP and by the specific PIC where the event took place. The critical point is that although the SSP still must recognize events at PICs such as a subscriber dialing an 800/888 number, the SSP does not have to take action itself to provide the requested service; rather, the SSP off-loads the service request to another AIN component—an SCP or adjunct. The
508
INTELLIGENT NETWORKS
Abandoned
CCF
Exception processing
1. Null and authorize orig. Call originated
2. Collecting information
Orig. BCM
Orig. denied SSF
Info collected
SCF Resume call processing
3. Analyzing information Info analyzed
4. Routing and alerting
Collect timeout Trigger table
Detection point in call
Route failure
Trigger control point
Busy
Figure 4. IN service process.
No answer
actual service provision takes place outside of the SSP, thus freeing it for other tasks related more closely to routing calls. The PIC/TCP mechanism in the IN switch functions like a remote procedure call. Call processing in the switch is suspended while a service request message is transmitted across the CCS7 network to the appropriate service processor— either an SCP or an adjunct depending on the type of service requested and the speed of the connection. The SLP in the service processor responds to the service request and creates a response message which is formatted and transmitted across the CCS7 network back to the AIN switch. The call resumes processing in accordance with directions provided by the SCP or adjunct. A single PIC/TCP can be used to support a variety of services depending on the context of the call. At the PIC the SSP analyzes the data provided about the call and the switch requests service from a specific SLP. The SLP may request additional information from the subscriber such as a personal identification number. The SLP may evoke other process such as those that reside on an IP. The IP may be asked to play out specific message or to collect digits from the subscriber. The notion of ‘‘backroom’’ processing by a network of service processors invisible to the subscriber allows the function of providing telecommunications services to be separated from the function of connecting calls because the service provision function has been off-loaded to other processors. Aside from greatly reducing the work required in the switch and allowing switch cycles to be more tightly allocated to call connection processing, off-loading the service function makes it easier to define new services and the new services can be deployed much faster. The new service is deployed once on the SCP rather than in each switch in the PSTN. The catch is, of course, that there must be a substantial infrastructure that allows global signaling among IN components; but once the infrastructure is in place, the deployment of new national telecommunications services is much easier. However, although service deployment is now easier, we are required to manage an increasingly complex signaling infrastructure.
Service logic
Answered
5. Call active Disconnect
Mid call processing
(a) Exception processing
6. Null and authorize term. Abandoned
Call terminated
8. Select facility and present call
Busy
Call presented
9. Alerting
No answer
Answered
10. Call active Disconnect
Mid call processing
(b)
Figure 3. Basic call model. (a) Originating call model. (b) Terminating call model.
INTELLIGENT NETWORKS
Using the 800/888 number translation example, when number translation is done at an SCP connected to the AIN switch over a CCS7 network, the switch and the SCP work in parallel. Plus new numbers can be added much more rapidly because all that is required is to update the SCP database instead of transmitting a new number translation table to all switches that make number translation. For a new service to be deployed, its SLP is installed only on the SCPs that provide the service—not on every switch in the PSTN. Thus new services can be defined and deployed in an AIN environment in months instead of years. Feature Interaction A major IN problem is the unintended consequences caused by several features active together during one call. This is called feature interaction and refers to the problems caused when parties to a call have different feature sets active. For example, suppose a subscriber makes a call to the universal personal number (UPN) of another subscriber. Further, suppose the UPN translates to a number that is a toll call for the caller. Who pays the charges? The calling party expects to reach a local number thus does not expect to be charged for the call. The called party is not interested in paying toll charges for unwanted calls such as telemarketing calls to his/ her UPN number. How are these feature interactions to be handled? When the feature is introduced, its interaction with all existing features must be clearly defined. Thus when the UPN feature is introduced, the added complexity of possible toll charges must be included in the design of the feature. For example, if the called party UPN translates to a number resulting in toll charges, the UPN feature might notify the calling party that the call is a toll call and ask for billing acceptance. Additionally, the UPN feature may include a list of numbers for which the subscriber is willing to accept toll charges without notifying the calling party. Clearly, the creation of features from SIBs is greatly complicated by the requirement to deal effectively with feature interaction.
509
Universal Personal Number Personal number service or universal personal number (UPN) is expected to be a big component of PCS, with 4.5 million personal numbers projected for the year 2000 (43). While the implementation of UPN differs from vendor to vendor (44), the single number applies regardless of the type of call—that is, FAX, voice, or e-mail. UPN allows a subscriber to be reached (conditionally) regardless of his physical location from a single telephone number. UPN also implies that wireless, voice, data, and FAX calls be automatically routed to the appropriate device. Clearly, the IN is critical to the development and deployment of UPN because as the service becomes available, an increasing percentage of called numbers require treatment. UPN will generally allow the subscriber to determine how he is to be reached and who is allowed access. Clearly, UPN depends on extensive and fast SCP processing. IN EVOLUTION The IN is a powerful mechanism for deploying telecommunications services. However, with this power comes great complexity. IN management is a critical component of the overall IN strategy. Management consists of operations, administration, maintenance, and provisioning (OAM&P). There are many IN component vendors each with their own operations systems (OSs) that want to place their equipment in the Bellcore version of IN in North America and the ITU version of IN in Europe and Asia. Clearly, for the sake of interoperability, there must be movement to a common IN reference model and there are several activities in this IN movement. These activities include the notion of an international telecommunications management network (TMN) (45) and an open distributed processing architecture (46). Bellcore is also developing a long-range view of IN in its information networking architecture (INA) (47,48) and its telecommunications network architecture (TINA) consortium (49–54). Telecommunications Management Network
NEW TELECOMMUNICATIONS SERVICES AND THE IN Personal Communications Services Personal communications services (PCS) is a good example of an IN transaction-intensive service (33–41). There are several generic functions necessary to support PCS. Several of those SCP functions are as follows: • Analyze time-of-day, calling number, and called number for routing or access directions to the SSP. • Collect and analyze user information such as personal identification number (PIN) for billing and call routing. • Database access and billing control functions are used to verify and update user database information and to create and store billing information. Bray (42) estimates that services, excluding PCS, will require SCPs to support query rates in the range of 1000 transactions per second.
One approach to consistent management across network components within a network and networks themselves is called Telecommunications Management Network (TMN) (55,60). TMN emerged in the early 1980s as a mechanism to effectively manage diverse operations systems developed by component vendors to support OAM&P efforts. TMN included a set of standards to ensure component and network interoperability. Conceptually, TMN produces a network of management systems. This overarching software system monitors and tunes entire telecommunications networks. Interfaces were standardized so that introduction of new components into the network was eased from the OAM&P perspective. While some success in this grand vision has been achieved, TMN has not yet been fully realized. There are several very difficult issues with the TMN concept that have not yet been resolved. Development stumbling blocks include the following: • TMN Complexity. Open Systems Interconnection (OSI) system management technology was selected as the basis
510
INTELLIGENT NETWORKS
for TMN interfaces; and while these systems are very powerful, they are also quite complex. Thus, TMN interfaces are being slowly deployed. • Legacy Systems. Developing TMN interfaces is very expensive, and thus developers require a strong incentive to deploy TMN. Since there are few TMN systems currently deployed on legacy systems, the incentive to develop TMN interfaces for new systems is weak. • Alternative Protocols. TMN relies on OSI management standards while the TCP/IP protocol suite uses Simple Network Management Protocol (SNMP). Because TCP/ IP is so widely available, there is pressure to use SNMP. However, since SNMP is simpler than OSI management systems, it is perceived to be less powerful. TMN concepts are critical to the effective management of complex INs. The ability to automate these functions is essential to smooth network interoperability. See Magedanz and Popescu-Zeletin (24, Chapter 4), and other articles referenced here for more details about TMN concepts. Information Network Architecture In 1990 Bellcore started working on a concept meant to be the successor to its AIN. The successor network concept was called information network architecture (INA). Basic INA concepts were specified by Bellcore in 1992 and 1993 and are described in Ref. 53. INA concepts require management software modules and service software modules to be separated and capable of working correctly anywhere in a distributed network of telecommunications service processors. The distributed processing environment of the INA concept has a kernel that will be present in every node and a set of transactions servers that provide the telecommunications service delivery function. See also Ref. 24, Chapter 5 for an INA overview. Telecommunications Information Networking Architecture About the same time that Bellcore was developing the INA concepts, a group of network equipment vendors and network operators formed the Telecommunications Information Networking Architecture Consortium (TINA-C) (61) to specify an architecture that can support all network applications across all network types. Bellcore’s INA concepts influenced the TINA-C, but they took the evolution of the IN further. Specifically, TINA-C focused on four areas (24): • Computing architecture concepts for designing and implementing a distributed computing environment based on the open distributed processing reference model. • Service architecture concepts for designing and implementing the delivery of telecommunications services. • Network architecture concepts for designing and implementing a transport network. • Management architectural concepts for designing and implementing an OAM&P system across the distributed architecture. The TINA initiative is well underway. There exist several proposals for TINA trials most scheduled for mid-1998 (62).
CONCLUSION Telecommunications service providers have dramatically expanded the services they offer to their subscribers. Also the environment in which they operate has become much more complex with competitive long distance, competitive local service, and competitive wireless service. Couple this dramatic increase in complexity with an increasingly demanding subscriber and we have a situation that forces us to carefully examine the service provision architecture. The evolution of the IN is in response to this growing complexity. The bright side of the IN is the speedy development of telecommunications services to sophisticated subscribers. The dark side of the IN is an increasing complex entity that must be maintained and must evolve. BIBLIOGRAPHY 1. ITU, Recommendation Q.1200: General series intelligent networks recommendations structure, Int. TeleCommun. Union, Geneva, September 1997. 2. ITU, Recommendation Q.1201/I.312: Principles of intelligent network architecture, Int. TeleCommun. Union, Geneva, October 1992. 3. ITU, Recommendation Q.1202/I.328: Intelligent network service plane architecture, Int. TeleCommun. Union, Geneva, September 1997. 4. ITU, Recommendation Q.1203/I.329: Intelligent network global functional plane architecture, Int. TeleCommun. Union, Geneva, September 1997. 5. ITU, Recommendation Q.1204: Intelligent network distributed functional plane architecture, Int. TeleCommun. Union, Geneva, March 1993. 6. ITU, Recommendation Q.1205: Intelligent network physical plane architecture, Int. TeleCommun. Union, Geneva, March 1993. 7. ITU, Recommendation Q.1208: Intelligent network interface recommendations for CS-1, Int. TeleCommun. Union, Geneva, October 1995. 8. ITU, Recommendation Q.1211: Introduction to intelligent network capability set 1, Int. TeleCommun. Union, Geneva, March 1993. 9. ITU, Recommendation Q.1213: Global functional plane for intelligent network CS-1, Int. TeleCommun. Union, Geneva, October 1995. 10. ITU, Recommendation Q.1214: Distributed functional plane for intelligent network CS-1, Int. TeleCommun. Union, Geneva, October 1995. 11. ITU, Recommendation Q.1215: Physical plane for intelligent network CS-1, Int. TeleCommun. Union, Geneva, October 1995. 12. ITU, Recommendation Q.1219: Intelligent network user’s guide for CS-1, Int. TeleCommun. Union, Geneva, April 1994. 13. ITU, Recommendation Q.1221: Introduction to Intelligent Network Capability Set 2, Int. TeleCommun. Union, Geneva, September 1997. 14. ITU, Recommendation Q.1223: Global functional plane for Intelligent Network Capability Set 2, Int. TeleCommun. Union, Geneva, September 1997. 15. ITU, Recommendation Q.1224: Distributed functional plane for Intelligent Network Capability Set 2, Int. TeleCommun. Union, Geneva, September 1997. 16. ITU, Recommendation Q.1225: Physical plane for Intelligent Network Capability Set 2, Int. TeleCommun. Union, Geneva, September 1997.
INTELLIGENT TRANSPORTATION SYSTEMS 17. ITU, Recommendation Q.1228: Interface Recommendation for Intelligent Network Capability Set 2, Int. TeleCommun. Union, Geneva, September 1997. 18. ESTI, ETSI ETR NA-61010: Intelligent network user’s guide for CS-1, European TeleCommun. Standardization Institute, May 1994. 19. ESTI, ETSI ETR 300348: Intelligent network CS-1 physical plane, European TeleCommun. Standardization Institute, November 1993. 20. ESTI, ETSI TCR-TR NA-60502: Distributed functional plane for intelligent network CS-1 physical plane, European TeleCommun. Standardization Institute, November 1993. 21. ESTI, ETSI TCR-TR NA-60204: Guidelines for standards on intelligent network CS-1, European TeleCommun. Standardization Institute, March 1923. 22. ESTI, ETSI TCR-TR NA-60501: Global functional plane for intelligent network CS-1, European TeleCommun. Standardization Institute, November 1993. 23. E. Carne, Telecommunications Primer: Signals, Building Blocks and Networks, Upper Saddle River, NJ: Prentice-Hall, 1995. 24. T. Magedanz and R. Popescu-Zeletin, Intelligent Networks Basic Technology, Standards and Evolution, New York: International Thompson Computer Press, 1996. 25. R. Berman and J. Brewster, Perspectives on the AIN architecture, IEEE Commun. Mag., 30 (2): 27–32, 1992. 26. R. Robrock, The intelligent network evolution in the United States, Conf. Rec. Int. Conf. Commun. (ICC) (Singapore), vol. 1, 1990, pp. 11.4/1.5. 27. R. Robrock, The intelligent network-changing the face of telecommunications, Proc. IEEE, 79: 7–20, 1991. 28. Bellcore, Bellcore multi-vendor interactions compendium of 1989 technical results, 1990. 29. ITU, Principles of intelligent network architecture, Q.1201/I.312, 1992, revised 1995. 30. W. Ambrosch, A. Maher, and B. Sasscer, The Intelligent Network: A Joint Study by Bell Atlantic, IBM and Siemens, Berlin: SpringerVerlag, 1989. 31. J. Garrahan et al., Intelligent network overview, IEEE Commun. Mag., 31 (3): 30–36, 1993. 32. P. Russo et al., INrollout in the United States, IEEE Commun. Mag., 31: 56–63, 1993. 33. P. Bragwad and B. Coker, Advanced intelligent network requirements for personal communications services, 1st Int. Conf. Universal Personal Commun. Dallas, TX, 1992, pp. 1–5. 34. R. Cochetti, Intelligent communications: The complete solution, Telephony 226: 22, 26, 28, 1994. 35. G. S. Lauer, IN architectures for implementing universal personal telecommunications, IEEE Netw., 8 (2): 6–16, 1994. 36. W. Lee, Applying the intelligent cell concept to PCS, IEEE Trans. Veh. Technol., 43: 672–679, 1994.
42. M. Bray, Impact of new services on SCP performance, Conf. Rec. Int. Conf. Commun. (ICC), 1990, pp. 241–247. 43. K. Egolf, Single number expanding to multiple applications, Telephony, 230, 1996. 44. J. Shih and C. Addison, Bellsouth telecommunications first step towards personal communications, Proc. IEEE Int. Conf. Universal Personal Commun. San Diego, 1994, pp. 607–611. 45. ITU, Principles for a Telecommunications Management Network, ITU Recommendation M.3010, 1992. 46. ITU, Basic reference model for open distributed processing, X.900/ ISO/IEC 10746-2.2 1-5, 1993. 47. Bellcore, Cycle 1 specifications for information networking architecture (INA), 1993. 48. H. Rubin and N. Natarajan, A distributed software architecture for telecommunication networks, IEEE Network, 8 (1): 8–17, 1994. 49. M. Appeldorn, R. Kung, and R. Saracco, TMN ⫹ IN ⫽ TINA, IEEE Commun. Mag., 31 (3): 78–85, 1993. 50. W. Barr, T. Boyd, and Y. Inoue, The TINA initiative, IEEE Commun. Mag., 31 (3): 70–77, 1993. 51. H. Berndt, P. Graubmann, and M. Wakano, Service specification concepts in TINA-C, Towards a Pan-European Telecommunication Service Infrastructure, IS&N 94 2nd Int. Conf., 1994, pp. 355–366. 52. F. Dupuy, G. Nilsson, and Y. Inoue, The TINA Consortium: Toward networking telecommunications information services, IEEE Commun. Mag., 33 (11): 78–83, 1995. 53. N. Natarajan and G. Slawsky, A framework architecture for information networks, IEEE Commun. Mag., 30 (4): 102–109, 1992. 54. M. Wakano, M. Kawanishi, and L. Richter, Information model to support TINA service and management applications, Proc. IEEE Conf. Global Commun. (GLOBECOM), 1994, pp. 543–547. 55. H. Fowler, TMN-based broadband ATM network management, IEEE Commun. Mag., 33 (3): 74–79, 1995. 56. R. Glitho and S. Hayes, Telecommunications management network: Vision vs. reality, IEEE Commun. Mag., 33 (3): 47–52, 1995. 57. M. Kockelmans and E. Jong, Overview of IN and TMN harmonization, IEEE Commun. Mag., 33 (3): 62–66, 1995. 58. D. Sidor, Managing telecommunications networks using TMN interface standards, IEEE Commun. Mag., 33 (3): 54–60, 1995. 59. T. Towle, TMN as applied to the GSM network, IEEE Commun. Mag., 33 (3): 68–73, 1995. 60. K. Yamagishi, N. Sasaki, and K. Morino, An implementation of a TMN-based SDH management system in Japan, IEEE Commun. Mag., 33 (3): 80–85, 1995. 61. The TINA Consortium, About TINA-C, [Online], 1997. Available: http://www.tinac.com/ 62. The TINA Consortium, The TINA trials, [Online], 1997. Available: http://www.tinac.com/
37. E. Lipper and M. Rumsewicz, Teletraffic considerations for widespread deployment of PCS, IEEE Netw., 8 (5): 40–49, 1994.
J. PLACE J. STACH
38. R. Pandya, Emerging mobile and personal communication systems, IEEE Commun. Mag., 33 (6): 44–52, 1995.
University of Missouri—Kansas City
39. P. Rice, SS7 networks in a PCS world, Telephony, 230: 138, 140, 142, 144, 146, 1996. 40. J. Stach and J. Place, A simulation study of hybrid SSP/IP architecture in the advanced intelligent network, Int. J. Model. Simulation, 16: 111–117, 1996. 41. K. Takami et al., An application of advanced intelligent network technology to personal communication services, IEEE Intell. Netw. Workshop Melbourne, Victoria, Australia, 1996, p. 6.
511
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5312.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Internetworking Standard Article Carl A. Sunshine1 1Aerospace Corporation, Los Angeles, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5312 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (107K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5312.htm (1 of 2)16.06.2008 16:23:53
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5312.htm
Abstract The sections in this article are History of Computer Internetworking Major Technical Issues Major Internet Examples Future Directions | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5312.htm (2 of 2)16.06.2008 16:23:53
INTERNETWORKING
653
INTERNETWORKING The terms internetworking or network interconnection refer broadly to the techniques that enable computer systems on one network to communicate with systems on another network. The set of interconnected networks may be called an internet. (We shall use the proper name Internet for the particularly well known global internet that has come to dominate internetworking in the 1990s.) A major challenge for internetworking is to allow different types of networks to participate. A variety of network technologies and products have been devised to provide efficient data communication through different media (twisted pair copper wires, optical fibers, coaxial cable) and over various distances, such as within a building, across a campus, and between widely separated locations. Recently, wireless data communications networks (ground and satellite based) have become more prevalent to support mobile users or remote locations. Providing for all types of networks to be interconnected so that users on one network can effectively communicate with users on other networks adds great value to the system. J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
654
INTERNETWORKING
However, each network technology comes with its own characteristics for speed, format, reliability, and protocols which define the format and procedures for data exchange (1). There are good technical and marketing reasons for these different solutions, so diversity in network technologies is likely to persist. This suggests that for a network interconnection strategy to succeed, it must accommodate the autonomy and differences of individual networks to the greatest extent possible. On the other hand, some commonality of services must be supported if communication between users on different networks is to succeed. These two requirements represent a tension, within which a variety of interconnection approaches has been devised (2–5). Typically, some additional equipment is required to interconnect two different networks, by connecting to both networks through appropriate interfaces and implementing any necessary additional protocols (see Fig. 1). These intermediate devices that create the internet from its component networks may be called gateways, or routers, since one of their key functions is to forward incoming data in the proper direction to reach its ultimate destination, possibly many networks away. To accomplish this, a higher-level internet addressing scheme must be provided that can identify destinations across all of the networks in the internet. The routers must then determine from this internet address where to send the data next and how to package data in the local protocol used within the next individual network. The basic operation of an internet is much like that of the postal service. The sender of a letter places it in an envelope with the address of the destination and drops it in the mail. The local postal service then reads the address and delivers the letter to an appropriate forwarding office, using whatever transport mechanism is most suitable (bicycles, trucks, planes). At the forwarding office, the letter is sorted and forwarded again, until it reaches the final post office, which can deliver it to the destination. The postal service is not concerned with the contents of the letter, although it must conform to certain maximum size and weight limits (which may vary in different postal systems). Thanks to certain international agreements, there is enough commonality in mail services and the languages used for addresses that the basic mail forwarding service can be provided successfully, even if the contents might not be understood. Similarly, in an internet, the data to be sent are bundled into packets, with an ‘‘envelope’’ of header and trailer information including the source and destination internet addresses. Each network delivers these to an appropriate router, which uses the header information to determine how to forward the packet onward. In Fig. 1, the sending host A
Router R
Host A
WAN W
Router R
LAN X Host C LAN Y Host E
Router T
LAN Z Host D Host B
Figure 1. Routers interconnect networks to form an internet.
sends packets to destination B via local area network (LAN) X to router R, which in turn forwards the packet through WAN Y to router S, which finally forwards the packet via LAN Z to host B. To support all types of data communication applications, the internet must be able to forward arbitrary data inside the packet, so long as the size is acceptable and the ‘‘envelope’’ information is properly formed. If the end host systems and the intermediate routers all implement a common internet protocol to handle the basic addressing and routing functions, the data can reach their destination anywhere in the system. In practice, additional issues, such as congestion control, fragmentation, and multiplexing, must also be dealt with (3). We first summarize how network interconnection has developed historically. We then review the major technical problems of network interconnection, including stepwise versus endpoint services, level of interconnection, addressing, routing, fragmentation, and congestion control, ending with a summary of functions performed by a router. Next we present several important examples of internet systems that illustrate the technical alternatives, and we conclude with some directions for further research.
HISTORY OF COMPUTER INTERNETWORKING Computer networking as we know it today may be said to have gotten its start with the ARPANET development in the late 1960s and early 1970s under the sponsorship of the Advanced Research Projects Agency (ARPA) in the United States. Prior to that time there were computer vendor ‘‘networks’’ designed primarily to connect terminals and remote job entry stations to a mainframe. But the notion of networking between computers viewing each other as equal peers to achieve ‘‘resource sharing’’ was fundamental to the ARPANET design (6). The other strong emphasis of the ARPANET work was its reliance on the then novel technique of packet switching to share communication resources efficiently among users transmitting intermittent bursts of information, instead of the more traditional dedicated links or circuit switching which supported steady rate transmission well. Although the term network architecture was not yet widely used, the initial ARPANET design did have a definite structure and introduced another key concept: protocol layering, or the idea that the total communications functions could be divided into several layers, each building on the services of the one below. The original design had three major layers, a network layer that included the network access and switchto-switch (IMP-to-IMP) protocols, a host-to-host layer (the Network Control Protocol, or NCP), and a function-oriented protocol layer, where specific applications such as file transfer, mail, speech, and remote terminal support were provided (7). Similar ideas were being pursued in several other research projects around the world, including the Cyclades network in France (5), The National Physical Laboratory Network in England (8), and the Ethernet system (9) at Xerox Palo Alto Research Center in the United States. Some of these projects focused more heavily on the potential for high-speed local networks such as the early 3 Mbps Ethernet. Satellite and radio channels for mobile users were also a topic of growing research interest.
INTERNETWORKING
Creation of the Internet Protocol By 1973 it was clear to the networking vanguard that another protocol layer needed to be inserted into the protocol hierarchy to accommodate the interconnection of diverse types of individual networks. Cerf and Kahn published their seminal paper describing such a scheme (10), and development of the new Internet Protocol (IP) and Transmission Control Protocol (TCP) to jointly replace the NCP began. Similar work was being pursued by other groups meeting in the newly formed International Federation of Information Processing (IFIP) Working Group 6.1, called the Internetwork Working Group (11). The basis for the network interconnection approach developing in this community was to make use of a variety of individual networks, each providing only a simple ‘‘best effort’’ or datagram transmission service. Reliable virtual circuit services would then be provided on an end-to-end basis with the TCP (or similar protocol) in the hosts. ARPA sponsored an effort to form a national internet based on TCP/IP protocols and connecting research groups with local networks via the ARPANET. This sytem gradually grew international extensions and eventually became the Internet. Other Internetworking Approaches During the same time period, public data networks (PDN) were emerging under the auspices of what is now the International Telecommunications Union (ITU), then known as CCITT. The newly defined X.25 protocol aimed at providing more traditional virtual circuit types of network service that guaranteed reliable end-to-end delivery (1). The PDNs devised an interconnection scheme based on concatenating virtual circuits across each network (12). The middle and late 1970s saw networking conferences dominated by heated debates over the relative merits of circuit versus packet switching and datagrams versus X.25 virtual circuits (13). The mainframe computer vendors continued to offer their proprietary networks, gradually supporting the new X.25 service as links under their own protocols. Digital Equipment (DEC) was the notable exception, adopting the research community approach of peer-to-peer networking at an early date and coming out with its own new suite of protocols (DECNET). By the late 1970s, a new major influence was emerging in the computer networking community. The computer manufacturers realized that multivendor systems could no longer be avoided and began to take action to satisfy the growing user demand for interoperability. Working through their traditional standards body, the International Standards Organization (ISO), a new group (Study Committee 16) was created to develop standards in the networking area. Their initial charter was to define an explicit architecture or ‘‘reference model’’ for Open Systems Interconnection (OSI) (1). They formalized the concept of protocol layering to facilitate the design of increasingly complex communications software. In a layered architecture, the communications functions in each system are partitioned into a set of layers, with each layer making use of the functions provided by the layer beneath. This allows modifying the protocol within a layer so long as the functions provided upward and used below are maintained. Interconnection of LANs Another force contributing to the growth of internetworking was the introduction of personal computer networks, initially
655
for business purposes. Both Apple Computer and Novell introduced networking software in the mid-1980s that allowed multiple LANs to be interconnected, with sharing of files and printers. The work on Ethernet at Xerox was extended to allow interconnection of LANs over long-distance links (14). The breakup of the long-distance phone monopoly in the United States in 1984 provided competition and a rapid drop in prices for higher-speed links to interconnect the growing business LANs at various sites. Such links also provided greater bandwidth for interconnection of the growing number of TCP/IP networks at university and research sites. This led to formation of the first high-speed national TCP/IP network by the National Science Foundation and a further growth of TCP/IP systems to include commercial sites. The Internet Engineering Task Force (IETF) was formed to guide the further evolution of the TCP/IP internet, later known as the Internet. Meanwhile, the CCITT and ISO camps aligned their efforts, with OSI adding an internet sublayer within the network layer to accomodate the datagram internetworking approach beside the virtual circuit approach. This new OSI protocol family functioned much like the TCP/IP suite. Many proponents of the OSI stack expected it to succeed the TCP/ IP suite, and it enjoyed considerable acceptance in Europe and the Far East. The United States government mandated its inclusion in all network purchases through the Government Open Systems Interconnect Profile (GOSIP). Dominance of the Internet In the mid-1990s, several factors contributed to the growing dominance of the TCP/IP system, which came to be called simply the Internet. Free software for the TCP/IP suite was widely available. The invention of hypertext browser software (the original was called Mosaic) made hypermedia information throughout the Internet easily accessable. With the tremendous growth in PCs and the discovery of the Internet for personal and general business use, demand for connectivity accelerated dramatically, and by the late 1990s there are millions of connections to the Internet. The protocol suite developed by the researchers in the 1970s is now an essential basis for a vast array of personal and enterprise information exchange. In the process of growth, some modifications to the original internet protocols have been proposed, but the fundamental principles remain valid. MAJOR TECHNICAL ISSUES As noted previously, an internet must deal with basic issues common to any switching system, such as addressing, routing, congestion control, fragmentation, and multiplexing. The following sections focus on the extra concerns that are important at the internet level in each of these areas, along with a discussion of alternatives for the level at which to interconnect networks. Naming, Addressing, and Routing To understand the problem of delivering data to the correct destination in an internet, a clear distinction must be drawn among names, addresses, and routes (15). Although these concepts are applicable at each protocol level, we shall be primarily concerned with the network level, where hosts or end systems and routers are the relevant objects. A name serves
656
INTERNETWORKING
Network
Local Internet address
Figure 2. Hierarchical addresses simplify global addressing and routing.
Next Address
Port
Dest. Net
Next Address
Port
A B C D E
A S C S T
X W X W X
X Y Z
Local T S
X X W
...
Addressing. As noted earlier, packets traversing an internet include a header specifying the internet address of the destination host. Internet addresses must provide a unique identifier for each network interface in the internet system. In small internet systems with broadcast media (such as Ethernet, or token rings) it may be sufficient to use a ‘‘flat’’ address format where the addresses provide no indication of the location of the host’s interface. In large internet systems, it is essential to introduce a hierarchical internet address format where an explicit ‘‘network’’ prefix is combined with a ‘‘local’’ suffix to form a complete address (Fig. 2). The network prefix identifies the destination network (or closely related set of networks), and the local suffix identifies the destination host interface within that network. In the Internet, the original internet address format was 32 bits, with either 8, 16, or 24 bits allocated for the network prefix. This allowed for a small, medium, or large number of nets which each contained a large, medium, or small number of hosts, respectively. As the Internet expanded, a more flexible scheme was developed called subnetting (17) which allows a locally interconnected group of networks (such as LANs on a campus) to appear as a single network to the rest of the Internet. Local routers in the group then use the first few bits of the ‘‘local’’ address to properly distinguish between the different ‘‘internal’’ networks. While subnetting has allowed successful expansion of the Internet for many years, the newly adopted Internet Protocol Version 6 provides for longer 128-bit addresses to facilitate future growth. Another addressing issue in forming successful internets is how to create the network level addresses needed to transmit a packet through each individual network along its path. As described later, routing tables in each host or router provide the internet address to which each incoming packet must be forwarded, but this must be translated to a network level address that can be ‘‘understood’’ by the next network. In some cases the local portion of the internet address can be
Dest.
...
to identify the host ‘‘logically,’’ independent of its point(s) of attachment to the network(s). The same host may have several names to provide for convenient ‘‘nicknames’’ or aliases. An address identifies a point of attachement for purposes of delivering data to the host; since the same host may have multiple network interfaces, it may have multiple addresses. Finally, a route is the path taken from source to destination host (the sequence of intermediate nodes that the packet traverses), and there are typically multiple routes available to the same destination. The process of sending data to a destination generally involves first determining its address from its name using a directory service, and then determining the best route to that address. In large systems, this name lookup function is typically implemented in a distributed fashion, with a hierarchical name space where subdirectories are responsible for their portion of the name space (16).
(a) Flat internet addressing
(b) Hierarchical internet addressing
Figure 3. Routing table for Router R in Fig. 1. Table is larger with flat internet addressing (a) and smaller with hierarchical addressing (b).
used or translated directly into a network address (e.g., the IMP and port numbers in the original ARPANET). In other cases, the local portion must be determined or ‘‘resolved’’ using tables created by an address resolution protocol (ARP) (17). The ARP dynamically discovers the network addresses (and internet addresses) of hosts and routers connected to a particular network, and maintains a table giving the correspondence between network and internet addresses on a given network. Routing. As a packet traverses the internet, the source node and each router must take the destination internet address and determine the next place to send the packet. Normally this is done using a routing table, or data structure containing destination addresses and how to reach them. With a flat internet address space routers must maintain information on how to reach each destination individually, and hence have large routing tables. This approach is usable in smaller internet systems, such as interconnected LANs. Figure 3(a) shows a routing table for Router R in Fig. 1, containing a list of destination addresses and the address of the next hop. The next hop specifies the internet address of the next router to use when forwarding the packet, which in turn determines which ‘‘port’’ or network interface to use. The routing subroutines usually index entries with a hash, tree, or other efficient lookup mechanism to speed the process and may contain additional information for each entry, like age and frequency of use. Routing tables may be constructed according to many requirements, such as lowest delay or cost or highest availability. In some cases they are best created statically when the end system or router is configured. For example, host A in Fig. 1 need only create a single default route to Router B, since that is the only path into the internet. Though static tables for singly connected end systems are commonplace, other routing tables are commonly altered dynamically to represent current link and router availability. Accomplishing this task in large networks or internets with high reliability, efficiency, and timeliness has been a very challenging problem that has led to the development of many routing information exchange protocols that balance complexity, optimality, and required processing speed (1,18). For large internets, a hierarchical address is often used in the internet protocol, so that routing can be done in steps. First the gateways route packets to the final network (ignoring the local suffix), and then within the final network to the local address. With this approach the routing table contains an entry for each net, rather than for every destination host.
INTERNETWORKING
This reduces the size of routing tables, as shown in Fig. 3(b), with the potential for some loss in optimality. Despite considerable recent progress in routing algorithms, computer and human errror conditions can still occur that create routing loops, a condition among a set of routing tables in which packets repeatedly traverse the same set of intermediate systems, never reaching their ultimate destination. To prevent these conditions from congesting the network indefinitely, internetwork protocols specify a hop count or time to live field that is decremented by each router. If the hop count ever reaches zero, the packet is discarded. Senders normally set this field equal to or greater than the longest normal path through the internet. Another design choice concerns the frequency of routing decisions. For maximum robustness, each packet may cause a best route selection process to be carried out, as in the original ARPA internet (19). Other systems choose to perform the best route determination process only for the initial packet to a destination. This route is then remembered in the routers, and subsequent packets to the same destination follow the same route. Often this type of path set up is accompanied by an abbreviated addressing convention, where only the first packet must carry full destination address and subsequent packets carry only a shorter path identifier. The CCITT X.75 and the new IPv6 use this approach. A mechanism for timing out such routes and recovering from changes in the internet topology must be provided. Yet another approach employs flooding to avoid the need for intelligence in packet forwarders. Since flooding is expensive of network resources, it is typically used only for control purposes or for initially establishing a path that later packets to the same destination will follow. Another method of routing called source routing allows the sender to avoid the need for intelligent routers or to force a specific path to be used by providing a route in the packets it sends (20). Congestion Control The problems of congestion control in an internet system are much like those of individual networks. Speed mismatches are likely to be more severe between LANs and slower wide area networks (although recent advances in high-speed WAN service should reduce this). In some cases, the individual network procedures may be adequate [e.g., Asynchronous Transfer Mode (ATM) quality-of-service parameters]. In others, some form of explicit internet-level control may be needed. Questions have been raised about the ability of connectionless systems to provide effective congestion control. This is a particular concern when connectionless or datagram internet service is used to support higher-level connection-oriented services. Several techniques have been proposed in this area, including input buffer limits, buffer classes, fair queuing, slow start, and choke packets (1,21). Once the sender has determined that congestion has occurred (by receiving an explicit signal from a host or router or by timing out waiting for an acknowledgment), it must reduce its transmission rate for a while and then try to increase it again. Various specific algorithms for this purpose have been proposed, and this is an active area of research. Fragmentation and Reassembly When networks with differing maximum packet size limits are interconnected, the need to fragment large packets for
657
traversal through networks with smaller size limits must be considered. The original packet is broken into two or more new packets, each small enough to transmit over the next network. These fragments can be reassembled at the exit from the individual small packet network or allowed to propagate all the way to the final destination. Mechanisms to support such fragmentation typically include some sort of additional sequencing information in the packet header. The most robust mechanisms allow further fragmentation of already created fragments and proper reassembly of fragments at the final destination that may have followed different paths. In general, fragmentation is undesirable because of the processing burden placed on routers and because of the possibility of inefficient link utilization. For example, a fragment that fills one network packet may have to be fragmented at a subsequent router into one large and one very small piece. The very small piece has a large ‘‘overhead’’ (ratio of data carried to data and header information), which uses resources inefficiently. To help alleviate this problem, the internet protocol suite may provide for an advisory message to be transmitted back to the source of large packets, indicating that they are too big for the router to forward without fragmentation. Level of Interconnection The previous discussion has assumed that networks are interconnected at the network level of the protocol hierarchy, since this is the dominant approach in use today. However, other levels of interconnection may also be chosen, from the lowest (physical) level to the highest (application) level. In general, the lower the level of interconnection, the more similar the networks to be connected must be, while high-level interconnections support more specialized services. When different networks and protocols are involved, the interconnection involves a conversion process between the services provided for comparable functions in each network (22). The complexity of this process and the quality of end-toend services resulting are largely determined by the level of interconnection chosen. The following sections summarize the key features of each major alternative. Physical Level. The physical level deals with serial transmission of bits over a physical medium. Interconnection devices operating at the physical level are generally called repeaters. They forward individual bits of the packet as they arrive, perhaps translating from one medium to another (e.g., baseband coaxial cable to optical fiber). The resulting interconnected system functions essentially as a single network at the data link level, and hence all networks to be so connected must have identical data rates and link protocols. This approach is typically used to interconnect several physically separate segments of a LAN system, perhaps separated by a point-to-point link. A disadvantage is that repeaters propagate noise and interference as well as valid data. Link Level. The link level deals with transmission of frames over a link, which may be shared by multiple users. Interconnection devices operating at the link level receive entire frames from one link, examine the link level protocol header, and possibly forward the frame onto another link. They are
658
INTERNETWORKING
typically called bridges (20). As with repeaters, they may interconnect two or more local LAN segments or may interconnect remote segments over a long-distance link. Major motivations for their use are to interconnect LAN segments with different speeds and/or protocols or to increase network capacity by ‘‘filtering’’ incoming packets and forwarding only those whose link-level destination is on another segment. Hence bridges accommodate parallelism by permitting simultaneous use of both segments. Moreover, bridges transparently support systems with multiple network-level protocols in use. Network Level. The network level deals with transmission of packets over a network that may include intermediate switches. Traditionally, interconnection at the network protocol level has been a WAN problem, where different networks had independently developed different protocol mechanisms for the variety of network-level functions, such as routing, congestion control, error handling, and segmenting. If the networks are identical, then the problem becomes largely one of routing as with the X.25/X.75 approach in public data networks. When the networks differ, the complexity of protocols at the network level (e.g., X.25 versus ARPANET 1822) makes a translation approach difficult. There has been some success in one vendor emulating another vendor’s network behavior [e.g., IBM Systems Network Architecture (SNA) gateways]. The approach that has gained wide acceptance in the Internet places a common IP sublayer on top of the different network protocols. As noted previously, this has particular benefits for supporting the sophisticated routing procedures needed for large internet systems, and devices operating at this level are often called IP routers. Choosing this level for interconnection makes available the general-purpose services of the network level and allows the router implementor to take advantage of what is normally a well-documented interface with many implementations. It allows each network to function autonomously with its own procedures internally, while requiring some standard ‘‘internet’’ procedures to be used on top of the normal network access for individual networks. Transport Level. The transport layer is intended to provide general-purpose data transfer between end users. In the OSI architecture, the transport service is supposed to be an endto-end service, so transport-level gateways are, strictly speaking, a violation of the architecture. Nevertheless, they may be of practical benefit when common upper-level protocols are in use but different transport protocols are available. Early experiments with the competing protocol hierarchies demonstrated connections of this nature (for example, concatenating TCP and ISO TP4 connections to each other). Higher Level. Many application-level gateways have been implemented to support specific services found at the application level. This type of gateway is essentially a ‘‘Janus host’’ that implements two (or more) full protocol suites. Common examples have been interconnecting terminal concentrators or CCITT packet assemblers/disassemblers (PAD) to provide an interactive terminal service, or electronic mail servers to form a mail forwarding service. Where only a specific application service is wanted and the desired application services on
each net match closely, this type of gateway may be easy to set up with existing equipment. However, the service provided is clearly not general purpose, and the limitations imposed by providing only those service elements common to the interconnected systems are often more irksome than anticipated (23).
MAJOR INTERNET EXAMPLES The following sections illustrate the application of the technical issues discussed previously in several widely used internet systems. The Internet One of the first major internet systems was developed by ARPA in the United States (24,25). This system included the original ARPANET, packet radio nets, satellite networks, and various LANs. The system was subsequently split into separate systems for research users and for operational military users and eventually evolved into the Internet. Networks in the Internet are interconnected by routers that implement a connectionless or datagram IP (19,26,27) to provide maximum robustness and routing flexibility. The system originally employed dedicated router machines based on general-purpose 16-bit minicomputers, but special-purpose high-speed routers are now manufactured by a variety of vendors. Each datagram is analyzed by the routers and routed based on its destination address. The Internet uses hierarchical 32-bit addresses, with routers designed to route to the network portion of the address first, and then the local portion once the correct net is reached. As described earlier, subnetting has been introduced to allow more efficient and flexible use of address space. Host name to address lookup was initially supported by a single flat directory, but as the number of hosts grew, a hierarchical distributed directory service [the domain name system (DNS)] was adopted (17), which now can access millions of names throughout the world within a few seconds. Most of the individual networks in the Internet provide connectionless service, although there is a provision for running IP over connection-oriented network services such as X.25 and ATM. The major transport service is connection oriented, implemented by a common protocol called the transmission control protocol (TCP), that must be present in the end systems (not in routers). IP also supports other types of transport protocols, including datagram and ‘‘stream’’ mode (for packetized voice or video). The Internet IP provides for fragmentation at routers, with reassembly at the final destination so that individual fragments may follow different routes. A time-to-live or hop limit field is included to limit the maximum lifetime of packets in the system, providing an essential part of the overall routing system. Options are defined to allow inclusion of source routes, security markings, timestamps, and so on. There is a separate Internet Control Message Protocol (ICMP) used for signaling errors and diagnostic information. This includes destination unreachable, congestion control (choke packets), packet too big, echo request/reply, and redirect indications (giving a better route for a specific destination).
INTERNETWORKING
Internet routing information exchange was originally handled by a gateway-to-gateway protocol that required interaction between all ‘‘neighboring’’ gateways. As the Internet grew, a more hierarchical scheme called the Exterior Gateway Protocol (EGP) (24) was developed to reduce the amount of routing traffic. In EGP, each autonomous system (typically a campus or corporate internet) elects one gateway to exchange routing data with a neighbor gateway in an adjacent autonomous system, and the systems then propagate the information to all their other gateways through an internal procedure. EGP evolved further to become the Interdomain Routing Protocol (IDRP) (28). The version of IP developed in the 1970s (IPv4) has been adapted to work on a wide variety of subnetwork technologies and is used at very high speeds, but its deployment in largescale networks has revealed opportunities for enhancement. In particular, a larger address space is needed. A new version, IPv6, which supports 128-bit addresses, eliminates fragmentation, and improves route lookup times, has been defined by the IETF and is being cautiously deployed. International Standards Organization The ISO extended its original seven-layer OSI architecture to define three sublayers within the network layer. The topmost layer corresponds to the internet protocol, and the middle layer is intended to adapt (‘‘converge’’) specific network services to those required by the internet sublayer. One example would be use of a connectionless internet protocol over a connection oriented network, requiring a connection management intermediate layer protocol to set up and terminate connections as needed in order to send internet-level datagrams. ISO has defined a connectionless internet sublayer protocol (1,29,30) much like the Internet IP. Although the format of the packet header is different, most fields have a one-to-one correspondence with the Internet IP. However, the ISO IP does not include a field to specify the upper layer protocol being carried since this is viewed as part of the address information. The ISO IP includes an error reporting capability, while the Internet IP provides this through the separate ICMP protocol. The fragmentation (segmentation) fields are different, with the ISO IP including a field giving the total length of the original segment in each fragment to aid in assigning reassembly buffers. The final major difference concerns the format of addresses at the network level, which is not part of the ISO IP itself but is covered in a separate document. The ISO format is a variable-length string that is intended to cover the requirements of both public and private, local and wide area networks for the foreseeable future. This involves a maximum of 16 octets of binary data, which could be alternatively coded as 40 binary-coded decimal digits. The first octet is an authority/format code meant to indicate what format the following data are in. Provision has been made to identify all the major address formats as alternatives (X.121, F.69 [telex], E.163 [telephone], E.164 [ISDN], ISO 6523). The address is assumed to be hierarchical, with each domain responsible for defining the meaning of the suffix portion of the address under its control. Appletalk Appletalk was developed in the mid-1980s and is primarily used on Apple computers. It has several innovations that
659
make it efficient and easy to configure. There are two header format options: a 5-byte short form for use on packets that do not exit a single LAN, and a 13-byte-long form for routed packets. The former is quite compact, containing 6 reserved bits, 10 length bits, the source and destination sockets, and the Datagram Delivery Protocol (DDP) type, indicating the application. The sockets identify the particular application to receive the data, as is usually done in the transport protocol; the packets do not have internetwork addresses because the link layer sends it to the correct destination. Zero to 586 data bytes follow the header. The first two bytes of the extended header are the same as for the short, except for a 4- bit hop count field that limits maximum network diameter to 15. Bytes 3 and 4 are an optional header checksum. Following are two destination and two source network bytes for a total of over 65,000 allowed networks (addresses FF00 through FFFE are reserved.) Each network may have 254 nodes, as indicated in the following two bytes. Addressing is handled in a ‘‘plug and play’’ fashion. End systems arbitrate (using a broadcast protocol) to obtain an unused node ID when they are initialized and learn the network number (if any) from their nearest router. This eliminates the need to configure end nodes with unique addresses. Routers are manually configured with network numbers. Novell IPX Novell began selling its distributed system in the mid-1980s, and made rapid inroads in the office automation market. Novell’s Internet Packet Exchange (IPX) protocol has a fixed 30byte header. The first 2 bytes contain an optional checksum, followed by 2 bytes of length (excluding LAN overhead.) Next is a 1-byte time-to-live field that starts at 16, and a packet type that indicates which transport protocol is used. The destination and source node addresses have a 4-byte network part and a 6-byte node identifier that is the same as the physical address for Ethernet. Since the physical address is expected to be unique, this simplifies manual configuration and eliminates the need to discover physical addresses dynamically. The source and destination sockets identify particular applications, like transport sockets. FUTURE DIRECTIONS The variety of individual network technologies is likely to continue increasing. Fortunately, by introducing standards at the internetwork level, it is possible to interconnect diverse networks while preserving their individual autonomy to a large degree. The success of the Internet in working with new network technologies such as FDDI and ATM indicates the validity of its basic architecture. To cope with the tremendous growth in end systems, broader addressing and routing schemes are now emerging from IETF work (28). Research is also underway on improved methods of congestion control and routing protocols. With the high packet rates now flowing in the Internet, some provisions for streamlining packet processing are needed. A new version of the Internet IP, IPv6, provides for 128-bit addresses, with a flow label in each packet to allow routers to cache routes and avoid a full destination lookup on each packet. These are timed out every few seconds to ensure
660
INTERRUPTERS
responsiveness to changing conditions. IPv6 allows fragmentation only at the source host, to streamline packet processing in intermediate routers. Inclusion of high-latency links, such as satellite hops, and high-error-rate links (mobile users) also provides new challenges for the Internet. Greater demand for broadcast service (the same data going to multiple users), constant rate data (audio and video), and asymmetric rate links (fast data retrieval, slow requests, as provided in some broadband cable systems, and Asynchronous Digital Subscriber Line technology) are other directions for expansion. BIBLIOGRAPHY 1. C. Sunshine (ed.), Computer network architectures and their protocols, 2nd ed., New York: Plenum, 1989. 2. V. Cerf and P. Kirstein, Issues in packet network interconnection, Proc. IEEE, 66: 1386–1408, 1978. 3. M. Gien and H. Zimmermann, Design principles for network interconnection, Proc. 6th Data Commun. Symp., Pacific Grove, CA, 1979, ACM/IEEE, pp. 109–119.
22. P. Green, Jr., Protocol conversion, IEEE Trans. Comm., COM-34: 257–268, 1986. 23. M. Padlipsky, Gateways, architectures, and heffalumps, in The Elements of Networking Style, Englewood Cliffs, NJ: PrenticeHall, 1985, pp. 167–176. 24. R. Hinden, J. Haverty, and A. Sheltzer, The DARPA internet: Interconnection of heterogeneous computer networks with gateways, IEEE Comput., 16 (9): 38–48, 1983. 25. J. Postel, C. Sunshine, and D. Cohen, Recent developments in the DARPA internet program, Proc. 6th Int. Conf. Comput. Comm., London, UK, 1982, pp. 975–979. 26. Department of Defense, Internet protocol, MIL-STD-1777, 1983. 27. D. Clark, The design philosophy of the DARPA internet protocols, Proc. ACM SIGCOMM Symp., 1988, pp. 106–114. 28. S. Thomas, IPng and the TCP/IP Protocols: Implementing the Next Generation Internet, New York: Wiley, 1996. 29. R. Callon, Internetwork protocol, Proc. IEEE, 71: 1388–1393, 1983. 30. International Standards Organization (ISO), Protocol for providing the connectionless network service, IS 8473, March 1986.
CARL A. SUNSHINE Aerospace Corporation
4. J. Postel, Internetwork protocol approaches, IEEE Trans. Comm., COM-28: 604–611, 1980. 5. L. Pouzin, A proposal for interconnecting packet switching networks, Proc. Eurocomp, 1974. 6. L. Roberts and B. Wessler, Computer network development to achieve resource sharing, AFIPS Conf. Proc., (SJCC) 36: 543– 549, 1970. 7. V. Cerf, The DoD internet architecture model, Comput. Netw., 7: 307–318, 1983. 8. R. Scantlebury and P. Wilkinson, The national physical laboratory data communication network, Proc. ICCC, Stockholm, 1974. 9. R. Metcalfe and D. Boggs, ETHERNET: Distributed packet switching for local computer networks, Commun. ACM, 19: 395– 404, 1976. 10. V. Cerf and R. Kahn, A protocol for packet network intercommunication, IEEE Trans. Commun., COM-22: 637–648, 1974. 11. V. Cerf et al., Proposal for an international end-to-end protocol, Comput. Comm. Rev., 6: 68–89, 1974. 12. A. Rybczynski, J. Palframan, and A. Thomas, Design of the Datapac X.75 internetworking capability, Proc. 5th Int. Conf. Comput. Comm., 1980, pp. 735–740. 13. B. Meister, P. Janson, and L. Svobodova, Connection-oriented versus connectionless protocols: a performance study, IEEE Trans. Comput., C-34: 1164–1173, 1985. 14. D. Boggs et al., PUP, An internetwork architecture, IEEE Trans. Commun., COM-28: 612–624, 1980. 15. J. Shoch, Internetwork naming, addressing, and routing, Proc. IEEE COMPCON, 1978, pp. 72–79. 16. P. Mockapetris and K. Dunlap, Development of the domain name system, Proc. ACM SIGCOMM Symp., 1988, pp. 123–133. 17. D. Comer, Computer Networks and Internets, Englewood Cliffs, NJ: Prentice-Hall, 1997. 18. C. Huitema, Routing in the Internet, Upper Saddle River, NJ: Prentice-Hall, 1995. 19. J. Postel, C. Sunshine, and D. Cohen, The ARPA internet protocol, Comput. Netw., 5 (4): 261–271, 1981. 20. R. Dixon and D. Pitt, Addressing, bridging, and source routing, IEEE Netw., 2 (1): 25–32, 1988. 21. V. Jacobson, Congestion avoidance and control, Proc. ACM SIGCOMM Symp., 1988, pp. 314–329.
INTERPRETERS, PROGRAM. See PROGRAM INTERPRETERS.
INTERPROCESS COMMUNICATION. See APPLICATION PROGRAM INTERFACES.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5313.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering ISO OSI Layered Protocol Model Standard Article Adrian Tang1 1University of Missouri-Kansas City, Kansas City, MO Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5313 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (153K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5313.htm (1 of 2)16.06.2008 16:24:07
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5313.htm
Abstract The sections in this article are Interoperability OSI Reference Model OSI Application Standards OSI Concepts OSI Layers Conclusions | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5313.htm (2 of 2)16.06.2008 16:24:07
ISO OSI LAYERED PROTOCOL MODEL
ISO OSI LAYERED PROTOCOL MODEL In recent years, information technology has become a major part of human civilization. The sheer variety of information technology has created a significant problem when interconnecting computer systems. The Open System International (OSI) standards are a set of international standards prescribing how multivendor computer systems can be interconnected in a consistent fashion. Among these standards, the standard on the OSI Reference Model provides a framework for the development of protocols to interconnect open systems. This article attempts to introduce the OSI Reference Model and the OSI protocols. INTEROPERABILITY The purpose of the OSI interconnection standards is to promote interoperability, which will enable open systems to communicate with each other. To appreciate the role of the OSI interconnection standards in promoting interoperability, a three-step strategy to achieve interoperability is described in the following. The first step is to develop the OSI standards. The International Organization for Standardization (ISO) and the International Electrotechnical Committee (IEC) jointly formed a committee, ISO/IEC JTC (Joint Technical Committee) 1, which is responsible for solving problems that can occur at the interconnection and information processing levels. Within JTC 1 are subcommittees (SC). For example, SC 6 is involved in lower-layer information exchange standards and SC 18 is involved in information processing standards. The OSI Reference Model is covered in ISO/IEC 7498 which is an SC 21 standard. Another important OSI standard organization is the International Telecommunications Union (ITU). The ITU recommendations provide end-to-end compatibility for international telecommunications. Although the ITU has been more concerned with issues addressing the lower three layers of the OSI Reference Model and with applications using the telecommunications capabilities, there is, however, a great deal of overlap between its work and that of the ISO. Indeed, ISO and ITU publish nearly identical texts as both an International Standard and a Recommendation. The second step is to establish functional profiles. The difficulty in implementing the open systems standards is the ab-
751
stract often informal nature of the standards documents, as well as the wide variety of options that can be implemented for a protocol. Furthermore, in a given layer of the model, there may be a variety of similar protocols. To overcome such difficulties in consistency, OSI implementors define functional profiles which identify specific protocols and specific choices of permitted options and specific values for parameters in the standards. Functional profiles have been defined primarily in three OSI regional workshops, which are the OSI Implementation Workshop (OIW) in the United States, the European Workshop in Open Systems (EWOS) in Europe and the Asian and Oceanic Workshop (AOW) for Japan, Australia and the Pacific Rim countries. Because functional profiles are developed in three different workshops, there is potential for interoperability problems between systems implemented in different parts of the world. A subcommittee of ISO/IEC JTC 1 was formed in 1987 to define International Standard Profiles (ISPs) which would harmonize potentially divergent regional efforts into common internationally recognized functional profiles. The third step is to test. There are two kinds of OSI tests—conformance testing and interoperability testing. Conformance testing examines a product to determine whether it meets the standard and profile requirements. It is not ensured that two products that have passed conformance testing may interoperate with each other, e.g., in the case where they choose incompatible option values. Interoperability testing is done to determine whether two products conform to the same standard and profile requirements. This process is time consuming because a total of n*(n⫺1)/2 interoperability tests are needed to demonstrate full interoperability in an environment consisting of n implementations. Test strategies, concepts and scripts have been developed by ISO and specific industry consortia. Testing houses have been established to conduct manufacturer-neutral conformance and interoperability test campaigns.
OSI REFERENCE MODEL In 1978, ISO proposed the establishment of a framework for developing standards for the interconnection of heterogeneous computer systems, potentially via a variety of intermediary networking devices. The resulting framework, known as the OSI Reference Model (ISO/IEC 7498 or ITU X.200) was published in the spring of 1983. Since then, it has been extended to cover the connectionless communication mode, security, naming/addressing, and management. The interconnection requirements can be divided into internetworking requirements and interworking requirements. To meet these requirements, the OSI Reference Model is divided into seven layers (Fig. 1). Layers 1 through 4 deal with the internetworking requirements; Layers 5 through 7 deal with the interworking requirements. The objective of the internetworking requirements is to provide physical connectivity between systems in different networks in an internetworking environment. Conceptually, the internetwork environment consists of subnetworks that are connected by intermediate systems (IS) and populated by end systems (ES). Applications are found in ESs and ISs provide the glue to interconnect ESs. The objectives behind the internetworking requirement are twofold, specifically, trans-
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
752
ISO OSI LAYERED PROTOCOL MODEL
Open system
Open system
Application
Application
Presentation
Presentation
Session
Session
Transport
Transport
Network
Network
Data link
Data link
Physical
Physical
tured approach to build objects in the application layer due to the large number of communication requirements. OSI APPLICATION STANDARDS Among the OSI application standards, the following are introduced below: Common Management Information, X.500, X.400, File Transfer Access Management, and Remote Database Access.
Physical media
Common Management Information Protocol Figure 1. OSI Reference Model.
parency over the topology of the internetwork, and transparency over the transmission media used in each subnetwork. The internetworking requirements are met by the lower four layers in the following way. Layer 1, the physical layer, is responsible for providing physical connectivity among adjacent systems. The bits in a physical connection may occasionally experience corruption. Thus Layer 2, the data link layer, takes care of error handling as well as flow control for physical connections. Because not all the ESs are necessarily adjacent to each other, Layer 3, the network layer, provides network connectivity among nonadjacent ESs. Although a network connection may require one or more connections to ISs, the Network Layer ensures that the connectivity details are transparent to the two ESs. Layer 4, the transport layer, provides the final touch in meeting the internetworking requirements. Although the lower three layers provide the means for moving data from one ES to another, reliability is still an issue. For example, the aggregated bit-error rate over all the underlying subnetworks may not meet the performance requirements of the application. The purpose of the transport layer is to enhance the transmission quality of the network layer. As a result, users of the transport service are presented with a reliable end-to-end transport pipe. Even with the task of providing the internetworking service, ESs may not be able to interwork significantly with each other. For instance, an ES may not be able to interpret the received bits although the bits are received correctly. The interworking service is provided by Layers 5 through 7. Layer 5, the session layer, is responsible for dialogue control as well as synchronization in the case of bulk transfer. It achieves its functionalities by adding a structure to the transport pipe. Layer 6, the presentation layer, provides the environment for applications to determine the application context. The application context, which determines the scope of communication, should identify the syntax of the application information exchanged. These syntax are known as abstract syntax because they do not depend on how their values are represented locally. Values of the abstract syntax are represented in the transfer by a set of rules known as the transfer syntax. The presentation layer provides the capabilities for applications to negotiate on such transfer syntax. Applications can also rely on the presentation layer to perform encryption for secured communication. Layer 7, the application layer, is responsible for meeting all the interworking requirements that are not met by the lower layers. The application layer model provides a struc-
The OSI management standards are to provide uniformity in the specification of a management communication protocol, uniformity in the definition of managed objects, and uniformity in the definition of systems management functions. The Common Management Information Protocol (CMIP) is the OSI protocol used to support the exchange of system management messages between a manager and an agent (Fig. 2). Managed objects are management views of resources such as network elements. The manager invokes remote management operations such as finding the status of a managed object. After the manager successfully completes these operations, the results are returned to the manager by the agent. The agent can send a notification to the manager as the result of a (e.g., alarm) notification received from a managed object. Unlike other management protocols, CMIP is connectionoriented. The management operations, which are defined in ISO/IEC 9595, can be summarized as follows: • M-GET: This operation is invoked to retrieve attribute values from one or more managed objects. • M-CREATE: This operation is invoked to create a managed object. • M-SET: This operation is invoked to modify attribute values of one or more managed objects. • M-ACTION: This operation is invoked to have a defined, ad-hoc action performed on a managed object, e.g., resetting a network connection. • M-DELETE: This operation is invoked to delete a managed object (e.g., deleting a log record). • M-CANCEL-GET: This operation is invoked to request the cancellation of an outstanding invocation of the MGET operation. • M-EVENT-REPORT: This operation is invoked by the agent to report an event to the manager. Operations such as M-GET, M-SET, M-ACTION and M-DELETE contain scoping and filtering parameters that permit the agent to operate only on the managed objects that have
Management operations Manager
Management operations Agent
Notifications
Notifications Managed emitted objects
Figure 2. CMIP Functional Model.
ISO OSI LAYERED PROTOCOL MODEL
passed the filter. Only intelligent agents with the ability to evaluate filters can perform such operations. Therefore, it is not practical to run OSI agents on dumb devices. To provide uniformity in the definition of managed objects, the OSI management standards use the object-oriented approach and provide Guidelines to Definition of Managed Objects (GDMO) templates to define managed object classes. A managed object class is a set of managed objects with similar characteristics. A managed object is an instance of a managed object class. For example, the transport-connection managed object class is an encapsulation of common characteristics of transport connections. When a transport connection is established, a transport-connection managed object can be created to model the management behavior of a transport connection. The managed object class template forms the basis of the definition of a managed object class. It supports inheritance by identifying the inheritance relationships between the class and other managed object classes. Within the template are place holders for the specification of behavior, attributes, notifications and actions. Attributes are used to specify properties such as states, and notifications are used to specify unsolicited messages that can be emitted by a managed object. To facilitate rapid development of managed object classes, standard and profile groups have registered managed object classes for the common resources. For example, ITU M.3100 provides definitions of managed object classes for generic telecommunications resources. These definitions can be used to derive (via inheritance) definitions of managed object class for specific telecommunications resources. System management functions are required to address a variety of management functions for the entire managed system. Management functions can be classified into five categories: fault management; configuration management; performance management; accounting management; and security management. The OSI standards have defined systems management functions to address common problems in these five categories. Among these functions, the object management function, for example, provides services for the reporting of creation/deletion and attribute value change of managed objects; the event reporting management function provides services for event report delivery to managers. The OSI management standards have found their greatest success in telecommunications applications. ITU M.3010 defined a five-layer model for Telecommunications Management Network (TMN), and recommended the use of OSI management standards. Given the huge number of OSI management standards, it is not easy for a telecommunications manager to identify the OSI management standards that can be used to solve a specific management problem. To address this dilemma, the Network Management Forum (NMF) came up with the idea of a solution set. Similar to the notion of a functional profile, a solution set provides the solution to a specific telecommunications management problem. It identifies the managed object classes and systems management functions which are needed to solve the problem. Through scenarios, it explains how the functions can be applied to meet the requirements of the problem. The NMF has already defined a number of useful solution sets for practical telecommunications problems. A majority of telecommunications managers have committed to deploying NMF solution sets to solve telecommunications problems.
753
X.500 In a distributed computing environment, the X.500 standard defines a directory service to manage information which is not likely to change often (e.g., e-mail addresses and network addresses). The directory information base (DIB) is the conceptual repository of such directory information. Structured as a directory information tree (DIT), it represents each piece of directory information (known as directory entry) by a tree node. In the X.500 Functional Model (Fig. 3), the directory user agent (DUA) invokes directory access operations to request directory services on behalf of a directory user. The outcome of the invocation is either a result, a referral or an error. A result is returned if the request has been carried out successfully. A referral is returned if the service is unobtainable at a service access point or more easily reached at another service access point. An error is returned if the request cannot be carried out. Directory access operations fall into three categories. The read category consists of operations to retrieve information of a directory entry, to compare the value of a directory entry with a submitted value, or to cancel an outstanding interrogation. The search category consists of operations to list ‘‘child’’ subordinates of a directory entry or to search for directory entries satisfying an input filter. The modify category consists of operations to add/remove a directory entry, to modify a directory entry, or to change a directory name. The directory access operations are described in the Directory Access Protocol (DAP). In a large network, the directory services may be distributed among multiple directory service agents (DSA). The DSAs collaborate to provide the distributed directory service in three modes—chaining, referral and multicasting. In the chaining mode, a DSA continues passing a request to another DSA until one is found that can provide the requested information. In the multicasting mode, a DSA which does not have the requested information chains an identical request in parallel to multiple DSAs. In the referral mode, a DSA which does not have the requested information refers another DSA to the DUA. The directory service protocol (DSP) is used by the DSAs to provide the distributed directory service. The DSP operations are quite similar to the DAP operations. In the X.500 information model, a directory entry is either an object entry or an alias entry. An object entry is the primary collection of information about a real world object. On the other hand, an alias entry, which points to another directory entry, is used primarily to provide a user-friendly name to the referenced entry. Directory entries of similar characteristics are grouped into directory object classes. Examples of directory object classes are country, organization, person and device. The X.500 standard provides templates for the speci-
DUA
DAP
DSA DSP
DUA
DAP
DSA
DSP DSA
DAP
DUA
DSP
Figure 3. The X.500 Functional Model.
754
ISO OSI LAYERED PROTOCOL MODEL
fication of directory object classes as well as attributes used in the definition of directory object classes. The naming of a directory entry is straightforward. As a tree node, the directory entry is labeled by a set of naming attributes known as the relative distinguished name (RDN). There is a unique path from the node to the root of the DIT. The concatenation of the RDNs of nodes in this path gives the distinguished name (DN) of the directory entry. The X.400 standard relies on DNs for the naming of X.400 users. X.400 The current trends towards electronic office automation have created a demand for universal message communication. The X.400 standard defines a store-and-forward infrastructure for the transport of interpersonal messages and business documents. Figure 4 shows the X.400 functional model. An X.400 user is a person or an application process that originates and receives messages. A user agent (UA) is a process which acts on behalf of an X.400 user. It is responsible for the submission and the delivery of messages. The message transfer system (MTS) provides the store-and-forward transfer service for the X.400 systems, and it transfers messages regardless of their content types. However, it does not examine or modify the message content unless content conversion is explicitly requested by the originator. A message store (MS) provides a secure and continuously available storage mechanism on behalf of a UA. By serving as an intermediary entity between the UA and the MTS, the MS can store incoming messages from the MTS until the UA is ready to process the messages. It can perform autoforwarding and provide a summary of the stored messages. Not every UA has an associated MS. If it does have one, all incoming messages for the UA are delivered first to the associated MS. The value of the X.400 system can be enhanced if it can be connected to other non-X.400 systems such as postal systems. An access unit (AU) is a functional object to link a non-X.400 system to the X.400 system. There are a number of AU types, for example, telex, teletex and facsimile. The functionality of the MTS is distributed among a set message transfer agents (MTA). To illustrate how a UA interacts with an MTA, suppose that the originator UA submits a user message to an originator MTA. First, the originator MTA validates the submission envelope and records appropriate administrative information on the envelope. Next, the MTA attempts to deliver the message to the recipient. If the recipient is local to the originator MTA, the delivery is straightforward because it does not involve another MTA. If not, the
To user
To user UA
UA MTS MS
AU To non-X.400 users Figure 4. The X.400 Functional Model.
originator MTA needs to relay the message to another MTA. When there is more than one recipient, the originator MTA needs to create a copy for each MTA to which the message is relayed. An MTA, on receiving the relayed message from the originating MTA, may be held responsible for progressive delivery of the message to the intended recipient(s). An MTA, if held responsible, can discharge its responsibility by either delivering the message if the recipient is local to the MTA or by transferring the message to other MTAs that are closer (measured according to a metric, such as distance) to the recipient. If delivery is unsuccessful, its responsibility ends with the generation of a nondelivery report. The X.400 protocols are designed to reflect two major levels (i.e., the MTS and the MTS-user). The P1 protocol addresses how an MTA interacts with other MTAs to provide the distributed MTS service. The P3 protocol addresses how a UA accesses the MTS service. The P7 protocol addresses how a MS accesses the MTS service. Similar to the X.500 protocols, each X.400 protocol is modeled by a set of operations. For example, the P3 protocol specifies operations for an originator UA to submit/receive a message, and to manage messages in a mailbox. The X.400 information model deals with how a message is structured. An interpersonal message (IPM) consists of an envelope and an IPM content. The IPM content consists of a header and one or more body parts. Each body part has an encoded information type (EIT), e.g., text, facsimile or telex. If conversion is explicitly requested by the originator UA, an MTA can transform an EIT to another EIT. In addition to IPMs, thre are also interpersonal notifications (IPNs). An IPN conveys a receipt/nonreceipt notification. If the originator UA requests a receipt notification, an IPN signifying receipt is sent by the recipient UA to the originator UA. The X.400 standard relies on the X.500 standard in a number of ways. In naming, for example, an X.400 user is named by an O/R (i.e., Originator/Recipient) name which is a twoslot data structure—DN and O/R address. The DN provides a user-friendly naming of the X.400 user. For example, the following DN may be assigned to Adrian Tang: 兵C ⫽ US, ORG ⫽ UMKC, PersonalName ⫽ Adrian Tang其. The O/R address component identifies the management domains (MD) to which an X.400 user belongs. The X.400 standard defines a management domain to be a set of messaging systems owned by an administration or an organization. An MD managed by an administration (e.g., a public carrier) is called an administration management domain (ADMD), while an MD managed by an organization (e.g., a private company) other than an administration is called a private management domain (PRMD). For example, the following O/R address may be assigned to Adrian Tang: 兵C ⫽ US, ADMD ⫽ ATTMAIL, PRMD ⫽ UMKC, PersonalName ⫽ Adrian Tang其. File Transfer Access Management Existing file protocols are concerned primarily with moving complete files. The file transfer access management (FTAM) standard (ISO/IEC 8571) has broadened the scope of these protocols by offering three modes of file manipulation, i.e., file transfer, file access and file/filestore management. File transfer is the movement of a complete file between two filestores in different open systems. File access enables reading, writing or deletion of selected parts of a remote file. File/filestore
ISO OSI LAYERED PROTOCOL MODEL
Virtual filestore
FTAM Real filestore
Initiator
Responder
Real filestore
755
• Limited file management: This supports file management operations such as creating and deleting files and interrogating the attributes of files. • FADU locking: This supports concurrency control on either the file basis or the FADU basis. • Recovery: This allows an initiator to perform recovery actions when a failure occurs after a file is opened. • Restart: This allows a data transfer to be interrupted and restarted at some checkpoint.
Presentation layer
Remote Database Access Figure 5. The FTAM Functional Model.
management refers to the management of a remote file/ filestore. The FTAM standard has found popularity in the banking industry. Figure 5 shows the FTAM functional model. Initiators and responders correspond to file clients and file servers, respectively. A virtual file/filestore is the abstraction of a real file/ filestore presented by the responder to an initiator. Before the initiator can read, write, access or manage a virtual file/ filestore, it must establish an FTAM dialogue with the responder. A virtual file is characterized by a document type which specifies the structure and content of a file. In the FTAM information model, the file structure is a hierarchical access structure. Each node of the structure may contain structural information (such as its name and level) and content information called data unit (DU). File access is directed to a file access data unit (FADU) which is a subtree of the structure. As a special case of a hierarchical access structure, the unstructured file structure is a one-level structure consisting of only one node, and the flat file structure is a two-level structure in which the nodes are either named (ordered flat structure) or unnamed (sequential flat structure). A virtual file is described by two classes of attributes: file attributes and activity attributes. The values of a file attribute are supposed to remain constant throughout the lifetime of a file unless specifically modified. For example, file name, date/ time of creation, document type and access control are file attributes. An activity attribute describes a file relative to a particular FTAM dialogue in progress. It is dynamic in nature and has no meaning outside the dialogue. For example, current file position, current access request and initiator identity are activity attributes. The FTAM standard introduces many service elements. FTAM functional units are used to group the functional related FTAM service elements. Some examples of FTAM functional units are given in the following. • Kernel: This supports the basic functions for establishing and releasing an FTAM association, selecting and deselecting of a file for further processing. • Read: This supports data transfer from a responder to an initiator. • Write: This supports data transfer from an initiator to a responder. • File access: This is used to locate a specific FADU for subsequent file operations.
Today, multivendor database interoperability is limited. A database client can only access a limited set of remote database servers. The remote database access (RDA) standard enables multivendor database interoperability. The RDA is premised on a client-server relationship between communicating end systems (Fig. 6). The RDA client is a process desiring the remote database access service supported by RDA. The RDA server is a process running on a remote end system that provides the RDA-based remote database access service to RDA clients. The RDA standard is composed of a Generic Standard (ISO/IEC 9579-1), which defines the common aspects of the RDA protocol independent of specific database models and query languages, and a Specialization Standard, which specifies the part of the RDA protocol specific to a particular database model and query language. The SQL (Structured Query Language) Specialization is specified in ISO/IEC 9579-2. The RDA standard defines two modes of operation—basic and transaction processing. The RDA basic mode is intended to be used for basic tasks such as data retrievals from a centralized database. The transaction processing mode is intended to be used for complex actions such as updates to a distributed database. The RDA standard defines services for managing an RDA dialogue, managing a transaction, controlling outstanding operations, controlling the availability of database resources, and executing database language commands. For example, the Database Language group enables an RDA client to execute database language commands which can be either executed as soon as it is issued, or defined first and then later invoked. The latter mechanism would typically be employed for efficiency purposes in the case where the same command is used repeatedly within a given RDA dialogue. The RDA standard describes procedures for handling errors and recovering from failures occurring during an RDA dialogue. It specifies how an RDA server must react to the failure of an RDA dialogue. When an RDA dialogue failure occurs, the RDA server deletes all state information for that
RDA client
RDA server Database
RDA Client interface
RDA Server interface
RDA Figure 6. The RDA Functional Model.
756
ISO OSI LAYERED PROTOCOL MODEL
dialogue and rolls back all uncommitted transactions. The actions of the RDA client following an RDA dialogue failure are not addressed in the RDA standard. The RDA standard has been found useful in manufacturing and telecommunications environments. In the telecommunications environment, for instance, the Revenue Accounting Office can act as an RDA client to retrieve telecommunications usage information from a remote RDA server before applying a rate algorithm to the retrieved usage information. OSI CONCEPTS Layering The basic structuring technique used by the OSI Reference Model is layering. Layering divides the overall communication functions of an open system into a succession of smaller subsystems. Subsystems of the same rank (N) form the (N)layer of the OSI reference model. Objects in the (N)-layer are called (N)-entities. Collectively, they provide the (N)-service that is specified in terms of service elements. The (N)-service is always an enhancement of the (N⫺1)-service. Before an (N⫹1)-entity acquires any service from the (N)-layer, it must be bound to one or more (N)-service access-points (SAPs). At any time, no more than one (N⫹1)-entity can be bound to the same (N)-SAP. The service at an (N)-SAP is supported by a unique (N)-entity.
quest. In a negotiated release, which is a special case of orderly release, an (N⫹1)-entity can reject a release service request issued by its peer. In a destructive release, the release of an (N)-connection may disrupt the procedures of service requests issued earlier, implying potential loss of data in transit in both directions. In the connection-oriented mode, the mapping of (N⫹1)connections to (N)-connections can be one-to-one, many-toone, or one-to-many; (N⫹1)-multiplexing means that more than one (N⫹1)-connection is mapped to the same (N)-connection; (N⫹1)-splitting means that an (N⫹1)-connection is split into several (N)-connections, e.g., when the bandwidth of an (N)-connection is less than that of the (N⫹1)-connection. In connectionless communication, there is neither connection establishment nor connection release. The transmitted data units contain all the control information (such as addresses and quality of service) necessary for the transfer. Furthermore, they are transmitted independently of each other, meaning that there is no defined context. Data unit independence has the advantage that the data transfer service can be robust. Store-and-forward communication is a mixture of connection-oriented communication and connectionless communication. A connection between the two communicating end systems is not required, although connections are established with an intermediate system on a hop-by-hop basis. Communication between communicating UAs in an X.400 system uses this mode.
Communication Modes OSI offers three communication modes: connection-oriented; connectionless; and store-and-forward. Connection-oriented communication requires an (N⫹1)-association to be set up between the two communicating (N⫹1)-entities. The establishment of the (N⫹1)-association in turn requires an (N)-connection to be set up between two (N)-SAPs, i.e., the (N)-SAPs to which the (N⫹1)-entities are bound. The lifetime of a connection has three distinct phases: connection establishment; data transfer; and connection release. Once the connection is established, a connection endpoint identifier is assigned at each end to its local service user. During the data transfer phase, requests to data transfer phase are logically related within the context of the connection addressed by the two connection endpoint identifiers. The establishment of an (N)-connection requires the availability of an (N⫺1)-connection. This means that, in the worst case, when none of the lower layers have an established connection, a connection request from a higher layer would trigger connection establishments from the lower layers. As a result, connection establishment can be time consuming. There are at least two methods of optimizing the cost incurred at connection establishment. One method is to assign reasonably permanent connections at the layers where the cost is low, regardless of whether a higher-layer connection request has been issued. The other method, embedding, attempts to establish multiple connections simultaneously. Embedding is used for connection establishment in Layers 5 and 6. The release of an (N)-connection is initiated by either an (N⫹1)-entity associated with the connection or the (N)-layer supporting the connection. There are orderly release, negotiated release and destructive release. In an orderly release, there is no loss in transit for data sent before the release re-
Data Units While providing the (N)-service, the cooperating (N)-entities exchange (N)-protocol-data-units (PDUs) with each other. An (N)-PDU has two components—data and control. The control component, which is known as the (N)-protocol control information (N-PCI), contains control information such as name/ version of the (N)-protocol, type of the (N)-PDU, and addresses of the communicating (N⫹1)-entities. The data component, which is known as the (N)-service-data-unit (SDU), contains user information which is meant to be interpreted by the receiving (N)-entity. When an (N)-entity sends an (N)-PDU to its peer, the (N)PDU is first transported to the lower layers in the source system through encapsulation. Each time the PDU is passed to the layer below, the layer below adds a PCI prefix to the PDU. Thus, when the PDU reaches the lowest layer (i.e., the Physical Layer), it has been encapsulated with one or more PCIs. Next, the encapsulated PDU at the Physical Layer is transmitted across one or more transmission media to the Physical Layer of the target system. In the target system, the encapsulated PDU is transported to the receiving (N)-entity through decapsulation. Each time the encapsulated PDU is passed to the layer above, the layer above strips off a PCI from the PDU. A layer in the target system always strips off the PCI added earlier by the same layer in the source system. When the receiving (N)-entity receives the PDU, all the PCIs which were added by the source system should have been removed. While (N)-PDUs are messages passed between two open systems, (N)-interface-data-units (IDUs) are messages passed between two adjacent layers in the same open system, i.e., the (N⫹1)-layer and the (N)-layer. An (N)-IDU is passed either because of a local management need or because a (N⫹1)-
ISO OSI LAYERED PROTOCOL MODEL
entity wants to send an (N⫹1)-PDU to its peer. In the latter case, the (N)-IDU is constructed with two components—an (N⫹1)-PDU (which is treated as an (N)-SDU by the (N)-layer) and an (N)-interface-control-information (ICI). The (N)-ICI contains control information (e.g., address of the sending (N⫹1)-entity), which is to be interpreted by the (N)-layer in the source system. In some cases, the sending (N⫹1)-entity may compose more than one (N)-IDU in order to send an (N⫹1)-PDU to its peer. This happens when the (N)-layer imposes a size constraint on an (N)-IDU. Since (N)-IDUs are messages passed between adjacent layers in an open system, the design of their structures is a local implementation issue. OSI standards, which are only concerned with interconnection matters, do not define (N)-IDUs. When the (N)-layer receives an (N)-IDU, it will separate the (N)-ICI from the (N)-SDU. From the control information in the (N)-ICI, an (N)-PCI is built which is then concatenated with the (N)-SDU to form an (N)-PDU (Fig. 7). The (N)-PDU is subsequently passed to the (N⫺1)-layer through the use of one or more (N⫺1)-IDUs. Naming and Addressing The OSI environment is rich in objects. Some OSI objects require a global identification so that they can be referenced unambiguously by applications. They include abstract/transfer syntax, application contexts, FTAM document types and managed object classes. Attribute-based names and object-identifier-based names are found in the OSI environment. An attribute-based name is made up of a set of naming attributes, where each attribute can be represented by a pair, such as (STATE, Missouri). For example, X.500 names are attribute-based names. Object-identifier-based names are given by object identifiers. To explain object identifiers, the object identifier tree (OIT) is described, which is defined to provide global naming. In the OIT, leaf nodes represent objects or object classes, and nonleaf nodes represent administrative authorities. Each arc of the OIT is labeled by an integer and occasionally a mnemonic name for descriptive purpose. An object identifier is an ordered sequence of integer labels for the unique path from the root to a node in the OIT. Let us examine the arcs of the OIT in more detail. The OIT begins with three numbered arcs emanating from the root:
(N+1)-layer (N)-SDU
(N)-layer PCI
SDU
(N)-PCI
(N)-SDU
ICI
ICI
(N)-PDU (N-1)-layer
(N-1)-IDU
SDU
ICI
Figure 7. Building an (N)-PDU and an (N⫺1)-IDU.
757
arc 0 for ITU; arc 1 for ISO; and arc 2 for joint ISO-ITU. Below ITU there are arcs leading to recommendations (0), questions (1), administrations (2), and network operators (3). Below ISO there are arcs leading to standard (0), registrationauthority (1), member-body (2) and identified-organization (3). The arcs below standard (0) shall have the number of an International Standard. Consider the FTAM standard, i.e., ISO/IEC 8571. An arc with the arc number 8571 can be created for the FTAM standard. In this way, the object identifier, 兵1(ISO) 0(standard) 8571其, can be used to name the FTAM standard. An address is used to locate an object, e.g., an (N⫹1)-entity. Suppose that an (N⫹1)-entity is bound to one or more (N)-SAPs. Any one of these (N)-SAPs can be used to locate the (N⫹1)-entity. Hence, an address of the (N⫹1)-entity can be given by a name identifying the set of (N)-SAPs to which the (N⫹1)-entity is bound. Such an address is called an (N)address, or an (N)-SAP address when there is only one (N)SAP in the set. To locate an (N⫹1)-entity, we need lower-layer addressing information to identify a path from the lower layers all the way up to the target (N⫹1)-entity. An (N)-address (for N greater than 3) has two components, i.e., an (N⫺1)-address of a supporting (N)-entity, and an (N)-suffix, known as an (N)selector. The (N)-selector is used to identify the set of (N)SAPs to which the (N⫹1)-entity is bound. Accordingly, a presentation (i.e., Layer 6) address is given by a session address and a presentation selector, a session address is given by a transport address and a session selector, and a transport address is given by a network address and a transport selector. In short, a presentation address can be represented by a quadruple consisting of a presentation selector, a session selector, a transport selector and a network address. OSI LAYERS Physical Layer The purpose of the physical layer is to hide the nature of the physical media from the data link layer in order to maximize the transportability of higher-layer protocols. It provides mechanical, electrical, functional, and procedural means to activate, maintain, and deactivate physical connections for serial bit streams between data link entities. The common functions found in the physical layer are synchronization and multiplexing. The physical layer standards should be distinguished from the physical interface standards (e.g., X.21) which define the boundary or interface between the physical layer and the physical transmission medium. Data Link Layer The data link layer is responsible for error-free data transmission over a data link. It creates data packets, synchronizes the data packets, detects and corrects errors, and controls the flow of the packet stream. The data link service definition of the ISO high-level data link control (HDLC) protocol covers both connection mode operation and connectionless mode operation. While HDLC is seen as a superset of data link procedures, many interesting subsets are defined out of it. For example, the HDLC LAP B subset is adopted by ITU as part of the X.25 packet-switched
758
ISO OSI LAYERED PROTOCOL MODEL
network standard; HDLC LAP (Link Access Procedure) D, a data link standard developed as part of the ISDN standardization, is a subset of LAP B. Network Layer The Network Layer provides the service for network service users to exchange information without being concerned with the topology of the network and the characteristics in each constituent subnetwork. It is perhaps the most complex of the seven OSI layers due to the fact that many existing subnetwork types use different network addressing schemes, network protocols and communication modes. There are two modes of network service—connection-oriented network service (CONS) and connectionless network service (CLNS). The specification of CONS has six service elements: N-CONNECT to set up a network connections; NDATA to send normal user data; N-DATA-ACKNOWLEDGE to acknowledge the receipt of normal user data; N-EXPEDITED to send expedited data; N-RESET to reset a network connection; and N-DISCONNECT to release a network connection. In CLNS, there is only one service element, i.e., NUNIT-DATA, which is used to send data. In the Network Layer, there are two kinds of network protocols—routing protocols and interconnection protocols. Routing protocols are used to maintain a routing information base, to collect routing information, to distribute routing information to other nodes and to calculate the metrics of a route. Interconnection protocols are used to support the integration of subnetworks of different types. A routing framework should be in place before routing protocols are introduced. ISO/IEC TR (Technical Recommendation) 9575 defines a routing framework, essentially partitioning the global network into administrative domains and routing domains. An administrative domain is an autonomous set of ISs and ESs running under a single administration. Within an administrative domain, there may be one or more routing domains. Each routing domain runs the same IS-IS routing protocol among the ISs. Of all the OSI routing protocols, only the IS-IS routing protocol used within a routing domain is discussed here. This protocol has heavily influenced the design of the Open Shortest Path First protocol which is an Internet routing protocol. In the past, IS-IS routing protocols were primarily based on the vector state algorithm, requiring an IS to periodically send its routing table to its adjacent neighbors. Because the routing information propagates from one link to another, it may take considerable time before a remote IS can receive the update. Thus, the major drawback of the vector state algorithm is slow convergence, resulting in the receipt of an occasional stale update. The link state algorithm, which is adopted by ISO for the IS-IS routing protocol, requires each IS to maintain a complete topology map. Instead of sending a global routing table to its adjacent neighbors, an IS periodically broadcasts the status of its adjacent links. On receipt of link information from other ISs, an IS can build an up-to-date topology map which is modeled as a weighted graph. Using this graph, the IS can apply Dijkstra’s Shortest Path First (SPF) algorithm to compute the shortest distance to a destination. The SPF algorithm uses the well-known minimum spanning tree algorithm to convert a weighted graph into a tree. Because the link state algorithm requires each IS to
broadcast its link state status, every IS can maintain a consistent view of the entire topology. The slow convergence drawback of the vector state algorithm is precluded. Interconnection protocols are used to interconnect subnetworks, which may use different network address schemes, network protocols and communication modes. The current solution is to use interworking units (IWU) which perform relaying. Relaying involves the use of convergence protocols which can adapt non-OSI network protocols to OSI protocols. ISO/IEC 8648 defines a framework of the Network Layer for the introduction of interconnection protocols. In the Internal Organization of Network Layer (IONL) model (Fig. 8), there are three sublayers of the Network Layer. • Subnetwork access sublayer. This sublayer provides the attachment point of a subnetwork. A subnetwork access protocol (SNAcP) operating at this sublayer is a protocol associated with an underlying subnetwork. This protocol may or may not conform to the OSI network service requirement. • Subnetwork dependent sublayer. This sublayer is responsible for augmenting the service offered by a subnetwork technology into something close to the OSI network service. A subnetwork-dependent convergence protocol (SNDCP) is used for this sublayer. The operation of an SNDCP depends on the network service of a particular subnetwork. A common function of an SNDCP is to map between OSI network addresses and addresses specific to the subnetwork. • Subnetwork independent sublayer. This sublayer provides the OSI network service over a well-defined set of underlying capabilities, which need not be based on the characteristics of any particular subnetwork. When a subnetwork-independent convergence protocol (SNICP) is used, it is defined to require a minimal set of services from a subnetwork. Connectionless Network Protocol (CLNP), which is the OSI interconnection protocol providing CLNS, is an example of an SNICP. Using the IONL model, there are three basic strategies to interconnect subnetworks. • Interconnection of subnetworks which support OSI network services. In this strategy, all the subnetworks involved fully support the OSI network service. There is no need for an enhancement protocol. • Hop-by-hop enhancement. This approach is used in an environment containing at least one subnetwork type,
Routing and relaying
SNICP1
Subnetwork independent
Subnetwork independent
SNICP2
SNDCP1
Subnetwork dependent
Subnetwork dependent
SNDCP2
SNAcP1
Subnetwork access
Subnetwork access
SNAcP2
Figure 8. The IONL Model.
ISO OSI LAYERED PROTOCOL MODEL
which does not provide the OSI network service. It takes each of these subnetworks individually and enhances its subnetwork service to the level of the OSI network service. Different SNDCPs may be required on different subnetworks. • Internet approach. This approach is used in an environment containing at least one subnetwork type which does not provide the OSI network service. The SNICP (e.g., CLNP), which assumes the minimal set of network services from the underlying subnetworks, would operate on top of the SNAcP in all the systems attached to the subnetworks. The last topic in this section is OSI network addressing. The Network Layer must provide a global addressing scheme so that ESs in different subnetworks can be addressed unambiguously. An NSAP address is an address of an SAP of the Network Layer. To cope with the multitude of NSAP addresses in the global environment, NSAP addresses are partitioned into network addressing domains in a hierarchical fashion. Each network addressing domain has its own network addressing format and is administered by an address registration authority. An address registration authority may further suballocate its addressing space to another address registration authority. On the whole, the network addressing domains are structured as a tree where the root has seven toplevel network addressing domains (e.g., X.121, E.164 and ISO 6523-International Code Designator) as children. An NSAP address consists of an initial domain part (IDP) and a domain specific part (DSP). The IDP, in turn, consists of an authority and format identifier (AFI) and an initial domain identifier (IDI). The AFI specifies one of the seven toplevel addressing domains as well as the abstract syntax of the DSP (e.g., binary octets, decimal digits, characters). For example, an AFI value of 47 implies that the format is ISO 6523-International Code Designator (ICD) and the DSP abstract syntax is binary. The IDI component is used to specify an addressing registration authority, e.g., US Government OSI Profile (0005). The DSP is the part of an NSAP address assigned by an addressing registration authority specified in the IDI component. This is the place where one can put information on the routing domain and an ES. Transport Layer The transmission quality of the network service may not meet the requirement of an application because of possible signaled and residual errors. By performing transmission quality enhancements such as error detection and recovery, the transport layer (ISO/IEC 8072/8073) provides a reliable end-toend service. There are two modes of transport service: connection-oriented transport service (COTS) and connectionless transport service (CLTS). When operating in COTS, the transport layer provides a full-duplex transmission between the communicating transport service users. The following discussion focuses on COTS. The transport functions invoked by the transport layer depend on the underlying network type. If the underlying network is reliable, only a simple transport protocol is needed. On the other hand, if the underlying network is unreliable, a sophisticated transport protocol involving elaborate transport
759
mechanisms is needed. For this reason, the following five connection-oriented transport protocols have been defined. • TP 0: This is designed to operate over a reliable network. It is a simple protocol. • TP 1: This is designed to operate over an X.25-like network, which may have an unacceptable signaled error rate. It is capable of resynchronization upon reset of network signaled failures. It is also able to reassign a new transport connection in the event of a network failure. • TP 2: Similar to TP 0, this is designed to operate over a reliable network. It adds the multiplexing capability to TP 0 so that multiple transport connections can share a single network connection. • TP 3: Designed to operate over an X.25-like network, this is basically a combination of TP 1 and TP 2. • TP 4: This is designed to operate over an unreliable network. It is the most sophisticated transport protocol. The TP 4 protocol procedures are complicated. To set up a transport connection, three protocol messages are needed to avoid the processing of connection establishment PDUs for transport connections which have been released. Despite the complexity of the transport protocols, the specification of COTS is straightforward. There are four service elements: TCONNECT to set up a full duplex transport connection; TDATA and T-EXPEDITED-DATA to deliver normal data and expedited data, respectively, and T-DISCONNECT to release a transport connection. The T-DISCONNECT service is disruptive because data sent before the service request may be lost. Session Layer The transport layer provides a full duplex and unstructured pipe between two communicating application processes. In many cases, this pipe is sufficient. In other cases where bulk data transfer is necessary, it is desirable to add structure to the transport pipe so that the application processes can synchronize the data stream and perform recovery after a failure. The session layer (ISO/IEC 8326/8327) enhances the services of the transport layer by enabling application processes to synchronize their dialogues and to manage their data transfer. There are two methods to structure the transport pipe as a session dialogue. One uses the notion of an activity and the other does not. In the latter method, major synchronization points are inserted into the transport pipe to subdivide the pipe into dialogue units. A major synchronization point marks the end of a dialogue unit and no recovery is permitted back to that dialogue unit. Within each direction of a dialogue unit, minor synchronization points can be also added to facilitate the process of recovery. Whenever resynchronization is needed during recovery, a session service user can resynchronize the dialogue to the previous confirmed minor synchronization point within the current dialogue unit. The other method of organizing a session dialogue involves the use of activities. Conceptually, an activity represents a logical piece of work such as a file. It can be dynamically activated, interrupted, resumed or even discarded, thereby supporting parallel tasking and data recovery. The use of activities gives a three-level structure to a session dialogue. At the
760
ISO OSI LAYERED PROTOCOL MODEL
topmost level, the dialogue is structured into activities. At the second level, each activity is structured into dialogue units. At the third level, each dialogue unit is structured using minor synchronization points. Note that an activity is a logical concept. As such, an activity may span several session connections, and several overlapping activities may be contained in a session connection. A session connection has four token attributes: data token; release token; minor synchronize token; and major synchronize/activity token. Upon negotiation, these tokens are assigned to one of the two session service users during either the connection establishment phase or the data transfer phase. The owner of the data token can send data to its peer during a half-duplex data transfer. The use of the release token allows a session service user to refuse the release of a session connection (e.g., if it has data to send) when its peer, the owner of the release token, requests the release. The owner of the minor synchronize token (e.g., the sender of a file) can insert minor synchronization points. Finally, the owner of the major synchronize/activity token can insert major synchronize points to mark the beginning of a dialogue unit or an activity. The S-CONNECT service element is used to establish a session connection. When the session connection is established, either an existing transport connection is used or a new transport connection is established. When the session connection is released, the associated transport connection does not need to be released and thus can be saved in a reservation pool for future session connections. The session service elements available for the data transfer phase are used for dialogue control, synchronization and resynchronization. The full-duplex mode, which permits both session service users to send data simultaneously, does not require dialogue control and hence no data token is needed. The half-duplex mode, which enables only one session service user at a time to send data, requires dialogue control and hence the use of the token management capability. There are three session service elements for token management: S-TOKEN-GIVE to surrender a token; S-CONTROL-GIVE to surrender the entire set of available tokens; and S-TOKENPLEASE to request a peer to relinquish the ownership of one or more tokens. The transport layer provides only two types of data transfer facilities compared to the session layer, which provides four types of data transfer facilities. Normal data are sent using the S-DATA service element. When the data token is available, only the owner can invoke S-DATA. Expedited data are sent using the S-EXPEDITED-DATA service element. Due to limitations of the underlying expedited transport data facility, a maximum of 14 octets of expedited data can be transferred. The S-TYPED-DATA service element allows a session service user to send data outside the normal data stream independent of the availability and the assignment of the data token. When used properly, this facility gives the session service users a mixed half/full duplex mode which is useful in many applications. The S-CAPABILITY-DATA service element is used to send limited data in between two activities. At the end of an activity, for example, the session service users can use capability data to decide which activity to start next. Synchronized data transfer is intended for a bulky data transfer to facilitate error or crash recovery. It is achieved by
inserting minor or major synchronization points. The owner of the minor synchronization token can use the S-SYN-MINOR service element to insert a minor synchronization point. In practice, the two session service users have agreed on a window size during initialization, with the understanding that the receiver should acknowledge all the previously unconfirmed minor synchronization points before the two users exceed the window size. Unlike S-SYN-MINOR, S-SYNC-MAJOR is a confirmed service element. Once a session service user invokes a S-SYNC-MAJOR request, a confirmation must be received. The receipt of a confirmation acknowledges not only the major synchronization point but also all previously unconfirmed minor synchronization points. It marks the end of the current dialogue unit; thus the sender can discard all the data associated with the dialogue unit. Resynchronization is normally triggered by a notification which signals a possible failure. The notification is initiated by either a session service user or a session entity. The S-UEXCEPTION-REPORT service element is used by a session service user to report a user error (e.g., failure to hand over the data token) to its peer. The S-P-EXCEPTION-REPORT service element is used by a sesson entity to indicate an internal error (e.g., session protocol error) to the session service users. Typically, following an S-U-EXCEPTION-REPORT or S-P-EXCEPTION-REPORT indication, a session service user would initiate resynchronization by invoking the S-RESYNCHRONIZE service element. During the request, it must specify the resynchronize type, such as abandoning the current dialogue unit or resynchronizing the session to an unacknowledged checkpoint within the current dialogue unit. If necessary, the available tokens may be reassigned. Five session service elements are available for the management of activities. They are: S-ACTIVITY-START to initiate a new activity; S-ACTIVITY-INTERRUPT to interrupt an activity; S-ACTIVITY-RESUME to resume an activity; S-ACTIVITY-DISCARD to discard an activity; and S-ACTIVITYEND to end an activity. Two kinds of orderly release are made available to session service users—negotiated release and nonnegotiated release. The distinction between the two forms of release is based on the use of the release token. If this token is not available, the release cannot be negotiated. If the release token is available and the owner invokes the S-RELEASE service, the owner’s peer may choose to reply negatively to the release request when data has to be sent. For destructive release, the Session Layer provides an abortive service which can be either userinitiated (S-U-ABORT) or provider-initiated (S-P-ABORT). Presentation Layer The presentation layer (ISO/IEC 8822/8823) provides a passthrough capability that makes the entire set of session services visible to application processes as presentation services. In addition, it is responsible for handling the representation of application information which is exchanged for the communication. While the representation of information in an end system is a local issue, the two communicating application processes must agree upon what type of information is exchanged and how such information is represented during the transfer. During presentation connection establishment, the two application processes must identify the abstract syntax for the information to be exchanged for the connection. An
ISO OSI LAYERED PROTOCOL MODEL
abstract syntax can be represented by one or more transfer syntax. Therefore, the two application processes must also agree upon a common transfer syntax for every abstract syntax. Once the abstract syntax and the corresponding transfer syntax have been established, mapping between local syntax (i.e., syntax used for the local representation of information) and transfer syntax can be initiated. A presentation context is a pair consisting of abstract syntax and transfer syntax. The objective of the presentation layer is to establish a set of presentation contexts for the communication. This set is known as the defined context set (DCS). Each presentation context in the DCS is named by an integer-valued presentation context identifier (PCI). The DCS can be the empty set if the two application processes have previously agreed upon a default context. A presentation data value, which is passed to the presentation layer by an application process, may be composed of values from one or more abstract syntax. There are two ways to encode a presentation data value by the presentation layer— full encoding and simple encoding. Full encoding encodes a presentation data value as a presentation data value (PDV) list. Each component in the list is a PCI followed by a value encoded using the appropriate transfer syntax. The PCI value is always encoded using basic encoding rules (BER), which is one of the transfer syntax defined by ISO 8824. Simple encoding is used when the DCS is empty or when the DCS contains only one presentation context. In the simple encoding, the presentaiton data value is given simply by the encoded value; the PCIs are missing since they are not necessary. As far as the presentation services are concerned, the presentation layer makes the session services directly accessible by the application processes; it offers only a few services of its own. In fact, it offers only one service element which is not related to any session service element. This service element, P-ALTER-CONTEXT, provides the capability for the presentation service users to modify the DCS such as adding/deleting a presentation context. For example, an FTAM initiator, which may not know the abstract syntax of a file that it wants to open when the FTAM dialogue is established, can use this service element to add a presentation context for the abstract syntax of the file once it is known. A presentation connection is established using the P-CONNECT service element. The following explains how the initial DCS is established during connection establishment. When a presentation service user makes a P-CONNECT request, it passes the presentation context definition parameter. This parameter specifies a partially filled DCS where each item in the list contains two components—PCI and name of an abstract syntax. On receiving the P-CONNECT request, the local presentation entity first determines which transfer syntax it can use to represent each abstract syntax. It then creates a presentation context definition result list, where each abstract syntax is mapped to the set of transfer syntax, which the local presentation entity can support. The presentation context definition result list is passed to the peer presentation entity by means of a presentation CP (Connect Presentation) PDU. The peer presentation entity passes the presentation context definition result list to the called presentation service user. If the called presentation service user accepts the request, a possibly modified result list is returned. On receiving the response, the local presentation entity has a chance to modify the presentation context definition result list before it
761
returns the modified result list to the presentation entity of the initiating presentation service user. At this stage, the initial DCS is established. The remaining presentation service elements are almost identical to the session service elements. Since the presentation layer reproduces the services of the session layer in a pass-through manner, it is more efficient to implement the two layers in a single implementation module. By sharing the global data structures defined for the two layers, expensive copying can be avoided. For the rest of this section, examples are given as a brief introduction of abstract syntax notation one (ASN.1), and BER. ASN.1 is similar to the data declaration part of a highlevel programming language. It provides language constructs to define types and values. Types correspond to structures and values correspond to content. Unlike any programming language, ASN.1 types, which are meant to be machine-independent, need not be implemented by any machine. For example, the ASN.1 INTEGER type allows all integers as values. An abstract syntax is a named group of ASN.1 types and values. It can be defined by a standard group, a profile group or a user group. One of the reasons why the types are grouped into an abstract syntax is that values of these types are meant to be encoded by the same transfer syntax. Thus an abstract type can be viewed as a unit for transfer encoding. An ASN.1 module is the ASN.1 notation to define an abstract syntax:
ModuleExample DEFINITIONS ::= BEGIN TypeA ::= INTEGER TypeB ::= BOOLEAN valueA TYPEA ::= 10 valueB TypeB ::= TRUE END The foregoing module is named by Module Example. It has two ASN.1 types and two values. An ASN.1 type is either simple or structured. Simple ASN.1 types include INTEGER, REAL, BOOLEAN, CHARACTER STRING, BIT STRING, OCTET STRING, NULL and OBJECT IDENTIFIER. Structured types are built from simple types:
Person ::= SEQUENCE兵 name IA5String(SIZE(0..64), phone IA5String(SIZE(0..64))OPTIONAL, email SET OF IA5String OPTIONAL其 The ASN.1 type in the foregoing can be used to represent a person. The following observation about the type is made. • The keyword OPTIONAL means that the corresponding component of the sequence can be omitted when a value is specified. • The IA5String type is a CHARACTER STRING type consisting of characters taken from IA5 (International Alphabet number 5). • IA5String(SIZE(0.64)) is a subtype of IA5String where the allowed strings have a maximum size of 64. • SET OF is a structured type to represent an unordered list.
762
ISO OSI LAYERED PROTOCOL MODEL
ISO has defined a number of transfer syntax, including Basic Encoding Rules, Distinguishing Encoding Rules and Packed Encoding Rules. The following gives a brief introduction of BER which is by far the most popular transfer syntax. Every BER encoded value has three fields: a tag (identifier) field that conveys information on the type and the encoding form; a length field that defines the size of the value in octets; and a content field that conveys the actual value. A BER encoded value is sometimes called a Type-Length-Value (TLV) triple. Each ASN.1 type has an associated encoded tag which is used for the tag field in a TLV triple. The following gives an instance of a SEQUENCE type and its BER encoding:
Constructed ::= SEQUENCE兵 name OCTET STRING place INTEGER 兵room1(0), room2(1), room3(2)其 persons INTEGER OPTIONAL 其 meeting Constructed ::= 兵 name ‘1AA2FFGH’, place room3其 The TLV encoding (in hex) of meeting, which is constructed, is
30 09 04 04 1A A2 FF GH 02 01 02 The following explains how the encoded value is derived: • The 30 in the first row is the encoded tag value for SEQUENCE. • The 09 in the first row means that the length of the value field is 9 octets. • The second row gives the encoding of the octet string ‘‘1AA2EFFGH’’ where the encoded tag for octet string is 04. • The third row gives the encoding of the integer 2 (which is an abbreviation for room3). Application Layer The application layer provides all the communication support to application processes. Hence, if the lower six layers do not provide the required communication support, the application layer has to provide it. A framework to build the objects in the application layer is needed. The application layer structure standard (ISO/IEC 9545) defines a framework around which application standards can be developed. Conceptually, an application process can be divided into communication objects and noncommunication objects. The communication objects (i.e., objects which provide communication capabilities to the application process), are called application entities (AE). An application process may have one or more AEs. For example, a business application may consist of an AE containing X.400 capabilities and an AE containing FTAM capabilities. The division is only conceptual, so an actual implementation of an application process may not follow such a division. The structure of an AE can be complex. To understand how one AE communicates with another AE, it is necessary to refine the AE into granular components and analyze how these components communicate with their peers. In this way, the
design of an application protocol between two communicating AEs can be reduced to the design of an application protocol between two communicating components of less complexity. The application layer structure standard proposes structuring an AE in a recursive manner, starting with atomic components called application service elements (ASE). One or more ASEs can be combined to form an application service object (ASO). An ASO can be combined with one or more ASEs or ASOs to form another ASO. Continuing this recursively, the outermost ASO, which is the AE, is derived. Every ASO contains a control function (CF). The CF acts as a traffic cop to coordinate the activities of the ASEs and ASOs within the outermost ASO. In particular, the CF unifies the services of the various components of an ASO. The CF may add temporal constraints on the use of the combined service. Before two AEs can communicate with each other, they must first establish an application association which is an association between two AE-invocations (i.e., invocations of an AE). An application context, which is the most important attribute of an application association, defines the rules to be enforced during the lifetime of the application association. In particular, it specifies the required ASOs and ASEs, the abstract syntax that may be referenced, and the binding/unbinding information that needs to be exchanged before an application association is established/released. In short, an application context defines the working environment or knowledge that is shared by the AEs for the duration of the application association. The ASEs and ASOs can be viewed as workers at work in the constraint of the application context. Each worker communicates with a peer worker (of the same type) using a specialized protocol (e.g., an application protocol of an ASE). The decomposition of an AE into ASOs and ASEs is only static. It does not mean that all the ASOs and ASEs in an AE are always involved in an AE-invocation. For example, an AE may have five ASEs and three ASOs, but a particular AEinvocation may only involve three ASEs and two ASOs in an application association. To understand the dynamic behavior of an AE, one should examine the structure of an AE-invocation. By interacting with its peers, an AE-invocation can be involved in multiple application associations. Conceivably, the application contexts for these application associations may differ from each other. Therefore, one can refine an AE-invocation into components, with one component for each application association. These component objects are called single association objects (SAOs). They are active objects since they maintain states. Every SAO contains a single application control function (SACF) which acts as a coordinator of the ASEs and the ASOs which are involved in an application association. When the AE-invocation contains several SAOs, there may be a need for a multiple association control function (MACF) which is used to coordinate the SAOs. An MACF is similar to an executive manager. In some cases, an MACF is not needed because the SAOs do not need coordination. The structure of the application layer can reduce the design of an application protocol between two AEs to that of an application protocol between two ASEs. Since ASEs are the basic building blocks of an AE, we should first standardize the common ASEs and the associated application protocols. Common ASEs provide generic communication capabilities to a number of applications. Examples include the Application
ISO OSI LAYERED PROTOCOL MODEL
Control Service Element (ACSE), the Remote Operation Service Element (ROSE) and the Reliable Transfer Service Element (RTSE). The use of common ASEs ensures that applications can be built in a consistent manner. In addition to the common ASEs, there are specific ASEs that provide specific capabilities to applications. The FTAM ASE defined in the FTAM standard is an example of a specific ASE. The ACSE standard is defined for the purpose of establishing and releasing application associations. An application association is a presentation connection with additional application layer semantics, e.g., application context negotiation and peer-to-peer authentication. Currently, there is a one-to-one mapping between application associations and presentation connections. Future versions of the ACSE standard might permit a presentation connection to be reused for a new application association or multiple application associations to be interleaved onto a single presentation connection. There are four ACSE service elements. The A-ASSOCIATE service element is used to establish an application association between two AE-invocations. The A-RELEASE service element is used to release an application association in an orderly manner. The A-ABORT is used by an AE-invocation to abort an application association with possible loss of transit data. The A-P-ABORT service element is used by the ACSE service provider to notify the abortion of an application association. The ACSE service elements are mapped onto the presentation service elements in a straightforward manner. Because all the presentation parameters are supplied by the ACSE users, there are over 30 A-ASSOCIATE parameters. Additional A-ASSOCIATE parameters may be added in the future, whenever there is a need to provide additional semantics of an application association. Of the A-ASSOCIATE parameters which are application-specific (i.e., not specific to the Presentation Layer), only the application_context_name parameter which specifies the application context is mandatory. In a typical interactive environment, an AE-invocation requests a remote AE-invocation to perform an operation. The remote AE-invocation executes the operation and returns either an outcome or an error. Because many distributed applications are written in this kind of interactive environment, it is useful to provide an ASE to provide such interactive communication support. The remote operation service element (ROSE) standard is written for this purpose. It is used by application protocols such as CMIP and DAP/DSP. The ROSE standard defines a model for remote operations. A remote operation is requested by an invoker. The performer attempts to execute the operation and reports the outcome, which is either a normal outcome or an exception. Every invocation is identified by an invocation identifier, which is used to differentiate this invocation from other invocation(s) of the same operation. In addition, it may have a linked invocation identifier, indicating that the operation is part of a group of linked-operations formed by a parent-operation and one or more child-operations. The performer of the parent-operation may invoke zero or more child-operations to be performed by the invoker of the parent-operation. There are five ROSE service elements. An invoker uses RO-INVOKE to request a remote operation to be performed. After execution, a positive result is returned using RO-RESULT while a negative result is returned using RO-ERROR. When an ROSE user detects a problem with the invocation,
763
it can use RO-REJECT-U to reject the request or the reply. The ROSE service provider uses RO-REJECT-P to inform ROSE users of problems such as a badly structured application PDU. ROSE is not meant to be used as a standalone ASE in an application context. In any application context using ROSE, there must be at least one or more ASEs which supply the remote operations for the application needs. ROSE only acts as a courier for such remote operations. To facilitate the specification of remote operations by ROSE users, the ROSE standard provides templates for the definition of remote operations. CONCLUSIONS The OSI Reference Model is a very carefully designed model that provides the framework for the development of protocols to interconnect open systems. By providing a rich set of communication functionalities, it meets all the conceivable interconnection requirements. A seven-layer implementation necessitates good software engineering techniques and sound understanding of OSI concepts. Independently manufactured OSI implementations exist and have proven interoperability. Most of them are based on a profile that meets the requirements of a specific application. A reduced profile, known as Minimal Open System Interconnection (MOSI), was proposed by the OSI Regional Workshop. This profile would meet the requirements of most of the networking applications that exist today. Opponents of OSI believe that the functionalities of the OSI Reference Model are overkill. For instance, very few applications would require most of the session functions. Instead of requiring every open system to implement the session layer, the session functions could have been imbedded in only those applications that require them. This concern of these opponents has been addressed by the MOSI profile, the performance of which is comparable with that of the existing non-OSI stacks. The presentation layer addresses the situation where a negotiation of transfer syntax is necessary. Such situations arise in wireless communications when there is a need for encryption and compression. The growth of personal communication system (PCS) will cause the appreciation of the incorporation of the presentation layer in the communication stack. The success of the OSI Reference Model is clearly illustrated in the deployment of OSI application protocols such as CMIP, X.400 and X.500; CMIP, for example, has been unanimously chosen by telecommunications managers to be the management protocol in the lower layers of the TMN model. The OSI Reference Model is appreciated by practitioners who have a need for the rich functions it provides. It is appreciated by protocol designers (such as the present author) who can learn good protocol design principles from reading the OSI standards. It is certainly a sound protocol model to guide the development of a protocol stack. ADRIAN TANG University of Missouri-Kansas City
IT INDUSTRY. See INFORMATION TECHNOLOGY INDUSTRY.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5314.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Local Area Networks Standard Article Joseph B. Evans1 1University of Kansas, Lawrence, KS Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5314 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (148K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5314.htm (1 of 2)16.06.2008 16:24:23
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5314.htm
Abstract The sections in this article are Lan Topologies Ieee 802 Lan Standards Protocol Layering Ethernet (IEEE 802.3) Token Passing Bus (IEEE 802.4) Map/Top Token Ring (IEEE 802.5) Other Token Rings Hyperchannel and Hippi Other Lan Protocols Wireless Lans | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5314.htm (2 of 2)16.06.2008 16:24:23
LOCAL AREA NETWORKS
511
LOCAL AREA NETWORKS Local area networks (LAN) are data communications networks that are restricted in extent to an office, home, building, or, in some cases, areas as large as a campus. Due to the spectacular growth in networking, LANs can be found deployed in almost every organization. There are a variety of established and evolving technologies that are used in LANs, based on physical facilities ranging from copper and optical fiber to radio. The characteristics of commonly used LAN technologies will be discussed in this article. LAN TOPOLOGIES LANs can be logically organized in several topologies, the most popular of which are the bus, star, and ring. In the bus structure, illustrated in Fig. 1, nodes (computers, printers, or similar devices) are interconnected to a common, shared physical resource, typically a wire or cable. This topology is inexpensive, since wiring expenses are shared among the nodes. Unfortunately, this scheme involves sharing the limited bandwidth resources of the bus and can also be somewhat unreliable, as bus failures in the vicinity of one node can affect the others on the same bus. IEEE 802.3 10base5 and 10base5 Ethernet are examples of networking standards based on a bus topology at the physical layer. This remains the most common topology in use, however, due to the simplicity of deployment. An alternative is the star topology, shown in Fig. 2, in which each node has dedicated resources to some central switching site. This has the advantage of dedicated bandwidth to the interconnection point, but the attendant cabling costs are often higher than the bus topologies. Asynchronous transfer mode (ATM) is an example of a networking standard based on a star topology. There is increasing interest in star topologies (switched Ethernet, for another example) because the limited bandwidth on a cable is not shared and traffic is J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
512
LOCAL AREA NETWORKS
Figure 1. Bus topology. The transmission medium is shared among stations in this configuration.
not subject to internode arbitration delays for access to the medium. Another option is the ring topology, shown in Fig. 3, in which each node is interconnected to its neighbor. The IEEE 802.5 Token Ring and the Fiber Distributed Data Interface (FDDI) are examples of networking standards based on a ring topology. This topology shares many of the advantages and disadvantages of the bus topology—inexpensive wiring, but with reliability problems if the ring should be broken. Ringbased LANs have been designed to overcome the reliability issues by using counter-rotating rings (FDDI, for example). Depending on the protocols in use, bandwidth in a ring-based network can be reused (since the ring is not physically contiguous) and hence such a topology can have a capacity greater than the equivalent bus. IEEE 802 LAN STANDARDS Much of the growth in deployment of LAN technology can be attributed to the standardization of selected technology options, which has enabled multivendor interoperability and has spawned a highly competitive market. The IEEE 802 LAN standards are among the most widely used data protocols yet developed. The IEEE 802.2 standard specifies the Logical Link Control (LLC) protocols used by the other IEEE LAN standards. IEEE 802.2 allows the lower-level protocols to interface with higher-level protocols in a consistent manner. Using this approach, for example, the Internet Protocol (IP) need not know the type of underlying hardware being used on a particular host, which implies that software can be simplified and made
Figure 2. Star topology. This scheme is based on a central interconnection point for the transmission medium.
more reliable. Note that certain other protocol suites (IP over ATM, for example) use the IEEE LLC SAP (service access point) codes for protocol multiplexing and demultiplexing, so that similar benefits can be obtained. IEEE 802.2 provides several services; which services are used and the extent to which they are used depends on the needs of the other protocols involved. The IEEE 802.3 standard has been one of the most successful in the IEEE LAN suite. This standard describes the Carrier Sense Multiple Access with Collision Detection (CSMA/CD) protocol, which forms the basis for the Ethernet family (note that the 802.3 standard and Ethernet differ slightly but can be made to interoperate). The IEEE 802.3 standard is comprised of several related protocols, for different physical media. Included are the original 10base5 standard for CSMA/CD on 50 ⍀ thick coaxial cable, the 10base2 standard for lighter 50 ⍀ coaxial cable, and the 10baseT standard for unshielded twisted pair cables. Less commonly used today are the 1base5 StarLAN standard and the 10broad36 standard for more widely dispersed networks. In addition, fiber extension options are available for distributed site interconnection (within protocol distance limits). Ethernets can be found in almost all corporate data networks. The primary data rate is 10 megabits/s, although higher rate Ethernet protocols are becoming available, particularly 100 Mbits/s Fast Ethernet and ongoing work on Gigabit Ethernet. The IEEE 802.4 standard specifies the Token Bus protocol. This protocol has been the basis for several networking technologies, including the MAP/TOP (Manufacturing Automa-
Figure 3. Ring topology. The ring is based on a loop configuration for the medium.
tion Protocol/Technical and Office Protocol) suite. Multiple physical layers are defined for token bus on 75 ⍀ coaxial cable, including systems at 1 Mbit/s, 5 Mbits/s, and 10 Mbits/s. These are all broadband systems. The original 1 Mbit/s system has been quite popular due to its low cost and relative simplicity. The IEEE 802.5 standard specifies the Token Ring protocol. This standard has been widely deployed in PC-based networks and is second only to Ethernet in ubiquity. It uses unshielded twisted pair cabling, with data rates at 4 Mbits/s and 16 Mbits/s. It has several very desirable features, including robust behavior in the presence of high traffic loads and bounded delay (to transmit) times. PROTOCOL LAYERING For standardization purposes, networking protocols are most often conceptually partitioned into several layers. In the case of LAN technologies, the physical layer (PHY), media access layer (MAC), and logical link layer (LLC) are commonly specified. The latter two are often grouped together to form the data link layer in standard layering schemes. PHY Layer The PHY, or physical layer, is the lowest layer of a protocol stack. The standards for this layer typically describe the medium to be used (e.g., cable, fiber, wireless), modulation schemes, and encoding schemes used to transmit information across the medium. The PHY layer of LAN protocols generally fall into two categories, baseband and broadband. A baseband PHY layer is one in which the information bearing signals are digital signals, typically encoded using simple level-based keying, Manchester encoding, or differential Manchester encoding. This is the most common type of PHY layer in current LANs, being relatively inexpensive and sufficiently robust for most local environments. The disadvantages are distance limitations, typically 100 m to at most 1000 m on copper, and bandwidth, no more than about 155 Mbits/s over copper using current technologies. Baseband techniques may be used over optical fiber at much greater distances and rates, but with the attendant installation and network equipment costs. For typical LAN installations, however, baseband systems on copper are sufficient. Encoding schemes are another key element of the PHY layer. A variety of schemes, tailored to the physical medium for a given protocol, have been developed. Some typical encoding schemes are depicted in Fig. 4. These can be broadly classed as non-return-to-zero (NRZ) techniques and biphase techniques. The conceptually simplest schemes are the NRZ methods. In the NRZ-level approach, for example, zeros are encoded as low voltage level, and ones are encoded as a high voltage level. In optical fiber systems, the corresponding schemes may be that ones are the presence of optical power, zeros the lack of light. In the NRZI (NRZ with invert on ones) approach, a transition (either falling or rising edge) denotes a one, and the lack of a transition signifies a zero. While simple, the NRZ schemes have several shortcomings. Most significantly, recovery of bit timing at the receiver can be difficult—the moment in time at which to sample a bit
0
0
0
LOCAL AREA NETWORKS
513
1
0
1
0
1
1
Data
NRZ level
NRZI
Manchester Differential Manchester
Figure 4. Baseband PHY encoding schemes. This illustrates the relationship between data bits and the signal (optical or electrical) sent across the physical medium over time.
to determine if it is a zero or one is often not apparent in the presence of noise and other such impairments. A technique that provides an unambiguous timing reference is highly desirable. Furthermore, the occurrence of a long string of zeros or ones can result in an undesirable dc voltage bias on the transmission medium, which may cause threshold-related errors and problems with the use of transformers. There are several approaches to resolving these related problems, which center around the need for signal transitions. The 4B/5B and related techniques (4B/6B and 8B/10B are also common) involve guaranteeing sufficient transitions by inserting extra bits into the signal stream. Data symbols, 4 bits in this case, are mapped into a 5 bit code, which is then transmitted using NRZI, for example. This is illustrated in Table 1. An inspection of this table will prove that strings with a maximum of three consecutive zeros are possible, even when code words are concatenated. Multiple ones are not an issue if NRZI is used for transmission, as ones force a transition to occur. The cost of a 4B/5B mapping, of course, is that only 80% efficiency is possible. Table 1. 4B/5B Encoding Data Symbol
Code Word
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
11110 01001 10100 10101 01010 01011 01110 01111 10010 10011 10110 10111 11010 11011 11100 11101
514
LOCAL AREA NETWORKS
The biphase encoding is beneficial with respect to signal balance maintenance, and bit timing recovery is particularly easy to implement in this scheme. It is based on signal transitions at a rate double that of the bit rate. A transition (rising edge or falling edge) is guaranteed to occur at the center of the bit period. The absence of such a transition can be used as an error detection method. In Manchester encoding, a zero is encoded as a rising edge at the center of the bit period, and a one as a falling edge at such a time. The encoding mechanism can be implemented as an exclusive-or operation between the data and the clock. This is the encoding used for most of the common IEEE 802.3 protocols (10base5, 10base2, 10baseT). Differential Manchester encoding uses the midperiod transition for a clocking reference only, and uses the presence (denoting a zero) or absence (denoting a one) of a transition at the beginning of the bit period to encode the information. This is the method used for the IEEE 802.5 token ring standard. The primary disadvantage of biphase signaling is that transitions happen at twice the data rate, which means that the bandwidth required is greater than that of the equivalent NRZ system, and the hardware must operate twice as fast. The former is particularly critical in wireless systems. A broadband PHY layer is one in which the information is coupled into the medium by analog signals, which are modulated by some carrier, and encoded using frequency shift keying (FSK), amplitude shift keying (ASK), phase shift keying (PSK), or some similar scheme. This type of PHY layer is most often used in situations where longer distances need be served or additional bandwidth is required. Much greater bandwidths may be supported on one cable using broadband schemes, as multiple frequencies can be used. The primary disadvantage of the broadband approach is the cost of modulators, demodulators, and the associated analog hardware. MAC Layer The MAC, or media access layer, is used to arbitrate access to the PHY layer. For example, in the case of Ethernet, there is a shared medium (cable) that must be used by several nodes, and only one of the nodes can be permitted to access the cable at a particular time. The MAC layer influences the effective throughput over a given physical layer and should be efficient in its use of the available bandwidth. This includes minimizing the overhead due to factors such as protocol headers and dead time between transmissions, while at the same time maximizing the successful transmissions on a busy shared-medium network. In addition, the MAC layer is often designed to ensure that errors are not propagated to the higher-layer protocols. Various MAC schemes have been developed for the LAN protocols. The three most common are the Carrier Sense Multiple Access with Collision Detection (CSMA/CD) protocol used in Ethernet, the token ring protocols, and the token bus protocol. The CSMA/CD MAC protocol involves detecting the use of the medium by another station by checking the state of the carrier. If a station has data to transmit, it first attempts to verify if the medium is unused. If it is available, the station transmits. If the medium is not available, the station waits until the medium goes idle and then immediately begins to transmit (note that this is the IEEE 802.3 solution, but other
options are possible in the general case of CSMA). The success or failure of transmissions is monitored on the shared medium, and if a transmission is unsuccessful—that is, a collision is detected—the station waits a prescribed random amount of time (binary exponential back-off) and attempts to transmit again. This procedure is repeated until the transmission is successful, or the limit to the number of transmission attempts (16 in IEEE 802.3) is reached. CSMA/CD is simple, inexpensive, and performs well under light loads. Unfortunately, it can perform poorly under heavy loads and be sensitive to physical layer errors. Token ring protocols use a ‘‘token’’ to arbitrate access to the transmission medium. A token is a small frame that is exchanged between stations to gain the right to transmit. If a station has data to transmit, it waits until a token is seen on the medium. This station then modifies the token and appends the necessary fields as well as its data. When this frame returns around the ring to the originating station, it is purged from the medium. When data transmission is complete, the station inserts a new token onto the ring. Token rings support fair, controlled access to the medium and perform well under heavy load conditions. A disadvantage is the need for careful token maintenance, particularly in the presence of errors. Several varieties of token ring exist; some of these will be discussed in subsequent sections. The token bus protocol is closely related to the token ring, but with an underlying physical bus topology. The token exchange mechanism, however, does in fact use a logical ring for token passing. This logical ring is simply an ordering of stations on the bus. Once the logical ring is in place, token passing can proceed as in a ring-based system. This system provides controlled access to the bus and is robust under heavy loads. One of the disadvantages of this approach is that ring initialization and maintenance is more complex than in a physical ring—the ordering of stations must be determined through some algorithm, and station additions and deletions must be managed. LLC Layer The LLC, or logical link control layer, can be viewed as the upper part of the data link layer. It is used to provide data services to the higher layers. In particular, two types of services, connectionless and connection oriented, are defined. In LANs supporting complex higher-layer protocols, such as TCP/IP, only the simplest LLC services are commonly used. An example of a LLC protocol is the IEEE 802.2 layer. This provides both connectionless and connection-oriented services. The unacknowledged connectionless service provides simple datagram support for the multiplexing and demultiplexing of higher-layer protocols. In addition, a connectionless service with acknowledgments (for monitoring systems, for
8
8
8–16
n
DSAP
SSAP
Control
Data
Figure 5. IEEE 802.2 LLC frame format. The SAP (service access point) fields are used to select the appropriate protocol handler on reception of a packet.
LOCAL AREA NETWORKS
515
Router Upstream 10base5 Hub
Bridge
Figure 6. Typical Ethernet installation. This illustrates the interconnection of various physical and protocol devices in a typical LAN. Hubs are devices used to concentrate the physical media from several Ethernet stations, and are often used as physical layer translation devices (10baseT to 10base2, for example). Bridges provide isolation between Ethernets and allow more complex LANs to be built.
10baseT 10base2
example) is supported, as well as a connection-oriented service that furnishes flow control and error recovery capabilities based on the lower-layer CRC and a ‘‘go-back-N’’ strategy. The IEEE 802.2 LLC frame format is depicted in Fig. 5. The destination service access point (DSAP) and source service access point (SSAP) fields are used to indicate the service type (IP or IPX, for example) to higher layers. The control field is used for the LLC service support, including indication of the type of service. ETHERNET (IEEE 802.3) The Ethernet, and the closely related IEEE 802.3 standard, has been one of the most successful LAN protocols developed to date. This technology is based on CSMA/CD and takes a variety of forms at the PHY layer. A typical Ethernet installation is depicted in Fig. 6. The Ethernet frame format is illustrated in Fig. 7. The preamble is used for frame delineation. The destination and source address fields (48 bits each) are globally unique identifiers for each Ethernet adapter and are used for station to station communication, as well as broadcast (all ones) and multicast (first bit is one). It should be noted that the 48 bit addresses used in Ethernet have become a common feature in IEEE 802-based LANs. The type field (Ethernet) can be used for higher-layer demultiplexing, as in an LLC protocol. The length field (IEEE 802.3) is used to aid in end of frame detection. A 32-bit CRC is used for error detection and is followed by a postamble for end of frame detection. TOKEN PASSING BUS (IEEE 802.4) The IEEE 802.4 Token Bus standard has been widely used in manufacturing systems and early office automation products.
64
Preamble
48
Destination address
48
Source address
16
Type or length
Because it is based on a broadband physical medium, it is somewhat more resistant to the low-frequency electromagnetic (EM) noise that might arise on a factory floor. A token passing bus is a LAN with a bus topology that operates on the principle that a token will be received prior to the transmission of data by a station. The token bus format includes a preamble, frame control byte for denoting whether a particular frame is a token or data, the destination and source addresses (48 bits, as in 802.3), the data (an LLC frame), an error detection field (CRC-32, as in 802.3), and the postamble. The token bus operates by first establishing a logical ring that overlays the physical bus topology. Station additions and deletions require reconfiguration of the logical ring. When a token is received, a station is permitted to transmit multiple packets, until its token holding time has expired. The token bus offers optional support for multiple classes of service through the use of complex timer specifications that enable per-class bandwidth guarantees. Support for simpler nontoken stations is included to allow low-cost devices to respond to polling requests using this medium.
MAP/TOP MAP is the Manufacturing Automation Protocol developed by General Motors Corporation for communication among automated manufacturing devices, including robotic equipment and the associated controllers. It was primarily designed to support communication between very different sorts of devices, in real time with low, predictable delays. It supports applications as varied as word processing and equipment telemetry (temperature measurement, for example). TOP is the Technical and Office Protocol developed by Boeing Corporation for communication between office automation
32
Data
CRC
8
Postamble
Figure 7. Ethernet frame format. Ethernet and IEEE 802.3 differ in the fourth field, which is a type field in Ethernet and a length field in IEEE 802.3.
516
LOCAL AREA NETWORKS
devices such as word processing systems and printers. Interoperability between devices from a variety of manufacturers was a key design goal of this protocol. The MAP/TOP protocol suite is based on the IEEE 802.4 Token Bus protocols. As such, MAP/TOP networks are often interconnected with some variety of token passing network for ease of interface design. TOKEN RING (IEEE 802.5) The Token Ring protocol has been widely deployed in networks based on PCs. Token Ring operates on the principle of the exchange of a ‘‘token’’ to a station before it is permitted to transmit. Only one token is allowed on the ring at one time. The IEEE 802.5 Token Ring frame formats are illustrated in Fig. 8. The first format is used for token frames and only includes start and end delimiters and the access control field, with priorities and reservation information. The second format includes start and end delimiters, a frame control word for optional LLC support, source and destination addresses (in 802.3 format), the LLC (data) frame, a CRC-32, and a frame status word used by transmitting stations to verify reception. OTHER TOKEN RINGS Another example of token ring technology in wide use today is the Fiber Distributed Data Interface (FDDI) standard. This technology supports multiple packets on the ring at one time, with rates of 100 megabits/s. Provisions are made for multiple service classes (synchronous and asynchronous) with differing throughput and delay requirements. Further, reliability support is provided through the capability for optional redundant counterrotating rings, which can mask a station or fiber failure. Slotted Ring
Register Insertion Ring Register insertion rings are a common LAN technology and can be used to provide high performance through their support for multiple packets on the ring at one time. The register insertion ring uses a small shift register at each station to control forwarding and insertion onto the ring. The shift register is at least as large as the maximum frame size. This allows a station to store a frame as it passes. If the station has no data to send, a passing frame is buffered long enough to determine if it is destined for the local station. If it is destined locally, a typical implementation will both copy the frame into adapter memory and forward the frame back around the ring to support acknowledgments. Transmission when the medium is available is handled by simply copying the data onto the ring. If a frame arrives during this time, it is buffered in the insertion register. The register insertion method provides excellent ring utilization due to the multiple simultaneous packets on the ring without the overhead penalty of the slotted ring. The disadvantage of this technology is that the purge mechanism—that is, the technique used to remove problematic packets from the ring—is generally more complex than in other systems. HYPERchannel AND HIPPI
Slotted ring technology uses multiple ‘‘slots’’ that rotate around the ring to arbitrate access. Each slot is a small frame that can be marked empty or full. When an empty slot arrives at a station with data to transmit, the slot is marked full and data is injected. The slot is marked empty when it returns around the ring to its source. A given station cannot transmit again when it has an outstanding slot. The provision for multiple packets from different sources on the ring at one time assists in fair utilization and quality of service support.
Figure 8. Token ring frame format. The different formats used for the control token and data frames are depicted.
The Cambridge Ring is an early example of such technology (some claim it is the ancestor of the ATM protocols, also based on small fixed frame sizes). A slot contains one octet each for the source and destination addresses, five control bits, and two, four, six, or eight data octets, and thus slot sizes are extremely small. This implies that higher-layer packet data is almost always segmented into small units prior to transmission. Stations could choose not to receive packets from particular sources; some of the control bits support this through response codes. The Cambridge Ring was simple to implement, but was somewhat wasteful of bandwidth due to the header overhead in such small datagrams.
A number of LAN protocols are designed for very high-speed interconnection of computers and their peripherals. HYPERchannel, developed by Network Systems Corporation, is one of these. This protocol was developed in the mid-1980s for the interconnection of supercomputers and high-performance peripherals, and has been used with Cray and Amdahl systems, among others. It supports data rates of up to 275 megabits/s over a variety of physical layers.
8
8
8
Start delimiter
Access control
End delimiter
8
8
8
48
48
Start delimiter
Access control
Frame control
Destination address
Source address
Data
32
8
8
CRC
End delimiter
Frame status
LOGARITHMIC AMPLIFIERS
HIPPI, or High Performance Parallel Interface, is another of the protocols developed primarily for interconnection of supercomputers. This protocol supports 800 Mbits/s or 1.6 gigabits/s links over a large parallel cable, which is either 32 lines or 64 lines wide and runs at 25 MHz. The distances over which HIPPI can be used are quite limited, but large enough for a typical supercomputer center equipment floor. Interconnection of sites can be accomplished using fiber extension options. Simple flow control features are provided to lessen problems with computers and peripherals of widely different I/O (input/output) bandwidth. Although simple and effective, this flow control scheme does contribute to the problems of extending HIPPI networks over larger distances while maintaining high throughput. To build HIPPI networks of nontrivial size, simple switches are used to interconnect devices. These switches are typically not designed to switch between sources and destinations at high rates, as with routers and packet switches, but rather act as interconnection panels that may be reconfigured at reasonable rates for sharing peripherals.
OTHER LAN PROTOCOLS A number of new, higher-performance LAN protocols have been developed in recent years. FiberChannel is a LAN protocol suite designed for high-speed communication between nodes using optical fiber. Rates of up to 800 Mbits/s are supported, with systems up to 4 Gbits/s under design. Other developments include FireWire (IEEE 1394) and universal serial bus (USB), high-speed protocol suites based on serial interconnection technology. FireWire, for example, supports bandwidths of up to 400 Mbits/s with up to 63 devices (with no more than 16 cable hops) per bus. USB, a 12 Mbits/s serial protocol with chaining support, is designed primarily as an improvement over traditional serial port technologies. Asynchronous Transfer Mode (ATM) networks are also being widely deployed in LANs. ATM is a switch-based technology that uses small packets (53 bytes) called cells. Interconnection of nodes is through virtual circuits, which are analogous to circuits in voice telephony. Multiple physical layers are supported, including both copper and fiber infrastructure options. Although ATM is often viewed as a wide area networking technology, it does provide support for features that are not available in other technologies. For example, ATM allows the definition of virtual LANs, which provide network administrators with options that are not available in less sophisticated technologies. Virtual or emulated LANs are interconnections of LANs, perhaps widely separated, which are configured to emulate a single local area network. Furthermore, multiple logical local area networks can be supported over a single physical infrastructure using this capability.
WIRELESS LANS Wireless LANs use radio or infrared as the transmission medium, as opposed to the traditional wire or fiber. This has significant advantages, particularly for deployment in older buildings where wiring costs are high, as well as environments in which workers may be moving frequently.
517
Many of the initial wireless LANs in the United States have used radio frequencies in one of the ISM (Instrumentation, Scientific, Medical) bands, which generally may be used without individual site licensing subject to restrictions on power output. The data rates on these systems range from tens of kilobits per second to a few megabits per second, with typical ranges of a few hundreds of meters. Some early European products in this area were based on the digital enhanced cordless telephony (DECT) standard for digital telephony. These systems used multiple channels to provide data rates of hundreds of kilobits per second. Wireless LANs are still evolving, but instances of several standards are now being seen as products. The most significant development in this area is the IEEE 802.11 standard, which will provide data rates of 1 Mbit/s to 2 Mbits/s over a range of approximately 100 m in typical radio configurations. This is based on a CSMA/CA (CSMA with Collision Avoidance) MAC layer with multiple physical layers—in particular, direct sequence spread spectrum, frequency hopping spread spectrum, and infrared. Work on indoor wireless ATM and the European ETSI RES10 standards are focusing on systems with data rates of up to 25 megabits/s with ranges on the order of 100–200 m. While they are still early in the development cycle, these systems promise to deliver multimedia services over wireless links with quality of service support. BIBLIOGRAPHY P. T. Davis and C. R. McGuffin, Wireless Local Area Networks, New York: McGraw-Hill, 1995. L. L. Peterson and B. S. Davie, Computer Networks—A Systems Approach, San Francisco: Morgan Kaufmann, 1996. S. Saunders, The McGraw-Hill High-Speed LANs Handbook, New York: McGraw-Hill, 1996. W. Stallings, Local Networks, 5th ed., New York: Macmillan, 1996. J. Walrand and P. Varaiya, High Performance Communication Networks, San Francisco: Morgan Kaufmann, 1996. J. Wobus, LAN Technology Scorecard [Online], 1996. Available: http:// web.syr.edu/앑jmwobus/comfaqs/lantechnology.html
JOSEPH B. EVANS University of Kansas
LOCAL AREA NETWORKS. See ETHERNET. LOCATION SYSTEMS, VEHICLE. See VEHICLE NAVIGATION AND INFORMATION SYSTEMS.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5315.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Metropolitan Area Networks Standard Article N. F. Maxemchuk1 1AT&T Labs–Research, Murray Hill, NJ Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5315 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (224K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5315.htm (1 of 2)16.06.2008 16:24:40
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5315.htm
Abstract The sections in this article are History Differences Among Lans, Mans, and Wans The Man Anomaly The Fiber Distributed Data Interface The Distributed Queue, Dual Bus Protocol The Manhattan Street Network Comparison of FDDI, DQDB, and the MSN CATV Conclusion | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5315.htm (2 of 2)16.06.2008 16:24:40
602
METROPOLITAN AREA NETWORKS
METROPOLITAN AREA NETWORKS Metropolitan area networks (MAN) have been studied, standardized, and constructed for less than 20 years. During that time the capabilities of the telecommunications network and the requirements of users have changed rapidly. What a MAN is supposed to do has changed as quickly as MANs are designed. Recent changes in user requirements, resulting from the growing use of the Internet at home, are likely to redefine MANs once again. To understand the evolution and predict the future of MAN, one must consider the applications and alternative technologies. There are also inherent differences in the capabilities of local, metropolitan, regional, and wide area networks. HISTORY The first mention of MANs, that I am aware of, occurred at a workshop on local area networks (LAN) in North Carolina in the late 1970s. One session at the workshop was dedicated to customer experience with LANs. One of the customers, from a New York bank, described a successful application of LANs but complained about the difficulty he had transferring data between branches of the bank in the same city. The feeling among the workshop participants was that we could do better connecting sites in the same city than using technology that was designed for a national network. In the 1970s telephone modems were expensive, about a buck a bit per second, and the highest rate modem that was generally available was 9.6 kbit/s. High-rate private lines, such as the current T-carrier system, were not widely deployed. Using the available technology was expensive and created a bottleneck J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
METROPOLITAN AREA NETWORKS
DIFFERENCES AMONG LANS, MANS, AND WANS The distance spanned by WANs is greater than that by MANs, and the distance spanned by MANs is greater than that by LANs. It is useful to define the maximum distance spanned by the various network technologies as increases of an order of magnitude. LANs span distances up to 3 miles, and include most networks that are installed in a building or on a campus. MANs span distances up to 30 miles (50 km according to the standards committees) and can cover most cities. RANs (regional area networks) span distances up to 300 miles, the area serviced by the telephone operating companies in the United States. And WANs span distances up to 3000 miles, the distance across the United States. The next order of magnitude increase covers international networks. The distances spanned by networks affects the transmission costs, access protocols, ownership of the facilities, and the other users who share the network. The transmission costs usually increase with distance. This cost affects both the applications that are economically viable and the protocols that are used to transfer data. For instance, access protocols have been designed for LANs that trade efficiency for processing complexity. Carrier sense multiple access/collision detection (CSMA/CD) protocols, in which many users share a channel by continuing to try until the data successfully get through, are used on LANs. WANs use reservation mechanisms that require more processing, but pack the transmission facility as fully as possible. With CSMA/CD protocols the propagation delay across the network must be much less than a packet transmission time, which precludes using these protocols in WANs.
Traditionally, LANs are networks that are owned and installed by a single company or organization. An organization can choose to try new technologies. MANs are less expensive to install than WANs and may not be interconnected. There is more freedom to experiment with new technologies on MANs than on WANs. The expense of installing MANs relative to installing LANs has resulted in far fewer experimental MANs than LANs. In a LAN the other network users are generally more trusted than the users in a more open environment. Traditionally, MANs, such as CATV networks, and RANs and WANs, such as the telephone network, service an unrelated community of users. The users in these networks do not trust one another as much as the users on a LAN, and greater measures must be taken to protect data. There are increasing numbers of wide area networks that are owned and controlled by a single organization. Corporate networks and intranets, which use Internet technology within a corporate network, are becoming common. These networks have trust structures and flexibility that is more closely related to LANs than to WANs. The differences between general WANs and intranets are reflected in the applications of the networks and are leading to different implementations. Many of the economic tradeoffs that are related to the distance spanned by networks change with time. However, the difference in propagation delay can never change. As the size of the network increases, the maximum useful transmission rate that a user can access to transfer a particular size packet decreases. This phenomenon is demonstrated in Fig. 1. The lines in this figure show when the propagation delay and transmission time are equal in LANs, MANs, RANs and WANs. The calculations are performed assuming that the propagation delay in the medium is 80% of the speed of light in free space, which is common for optical fibers. To the right of these lines, the time it takes to get the message from the source to the destination is dominated by the propagation delay rather than the transmission time. Increasing the transmission rate when operating to the right of the line does not bring a commensurate decrease in the time it takes to deliver a message. For instance, on a 3000 mile WAN the equilibrium point on a 1.5 megabits/s T1 circuit is about 30.3 kbit. For a message of this size, the propagation delay and transmission time are equal. If the user’s rate is increased to 45 megabits/s, a T3 circuit, which is 30
108 Message size (bits)
between LANs. With the customer and application identified, work on MANs began. The first MANs were designed to interconnect LANs. They evolved from LANs and looked very much like LANs. Two MAN standards that are clearly related to LANs, fiber distributed data interface (FDDI) and distributed queue, dual bus (DQDB), are described later in this article. A third network, which is based on a mesh structure, the Manhattan street network (MSN), is also described. The MSN is a network of two-by-two switches that operate on fixed-size cells. The MSN straddles the middle ground between a LAN and a centralized asynchronous transfer mode (ATM) switch (1). Interest in MANs waned as the useful functions performed by MANs were subsumed by wide area networking (WAN) technology. WANs had a much larger customer base than MANs. It became more economical to interconnect LANs in a city with routers and private lines than to deploy new, special-purpose networks. There is a resurgence in interest in MANs because of the World Wide Web (Web). The time required to download Web pages using WAN technologies is frustrating many users and constraining the growth of this service. As more and more individual homes are connected to the Internet, there is a rapidly growing demand for bursty, high-rate data to a large number of locations in a metropolitan area. Therefore, there is a renewed interest in MANs, although the requirements and customer set are completely different from those of the earlier MANs.
603
3000 miles 300 miles 30 miles 3 miles
106 104 102
9.6 K
1.5 M 150 M Transmission rate (bits/s)
2.4 G
Figure 1. Equal delay lines when the distance between the source and destination is 3, 30, 300, and 3000 miles. Along the line the propagation delay for light and the transmission time of the message are equal.
604
METROPOLITAN AREA NETWORKS
times faster, the time to transmit the message decreases by a factor of 30, but the propagation time remains the same. The time it takes the message to get to the destination decreases by less than a factor of 2 as the transmission rate increases by a factor of 30. At T3 rates the message delivery time is almost entirely due to the propagation delay, and if the user’s rate increases to 155 Mbits/s, an ATM circuit, there is virtually no decrease in the time it takes to receive the message. If the user sends the same message on a MAN and increases the rate from T1 to T3, the time to deliver the message decreases by almost a factor of 30, and increasing the ATM rates decreases the delivery time by another factor of 2. Therefore, ATM rates may be used to obtain faster delivery of this size message in a MAN, but not in a WAN. THE MAN ANOMALY Generally, transmission links cost more as the distance increases. At present, high-rate channels are readily available in LANs and WANs, but not in MANs. The use of computers in offices has become ubiquitous and has resulted in most office buildings being wired for high-speed communications. The backbone of the wide area telephone network is shared by a large number of users. Even though only a fraction of the users require high-rate facilities, the number is large enough to warrant providing those facilities between central offices. Fiber to the curb and other methods to provide high data rates in a MAN exist, but most users do not require these rates. The lines running down a street in a MAN are not shared by as large a number of users as the lines between central offices. As a result, when high-rate channels are installed in an office, the line that spans the final mile or two from the central office to the office building is frequently an expensive, custom installation. The current networks must be modified to provide high data rates to homes until the demand increases to the point where new facilities are justified. Two possible technologies are digital subscriber loops (DSL) and the CATV network. DSL and ADSL (asymmetric DSL) use adaptive equalizers to transmit between 1.5 and 6.3 Mbits/s over current local loops in the telephone network. ADSL provides higher rates in one direction than in the other. To date, the only really successful MAN for distributing information to a large number of homes has been the CATV network. CATV networks are mainly used to distribute entertainment video; however, experimental networks are being deployed to deliver data to homes. Standards organizations and working groups are actively considering these networks. The multimedia requirements of the Web may well make CATV technology the correct solution for the next MAN. Several of the early proposals for MANs used CATV networks to deliver point-to-point voice and data services, as well as broadcast TV. Later we describe one of these techniques, which is still one of the most forward-looking CATV solutions. THE FIBER DISTRIBUTED DATA INTERFACE FDDI (2) is a token passing loop network that operates at 100 megabytes/s. It is the American National Standards Institute (ANSI) X3T9 standard and was initially proposed as the suc-
cessor to an earlier generation of LANs. FDDI started as a LAN and has been primarily used as a LAN; however, it is capable of transmitting at the rates and spanning the distances required in a MAN. Therefore, it has become common to discuss FDDI in the context of MANs. Baseband Transmission FDDI uses a baseband transmission system. Baseband systems transmit symbols, ones and zeros, on the medium rather than modulating the symbols on a carrier, as in a radio network. Baseband systems are simpler to implement than carrier systems; however, the signal does not provide timing and there may be a dc component that is incompatible with some system components. For instance, a natural string of data may have a long sequence of ones or zeros. If the medium stays at the same level for a long period, it is difficult to decide how many ones or zeros were in the string and the dc level of the system will drift toward the value of that symbol. To tailor the signal to have desirable characteristics, the data are mapped into a longer sequence of bits. A common code for transmitting baseband data on early twisted pair networks is a Manchester code. Each data bit is mapped into a 2 bit sequence, a one is mapped into ⫹1,⫺1 and a zero into ⫺1,⫹1. There is at least one transition per bit, which provides a strong timing signal. There is no dc component in this code. Twisted pairs are connected to receivers and transmitters by transformers to protect the electronics from energy picked up by the wires during lightning storms, and transformers do not pass dc. A framing signal is needed to identify bit boundaries and the beginning of a sequence of bits. Framing signals occur infrequently and are sequences that do not occur in the data. Framing can be obtained by alternately transmitting ⫺1,⫺1 and ⫹1,⫹1 every n bits. With a Manchester code the bit rate on the medium is twice the bit rate from the source; however, this is a very simple coding system to implement. The FDDI standard uses a rate 4/5 code that maps 4 data bits into 5 transmitted bits. The constraints on codes used for fiber optics are different from those on twisted pairs. Fibers do not act as antenna during lightning storms, are not coupled to electronics through transformers, and can tolerate a dc component. In addition, logic costs have decreased since the early twisted pair networks were designed, so that it is now reasonable to implement more complex bit mappings and to design signal extraction circuits with fewer transitions. In the FDDI code there are 16 possible data patterns per symbol and 32 possible transmitted patterns. The 16 patterns are selected to guarantee at least one transition every 3 bits. Some of the remaining patterns serve control and framing functions. The use of the transmitted patterns is listed in Ref. 3. Token Passing Protocol A token is a unique sequence of bits following a framing sequence. When a station on the loop receives the token, it may transmit data after changing the token to a different pattern of bits. When the station has completed its data transmission, it transmits a framing sequence and the token so that the next station has a chance to transmit. When a station does not have the token, it forwards the data it receives on the loops. A station may remove the data destined for itself. When a station has the token and is transmitting, it discards any
METROPOLITAN AREA NETWORKS
data it receives on the loop. The discarded data either passed this station prior to the token, and has circulated around the loop or was transmitted by this station after accepting the token. In either case, the data have circulated around the loop at least once and every station has had a chance to receive the data. In FDDI there is a time for the token to circulate around the loop, a target token rotation time (TTRT). In a simple token passing protocol one station can hold the token for a very long period of time. That station can obtain a disproportionate fraction of the bandwidth and delay other stations for long periods. In the FDDI protocol, station i is entitled to send S(i) bits each time it receives the token. The TTRT is set so that every station can send at least the bits it is entitled to send in a single rotation.
605
as the other levels by setting Tjmax ⬍ ⫺(2*TTRT ⫹S(i)), so that data at that level are never inhibited by the threshold. Tokens that circulate the loop and can be used by any station are called unrestricted tokens. The FDDI standard also supports a mode of operation in which all of the capacity that is not being used by synchronous traffic is assigned to a single station, possibly for a large file transfer. To support this mode of operation, a restricted token is defined. A station that enters this mode forwards the restricted token rather than the standard token. Another station that receives a restricted token transmits its synchronous traffic, but does not transmit traffic at level j ⬍ jmax. Therefore, all of the capacity available for asynchronous traffic is given to a single station until that station forwards an unrestricted token. Isochronous Traffic
S(i) + < TTRT
i
where ⌬ is the propagation delay around the loop plus the maximum time that is added by the transmission format. The TTRT provides guaranteed bandwidth and delay for each station. After a station i transmits S(I), it can continue to transmit data if the time since it last forwarded the token is less than TTRT, which indicates that the token is circulating more quickly than required. How long a station holds the token depends on the priority of the data it is transmitting. If the data are high priority, the station can hold the token until all of the surplus time in the token rotation has been used. If the data are lower priority, the station may leave surplus time on the token to give stations with higher-priority data a chance to acquire the surplus time. The TTRT is set individually for each FDDI system depending on the requirements of the stations on that system. The maximum delay that can occur is 2*TTRT, and the average token rotation time is less than TTRT (4). Therefore, TTRT can be set to provide the guaranteed bits, S(i), and an upper bound on delay for synchronous applications. When best effort traffic, rather than traffic that requires service guarantees, is the dominant traffic type, then TTRT is set to trade access delay and efficiency. As TTRT is made smaller, the time delay until a token arrives decreases. As TTRT is made larger, the amount of data transmitted before passing the token increases, less time is spent passing the token, and the efficiency of the system increases. The increase in efficiency is greatest when there is only one active station. The operation of the token protocol is depicted in Fig. 2. When a station transfers the token, it sets the local counter TRT(i) to TTRT. The station decrements TRT(i) every unit of time, whether or not it has the token. If TRT(i) reaches ⫺TTRT before the token is received, then the token has not visited this station in 2*TTRT and is presumed to be lost. When station i receives the token it has Wj(i) ⱖ 0 units of data waiting to be transmitted at priority level j, where jmax is the highest level and Wjmax(i) ⱕ S(i). A station transmits Wjmax no matter how much time is left on TRT(i). For the priority levels j ⬍ jmax, a station transmits a data unit as long as TRT(i) is greater than the threshold Tj. Tj⫺1 ⬍ Tj so that we never transmit data at level j ⫺ 1 if we cannot transmit data at level j. We process priority jmax data in the same structure
FDDI-II adds the ability to send isochronous traffic to FDDI. An isochronous channel provides a regularly occurring slot. The channel is assigned to a specific station on a circuit switched basis. FDDI-II is implemented by periodically switching the network between a circuit switched and packet switched mode. There is a central station that sends out a framing signal every 125 애s. A portion of the interval following a frame signal is assigned to circuit switched traffic, the isochronous mode, and the remaining time in the frame is assigned to the token passing protocol. An isochronous station that is assigned one byte per frame has a 64 kb/s channel. This channel is adequate for telephone quality voice. In FDDI-II the stations that implement the token passing protocol must switch between the two modes of operation when they receive framing signals. A station that enters the circuit switched mode must forward whatever bits it receives. When the circuit switched mode ends, the station must resume the token passing protocol where it left off. Architecture A single failure of a node or a link disconnects the stations on the loop. In a LAN, loops are made more reliable by using normally closed relays to bypass individual stations that lose power or fail in an obvious manner. It is also common to arrange stations in subloops that are chained together at a central location. Subloops with failures are removed from the network (5), so that the stations on the other subloops can continue to communicate. Poor reliability prevents loop networks from spanning the distances and connecting the number of users associated with MANs. In FDDI the reliability is improved with a second loop. The second loop does not carry data during normal operation, but is available when failures occur. Figure 3 shows three components that are used in FDDI networks: A. Units that implement the token rotation protocol B. Units that manage the reliability C. A unit that is responsible for signal timing and framing Type A units connect user devices to the primary loop. There can be more than one type A unit attached to a type B unit. The type A units do not have to be collocated with the type B
606
METROPOLITAN AREA NETWORKS TRT(I)=TTRT
Wait for token No
Yes
TRT(i)←TRT(i)–1
TRT(i)0
No
No
j>0
j←j–1
Yes
Figure 2. The flow diagram for the timed token rotation protocol in FDDI. The diagonal boxes are decision points where a terminal decides whether or not to transmit when a token is received dependent upon a local timer.
Yes
TRT(i)>Tj
No
Source
Source
Source
Source
Source
A
Source
A
A
A
A
A Disabled link B
B
B
C
Disabled link
Figure 3. The topology of an FDDI loop. This shows the application of the three types of units that are defined for an FDDI in unidirectional and bidirectional loops. The dashed lines show how the loops are reconfigured after failures.
A
Source
Clock and frame generator
METROPOLITAN AREA NETWORKS
unit, and they can be daisy-chained together to form a subloop. Type B units are connected to both loops. These units monitor the signal returning from type A units and bypass type A units that have stopped operating. Type B units also monitor the signal on the two loops and bypass failed loop components. The secondary loop is used to bypass failed links, or failed type B units. The signal on the secondary loop is transmitted in the opposite direction from the primary loop. Normally type B units patch through the signal that they receive on the secondary loop. When a primary loop failure is detected, by a loss of received signal on that loop, a type B unit replaces the lost signal with the signal it receives from the secondary loop and stops transmitting on the secondary loop. When a type B unit stops receiving the signal on the secondary loop, it replaces that signal with the signal it would have transmitted on the primary loop and stops transmitting on the primary loop. As an example, in Fig. 3 the X signifies a link failure and the — (horizontal bar) indicates the link on which the type B unit has stopped transmitting. The entire secondary loop replaces the single failed link on the primary loop. The configuration of type A and type B units reflects the way loop networks are installed. Loop networks are installed by running wires or fibers from an office to a wiring cabinet. The type A units are located in offices and the type B units are located in the wiring cabinets. Physically, the topology is a star, but the connection of wires in the wiring cabinet form a logical loop. The star topology makes it possible for stations to be added to a loop, or moved from one loop to another, without rewiring a building. THE DISTRIBUTED QUEUE, DUAL BUS PROTOCOL DQDB (6,7) is the IEEE 802.6 standard for MANs. It uses two buses that pass each station. The stations use directional taps to read and write data on a bus without breaking the bus. Directional taps transmit or receive data from one, rather than both, directions at the point of connection, and are common components in both CATV and fiber optic networks. DQDB transmits information in fixed-size slots and uses the distributed queue protocol to provide fair access to all of the stations. Signals on the two buses propagate in opposite directions. A station selects one bus to communicate with another specific station and uses the other to place reservations for that bus. The DQDB standard was preceded by two earlier protocols, Express-Net (8) and Fasnet (9), that used directional taps. In the two earlier systems there was a single bus that passed each station twice. On the first pass each station could insert signals and on the second pass a station could receive signals from all other stations. Both of the earlier protocols provided fair access by guaranteeing that every station had the opportunity to transmit one slot before any station could transmit a second slot. Passive Taps Passive taps distinguish directional buses from loop networks, which use signal regenerators. In loop networks, there is a point-to-point transmission link between each station.
607
Each station receives the signal on one link and transmits on the next link. A station can add or remove the signal on the loop. A failure in the electronics in a station breaks the communications path. By contrast, the stations on a directional bus network do not interrupt the signal flow. A passive tap reads the signal as it passes the station, and another tap adds signal to the bus. If a station with passive taps stops working, the rest of the network is not affected. The protocols that can be used on bus networks are a subset of the protocols that can be used on loop networks, since the stations can add but not remove signals. The inability to remove signals makes it necessary for the the bus to have a break in the communications path, where signals can leave the system. The protocols for directional buses can be implemented using regenerators rather than passive taps when it is advantageous. Passive taps remove energy from the signal path, and the signal must be restored to its full strength after passing several stations. In addition, by removing information from the transmission medium after it is received, the medium can be reused and to support more communications. The DQDB standard provides for erasure nodes (10,11), which remove information that has already been received. Erasure nodes are regenerators that read slots and, depending on the location of the destination, regenerate the slot or leave the bus empty. Architecture The dual bus in a DQDB network is configured as a bidirectional loop, as shown in Fig. 4. The signal on the outer bus propagates clockwise around the loop, and the signal on the inner bus propagates counterclockwise. The signal does not circulate around the entire loop, but starts at a head-end on each bus and is dropped off the loop before reaching the head-end. To communicate, a station must know the location of the destination and the head-ends and transmit on the proper bus. For instance, station A transmits on the outer bus to communicate with station B, and station B transmits on the inner bus to communicate with station A. The dual bus is configured as a loop so that the head-end can be repositioned to form a contiguous bus after a failure occurs. The head-end for each bus is moved so that the signal is inserted immediately after the failure and drops off at the failure. This system continues to operate after any single failure. The ability to heal failures increases the complexity of stations on the DQDB network. To heal failures, the station that assumes the responsibility of the head-end must be able to generate clock and framing signals. In addition, after a failure each station must determine the new direction of every other station. For instance, after the failure in Fig. 4 is repaired, station A must use the inner bus, rather than the outer bus, to transmit to station B. The Access Protocol In a DQDB network transmission time is divided into fixedsize slots that stations acquire to transmit data. A station at the beginning of the bus periodically transmits a sync signal that each station uses to determine slot boundaries. The first bit in the slot is a ‘‘busy’’ bit. It is one when the slot is being
608
METROPOLITAN AREA NETWORKS
Frame generator Station
Station
•
Bus 2
• Bus 1
• •
Station
Station
Station
•
•
•
•
•
Bus 1
• •
Station A
Station B
Bus 2
•
Station A
Station B Station Figure 4. The topology of the DQDB network. This shows dual bus configured as a dual loop in order to survive single failures.
Station Frame generator Normal
used, and zero when it is empty. When a station has data to send, it writes a one into the busy bit. A read tap precedes the write tap at each station. When a station writes a one into the busy bit, it also reads the bit from upstream to determine if it was already set. If the busy bit was zero, the station transmits data. If the busy bit was one, then the slot is full, but there is no harm in over writing the busy bit. The problem with this type of a protocol is that stations that are closer to the head-end see more empty slots than stations that are farther away. To prevent unfair access to the medium, stations send reservations to the stations that can acquire slots before they have a chance. Reservation requests are transmitted to the upstream stations on the opposite bus as the data. There are two separate reservation systems, one for transmitting data on each bus. In each system, the bus that is used to transmit data is referred to as the data bus, and the other bus is the reservation bus. Reservations are used to prevent upstream stations from acquiring all of the empty slots when downstream stations are waiting. A station places reservations in a queue with its own requests and services the entrys in this queue whenever an empty slot passes. If the entry is a reservation, the busy bit in the slot is left unchanged so that the slot is available for a downstream station. If entry is from the local source, the busy bit is set and data are inserted in the slot. A station with a very large message may still dominate the bus by requesting a large number of slots. To prevent this, a station is only allowed to have one outstanding request at a time. When a message requires several slots, one request is transmitted on the reservation bus and one data slot is placed in the local queue. When the data slot is removed from the queue, a second request is transmitted on the upstream bus and the second data slot is placed in the queue. The process continues until the entire message is transmitted. When there is more than one active user, the active users are given one slot at a time and serviced in a round-robin order. The queue in each reservation system is implemented with two counters that track the number of reservations. Counter CB counts the reservations that precede this station’s data slot, and counter CA counts the reservations that follow this station’s slot. When a station is not actively transmitting, it
Frame generator Failure
increments CB whenever a reservation is received and decrements it whenever an empty slot passes. When a station has a data slot to transmit, it increments CA whenever a reservation is received. When CB is zero and an empty slot appears on the data bus, the station transmits its data slot and then transfers the count from CA to CB. If the station has another data slot, transferring the count from CA to CB gives downstream stations, which placed reservations while this data slot was queued, a chance to acquire the data bus before this station transmits a second data slot. In a DQDB network with multiple priority levels for data, there is one reservation bit and two counters for each priority level. When empty slots are received, the counters for the higher priority levels are emptied first. Protocol Example Figure 5 shows the operation of five stations in the middle of a DQDB bus as data arrive and are then transmitted. The data bus propagates from left to right and the reservation bus from right to left. The figure shows the value of the busy bit on the data bus and the reservation bit on the reservation bus for each time slot, as the bus passes each station. The values in counters CA and CB, and whether or not a station is waiting to transmit data, are shown in between the data and request bus at each station. The operation is simplified and ignores questions of relative timing and propagation delay. Starting at slot time 1, a single data slot arrives at station 5, followed by two data slots at station 2, and another single data slot at station 3. When the data slots arrive, the network is busy transmitting other data slots, so the data is queued rather than being transmitted immediately. The service order for the messages is station 5, station 2 slot 1, station 3, and station 2 slot 2. The protocol operates as a first in first out (FIFO) queue with a round-robin strategy for multiple slot messages. In the first three slots, while the reservations arrive, the data bus is busy carrying slots from stations that are closer to the head-end, which were previously queued. Since the stations are upstream, the reservations are not in the queues at these five stations. In the first slot, station 5 inserts a one on
METROPOLITAN AREA NETWORKS
the reservation bus that is entered in the queues at stations 1 to 4. In the second slot station 2 places a reservation that is only entered in the queue at station 1. In the third slot station 3 places a reservation that is entered in the queues at stations 1 and 2. After three time slots, station 1 has 3 reservations and station 2 has two reservations, one that will be serviced before its own request and one after. At station 3 there is only one reservation, since station 3 did not receive the reservation from station 2. The fact that station 3 cannot contribute to seeing that station 2 is serviced in its fair turn will not matter, since station 2 has earlier access to the data bus. Stations 4 and 5 have 1 and 0 reservations, respectively. The reservations are serviced starting in slot 4. Since CB is greater than zero in stations 1 to 4, these stations let the empty slot pass by, and each removes one reservation by decrementing CB. Station 5 transmits in slot 4. In slot 5 both stations 2 and 3 are poised to transmit, with CB at zero and a data slot waiting. Station 2 has the first access at the slot and acquires it, demonstrating that it is unimportant that station 3 did not have an entry for station 2. Station 2 has a second slot that it must transmit, but CA indicates that one other downstream station has not been serviced. Station 2 moves the count from CA to CB before entering its next request in the queue. Slot 6 is acquired by station 3, since CB is greater than
0 0,0,0
7 0
0 to 1 1 1 1 1 0,0,0 0,0,0 0,0,0 0,0,0 0 0 0 0 0
0
0 1,0,0
6 0
0 0 2,0,0
5 0 Time (in slots)
0,1,0
0 to 1 1 1 1 1 1,1,0 0,1,0 0,0,0 0,0,0 0 0 0 0 1 to 0
0
0 2,0,0
4 0 1
1
1
1
1
1 1,0,0
1
Counters CBXCA
CBXCA
1 0,1,0
0
1,0,0
0 1
1,0,0 0
1 0,1,0
0 1
1,0,0
1
1 1,0,0
1
1 1,0,0
1 to 0 1 0,0,0 0 0
1
1
1,1,0 1 1 to 0
1 1
0
1,1,0 1 1 to 0
1 2,0,0
2
0,0,0
1 1,1,1
1
0 0,1,0
0
3,0,0
1
0 0,1,1
0
3
0 to 1 1 1 1 0,0,0 0,0,0 0,0,0 0 0 0 0
0 1
1,0,0 1
1
0,1,0 1 1 to 0
Busy bits and data CBXCA
CBXCA
CBXCA
4
5
Request bits 1
2
3 Stations
CB, CA = Counters at station X = 1 When station has data to transmit
Figure 5. The simplified operation of the DQDB protocol. This shows the operation of the counters at five stations on the bus as data packets arrive at the stations and are transmitted on the bus.
609
zero at station 2, and slot 7 is acquired by the second slot from station 2. Isochronous Traffic There is no mechanism in the distributed queueing protocol to provide service guarantees on delay or bit rate. This problem has been sidestepped in the standard by creating a separate protocol that shares the bus with the DQDB protocol, as in FDDI. The slots leaving the head-end are grouped into 125 애s frames. In some slots the busy bit is zero and the slots are available for the data transfer protocol. In other slots the busy bit is one so that stations practicing the data transfer protocol do not try to access these slots. The busy slots that are generated by the head-end occur at regular intervals and contain a unique header so that they are recognized by stations that require guaranteed rates. These slots are partioned into octets (bytes) that can be reserved. A station that reserves a single octet in a frame acquires a guaranteed 64 kbits/s channel with at most a 125 애s delay. This is the same guarantee provided by the telephone system. As in FDDI, the channels are circuit switched and referred to as isochronous channels. Protocol Unfairness Soon after the IEEE 802.6 standard was passed, it was noted that because of the distance-bandwidth product of the network, there was a potential for gross unfairness (12,13). In this work we will explain the source of the problem, and a particularly simple solution, called bandwidth balancing (BWB) (14), which eliminated most of the problem and was added to the standard. The IEEE 802.6 standard is designed to operate at 155 megabits/s, with 53 byte slots, and is compatible with ATM. At these rates, a cell is only about 0.4 miles long. The standard spans up to 30 miles. Therefore, there may be 75 cells simultaneously on the bus. Assume that a station near the head-end of the bus has a long file transfer in progress when a station 50 cells away requests a slot. In the time it takes the request to propagate to the upstream station, that station transmits 50 slots. When the request arrives, the upstream station lets an empty cell pass and then resumes transmission. An additional 50 slots are transmitted before the empty cell arrives at the downstream station. When the empty slot arrives, the downstream station transmits one slot and submits a request for another. The round trip for this request to get to the upstream station and return an empty slot is another 100 slots. As a result, the upstream station obtains 100 times the throughput of the downstream station. Although the reason is less obvious, a similar imbalance can occur in favor of the downstream station when that station starts transmitting first. While the downstream station is the only source, it transmits in every cell, while placing a reservation in every slot. When the upstream station begins transmitting, there are no reservations in its counter, but there are 50 reservations on the bus. While the upstream source transmits a slot, a reservation is received. Therefore, the upstream station must allow one slot to pass before transmitting its second slot. During the time it takes to service the reservation and the upstream station’s next slot, two more reservations arrive. Therefore, the upstream station lets two empty slots pass before transmitting its third slot. The reser-
610
METROPOLITAN AREA NETWORKS
vation queue at the upstream station continues to build up each time it transmits a slot, and the upstream station takes fewer of the available slots. The imbalance between the upstream and downstream station is sustained after the 50 reservations pass the upstream station because, at the other end of the bus, the downstream station places a reservation on the bus for each of the empty slot that the upstream station releases. The imbalance is not as pronounced as when the upstream station starts first, but it is considerable. The exact imbalance depends on the distance between two stations and the time that they start transmitting relative to one another (14). Bandwidth Balancing The bandwidth balancing (BWB) mechanism, which was added to the standard to overcome the unfairness, is based on two observations: 1. Each station can calculate the exact number of slots that are used, whether or not the data physically pass the station. 2. It is possible to exchange information between stations by controlling the fraction of the slots that are not used. A station sees a busy bit for every slot transmitted by an upstream station and a reservation for every slot transmitted by a downstream station. By summing the busy bits and reservation bits that apply to a data bus and adding the number of slots that the station transmits, the station calculates the total number of slots transmitted on the bus. Table 1 shows an example of how stations can communicate and achieve a fair utilization by using the average number of unused slots. Two stations, station A and station B, each try to acquire 90% of the unused bandwidth on a channel. Station A starts first and acquires 90% of the total slots. When station B arrives, only 10% of the slots are available, but station B does not know if the slots are being used by a single station taking its allowed maximum share, or many stations transmitting sporadically. Station B uses 90% of the available slots, or 9% of the slots in the system. Station A now has 91% of the slots available. When station A adjusts its rate to 90% of 91% of the slots, it uses 82% of the slots, making 18% of the slots available to station B. Station B adjusts its rate up to 90% of 18%, which causes station A to adjust its rate down, and so on until both stations arrive at a rate of 47.4%. Note that this mode of communications can
Table 1. Convergence of Rates When Two Stations Use 90% of the Slots Available to Them Station A Measure Bsy ⫹ Rqst 0 0 0.19 0.16 0.22 0.27 0.31 ⭈ ⭈ ⭈ 0.474
— Measure Bsy ⫹ Rqst
Take 0.9 ⴱ 1 ⫽ 0.9 ⴱ 1 ⫽ 0.9 ⴱ 0.91 ⫽ 0.9 ⴱ 0.84 ⫽ 0.9 ⴱ 0.78 ⫽ 0.9 ⴱ 0.73 ⫽ 0.9 ⴱ 0.69 ⫽ ⭈ ⭈ ⭈ 0.9 ⴱ 0.526 ⫽
Station B
0.9 0.9 0.82 0.76 0.7 0.66 0.62 0.474
— 0.9 0.82 0.76 0.7 0.66 0.62 ⭈ ⭈ ⭈ 0.474
Take — 0.9 ⴱ 0.1 ⫽ 0.9 ⴱ 0.18 ⫽ 0.9 ⴱ 0.24 ⫽ 0.9 ⴱ 0.3 ⫽ 0.9 ⴱ 0.34 ⫽ 0.9 ⴱ 0.38 ⫽ ⭈ ⭈ ⭈ 0.9 ⴱ 0.526 ⫽
0.09 0.16 0.22 0.27 0.31 0.34 0.474
only be used when stations try to acquire less than 100% of the slots. The implementation of BWB in the standard is particularly simple. A station acquires a fraction of the slots available by counting the slots it transmits and placing extraneous reservations in the local reservation queue when the count reaches certain values. In this way, a station lets slots pass that it would have acquired. For instance, if stations agree to take 90% of the slots that are available, they count the slots that they transmit and insert an extra reservation in CA after every ninth slot that they transmit. As a result, every tenth slot that the station would have taken passes unused. With BWB, the fraction of the throughput that station i acquires, Ti, is a fraction 움 of the throughput left behind by the other stations:
Ti = α 1 −
Tj
j = i
When N stations contend for the channel, they each acquire a throughput: T=
α 1 + α(N − 1)
The total throughput of the system increases as 움 approaches one or the number of users sharing the facility becomes large. The disadvantage with letting 움 approach one is that it takes the network longer to stabilize. We can see from the example in Table 1 that the network converges exponentially toward the stable state. However, as 움 씮 1 the time for convergence goes to infinity. When 움 ⫽ 1, BWB is removed from the network and the original DQDB protocol is implemented.
THE MANHATTAN STREET NETWORK The Manhattan Street Network (MSN) (15) is a network of two-by-two switches. A source and destination may be attached to each switching node. The logical topology of the network resembles the grid of one-way streets and avenues in Manhattan. Fixed size cells are switched between the two inputs and outputs using a strategy called deflection routing. The MSN resembles a distributed ATM switch. The twoby-two switching elements may be in a large number of wiring centers rather than a central location. However, the same interconnection structure is used for switching elements in different wiring centers as for elements in the same location. Routing is simpler in the structured MSN than in a general network of small switches. In deflection routing, packets can be forced to take an available path rather than waiting for a specific path. Each packet between a source and destination is routed individually and may take different paths. The packets may arrive at the destination out of order and have to be resequenced. Deflection routing has several advantages over virtual circuit routing. The overhead associated with establishing and maintaining circuits is eliminated, and the capacity is shared between bursty sources without large buffers and without losing packets because of buffer overflows. Deflection routing is also being used for routing inside some ATM switches (16,17).
METROPOLITAN AREA NETWORKS Manhattan Street Network Source
Source
611
Shuffle-exchange network Output from node 0000
Input to node 0000
0001
0001
0010
0010
0011
0011
0100
0100
0101
0101
0110
0110
0111
0111
1000
1000
1001
1001
1010 1011
1010 1011
1100
1100
1101
1101
1110
1110
1111
1111
Topology The MSN is a two-connected topology. In two-connected networks, other than linear structures like the dual bus or dual loop, a path must be chosen at every intermediate node. Two two-connected topologies that have simple routing rules are the MSN and the shuffle-exchange network (15), shown in Fig. 6. The MSN is a grid of one-way streets and avenues. The directions of the streets alternate. By numbering the streets and avenues properly, it is possible to get to any destination without asking directions or having a complete map. The grid is logically constructed on the surface of a torus instead of a flat plane. The wraparound links on the torus decrease the distance between the nodes on the edges of the flat plane and eliminate congestion in the corners. In the shuffle exchange network node i is connected to nodes 2ⴱi and 2ⴱi ⫹ 1, modulo the number of nodes in the network. In the figure for the shuffle exchange network each node appears in both the left- and right-hand column in order to make it easier to draw the connections. The two links leaving each 2 ⫻ 2 switching element are shown in the left-hand column and the two links arriving at each switching element are shown in the right-hand column. The shuffle exchange network has a simple routing rule, based on the address of the destination. The simple routing rule is one of the reasons why this structure is used in most ATM switches.
cells arriving at the node never exceeds the number of cells that can be stored or transmitted. There is an implicit assumption that the source can be controlled. This assumption is common in data networks with variable throughputs, such as the Ethernet. The MSN is well suited for deflection routing for three reasons: 1. At any node many of the destinations are equidistant on both output links. Cells headed for these destinations have no preference for an output link and do not force other cells to be deflected. 2. When a cell is deflected, only four links are added to the path length. The worse that happens is that the cell must travel around the block.
In 1
D
2 1
Out 1
X In 2
D
Deflection Routing Deflection routing is a rule for selecting paths for fixed-size cells at network nodes with the same number of inputs and outputs, as shown in Fig. 7. The cells are aligned at a switching point. If both cells select the same output, and the output buffer is full, one cell is selected at random and forced to take the other link. The cell that takes the alternate path is deflected. Deflection routing can operate without any buffers in the nodes. Deflection routing gives priority to cells passing through the node. Cells are only accepted from the local source when there are empty cells arriving at the switch. Cells are never dropped due to insufficient buffering because the number of
Figure 6. Regular Mesh Topologies. This figure shows the connectivity of the nodes in the Manhattan Street Network and the Shuffle-Exchange Network. The same nodes appear in the left and right columns of the Shuffle–Exchange Network in order to make the regular structure easier to see.
Out 2
Out
In
Source Switch
Input P1 P3
P2
P3
Output
P1
SRC
P2
P2
P3
P1
Figure 7. A block diagram of a node in a deflection routing network. The arriving packets are delayed so that they aligned. The packets can be exchanged and placed in their respective output buffers.
612
METROPOLITAN AREA NETWORKS
3. Deflection routing can guarantee that cells are never lost, but cannot guarantee that they will not be deflected indefinitely and never reach their destination. It has been proven that this type of livelock will never occur in the MSN (18). Deflection routing is similar to the earlier hot potato routing (19), which operated with variable-size packets, on a general topology, with no buffers. Fixed-size cells, the MSN topology, and two or three cells of buffering converted the earlier routing strategy, which had very low throughputs, to a strategy that can operate at levels exceeding 90% of the throughput that is achieved with infinite buffering. Reliability The MSN topology has several paths between each source and destination. The alternate paths can be used to communicate after nodes or links have failed. There are two simple mechanisms to survive failures in the MSN. Both mechanisms, shown in Fig. 8, are adopted from loop networks. Node failures are bypassed by two normally closed relays that connect the rows and columns through. The missing node in the grid in Fig. 8 has failed. Link failures are detected by a loss of signal, as in loop networks. Nodes respond to the loss of signal by not transmitting on the link at right angles to the link that has stopped. When one link fails, three other links are removed from service and the node at the input to the failed link stops transmitting on it. The dotted link in Fig. 8 has failed and nodes stop transmitting on the dashed links. This link removal procedure works with any number of link failures. Since the number of input and output links are equal at all of the operating nodes, deflection routing continues to operate without losing cells. In addition, it has been found that the simple routing rules that are designed for complete MSNs continue to work on networks with failures. COMPARISON OF FDDI, DQDB, AND THE MSN The FDDI, DQDB, and MSN networks are all two-connected topologies. There are two links entering and leaving each node. Some units in the FDDI network may not be two-connected, but the main part of the network is a dual loop. Twoconnectivity distinguishes these networks from the earlier loop and bus networks, used in LANs, which were one-connected. The increased connectivity makes these networks bet-
Node failures Bypass relay
Figure 8. Failure recovery mechanisms in the MSN. Node failures are survived with bypass relays at the failed node. Link failures are survived by eliminating a circuit that includes the failed link, thus preserving the criterion necessary to perform deflection routing at every node.
ter suited for the increased number of users and longer distances spanned in a MAN. Like the earlier LANs, DQDB and FDDI are linear topologies. Logically, the nodes are arranged in a one-dimensional line. In linear topologies the throughput per user and reliability are not as high as they are in other two-connected topologies, which have shorter distances and more paths between nodes. There were early proposals (20) to connect the second paths in loop networks to achieve higher throughputs and reliability than the bidirectional loop, but these more complicated topologies lost some of the more desirable characteristics of loops and buses. The MSN and shuffle-exchange topology achieve both goals while approaching the ease of routing and growth that is associated with linear topologies. Throughput A disadvantage with linear topologies is that the average throughput that each user can obtain decreases linearly with the number of users. In linear topologies, the average distance between nodes increases linearly with the number of nodes in the network. With uniform traffic, the number of users sharing a link increases linearly with the number of users in the network and the average rate per user decreases accordingly. The protocols used in FDDI and some DQDB networks do not reuse slots. The throughput in these networks is reduced for all traffic distributions. By contrast, in the MSN the distance between nodes increases as the square root of the number of nodes in the network, and in the shuffle exchange the distance increases as the log of the number of nodes. As a result, the reduction in the throughput per user, which occurs as networks become large, is much less in the MSN and the shuffle-exchange networks than in the FDDI or DQDB network. In the DQDB network, the penalty for large networks can be reduced by breaking the network into segments and erasing data that have already been received when they reach the end of a segment. This strategy works particularly well when the traffic requirements are nonuniform. When there are communities of users that communicate frequently, if those users are placed on the same segment of the bus, the traffic between them does not propagate outside the segment and interfere with users in other segments. A similar strategy for reusing capacity does not exist for FDDI. In addition, the concept of communities is not as meaningful in FDDI. FDDI operates as a single loop. When user A has a short path to user B, user B
Link failures Circuit elimination
METROPOLITAN AREA NETWORKS
must traverse the rest of the loop to get to user A, and therefore has a long path. In the MSN and shuffle-exchange network special erasure nodes are not needed because the protocol removes cells that reach their destination. In the MSN, communities of users are supported in a very natural way. If nodes that communicate frequently are located within a few blocks, they only traverse the paths in those few blocks and do not affect the rest of the network. The concept of communities does not exist in shuffleexchange networks. There is less interference between neighborhoods in the MSN than in DQDB networks. In the MSN, when there is heavy traffic within a neighborhood, communications between other neighborhoods can continue without passing through that neighborhood. In deflection routing, because of the random selection process when there are equal length paths or oversubscribed nodes, cells naturally avoid passing through congested neighborhoods. By contrast, when a community in the middle of the bus becomes congested in a DQDB network, communications between nodes at opposite edges of the bus must still pass through that community. Reliability Both DQDB and FDDI have a structure that enables them to survive single failures without losing service. FDDI has a redundant loop that is pressed into service to bypass single failures, and DQDB repositions the head-end of the bus. In both of these networks if two or more failures occur in adjacent components, the mechanism bypasses those components without affecting the operating components. However, in the more likely event that the multiple failures occur in components that are separated from one another, the network is partitioned into islands of nodes that cannot communicate with one another. Nodes in the MSN are not cut off from one another until at least four failures occur. When four failures have occurred, the likelihood of nodes being disconnected, and the number that are actually disconnected, is small. A quantitative comparison of the MSN, DQDB, and FDDI networks is presented in Ref. 21. Routing Complexity An advantage of linear topologies is that routing is relatively simple. All of the data that enter an FDDI system are transmitted on a single path, and there is only one path to select at any intermediate node. In a DQDB network the source must decide which of the two paths leads to the destination, but once the data are in the network there are no choices to make. The MSN and shuffle-exchange networks have simple rules to select a path, but a choice must be made at each node. In a linear topology everything that is destined for an output link either originates at the source at the node or is received from a single input link. When the data that arrive on the input link have precedence over the data from the source, there is no need to delay or buffer the data from the input link. Deflection routing has the same result in any network where the in-degree equals the out-degree at each node. Growth An important consideration in any large network is how easily it can be modified to add or delete users. In the early days
613
of LANs the correspondence between the topology of the network and the physical distribution of users was considered important. A loop network was supposed to be a daisy chain between adjacent offices and a bus network was supposed to pass down a hallway. With experience we realized that it is more difficult to change the wiring between offices as users move than to have all offices connected to a wiring cabinet and change the interconnection in that cabinet. As a result, most LANs are physically a star network, or a cluster of stars, of connections between users and a wiring cabinet. The users are connected together inside the wiring cabinet to form a logical loop or bus or mesh network. The number of wires that must be changed in the wiring cabinet determines how difficult it is to add or delete users from a network. In a bidirectional loop or bus network, adding or deleting a user is a relatively simple operation. To add a user, the connection between two users is broken and the new user is inserted between them. In the wiring cabinet, two wires are deleted and four are added. In the FDDI loop, there is only one path to the new user. In the DQDB network, every user must determine which bus to use to transmit to the new user. In the shuffle-exchange network, adding or deleting nodes is very complex. Complete networks are only defined for certain numbers of users. The shuffle–exchange network shown in Fig. 6 is only defined when the number of users is 2n. When a network is replaced by the next larger network, virtually every wire must be removed and the network reconfigured from scratch. In a complete MSN, two complete rows or columns must be added to retain the grid structure. There are, however, partial MSNs in which rows or columns do not span the entire grid, and a technique is known for adding one node at a time to a partial MSN to construct eventually a complete MSN (22). With this technique the number of links that must be changed in the wiring cabinet is the same as in the loop or bus network. In addition, an addressing scheme is known that allows new partial rows or columns to be added without changing the addresses of existing nodes or the decisions made in the routing strategy. Multimedia Traffic Multimedia traffic is playing an increasingly more important role in networks. Non-real-time video or audio, as is currently delivered by the Web sites on the Internet, is adequately handled by the data modes on any of the MANs that have been described. However, the data mode of operation cannot provide the guarantees on throughput or delay that are required for real-time voice or video applications; nor can it guarantee the sustained throughput that is needed to view movies while they are being received. DQDB and FDDI have an isochronous mode of operation that provides dedicated circuits to support real-time traffic. The isochronous mode is well integrated into the DQDB protocol. Nodes that only require the data mode do not have to change any protocols or hardware when isochronous traffic is added to the network. The only change that these nodes notice is that some slots seem to enter the network in a busy state. By contrast, when isochronous traffic is added to an FDDI network, every node must be able to perform context switches to move between the data and circuit modes. The MSN does not have an isochronous mode of operation. It is
614
METROPOLITAN AREA NETWORKS
unlikely that dedicated circuits can be added to the mesh structure without constructing a separate circuit switch at every node in the network. CATV The current CATV network is an existing MAN that delivers TV programs to a large number of homes. The network is designed for unidirectional delivery of the same signal to a large number of receivers. The channels have a very wide bandwidth relative to telephone channels. In bidirectional CATV systems there are many more channels from the head-end to the home than in the opposite direction. The growth of the use of the Web in homes has created the need for wide-band channels to homes. The traffic on the Web is predominantly from servers to clients, and most homes are clients rather than servers. Increasingly, the Internet is being used for multicast communications (23) of video or audio programming, in which the same signal is received by more than one receiver. CATV networks are naturally suited for the Web and multicast communications. The simplest CATV systems are hybrid networks that use the CATV network to send addressed packets to the home, and the telephone network to receive data from the home. The CATV network provides a means of quickly getting a large amount of data to the home. The home terminals share a CATV channel. They receive all of the packets and filter out the packets that are intended others. If the data are considered sensitive, encryption is used to keep sensitive data from being intercepted. Shared CATV channels are particularly useful when high data rates are needed for short periods, as when receiving new pages from Web servers. The telephone channel is a relatively low bandwidth channel that adequately handles the traffic to Web servers. The IEEE 802.14 working group is currently considering standards for hybrid networks. The application of the Internet is not stationary. Packet telephony or an increase in the number of servers in homes for publishing or small businesses can quickly change the current unidirectional traffic demands. Data MANs that are implemented on CATV networks should be flexible so that they can track changes in the applications. In the early MANs, experimental CATV systems were used for two-way voice and data (24). There was not sufficient demand for data to homes at that time, and work on these networks stopped. The renewed interest in data applications of CATV networks makes it reasonable to reconsider the earlier work. In this section, we describe one such network, the Homenet (25).
is no relationship between the bandwidth that is available on a station’s transmit and receive channel. Bandwidth is assigned to transmitters by adjusting the number of stations that they contend with for a Homenet. For instance, an Internet service provider (ISP) can be given one or more Homenets, so that traffic from the servers can access the network without contention. Many clients, in homes, may share the same Homenet, since the traffic level from a client to the servers is much lower. A convenient characteristic of the Homenet strategy, in comparison with hybrid networks, is that the network can be modified easily to match the changes in load imbalances. If the traffic distribution changes and more traffic originates in some homes, those homes can be placed on less heavily populated Homenets to increase the amount that they can transmit. Access Strategy. The stations in a Homenet access the channel using Movable Slot Time Division Multiplexing (MSTDM) (26), which is a variation on the CSMA/CD (Carrier Sense Multiple Access/Collision Detection) protocol used in Ethernet. MSTDM is implemented by a very small change in the standard Ethernet access unit, and sets up telephone quality voice connections on an Ethernet. Figure 9 shows the transmission strategy that is used within a Homenet. The CATV taps are directional so that a station only receives signals from downstream and can only transmit upstream. The stations in a Homenet transmit upstream in an assigned frequency band. At a reflection point the upstream frequency band is received and transmitted downstream in a different frequency band. Before transmitting, a station listens to the downstream channel to determine if it is busy; CSMA then transmits on the upstream channel. When a station receives the signal that it transmitted, it knows that it has not collided with any other station, CD, so that any station that receives this channel can receive its data.
Station 1
Reflection point
Homenet The Homenet transmission strategy partitions the CATV network into smaller areas, called Homenets. Stations only contend with stations in their own Homenet to gain access to the network. Because of the size of the Homenets, the contending stations satisfy the distance constraints imposed on the CSMA/CD protocol used in Ethernet LANs. Any station on the CATV network can receive the signal on any Homenet. Two stations communicate by transmitting on their own Homenet and receiving on the other stations Homenet. The Homenet strategy is readily tailored to load imbalances, such as the directional imbalance in Web traffic. There
Station 2 Figure 9. Transmission strategy within a Homenet. Stations transmit upstream on the CATV tree using one channel. At the root of the homenet the signal that is received on this channel is retransmitted downstream on a different channel.
METROPOLITAN AREA NETWORKS
615
Upstream Return paths Downstream D2 D3 D4 XH XS RS R1 R2 R3 R4 Homenet Xmit Signaling channels
Homenet Rev
D2 D3 D4 XH XS RS R1 R2 R3 R4 Filters
Homenet 4
XH XS RS R1 R2 R3 R4
D2 D3 D4 XH XS RS R1 R2 R3 R4 Filters
XH XS RS R1 R2 R3 R4 Homenet 1
Homenet 2
D2 D3 D4 XH XS RS R1 R2 R3 R4
Homenet 3
D2 D3 D4 XH XS RS R1 R2 R3 R4 Filters
XH XS RS R1 R2 R3 R4
Figure 10. Transmission plan for a CATV network with 4 Homenets. The X’s are upstream, transmit channel and the R’s are downstream, receive channels. The cross-hatched channels are filtered out at the edge of the homenets.
Interconnection Strategy. Figure 10 shows the boundaries of Homenets, the assignment and filtering of frequencys, and the interconnections for a four-Homenet system. There are many fewer upstream channels in a CATV network than downstream channels. The same upstream data channel is reused in every Homenet by filtering that channel at the boundaries between Homenets. The stations in every Homenet contend for the same upstream channel, but the transmissions from stations in different Homenets do not interfere with one another. Each Homenet is assigned a unique receive channel. At the Homenet’s reflection point the upstream signal on the transmit channel is received and retransmitted on the Homenet’s receive channel. The signal is also carried by a point-to-point link to the head-end of the CATV network, where it is again transmitted on the Homenet’s receive channel. A Homenet’s receive channel is filtered at the reflection point for the Homenet so that the signal from the head-end does not interfere with the signal that is inserted. The reflection point of each Homenet appears to be the root of the entire CATV tree for that Homenet’s receive channel, and any station on the CATV network can receive the signal that a station transmits in its Homenet. To see how the transmission strategy works, consider a station in Homenet 2. The station transmits on the common upstream channel. This transmission may collide with transmissions from other stations in Homenet 2. The transmission will not collide with transmissions from stations in Homenet 4 because the transmit channel is filtered at the boundary between Homenets 2 and 4. At the reflection point for Homenet 2, the signal on the transmit channel is placed in the downstream channel 2. A station that has transmitted detects collisions by listening to the downstream signal on channel 2.
A successfully transmitted packet on Homenet 2 can be received by any station in Homenets 2 and 4. The signal on the transmit channel is also transferred to the head-end of the CATV network, where it is placed in channel 2, so that stations in Homenets 1 and 3 can also receive the signal from the station in Homenet 2. The signal from the head-end does not interfere with the signal inserted in channel 2 at the reflection point because the signal from the head-end is removed at the entry to Homenet 2. In Fig. 10 there is a second upstream and downstream channel that is used by all of the stations on the CATV network. This is a signaling channel and is used for communications between stations during call setup. A station places a call by transmitting on the upstream signaling channel. At the head-end of the CATV network this signal is placed on the downstream signaling channel. The CSMA/CD contention rule cannot be used on this channel because of the larger distances involved, but the low utilization of the channel makes it reasonable to use a less efficient Aloha (27) contention rule. A station does not listen to the channel before transmitting, but does listen to the receive channel to determine if its signal collided with the signal from another station. If a station can receive its own signal, so can all other stations. When a station is not busy, it listens to the downstream signaling channel and responds to a connect request.
CONCLUSION The principal reasons for using different technologies in different area networks are economic. As MANs have developed, the economics have changed, and continue to change. The initial application of MANs was interconnecting LANs in a city.
616
MICROCOMPUTER APPLICATIONS
When the economics changed, WAN technology was used for this function. At present there is a growing demand for wider bandwidth channels to individual homes in a MAN. The bandwidth is not available. In the short term we should expect the bandwidth to be made available using existing networks, either the CATV infrastructure or ADSL technology over the telephone local loop. As the demand continues to grow, additional capacity will be installed. Until the demand reaches a level where individual fibers to a home are justified, channel sharing techniques, such as FDDI, DQDB, and the MSN, are likely to be used to share the added capacity. Just as the capabilities of WANs have affected MANs, the capabilities of MANs can affect LANs. An interesting question is whether or not the future growth of MANs will lead to the end of LANs. When inexpensive, wide-bandwidth channels are available from every desk to the telephone central office or an ISP, will it still be economical for companies to install and maintain private LANs? BIBLIOGRAPHY 1. E. W. Zegura, Architectures for ATM switching systems, IEEE Comm. Mag., 31 (2): 28–37, 1993. 2. F. E. Ross, An overview of FDDI: The fiber distributed data interface, IEEE JSAC, 7 (7): 1043–1051, 1989. 3. W. Stallings, Local and Metropolitan Area Networks, Upper Saddle River, NJ: Prentice Hall, 1997. 4. M. J. Johnson, Proof that timing requirements of the FDDI token ring protocol are satisfied, IEEE Trans. Comm., COM-35: 620– 625, 1987. 5. H. E. White and N. F. Maxemchuk, An experimental TDM data loop exchange, Proc. ICC ’74, June 17–19, 1974, Minneapolis, MN, pp. 7A-1–7A-4. 6. R. M. Newman, Z. L. Budrikis, and J. L. Hullett, The QPSX man, IEEE Comm. Mag., 26 (4): 20–28, 1988. 7. R. M. Newman and J. L. Hullett, Distributed queueing: A fast and efficient packet access protocol for QPSX, Proc. 8th Int. Conf. on Comp. Comm., Munich, F.R.G., Sept. 15–19, 1986, published by North-Holland, pp. 294–299.
15. N. F. Maxemchuk, Regular mesh topologies in local and metropolitan area networks, AT&T Tech. J., 64 (7): 1659–1686, 1985. 16. S. Bassi et al., Multistage shuffle networks with shortest path and deflection routing for high performance ATM switching: The open loop shuffleout, IEEE Trans. Commun., 42: 2881–2889, 1994. 17. A. Krishna and B. Hajek, Performance of shuffle-like switching networks with deflection, Proceedings INFOCOM ’90, June 1990, pp. 473–480. 18. N. F. Maxemchuk, Problems arising from deflection routing: Live-lock, lockout, congestion and message reassembly, In G. Pujolle (ed.), High Capacity Local and Metropolitan Area Networks, Springer-Verlag, 1991, pp. 209–233. 19. P. Baran, On distributed communications networks, IEEE Trans. Comm. Sys., cs-12, 1–9, 1964. 20. C. S. Raghavendra and M. Gerla, Optimal loop topologies for distributed systems, Proc. Data Commun. Symp., 1981, pp. 218–223. 21. J. T. Brassil, A. K. Choudhury, and N. F. Maxemchuk, The Manhattan Street Network: A high performance, highly reliable metropolitan area network, Computer Networks and ISDN Systems, 1994. 22. N. F. Maxemchuk, Routing in the Manhattan Street Network, IEEE Trans. Commun., COM-35: 503–512, 1987. 23. S. Deering, Multicast routing in internetworks and extended LANs, Proceedings of ACM SIGCOMM ’88, Aug. 1988, Stanford, CA, pp. 55–64. 24. A. I. Karchmer and J. N. Thomas, Computer networking on CATV plants, IEEE Network Mag., pp. 32–40, 1992. 25. N. F. Maxemchuk and A. N. Netravali, Voice and data on a CATV network, IEEE J. Sel. Areas Commun., SAC-3 (2): 300–311, 1985. 26. N. F. Maxemchuk, A variation on CSMA/CD that yields movable TDM slots in integrated voice/data local networks, BSTJ, 61 (7): 1527–1550, 1982. 27. N. Abramson, The Aloha system—Another alternative for computer communications, Fall Joint Computer Conference, AFIPS Conference Proceedings, 37, pp. 281–285, 1970.
N. F. MAXEMCHUK AT&T Labs–Research
8. L. Fratta, F. Borgonovo, and F. A. Tobagi, The Express-Net: A local area communication network integrating voice and data, Proc. Int. Conf. Perf. Data Commun. Syst., Paris, Sept. 1981, pp. 77–88. 9. J. O. Limb and C. Flores, Description of Fasnet—A unidirectional local area communications network, BSTJ, 61 (7): 1413–1440, 1982. 10. M. Zukerman and P. G. Potter, A protocol for Eraser Node Implementation within the DQDB framework, Proc. IEEE GLOBECOM ’90, San Diego, CA, Dec. 1990, pp. 1400–1404. 11. M. W. Garrett and S.-Q. Li, A study of slot reuse in dual bus multiple access networks, IEEE JSAC, 9 (2): 248–256, 1991. 12. J. W. Wong, Throughput of DQDB networks under heavy load, EFOC/LAN-89, Amsterdam, The Netherlands, June 14–16, 1989, pp. 146–151. 13. J. Filipiak, Access protection for fairness in a distributed queue dual bus metropolitan area network, ICC ’89, Boston, June 1989, pp. 635–639. 14. E. L. Hahne, A. K. Choudhury, and N. F. Maxemchuk, Improving the fairness of distributed-queue dual-bus networks, INFOCOM ’90, San Francisco, June 5–7, 1990, pp. 175–184.
MHDCT. See HADAMARD TRANSFORMS. MHD POWER PLANTS. See MAGNETOHYDRODYNAMIC POWER PLANTS.
MICROBALANCES. See BALANCES. MICROCOMPUTER. See MICROPROCESSORS.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5336.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Mobile Network Objects Standard Article Lubomir F. Bic1, Michael B. Dillencourt2, Munehiro Fukuda3 1University of California, Irvine, 2University of California, Irvine, 3University of California, Irvine, Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5336 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (118K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5336.htm (1 of 2)16.06.2008 16:24:56
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5336.htm
Abstract The sections in this article are Basic Infrastructure Programming Language Mobility Agent Interactions Protection and Security Utility and Applications | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5336.htm (2 of 2)16.06.2008 16:24:56
MOBILE NETWORK OBJECTS
353
MOBILE NETWORK OBJECTS Computer networks are collections of computing nodes interconnected by communication channels. They have experienced explosive growth recently, primarily due to the steadily decreasing cost of hardware, and have become an integral part of daily life for most businesses, government institutions, and individuals. One of the main objectives of interconnecting individual computers into networks is to permit them to exchange information or to share resources. This can take on a number of different forms, including electronic mail messages exchanged among individuals, down- or uploading of files, access to databases and other information sources, or the use of a variety of services. It also permits the utilization of remote computational resources, such as specialized processors or supercomputers necessary to accomplish a certain task, the utilization of multiple interconnected computers to solve a problem through parallel processing, or simply the utilization of unused processing or storage capacity available on remote network nodes. The recent explosion in the use of portable devices, such as laptop computers or various communication devices used by ‘‘nomadic’’ users, has opened new opportunities but has also created new technological challenges. The main problem is that such devices are connected to the network intermittently and typically for only brief periods of time, they use low-bandwidth, high-latency, and low-reliability connections, and they may be connected to different points of the network each time. The exchange of data among processor nodes in a network—whether connected permanently or temporarily— occurs via communication channels, which are physical or virtual connections established, either permanently or temporarily, between the nodes. To use a channel, the communicating parties need to obey a certain communication protocol, which is a set of rules and conventions regarding the format of the transmitted data and its processing at the sending and receiving end. There is a wide range of different communication protocols to serve different needs and they are usually structured hierarchically such that each layer can take advantage of the properties provided at the lower level. Despite the great variety of communication protocols, they all embody the same fundamental communication paradigm. Namely, they assume the existence of two concurrent entities (processes or users) and a set of send/receive primitives that permit a piece of data (a bit, a packet, a message) to be sent by one of the active entities and received by the other. The specific protocol used only determines various aspects of the transmission, such as the size and format of the transmitted data, the speed of transmission, or its reliability. This leads to a great variety of send/receive primitives, but the underlying principle remains the same. From the programming point of view, this form of communication is referred to as message passing and is the most common paradigm used in parallel or distributed computing today. Its main limitation is that it views communication as J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
354
MOBILE NETWORK OBJECTS
Control only
Control + code Before execution
During execution Partial computation
RPC Figure 1. Levels of computation migration.
Remote execution
Self-contained computation
Passive
Active
Passive
Active
Object migration
Thread migration
Process migration
Agent migration
Code import
a low-level activity and thus is difficult to program, analyze, and debug. To alleviate these problems, higher-level programming constructs have been developed. The best-known representative is the concept of a remote procedure call (RPC) (1). As the name suggests, this extends the basic idea of a procedure call to permit the invocation of a procedure residing on a remote computer. At the implementation level, the RPC must be translated into two pairs of send/receive primitives. The first transmits the necessary parameters to the remote site while the second carries the results back to the caller once the procedure has terminated. Several issues must be handled, including the translation of data formats between different machine architectures and the handling of failures during the RPC. Nevertheless, the RPC mechanisms hide the details of the message-passing communication inside the wellknown and well-understood procedure-calling abstraction. The popularity of the RPC mechanism is due to its affinity with the popular client-server paradigm, where a distributed application is structured as a continuously operating process, termed a server (or collection of servers), that may be contacted by a client process to utilize the provided service. For example, a name server would return the address of a particular host computer given its name. When a client-server application is implemented using RPCs, the client simply invokes the appropriate (remote) procedure with the name as the argument and waits for it to return the result. RPCs require that the procedure to be invoked be preinstalled (compiled) on the remote host node before they can be utilized by a client. Hence, as indicated in Fig. 1, only control is transferred between the client and the server during the RPC. The natural extension to RPCs is remote execution, also referred to as remote programming (2), which permits not only control but also the code that is to be executed remotely to be carried as part of the request. Alternatively, a client could contact a server and download a program to accomplish a certain task (i.e., copy the program from the server to the client’s host). In both cases we are addressing the problem of code mobility. Since both of these scenarios require that the code is carried before it starts executing, the code must be made portable between the different machines. This can be accomplished either by carrying the source code and recompiling it on the target host or by providing an interpreter for the given language on the target host. In both cases, translating the original source code into a more compact and easier to process intermediate code generally yields a much better performance in both transmission and processing. One of the most popular systems today is Java (3), which uses a stream-oriented intermediate representation, referred to as byte code, that lends itself well to interpretation as well as to on-the-fly compila-
tion into native code. The necessary machine independence is achieved by establishing a common standard for the various aspects of code generation, such as byte ordering, data alignment, calling conventions, and data layout. This accounts for Java’s popularity as a language to develop highly portable Internet-based applications. The main motivation for moving code between machines is to make it available for execution on demand (by downloading it when needed), to perform load leveling (i.e., to invoke particular subcomputations on remote nodes to take advantage of their computational capacity), or to reduce communication latency by moving the execution to the data or the service that it needs to access. Another dimension of complexity is added when we permit code to migrate after it has started executing. In this case we are moving the state of the ongoing computation in addition to the code itself. This can be subdivided further along two axes, as shown in Fig. 1. The first axis captures the distinction in granularity—that is, whether the entire computation can be moved or whether it is possible to move only some portion of it while other parts remain at their original sites. The second orthogonal axis divides the space based on who initiates and performs the migration. If this is done by an activity other than the moving one, we call the migration passive. If a computation can effect its own migration, we call it active. There are representatives in each of the four areas resulting from the preceding subdivision. An example of passive migration of parts of an ongoing computation is the movement of objects of an object-oriented language, as pioneered in the Emerald system (4). Emerald provides several primitives by which one object can cause another object to change its location, which then automatically moves all the threads currently executing as part of the moved object to the new host. Active migration of parts of an ongoing computation can be accomplished by permitting a thread to send and invoke a given procedure on a remote host. Unlike the simple remote execution discussed earlier, this transfer does not imply a return of control to the caller site once the procedure terminates. Rather, the migrating procedure carries with it the state of its invoking thread and thus retains its semantics regardless of its current location. One approach for accomplishing this was pioneered in the Obliq language (5). In this approach, backward network references to the originating site are maintained, and parts of the state are copied as necessary. The ongoing copying is transparent to the user. From the user’s viewpoint, the entire thread has been transferred to the new location, where it continues executing until it again decides to move.
MOBILE NETWORK OBJECTS
The last column of Fig. 1 represents systems where computations are relocated in their entirety. Under passive migration, this is typically done at the process level, where the operating system captures the complete execution state of a process and relocates it, including its entire address space, to another machine. When processes can actively decide if and where to move—that is, perform self-migration—we refer to them as mobile network objects, or mobile agents. These are the main subject of this article. Self-migration requires a basic infrastructure consisting of some form of servers running on the physical nodes to be established, which accept the mobile agents, provide them with an execution environment and an interface to the host environment, enforce some level of protection of the agent and the host, and permit them to move on. The remainder of this article explores these issues in more detail. The utility of mobile agents and some of their applications are discussed later. BASIC INFRASTRUCTURE To give mobile agents their autonomy in moving through the underlying network and performing the necessary tasks at the nodes they visit, a basic infrastructure needs to be established. Figure 2 shows the generic concept of such an infrastructure. The lowest level consists of a physical network of nodes. This is typically a WAN (wide area network), such as the Internet, which is a large heterogeneous collection of different computers ranging from PCs to supercomputers, interconnected by a variety of different links and subnetworks. For some applications (notably, general-purpose parallel/distributed computing), the physical network could also be a LAN (local area network), consisting of a relatively small number of computers interconnected by an Ethernet or a token ring– based network. The mobile agents infrastructure is established on a subset of the physical nodes. This may be viewed as a virtual network, where each node is a software environment that enables the agents to operate in the physical node. In the simplest case, a single virtual node is mapped onto a physical
node and there are no specific virtual connections. That is, the virtual network is a strict subset of the physical network and the virtual connections are simply implied (identical to) the existing physical links. Additional flexibility is attained by permitting more than one virtual node to share a physical network. For example, the node ocean.ics.uci.edu in Fig. 2 shows two virtual nodes mapped to it. Since the physical resources are multiplexed between the virtual nodes, there is no performance benefit but the resulting logical concurrency provides for more flexibility in the design of applications. Some systems, such as UCI MESSENGERS (6) and WAVE (12) support a separate logical network, implemented on top of the virtual network. Logical links are used by the mobile agents to navigate through the network. Having logical links that represent virtual connections provides for greater flexibility in navigation. Each virtual node consists of several components to enable the mobile agents operation. The most important is a ‘‘processing engine’’ that gives the mobile agents their autonomy. This can be subdivided further into a communication module, whose task is to receive and send mobile agents, and an execution engine, responsible for the agents execution while it resides on the current node. Depending on the language used to write the code for mobile agents, this engine can be either a self-contained interpreter or a form of a manager, that creates a new process or thread for each new incoming agent and then supervises its execution. Each virtual node also typically provides a local communication facility, such as a shared data area, that can be used by the mobile agents currently on that node to communicate with one another (that is, exchange data or synchronize their operations). In the remaining sections, we will elaborate on the various aspects of the supporting infrastructure, the various capabilities of mobile agents as supported by different approaches, and their main benefits and applications. PROGRAMMING LANGUAGE There is a wide range of programming languages used to write mobile agents programs (i.e., to describe an agent’s be-
L
L
L
E
E
E
C
C
C
C
wave.ics.uci.edu
cress.ics.uci.edu
ocean.ics.uci.edu
355
L
...
E
Physical network C = Communication module E = Execution engine L = Local communication facility
Figure agents.
2. Infrastructure
of
mobile
356
MOBILE NETWORK OBJECTS
havior). We can loosely classify them along two orthogonal axes: general purpose versus special purpose and conventional versus object oriented. Within each class we can further distinguish interpreted versus compiled languages, and combinations thereof. The most popular general-purpose programming languages used for mobile agents are C and Java. C is an imperative language (that is, based on sequential flow of assignment, control, and function invocation statements) and is one of the most widely used programming languages today. Java is object oriented, which implies a hierarchical structure of objects derived from common classes and interacting with one another by invoking procedures defined as part of each object. One of the main strengths of Java code is that it is based on a structured bytecode and is thus highly portable between heterogeneous computers. To use a general-purpose programming language like C or Java for mobile agents, it must be extended to be able to handle the specific requirements of selfmigrating code. The most important extension is to support mobility (that is, some set of commands that an agent can use to cause its migration to another computer). Another important area of concern is to provide protection mechanisms to permit a safe operation of mobile agent applications. The main advantage of using existing general-purpose languages is that the programmers do not need to learn yet another language but only extend their knowledge to integrate aspects of mobility. Hence it is easier to make a transition into the new paradigm of mobile agents. New languages have also been developed specifically for the purpose of writing mobile agents code. One such language is Telescript (7), pioneered by General Magic, Inc. This is a high-level object-oriented language designed for mobile agents for the rapidly expanding electronic marketplace on the Internet. A number of other languages, both object oriented and conventional, have also been developed. Another approach has been to adapt existing special-purpose languages. One example is Agent Tcl (8), which is built on top of an extended version of Tcl, a scripting language originally intended for the composition of high-level program scripts to coordinate lower-level computations. Another example is the system developed by researchers at the Johann Wolfgang Goethe University (9), which is built on top of a customized Hypertext Transfer Protocol (HTTP) server. One of the main distinguishing features of all of the preceding languages is whether they are compiled and executed as native code of the host computer or interpreted. This represents a tradeoff between performance, which is degraded due to interpretation, and security, which is improved due to the interpreter’s tight control over the agent’s behavior. There are three general options. First, the agent can carry code that is fully interpreted by the execution engine of the host. This is the safest but also the slowest approach and is typically used with scripting languages. Second, the agent’s code could be an intermediate machine-independent program representation, like the Java bytecode, which can be interpreted more efficiently than source code or can be compiled on the fly into directly executable native code. Finally, the agent could carry native code precompiled for the target host. This is the fastest but also least secure and least flexible approach, since the agent would have to carry different code versions for every machine architecture it may visit. A compromise between the preceding approaches has been adopted by UCI MESSEN-
GERS (6). The agent’s code is written in a subset of C, which is translated into a more efficient yet machine-independent form (similar to bytecode). This is carried by the agent and is interpreted by the execution engine of each host. In addition, the agent can dynamically load and invoke arbitrary C functions, resident on a given host and compiled into the host’s native code. Hence it can alternative between interpreted and compiled code at the programmer’s discretion. Another consideration is whether an agent is executed as a separate process or as a thread within the same address space of the execution engine. This decision represents a tradeoff between security and performance. Starting a new thread is more efficient than starting a new process, but allowing multiple agents to run in the same address space represents a potential security risk. MOBILITY The ability to spread or move computation among different nodes at runtime is perhaps the most important characteristic of mobile agents. We can distinguish three aspects of mobility: addressing, the mechanisms to affect the movement of an activity, and high-level support mechanisms. Addressing For an activity to move, a destination must first be specified. This destination, also referred to as a place, a location, or a logical node by different systems, is some form of an execution environment capable of supporting the mobile agent’s functionality. Depending on how the logical nodes are mapped onto the physical network, different forms of addressing are possible. In the simplest form, the networkwide unique names or addresses of the physical nodes are used to specify a destination. This implies that only a single copy of a logical node can be mapped onto any one physical node. To achieve location transparency, logical names are used, whose mapping to the physical nodes may be changed as necessary (for example, to reflect changes in the physical network topology). This frees the application from having to know anything about the physical network and thus also facilitates its portability. This also permits more than one logical node to be mapped onto a physical node, thus providing better structuring capabilities and facilitating load balancing. As already discussed, another degree of flexibility is achieved by permitting not only logical nodes but also logical links. These are mapped onto paths of zero or more physical links, thus providing virtual connections that can be used for navigation by agents. Addressing can further be subdivided into explicit and implicit. Explicit addressing implies that an agent specifies the exact node destination where it wishes to travel or where a new agent should be spawned. Some systems support itinearary-based addressing, where an agent carries a list of destinations. Each time it issues a migration command, it moves to the next destination on its list. Implicit addressing means that an agent specifies the set of destinations indirectly using an expression that selects zero or more target nodes. The agent is then replicated and a copy sent to all the nodes that meet the selection criteria. The UCI MESSENGERS system defines a elaborate navigational calculus where, given a logical node, an expression involving various combinations of link and node names (including ‘‘wild cards’’), a set of target nodes
MOBILE NETWORK OBJECTS
relative to the current node is specified. The agent issuing this statement is then replicated and a separate copy sent to each of the selected nodes. For example, an agent could decide to replicate itself along all outgoing links with a specific name and/or orientation or connected to specific neighboring nodes. Mechanisms for Mobility Mobility can be achieved in one of two ways. The first is remote execution, which permits a new activity to be spawned on a remote node. The second is migration, which permits an agent to move itself to another node. The boundary between these two approaches, however, is not crisp, since an agent could spawn a copy of itself on a remote node and then terminate on the current node, thus effectively migrating itself. The main question is the level of support provided by the system to extract the current state at the source node and restore it at the destination node. This may range from no support to a fully transparent migration. In the case of remote evaluation, the commands supported to achieve mobility typically take on a passive form, such as ‘‘spawn’’ or ‘‘dispatch,’’ implying that it is not the currently executing agent itself that moves; rather, it is causing the creation of another agent on a remote node. To achieve active mobility using this approach, it is not sufficient simply to spawn to copy of the current agent, since the new agent would start executing from the beginning. Rather, the sending agent must extract its current state, transmit this along with the code, and cause the new instance to continue executing the code that follows the migration statement. If the agent code is compiled, its state consists of the activation stack, the CPU (central processing unit) registers, any dynamically allocated heap memory, and open I/O (input/output) connections (file and communication descriptors). If the code is interpreted, the agent’s state is typically maintained by the interpreter. In either case, the system must provide support to permit as much of the agent’s state to be extracted in order to permit its migration. Unfortunately, some parts of the state, notably the I/O connections, may be machine dependent and thus cannot be moved. Hence a completely transparent migration may not always be possible. In the case of self-migration, the commands typically take on an active form, such as ‘‘go’’ or ‘‘hop,’’ indicating that it is the agent issuing these commands that is being moved. The problems of state capture are similar to those described previously. That is, the system must provide support for extracting the agent’s current state and reinstating it at the new destination. This, as well as the creation of the new instance and the destruction of the original one, is usually done automatically by the system as part of the migration operation, which then may be viewed as a high-level construct that transparently achieves self-migration of an agent. Given the difficulty of extracting and restoring an agent’s state at an arbitrary point in its execution, some systems will limit migration to only the top level of execution (i.e., the equivalent of the main program). This is the case with the UCI MESSENGERS system, which also prohibits the use of pointers at that level. This eliminates the need to extract/ restore the activation stack as well as any data on the heap storage, and hence only the agent’s local variables and its program counter need to be sent along with the code during migration, thus making this operation very efficient.
357
High-Level Support The third aspect of navigation concerns the high-level tools and mechanisms that make the migration of agents more powerful or more user friendly. These fall into two categories—the first deals with finding agents or services on the net, the latter with finding ways best to reach the corresponding remote sites. There is no conceptual framework for either problem and hence we only mention a few approaches that have been used by various systems. The Agent Tcl project (8) addresses both areas. To locate services, it provides a hierarchy of specialized navigation agents, which are stationary and which maintain a database of service locations. Services are registered with these nagivation agents. A mobile agent looking for a service may query a navigation agent, which suggests a list of possible services based on a keyword search, and possible other navigation agents, which may be more specialized in maintaining services on the requested topic. Later, mobile agents may provide feedback about which services were useful, thus improving the navigation agent’s ability to provide information in the future. The CUI MESSENGERS system (10) also provides extensive support for publicizing and discovering services on the net. One of the main issues is to ensure protection of the service provider. Unlike Agent Tcl, which uses active navigation agents, CUI MESSENGERS use specialized dictionaries in each logical node, which can be consulted by potential clients. Each service is publicized with its operational interface, which, using a specialized interface-description language, specifies the necessary conventions to interact with the service. The second category of high-level support generally includes network sensing and monitoring tools. The complexity and sophistication of these tools range from very simple (for example, to determine whether a particular computer is ‘‘alive’’ and connected to the current node) to continuous network monitoring services that provide estimates on latency and bandwidth of various connections in the network. AGENT INTERACTIONS Agents have the need to interact with one another at runtime, either to exchange information (i.e., communicate) or to synchronize their actions. There are several forms of interagent communication schemes supported by different systems. The simplest is based on shared data. That is, a logical node will contain some agreed-upon variables or data structures, which may be accessed by agents currently executing on that node. The access can either be by location (i.e., reading from or writing to a specific location specified by name or address) or associative by content (i.e., specifying a part of the data item to be accessed and letting the system find all data items that match the given value). In the case of object-oriented systems, another form of interagent communication is possible. Each such agent consists of one or more objects, where objects encapsulate both data and the functions (called methods) that may operate on the data. Two agents operating on the same node may establish a connection that permits them to invoke each other’s methods, thus passing information to each other or otherwise manipulating each other’s internal state. This is analogous to performing remote procedure calls in conventional clientserver applications and thus can be extended to communica-
358
MOBILE NETWORK OBJECTS
tion with stationary agents or other services on the same or even remote nodes. Since communication via shared variables or method invocation requires both agents to be in the same node, some systems permit agents to establish connections across different nodes and to communicate with each other by messages. This includes connections that may be established between an agent and its owner (user). The send/receive primitives supported by the system may be both synchronous or asynchronous, depending on the system’s intended application domain. To find a particular mobile agent on the net, a ‘‘paging’’ service may also be provided, which returns the location of the sought-after agent. In Agent Tcl, for example, this service relies on each mobile agent registering its position with its ‘‘home’’ machine after each jump, which permits the user or source agent to find its current location. Synchronization may be required for agents operating on different nodes or on the same node. Synchronizing activities on different nodes is a general problem in distributed coordination for which various solutions exist, including distributed semaphores, using a central server/manager, distributed voting, or token-based schemes. For this reason, few mechanisms specific to mobile agents have been proposed. Similarly, synchronization of mobile agents on the same node is generally achieved by adapting classical methods. The solutions incorporated in different systems vary greatly in their sophistication. The simplest way to achieve synchronization is by busy waiting (also called spin lock), where an agent continuously reads a given variable until it has been set to the desired value by another agent. The main disadvantage of this scheme is the wasted CPU time, which can be eliminated by implementing a more sophisticated form of locks (or semaphores), where the waiting agent is blocked (sleeping) while the desired condition is false. The CUI MESSENGERS system (10) provides a novel synchronization mechanism based on the notion of synchronization queues. Agents can create/destroy queues as needed and can use specialized primitives to enter/exit a particular queue. The basic principle is that only those agents that are at the head of a queue (or on no queue) are running. An agent that is currently running may also block/unblock a given queue, thus preventing all agents on that queue from proceeding. Another synchronization mechanism is based on events. These are arbitrary user-defined conditions that can be set and tested at runtime. An agent may indicate that it is interested in certain types of events, in which case it will be notified whenever an event of the desired type occurs. The notification is in the form of an ‘‘interrupt,’’ which executes a specific function (an event handler) provided by the agent.
PROTECTION AND SECURITY The autonomous mobility of agents creates the potential for security violations that would otherwise not be possible. With traditional approaches the outside world can only interact with a computer through well-defined interfaces and by the fixed set of programs installed on the computer. These restrictions provide a barrier that allows the computer to protect itself from external attack. Mobile agents eliminate this barrier, since the code that the computer runs is provided by the
very external agents from whom the computer should be protected. Without proper safeguards, the computer may accept unsafe code and permit it to run. In so doing, the computer opens itself and its other users to abuse or misuse of its resources. For example, the entering agents may consume excessive amounts of memory or CPU time, access memory, disk files, or services for which it has no authorization; leak sensitive information to the outside world; or destroy information or services. Mobile agents also open up the possibility of attacks on the agents themselves. For example, an agent might attempt to steal sensitive information that another agent is carrying. A host in the system might try to modify an agent by changing its data (e.g., the maximum price it is prepared to pay for a service offered by the host) or its instructions (e.g., by altering the agent so that it works on behalf of this host rather than on behalf of its original owner). Thus threats to agents can come either from other agents or from hosts on the network. So there are three kinds of protection that need to be addressed: protecting the system from an agent, protecting an agent from an agent, and protecting an agent from a host. Protecting the System from an Agent Nodes in a system can be protected from agents by using a combination of authentication, restriction of access to potentially dangerous operations, and resource limits. An agent typically carries with it certain identifying information, such as its owner and its origin. Authentication mechanisms check that this information is correct; this can be done using publickey encryption protocols. An agent can then be permitted to perform or forbidden from performing certain operations, depending on its status. In Agent Tcl (8), an agent is assigned a status of ‘‘trusted’’ or ‘‘untrusted.’’ An untrusted agent is run with an interpreter that limits its ability to perform potentially dangerous operations, either by forbidding such operations entirely or by carefully checking the parameters of each such operation before allowing it to proceed. In Telescript, an agent carries with it ‘‘permits,’’ each of which allows it to perform certain operations that are otherwise forbidden (11). Resource limits prevent a single agent from consuming excessive amounts of resources on a single host. Agents could abuse the system in more subtle ways. For example, a malicious agent could simply hop to a new host selected at random, make two copies of itself, and stop. Such an agent could ultimately paralyze the entire network. Agent Tcl and Telescript propose addressing this problem by introducing an analog of a cash economy. Each agent carries with it a certain amount of ‘‘currency,’’ which it must spend in order to use resources. Every time an agent creates a new agent, it must give some of its currency to the child agent; otherwise, the child agent will not be able to use any resources. This mechanism limits the total amount of network resources that can be consumed by an agent and its descendants. Protecting an Agent from Other Agents Once the system has been protected from the agents running on it, the problem of protecting an agent from other agents is quite similar to the classical security problem of protecting a program from other programs on multiuser machines. One approach is to have each agent run in a separate address
MOBILE NETWORK OBJECTS
space, so that an agent cannot be affected by another agent unless it chooses to communicate with it. Protecting an Agent from a Node This is the most difficult of the three types of protection. It is virtually impossible for an agent to prevent itself from being tampered with by a malicious or faulty host. Nevertheless, it is usually possible to detect whether specific sensitive areas of an agent arriving at a node have been tampered with at the previous node. This is being implemented in Agent Tcl using digital signatures. Once an agent migrates to a new host, it cannot prevent the host from examining its contents and possibly stealing sensitive information that it contains. The damage from such a theft can be limited if an agent makes sure that the sensitive information is stored in a form that is not useful without cooperation from a trusted network node (e.g., by keeping it encrypted.) It is possible for an agent to build an audit trail that includes a list of the nodes it has visited. This does not prevent theft, but it can be used after the fact to help identify nodes that might be stealing data. It is also possible for an agent to be sent out with a list of trusted nodes and with the restriction that it only visits trusted nodes. However, this approach represents a significant restriction, since one of the most attractive features of the mobile agent paradigms is the notion of autonomous agents that can freely roam the network. UTILITY AND APPLICATIONS Mobile agents have several advantages over distributed computing using conventional message-passing approaches. These advantages can be roughly divided into two groups: software engineering advantages and performance advantages. The ability to move computations at runtime between different nodes makes applications functionally open ended and thus arbitrarily extensible. Notably, a server is not linked to a fixed set of predefined functions. Rather, each incoming request can carry with it the necessary code for its processing, thus making the server’s capabilities virtually unlimited. The same principle applies to communication. Mobile agents provide a mechanism for dynamic protocols. Without mobile agents, a computer supports a finite set of protocols to move data between, to, and from other computers. Each can only manipulate the data in a fixed number of ways, determined by its current software capabilities. If it lacks a particular application needed to access, view, or process some received data properly, the needed application must manually be installed to extend the machine’s capabilities. Mobile agents permit new protocols to be installed automatically and only as needed for a particular interaction. Another software engineering advantage of mobile agents is ease of programming for certain kinds of applications. Conventional distributed programming requires viewing the application as a global collection of concurrent activities interacting with each other via message passing. Each program must anticipate in advance all the possible messages it can receive from other programs and be ready to respond to them. Programming with mobile agents is more like driving a car through the network; the programmer’s task is to guide the
359
agent on its journey through the network, describing the computation to be performed at stops along the way. One class of applications that are particularly well suited to implementations using mobile agents are individual-based simulations, in which agents representing individual entities coordinate their activities to model complex collective behavior in a spatial domain. Examples of such applications include interactive battle simulations, particle-level physics simulations, traffic modeling, and ecological studies. All of the above software engineering advantages stem from the fact that the mobile agents paradigm better fits certain types of distributed applications, which reduces the amount of programming necessary. In terms of performance, the ability of mobile agents to move through the network results in a considerable potential reduction in communication cost. Suppose, for example, that a program wishes to process a large amount of data at a remote site. One approach would be for the remote site to send all the data to the local site. This is likely to incur considerably more communication overhead than dispatching an agent to the remote site that processes the data and then returns. If the connection is slow or unreliable, there is a further advantage to the mobile agent approach. If the remote site is sending a large stream of data to the local site and the connection is lost, then the stream may have to be resent in its entirety or a restart protocol may have to be run. With mobile agents, a steady connection is not necessary. Once the agent has arrived at the remote site, there is no reason for the remote site and the local site to have any mutual contact until the agent is ready to return to the local site. A number of applications using mobile agents have been proposed or actually developed. A few are briefly described next. For more details, see the reading list at the end of this article. Information Retrieval One obvious application of mobile agents is accessing and retrieving data at remote sites on a network. If the volume of information is large, it is clearly more efficient to dispatch an agent to the remote site and have it filter the data than to ship all the data over the network and then process it. Servers can support search without providing any specific software capabilities other than permitting mobile agents to enter and execute at their site. These bring with them all the necessary code and ‘‘intelligence’’ to carry out the necessary searches, which is supplied by the user originating the request. The data at the remote site may contain references to other useful data at other remote sites, in which case the agent may move or send copies of itself to these other sites and access the data there as well. Electronic Commerce As commerce on the Internet becomes a reality, the potential uses for mobile agents are almost unlimited. Many of the references at the end of this article address some of the possible uses of mobile applications in the electronic marketplace. Mobile agents can search the Internet to find the best price on a particular item, make certain reservations or purchases on behalf of their owner (e.g., airplane tickets, hotel reservations), or repeatedly search to see if a currently unavailable item (e.g., a ticket to a sold-out concert) becomes available.
360
MOBILE NETWORK OBJECTS
More complex mobile agents could perform more difficult tasks, such as negotiating deals or closing out business transactions on behalf of their owners. One important related problem is the implementation and use of electronic cash. Intelligent Agents and Personal Assistants The term intelligent agent is used in two different contexts. One use refers to artificial intelligence (AI) systems in which the intelligence stems from the behavior and interaction of individual entities or agents within the system. Generally these agents do not migrate, and hence do not fall within the scope of this article. The term is also used to describe agents that act as personal assistants to the user. Some of these are mobile and some are not. Examples of the latter include interfaces for e-mail and news filtering systems. An example of intelligent agents that are also mobile agents is software for scheduling meetings (interacting with users and/or their calendars at distributed locations). Mobile Computing This application was alluded to at the beginning of this article. The user of a portable computer can submit a mobile agent that contains a program to be run and sign off. When the agent is finished computing, it waits and jumps back to the user’s computer after the user signs back on and requests it do so. Network Management Mobile agents can be used to perform various administrative and maintenance functions in networks. For example, agents can be dispatched to monitor links and nodes, diagnose faults, identify areas of congestion, etc. As another example, one of the stated goals of the CUI Messengers Project is developing a distributed operating system based on mobile agents. General-Purpose Computing Mobile agents can be used as the basis for general-purpose distributed computing (6,12). If the communication overhead is reasonably low compared with the amount of computation required, distributed solutions using mobile agents are competitive in performance with distributed solutions using traditional message-passing approaches. Many algorithms are more naturally implemented using the metaphor of navigation through a network than using message passing, so the mobile agent approach often yields a smaller semantic gap between the abstract specification of the algorithm and the actual implementation. Mobile agents also provide a useful way of coordinating the behavior of functions and data in a distributed application such as a distributed simulation. The use of mobile agents as a coordination paradigm is particularly well suited to systems that permit calls into native mode code. The coordination functions are performed by services provided by the interpreter, while the actual computation can be done in native mode, so the computational cost due to interpretive overhead is minimized. BIBLIOGRAPHY 1. A. D. Birrell and B. J. Nelson, Implementing remote procedure calls, ACM Trans. Comput. Syst., 2: 39–59, 1984.
2. J. W. Stamos and D. K. Gifford, Remote evaluation, ACM TOPLAS, 12 (4): 537–565, 1990. 3. J. Gosling and H. McGilton, The Java Language Environment. Sun Microsystems, Inc., Mountain View, CA 94043, 1995. http:// java.sun.com 4. E. Jul et al., Fine-grained mobility in the Emerald system. ACM Trans. Comput. Syst., 6 (1): 109–133, 1988. 5. L. Cardelli, Obliq: A language with distributed scope, Comput. Syst., 8 (1): 27–59, 1995. 6. L. F. Bic, M. Fukuda, and M. Dillencourt, Distributed computing using autonomous objects, IEEE Comput., 29 (8): 55–61, 1996. 7. The Telescript reference manual. Technical report, General Magic, Inc., Mountain View, CA 94040, June 1996. http://www.genmagic.com 8. R. S. Gray, Agent Tcl: A flexible and secure mobile-agent system. In Proc. 4th Annu. Tcl/Tk Workshop (TCL 96), Monterey, CA, July 1996. http://www.cs.dartmouth.edu/앑agent/papers/index. html 9. A. Lingnau, O. Drobnik, and P. Do¨mel, An HTTP-based infrastructure for mobile agents. In 4th Int. World Wide Web Conf. Proc., pp. 461–471, Sebastopol, CA, December 1995, O’Reilly and Associates. 10. C. F. Tschudin, On the Structuring of Computer Communications. Ph.D. thesis, University of Geneva, Centre Universitaire d’Informatique, Geneva, Switzerland, 1993. http://cuiwww.unige.ch/ tios/msgr/home.html 11. J. White, Mobile agents white paper. Technical report, General Magic, Inc., Mountain View, CA 94040, 1996. http://www.genmagic.com 12. P. S. Sapaty and P. M. Borst, An overview of the WAVE language and system for distributed processing of open networks. Technical report, University of Surrey, UK, 1994
Reading List J. Baumann, Mobile agents: A triptychon of problems. In 1st ECOOP Workshop Mobile Object Systems, 1995. http://www.informatik.unistuttgart.de/ipvr/vs/projekte/mole/agents.html D. Johansen, R. van Renesse, and F. B. Schneider, An introduction to the TACOMA distributed system version 1.0. Technical Report 05-23, Department of Computer Science, University of Tromsø, June 1995. http://www.cs.uit.no/DOS/Tacoma/index.html D. B. Lange and M. Oshima, Programming mobile agents in Java with the Java Aglet API. http://www.trl.ibm.co.jp/aglets/ T. Magedanz and T. Eckardt, Mobile software agents: A new paradigm for telecommunications management. In IEEE/IFIP Network Operations and Management Symposium (NOMS), Kyoto, Japan, April 1996. http://www.fokus.gmd.de/oks/research/ magna_g.html兩#dokumente H. Peine, An introduction to mobile agent programming and the Ara system. ZRI Technical Report 1/97, Dept. of Computer Science, University of Kaiserslautern, January 1998. http://www.uni-kl.de/ AG-Nehmer/Ara/ara.html C. F. Tschudin, On the Structuring of Computer Communications. Ph.D. thesis, University of Geneva, Centre Universitaire d’Informatique, Geneva, Switzerland, 1993. http://cuiwww.unige.ch/tios/ msgr/home.html D. Wong et al., Mitsubishi Horizon Systems Lab, USA, Concordia: An Infrastructure for Collaborating Mobile Agents, in Proc. 1st Int. Workshop Mobile Agents, Berlin, Germany, April 7–8, 1997. http://www.meitca.com/HSL/Projects/Concordia/ MobileAgentConf_for_web.html
MOBILE ROBOTS M. Condict et al., Towards a world-wide civilization of objects, in Proc. 7th ACM SIGOPS Eur. Workshop, Connemara, Ireland, September 1996. http://www.opengroup.org/RI/java/moa/WebOS.ps General Magic, Inc., Odyssey, 1997. agents/odyssey.html
http://www.genmagic.com/
LUBOMIR F. BIC MICHAEL B. DILLENCOURT MUNEHIRO FUKUDA University of California
361
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5302.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Multicast Standard Article George C. Polyzos1 and K. Katsaros1 1Athens University of Economics and Business, Athens, Greece Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5302 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (101K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5302.htm (1 of 2)16.06.2008 16:25:16
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5302.htm
Abstract The sections in this article are Introduction Multipoint Routing and Multicast Tree Algorithms Feedback Control and Reliable Multicast Multicasting Real-Time Continuous Media Multicasting on the Internet Multicast in ATM and Other Technologies Application-Level Multicast | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5302.htm (2 of 2)16.06.2008 16:25:16
MULTICAST
INTRODUCTION One of the ways communication can be characterized is by the number of parties involved. Traditional communication modes have been unicast, i.e., one-to-one, and broadcast, i.e., one-to-all. Between these two extremes we find multicast, the transmission of a single message or data stream to a set of receivers. Thus, multicast is a generalization of both unicast and broadcast and a unifying communication mode. For this reason it is receiving increasing attention in modern networking architectures. The above definition of multicast is still traditional in the sense that it is sender-centric and unidirectional. An even more general term would be multipoint communication. Note also that multicast is considered mostly in the context of digital, and particularly, packet-switching networks. Multicast is examined on its own because the specification of receivers through a set introduces features and complications that are not present in traditional communication modes. We can distinguish two cases of increasing complexity with respect to the set of receivers: (1) it is fixed and known (e.g., to the sender), and (2) it is unknown and/or dynamic. The multicast model of communication supports applications where data and control are partitioned over multiple actors, such as updating replicated databases, contacting any one of a group of distributed servers of which the composition is unknown (more appropriately termed anycast), and interprocess communication among cooperating processes, e.g., distributing intermediate computational results from one processor to others in parallel computers. A demanding application of multicast is Distributed Interactive Simulation (DIS). Targeted news and information distribution in near real-time has potential global impact and could normally be less demanding, even though specific applications, such as stock-quote information distribution, might have very stringent requirements (e.g., needing atomic multicast, the semantics of which imply that the message should be received by all receivers in the group or none at all). However, the prototypical multipoint communication application is probably real-time interactive multimedia tele-conferencing, possibly including shared workspaces, and being under the umbrella of Computer Supported Collaborative Work (CSCW). Efficient multicast is a fundamental issue for the success of group applications. Here the selective multicast service takes the place of the indiscriminate broadcasting so as to reduce the waste of resources caused by transmitting all the information or channels to all receivers. The basic means of conserving resources via multicast is sharing: instead of transmitting information from a sender to each receiver separately, we can arrange for routes that share links to carry the information only once over the shared links. We can picture a multicast route as a tree rooted at the sender with a receiver at each leaf and possibly some receivers on internal nodes. The tree can be designed so
as to maximize shared links and thus minimize resource consumption. While multicast is increasingly recognized as a valuable service for packet networks, many architectures still do not support it directly. Many shared-medium networks such as Ethernet support options for broadcast and multicast packets and the corresponding addressing mechanisms. However, processors are often required to perform extra processing when receiving a multicast packet. When switches are used in point-to-point networks, we would like the hardware to automatically recognize multicast addresses as such and transmit multicast packets through multiple links copying packets on the fly, as required. ATM switch designs increasingly support parallel transmission of multicast cells over multiple links in hardware, increasing peak switching speeds. Another set of issues is concerned with extending the feedback mechanisms employed by unicast-oriented protocols to deal with flow, congestion and error control. For example, transport-layer protocols such as TCP adapt their behavior according to the prevailing network conditions at any given point in time by measuring loss rates as experienced by receivers. When extending these protocols for multicast, there is the possibility of feedback implosion when many receivers send such reports towards the sender, thus swamping the network and the source with control information. Apart from the obvious scalability problems of such schemes, there is also the issue of how to adapt the sender’s behavior when conflicting reports arrive from the various receivers. Multicast Groups and their Dynamics The difference between multicasting and separately unicasting data to several destinations is best captured by the Internet host group model: a host group is a set of network entities sharing a common identifying multicast address, all receiving (traditionally, via best-effort service) any data packets addressed to this multicast address by senders that may (closed group) or may not be members of the group (open group) and have no knowledge of the group’s membership. This definition implies that the behavior of the group over time is unrestricted in multiple dimensions; it may have local (LAN) or global (WAN) membership, be transient or persistent in time, and have constant or varying membership. From the sender’s point of view, this model reduces the multicast service interface to a unicast one. This implies that the network software is accordingly burdened with the task of managing the multicasts in a manner transparent to the users. From the network designer’s point of view, this extra work is expected to result in a more efficient usage of resources. This is the primary motive for network providers to support multicast in the first place. These goals for multicast service impose specific requirements for the network implementation. First, there must be a means for routing packets from a sender to all group members whenever the destination address of a packet is a multicast one, which implies that the network must locate all members of the relevant group and make routing arrangements. Second, since group member-
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright © 2007 John Wiley & Sons, Inc.
2
Multicast
ship is dynamic, the network must also continuously track current membership during a session’s lifetime, which can range from a short to a very long period of time. Tracking is required both to start forwarding data to new group members and for stopping the wasteful transmission of packets to destinations that have left the group. Both tasks must be carried out without assistance from the sending entity as defined by the host-group model. The dynamic nature of multicast groups has important implications for multicast routing. A very different model is adopted by the Asynchronous Transfer Mode (ATM) technology. The first and currently the only supported model is that of a point-to-multipoint Virtual Channel (VC), where the source signaling agent knows exactly the addresses of all destinations, which are included in the VC, typically by being added one-by-one, during the initial connection set-up time. Proposals for receiver-initiated dynamic modifications to VCs are being investigated.
MULTIPOINT ROUTING AND MULTICAST TREE ALGORITHMS Unicast routing intends to minimize transmission cost or delay, depending on the metric used for the optimization. These two apparently different goals are equivalent from an algorithmic point of view, both leading to the use of shortest-path algorithms, with Dijkstra’s and the BellmanFord algorithm the two common cases. These algorithms find optimal routes between one node (the sender) and all other nodes in the network (including all receivers) in the form of shortest path trees. Thus, a straightforward (but not optimal) solution to the multicast routing problem can be based on the shortest path trees produced by these algorithms by pruning off any branches that do not lead to any receivers in the group. Although details vary according to the base algorithm, there are some observations that generally apply. On the up side, these algorithms are easy to implement, as direct extensions of existing ones, and thus fast to deploy. Additionally, each path is optimal by definition, regardless of changes in group membership, and this optimality comes essentially for free since shortest paths need to be computed for unicast routing as well. On the down side, these algorithms optimize the wrong metric. Also, for large internetworks with widely dispersed groups, either the scale of the network or continuous network changes restrict their use to subnetworks that already employ their unicast counterparts. Similar problems (e.g. processing complexity for Dijkstra and instability for Bellman-Ford) have also forced unicast routing algorithms to rely on hierarchical routing techniques for large networks. Cost optimization in multicast can be viewed from another angle: overall cost optimization for the distribution tree. The shortest path algorithms concentrate on pairwise optimizations between the source and each destination and only conserve resources as a side effect, when paths overlap. We can instead try to build a tree that exploits link sharing as much as possible, and by duplicating packets only when paths diverge, minimize total distribution cost,
even at the expense of serving some receivers over longer paths. What we need is a tree that reaches all receivers and may use any additional network nodes on the way. This is equivalent to the Steiner tree problem, where a cost-labeled graph and a set of nodes, the Steiner points, are given and we want a minimal-cost tree connecting all Steiner points, consisting of the sender and the receivers. Note that if all the nodes in the graph were Steiner points, the problem coincides with finding a spanning tree for the graph, for which efficient algorithms are known. Instead the Steiner tree problem is intractable. However, even though Garey et al. (10) have shown that this problem is NP-complete, approximation algorithms exist with proven constant worst-case bounds. Implementations of such algorithms have been shown to produce lowcost multicast trees with very good average behavior. As an example, trees built with the heuristic by Kou et al. (15) have at most twice the cost of Steiner trees, while simulations of realistic network topologies have shown their cost to be within 5% of the optimum. The advantage of this approach is its overall optimality with respect to a single cost metric, such as transmission cost. However, the disadvantages are also important: the algorithm needs to run in addition to the unicast algorithms, and it will itself have scaling problems for large networks. Furthermore, optimality is generally lost after group membership changes and network reconfigurations if the tree is not recomputed from scratch. Thus, Steiner tree algorithms are best suited to static or slowly changing environments since changes lead to expensive recalculations to regain optimality. Both approaches discussed above suffer from an inability to maintain their measure of optimality in large and dynamic networks. Approaches for extending these algorithms to deal with changes in group membership without complete tree reconfigurations include extending an existing tree in the cheapest way possible to support a new group member and pruning the redundant branches of the tree when a group member departs. The quality of the trees after several local modifications of this sort will deteriorate over time, eventually leading to a need for global tree reconfiguration. A different approach to the routing problem opts for a solution in realistic settings by adopting the practical goal of finding good rather than optimal trees that can also be easily maintained. The departure point for this approach is the center-based tree, which is an optimal-cost tree that, instead of being rooted at the sender, is rooted at the topological center of the receivers. Even though such a tree may not be optimal for any one sender, it can be proven to be an adequate approximation for all of them together. The implication is that one basic tree can serve as a common infrastructure for all senders. Thus, maintenance of the tree is greatly simplified and nodes on the tree need only maintain state for one shared tree rather than many source-rooted trees. Since this method has been developed for broadcasting rather than multicasting, the theoretical investigation does not hold when we prune the broadcast trees to get multicast ones. In addition, the topological center of the tree, apart from being hard to find (the problem being NP-complete), will not even be of use in a dynamic multicast environment.
Multicast
Practical proposals for multicast routing abandon the concrete optimality claims discussed above, but keep the basic idea of having a single shared multicast tree for all senders to a group. This is a departure from approaches that build one tree for each sender. Routing is then performed by defining one or more core or rendez-vous points to serve as the basis for tree construction and adding branches by separately routing packets optimally (in the unicast sense) from the senders to these fixed points and then from there to the receivers. Again, merging of paths is exploited whenever possible, but it is not an explicit goal of the routing calculations. Instead, because of the concentration of paths around the fixed points, common paths are expected to arise. A single shared multicast tree is not optimal in any strict sense since no attempt is made to find the topological center of the tree, both due to its computational cost and the limited lifetime of any topological center for a dynamic environment. But, the advantages of shared multicast trees are numerous. First, a shared tree for the whole group means that this approach scales well in terms of maintenance costs as the number of senders increases. Actually, there is still a tree emanating from each sender, but all these trees merge near the fixed points and the distribution mesh is common from there on to the receivers. Second, the trees can be made quite efficient by clever choice of the fixed points. Third, routing is performed independently for each sender and receiver, with entering and departing receivers influencing only their own path to the fixed points of the single shared tree, employing any underlying mechanism available for unicast routing. This last property means that network and group membership dynamics can be dealt with without global recalculations and by using available mechanisms. In practice, these multicast algorithms are expected to use the underlying unicast algorithms, but are independent of them. Interoperability with different unicast schemes, coupled with the scalability of the shared trees, make these algorithms ideal for use on very large-scale heterogeneous networks. The fixed points can also be selected so as to facilitate hierarchical routing for very large internetworks, further enhancing scalability properties. Group dynamics are an obstacle in maintaining optimality, whatever the method of constructing the initial trees. Since repeating all routing computations whenever members join or leave the group may be prohibitively expensive, an alternative is to prune extraneous links when a member leaves the group, and add the most economical extension path towards a new member, either from a fixed or from the optimal location in the existing group. Rather than making modifications blindly, the most advanced algorithms store some of the state accumulated during tree construction and make only local calculations that still satisfy the requirements of the application. However, simulations have shown that even simple multicast routing using the shortest path tree is not significantly worse in terms of total tree cost from the optimal solutions or the near-optimal heuristics. For realistic network topologies Doar and Leslie (9) have found that the cost of a shortest path tree is less than 50% larger than that of a near-optimal heuristic tree, while path delays for
3
heuristic trees are 30% to 70% larger than shortest path delays. Since shortest path trees are easily built and modified using the underlying unicast routing and they never deteriorate in terms of delay, but simply vary in their inefficiency in terms of total cost, if an application is prepared to accept the overhead, it can avoid special multicast-tree construction and maintenance methods by employing the shortest delay paths. A similar cost versus simplicity trade-off is involved when using shared trees for all senders to a group. For shared trees, optimality is hard to achieve and even harder to maintain, as discussed earlier, but a simple approach is to choose the center among group members in a way so that only as many trees as group members will have to be considered. For these trees, when path delay is optimized, simulations show that delays are close to 20% larger than the shortest paths, and tree cost is about 10% lower than that of shortest path trees. Furthermore, a single tree constructed using the underlying unicast routing mechanisms minimizes state and maintenance overhead. Unfortunately, apart from their moderate sub-optimality, shared trees also suffer from traffic concentration, since they route data from all senders through the same links. Simulations show that delay-optimal member-centered shared trees can cause maximum link loads to be up to 30% larger than in a shortest path tree. For these reasons, recent proposals try to combine shared trees and shortest paths by starting each group connection in the shared tree mode and then changing individual paths to shortest delay ones upon receiver requests. This approach can also support traditional source-rooted trees. Additional complications arise when links and paths are asymmetrical. Most of the algorithms and approaches discussed above need modifications and in some cases they do not apply at all. An interesting problem arising from resource sharing in multicast is how to split the total distribution costs among the receivers, or how to allocate the savings compared to using separate unicasts. This issue is orthogonal to the problem of what the total costs are, a question also arising in the unicast case. Whether these costs are used for pricing or for informational purposes, they are a primary incentive to use multicast. Multicast Routing with Quality-of-Service Constraints The motivation for routing multicast traffic along trees rather than along arbitrary paths is to minimize transmission cost through link sharing. For continuous media, the volume of data transferred makes this goal even more important. However, for real-time multimedia applications we must take into account two additional factors: delay constraints, particularly for interactive applications, and media heterogeneity. Separate handling of media streams is useful in order to use the most effective coding techniques for each stream. The question arises then whether we should use the same or separate distribution trees for each stream. Considering the load that continuous media puts on network links and the interaction among admission control and routing, it seems better to use separate trees. Thus, each media stream could ask for the appro-
4
Multicast
priate Quality-of-Service (QoS) parameters and get routed accordingly, with receivers choosing to connect to any subset of the trees. On the other hand, the management overhead of multiple trees per source may be prohibitive. In addition, routing each media stream separately exacerbates the inter-media synchronization problem. Turning to delay requirements, if we use delay as the link metric during routing, we can easily see that the shortest delay tree, made up from the shortest paths from sender to each receiver, is not the same as the tree of total minimal cost that maximizes link sharing at the expense of individual path delays. We have then a global tree metric (tree cost) and many individual receiver-oriented metrics (path delays) that are potentially in conflict. Since we cannot hope to optimize on all fronts, we can try to optimize cost subject to the constraint that delay is tolerable. Interactive applications can be characterized by upper bounds on end-to-end delay and/or limits on jitter. In this sense, it is reasonable to design the tree so as to optimize total cost while keeping individual paths within their respective bounds. Normally, all receivers would be satisfied by the same limits, as these are determined by human perception properties. This new problem is essentially a version of the Steiner tree problem with additional constraints on the paths. Even though it is NP-complete, fast heuristic algorithms that are nearly optimal have been developed. Almost identical formulations could be obtained when the constraints are delay jitter or a probabilistic reliability constraint. For example the latter could be modeled in the case of independent link losses by a loss probability assigned to each link. Then, using logarithms the reliability metric could be expressed in linear form between a source and each destination by adding the logarithms along the path. This maps the problem to the previous one since the goal is tree cost minimization with a constraint on additive path-based metric. Finally, a similar formulation can be used when the constraint is link capacities which must not be exceeded, instead of a delay bound. Again, heuristics exist to solve this variant of the problem.
FEEDBACK CONTROL AND RELIABLE MULTICAST Whether a network provides a simple connectionless service or a complicated connection-oriented service for unicast, generalizing it for multicast is not trivial. Flow, congestion, and error control depend on feedback to the sender, according to network and receiver-triggered events. For simple network services, no such information is provided by the network itself, but instead end-to-end reports must be exchanged. Error control ensures that packets transmitted by the sender are received correctly. Packets may be received corrupted (detected by error-detection codes) or they may be lost (detected by missing sequence numbers). Flow control assures that the sender does not swamp the receiver with data that cannot be consumed in time. Congestion control deals again with the problem of insufficient resources, but this time at network elements between sender and receiver. Although packets may be dropped at intermediate
nodes, in many networks this loss can be detected only by the receiver, resulting in confusion between errors and congestion. In the unicast case, lost or corrupted packets are retransmitted based on feedback received from the network or the receiver. When packets are multicast, simple feedback schemes face the feedback-implosion problem: all receivers respond with status information, swamping the sender with possibly conflicting reports. Ideally, senders would like to deal with the multicast group as a whole and not on an individual receiver basis, following the hostgroup model. However, the sender cannot simply treat all receivers identically, because this would lead to either ignoring the retransmission requests of some receivers, or to wasting resources by retransmitting to all of them. Since there is no evident solution that satisfies all requirements, several approaches exist emphasizing different goals. The simplest approach of all is to ignore the problem at the network layer and provide a best-effort connectionless service. Delegating the resolution of transmission problems to the higher layers may be an adequate solution in many cases, since they may have additional information about the application requirements and thus can implement more appropriate mechanisms than what is possible at this layer. A second solution sacrifices the host-group model’s simplicity by keeping per-receiver state during multicasts. After transmitting a multicast packet, the sender waits until a stable state is reached before sending the next one. For flow control, this slows down the sender enough so as not to swamp the slowest receiver. For error control, retransmissions are made until all receivers receive the data. This may not be possible even after multiple retransmissions, so the sender may have to take special action, e.g., removing some receivers from the group. Retransmissions may be multicast when many receivers lose a packet, or unicast when few do. Since feedback implosion is always a possibility, all such schemes should use negative rather than positive acknowledgments, i.e., send responses when problems occur rather than confirming that packets are received correctly and in time. In a negative acknowledgment scheme, some responsibilities are moved to the receivers, complicating their operation. However, additional opportunities arise, such as multicasting the negative acknowledgments to all receivers after random periods of time to minimize the number of negative acknowledgments returned to the sender. Assigning such responsibilities to receivers can lead to higher throughput. However, the scalability of such schemes is doubtful, even for very reliable links and rare congestion or overflow problems. The problem is that the sender is still the control center, and as the number of group members grows, receivers and network paths become more heterogeneous. With these essentially symmetric schemes, the service provided to a group member is the lowest common denominator, which may be the slowest or most overloaded receiver, or the slowest or most congested network link. Sophisticated approaches exist that follow these general directions, but their complexity and inefficiency makes them appropriate only for applications that require very high reliability and uniform member treatment. Note that such reliable
Multicast
solutions can be implemented as transport services over a simple connectionless network service. A third solution is to distribute the feedback control mechanism over the entire multicast tree, and follow a hierarchical scheme. A receiver’s feedback need not propagate all the way to the sender. Instead, intermediate nodes may either respond directly or merge the feedback from many downstream receivers to a summary message and then recursively propagate it upwards. In this case, feedback implosion is avoided in terms of messages, but the problem of dealing with possibly conflicting requests remains. If the added complexity of making local decisions on each network node (not only group members) is acceptable, we can narrow down the impact of problems to specific parts of the tree, relieving the sender from dealing with individual receivers. A non-hierarchical method for distributed feedback control targeted to recovery of lost messages is to let all receivers and senders cooperate in handling losses, thus extending the sender-oriented model. When receivers discover a loss, they multicast a retransmission request, and anyone that has that message can multicast it again. To avoid feedback implosion, these requests and replies are sent after a fixed delay based on the distance from the source of the message or the source of the request respectively, plus a (bounded) randomized delay. The result is that most duplicate requests and replies are suppressed by the reception of the first multicasts. By varying the randomdelay intervals, the desired balance among recovery delay and duplicates can be achieved. In contrast to hierarchical schemes and because location-independent multicasts are used, only group members participate but recovery cannot be localized without additional mechanisms. A scalable feedback mechanism that can be used to estimate network conditions without creating implosion problems has been proposed by Bolot et al. (4): it first estimates the number of receivers in a group and then what the average quality of reception is (the averaging depends on the application), using probabilistic techniques. This method has been used in applications for senders to detect congestion problems and adapt their output rates (to relieve congestion) and error redundancy factors (to increase the chances of error recovery). Cheung and Ammar (5), proposed a further enhancement to scalable feedback control, by splitting the receivers in groups according to their reception status and capabilities and only sending them the data that each group can handle. This avoids problems created by very slow or very fast machines dragging the whole group towards one extreme. Finally, another approach (mostly orthogonal to the above), tries to minimize the need for feedback by taking preventive rather than corrective action. For error control, this is achieved by using Forward Error Correction (FEC) rather than simple error detection codes. For flow and congestion control, this is achieved by reserving resources so that both receivers and intermediate network nodes are able to support the sender’s data rate. The cost of these techniques are increased overhead and network complexity. FEC imposes processing and transmission overhead, but requires no additional mechanisms in the network. Resource reservation on the other hand needs additional con-
5
trol mechanisms to set up and maintain the resources for a session. Message Ordering and Atomic Multicast Some applications require delivery of messages in order. In some cases this requirement is expressed across sources, leading to a synchronization problem. The required ordering of messages could be causal or total. Causal ordering is based on the “happens before” relationship and might not be total (i.e., messages could be concurrent). A multicast protocol that ensures reliability and total ordering is called atomic. Such protocols might be necessary for secure distributed computing in the presence of failures and malicious agents, while causal ordering is sufficient to ensure consistency in updates to replicated databases. MULTICASTING REAL-TIME CONTINUOUS MEDIA Host and Network Heterogeneity Several representational formats for various media types coexist. This is a problem with traditional data communications, but it is more of an issue with images, audio and video. Such issues are typically addressed at the presentation layer in the OSI model. Translation between formats can be provided at three points: at the transmitter, at the receiver, or inside the network. In the latter case, format converters are required to be deployed in the network. This may be appropriate for converting protocols or text encodings between autonomous systems with different standards (placing the converters in the gateways), but it is not effective when the terminals themselves can use different encodings within the same area. Thus, it is more realistic to move the translation services to the hosts. With unicast, translation can be effectively done at either the sender or at the receiver. However, heterogeneity problems are aggravated with multicast. For example, translation at the sender requires the stream to be duplicated and translated for each different type of receiver, precluding link sharing over common paths. This approach also does not scale for large heterogeneous groups since the sender’s resources are limited. Finally, it requires the sender to be aware of the receiver’s capabilities, which is incompatible with the host-group model. The sender may use different multicast groups for each encoding to avoid this, but the other problems remain. Translation at the receiver is the most economical and scalable approach in this case since it fully exploits sharing and moves responsibilities away from the sender. Since continuous media impose heavy demands on both networks and hosts, it is likely that not all receivers will be able to receive all of a sender’s traffic. This argues in favor of prioritization of the traffic generated through hierarchical coding. Hierarchical or layered coding techniques decompose a signal into independent or hierarchically dependent components that can be used in specific subsets to provide partial reconstruction of the signal. In this case receivers can choose to get only those parts of the media that they can use or are most important to them. Thus, appropriate hierarchical coding can be easily combined with
6
Multicast
and facilitates translation and reconstruction of the signal at the receivers, according to their needs and abilities. For example, a high resolution component of a video could be dropped from a congested subnetwork, allowing low resolution components to be received and displayed in that subnetwork, but without impacting other subnetworks that are not congested. Resource Reservations Resource reservations at the network switches are needed if any service guarantees are to be provided. The exact nature of these reservations differ according to the required service guarantees and the approach taken towards satisfying them, so resource reservation along transmission paths could be viewed as a subset of the general switchstate establishment mechanisms. An alternative to reserving resources for an indefinite period of time during connection establishment is to make advance reservations for a future connection with a given lifetime. This allows more sessions to be admitted (due to their deterministic timing) and also permits negative responses for reservation requests to be dealt with more gracefully. The first component of resource reservation schemes is a specification model for describing flow characteristics, that depends heavily on the model of service guarantees supported by the network. Then, an appropriate protocol is required to communicate these specifications to the receivers and reserve resources on the transmission path so that the service parameters requested can be supported. Simple unicast approaches to resource reservations are generally source-based. A set-up message containing the flow specification is sent to the destination with the intermediate nodes committing adequate resources for the connection, if available. Resources are normally over-allocated early on in the path, so that even if switches encountered further along the path are short of resources, the connection can still be set up. After the set-up message reaches its destination, assuming the connection can be admitted along the path, a response message is returned on the reverse path, allowing the intermediate switches to relax commitments in some cases. Similarly, for multicast, there must be a way for senders to notify receivers of their properties, so that appropriate reservations can be made. In a perfectly homogeneous environment, the reservations will be made once on each outgoing link of a switch for all downstream receivers, so that resource usage can be minimized. Reserved resources can also be shared among data transmitted from multiple senders to the same group (e.g., in applications such as conferencing where the number of simultaneous senders is much smaller than the total). However, receiver and network heterogeneity often prohibits use of this simplistic scheme. One approach is to allocate resources as before during the first message’s trip and then have all receivers send back their relaxation (or rejection) messages. Each switch that acts as a junction will only propagate towards the source the most restrictive relaxation among all those received. However, since paths from such junctions towards receivers may have committed more resources than are now needed, additional passes will be required
for convergence, or resources will be wasted. To handle dynamic groups without constant source intervention, this model can be augmented with receiver-initiated reservations that propagate towards an already established distribution tree. An alternative approach is to abandon reservations during the sender’s multicast set-up message and instead reserve resources based on the modified specifications with which the receivers respond to the initial message. Again, resource reservations will be merged on junction points, but since the (now upstream) requests are expected to be heterogeneous, each junction will reserve adequate resources for the most demanding receivers and reuse them to support the less demanding ones. Even though it is still unclear how aggregation of reservations should be performed, this approach has the potential to support both heterogeneous requests and resource conservation, possibly without over-committing resources, thus maximizing the possibility for a new session to be admitted. Since this mechanism converges in one rather than in multiple passes, the reservation state in the switches can be periodically refreshed, turning the fixed hard state of a static connection into adaptive soft state suitable for a dynamic environment. In this way this mechanism can accommodate both group membership changes and routing modifications without involving the sender. The interaction of routing and resource reservations further complicates matters. Even in the simple case of static routing, success in building a multicast tree depends on the adequacy of resources on each switch. We would like to construct the tree using the switches that pass the admissibility tests, thus favoring the sender-initiated reservation approach. On the other hand, we do not want the construction to fail due to over-allocation, so receiver-initiated reservations are preferable because they may avoid overcommitting resources and converge in one pass. Now however, the tree constructed by the routing algorithm may be inadequate to support the reservations, again rejecting a session that could in principle be set up.
MULTICASTING ON THE INTERNET The Internet has been extensively used as a testbed for algorithms and protocols supporting multicast. The extensions of the IP model to support multicast are the provision of special (class D) multicast addresses and IGMP (the Internet Group Management Protocol), which supports the host-group model. Multicast-aware routers periodically multicast, on a well-known address, membership queries on their LANs and gather replies from interested hosts in order to discover which groups have members present in their area. To achieve multicasting in a wide area network, we need a mechanism to keep track of the dynamic membership of each group and another mechanism to route the multicast datagrams from a sender to these group members without unnecessary duplication of traffic. IP multicasting implements these mechanisms in two parts: local mechanisms track group membership and deliver multicasts to the correct hosts within a local network, and global mechanisms
Multicast
route datagrams between local networks. Distinguishing local from global mechanisms is appropriate for IP since it is an internetworking protocol: each local network can use mechanisms appropriate to its technology, while cooperation among networks is achieved by hiding local differences behind a common interface. In each local network, at least one router acts as a multicast router. A multicast router keeps track of local group membership and is responsible for forwarding multicasts originating from its network towards other networks, and for delivering multicasts originating elsewhere to the local network. Multicast delivery of either externally or locally originated datagrams to local receivers, as well as reception of local multicasts by the router for subsequent propagation to other networks, depend on the underlying network technology. Accordingly, the information needed within the local network regarding group membership in order to achieve local multicast delivery may vary. In contrast, cooperation among multicast routers with the purpose of delivering multicast datagrams between networks is based on a network independent interface between each local network and the outside world. The information needed in order to decide if multicasts should be delivered to target networks is whether at least one group member for a destination group is present there. A multicast router uses the information for each of its attached local networks along with information exchanged with its neighboring routers to support wide area multicasting. Irrespective of the group membership information tracked by a multicast router for local purposes, the interface between local information and global routing is a list of groups present at each attached network. Based on this interface, alternative algorithms can be used for routing among networks, without affecting local mechanisms. Conversely, as long as this interface is provided by the local mechanisms, they can be modified without affecting routing. A variety of global, wide area multicast routing, mechanisms exist, with the earliest and most widespread being the Distance Vector Multicast Routing Protocol (DVMRP). DVMRP v.1 is a variant of Truncated Reverse Path Broadcasting. Routers construct distribution trees for each source sending to a group, so that datagrams from the source (root) are duplicated only when tree branches diverge towards destination networks (leaves). Each router identifies the first link on the shortest path from itself to the source, i.e., on the shortest reverse path, using a distance vector algorithm. Datagrams arriving from this link are forwarded towards downstream multicast routers, i.e. those routers that depend on the present one for multicasts from that source. A broadcast distribution tree is thus formed with datagrams reaching all routers. Since each router knows which groups are present in its local networks, redundant datagrams are not forwarded by truncating the tree. DVMRP v.3 implements the improved Reverse Path Multicasting mechanism, which prunes tree branches leading to networks that have no members, and grafts them back when members appear, thus turning the group distribution tree to a real multicasting one. Another protocol discussed by Moy (12), the Multicast Open Shortest Path First (MOSPF), uses a link state algorithm: routers flood their membership lists among them,
7
so that each one has complete topological information concerning group membership. Shortest path multicast distribution trees from a source to all destinations are computed on demand as datagrams arrive. These trees are real multicast ones (i.e., not broadcast), but the flooding algorithm introduces considerable overhead. A radically different proposal for multicast routing is the Core Based Trees (CBT) protocol, which employs a single tree for each group, shared among all sources. The tree is rooted on at least one arbitrarily chosen router, called the core, and extends towards all networks containing group members. It is constructed starting from leaf network routers towards the core as group members appear, thus it is a multicast tree composed of shortest reverse paths. Sending to the group is accomplished by sending towards the core; when the datagram reaches any router on the tree, it is relayed towards tree leaves. Routing is thus a two stage process which can be sub-optimal. The first stage may propagate datagrams away from their destinations until the tree is reached, thus increasing delay, and in addition, traffic tends to concentrate on the single tree rather than being spread throughout the network. Finally, the Protocol Independent Multicast (PIM) protocol by Deering et al. (7), employs either shared or per source trees, depending on application requirements. There are two main modes of operation of the PIM protocol depending on the distribution of the multicast group members throughout the network. In the Sparse Mode (PIM-SM), receivers are assumed to be sparsely distributed throughout the network and therefore any router with downstream group members must explicitly inform its upstream multicast routers of its interest in joining a multicast group. The resulting shared tree is rooted at a group-specific Randez-vous Point (RP). Routers can later join a source-specific, shortest path, distribution tree and prune themselves of the shared tree. In the Dense Mode (PIM-DM), the opposite assumption is made, i.e., it is assumed that most of the routers are interested in receiving multicast traffic. Therefore, each router forwards multicast data to all of its neighboring routers. Routers not interested in joining the multicast group, explicitly prune themselves off the constructed source-rooted multicast tree. Networks supporting IP multicasting may be separated by multicast unaware routers. To connect such networks, tunnels are used: tunnels are virtual links between two endpoints, that are composed of a, possibly varying, sequence of physical links. Multicasts are relayed between routers by encapsulating multicast datagrams within unicast datagrams at the sending end of the tunnel and decapsulating them at the other end. The MBone is a virtual network composed of multicast aware networks bridged by such tunnels. Multicast routers may choose to forward through the tunnels only datagrams that have Time-toLive (TTL) values above a threshold, to limit multicast propagation. In contrast to global mechanisms, only a single set of local mechanisms exists. These local multicasting and group management mechanisms were based on shared medium broadcast networks such as Ethernet, and this is evident on some of the design decisions made. Delivery is straightforward on these LANs, as all hosts can listen to all datagrams and select the correct ones. If a LAN supports multicasting
8
Multicast
as a native service, class D IP addresses may be mapped to LAN multicast addresses to filter datagrams in hardware rather than in software. Multicasts with local scope do not require any intervention by the multicast router, while externally originated multicasts are delivered to the LAN by the router. The router also monitors all multicasts so that it can forward to the outside world those for which receivers exist elsewhere. Both unicasts and multicasts are physically broadcast on these LANs, so the only issue for the router when delivering externally originated multicasts is whether at least one member for the destination group exists in the network. The router only has to keep internally a local group membership list, which coincides exactly with the information on which global multicast routing is based. Both versions of the Internet Group Management Protocol (IGMP) provide a mechanism for group management well suited to broadcast LANs, since only group presence or absence is tracked for each group. In IGMP v.1 the multicast router periodically sends a query message to a multicast address to which all local receivers listen to. Each host, on reception of the query, schedules a reply, to be sent after a random delay, for each group in which it participates. Replies are sent to the address for the group being reported, so that the first reply will be heard by all group members and suppress their own transmissions. The router monitors all multicast addresses, so that it can update its membership list after receiving each reply. If no reply is received for a previously present group for a number of queries, the group is assumed absent. In steady state, in each query interval the router sends one query and receives one reply for each present group. When a host joins a group it sends a number of unsolicited reports to reduce join latency for the case where it is the first local member of the group. No explicit action is required when a host leaves a group, as group presence times out when appropriate. In IGMP v.2 a host must send a leave message when abandoning a group, but only if it was the last host to send a report for that group. However, since this last report may have suppressed other reports, the router must explicitly probe for group members by sending a group specific query to trigger membership reports for the group in question. It can only assume the group absent if no reports arrive after a number of queries. All IGMP v.2 queries include a time interval within which replies must be sent: general queries may use a long interval to avoid concentrating reports for all groups, while group specific queries may use a short interval to speed up group status detection. The time between the last host leaving a group and the router stopping multicast delivery for that group is called the leave latency. Other Internet Protocols and Services The Resource ReSerVation Protocol (RSVP), designed by Zhang et al. (21), acts as an overlay on routing protocols, supporting receiver-initiated resource reservations over any available multicast routing scheme. In addition, RSVP supports dynamic reservation modifications and network reconfigurations. A transport protocol supporting continuous media has been developed by Schulzrinne et al. (18): RTP (Real Time Protocol). It provides support for timing informa-
tion, packet sequence-numbers and option specification, without imposing any additional error control or sequencing mechanisms. An application can use this basic framework adapted to its requirements to add whatever mechanisms seem appropriate, such as error control based on loss detection using sequence numbers, or intra-media and inter-media synchronization based on timing information. A companion control protocol, RTCP (Real Time Control Protocol), can be used for gathering feedback from the receivers, again according to the application’s needs. For example, an application can use RTP for transport and RTCP adapted for scalable feedback control, along with appropriate FEC and adaptation mechanisms. Another relevant protocol is SDP (Session Description Protocol), which provides a mechanism for applications to learn what streams are carried in the network, describing them in adequate detail so that anyone interested can launch the appropriate receiver applications.
MULTICAST IN ATM AND OTHER TECHNOLOGIES ATM technology supports point-to-multipoint VCs, with the source responsible for setting up the VC. Two basic models have been proposed to support multicast in ATM. The first is based on a mesh of point-to-multipoint VCs from each source to the destinations. The second resembles the center based trees and uses a multicast server (MCS) with point-to-multipoint VCs to all destinations; sources then establish and use point-to-point VCs to forward their data to the MCS, which then forwards them to the destinations. Both models have advantages and disadvantages. Group management and dynamic join-leave is probably more complex and slow with the mesh, but throughput and delay should be better. Also the MCS is a single point of failure and concentrates traffic, not only to the particular server, but also to the subnetwork surrounding it. LAN emulation service on top of ATM offers a very similar solution with that of the MCS for multicast and broadcast. A third, multipoint-to-multipoint solution has also been suggested based on the shared tree approach and an access control protocol that allows sources to alternate in using the common infrastructure. However, this last proposal is less compatible with the various ATM protocols and techniques currently adopted. Finally, an important problem for ATM is the mapping of high-level multicast addresses (or group names) to specific destination end-points and point-to-to-multipoint VCs. A key solution is based on a Multicast Address Resolution Server (MARS) based on the notion of an ATM ARP server. This service is obviously necessary for full implementation of IP over ATM. Similarly, when multiple hosts and applications are communicating, as in a multi-party conference, there is usually a need to mediate transmission and reception of data among participants at the application layer. As the specific needs of each application and conference setting may vary, one way to support multiple control policies is to use either a specialized server or a logical conference control channel as a shared mechanism through which control messages are exchanged. Floor control and session management applications can then employ this mechanism for
Multicast
their needs. Many other applications and networking technologies have also had to confront the multicast problem. For example multi-hop light-wave technologies using multiple wavelengths assigned to source-destination pairs in the case of unicast, have to modify their architectures for multicast. For Mobile IP and IP Multicast, even though straightforward solutions do exist, they suffer from various problems that are still being investigated. Finally, in some cases multicast is proposed as a solution to other problems. For example, in order to minimize delay and loss during hand-offs in mobile packet communications, it has been proposed to multicast the packets to all base stations near the mobile so that the information will be immediately available in case of hand-off. Multicast Security Issues The traditional security issues for communications, which are typically thought of in the context of unicast (i.e., pointto-point communications), also exist for multicast. They typically relate to data confidentiality and integrity and service availability. However, multicast amplifies the existing problems and poses new ones. In addition, straightforward extensions of unicast either do not apply or are uninteresting. For example, using O(n2 ) independent endto-end unicast secure channels between all pairs of participants can provide secure group communication. However, no benefits from multicasting can be drawn in this case. In particular, no efficient transport, nor flexible membership. With the original Internet protocols session membership is typically not known (except perhaps in an indirect way at the application level) and cannot be controlled. This makes the problems of eavesdropping, unauthorized injection of messages, and even service denial to authorized members, even more important. However, research into these issues is just beginning and experience with real systems is almost non-existent. Some obvious but central requirements for approaches to secure multicast, in particular in the context of global networks such as the Internet and ATM, are: (1) compatibility with existing network protocols, (2) scalability, and (3) transparency to higher level services and applications. APPLICATION-LEVEL MULTICAST The scalability problems faced in the deployment of IP multicast in wide area networks have led to an interest in application-level multicast over peer-to-peer overlay networks. As suggested by Ratnasamy et al (26), the main target is to eliminate the need for a multicast routing algorithm to construct distribution trees. Towards this direction, peer-to-peer overlay networks provide a scalable, fault-tolerant, self-organizing routing substrate. Application-level multicast aims at leveraging these underlying overlay routing properties. Peer-to-peer overlay routing is performed over an abstract namespace. A randomly chosen portion of the namespace is assigned to each participating node. Each node of the overlay network holds routing information only for a small subset of nodes whose namespace portions are neigh-
9
boring its own in the global namespace. In this way, routing information is distributed among all nodes, yielding a scalable routing infrastructure. For the logical namespace proximity to reflect the actual networking proximity (in terms, for example, of round trip time), among all neighboring nodes in the logical namespace, a node holds routing information only for those closest to it in the actual networking topology. Rowstron et al. (23) and Zhao et al. (24) show that overlay routes are approximately 30% longer than the routes followed in the case of direct IP routing based on complete routing tables. This is considered as an acceptable cost in view of the fact that each node holds routing information only for a small subset of the overall topology. Messages are destined to points in the namespace (otherwise termed as keys). The overlay network routes a message to the node that has been assigned the portion of the namespace containing the destination point, i.e. the owner of the key. This is accomplished by each node forwarding the message to the node with the closest namespace to the key. The average number of overlay hops required for a message to reach its destination is a logarithmic function of the number of nodes constituting the overlay network. Two main approaches have been followed towards application-level multicast. The first aims at building the multicast distribution tree on top of the overlay network. Zhuang et al. (25) propose the creation of source-specific trees on top of a Tapestry (24) overlay network. The construction of the multicast tree follows the hierarchical character of the underlying rooting mechanism, i.e. closely neighboring nodes in the namespace belong to the same tree level. However, each join message reaches the source node, so that the construction of the multicast tree is coordinated there, weakening the scalability of the proposed scheme. Castro et al. (27) overcome this limitation by handling group joins locally. They propose the creation of source-rooted trees or trees rooted at a randomly chosen randez-vous node, on top of a Pastry (23) overlay network. In this case, a join message is not propagated towards the root of the tree, but it is suppressed by an intermediate node that has already joined the group. Both approaches result in the creation of well-balanced source-specific trees due to the randomization of the overlay addresses. However, these trees may contain nodes not belonging to that multicast group. The second approach, followed by Ratnasamy et al. (26), aims at the creation of an overlay network for each multicast group, avoiding the construction of a multicast tree. Multicast data is then broadcasted in the overlay network. Contrary to the first approach, there is no restriction on the number of sources, i.e. multiple nodes may broadcast in the overlay network resulting in a multipoint-to-multipoint communication model. Moreover, the dissemination of data is performed only by nodes actually belonging to the multicast group. Nevertheless, Castro et al. (28) show that, the tree-building approach achieves lower delay and signaling overhead than broadcasting over per group overlays due to the significant delay overhead incurred by the routing state establishment during the construction of the overlay network in the latter case.
10
Multicast
Streaming Applications Streaming applications, such as live audio and video delivery, pose additional requirements for the efficient point-tomultipoint data distribution. The bandwidth requirements of these applications are significant, imposing the need for load balancing measures during multicast tree construction. The end-to-end delay from the source to the receivers may be high if the content traverses long paths of nodes until it reaches the leaf nodes of the multicast tree. Hence, the need for a small tree height is apparent. Furthermore, traffic bottlenecks may appear in the tree topology if nonleaf nodes are required to forward the multicast content to a large number of descendant nodes. In effect, the fan-out degree of each node in the tree must be bounded. Overall, bandwidth availability is an essential criterion for the construction of efficient multicast trees and the adequate placement of each node in the hierarchical topology. A significant approach addressing these issues, followed by Castro et al. (30) and Padmanabhan et al. (31), is based on the creation of multiple multicast trees per group. The multicast content is encoded in several separate, decodable streams (stripes) of lower quality and each stream is transmitted over a separate tree. All trees share the same root (source) and leaf nodes, but consist of disjoint sets of intermediate-level nodes. The main goal is to distribute the forwarding load among the participating nodes and this is achieved by each interior node bearing the burden of forwarding a single, lighter stream. Each node participates in all trees, either as an interior or a leaf node, in order to receive all stripes and reconstruct the original content. It is noted that the reception of all stripes is necessary only for the reconstruction of the content in its initial quality. The reception of fewer stripes typically still results in the reconstruction of the content, but at a lower quality. In addition to the balancing of the forwarding load, this approach may also achieve robustness to node failures. A node failure will only result in the loss of a single stripe by the leaf nodes served by the failing node. In effect the leaf nodes will only experience the degradation of the quality of the received content rather than the interruption of the streaming service. In addition to the aforementioned approach, Duc et al. (29) propose the construction of the multicast tree based on a multilayer hierarchy of bounded-size clusters of nodes. In each cluster, a head node is responsible for monitoring membership in the cluster and an associate head node is responsible for transmitting the content to cluster members. Thus, membership management is distributed, relieving the source node from the burden of overall tree management. The height of the resulting tree is at most logarithmic to the size of the node population and the fan-out degree of each node is bounded by a constant. BIBLIOGRAPHY 1. M. Ahamad, Multicast Communication in Distributed Systems. IEEE Computer Society Press, Los Alamitos, CA, 1990. 2. M. H. Ammar, G. C. Polyzos, and S. Tripathi, (eds.), Special Issue on “ Network Support for Multipoint Communica-
3.
4.
5.
6.
7.
8.
9. 10.
11. 12. 13.
14.
15. 16.
17.
tion,” IEEE Journal on Selected Areas in Communications, 15: 273–588, 1997. K. P. Birman, A. Schiper, and P. Stephenson, Lightweight Causal and Atomic Group Multicast, National Aeronautics and Space Administration, Washington, D.C., 1991. J. C. Bolot, T. Turletti, and I. Wakeman, Scalable feedback control for multicast video distribution in the Internet. Computer Communications Review, 24(4): 58–67, 1994. S. Y. Cheung and M. H. Ammar, Using destination set grouping to improve the performance of window-controlled multipoint connections. Computer Communications, 19: 723–736, 1996. S. E. Deering and D. R. Cheriton, Multicast routing in datagram internetworks and extended LANs. ACM Transactions on Computer Systems, 8(2): 85–110, 1990. S. Deering, D. Estrin, D. Farinacci, V. Jacobson, C. Liu, and L. Wei, The PIM architecture for wide-area multicast routing. IEEE/ACM Transactions on Networking, 4: 153–162, 1996. C. Diot, W. Dabbous, and J. Crowcroft, Multipoint communication: a survey of protocols, functions, and mechanisms. IEEE Journal on Selected Areas in Communications, 15: 277–290, 1997. M. Doar and I. Leslie, How bad is naive multicast routing? Proc. IEEE INFOCOM’93: 82–89, 1993. M. R. Garey, R. L. Graham, and D. S. Johnson, The complexity of computing Steiner minimal trees. SIAM Journal on Applied Mathematics, 34: 477–95, 1978. S. L. Hakimi, Steiner’s problem in graphs and its implications. Networks, 1: 113–133, 1971. J. Moy, Multicast routing extensions for OSPF. Communications of the ACM, 37(8): 61–66, 1994. B. K. Kabada and J. M. Jaffe, Routing to multiple destinations in computer networks. IEEE transactions on communications, 31: 343–351, 1983. V. P. Kompella, J. C. Pasquale, and G. C. Polyzos, Multicast routing for multimedia communication. IEEE/ACM Transactions on Networking, 1 (3): 286–292, 1993. L. Kou, G. Markowsky, and L. Berman, A fast algorithm for Steiner trees. Acta Informatica, 15: 141–145, 1981. J. C. Pasquale, G. C. Polyzos, E. W. Anderson, and V. P. Kompella, The multimedia multicast channel. Internetworking: Research and Experience, 5(4): 151–162, 1994. J. C. Pasquale, G. C. Polyzos, and G. Xylomenos, The multimedia multicast problem. Multimedia Systems, 6(1): 43–59, 1998.
18. H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, RTP: A Transport Protocol for Real-Time Applications. Internet Request For Comments, RFC 1889, 1996. 19. D. Towsley, J. Kurose, and S. Pingali, A comparison of senderinitiated and receiver-initiated reliable multicast protocols. IEEE Journal on Selected Areas in Communications, 15: 398–406, 1997. 20. G. Xylomenos and G. C. Polyzos, IP Multicast for Mobile Hosts, IEEE Communications Magazine, 35(1): 54–58, 1997. 21. L. Zhang, S. Deering, D. Estrin, S. Shenker, and D. Zappala, RSVP: a new resource ReSerVation Protocol. IEEE Network, 7(5): 8–18, 1993. 22. W. D. Zong, Y. Onozato, and J. Kaniyil, A copy network with shared buffers for large-scale multicast ATM switching. IEEE/ACM Transactions on Networking, 1(2): 157–165, 1993. 23. A. Rowstron and P. Druschel, Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems.
Multicast IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), Heidelberg, Germany, pp. 329–350, November 2001. 24. B. Y. Zhao, Ling Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and J. D. Kubiatowicz. Tapestry: A resilient global-scale overlay for service deployment. IEEE Journal on Selected Areas in Communications, 22(1), 2004. 25. S. Zhuang, B. Zhao, A. Joseph, R. Katz, and J. Kubiatowicz, Bayeux: An Architecture for Scalable and Fault-tolerant WideArea Data Dissemination, Proc. NOSSDAV, pp. 11–20, 2001. 26. S. Ratnasamy, M. Handley, R. Karp, and S. Shenker, Application-level multicast using content-addressable networks, Proc. International Workshop on Networked Group Communication (NGC), November 2001. 27. M. Castro, P. Druschel, A.-M. Kermarrec, and A. Rowstron, Scribe: A large-scale and decentralized application-level multicast infrastructure, IEEE JSAC, 20(8), 2002. 28. M. Castro, M. B. Jones, A.-M Kermarrec, A. Rowstron, M. Theimer, H. Wang, and A. Wolman, An evaluation of scalable application-level multicast built using peer-to-peer overlays, Proc. IEEE INFOCOM, 2: 1510–1520, 2003. 29. D. A. Tran, K. Hua, T. Do, A peer-to-peer architecture for media streaming, IEEE Journal on Selected Areas in Communications, 22(1): 121–133, 2003. 30. M. Castro, P. Drushel, A.-M. Kermarrec, A. Nandi, A. Rowstron, and A. Singh, SplitStream: High-bandwidth Multicast in Cooperative Environment, ACM Symposium on Operating Systems Principles, 2003. 31. V. N. Padmanabhan, J. J. Wang, P. A. Chou, and K. Sripanidkulchai, Distributing streaming media content using cooperative networking, Proc. NOSSDAV, pp. 177–186, 2002
GEORGE C. POLYZOS K. KATSAROS Athens University of Economics and Business, Athens, Greece
11
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5317.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Multiple Access Schemes Standard Article Moshe Sidi1 1Technion—Israel Institute of Technology, Haifa, Israel Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5317 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (158K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5317.htm (1 of 2)16.06.2008 16:25:31
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5317.htm
Abstract The sections in this article are Basic Model Conflict-Free Schemes Contention-Based Schemes Collision Resolution Schemes | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5317.htm (2 of 2)16.06.2008 16:25:31
MULTIPLE ACCESS SCHEMES
663
MULTIPLE ACCESS SCHEMES Communication channels are major components of computer communication networks. They provide the physical mediums over which signals representing data are transmitted from one node of the network to another node. Communication channels can be classified into two main categories: point-topoint channels and shared channels. Typically, the backbone J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
664
MULTIPLE ACCESS SCHEMES
of wide area networks (WAN) consists of point-to-point channels, whereas local area networks (LAN) use shared channels. Point-to-point channels are dedicated to connecting a pair of nodes of the network. They are usually used in fixed topology networks, and their cost depends on many parameters such as distance and bandwidth. An important characteristic of these channels is that nodes do not interfere with each other; in other words, transmissions between a pair of nodes has no effect on the transmissions between another pair of nodes, even if a node is common to the two pairs. Shared channels are used when point-to-point channels are not economical or not available or when dynamic topologies are preferable. In a shared channel, called also a broadcast channel, several nodes can potentially transmit and/or receive messages at the same time. Shared channels appear naturally in radio networks, satellite networks, and some local area networks (e.g., Ethernet). Their deployment is usually easier than point-to-point channels. An important characteristic of shared channels is that transmissions of different nodes interfere with each other; specifically, one transmission coinciding in time with another may cause none of them to be received. This means that the success of a transmission between a pair of nodes is no longer independent of other transmissions. To have successful transmissions in shared channels, interference must be avoided or at least controlled. The channel allocation among the competing nodes is critical for proper operation of the network. This article focuses on access schemes to such channels known as multiple access schemes. These schemes are nothing more than channel allocation rules that determine who goes next on the channel, aiming at some desirable network performance characteristics. Multiple access schemes belong to a sublayer of the data link layer called the medium access control layer (MAC), which is especially important in LANs. Multiple access schemes are natural not only in communication systems but also in many other systems such as computer systems, storage facilities, or servers of any kind, where resources are shared by a number of nodes. In this article we mainly address shared communication channels. One way to classify multiple access schemes is according to the level of contention that is allowed among the nodes of the network. On the one hand, there are the conflict-free schemes that ensure that each transmission is successful, namely, it will not be interfered with by any other transmission. On the other hand, there are the contention-based schemes that do not guarantee that a transmission will be successful, namely, it might be interfered with by another transmission. Conflict-free transmissions can be achieved by allocating the shared channel in an adaptive or nonadaptive (static) manner. Two common static allocations are the time division multiple access (TDMA), where the entire available bandwidth is allocated to a single node for a fraction of the time, and the frequency division multiple access (FDMA), where a fraction of the available bandwidth is allocated to a single node for all the time. Adaptive allocations are usually based on demands so that nodes that are idle use only little of the shared channel, leaving the majority of their share to other more active nodes. Adaptive allocations can be done by various reservation schemes using either central or distributed network control. Polling algorithms illustrate central control,
whereas ring networks generally use distributed control based on token-passing mechanisms. It is important to note that idle nodes consume their portion of the shared channel when conflict-free schemes are used. The aggregate channel portion of idle nodes becomes significant when the number of potential nodes in the system is very large to the extent that conflict-free schemes might become impractical. When contention-based schemes are used, it is essential to devise algorithms that resolve conflicts when they occur, so that messages are eventually transmitted successfully. Conflict-resolution algorithms can be either adaptive or nonadaptive (static). Static resolution can be deterministic using some fixed priority that is assigned to the nodes, or it can be probabilistic when the transmission schedule for interfered nodes is chosen from a fixed distribution as is done in Aloha-type schemes and the various versions of carrier-sensing multiple access (CSMA) schemes. Adaptive resolutions attempt to track the system evolution and exploit the available information. For example, resolution can be based on time of arrival, giving highest (or lowest) priority to the oldest message in the system as is done in some tree-based algorithms. Alternatively, resolution can be probabilistic but such that the statistics change dynamically according to the extent of the interference. This category includes estimating the multiplicity of the interfering nodes and the exponential back-off scheme of the Ethernet standard. Note that when the population of potential nodes in the system increases beyond a certain amount and conflict-free schemes are useless, contentionbased protocols are the only possible solution. The goal of this article is to survey typical examples of multiple access schemes. These examples include TDMA, FDMA, Aloha, polling, and tree-based schemes. The allocated space for the topic of multiple access schemes in the encyclopedia (which is yet another shared resource) is just too tiny to include all the ingenious multiple access schemes that have been designed by researchers over the years. Interested readers should refer to books on the subject (e.g., Rom and Sidi (23), Hammond and O’Reilly (22), and to the international journals that have published papers on the subject).
BASIC MODEL When multiple access schemes are devised, a collection of nodes that communicate with each other or with a central node via a single shared channel is considered. In general, the ability of a node to hear the transmission of another node depends on the transmission power used, on the distance between the two nodes, and on the sensitivity of the receiver at the receiving node. We assume single-hop topologies in which all nodes hear one another, and whenever messages are transmitted successfully they arrive at their destinations. The shared channel is the medium through which data are transferred from their sources to their destinations. The total transmission rate possible in the channel is C bits/s. We consider an errorless collision channel. Collision is a situation in which, at the receiver, two or more transmissions overlap in time wholly or partially. A collision channel is one in which all the colliding transmissions are not received correctly and must be retransmitted until they are received correctly. We assume that nodes can detect collisions. The channel is errorless in the sense that a single transmission heard at a node
MULTIPLE ACCESS SCHEMES
is always received correctly. Other possible channels include the noisy channel in which errors may occur even if only a single transmission is heard at a node; furthermore, the channel may be such that errors between successive transmissions are not independent. Another channel type is the capture channel in which one or more of the colliding transmissions captures the receiver and can be received correctly. Yet another case is a channel in which coding is used so that even if transmissions collide the receiver can still decode some or all of the transmitted information. The basic unit of data generated by a node is a message. It is possible, though, that because of its length, a message cannot be transmitted in a single transmission and must therefore be broken into smaller units called packets, each of which can be transmitted in a single channel access. A message consists of an integral number of packets, although the number of packets in a message can vary randomly. Packet size is measured by the time required to transmit the packet after access to the channel has been granted. Typically, all packets are of equal size, say L bits. The number of nodes that share the channel is denoted by M. When M becomes very large, the population of nodes is referred to as infinite population. Only contention-based schemes can cope with an infinite node population. The aggregate arrival process of new packets is assumed to be Poisson with rate ⌳ packets/s. When the population is finite, the arrival rate to each node is ⫽ ⌳/M packets/s. Nodes are generally not assumed to be synchronized and are capable of accessing and transmitting their messages on the shared channel at any time. Another important class of systems is that of slotted systems in which there is a global clock that marks discrete intervals of time called slots whose length is usually the time required to transmit a packet (i.e., T ⫽ L/C s). In these systems, transmissions of packets start only at slot starts. The slot length is therefore T ⫽ L/C s. Other operations, such as determining activities on the channel, can be done at any time. In some models, nodes can tell if the shared channel is in use before trying to use it. If the channel is sensed as busy, no node will attempt to use it until it goes idle in order to reduce interference. Naturally, additional hardware is required at each node to implement the sensing ability. In other models, nodes cannot sense the channel before trying to use it. They just go ahead and transmit according to their access scheme. Only later can they determine whether or not the transmission was successful via the feedback mechanism. Feedback in general is the information available to the nodes regarding activities on the shared channel at prior times. This information can be obtained by listening to the channel, or by explicit acknowledgment messages sent by the receiving node. For every scheme, there exist some instants of time (typically slot boundaries or end of transmissions) in which feedback information is available. Common feedback information indicates whether a message was successfully transmitted or a collision took place or the channel was idle. Feedback mechanisms do not consume the shared channel sources because they usually use a different channel or are able to determine the feedback locally. Other feedback variations include indication of the exact or the estimated number of colliding transmissions, or providing uncertain feedback (e.g., in the case of a noisy channel).
665
The important performance measures of multiple access schemes are their throughput and delay. The throughput of the channel is the aggregate average amount of data that is transported successfully through the channel in a unit of time. The throughput equals the fraction of time in which the channel is engaged in the successful transmission of node data and will be denoted by S, and it is obvious that S ⱕ 1. In conflict-free access schemes, the throughput is also the total or offered load on the shared channel. However, in contention-based access schemes, the offered load on the shared channel includes transmissions of new packets as well as retransmissions of packets that collide with each other. The offered load is denoted by g (measured in packets per second) and, obviously, g ⱖ ⌳. The normalized offered load [i.e., the rate (per packet transmission time) packets are transmitted on the channel] is denoted by G ⫽ g ⭈ T and, obviously, G ⱖ S. Delay is the time from the moment a message is generated until it arrives successfully across the shared channel. Here one must distinguish between the node and the system measures because it is possible that the average delay measured for the entire system does not necessarily reflect the average delay experienced by any of the nodes. In ‘‘fair’’ or homogeneous systems, we expect these to be almost identical. The average delay is denoted by D seconds, and its normalized version, grouped into units of packet transmission times, is denoted by D (i.e., D ⫽ D/T ⫽ D ⭈ C/L). Another important performance criterion is system stability. Unfortunately, some schemes’ characteristics may be such that some message-generation rates, even smaller than the maximal transmission rate in the channel, cannot be sustained by the system for a long time. Evaluation of those input rates for which the system remains stable is therefore essential. Ideal Access Scheme Before introducing the various multiple access schemes, let us consider an ideal scheme to use the shared channel. Ideally, transfer of the channel from one node to another can be accomplished instantaneously, without cost. Furthermore, whenever a node has data to transmit, some ingenious central controller knows this instantaneously and assigns the channel to that node in case the channel is idle. If the channel is busy, packets that arrive at the nodes are queued. For our purposes, the order in which packets of different nodes are served is not important. The performance of the ideal scheme serves as a bound to what can be expected from any practical access scheme. The way the ideal scheme operates is identical to the operation of a single queue that is served by a single server, because packets do not interfere and because no time is wasted in transferring the channel use from one node to another. Because arrivals of new packets are according to a Poisson process and time is slotted, the performance of the ideal scheme is that of an M/D/1 queue. The throughput of an M/D/1 queue is just the utilization factor of the server as long as S ⬍ 1 (the stability condition), and it equals the offered load, in other words, S = G = T =
λ·M·L C
(1)
666
MULTIPLE ACCESS SCHEMES
The normalized average delay of an M/D/1 queue is given by (as long as S ⬍ 1) D =1+
2−S S = 2(1 − S) 2(1 − S)
(2)
The unit in the expression D is the normalized transmission time of a packet, whereas S/[2(1 ⫺ S)] is the normalized waiting time of a packet until being transmitted. No access scheme can achieve throughput higher than S given in Eq. (1), and no access scheme can provide normalized average delays lower than D given in Eq. (2). These quantities will serve as yardsticks in the sequel. CONFLICT-FREE SCHEMES Conflict-free schemes are designed to ensure that a transmission, whenever made, is not interfered with by any other transmission and is therefore successful. This is achieved by allocating the channel to the nodes without any overlap between the portions of the channel allocated to different nodes. An important advantage of conflict-free access protocols is the ability to ensure fairness among nodes and the ability to control the packet delay—a feature that may be essential in realtime applications. We consider both fixed-assignment schemes and dynamic schemes that guarantee no conflicts. In fixed-assignment schemes the channel allocation is predetermined (typically at network design time) and is independent of the demands of the nodes in the network. The most well-known fixed-assignment schemes are the frequency division multiple access and the time division multiple access. For both FDMA and TDMA, no overhead, in the form of control messages, is incurred. However, because of the static and fixed assignment, parts of the channel might be idle even though some nodes have data to transmit. Dynamic channel allocation schemes attempt to overcome this drawback by changing the channel allocation based on the current demands of the nodes. These schemes use some kind of reservation strategies based on either centralized or distributed polling. Fixed Assignment Both FDMA and TDMA are the oldest and most understood access schemes, widely used in practice. They are the most common implementation of fixed-assignment schemes. With FDMA the entire available frequency band is divided into bands, each of which is used by a single node. Every node is therefore equipped with a transmitter for a given, predetermined frequency band and a receiver for each band (which can be implemented as a single receiver for the entire range with a bank of band-pass filters for the individual bands). With TDMA the time axis is divided into time slots, preassigned to the different nodes. Every node is allowed to transmit freely during the slot assigned to it; that is, during the assigned slot the entire shared channel is devoted to that node. The slot assignments follow a predetermined pattern that repeats itself periodically; each such period is called a frame. In most TDMA implementations, every node has exactly one slot in every frame. The main advantage of both FDMA and TDMA is that each transmission is guaranteed to be successful and no control
messages are required. An additional advantage of FDMA is its simplicity—it does not require any coordination or synchronization among the nodes because each can use its own frequency band without interference. However, both FDMA and TDMA are wasteful, especially when the load is momentarily uneven, because when one node is idle, its share of the channel cannot be used by other nodes. Another drawback of FDMA and TDMA is that they are not flexible; adding a new node to the network requires equipment or software modification in every other node. In addition, both waste some portion of the channel to ensure no overlap (either in time or in bandwidth) in the transmissions of different nodes. FDMA uses guard bands between the subchannels, and TDMA uses guard times to separate the nodes. Neglecting the channel waste resulting from guard bands or times, the throughput of FDMA and TDMA is identical to that of the idealized schemes, because packets are never transmitted more than once. Therefore, we have for both S = G = T =
λ·M·L C
The delay characteristics of FDMA and TDMA are different. With FDMA the transmission rate of each node is C/M bits/s; therefore, the time to transmit a packet is M ⭈ L/C seconds. Each node can be modeled as an M/D/1 queue with arrival rate ⫽ ⌳/M and service time M ⭈ L/C. The normalized average delay is, therefore, 2−S S =M D =M 1+ 2(1 − S) 2(1 − S) which is M times larger than the normalized average delay of the ideal scheme. With TDMA the transmission rate of each node is C bits/ s, and the time to transmit a packet is L/C seconds. Each node can be modeled as an M/D/1 queue with arrival rate ⫽ ⌳/M, but service is granted to the node only once a frame, namely every M ⭈ L/C seconds. The normalized average delay is therefore D =1+
M 2(1 − S)
Comparing the throughput delay characteristics of FDMA and TDMA, we note that D FDMA = D TDMA +
M −1 2
We thus conclude that for any reasonable parameters, the TDMA-normalized average delay is always less than that of FDMA and the difference grows linearly with the number of nodes and is independent of the load. The difference stems from the fact that the actual transmission of a packet in TDMA takes only a single slot, whereas in FDMA it lasts the equivalent of an entire frame. This difference is somewhat offset by the fact that a packet arriving at an empty node may need to wait until the proper slot when a TDMA scheme is employed, whereas in FDMA transmission starts right away. It must be remembered, though, that at high throughput the dominant factor in the normalized average delay is inversely proportional to (1 ⫺ S) in both TDMA and FDMA; therefore,
MULTIPLE ACCESS SCHEMES
M = 15 1000 Ideal
5
TDMA Normalized delay
2
FDMA
100 5 2 10 5 2 1 0.00
0.20
0.40
0.60
0.80
1.00
Throughput Figure 1. TDMA and FDMA performance.
the ratio of the normalized average delays between the two schemes approaches unity when the load increases. Figure 1 depicts the delay-throughput characteristics for TDMA and FDMA and the ideal access scheme for 50 users. Further Reading. Many texts treating FDMA and TDMA are available [e.g., Martin (1) and Stallings (2)]. A good analysis of TDMA and FDMA can be found in Ref. 3. A sample path comparison between FDMA and TDMA schemes is carried out in Ref. 4 where it is shown that TDMA is better than FDMA not just on the average. A TDMA scheme in which the packets of each node are serviced according to a priority rule is analyzed by De Moraes and Rubin (5). The question of optimal allocation of slots to the nodes in generalized TDMA (in which a node can have more than one slot in a frame) is addressed in Itai and Rosberg (6), where the throughput of the network is maximized (assuming single buffers for each node), Hofri and Rosberg (7), where the expected packet-delay in the network is minimized. Message delay (as opposed to packet delay) for generalized TDMA is analyzed by Rom and Sidi (8). Dynamic Assignment Static conflict-free protocols such as FDMA and TDMA schemes do not use the shared channel very efficiently, especially when the network is lightly loaded or when the loads of different nodes are asymmetric. The static and fixed assignment in these schemes cause portions of the channel to remain idle even though some nodes have data to transmit. Dynamic channel allocation schemes are designed to overcome this drawback. With dynamic allocation strategies, the channel allocation changes with time and is based on current (and possibly changing) demands of the various nodes. The better and more responsive use of the shared channel achieved with dynamic schemes does not come for free; it requires control overhead that is unnecessary with fixed-assignment schemes and consumes a portion of the channel. To ensure conflict-free operation, it is necessary to reach an agreement among the nodes on who transmits in a given slot. This agreement entails collecting information as to which nodes have packets to transmit and an arbitration
667
scheme that selects one of these nodes to transmit in the slot. Both the information collection and the arbitration can be achieved using centralized control or distributed control. A representative example of schemes that use centralized control are polling schemes. The basic feature of polling schemes is the operation of a central controller that polls the nodes of the network in some predetermined order (the most common being round-robin) to provide access to the shared channel. When a node is polled and has packets to transmit, it uses the whole shared channel to transmit its backlogged packets. With an exhaustive policy, the node empties its backlog completely, whereas with a gated policy it transmits only those packets that reside in its queue upon the polling instant. The last transmitted packet contains an indication that the central controller can poll the next node. If a polled node does not have packets to transmit, the next node is polled. In between polls, nodes accumulate the arriving packets in their queues and do not transmit until polled. The control overhead of polling schemes is a result of the time required to switch from one node to the next. The switching time, denoted by w, includes all the time necessary to transfer the poll (channel propagation delay, transmission ˆ ⫽ w/T time of polling and response packets, etc.). We let w denote the normalized switching time. The throughput of a polling scheme is identical to that of an ideal scheme and is given by Eq. (1). The normalized average delay is given by D =1+
Mω(1 ˆ − S/M) S + 2(1 − S) 2(1 − S)
We note that the first two terms are just the normalized average delay of the ideal scheme and the third term reflects the overhead resulting from the switching times from one node to the next. As an example of a distributed dynamic conflict-free scheme we use the mini slotted alternating priority (MSAP) scheme (9). The MSAP scheme allows the nodes to determine in a distributed manner the order in which they’ll use the shared channel, assuming the nodes are ordered according to some priority rule. Either the priority rule can be static or it can change in a round-robin manner in each slot. MSAP is based on distributed reservations. To describe its operation, we need to define the slot structure. Let (seconds) denote the maximum system propagation delay, that is, the longest time it takes for a signal emitted at one end of the network to reach the other end. The quantity plays a crucial role in multiple access schemes. Its normalized version is denoted by a ⫽ /T. Let every slot consist of initial M ⫺ 1 reservation minislots, each of duration , followed by a data transmission period of duration T, followed by another minislot. Only those nodes wishing to transmit in a slot take any action: a node that does not wish to transmit in a given slot remains quiet for the entire slot duration. Given that every node wishing to transmit knows its own priority, they behave as follows. If the node of the highest priority wishes to transmit in this slot, then it starts immediately. Its transmission consists of an unmodulated carrier for a duration of M ⫺ 1 minislots followed by a packet of duration T. A node of the ith priority (2 ⱕ i ⱕ M) wishing to transmit in this slot will do so only if the first i ⫺ 1 minislots are idle. In this case, it will transmit M ⫺ i minislots of unmodulated carrier followed
668
MULTIPLE ACCESS SCHEMES
Normalized delay
M = 15, a = 0.01 6 5 4 3 2.5 2 1.5
Ideal Dynamic
10 8 6 5 4 3 2.5 2 1.5 1 0.00
0.20
0.40
0.60
0.80
1.00
Throughput Figure 2. Dynamic access.
by a packet of duration T. The specific choice of the minislot duration ensures that when a given node transmits in a minislot all other nodes know it by the end of that minislot allowing them to react appropriately. The additional minislot at the end allows the data signals to reach every node of the network. This is needed to ensure that all start synchronized in the next slot, as required by the reservation scheme. The fraction of slots in which transmissions take place is ⌳T. Because a fraction of M /(T ⫹ M) of every slot is overhead, we conclude that the throughput of this scheme is S = T
1 T = T T + Mτ 1 + Ma
The normalized average delay is obtained by using standard analysis of priority queues, and it is given by
1 D = (1 + Ma) 1 + 2[1 − (1 + Ma)S ] Figure 2 depicts the delay-throughput characteristics for the dynamic-access schemes for 50 users. Further Reading. The variants of polling schemes are numerous. Reference 10 contains the analysis of most of the basic schemes with a long list of references that is complemented in Ref. 11. In Ref. 12 more advanced schemes are described along with some optimization considerations in the operations of polling schemes, such as the determination of the poll order of the nodes. The MSAP scheme described previously represents an entire family of schemes that guarantees conflict-free transmissions using distributed reservation. All these schemes have a sequence of preceding bits serving to reserve or announce upcoming transmissions (this is known as the reservation preamble). In MSAP there are M ⫺ 1 such bits for every transmitted packet. An improvement to the MSAP scheme is the bit-map protocol described by Tanenbaum (13). The idea is to use a single reservation preamble to schedule more than a single transmission; using the fact that all participating nodes are aware of the reservations made in the preamble.
The bit-map scheme requires synchronization among the nodes that is somewhat more sophisticated than the MSAP scheme, but the overhead paid per transmitted packet is less than the overhead for MSAP. Another variation of a reservation scheme has been described by Roberts (14). There, every node can make a reservation in every minislot of the reservation preamble, and if the reservation remains uncontested, that reserving node will transmit. If there is a collision in the reservation minislot, all nodes but the ‘‘owner’’ of that minislot will abstain from transmission. Altogether, this is a standard TDMA with idle slots made available to be grabbed by others. Several additional reservation and TDMA schemes are also analyzed by Rubin (4). One of the most efficient reservation schemes is the broadcast recognition access method (BRAM) (15). This is essentially a combination between the bit-map and the MSAP schemes. As with MSAP, a reservation preamble serves to reserve the channel for a single node, but unlike the MSAP the reservation preamble does not necessarily contain all M ⫺ 1 minislots. The idea is that nodes start their transmission with a staggered delay not before they ensure that another transmission is not ongoing [Kleinrock and Scholl (9) also refers to a similar scheme]. Under heavy load BRAM reduces to regular TDMA. CONTENTION-BASED SCHEMES With the conflict-free schemes discussed earlier, every scheduled transmission is guaranteed to succeed. With contentionbased schemes success of a transmission is not guaranteed in advance because whenever two or more nodes are transmitting on the shared channel simultaneously, a collision occurs and the data cannot be received correctly. This being the case, packets may have to be transmitted and retransmitted until eventually they are correctly received. Transmission scheduling is therefore the focal concern of contention-based schemes. Pure and Slotted Aloha The Aloha family of schemes is probably the richest family of multiple access protocols. First of all, its popularity is the result of seniority because it was the first contention-based scheme introduced (16). Second, many of these schemes are so simple that their implementation is straightforward. Many local area networks of today implement some sophisticated variants of this family of schemes. The pure Aloha scheme is the basic scheme in the family and it is very simple (16). It states that a newly generated packet is transmitted immediately upon generation, hoping for no interference by others. If two or more nodes transmit so that their packets overlap (even partially) in time, interference results, and the transmissions are unsuccessful. In this case every colliding node, independently of the others, schedules its retransmission to a random time in the future. This randomness is required to ensure that the same set of packets does not continue to collide indefinitely. The Aloha scheme is very well suited to bursty traffic because a node does not hold the shared channel when it has no packets to transmit. The drawback of this scheme is that network performance deteriorates significantly as a result of excessive collisions at medium and high traffic intensities. The Aloha scheme is a completely distributed scheme that allows every node to operate independently of the others.
MULTIPLE ACCESS SCHEMES
The exact characterization of the offered load to the channel for the pure Aloha scheme is extremely complicated. To overcome this complexity, it is standard to assume that the offered load forms a Poisson process (with rate g, of course). This flawed assumption is an approximation (as has been shown by simulation) that simplifies the analysis of Alohatype schemes considerably and provides some initial intuitive understanding of the ALOHA scheme. Consider a packet (new or retransmitted) whose transmission starts at time t. This packet will be successful if no other packet is transmitted in the interval (t ⫺ T, t ⫹ T) (this period of duration 2T is called the vulnerable period). The probability of this happening, that is, the probability of success Ps is the probability that no packet is transmitted in an interval of length 2T. Because the transmission points correspond to a Poisson process, we have Ps = e−2gT Now, packets are scheduled at a rate of g per second, of which only a fraction Ps are successful. Thus, the rate of successfully transmitted packets is gPs. When a packet is successful, the channel carries useful information for a period of T seconds; in any other case, it carries no useful information at all. Because the throughput is the fraction of time that useful information is carried on the shared channel, we have
stability requires S ⫽ ⌳T. Larger values of ⌳ clearly cannot result in stable operation. Note, however, that even for smaller values of ⌳ there are two values of G to which it corresponds—one larger and one smaller than . The smaller one is (conditionally) stable, whereas the other one is conditionally unstable, meaning that if the offered load increases beyond that point the system will continue to drift to higher load and lower throughput. Thus, without additional measures of control, the stable throughput of pure Aloha is 0 (17). It is appropriate to note that this theoretical instability is rarely a severe problem in real systems, where the long-term load, including, of course, the ‘‘off-hours’’ load, is fairly small, although temporary problems may occur. The delay characteristic of the Aloha scheme can be approximated as follows. For each packet, the average number of transmission attempts until the packet is transmitted successfully is G/S ⫽ e2G. Thus, the average number of unsuccessful transmission attempts is G/S ⫺ 1 ⫽ e2G ⫺ 1. If a collision occurs, the node reschedules the colliding packet for some random time in the future. Let the average rescheduling time be B (seconds). Each successful transmission attempt requires T seconds and each unsuccessful transmission attempt requires T ⫹ B seconds on the average. Therefore, the average delay is given by D = T + (G/S − 1)(T + B) = T + (e2G − 1)(T + B)
S = gTe−2gT = Ge−2G This relation between S and G is typical to many Aloha-type schemes. For small values of G (light load), the throughput is approximately the offered load. For large values of G (heavy load), the throughput decreases rapidly because of excessive amount of collisions. For pure Aloha we note that for G ⫽ , S takes on its maximal value of 1/2e 앒 0.18. This value is referred to as the capacity of the pure Aloha channel. Figure 3 depicts the load-throughput characteristics for the Alohatype schemes. We recall that for a system to be stable the long-term rate of input must equal the long-term rate of output meaning that
669
(3)
and in a normalized form D = 1 + (e2G − 1)(1 + B/T ) With pure Aloha, even if the overlap in time between two transmitted packets is very tiny, both packets are destroyed. The slotted Aloha variation overcomes this drawback, and it is simply pure Aloha with a slotted channel. Thus, two (or more) packets can either overlap completely or do not overlap at all, and the vulnerable period is reduced to a single slot. In other words, a slot will be successful if and only if exactly one packet is transmitted in that slot. Therefore, S = gTe−gT = Ge−G
Throughput
Aloha 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00
This relation is very similar to that of pure Aloha, except of increased throughput. Channel capacity is 1/e 앒 0.36 and is achieved at G ⫽ 1. These results were first derived by Roberts (14). Similar to the pure Aloha scheme, the normalized average delay for the slotted Aloha scheme is
Aloha Slotted-Aloha
D = 1 + (eG − 1)(1 + B/T ) Carrier-Sensing Protocols
0.01 0.03
0.1
0.5 Load
1
3
10
Figure 3. Throughput of Aloha and slotted Aloha.
The Aloha schemes exhibit fairly poor performance, which can be attributed to the ‘‘impolite’’ behavior of the nodes, namely, whenever one has a packet to transmit it does so without consideration of others. It is clear that in a shared environment even little consideration can benefit all. Consider a listen-before-talk behavior wherein every node, before attempting any transmission, listens whether somebody else is already using the channel. If this is the case, the node will refrain from transmission to the benefit of all; its packet will clearly not be successful if transmitted; furthermore, disturbing another
670
MULTIPLE ACCESS SCHEMES
node will cause the currently transmitted packet to be retransmitted, possibly disturbing yet another packet. The process of listening to the shared channel is not that demanding. Every node is equipped with a receiver anyway, and every node can monitor the channel because it is shared. Moreover, to detect another node’s transmission does not require receiving the information; it suffices to sense the carrier that is present when signals are transmitted. The carriersensing family of schemes is characterized by sensing the carrier and deciding according to it whether another transmission is ongoing. Carrier sensing does not yield conflict-free operation. Suppose that the channel has been idle for a while and that two nodes concurrently generate a packet. Each will sense the channel, discover that it is idle, and transmit the packet to result in a collection. ‘‘Concurrently’’ here does not really mean at the very same time; if one node starts transmitting it takes some time for the signal to propagate and arrive at the other node. Hence concurrently actually means within a time window of duration equal to signal propagation time. The maximum propagation time in the network is , and its normalized version is a, an important parameter that affects the performance of carrier sensing schemes. The larger this quantity is, collisions are more likely and the performance becomes worse. All the carrier sensing multiple access schemes share the same philosophy: when a node generates a new packet, the channel is sensed, and if found idle the packet is transmitted without further ado. When a collision takes place, every transmitting node reschedules a retransmission of the collided packet to some other time in the future (chosen with some randomization to avoid repeated collisions) at which time the same operation is repeated. The variations on the CSMA scheme are caused by the behavior of nodes that wish to transmit and find (by sensing) the channel busy. Most of the basic variations were introduced and analyzed by Kleinrock and Tobagi (18–20). In the nonpersistent versions of CSMA (NP-CSMA) a node that generated a packet and found the channel busy refrains from transmitting the packet and behaves exactly as if its packet collided [i.e., it schedules (randomly) the retransmission of the packet to some time in the future]. With NPCSMA, there are situations in which the channel is idle although one or more nodes have packets to transmit. The 1persistent CSMA (1P-CSMA) is an alternative to NP-CSMA because it avoids such situations by being a bit more greedy. This is achieved by applying the following rule. A node that senses the channel and finds it busy persists to wait and transmits as soon as the channel becomes idle. Consequently, the channel is always used if there is a node with a packet. With the 1-persistent scheme, a collision may occur not only because of nonzero propagation delays but also when two nodes become ready to transmit in the middle of another node’s transmission. In this case, both nodes will wait until that transmission ends and will begin transmission simultaneously, resulting in a collision. For slotted operation, CSMA schemes use time slot of duration seconds, which is usually much smaller than the slot size of duration T seconds, used with slotted Aloha. However, like slotted Aloha, all nodes using slotted CSMA schemes are forced to start transmission at the beginning of a slot.
Beside the ability to sense the carrier, some local area networks (such as Ethernet) have an additional feature, namely, that nodes can detect interference among several transmissions (including their own) while transmission is in progress and abort transmission of their collided packets. If this can be done sufficiently fast, then the duration of an unsuccessful transmission would be shorter than that of a successful one, thus improving the performance of the scheme. Together with carrier sensing, this produces a variation of CSMA that is known as CSMA/CD (Carrier Sensing Multiple Access with Collision Detection). The operation of all CSMA/CD schemes is identical to the operation of the corresponding CSMA schemes, except that if a collision is detected during transmission, the transmission is aborted and the packet is scheduled for transmission at some later time. For Ethernet networks this random delay is doubled (at most 16 times) each time the packet collides—a scheme known as binary exponential backoff. To ensure that all network nodes indeed detect a collision when it occurs, a consensus reenforcement procedure is used. This procedure is manifested by jamming the channel with a collision signal for a duration of cr seconds, which is usually much larger than the time necessary to detect a collision. We let 웂 ⫽ cr / . The analysis of the throughput of CSMA schemes is rather complicated. It is based on computations of average lengths of idle and transmission periods. For NP-CSMA we have S=
gTe−gτ Ge−aG = −gτ g(T + 2τ ) + e G(1 + 2a) + e−aG
For slotted NP-CSMA, we have S=
aGe−aG 1 − e−aG + a
For 1P-CSMA, we have
gTe−g(T +2τ ) [1 + gT + gτ (1 + gT + gτ /2)] g(T + 2τ ) − (1 − e−gτ ) + (1 + gτ )e−gT +τ Ge−G(1+2a)[1 + G + aG(1 + G + aG/2)] = G(1 + 2a) − (1 − e−aG ) + (1 + aG)e−G(1+a)
S=
For slotted 1P-CSMA, we have S=
Ge−G(1+a)[1 + a − e−aG ] (1 + a)(1 − e−aG ) + ae−G(1+a)
For nonpersistent CSMA/CD, we have S=
Ge−aG
+ γ aG(1 −
Ge−aG + 2aG(1 − e−aG ) + 2 − e−aG
e−aG )
For slotted nonpersistent CSMA/CD, we have S=
Ge−aG
+ γ aG(1 −
e−aG
Ge−aG − aGe−aG ) + (2 − e−aG − aGe−aG )
MULTIPLE ACCESS SCHEMES
Throughput
a = 0.01 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00
NP-CSMA 1P-CSMA Slotted NP-CSMA Slotted 1P-CSMA
0.001 0.01
0.1
1 Load
10
100 1000
Figure 4. Throughput of CSMA versions.
Figure 4 depicts the load-throughput characteristics for the CSMA-type schemes. Further Reading. Numerous variations on the environment under which the Aloha and CSMA schemes operate have been addressed in the literature (see, e.g., Refs. 3, 13, and 21–23). For instance, various packet length distributions were considered by Abramson (24) and Ferguson (25) for Aloha and by Tobagi and Hunt (26) and for CSMA. The assumption that, whenever two or more packets overlap at the receiver, all packets are lost is overly pessimistic. In radio networks the receiver might correctly receive a packet despite the fact that it is time-overlapping with other transmitted packets. This phenomenon is known as capture and it can happen as a result of various characteristics of radio systems. Most studies (27,28) considered power capture (the phenomenon whereby the strongest of several transmitted signals is correctly received at the receiver). Thus, if a single high-powered packet is transmitted, then it is correctly received regardless of other transmissions. Hence, channel use increases. Reservation schemes that allow contentions are designed to have the advantages of both the Aloha and the TDMA approaches. Examples of reservation schemes appear in Ref. 29, where the knowledge of the number of users is needed, or in Refs. 14 and 30, where they do not require this knowledge. Approximate analysis of a reservation Aloha protocol can be found in Lam (31). Approximate analysis of the delay was presented by Ferguson (32) for Aloha and by Beuerman and Coyle (33) for CSMA schemes. Instability issues of the Aloha protocol were first identified by Carleial and Hellman (34) and Lam and Kleinrock (35). Later, similar issues were identified for the CSMA family of protocols by Tobagi and Kleinrock (20). COLLISION RESOLUTION SCHEMES The original Aloha scheme and its CSMA derivatives are inherently unstable in the absence of some external control.
671
Looking into the philosophy behind the schemes, it is obvious that there is no sincere attempt to resolve collisions among packets as soon as they occur. Instead, the attempts to resolve collisions are always deferred to the future, with the hope that things will then work out, somehow, but they never do. Another type of contention-based schemes with a different philosophy are collision resolution schemes (CRS). In these schemes the efforts are concentrated on resolving collisions as soon as they occur. Moreover, in most versions of these schemes, new packets that arrive to the network are inhibited from being transmitted while the resolutions of collisions is in progress. This ensures that if the rate of arrival of new packets to the system is smaller than the rate at which collisions can be resolved (the maximal rate of departing packets—throughput), then the system is stable. The basic idea behind these schemes is to exploit in a more sophisticated manner the feedback information that is available to the nodes in order to control the retransmission process so that collisions are resolved more efficiently. The most basic collision resolution scheme is called the binary-tree CRS (or binary-tree scheme) and was proposed by Capetanakis (36), Hayes (37), and Tsybakov and Mikhailov (38). According to this scheme, when a collision occurs, in slot k say, all nodes that are not involved in the collision wait until the collision is resolved. The nodes involved in the collision split randomly into two subsets, by (for instance) each flipping a coin. The nodes in the first subset, those that flipped 0, retransmit in slot k ⫹ 1, whereas those that flipped 1 wait until all those that flipped 0 transmit their packets successfully. If slot k ⫹ 1 is either idle or contains a successful transmission, the nodes of the second subset (those that flipped 1) retransmit in slot k ⫹ 2. If slot k ⫹ 1 contains another collision, then the procedure is repeated (i.e., the nodes whose packets collided in slot k ⫹ 1 flip a coin again and operate according to the outcome of the coin flipping, and so on). A node having a packet that collided (at least once) is backlogged. The operation of the scheme can also be described by a binary-tree in which every vertex corresponds to a time slot. The root of the tree corresponds to the slot of the original collision. Each vertex of the tree also designates a subset (perhaps empty) of backlogged nodes. Vertices whose subsets contain at least two nodes indicate collisions and have two outgoing branches, corresponding to the splitting of the subset into two new subsets. Vertices corresponding to empty subsets or subsets containing one node are leaves of the tree and indicate an idle and a successful slot, respectively. For instance, consider a collision that occurs in slot 1. At this point it is neither known how many nodes nor who are the nodes that collided in this slot. Each of the colliding nodes flip a coin, and those that flipped 0 transmit in slot 2. By the rules of the scheme, no newly arrived packet is transmitted while the resolution of a collision is in progress, so that only nodes that collided in slot 1 and flipped 0 transmit in slot 2. Another collision occurs in slot 2, and the nodes involved in that collision flip a coin again. In this example, all the colliding nodes of slot 2 flipped 1 and therefore slot 3 is idle. The nodes that flipped 1 in slot 2 transmit again in slot 4, resulting in another collision and forcing the nodes involved in it to flip a coin once more. One node flips 0 and transmits (successfully) in slot 5 causing all nodes that flipped 1 in slot 4 to transmit in slot 6. In this example, there is one such node, and there-
672
MULTIPLE ACCESS SCHEMES
fore slot 6 is a successful one. Now that the collision among all nodes that flipped 0 in slot 1 has been resolved, the nodes that flipped 1 in that slot transmit (in slot 7). Another collision occurs, and the nodes involved in it flip a coin. Another collision is observed in slot 8, meaning that at least two nodes flipped 0 in slot 7. The nodes that collided in slot 8 flip a coin and, as it happens, there is a single node that flipped 0, and it transmits (successfully) in slot 9. Then, in slot 10, transmit the nodes that flipped 1 in slot 8. There is only one such node, and its transmission is, of course, successful. Finally, the nodes that flipped 1 in slot 7 must transmit in slot 11. In this example, there is no such node; hence slot 11 is idle, completing the resolution of the collision that occurred in slot 7 and, at the same time, the one in the first slot. It is clear from this example that each node, including those that are not involved in the collision, can construct the binary-tree by following the feedback signals corresponding to each slot, thus knowing exactly when the collision is resolved. A collision is resolved when the nodes of the network know that all packets involved in the collision have been transmitted successfully. The time interval starting with the original collision (if any) and ending when this collision is resolved is called collision resolution interval (CRI). In the preceding example the length of the CRI is 11 slots. The binary-tree protocol dictates how to resolve collisions after they occur. To complete the description of the protocol, one must specify when newly generated packets are transmitted for the first time. One alternative, which is assumed all along (known as the obvious-access scheme), is that new packets are inhibited from being transmitted while a resolution of a collision is in progress. That is, packets that arrive to the system while a resolution of a collision is in progress, wait until the collision is resolved, at which time they are transmitted. In the example above all new packets arriving to the system during slots 1 through 11 are transmitted for the first time in slot 12. Let Ln be the expected length of a CRI that starts with the transmission of n packets. From the operation of the scheme, it is clear that as long as the arrival rate of new packets into the system is smaller than the ratio n/Ln (for large n), the system is stable. When fair coins are used for splitting the users upon collisions, one can show that for every n, Ln ≤ 2.886n + 1 yielding stable system for arrival rates that are smaller than 0.346. The performance of the binary-tree protocol can be improved in two ways. The first is to speed up the collision resolution process by avoiding certain, avoidable, collisions. The second is based on the observation that collisions among a small number of packets are resolved more efficiently than collisions among a large number of packets. Therefore, if most CRIs start with a small number of packets, the performance of the protocol is expected to improve. Consider again the example above. In slots 2 and 3 a collision is followed by an idle slot. This implies that in slot 2 all users (and there were at least two of them) flipped 1. The binary-tree protocol dictates that these users must transmit in slot 4, although it is obvious that this will generate a collision that can be avoided. The modified binary-tree protocol was suggested by Massey (39), and it eliminates such avoid-
able collisions by letting the users that flipped 1 in slot 2 in the preceding example, flip coins before transmitting in slot 4. Consequently, the slot in which an avoidable collision would occur is saved. In this case, fair coins yield a stable system for arrival rates smaller than 0.375, and biased coins increase this number up to 0.381. When the obvious access is employed, it is very likely that a CRI will start with a collision among a large number of packets when the previous CRI was long. When the system operates near its maximal throughput, most CRIs are long; hence collisions among a large number of packets must be resolved frequently, yielding nonefficient operation. Ideally, if it were possible to start each CRI with the transmission of exactly one packet, the throughput of the system would have been 1. Because this is not possible, one should try to design the system so that in most cases a CRI starts with the transmission of about one packet. There are several ways to acheive this goal by determining a first-time transmission rule (i.e., when packets are transmitted for the first time). One way, suggested by Capetanakis (36), is to have an estimate on the number of packets that arrived in the previous CRI and divide them into smaller groups, each having an expected number of packets on the order of one and handling each group separately. Another way, known as the epoch mechanism has been suggested by Gallager (40) and Tsybakov and Mikhailov (41). According to this mechanism, time is divided into consecutive epochs each of length ⌬ slots. The ith arrival epoch is the time interval [i⌬, (i ⫹ 1)⌬]. Packets that arrive during the ith arrival epoch are transmitted for the first time in the first slot after the collision among packets that arrived during the (i ⫺ 1)st arrival epoch is resolved. The parameter ⌬ is chosen to optimize the performance of the system. When ⌬ ⫽ 2.68, the system is stable for arrival rates up to 0.429 if slots of sure collisions are not saved, and up to 0.462 if they are. A final enhancement of the epoch mechanism is to start a new epoch each time a collision is followed by two successful transmissions. This guarantees that each CRI will start with an optimal number of packets, and it yields the highest stable throughput known for multiple access systems—0.487. Further Reading. Numerous variations of the environment under which collision resolution protocols operate have been addressed in the literature and excellent surveys on the subject appear in Refs. 42 and 43. Books by Bertsekas and Gallager (21) and Rom and Sidi (23) are also excellent sources on collision resolution protocols. Considerable effort has been spent on finding upper bounds to the maximum throughput that can be achieved in an infinite population model with Poisson arrivals and ternary feedback. The best upper bound known to date is 0.568 and is the work of Tsybakov and Likhanov (44). Practical multiple access communication systems are prone to various types of errors. Collision resolution protocols that operate in the presence of noise errors, erasures, and captures have been studied in Refs. 45–49. Collision resolution protocols yielding high throughputs for general arrival processes (even if their statistics are unknown) were developed by Cidon and Sidi (50) and Greenberg et al. (51). The expected packet delay of the binary-tree protocol has been derived by Fayolle et al. (52) and Tsybakov and Mikhailov (38). Bounds on the expected packet delay of the algorithm with the epoch mechanism have been obtained in Refs. 41 and 53,
MULTIPLE ACCESS SCHEMES
and bounds on the packet delay distribution have been obtained in Refs. 54 and 55.
673
22. J. L. Hammond and P. J. P. O’Reilly, Performance Analysis of Local Computer Networks, Reading, MA: Addison-Wesley, 1986. 23. R. Rom and M. Sidi, Multiple Access Protocols; Performance and Analysis, New York: Springer-Verlag, 1990.
BIBLIOGRAPHY 1. J. Martin, Communication Satellite Systems, Englewood Cliffs, NJ: Prentice-Hall, 1978. 2. W. Stallings, Data and Computer Communications, New York: Macmillan, 1985. 3. J. F. Hayes, Modeling and Analysis of Computer Communications Networks, New York: Plenum Press, 1984. 4. I. Rubin, Access control disciplines for multi-access communications channels: Reservation and TDMA schemes, IEEE Trans. Inf. Theory, IT-25: 516–536, 1979. 5. L. F. M. De Moraes and I. Rubin, Message delays for a TDMA scheme under a Nonpreemptive Priority Discipline, IEEE Trans. Commun., COM-32: 583–588, 1984. 6. A. Itai and Z. Rosberg, A golden ratio control policy for a multiple-access channel, IEEE Trans. Autom. Control, AC-29: 712– 718, 1984. 7. M. Hofri and Z. Rosberg, Packet delay under the golden ratio weighted TDM policy in a multiple access channel, IEEE Trans. Inf. Theory, IT-33: 341–349, 1987. 8. R. Rom and M. Sidi, Message delay distribution in generalized time division multiple access (TDMA), Probability Eng. Inf. Sci., 4: 187–202, 1990. 9. L. Kleinrock and M. Scholl, Packet switching in radio channels: New conflict-free multiple access schemes, IEEE Trans. Commun., COM-28: 1015–1029, 1980. 10. H. Takagi, Analysis of Polling Systems, Cambridge, MA: MIT Press, 1986. 11. H. Takagi, Queueing analysis of polling models, ACM Comp. Surv., 20 (1): 5–28, 1988. 12. H. Levy and M. Sidi, Polling systems: Applications, modeling and optimization, IEEE Trans. Commun., 38: 1750–1760, 1990. 13. A. S. Tanenbaum, Computer Networks, 3rd ed., Englewood Cliffs, NJ: Prentice-Hall International Editions, 1996. 14. L. G. Roberts, ALOHA packet system with and without slots and capture, Comput. Commun. Rev., 5 (2): 28–42, 1975. 15. I. Chlamtac, W. R. Franta, and K. D. Levin, BRAM: The Broadcast Recognizing Access Mode, IEEE Trans. Commun., COM-27: 1183–1189, 1979. 16. N. Abramson, The ALOHA System—Another Alternative for Computer Communications, Proc. Fall Joint Comput. Conf., pp. 281–285, 1970. 17. G. Fayolle et al., The stability problem of broadcast packet switching computer networks, Acta Informatica, 4 (1): 49–53, 1974. 18. L. Kleinrock and F. A. Tobagi, Packet switching in radio channels: Part I—Carrier sense multiple-access modes and their throughput delay characteristics, IEEE Trans. Commun., 23: 1400–1416, 1975. 19. F. A. Tobagi and L. Kleinrock, Packet switching in radio channels: Part II—The hidden terminal problem in carrier sense multiple-access and the busy tone solution, IEEE Trans. Commun., 23: 1417–1433, 1975.
24. N. Abramson, The throughput of packet broadcasting channels, IEEE Trans. Commun., 25: 117–128, 1977. 25. M. J. Ferguson, An Approximate Analysis of Delay for Fixed and Variable Length Packets in an Unslotted Aloha Channel, IEEE Trans. Commun., 25: 644–654, 1977. 26. F. A. Tobagi and V. B. Hunt, Performance analysis of carrier sense multiple access with collision detection, Comput. Netw., 4 (5): 245–259, 1980. 27. J. J. Metzner, On Improving Utilization in Aloha Networks, IEEE Trans. Commun., 24: 447–448, 1976. 28. N. Shacham, Throughput-delay performance of packet-switching multiple-access channel with power capture, Performance Evaluation, 4 (3): 153–170, 1984. 29. R. Binder, A Dynamic Packet Switching System for Satellite Broadcast Channels, Proc. ICC’75, pp. 41.1–41.5, 1975. 30. W. Crowther et al., A System for broadcast communication: Reservation-Aloha, Proc. Int. Conf. Syst. Sci., pp. 371–374, 1973. 31. S. S. Lam, Packet broadcast networks—A performance analysis of the R-ALOHA protocol, IEEE Trans. Comput., 29: 596–603, 1980. 32. M. J. Ferguson, On the control, stability, and waiting time in a slotted Aloha, IEEE Trans. Commun., 23: 1306–1311, 1975. 33. S. L. Beuerman and E. J. Coyle, The delay characteristics of CSMA/CD networks, IEEE Trans. Commun., 36: 553–563, 1988. 34. A. B. Carleial and M. E. Hellman, Bistable behavior of ALOHAtype systems, IEEE Trans. Commun., 23: 401–410, 1975. 35. S. S. Lam and L. Kleinrock, Packet switching in a multicast broadcast channel: Dynamic control procedures, IEEE Trans. Commun., 23: 891–904, 1975. 36. J. I. Capetanakis, Tree Algorithm for Packet Broadcast Channels, IEEE Trans. Inf. Theory, 25: 505–515, 1979. 37. J. F. Hayes, An adaptive technique for local distribution, IEEE Trans. Commun., 26: 1178–1186, 1978. 38. B. S. Tsybakov and V. A. Mikhailov, Free synchronous packet access in a broadcast channel with feedback, Prob. Inf. Trans., 14 (4): 259–280, 1978. 39. J. L. Massey, Collision Resolution Algorithms and Random-Access Communications Multi-User Communications Systems, CISM Courses and Lectures Series (G. Longo, ed.), New York: SpringerVerlag, pp. 73–137, 1981 (also in UCLA Technical Report UCLAENG-8016, April 1980). 40. R. G. Gallager, Conflict resolution in random access broadcast networks, Proc. AFOSR Workshop Commun. Theory Appl., Provincetown, pp. 74–76, September 1978. 41. B. S. Tsybakov and V. A. Mikhailov, Random multiple packet access: Part-and-try algorithm, Prob. Inf. Trans., 16: 305–317, 1980. 42. R. G. Gallager, A perspective on multiaccess channels, IEEE Trans. Inf. Theory, 31: 124–142, 1985. 43. B. S. Tsybakov, Survey of USSR contributions to multiple-access communications, IEEE Trans. Inf. Theory, 31: 143—165, 1985. 44. B. S. Tsybakov and N. B. Likhanov, Upper bound on the capacity of a random multiple access system, Prob. Inf. Trans., 23 (3): 224–236, 1988.
20. F. A. Tobagi and L. Kleinrock, Packet switching in radio channels: Part IV—Stability considerations and dynamic control in carrier sense multiple-access, IEEE Trans. Commun., 25: 1103– 1119, 1977.
45. I. Cidon and M. Sidi, The effect of capture on collision-resolution algorithms, IEEE Trans. Commun., 33: 317–324, 1985.
21. D. Bertsekas and R. Gallager, Data Networks, 2nd ed., Englewood Cliffs, NJ: Prentice-Hall International Editions, 1992.
46. I. Cidon and M. Sidi, Erasures and noise in multiple access algorithms, IEEE Trans. Inf. Theory, 33: 132–143, 1987.
674
MULTIPLIERS, ANALOG
47. I. Cidon, H. Kodesh, and M. Sidi, Erasure, Capture and random power level selection in multiple-access systems, IEEE Trans. Commun., 36: 263–271, 1988. 48. M. Sidi and I. Cidon, Splitting protocols in presence of capture, IEEE Trans. Inf. Theory, 31: 295–301, 1985. 49. N. D. Vvedenskaya and B. S. Tsybakov, Random multiple access of packets to a channel with errors, Prob. Inf. Trans., 19 (2): 131– 147, 1983. 50. I. Cidon and M. Sidi, Conflict multiplicity estimation and batch resolution algorithms, IEEE Trans. Inf. Theory, 34: 101–110, 1988. 51. A. G. Greenberg, P. Flajolet, and R. E. Ladner, Estimating the multiplicities of conflicts to speed their resolution in multiple access channels, J. ACM, 34 (2): 289–325, 1987. 52. G. Fayolle et al., Analysis of a stack algorithm for random multiple-access communication, IEEE Trans. Inf. Theory, 31: 244– 254, 1985. 53. L. Georgiadis, L. F. Merakos, and P. Papantoni-Kazakos, A method for the delay analysis of random multiple-access algorithms whose delay process is regenerative, IEEE J. Sel. Areas Comm., 5 (6): 1051–1062, 1987. 54. L. Georgiadis and M. Paterakis, Bounds on the Delay Distribution of Window Random-Access Algorithms, IEEE Trans. Commun., COM-41: 1993, 683–693. 55. G. Polyzos and M. Molle, A Queuing Theoretic Approach to the Delay Analysis for the FCFS 0.487 Conflict Resolution Algorithm, IEEE Trans. Inf. Theory, IT-39: 1887–1906, 1993.
MOSHE SIDI Technion—Israel Institute of Technology
MULTIPLIER. See ANALOG MOS MULTIPLIER.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5333.htm
●
HOME ●
ABOUT US //
●
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering Network Flow and Congestion Control Standard Article Eytan Modiano1 and KaiYeung Siu2 1MIT Lincoln Laboratory 2MIT d'Arbeloff Laboratory for Information Systems and Technology Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5333 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (122K)
Browse this title ●
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5333.htm (1 of 2)16.06.2008 16:25:47
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5333.htm
Abstract The sections in this article are Issues and Mechanisms for Congestion Control Flow Control in Practice Advanced Issues | | | Copyright © 1999-2008 All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5333.htm (2 of 2)16.06.2008 16:25:47
NETWORK FLOW AND CONGESTION CONTROL
NETWORK FLOW AND CONGESTION CONTROL
N×N
U1
U2 ..... UN
Figure 1. A hub network.
Without flow control Delay
Modern computer networks connect geographically dispersed nodes using switches and routers, and transmission lines between them. In this way, bursty and random traffic streams can be statistically multiplexed to make more efficient use of resources. For example, the hub network in Fig. 1 is used to connect N users with N shared links and an N ⫻ N switch. Communication between pairs of users is accomplished by going through the hub. If, instead, all nodes were to be connected using dedicated links, N(N ⫺ 1)/2 links would be required. However, at any point in time communication usually takes place only between a small fraction of the users. Hence, providing full and unshared connectivity between all users would be wasteful of resources. The functionality of computer networks in connecting users is quite similar, in many aspects, to that of highways and local streets in connecting households. In both cases, effective traffic control mechanisms are needed to regulate the flow of traffic and ensure high throughput. Unlike traditional voice communications, where an active call requires constant bit rate from the network, data communication is bursty in its nature. A typical data session may require very low data rates during periods of inactivity and much higher rates at other times. Consequently, there may be times when incoming traffic to a network exceeds its capacity. Flow and congestion control are mechanisms used in computer networks for preventing users from overwhelming the network with more data than the network can handle. The simplest way to handle network congestion is to temporarily buffer the excess traffic at a congested switch until it can all be transmitted. Yet, since switch buffers are limited in size, there may be times when sustained excessive demand on parts of the network causes buffers to fill-up, and excess packets can no longer be buffered and must be discarded. When packets are discarded it is typically left up to higherlayer protocols to recover the lost packets using an appropriate retransmission mechanism. For example, the Transmission Control Protocol (TCP) recovers from such buffer overflows by using a timed acknowledgment mechanism and retransmitting packets for which an acknowledgment does not arrive in time. Consequently, at times of congestion, packets may be retransmitted not only because of buffer overflows but also because of the increased delay that is due to the congestion. In the absence of flow control, this, sometimes unnecessary, retransmission of packets can lead to instability where little if any new traffic can flow through the network. A well-designed flow-control mechanism should keep the traffic levels in the network low enough to prevent buffers from overflowing and maintain relatively low end-to-end delays. Furthermore, in the event of congestion, the flow-control mecha-
167
With flow control Throughput Figure 2. An effective flow-control mechanism can yield both higher throughputs and decreased delays.
nism should allow the network to stabilize. Figure 2 illustrates the benefits of an effective flow-control mechanism. In addition to the obvious objectives of limiting delays and buffer overflow, a good flow-control scheme should also treat all sessions fairly. One notion of ‘‘fairness’’ is to treat all sessions in the network equally. However, this notion is not appropriate for networks that attempt to provide Quality-of-Service (QoS) guarantees. In some networks users may be offered service contracts guaranteeing minimum data rates, maximum packet delays, and packet discard rates, as well as other performance measures. In such networks, it is up to the flowcontrol mechanism to make sure that these guarantees are met. Clearly, in this case, sessions cannot be treated equally and a different notion of fairness, related to the service agreements of the users, must be used. A more detailed discussion of fairness and how flow-control mechanisms attempt to provide fairness will be given in the next section. There are a number of flow-control mechanisms that are used in practice, all of which attempt to limit delays and buffer overflows in the network by keeping packets waiting outside the network rather than in buffers within the network. The simplest mechanism for preventing congestion in the network is call admission. Here a call may be blocked, or prevented from entering the network, if it is somehow determined that the network lacks sufficient resources to accept the call. Call admission is a passive flow-control mechanism in the sense that once a call is admitted, nothing further is done to regulate traffic. It is therefore appropriate for traffic with very predictable behavior. Typically, call-admission mechanisms are used in circuit-switched networks (e.g., the telephone network); however, with the recent emergence of packet network services offering QoS guarantees, call blocking may also play a role in data networks, in conjunction with additional mechanisms to regulate traffic among active sessions. This article will focus on active flow-control mechanisms that attempt to regulate the traffic flow among active sessions. A comprehensive discussion of flow control in data networks can be found in Refs. 1 and 2. As described in Ref. 3, one way to classify flow-control mechanisms is based on the layer of ISO/OSI reference model at which the mechanism operates. For example, there are data link, network, and transport layer congestion-control schemes. Typically, a combination of such mechanisms is used. The selection depends upon the severity and duration of congestion. Figure 3 shows how the duration of congestion affects the choice of the method. In general, the longer the duration, the higher the layer at which control should be exercised. For example, if the congestion is permanent, the installation of additional links is required. If the congestion lasts for the duration of the connec-
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
168
NETWORK FLOW AND CONGESTION CONTROL
Congestion duration Long
Short
Control mechanism Capacity planning and network design Admission control Dynamic routing End-to-end feedback Link-by-Link feedback Buffering
Figure 3. Control mechanisms based on congestion duration.
tion, admission control (e.g., use of busy signal) or dynamic routing (i.e., rerouting of traffic into another less congested path) is more appropriate. If the congestion lasts for several round-trip delays, transport level control with end-to-end feedback is more effective. If the congestion is of a short duration (less than a round-trip delay), link-by-link feedback or sufficient buffering should be used. Since every network can have overloads of all durations, every network needs a combination of control mechanisms at various levels. No single scheme can solve all congestion problems. The rest of this article will focus on mechanisms that deal with congestion that lasts only a few round-trip delays.
ISSUES AND MECHANISMS FOR CONGESTION CONTROL Buffer Implementation and Management As discussed earlier, call-admission-control mechanism alone is only appropriate for regulating traffic with steady or predictable bandwidth requirements (e.g., voice), but not effective for dealing with unpredictable bursty traffic (e.g., data). For efficient utilization of network bandwidth, it is often necessary to buffer the traffic when the incoming traffic to a node temporarily exceeds the capacity of its outgoing link. Flow control thus involves buffer management at a node in such a way that the service requirements of connections traversing the node can be satisfied. In general, packet loss at a node will occur less frequently with a larger buffer than with a smaller one. However, a larger buffer may also lead to larger packet delay. For most data applications, an excessive packet delay will yield the same effect as a packet loss and will trigger a retransmission of the delayed packet. Thus, there is a tradeoff between throughput and delay in regulating network traffic. A key challenge in flow control is to achieve good delay-throughput, among connections with possibly different service requirements competing for network resources. In addition to buffer size, another issue in buffer management for flow control has to do with the order in which packets of various connections are stored into and transmitted out of the buffer. The simplest way to buffer packets is to implement a single first-in-first-out (FIFO) queueing structure, in which buffered packets of all connections are transmitted on a first-come-first-serve (FCFS) basis. In other words, when a packet arrives at a node and needs to be buffered, it will be put at the end of a queue, regardless of which connection it belongs to. The packet at the front of the queue is always the first to be transmitted. Since the order of packet arrivals
determines the order of packet departure, it is difficult to support different service requirements to connections with FIFO queueing. Moreover, since various traffic is mixed into the same queue, some connections can overutilize the buffer space, thereby preventing other connections from using it. A more sophisticated way is to implement a separate queue for each connection, that is, per-connection queueing, so that buffered packets of different connections are isolated from one another. With per-connection queueing, a scheduling mechanism is used to decide, at any instant, which connection can transmit its packet. Compared with FIFO queueing, perconnection queueing is more expensive to implement, but offers greater flexibility in exercising flow control. For traffic with minimal and similar service requirements, FIFO queueing is usually sufficient. When different classes of traffic with different levels of service requirements are mixed together onto the same link, per-connection queueing may be necessary (4). In the following sections, the problems of packet scheduling and packet discarding, which are closely related to buffer management for flow control, will be discussed.
Packet Scheduling In general, a network node can have multiple incoming and outgoing links. Packets can be buffered at either the entrance or the exit interface of a node. The former is called input buffering and the latter output buffering. With input buffering, packets of different connections from the same incoming link of a node will first be buffered before they are transmitted to different outgoing links. To resolve possible contention caused by packets from different incoming links transmitting to the same outgoing link, a scheduling mechanism is required, to determine which packet should be transmitted at any instant. Moreover, when input buffering is implemented using FIFO queueing, packets of slow connections at the front of the queue will block packets of fast connections that follow. This is usually called the head-of-line (HOL) blocking problem. In fact, it is known that under certain traffic assumptions (e.g., a packet from each incoming link is equally likely to be transmitted to any outgoing link of the node), with input FIFO buffering, at most 58% of the maximum possible throughput can be achieved (5). On the other hand, with output buffering, a packet arriving at a node is immediately transferred to the interface of its destined outgoing link, and is buffered there before it is transmitted. In this case, there is no HOL blocking problem and no scheduling mechanism is needed to resolve transmission contention among packets from different incoming links. However, since packets from all incoming links can potentially go on the same outgoing link at any instant, a node with output buffering needs to transfer packets at a rate that is the aggregate speed of all incoming links. This also means that faster (and, thus, more expensive) switching hardware is usually required for nodes with output buffering than with input buffering. The problem of packet scheduling is further complicated by the fact that different connections may have different bandwidth or service requirements. Thus, a scheduling mechanism is needed to selectively expedite or transmit the buffered packets of various connections. For example, if each con-
NETWORK FLOW AND CONGESTION CONTROL
nection should share the bandwidth of an outgoing link equally, a node can transmit buffered packets of each connection in a round-robin fashion. Similarly, when a particular connection has a minimum bandwidth requirement of R packets/second, then a node should schedule the transmissions so that, on the average, at least one packet of that connection will be transmitted every 1/R s. Furthermore, a node may desire to delay the transmission of the packets of some connections to avoid or relieve congestion further along the paths used by those connections, when such congestion information is available at the node. Simple FIFO queueing at a node often cannot support these and various scheduling mechanisms, and more expensive per-connection queueing is necessary when the scheduling constraints are stringent. More information on packet scheduling can be found in Refs. 6 and 7. Packet Discarding Since there is only a finite amount of available buffer space at a node, packets will be discarded if congestion persists. When packets of a connection get discarded, whether they will be retransmitted depends on the service requirements of that connection. For example, when the connection is a file transfer application, where each packet carries essential information, discarded packets need to be retransmitted by the source. The retransmission is usually performed if the receipt of a packet has not been acknowledged after a time-out period. In the case of a TCP connection, the destination returns to the source an acknowledgment packet corresponding to each data packet received, and the retransmission time-out is determined dynamically (8). On the other hand, for real-time traffic such as voice or video, discarded packets are usually not retransmitted because delayed information is useless in such cases. A common approach to flow control for real-time traffic is to assign different priority levels to packets so that packets of highest priority will be discarded least often, if at all. Before a connection is established, the network may exercise a call-admission mechanism to ensure that the transmission of these highest priority packets can be maintained above a certain rate in order to support a minimum acceptable level of quality of service. This approach is also applicable to data traffic, where each discarded packet needs to be retransmitted. In this case, a network may offer several different classes of service with different priority levels in terms of packet discarding. When a connection is established, it can negotiate with the network to which service it wants to subscribe and, subsequently, during periods of congestion its packets will be discarded, based on their priority level. It is sometimes desirable to discard packets even when buffer space is still available, particularly if FIFO queueing is used. This is because a connection overutilizing the buffer space in a FIFO queue will cause packets of other connections sharing the FIFO queue to be discarded. Thus, packets should be discarded if they belong to connections that utilize more than their fair share of buffer space, or if they may cause packets of higher priority to be discarded. Furthermore, if packets are to be discarded further down the path, because of congestion there, they should be discarded as early as possible, to avoid wasting additional network resources unnecessarily (9,10).
169
C Link 1 Link 2
A
B Figure 4. Fair bandwidth allocation.
Fair Allocation of Bandwidth In addition to limiting delay and buffer overflow, fairness in network use is another objective of flow control. It is difficult to define a simple notion of fairness when different connections with different service requirements are present in the network. Here only a particular notion of fairness on bandwidth allocation will be discussed. First consider a simple network with two links and three connections, as shown in Fig. 4. For the present, assume that each link has equal capacity, supporting 1 unit/s of traffic. If the fairness criterion is to allocate an equal rate to each connection, then each connection should get a rate of unit/s and the total network throughput in this case would be units/s. Note, however, that the maximum network throughput is 2 units/s, which can be achieved by shutting off connection A and allowing connections B and C each to transmit 1 unit/s of traffic. This example shows that fairness and throughput are two independent (and sometimes conflicting) objectives of flow control. Now suppose the capacity of link 1 is changed to unit/s. In this case, connections A and B can share the bandwidth of link 1 equally, resulting in a throughput of unit/s for each. However, it will be a waste of bandwidth in link 2 if connection C is allocated with less than unit/s of bandwidth; it will be unfair if the bandwidth allocated to connection C is more, since it would further restrict the bandwidth allocated to connection A. This example motivates the notion of max–min fairness, which refers to maximizing bandwidth utilization for connections with the minimum bandwidth allocation. More formally, it can be said that a set of connections has a max–min fair bandwidth allocation if the bandwidth allocated to any connection C⬘ cannot be increased without further decreasing the bandwidth allocated to another connection whose bandwidth is already smaller than C⬘. For example, in Fig. 4 with the capacity of link 1 being unit/s, one cannot increase the bandwidth allocated to connection C above unit/s without making the bandwidth of connection A smaller than unit/s. Max–min fairness can also be defined in terms of the notion of a bottleneck link. With respect to some bandwidth allocation, a particular link L is a bottleneck link for a connection C⬘, which traverses L if the bandwidth of L is fully utilized and if the bandwidth allocated to C⬘ is no less than the bandwidth allocated to any other connection traversing L. Then a max–min fair bandwidth allocation can be shown to be equivalent to the condition that each connection has a bottleneck link, with respect to that allocation. The notion of max–min fairness needs to be modified if each connection requires a minimum guaranteed data rate. One possible way is first to define the excess capacity of each
170
NETWORK FLOW AND CONGESTION CONTROL
link L to be the bandwidth of L minus the aggregate guaranteed rates of all the connections that traverse L. Then a set of connections has a max–min fair bandwidth allocation if the excess capacity of each link is shared in a max–min fair manner (according to the notion defined earlier). More information on fair queueing algorithms and their performance can be found in Refs. 11–14. Window Flow Control The oldest and most common flow-control mechanism used in networks is window flow control. Window flow control has been used since the inception of packet-switched data networks and it appears in X.25, SNA, and TCP/IP networks (1,2). Window flow control regulates the rate with which sessions can insert packets into the network with a simple acknowledgment mechanism. Within a given session, the destination sends an acknowledgment to the source for every packet that it receives. With a window size of W, the source is limited to having W outstanding packets for which an acknowledgment has not been received. Hence, the window scheme limits the number of packets that a given session can have inside the network to the window size, W. These packets can be either in buffers throughout the network or propagating on transmission lines. This strategy is typically implemented using a sliding transmission window, where the start of the window is equal to the oldest packet for which an acknowledgment has not yet been received. Only packets from within the window can be transmitted, and the window is advanced as acknowledgments for earlier packets are received. An example with W ⫽ 4 is shown in Fig. 5. One reason that this strategy is very popular is its similarity to window-based retransmission mechanisms (e.g., Go Back N or SRP), which are used for error control in data networks, making it easy to implement in conjunction with the error-control scheme. While window flow control, in effect, limits the number of packets that a given session can have in the network, it also indirectly regulates the rate of the session. Suppose that the round-trip delay for transmitting a packet and receiving its acknowledgment is D seconds. Then with a window of size W, a session can at most transmit r ⫽ W/D packets per second. This is because, after sending a full window of packets, the sender must wait for the acknowledgment of the first packet before it can send any new packets. As delays in the network increase (i.e., D increases), the maximum session rate r is forced to decrease, producing the desired effect of slowing transmissions down at time of congestion. As congestion is alleviated, D is decreased, allowing sessions to increase their transmission rates. One problem with window flow control is that window flow control cannot be used for sessions that require guaranteed
data rates, such as real-time traffic, because as delays through the network vary, the rate of the session is forced to vary as well. Another problem is in the choice of a window size. On the one hand, one would like to keep window sizes small, in order to limit the number of packets in the network and prevent congestion. However, one would also want to allow sessions the ability to transmit at the maximum rate, at times when there is no congestion in the network. Consider a network where the transmission time for a packet is X. In order to allow unimpeded transmission, the window size W must be greater than D/X. That is, the window size must be large enough to allow a session to transmit packets continuously, while waiting for acknowledgments to return. Clearly, when W is greater than D/X, flow control is not active (i.e., the session can transmit at the maximum rate of 1/X packets per second) and when W is smaller than D/X, flow control is active and the session transmits at a rate of W/D ⬍ 1/X packets per second. The problem is in choosing a window size that both allows sessions unimpeded transmission when there is no congestion, and also prevents congestion from building up in the network. When there is no congestion in the network, the primary source of delay is propagation delay. Since propagation delay would be present, regardless of congestion, the window size should be big enough to allow unimpeded transmission when propagation is the only source of delay in the network. Hence if the propagation delay is equal to Dp then the window size should be at least equal to Dp /X, allowing transmission at a rate of 1/X packets per second, when the only source of delay is due to propagation. This is particularly needed in highspeed networks, where propagation delays can be relatively large. However, this can lead to the use of very large windows. Consider, for example, transmission over a satellite, where the round-trip propagation and signal processing delays can be on the order of a second. Suppose that the transmission rate is 106 bits/s and that the packet size is 1000 bits. In order to allow sessions to transmit at the full rate of 106, the window size must be at least 1000 packets. Hence, as many as 1000 packets per second can be in the network for each session. With so many packets in flight simultaneously, attempting to control congestion in the network becomes very difficult. First, the window mechanism becomes somewhat ineffective, because delays that are due to congestion are likely to be relatively small, compared with the propagation delay. Recalling that when flow control becomes active the allowable session rate is r ⫽ W/D, and since the overall increase in delay due to congestion is small, as compared with the overall delay, the result is only small decrease in the session rate. Also, with very large windows, sufficient buffering must be present throughout the network, to prevent buffer overflows
Window Window Window Window Window Source
Figure 5. Sliding window mechanism with W ⫽ 4.
Destination
PKT-0 PKT-1 PKT-2 PKT-3 PKT-4 PKT-5 PKT-6 PKT-7 PKT-8 PKT-9 ACK-0 ACK-1 ACK-2 ACK-3 ACK-4 ACK-5 ACK-6 ACK-7 ACK-8
NETWORK FLOW AND CONGESTION CONTROL
in the event that congestion sets in. Clearly, a mechanism is needed to dynamically alter the window size allocated to a session, based on estimated traffic conditions in the network. In this way, when the network is not congested the window size can be increased, to allow unimpeded transmission, but, as congestion begins to set in, the window size can be reduced to yield a more effective control of the allowed session rate. An example of a dynamic window adjustment mechanism is given in Ref. 15. In order to be able to adjust the window size in response to congestion, a mechanism must exist to provide feedback to the source nodes, regarding the status of congestion in the network. There are many ways in which this can be done, and finding the best such method is an area of active research. One approach, for example, would require nodes in the network, upon experiencing congestion, to send special packets (sometimes called choke packets) to the source nodes, notifying them of the congestion. In response to these choke packets, the source nodes would reduce their window size. Other mechanisms attempt to measure congestion in the network, by observing the delay experienced by packets and reducing the window size as delay increases. Yet another mechanism used by the Transmission Control Protocol (TCP) reduces the window size, in response to lost packets (packets for which an acknowledgment was not received). This is done based on the assumption that lost packets are due to buffer overflows and are a result of congestion. The flow-control mechanism used by TCP, and some of the problems associated with it, will be discussed in more detail in the next section. One problem with using end-to-end windows for flow control is that, when congestion sets in on some link in the network, the node preceding that link will have to buffer a large number of packets. Consider, for example, a session operating over a multi-hop network and suppose that congestion sets in at one of the links along its path. With a window size of W, as many as W packets can be sent by the session into the network without receiving an acknowledgment. When a link becomes congested, all W packets associated with that session will arrive at the congested link and have to be buffered at the node preceding that link. With many simultaneous sessions, this can lead to significant buffering requirements at every node. An alternative, known as link-by-link window flow control, establishes for a session windows along every link between the source and the destination. These windows can be tailored to the specific link, so that a long delay link (e.g., satellite link) would have a large window, while a short delay link would have a smaller window. These link-by-link windows can be much smaller than the end-to-end windows and, as a result, the amount of buffering at each node can be significantly reduced. In effect, link-by-link windows distribute the buffering in the network evenly among all of the nodes rather than require the congested nodes to handle all of the packets. Of course, link-by-link windows are not always possible, for example, in networks that use datagram routing, sessions do not use a fixed path between the source and destination; hence, setting up windows on a link-by-link basis is not possible.
171
large, very large windows are required, making window flow control ineffective. Another problem is that window flow control cannot be used for sessions that require guaranteed data rates, such as real-time traffic, because as delays through the network vary, the rate of the session is forced to vary as well. An alternative mechanism, which is more appropriate for high-speed networks and real-time traffic, is based on explicitly controlling the rate at which users are allowed to transmit. For a session that requires an average data rate of r packets per second, a strict implementation of a rate-control scheme would allow the session to transmit exactly one packet every 1/r s. Such implementation would amount to time-division-multiplexing (TDM), which is appropriate for constant rate traffic, but inefficient for bursty data traffic. Data sessions typically do not demand a constant transmission rate but are rather bursty, so that, at times, little if any transmission is required, and at other times, much higher rates are required. A more appropriate mechanism for supporting a bursty data session with an average rate of r packets per second is to allow the transmission of B packets every B/r seconds. In this way, bursts of up to B packets can be accommodated. A common method for accomplishing this form of flow control is the leaky bucket method, shown in Fig. 6. In this scheme, a session of rate r has a ‘‘bucket’’ of permits for its use. The bucket, is constantly fed new permits at a rate of 1 every 1/r s and it can hold at most B permits. In order for a packet to enter the network, it must first obtain a permit from the bucket. If the bucket has no more permits, it must wait until such a permit becomes available. It is easy to see that, in this way, up to B packets can burst into the network all at once. An important parameter in the design of a leaky bucket rate control scheme is the bucket size B. Clearly, a small bucket size would result in strict rate control scheme and would be ineffective for bursty traffic. However, too large a bucket would be ineffective in controlling congestion. Again, as with the dynamic adjustment of window size in the window flow-control scheme, it is also sometimes desirable to dynamically alter the bucket size and rates given to a session based on traffic conditions in the network. FLOW CONTROL IN PRACTICE TCP Flow Control The transmission control protocol (TCP) is the most commonly used transport layer protocol in today’s internet. Virtu-
Packet buffer Packets
Network
Permits B Permit Bucket
Rate Flow Control One problem with window flow control is that in very-highspeed networks, where propagation delays are relatively
Figure 6. Leaky bucket flow control. Permits arrive at the bucket one every 1/r s and a packet must obtain a permit before entering the network.
172
NETWORK FLOW AND CONGESTION CONTROL
ally all session-based traffic in the internet uses TCP. Among other things, TCP is responsible for flow control. There are a number of different TCP implementations (16–18), the details of which vary slightly from one another. The details of a particular standard are not emphasized here, but rather the general concepts that guide TCP flow control (19) are described. TCP controls the flow of traffic in a session, using end-toend windows. The key behind TCP flow control is the window size allocated for a given connection. For each connection, TCP determines a maximum allowable window size, Wmax. The value of Wmax is typically a function of the particular TCP implementation. Most TCP implementations use a value of Wmax that is somewhere between 4 kbytes and 16 kbytes (20). Upon connection setup, the value of Wmax is determined, based on the version of TCP used by the end stations. Once the maximum window size is determined, the communication can begin. However, in order to prevent a new connection from overwhelming the system, communication does not begin with the maximum window size. Rather, communication starts with a window size of W ⫽ 1 packet, typically around 512 bytes, and the window size is gradually increased, in what is known as a slow-start phase. During the slow-start phase, the window size is increased by one packet for every acknowledgment that returns from the destination. Therefore, the window size is doubled with every successful transmission of a complete window. The slow-start phase continues until the window size reaches half of the maximum window size, at which point the communication turns into what is known as the congestion-avoidance phase. During the congestion-avoidance phase, the window size is increased by one packet for every successful transmission of a full window. Hence, during the congestion-avoidance phase, the window size is increased much more slowly than during slow start. The window size continues to increase in this way, until it reaches its maximum value of Wmax. In the above discussion we described how TCP sets its initial window size. In addition, TCP adjusts the window sizes in response to congestion in the network. TCP assumes that any packets lost in the network (e.g., packets that are not acknowledged in time) are due to buffer overflows resulting from congestion. In response, upon detecting a lost packet, TCP reduces the window size. Most TCP implementations (20) reduce the window size to one packet, at which point the window size is increased gradually back to the maximum value, in accordance with the slow-start and congestion-avoidance algorithms described above. While TCP has been used effectively for many years, there are many shortcoming to its flow-control mechanism that make it ineffective for networks of the future. First, as discussed for general window flow control, it is not an effective mechanism for supporting sessions that require guaranteed data rates, such as real-time traffic. Second, as network transmission speeds increase, the window size needed to maintain unimpeded transmission has to be very large, especially over long delay links. With most versions of TCP having a maximum allowable window of around 16 kbytes, this is much too small for future high-speed networks (21,22). Furthermore, TCP’s response to lost packets, as if they are due to congestion, may be appropriate for networks that experience very little loss due to transmission errors. However, for wireless or satellite networks, lost packets are likely to be due to transmission errors and, hence, the TCP responses of clos-
ing the window is not appropriate, and results in significant performance degradation (23). Finally, the TCP slow-start mechanism, which gives a session a small window and gradually increases the window size with time, prevents TCP from taking full advantage of the high transmission capacity offered by networks of the future. Flow Control in ATM Networks Asynchronous transfer mode (ATM) is a network technology developed to carry integrated traffic including data, voice, images, and video. ATM carries all traffic on a stream of fixedsize packets (cells), each comprising 5 bytes of header information and a 48 byte information field (payload). The reason for choosing a fixed-size packet is to ensure that the switching and multiplexing function could be carried out quickly and easily. ATM is a connection-oriented technology, in the sense that, before two systems on the network can communicate, they should inform all intermediate switches about their service requirements and traffic parameters. This is similar to the telephone networks, where a fixed path is set up from the calling party to the receiving party. In ATM networks, each connection is called a virtual circuit or virtual channel (VC), because it also allows the capacity of each link to be shared by connections using that link on a demand basis, rather than by fixed allocations. The connections allow the network to guarantee the quality of service (QoS), by limiting the number of VCs. Typically, a user declares key service requirements at the time of connection setup, declares the traffic parameters, and may agree to control these parameters dynamically as demanded by the network. The available bit rate (ABR) service is one of the services in ATM developed to support data traffic. In other ATM services, network resources are allocated during connection establishment and sources are not controllable by feedback after a connection is established. ABR service, on the other hand, performs an end-to-end rate-based flow control, by requiring data sources to adapt their rates to their proper share of the available bandwidth by obtaining feedback information from the network. The feedback information is carried in resource management (RM) cells of each connection. These RM cells are generated by the source between blocks of data cells and returned by the receiver in the backward direction. The feedback control operates in two modes—explicit binary or explicit rate indication. The explicit binary indication mode assumes that a congested network node will mark a specific field (equivalent to a binary bit) in the header of any passing data cell. The receiver monitors the fields of the data cells it receives and sets the congestion fields in the backward RM cells appropriately. The data sources can then increase, decrease, or stay at their current rates based on the congestion information contained in the backward RM cells received. The explicit rate indication mode assumes that network nodes are capable of computing the proper share of available bandwidth for each source and writing this amount into a specific field in passing RM cells. The source will then adjust its rates to no more than the amount indicated in the RM cells. While the standards for ABR service specify the general behavior of the source and the receiver, the specific mechanism that governs when each network node should set the congestion field, or how it should compute the explicit rate, is
NETWORK FLOW AND CONGESTION CONTROL
173
TCP loop control ATM traffic control mechanism TCP/IP
TCP/IP IP router
IP router Legacy LAN
ATM interface
ATM links
ATM interface
left to the discretion of network equipment vendors. Design objectives of such a mechanism include maximal utilization of network bandwidth, fairness in network use, and low cost, in terms of algorithm complexity and buffer space. For a good historic account of the development of the ABR service in ATM, see Ref. 24. Detailed descriptions of various approaches and mechanisms for flow control in ATM networks can also be found in Refs. 24–28. ADVANCED ISSUES Because most Internet applications are currently supported using TCP, a lot of research activities in the networking community have been focused on improving TCP performance. It has been realized that, in a high-latency network environment, the window flow-control mechanism of TCP may not be very effective, because it relies on packet loss to signal congestion, instead of avoiding congestion and buffer overflow (29). For bulky data connections, the arrival time of the last packet of data is of primary concern to the users, whereas delays of individual packets are not important. However, for some interactive applications such as Telnet, the user is sensitive to the delay of individual packets. For such low-bandwidth delay-sensitive TCP traffic, unnecessary packet drops and packet retransmissions will lead to significant delays perceived by the users. It is suggested in some recent work (30) that the performance of TCP can be significantly improved if intermediate routers can detect incipient congestion and explicitly inform the TCP source to throttle its data rate before any packet loss occurs. This explicit congestion notification (ECN) mechanism would require modifications of existing TCP protocols. For example, a new ECN field can be implemented in the packet header, and will be used by an IP router, which monitors the queue size and, during congestion, marks the ECN field of an acknowledgment packet. The TCP source will then slow down after receiving the acknowledgment and seeing the ECN field being marked. An additional motivation for using ECN mechanisms in TCP/IP networks concerns the possibility of TCP/IP traffic traversing networks that have their own congestion-control mechanisms (e.g., ABR service in ATM). Figure 7 shows a typical network scenario where TCP traffic is generated from a source connected to a LAN (e.g., Ethernet) aggregated through an edge router to an ATM network. Congestion at the edge router occurs when the bandwidth available in the ATM network cannot support the aggregated traffic generated from the LAN. Existing implementations of TCP only rely on packet drop as an indication of congestion, to throttle
Legacy LAN
Figure 7. A typical network scenario where TCP traffic flows over an ATM network.
the source rates. By incorporating ECN mechanisms in TCP protocols, TCP sources can be informed of congestion at network edges and will reduce their rates before any packet loss occurs. The use of such ECN mechanisms to inform TCP sources of congestion would be independent of the congestion control mechanisms within the ATM networks. Instead of incorporating ECN mechanisms in TCP, which requires modifications of TCP, it is proposed in Ref. 31, that congestion can be controlled by withholding at network edges the returned acknowledgments to the TCP sources. Such a mechanism has the effect of translating the available bandwidth in the ATM network to an appropriately timed sequence of acknowledgments. A key advantage of this mechanism is that it does not require any changes in the TCP end-system software. There are other research efforts in improving TCP performance in a wireless network environment. The main problem here is that the noise in wireless transmission medium often can lead to corrupted TCP packets. Such corrupted packets are usually discarded at the destination, and the flow-control mechanism of TCP will mistakenly treat such packet loss as an indication of congestion and will throttle the source rate. Several solutions have been proposed to address this problem, most of which require modifying TCP protocols by decoupling the flow-control loops in the wireless medium from the wireline network. The readers are referred to Ref. 32 for detailed description of these mechanisms. BIBLIOGRAPHY 1. D. P. Bertsekas and R. Gallager, Data Networks, Englewood Cliffs, NJ: Prentice-Hall, 1987. 2. M. Schwartz, Telecommunication Networks: Protocols, Modeling and Analysis, Reading, MA: Addison-Wesley, 1987. 3. R. Jain, Myths about congestion management in high speed networks, Internetwork.: Res. Exp., 3 (3): 101–113, 1992. 4. N. McKeown, P. Varaiya, and J. Walrand, Scheduling calls in an input-queued switch, Electron. Lett., 29 (25): 2174–2175, 1993. 5. M. J. Karol, M. G. Hluchyj, and S. P. Morgan, Input versus output queueing in a space-division packet switch, IEEE Trans. Commun., 35: 1347–1356, 1987. 6. J. Rexford et al., Scalable architectures for integrated traffic shaping and link scheduling in high-speed ATM switches, IEEE J. Sel. Areas Commun., 15 (5): 938–950, 1997. 7. Hui Zhang, Service disciplines for guaranteed performance service in packet-switching networks, Proc. IEEE, 83: 1374–1396, 1995. 8. A. Romanow and S. Floyd, Dynamics of TCP traffic over ATM networks, IEEE J. Sel. Areas Commun., 13 (4): 633–641, 1995.
174
NETWORK MANAGEMENT
9. S. Floyd and V. Jacobson, Random early detection gateways for congestion avoidance, IEEE/ACM Trans. Network., 1 (4): 347– 413, 1993.
32. H. Balakrishnan et al., A comparison of mechanisms for improving TCP performance over wireless links, IEEE/ACM Trans. Network, 5 (6): 756–769, 1997.
10. H. Li et al., On TCP performance in ATM networks with perVC early packet discard mechanisms, Comput. Commun., 19 (13): 1065–1076, 1996.
EYTAN MODIANO MIT Lincoln Laboratory
KAI-YEUNG SIU
11. A. Demers, S. Keshav, and S. Shenker, Analysis and simulation of a fair queueing algorithm, Proc. ACM SIGCOMM, 19 (4): 1– 12, 1989.
MIT d’Arbeloff Laboratory for Information Systems and Technology
12. J. C. R. Bennett and Hui Zhang, Hierarchical packet fair queueing algorithms, IEEE/ACM Trans. Network., 5 (5): 875– 889, 1997. 13. A. K. Parekh and R. G. Gallager, A generalized processor sharing approach to flow control in integrated services networks: The single-node case, IEEE/ACM Trans. Network., 1 (3): 344–357, 1993. 14. S. J. Golestani, Network delay analysis of a class of fair queueing algorithms, IEEE J. Sel. Areas Commun., 13 (6): 1057–1070, 1995. 15. D. Mitra and J. B. Seery, Dynamic adaptive windows for highspeed data networks with multiple paths and propagation delays, Comput. Networks ISDN Syst., 25 (6): 663–679, 1993. 16. V. Jacobson, Berkeley TCP evolution from 4.3-Tahoe to 4.3 Reno, Proc. Internet Eng. Task Force, 18th, Vancouver, 1990. 17. L. S. Brakmo and L. Peterson, TCP Vegas: End-to-end congestion avoidance on a global internet, IEEE J. Sel. Areas Commun., 13 (8): 1465–1480, 1995. 18. L. S. Brakmo, S. W. O’Malley, and L. Peterson, TCP Vegas: New techniques for congestion detection and avoidance, Comput. Commun. Rev., 24 (4): 24–35, 1994. 19. V. Jacobson, Congestion avoidance and control, Proc. ACM SIGCOMM, 18: 314–329, 1988. 20. W. R. Stevens, TCP/IP Illustrated Vol. 1: The Protocols, Reading, MA: Addison-Wesley, 1994. 21. V. Jacobson, R. Braden, and D. Dorman, TCP extensions for high performance, Internet Eng. Task Force, 1992, RFC-1323. 22. R. C. Durst, G. J. Miller, and E. J. Travis, TCP extensions for space communications, Wireless Networks, 3 (5): 389–403, 1997. 23. T. V. Lakshman and U. Madhow, The performance of TCP/IP for networks with high bandwidth-delay products and random loss, IEEE/ACM Trans. Network., 5 (3): 336–350, 1997. 24. Ohsaki et al., Rate-based congestion control for ATM networks, Comput. Commun. Rev., 25 (2): 60–72, 1995. 25. F. Bonomi and K. W. Fendick, The rate-based flow control framework for the available bit rate ATM service, IEEE Network, 9 (2): 25–39, 1995. 26. R. Jain, S. Kalyanaraman, and R. Viswanathan, The OSU scheme for congestion avoidance in ATM networks: Lessons learned and extensions, Perform. Eval., 31 (1–2): 67–88, 1997. 27. P. Narvaez and K.-Y. Siu, Optimal feedback control for ABR service in ATM, Int. Conf. Network Protocols, pp. 32–41. Atlanta, GA, 1997. 28. K.-Y. Siu and H.-Y. Tzeng, Intelligent congestion control for ABR service in ATM networks, Comput. Commun. Rev., 24 (5): 81– 106, 1994. 29. K. Fall and S. Floyd, Simulation-based comparisons of Tahoe, Reno, and SACK TCP, Comput. Commun. Rev., 26 (3): 5–21, 1996. 30. S. Floyd, TCP and explicit congestion notification, Comput. Commun. Rev., 24 (5): 8–23, 1994. 31. P. Narvaez and K.-Y. Siu, An acknowledgment bucket scheme to regulate TCP flow over ATM, Proc. IEEE Globecom, 3: 1838– 1844, 1997.
NETWORK INTERCONNECTION. See INTERNETWORKING.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5319.htm
●
HOME ●
ABOUT US ●
//
CONTACT US ●
HELP
Wiley Encyclopedia of Electrical and Electronics Engineering
Browse this title ●
Network Management Standard Article Yechiam Yemini1 1Columbia University, New York, NY Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5319 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (120K)
Search this title Enter words or phrases ❍
Advanced Product Search
❍ ❍
Acronym Finder
Abstract The sections in this article are Challenges and Problems Architecture of Network Management Systems | | | Copyright © 1999-2008 All Rights Reserved. file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...CTRONICS%20ENGINEERING/38.%20Networking/W5319.htm16.06.2008 16:26:01
174
NETWORK MANAGEMENT
NETWORK MANAGEMENT The term network management is often used in an imprecise way to capture multiple meanings. The first part, network, can mean the entire range of network communications and computing systems and services, or just a subset of these associated with the physical and network layers; in the latter case one distinguishes network management from system management. Management means both a collection of operations tasks handled by network and system administrators and support staff, as well as technologies and software tools intended to simplify these tasks. This article uses the term network management in its broadest sense. Network here means