38.Networking

38• Networking 38• Networking Asynchronous Transfer Mode Networks Abstract | Full Text: PDF (129K) Channel Coding Abstr...

Author: John G. Webster (Editor)

107 downloads 2428 Views 7MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

38• Networking

38• Networking Asynchronous Transfer Mode Networks Abstract | Full Text: PDF (129K) Channel Coding Abstract | Full Text: PDF (160K) Client–Server Systems Abstract | Full Text: PDF (232K) Code Division Multiple Access Abstract | Full Text: PDF (188K) Data Compression for Networking Abstract | Full Text: PDF (300K) Ethernet Abstract | Full Text: PDF (129K) Group Communication bstract | Full Text: PDF (172K) High-Speed Protocols Abstract | Full Text: PDF (148K) Intelligent Networks Abstract | Full Text: PDF (123K) Internetworking Abstract | Full Text: PDF (107K) ISO OSI Layered Protocol Model Abstract | Full Text: PDF (153K) Local Area Networks Abstract | Full Text: PDF (148K) Metropolitan Area Networks Abstract | Full Text: PDF (224K) Mobile Network Objects Abstract | Full Text: PDF (118K) Multicast Abstract | Full Text: PDF (101K) Multiple Access Schemes Abstract | Full Text: PDF (158K)

file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ELECTRONICS%20ENGINEERING/38.Networking%20.htm (1 of 2)16.06.2008 15:59:53

38• Networking

Network Flow and Congestion Control Abstract | Full Text: PDF (122K) Network Management Abstract | Full Text: PDF (120K) Network Operating Systems Abstract | Full Text: PDF (160K) Network Performance and Queueing Models Abstract | Full Text: PDF (154K) Network Reliability and Fault-Tolerance Abstract | Full Text: PDF (118K) Network Routing Algorithms Abstract | Full Text: PDF (172K) Network Security Framework Abstract | Full Text: PDF (122K) Network Security Fundamentals Abstract | Full Text: PDF (213K) Remote Procedure Calls Abstract | Full Text: PDF (112K) Signaling Abstract | Full Text: PDF (110K) Telephone Networks Abstract | Full Text: PDF (233K) Token Ring Local Area Networks Abstract | Full Text: PDF (828K) Wireless Networks Abstract | Full Text: PDF (616K)

file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ELECTRONICS%20ENGINEERING/38.Networking%20.htm (2 of 2)16.06.2008 15:59:53

file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRIC...%20ELECTRONICS%20ENGINEERING/38.%20Networking/W5301.htm

●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Asynchronous Transfer Mode Networks Standard Article Tatsuya Suda1 1University of California, Irvine, Irvine, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5301 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (129K)

Browse this title ●

Search this title Enter words or phrases ❍

Advanced Product Search

❍ ❍

Acronym Finder

file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%2...ONICS%20ENGINEERING/38.%20Networking/W5301.htm (1 of 2)16.06.2008 16:16:51


Abstract The sections in this article are ATM Standards ATM Traffic Control Hardware Switch Architectures for ATM Networks Continuing Research in ATM Networks | | | Copyright © 1999-2008 All Rights Reserved.


ASYNCHRONOUS TRANSFER MODE NETWORKS

749

ASYNCHRONOUS TRANSFER MODE NETWORKS Asynchronous transfer mode, or ATM, is a network transfer technique capable of supporting a wide variety of multimedia applications with diverse service and performance requirements. It supports traffic bandwidths ranging from a few kilobits per second (e.g., a text terminal) to several hundred megabits per second (e.g., high-definition video) and traffic types ranging from continuous, fixed-rate traffic (e.g., traditional telephony and file transfer) to highly bursty traffic (e.g., interactive data and video). Because of its support for such a wide range of traffic, ATM was designated by the telecommunication standardization sector of the International Telecommunications Union (ITU-T, formerly CCITT) as the multiplexing and switching technique for Broadband, or highspeed, ISDN (B-ISDN) (1). ATM is a form of packet-switching technology. That is, ATM networks transmit their information in small, fixedlength packets called cells, each of which contains 48 octets (or bytes) of data and 5 octets of header information. The small, fixed cell size was chosen to facilitate the rapid processing of packets in hardware and to minimize the amount of time required to fill a single packet. This is particularly important for real-time applications such as voice and video that require short packetization delays. ATM is also connection-oriented. In other words, a virtual circuit must be established before a call can take place, where a call is defined as the transfer of information between two or more endpoints. The establishment of a virtual circuit entails the initiation of a signaling process, during which a route is selected according to the call’s quality of service requirements, connection identifiers at each switch on the route are established, and network resources such as bandwidth and buffer space may be reserved for the connection. Another important characteristic of ATM is that its network functions are typically implemented in hardware. With the introduction of high-speed fiber optic transmission lines, the communication bottleneck has shifted from the communication links to the processing at switching nodes and at terminal equipment. Hardware implementation is necessary to overcome this bottleneck because it minimizes the cell-processing overhead, thereby allowing the network to match link rates on the order of gigabits per second. Finally, as its name indicates, ATM is asynchronous. Time is slotted into cell-sized intervals, and slots are assigned to J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.

750


calls in an asynchronous, demand-based manner. Because slots are allocated to calls on demand, ATM can easily accommodate traffic whose bit rate fluctuates over time. Moreover, in ATM, no bandwidth is consumed unless information is actually transmitted. ATM also gains bandwidth efficiency by being able to multiplex bursty traffic sources statistically. Because bursty traffic does not require continuous allocation of the bandwidth at its peak rate, statistical multiplexing allows a large number of bursty sources to share the network’s bandwidth. Since its birth in the mid-1980s, ATM has been fortified by a number of robust standards and realized by a significant number of network equipment manufacturers. International standards-making bodies such as the ITU and independent consortia like the ATM Forum have developed a significant body of standards and implementation agreements for ATM (1,4). As networks and network services continue to evolve toward greater speeds and diversities, ATM will undoubtedly continue to proliferate.

Higher layer functions

Higher layers

Convergence

CS

Segmentation and reassembly

Layer management

Generic flow control Cell header generation/extraction Cell VPI/VCI translation Cell multiplex and demultiplex Cell rate decoupling Header error control (HEC) Cell delineation Transmission frame adaptation Transmission frame generation/recovery Bit timing Physical medium

SAR

AAL

ATM

TC Physical layer PM

Figure 2. Functions of each layer in the protocol reference model.

The telecommunication standardization sector of the ITU, the international standards agency commissioned by the United Nations for the global standardization of telecommunications, has developed a number of standards for ATM networks. Other standards bodies and consortia (e.g., the ATM Forum, ANSI) have also contributed to the development of ATM standards. This section presents an overview of the standards, with particular emphasis on the protocol reference model used by ATM (2).

establishment, call maintenance, and call release; and the management plane defines the operations necessary to control information flow between planes and layers and to maintain accurate and fault-tolerant network operation. Within the user and control planes, there are three layers: the physical layer, the ATM layer, and the ATM adaptation layer (AAL). Figure 2 summarizes the functions of each layer (1). The physical layer performs primarily bit-level functions, the ATM layer is primarily responsible for the switching of ATM cells, and the ATM adaptation layer is responsible for the conversion of higher-layer protocol frames into ATM cells. The functions that the physical, ATM, and adaptation layers perform are described in more detail next.

Protocol Reference Model

Physical Layer

The B-ISDN protocol reference model, defined in ITU-T recommendation I.321, is shown in Fig. 1 (1). The purpose of the protocol reference model is to clarify the functions that ATM networks perform by grouping them into a set of interrelated, function-specific layers and planes. The reference model consists of a user plane, a control plane, and a management plane. Within the user and control planes is a hierarchical set of layers. The user plane defines a set of functions for the transfer of user information between communication endpoints; the control plane defines control functions such as call

The physical layer is divided into two sublayers: the physical medium sublayer and the transmission convergence sublayer (1).

ATM STANDARDS

Management plane Control plane

User plane

Higher layers

Higher layers

ATM adaptation layer ATM layer Physical layer

Plane management Layer management

Figure 1. Protocol reference model for ATM.

Physical Medium Sublayer. The physical medium (PM) sublayer performs medium-dependent functions. For example, it provides bit transmission capabilities including bit alignment, line coding and electrical/optical conversion. The PM sublayer is also responsible for bit timing (i.e., the insertion and extraction of bit timing information). The PM sublayer currently supports two types of interface: optical and electrical. Transmission Convergence Sublayer. Above the physical medium sublayer is the transmission convergence (TC) sublayer, which is primarily responsible for the framing of data transported over the physical medium. The ITU-T recommendation specifies two options for TC sublayer transmission frame structure: cell-based and synchronous digital hierarchy (SDH). In the cell-based case, cells are transported continuously without any regular frame structure. Under SDH, cells are carried in a special frame structure based on the North American SONET (synchronous optical network) protocol (3). Regardless of which transmission frame structure is used, the TC sublayer is responsible for the following four functions: cell rate decoupling, header error control, cell delineation, and transmission frame adaptation. Cell rate decoupling is the insertion of idle cells at the sending side to adapt the ATM cell


Bit 8

7

6

5

751

Bit 4

3

GFC VPI

2

1

VPI VCI VCI

VCI

PT HEC

CLP

8 1 2 3 4 5

7

6

5

4

3

2

1

VPI VPI

VCI VCI

VCI

PT HEC

UNI header

CLP

1 2 3 4 5

NNI header Figure 3. ATM cell header structure.

stream’s rate to the rate of the transmission path. Header error control is the insertion of an 8-bit CRC in the ATM cell header to protect the contents of the ATM cell header. Cell delineation is the detection of cell boundaries. Transmission frame adaptation is the encapsulation of departing cells into an appropriate framing structure (either cell-based or SDHbased). ATM Layer The ATM layer lies atop the physical layer and specifies the functions required for the switching and flow control of ATM cells (1). There are two interfaces in an ATM network: the user-network interface (UNI) between the ATM endpoint and the ATM switch, and the network-network interface (NNI) between two ATM switches. Although a 48-octet cell payload is used at both interfaces, the 5-octet cell header differs slightly at these interfaces. Figure 3 shows the cell header structures used at the UNI and NNI (1). At the UNI, the header contains a 4-bit generic flow control (GFC) field, a 24-bit label field containing virtual path identifier (VPI) and virtual channel identifier (VCI) subfields (8 bits for the VPI and 16 bits for the VCI), a 2-bit payload type (PT) field, a 1-bit cell loss priority (CLP) field, and an 8-bit header error check (HEC) field. The cell header for an NNI cell is identical to that for the UNI cell, except that it lacks the GFC field; these four bits are used for an additional 4 VPI bits in the NNI cell header. The VCI and VPI fields are identifier values for virtual channel (VC) and virtual path (VP), respectively. A virtual channel connects two ATM communication endpoints. A virtual path connects two ATM devices, which can be switches or endpoints, and several virtual channels may be multiplexed onto the same virtual path. The 2-bit PT field identifies whether the cell payload contains data or control information. The CLP bit is used by the user for explicit indication of cell loss priority. If the value of the CLP is 1, then the cell is subject to discarding in case of congestion. The HEC field is an 8-bit CRC that protects the contents of the cell header. The GFC field, which appears only at the UNI, is used to assist the customer premises network in controlling the traffic flow. At the time of writing, the exact procedures for use of this field have not been agreed upon. ATM Layer Functions The primary function of the ATM layer is VPI/VCI translation. As ATM cells arrive at ATM switches, the VPI and VCI values contained in their headers are examined by the switch to determine which outport port should be used to forward the cell. In the process, the switch translates the cell’s origi-

nal VPI and VCI values into new outgoing VPI and VCI values, which are used in turn by the next ATM switch to send the cell toward its intended destination. The table used to perform this translation is initialized during the establishment of the call. An ATM switch may either be a VP switch, in which case it translates only the VPI values contained in cell headers, or it may be a VP/VC switch, in which case it translates the incoming VPI/VCI value into an outgoing VPI/VCI pair. Because VPI and VCI values do not represent a unique end-toend virtual connection, they can be reused at different switches through the network. This is important because the VPI and VCI fields are limited in length and would be quickly exhausted if they were used simply as destination addresses. The ATM layer supports two types of virtual connections: switched virtual connections (SVC) and permanent, or semipermanent, virtual connections (PVC). Switched virtual connections are established and torn down dynamically by an ATM signaling procedure. That is, they exist only for the duration of a single call. Permanent virtual connections, on the other hand, are established by network administrators and continue to exist as long as the administrator leaves them up, even if they are not used to transmit data. Other important functions of the ATM layer include cell multiplexing and demultiplexing, cell header creation and extraction, and generic flow control. Cell multiplexing is the merging of cells from several calls onto a single transmission path, cell header creation is the attachment of a 5-octet cell header to each 48-octet block of user payload, and generic flow control is used at the UNI to prevent short-term overload conditions from occurring within the network. ATM Layer Service Categories The ATM Forum and ITU-T have defined several distinct service categories at the ATM layer (1,4). The categories defined by the ATM Forum include constant bit rate (CBR), real-time variable bit rate (VBR-rt), non-real-time variable bit rate (VBR-nrt), available bit rate (ABR), and unspecified bit rate (UBR). ITU-T defines four service categories, namely, deterministic bit rate (DBR), statistical bit rate (SBR), available bit rate (ABR), and ATM block transfer (ABT). The first of the three ITU-T service categories correspond roughly to the ATM Forum’s CBR, VBR, and ABR classifications, respectively. The fourth service category, ABT, is solely defined by ITU-T and is intended for bursty data applications. The UBR category defined by the ATM Forum is for calls that request no quality of service guarantees at all. Figure 4 lists the ATM service categories, their quality of service (QoS) parameters,

752


ITU-T service categories

DBR

SBR

ABT

ABR

ATM forum service categories

CBR

VBR-rt

VBR-nrt

ABR

Cell loss rate Cell transfer delay Cell delay variation Traffic descriptors (contract)

Specified Specified Specified PCR/CDVT

UBR Unspecified

Unspecified Unspecified

PCR/CDVT PCR/CDVT PCR/CDVT SCR/BT MCR/ACR

PCR = Peak Cell Rate; SCR = Sustained Cell Rate; CDVT = Cell Delay Variation Tolerance; BT = Burst Tolerance; MCR = Minimum Cell Rate; ACR = Allowed Cell Rate. Figure 4. ATM layer service categories.

and the traffic descriptors required by the service category during call establishment (1,4). The constant bit rate (or deterministic bit rate) service category provides a very strict QoS guarantee. It is targeted at real-time applications, such as voice and raw video, which mandate severe restrictions on delay, delay variance (jitter), and cell loss rate. The only traffic descriptors required by the CBR service are the peak cell rate and the cell delay variation tolerance. A fixed amount of bandwidth, determined primarily by the call’s peak cell rate, is reserved for each CBR connection. The real-time variable bit rate (or statistical bit rate) service category is intended for real-time bursty applications (e.g., compressed video), which also require strict QoS guarantees. The primary difference between CBR and VBR-rt is in the traffic descriptors they use. The VBR-rt service requires the specification of the sustained (or average) cell rate and burst tolerance (i.e., burst length) in addition to the peak cell rate and the cell delay variation tolerance. The ATM Forum also defines a VBR-nrt service category, in which cell delay variance is not guaranteed. The available bit rate service category is defined to exploit the network’s unused bandwidth. It is intended for non-realtime data applications in which the source is amenable to enforced adjustment of its transmission rate. A minimum cell rate is reserved for the ABR connection and therefore guaranteed by the network. When the network has unused bandwidth, ABR sources are allowed to increase their cell rates up to an allowed cell rate (ACR), a value that is periodically updated by the ABR flow control mechanism (to be described in the section entitled ‘‘ATM Traffic Control’’). The value of ACR always falls between the minimum and the peak cell rate for the connection and is determined by the network. The ATM Forum defines another service category for nonreal-time applications called the unspecified bit rate (UBR) service category. The UBR service is entirely best effort; the call is provided with no QoS guarantees. The ITU-T also defines an additional service category for non-real-time data applications. The ATM block transfer service category is intended for the transmission of short bursts, or blocks, of data. Before transmitting a block, the source requests a reservation of bandwidth from the network. If the ABT service is being used with the immediate transmission option (ABT/IT), the

block of data is sent at the same time as the reservation request. If bandwidth is not available for transporting the block, then it is simply discarded, and the source must retransmit it. In the ABT service with delayed transmission (ABT/DT), the source waits for a confirmation from the network that enough bandwidth is available before transmitting the block of data. In both cases, the network temporarily reserves bandwidth according to the peak cell rate for each block. Immediately after transporting the block, the network releases the reserved bandwidth. ATM Adaptation Layer The ATM adaptation layer, which resides atop the ATM layer, is responsible for mapping the requirements of higher layer protocols onto the ATM network (1). It operates in ATM devices at the edge of the ATM network and is totally absent in ATM switches. The adaptation layer is divided into two sublayers: the convergence sublayer (CS), which performs error detection and handling, timing, and clock recovery; and the segmentation and reassembly (SAR) sublayer, which performs segmentation of convergence sublayer protocol data units (PDUs) into ATM cell-sized SAR sublayer service data units (SDUs) and vice versa. In order to support different service requirements, the ITU-T has proposed four AAL-specific service classes. Figure 5 depicts the four service classes defined in recommendation I.362 (1). Note that even though these AAL service classes are similar in many ways to the ATM layer service categories defined in the previous section, they are not the same; each exists at a different layer of the protocol reference model, and each requires a different set of functions. AAL service class A corresponds to constant bit rate services with a timing relation required between source and destination. The connection mode is connection-oriented. The CBR audio and video belong to this class. Class B corresponds to variable bit rate (VBR) services. This class also requires timing between source and destination, and its mode is connection-oriented. The VBR audio and video are examples of class B services. Class C also corresponds to VBR connectionoriented services, but the timing between source and destination needs not be related. Class C includes connectionoriented data transfer such as X.25, signaling, and future high-speed data services. Class D corresponds to connectionless services. Connectionless data services such as those supported by LANs and MANs are examples of class D services. Four AAL types (Types 1, 2, 3/4, and 5), each with a unique SAR sublayer and CS sublayer, are defined to support the four service classes. AAL Type 1 supports constant bit rate services (class A), and AAL Type 2 supports variable bit rate services with a timing relation between source and desti-

Class A Timing relation between source and destination Bit rate Connection mode

Class B

Required

Constant Connection oriented

Class C

Class D

Not required

Variable Connectionless

Figure 5. Service classification for AAL.

ASYNCHRONOUS TRANSFER MODE NETWORKS Cell header

ered to the user or discarded according to the user’s choice. The use of the CF field is for further study.

SAR-SDU payload

Figure 6. SAR-SDU format for AAL Type 5.

nation (class B). AAL Type 3/4 was originally specified as two different AAL types (Type 3 and Type 4), but because of their inherent similarities, they were eventually merged to support both class C and class D services. AAL Type 5 also supports class C and class D services. AAL Type 5. Currently, the most widely used adaptation layer is AAL Type 5. AAL Type 5 supports connection-oriented and connectionless services in which there is no timing relation between source and destination (classes C and D). Its functionality was intentionally made simple in order to support high-speed data transfer. AAL Type 5 assumes that the layers above the ATM adaptation layer can perform error recovery, retransmission, and sequence numbering when required, and thus, it does not provide these functions. Therefore, only nonassured operation is provided; lost or corrupted AAL Type 5 packets will not be corrected by retransmission. Figure 6 depicts the SAR-SDU format for AAL Type 5 (5,6). The SAR sublayer of AAL Type 5 performs segmentation of a CS-PDU into a size suitable for the SAR-SDU payload. Unlike other AAL types, Type 5 devotes the entire 48octet payload of the ATM cell to the SAR-SDU; there is no overhead. An AAL specific flag (end-of-frame) in the ATM PT field of the cell header is set when the last cell of a CS-PDU is sent. The reassembly of CS-PDU frames at the destination is controlled by using this flag. Figure 7 depicts the CS-PDU format for AAL Type 5 (5,6). It contains the user data payload, along with any necessary padding bits (PAD) and a CS-PDU trailer, which are added by the CS sublayer when it receives the user information from the higher layer. The CS-PDU is padded using 0 to 47 bytes of PAD field to make the length of the CS-PDU an integral multiple of 48 bytes (the size of the SAR-SDU payload). At the receiving end, a reassembled PDU is passed to the CS sublayer from the SAR sublayer, and CRC values are then calculated and compared. If there is no error, the PAD field is removed by using the value of length field (LF) in the CSPDU trailer, and user data is passed to the higher layer. If an error is detected, the erroneous information is either deliv-

CS-PDU CS-PDU trailer User data

PAD

CF

LF

CS layer

I AAL5 cell Indication PAD : Pad (0 to 47 bytes) CF : Control field (2 bytes) LF : Length field (2 bytes) CRC: Cyclic redundancy check (4 bytes) AAL5 cell

AAL Type 1. AAL Type 1 supports constant bit rate services with a fixed timing relation between source and destination users (class A). At the SAR sublayer, it defines a 48-octet service data unit (SDU), which contains 47 octets of user payload, 4 bits for a sequence number, and a 4-bit CRC value to detect errors in the sequence number field. AAL Type 1 performs the following services at the CS sublayer: forward error correction to ensure high quality of audio and video applications, clock recovery by monitoring the buffer filling, explicit time indication by inserting a time stamp in the CS-PDU, and handling of lost and misinserted cells that are recognized by the SAR. At the time of writing, the CS-PDU format has not been decided. AAL Type 2. AAL Type 2 supports variable bit rate services with a timing relation between source and destination (class B). AAL Type 2 is nearly identical to AAL Type 1, except that it transfers service data units at a variable bit rate, not at a constant bit rate. Furthermore, AAL Type 2 accepts variable length CS-PDUs, and thus, there may exist some SAR-SDUs that are not completely filled with user data. The CS sublayer for AAL Type 2 performs the following functions: forward error correction for audio and video services, clock recovery by inserting a time stamp in the CS-PDU, and handling of lost and misinserted cells. At the time of writing, both the SARSDU and CS-PDU formats for AAL Type 2 are still under discussion. AAL Type 3/4. AAL Type 3/4 mainly supports services that require no timing relation between the source and destination (classes C and D). At the SAR sublayer, it defines a 48-octet service data unit, with 44 octets of user payload; a 2-bit payload type field to indicate whether the SDU is at the beginning, middle, or end of a CS-PDU; a 4-bit cell sequence number; a 10-bit multiplexing identifier that allows several CS-PDUs to be multiplexed over a single VC; a 6-bit cell payload length indicator; and a 10-bit CRC code that covers the payload. The CS-PDU format allows for up to 65535 octets of user payload and contains a header and trailer to delineate the PDU. The functions that AAL Type 3/4 performs include segmentation and reassembly of variable-length user data and error handling. It supports message mode (for framed data transfer) as well as streaming mode (for streamed data transfer). Because Type 3/4 is mainly intended for data services, it provides a retransmission mechanism if necessary. ATM Signaling

CRC SAR layer

AAL5 cell

753

AAL5 cell

Figure 7. CS-PDU format, segmentation and reassembly of AAL Type 5.

ATM follows the principle of out-of-band signaling that was established for N-ISDN. In other words, signaling and data channels are separate. The main purposes of signaling are (1) to establish, maintain, and release ATM virtual connections and (2) to negotiate (or renegotiate) the traffic parameters of new (or existing) connections (7). The ATM signaling standards support the creation of point-to-point as well as multicast connections. Typically, certain VCI and VPI values are reserved by ATM networks for signaling messages. If additional signaling VCs are required, they may be established through the process of metasignaling.

754


ATM TRAFFIC CONTROL The control of ATM traffic is complicated as a result of ATM’s high-link speed and small cell size, the diverse service requirements of ATM applications, and the diverse characteristics of ATM traffic. Furthermore, the configuration and size of the ATM environment, either local or wide area, has a significant impact on the choice of traffic control mechanisms. The factor that most complicates traffic control in ATM is its high-link speed. Typical ATM link speeds are 155.52 Mbit/ s and 622.08 Mbit/s. At these high-link speeds, 53-byte ATM cells must be switched at rates greater than one cell per 2.726 애s or 0.682 애s, respectively. It is apparent that the cell processing required by traffic control must perform at speeds comparable to these cell-switching rates. Thus, traffic control should be simple and efficient, without excessive software processing. Such high speeds render many traditional traffic control mechanisms inadequate for use in ATM because of their reactive nature. Traditional reactive traffic control mechanisms attempt to control network congestion by responding to it after it occurs and usually involves sending feedback to the source in the form of a choke packet. However, a large bandwidth-delay product (i.e., the amount of traffic that can be sent in a single propagation delay time) renders many reactive control schemes ineffective in high-speed networks. When a node receives feedback, it may have already transmitted a large amount of data. Consider a cross-continental 622 Mbit/ s connection with a propagation delay of 20 ms (propagationbandwidth product of 12.4 Mbit). If a node at one end of the connection experiences congestion and attempts to throttle the source at the other end by sending it a feedback packet, the source will already have transmitted over 12 Mb of information before feedback arrives. This example illustrates the ineffectiveness of traditional reactive traffic control mechanisms in high-speed networks and argues for novel mechanisms that take into account high propagation-bandwidth products. Not only is traffic control complicated by high speeds, but it also is made more difficult by the diverse QoS requirements of ATM applications. For example, many applications have strict delay requirements and must be delivered within a specified amount of time. Other applications have strict loss requirements and must be delivered reliably without an inordinate amount of loss. Traffic controls must address the diverse requirements of such applications. Another factor complicating traffic control in ATM networks is the diversity of ATM traffic characteristics. In ATM networks, continuous bit rate traffic is accompanied by bursty traffic. Bursty traffic generates cells at a peak rate for a very short period of time and then immediately becomes less active, generating fewer cells. To improve the efficiency of ATM network utilization, bursty calls should be allocated an amount of bandwidth that is less than their peak rate. This allows the network to multiplex more calls by taking advantage of the small probability that a large number of bursty calls will be simultaneously active. This type of multiplexing is referred to as statistical multiplexing. The problem then becomes one of determining how best to multiplex bursty calls statistically such that the number of cells dropped as a result of excessive burstiness is balanced with the number of bursty

traffic streams allowed. Addressing the unique demands of bursty traffic is an important function of ATM traffic control. For these reasons, many traffic control mechanisms developed for existing networks may not be applicable to ATM networks, and therefore novel forms of traffic control are required (8,9). One such class of novel mechanisms that work well in high-speed networks falls under the heading of preventive control mechanisms. Preventive control attempts to manage congestion by preventing it before it occurs. Preventive traffic control is targeted primarily at real-time traffic. Another class of traffic control mechanisms has been targeted toward non-real-time data traffic and relies on novel reactive feedback mechanisms. Preventive Traffic Control Preventive control for ATM has two major components: call admission control and usage parameter control (8). Admission control determines whether to accept or reject a new call at the time of call set-up. This decision is based on the traffic characteristics of the new call and the current network load. Usage parameter control enforces the traffic parameters of the call after it has been accepted into the network. This enforcement is necessary to ensure that the call’s actual traffic flow conforms with that reported during call admission. Before describing call admission and usage parameter control in more detail, it is important to first discuss the nature of multimedia traffic. Most ATM traffic belongs to one of two general classes of traffic: continuous traffic and bursty traffic. Sources of continuous traffic (e.g., constant bit rate video, voice without silence detection) are easily handled because their resource utilization is predictable and they can be deterministically multiplexed. However, bursty traffic (e.g., voice with silence detection, variable bit rate video) is characterized by its unpredictability, and this kind of traffic complicates preventive traffic control. Burstiness is a parameter describing how densely or sparsely cell arrivals occur. There are a number of ways to express traffic burstiness, the most typical of which are the ratio of peak bit rate to average bit rate and the average burst length. Several other measures of burstiness have also been proposed (8). It is well known that burstiness plays a critical role in determining network performance, and thus, it is critical for traffic control mechanisms to reduce the negative impact of bursty traffic. Call Admission Control. Call admission control is the process by which the network decides whether to accept or reject a new call. When a new call requests access to the network, it provides a set of traffic descriptors (e.g., peak rate, average rate, average burst length) and a set of quality of service requirements (e.g., acceptable cell loss rate, acceptable cell delay variance, acceptable delay). The network then determines, through signaling, if it has enough resources (e.g., bandwidth, buffer space) to support the new call’s requirements. If it does, the call is immediately accepted and allowed to transmit data into the network. Otherwise it is rejected. Call admission control prevents network congestion by limiting the number of active connections in the network to a level where the network resources are adequate to maintain quality of service guarantees.


One of the most common ways for an ATM network to make a call admission decision is to use the call’s traffic descriptors and quality of service requirements to predict the ‘‘equivalent bandwidth’’ required by the call. The equivalent bandwidth determines how many resources need to be reserved by the network to support the new call at its requested quality of service. For continuous, constant bit rate calls, determining the equivalent bandwidth is simple. It is merely equal to the peak bit rate of the call. For bursty connections, however, the process of determining the equivalent bandwidth should take into account such factors as a call’s burstiness ratio (the ratio of peak bit rate to average bit rate), burst length, and burst interarrival time. The equivalent bandwidth for bursty connections must be chosen carefully to ameliorate congestion and cell loss while maximizing the number of connections that can be statistically multiplexed. Usage Parameter Control. Call admission control is responsible for admitting or rejecting new calls. However, call admission by itself is ineffective if the call does not transmit data according to the traffic parameters it provided. Users may intentionally or accidentally exceed the traffic parameters declared during call admission, thereby overloading the network. In order to prevent the network users from violating their traffic contracts and causing the network to enter a congested state, each call’s traffic flow is monitored and, if necessary, restricted. This is the purpose of usage parameter control. (Usage parameter control is also commonly referred to as policing, bandwidth enforcement, or flow enforcement.) To monitor a call’s traffic efficiently, the usage parameter control function must be located as close as possible to the actual source of the traffic. An ideal usage parameter control mechanism should have the ability to detect parameter-violating cells, appear transparent to connections respecting their admission parameters, and rapidly respond to parameter violations. It should also be simple, fast, and cost effective to implement in hardware. To meet these requirements, several mechanisms have been proposed and implemented (8). The leaky bucket mechanism (originally proposed in Ref. 10) is a typical usage parameter control mechanism used for ATM networks. It can simultaneously enforce the average bandwidth and the burst factor of a traffic source. One possible implementation of the leaky bucket mechanism is to control the traffic flow by means of tokens. A conceptual model for the leaky bucket mechanism is illustrated in Fig. 5. In Fig. 8, an arriving cell first enters a queue. If the queue is full, cells are simply discarded. To enter the network, a cell must first obtain a token from the token pool; if there is no token, a cell must wait in the queue until a new token is generated. Tokens are generated at a fixed rate corresponding to the average bit rate declared during call admission. If the

Arriving cells

Departing cells Queue Token pool Token generator

Figure 8. Leaky bucket mechanism.

755

number of tokens in the token pool exceeds some predefined threshold value, token generation stops. This threshold value corresponds to the burstiness of the transmission declared at call admission time; for larger threshold values, a greater degree of burstiness is allowed. This method enforces the average input rate while allowing for a certain degree of burstiness. One disadvantage of the leaky bucket mechanism is that the bandwidth enforcement introduced by the token pool is in effect even when the network load is light and there is no need for enforcement. Another disadvantage of the leaky bucket mechanism is that it may mistake nonviolating cells for violating cells. When traffic is bursty, a large number of cells may be generated in a short period of time, while conforming to the traffic parameters claimed at the time of call admission. In such situations, none of these cells should be considered violating cells. Yet in actual practice, leaky bucket may erroneously identify such cells as violations of admission parameters. A virtual leaky bucket mechanism (also referred to as a marking method) alleviates these disadvantages (11). In this mechanism, violating cells, rather than being discarded or buffered, are permitted to enter the network at a lower priority (CLP ⫽ 1). These violating cells are discarded only when they arrive at a congested node. If there are no congested nodes along the routes to their destinations, the violating cells are transmitted without being discarded. The virtual leaky bucket mechanism can easily be implemented using the leaky bucket method described earlier. When the queue length exceeds a threshold, cells are marked as ‘‘droppable’’ instead of being discarded. The virtual leaky bucket method not only allows the user to take advantage of a light network load but also allows a larger margin of error in determining the token pool parameters. Reactive Traffic Control Preventive control is appropriate for most types of ATM traffic. However, there are cases where reactive control is beneficial. For instance, reactive control is useful for service classes like ABR, which allow sources to use bandwidth not being used by calls in other service classes. Such a service would be impossible with preventive control because the amount of unused bandwidth in the network changes dynamically, and the sources can only be made aware of the amount through reactive feedback. There are two major classes of reactive traffic control mechanisms: rate-based and credit-based (12,13). Most ratebased traffic control mechanisms establish a closed feedback loop in which the source periodically transmits special control cells, called resource management cells, to the destination (or destinations). The destination closes the feedback loop by returning the resource management cells to the source. As the feedback cells traverse the network, the intermediate switches examine their current congestion state and mark the feedback cells accordingly. When the source receives a returning feedback cell, it adjusts its rate, either by decreasing it in the case of network congestion or increasing it in the case of network underuse. An example of a rate-based ABR algorithm is the Enhanced Proportional Rate Control Algorithm (EPRCA), which was proposed, developed, and tested through the course of ATM Forum activities (12).

756


Credit-based mechanisms use link-by-link traffic control to eliminate loss and optimize use. Intermediate switches exchange resource management cells that contain ‘‘credits,’’ which reflect the amount of buffer space available at the next downstream switch. A source cannot transmit a new data cell unless it has received at least one credit from its downstream neighbor. An example of a credit-based mechanism is the Quantum Flow Control (QFC) algorithm, developed by a consortium of reseachers and ATM equipment manufacturers (13).

Conflict 000

0 1

0 1

001

2 3

2 3

Input ports 101

HARDWARE SWITCH ARCHITECTURES FOR ATM NETWORKS In ATM networks, information is segmented into fixed-length cells, and cells are asynchronously transmitted through the network. To match the transmission speed of the network links and to minimize the protocol processing overhead, ATM performs the switching of cells in hardware-switching fabrics, unlike traditional packet switching networks, where switching is largely performed in software. A large number of designs has been proposed and implemented for ATM switches (14). Although many differences exist, ATM switch architectures can be broadly classified into two categories: asynchronous time division (ATD) and spacedivision architectures. Asynchronous Time Division Switches The ATD, or single path, architectures provide a single, multiplexed path through the ATM switch for all cells. Typically a bus or ring is used. Figure 9 shows the basic structure of the ATM switch proposed in (15). In Fig. 6, four input ports are connected to four output ports by a time-division multiplexing (TDM) bus. Each input port is allocated a fixed time slot on the TDM bus, and the bus is designated to operate at a speed equal to the sum of the incoming bit rates at all input ports. The TDM slot sizes are fixed and equal in length to the time it takes to transmit one ATM cell. Thus, during one TDM cycle, the four input ports can transfer four ATM cells to four output ports. In ATD switches, the maximum throughput is determined by a single, multiplexed path. Switches with N input ports and N output ports must run at a rate N times faster than the transmission links. Therefore, the total throughput of ATD ATM switches is bounded by the current capabilities of device logic technology. Commercial examples of ATD switches are the Fore Systems ASX switch and Digital’s VNswitch.

4 5

4 5

6 7

6 7

Output ports

Figure 10. A 8 ⫻ 8 Banyan switch with binary switching elements.

Space-Division Switches To eliminate the single-path limitation and increase total throughput, space-division ATM switches implement multiple paths through switching fabrics. Most space-division switches are based on multistage interconnection networks, where small switching elements (usually 2 ⫻ 2 cross-point switches) are organized into stages and provide multiple paths through a switching fabric. Rather than being multiplexed onto a single path, ATM cells are space-switched through the fabric. Three typical types of space-division switches are described next. Banyan Switches. Banyan switches are examples of spacedivision switches. An N ⫻ N Banyan switch is constructed by arranging a number of binary switching elements into several stages (log2 N stages). Figure 10 depicts an 8 ⫻ 8 self-routing Banyan switch (14). The switch fabric is composed of twelve 2 ⫻ 2 switching elements assembled into three stages. From any of the eight input ports, it is possible to reach all the eight output ports. One desirable characteristic of the Banyan switch is that it is self-routing. Because each cross-point switch has only two output lines, only one bit is required to specify the correct output path. Very simply, if the desired output addresses of a ATM cell is stored in the cell header in binary code, routing decisions for the cell can be made at each cross-point switch by examining the appropriate bit of the destination address. Although the Banyan switch is simple and possesses attractive features such as modularity, which makes it suitable for VLSI implementation, it also has some disadvantages. One of its disadvantages is that it is internally blocking. In other words, cells destined for different output ports may contend for a common link within the switch. This results in

TDM bus Buffers

Buffers

Input port 0

Output port 0

Input port 1

Output port 1

Input port 2

Output port 2

Input port 3

Output port 3

Batcher network

Input ports

Banyan network

Output ports

Timing Figure 9. A 4 ⫻ 4 asynchronous time division switch.

Figure 11. Batcher–Banyan switch.


Input ports

1 2 N Bus interfaces 1

2 Output ports

N

Figure 12. A knockout (crossbar) switch.

blocking all cells that wish to use that link, except for one. Hence, the Banyan switch is referred to as a blocking switch. In Fig. 10, three cells are shown arriving on input ports 1, 3, and 4 with destination port addresses of 0, 1, and 5, respectively. The cell destined for output port 0 and the cell destined for output port 1 end up contending for the link between the second and third stages. As a result, only one of them (the cell from input port 1 in this example) actually reaches its destination (output port 0), while the other is blocked. Batcher–Banyan Switches. Another example of space-division switches is the Batcher–Banyan switch (14). (See Fig. 11.) It consists of two multistage interconnection networks: a Banyan self-routing network and a Batcher sorting network. In the Batcher–Banyan switch, the incoming cells first enter the sorting network, which takes the cells and sorts them into ascending order according to their output addresses. Cells then enter the Banyan network, which routes the cells to their correct output ports. As shown earlier, the Banyan switch is internally blocking. However, the Banyan switch possesses an interesting feature. Namely, internal blocking can be avoided if the cells arriving at the Banyan switch’s input ports are sorted in ascending order by their destination addresses. The Batcher–Banyan switch takes advantage of this fact and uses the Batcher soring network to sort the cells, thereby making the Batcher– Banyan switch internally nonblocking. The Starlite switch, designed by Bellcore, is based on the Batcher–Banyan architecture (16). Crossbar Switches. The crossbar switch interconnects N inputs and N outputs into a fully meshed topology; that is, there are N2 cross points within the switch (14). (See Fig. 12.) Because it is always possible to establish a connection between any arbitrary input and output pair, internal blocking is impossible in a crossbar switch. The architecture of the crossbar switch has some advantages. First, it uses a simple two-state cross-point switch (open and connected state), which is easy to implement. Sec-

Input ports

Output ports N×N

Input ports

Output ports

Buffers (a)

(b)

ond, the modularity of the switch design allows simple expansion. One can build a larger switch by simply adding more cross-point switches. Lastly, compared to Banyan-based switches, the crossbar switch design results in low transfer latency, because it has the smallest number of connecting points between input and output ports. One disadvantage to this design, however, is the fact that it uses the maximum number of cross points (cross-point switches) needed to implement an N ⫻ N switch. The knockout switch by AT&T Bell Labs is a nonblocking switch based on the crossbar design (17,18). It has N inputs and N outputs and consists of a crossbar-based switch with a bus interface module at each output (Fig. 12). Nonblocking Buffered Switches Although some switches such as Batcher–Banyan and crossbar switches are internally nonblocking, two or more cells may still contend for the same output port in a nonblocking switch, resulting in the dropping of all but one cell. In order to prevent such loss, the buffering of cells by the switch is necessary. Figure 13 illustrates that buffers may be placed (1) in the inputs to the switch, (2) in the outputs to the switch, or (3) within the switching fabric itself, as a shared buffer (14). Some switches put buffers in both the input and output ports of a switch. The first approach to eliminating output contention is to place buffers in the output ports of the switch (14). In the worst case, cells arriving simultaneously at all input ports can be destined for a single output port. To ensure that no cells are lost in this case, the cell transfer must be performed at N times the speed of the input links, and the switch must be able to write N cells into the output buffer during one cell transmission time. Examples of output buffered switches include the knockout switch by AT&T Bell Labs, the Siemens & Newbridge MainStreetXpress switches, the ATML’s VIRATA switch, and Bay Networks’ Lattis switch. The second approach to buffering in ATM switches is to place the buffers in the input ports of the switch (14). Each input has a dedicated buffer, and cells that would otherwise be blocked at the output ports of the switch are stored in input buffers. Commercial examples of switches with input buffers as well as output buffers are IBM’s 8285 Nways switches, and Cisco’s Lightstream 2020 switches. A third approach is to use a shared buffer within the switch fabric. In a shared buffer switch, there is no buffer at the input or output ports (14). Arriving cells are immediately injected into the switch. When output contention happens, the winning cell goes through the switch, while the losing cells are stored for later transmission in a shared buffer common to all of the input ports. Cells just arriving at the switch join buffered cells in competition for available outputs. Because

Input ports

Output ports N(B+1) × N(B+1)

N×N

Buffers

757

NB (c)

Figure 13. Nonblocking buffered switches.

758


more cells are available to select from, it is possible that fewer output ports will be idle when using the shared buffer scheme. Thus, the shared buffer switch can achieve high throughput. However, one drawback is that cells may be delivered out of sequence because cells that arrived more recently may win over buffered cells during contention (19). Another drawback is the increase in the number of input and output ports internal to the switch. The Starlite switch with trap by Bellcore is an example of the shared buffer switch architecture (16). Other examples of shared buffer switches include Cisco’s Lightstream 1010 switches, IBM’s Prizma switches, Hitachi’s 5001 switches, and Lucent’s ATM cell switches. CONTINUING RESEARCH IN ATM NETWORKS ATM is continuously evolving, and its attractive ability to support broadband integrated services with strict quality of service guarantees has motivated the integration of ATM and existing widely deployed networks. Recent additions to ATM research and technology include, but are not limited to, seamless integration with existing LANs [e.g., LAN emulation (20)], efficient support for traditional Internet IP networking [e.g., IP over ATM (21), IP switching (22)], and further development of flow and congestion control algorithms to support existing data services [e.g., ABR flow control (12)]. Research on topics related to ATM networks is currently proceeding and will undoubtedly continue to proceed as the technology matures.

9. B. J. Vickers et al., Congestion control and resource management in diverse ATM environments, IECEJ J., J76-B-I (11): 1993. 10. J. S. Turner, New directions in communications (or which way to the information age?), IEEE Commun. Mag., 25 (10): 1986. 11. G. Gallassi, G. Rigolio, and L. Fratta, ATM: Bandwidth assignment and bandwidth enforcement policies. Proc. GLOBECOM’89. 12. ATM Forum, ATM Forum Traffic management specification version 4.0, af-tm-0056.000, April 1996, Mountain View, CA: ATM Forum. 13. Quantum Flow Control version 2.0, Flow Control Consortium, FCC-SPEC-95-1, [Online], July 1995. http://www.qfc.org 14. Y. Oie et al., Survey of switching techniques in high-speed networks and their performance, Int. J. Satellite Commun., 9: 285– 303, 1991. 15. M. De Prycker and M. De Somer, Performance of a service independent switching network with distributed control, IEEE J. Select. Areas Commun., 5: 1293–1301, 1987. 16. A. Huang and S. Knauer, Starlite: A wideband digital switch. Proc. IEEE GLOBECOM’84, 1984. 17. K. Y. Eng, A photonic knockout switch for high-speed packet networks, IEEE J. Select. Areas Commun., 6: 1107–1116, 1988. 18. Y. S. Yeh, M. G. Hluchyj, and A. S. Acampora, The knockout switch: A simple, modular architecture for high-performance packet switching, IEEE J. Select. Areas Commun., 5: 1274– 1283, 1987. 19. J. Y. Hui and E. Arthurs, A broadband packet switch for integrated transport, IEEE J. Select. Areas Commun., 5: 1264– 1273, 1987. 20. ATM Forum, LAN emulation over ATM version 1.0. AF-LANE0021, 1995, Mountain View, CA: ATM Forum. 21. IETF, IP over ATM: A framework document, RFC-1932, 1996.

BIBLIOGRAPHY 1. CCITT Recommendation I-Series. Geneva: International Telephone and Telegraph Consultative Committee. 2. J. B. Kim, T. Suda, and M. Yoshimura, International standardization of B-ISDN, Comput. Networks ISDN Syst., 27: 1994. 3. CCITT Recommendation G-Series. Geneva: International Telephone and Telegraph Consultative Committee. 4. ATM Forum Technical Specifications [Online]. Available www: www.atmforum.com 5. Report of ANSI T1S1.5/91-292, Simple and Efficient Adaptation Layer (SEAL), August 1991. 6. Report of ANSI T1S1.5/91-449, AAL5—A New High Speed Data Transfer, November 1991. 7. CCITT Recommendation Q-Series. Geneva: International Telephone and Telegraph Consultative Committee. 8. J. Bae and T. Suda, Survey of traffic control schemes and protocols in ATM networks, Proc. IEEE, 79: 1991.

22. Ipsilon Corporation, IP switching: The intelligence of routing, The Performance of Switching [Online]. Available www: www.ipsiolon.com

TATSUYA SUDA University of California, Irvine

ATC. See AIR TRAFFIC CONTROL. ATM. See STATISTICAL MULTIPLEXING. ATM NETWORKS. See ASYNCHRONOUS TRANSFER MODE NETWORKS.

ATM NETWORKS, VIDEO ON. See VIDEO ON ATM NETWORKS.

ATMOSPHERICS. See WHISTLERS. ATTENUATION. See REFRACTION AND ATTENUATION IN THE TROPOSPHERE.


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Channel Coding Standard Article Irving S. Reed1 and Xuemin Chen2 1University of Southern California, Los Angeles, CA 2General Instrument Corporation, San Diego, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5306 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (160K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Error-Handling Processes and Error-Control Strategies Basic Principles of Error-Control Codes Linear Block Codes Cyclic Codes BCH Codes Reed–Solomon Code Noisy Channel Coding Theorem Coding Performance and Decoding Complexity Keywords: error correction and detection; channel capacity; error-control coding; block codes; convolutional codes; information theory; BCH codes; reed-solomon codes; hamming codes; maximum likelihood decoding | | | Copyright © 1999-2008 All Rights Reserved.


CHANNEL CODING

CHANNEL CODING The general term channel coding implies a technique by which redundancy symbols are attached to the data by a channel encoder of the system. These redundancy symbols are used to detect and/or correct and/or interpolate erroneous data at the channel decoder. Channel encoding is achieved by imposing relations on the information data and redundancy symbols of the system. These restricting relations make it possible for the decoder to correctly extract the original source signal with high reliability and fidelity from a possibly corrupted received or retrieved signal. Channel coding is used in digital communication systems for several possible reasons: (1) to increase the reliability of noisy data communications channels or data storage systems; (2) to control errors in such a manner that a faithful reproduction of the data can be obtained; (3) to increase the overall signal-to-noise energy ratio (SNR) of a system; (4) to reduce the noise effects within a system; and (5) to meet the commercial demands of efficiency, reliability, and a high performance of an economically practical digital transmission and storage system. All of these objectives must be tailored to the particular application. Therefore, channel coding is also called errorcontrol coding, error-correction, or detection coding. ERROR-HANDLING PROCESSES AND ERROR-CONTROL STRATEGIES Figure 1 shows a physical layer coding model of a digital communication system. The same model can be used to describe an information storage system if the storage medium is considered to be the channel. The source information is usually composed of binary or alphanumeric symbols. The encoder converts the information messages into electrical signals acceptable to the channel. Then these signals are sent to the channel (or storage medium), where they may be disturbed by noise. Next, the output of the channel is sent to the decoder, which makes a decision to determine which message was sent. Finally, this message is delivered to the recipient data (sink). Typical transmission channels are twisted-pair telephone lines, coaxial cable wires, optical fibers, radio links, microwave links, satellite links, and so forth. Typical storage media can be semiconductor memories, magnetic tapes and discs, compact discs (CDs), optical memory units, or digital video discs (DVDs). Each of these channels or media is subject to various types of noise disturbances. For example, the disturbance on a telephone line may come from impulsive circuit switching noise, thermal noise, crosstalk between lines, or a loss of synchronization. The disturbances on a CD often are caused by surface defects, dust, or a mechanical failure. Therefore, the problems a digital communication system faces

Source

Encoder

Channel or storage medium

Decoder

Sink

Retransmission request Figure 1. A physical layer model of a communication or storage system.

187

are the possible message errors that might be caused by these different disturbances. To overcome these problems, ‘‘good’’ encoders and decoders need to be designed to improve the performance of these channels. Figure 2 shows the block diagram of a typical error handling system. In an ideal system, the symbols that are obtained from the channel (or storage medium) should match the symbols that originally entered the channel (or storage medium). In any practical system, there often are occasional errors, and the purpose of channel coding is to detect and possibly correct such errors. The first stage in Fig. 2 is concerned with encoding for error avoidance and the use of redundancy. This includes, for example, such processes as precoding data for modulation, the placing of digital data at an appropriate position on the tape for certain digital formats, the rewriting of a read-afterwrite error in a computer tape, and error-correction and detection encoding. Following these moves, the encoded data are delivered to the modulator in the form of a signal vector or code. Then the modulator transforms the signal vector into a waveform that matches the channel. After being transmitted through the channel, the waveform often is disturbed by noise. The demodulation of this waveform can produce corrupted signal vectors, which in turn cause possible errors in the data. On receipt of the data, errors are first detected. The detection of an error then requires some course of action. For example, in a bi-directional link a retransmission might be requested. Finally, the correctable error patterns can be eliminated effectively by an error-correction engine. The error-control strategy for the error-handling system shown in Fig. 2 depends primarily on the application. That is, such a strategy depends on the channel properties of the particular communication link and the type of error-control codes to be used. Without the feedback line, shown in Fig. 2, communication channels are one-way channels. The codes in this case are mainly designed for error-correction and/or error-concealment. Error control for a one-way system is usually accomplished by the use of forward error correction (FEC), that is, by employing error-correcting codes that automatically correct errors which are detected at the receiver. Communication systems frequently employ two-way channels, a fact that must be considered in the design of an error-control system. With a two-way channel, both error-correction and error-detecting codes can be used. When an error-detecting code is used and an error is detected at one terminal, a request for a repeat is given to the transmitting terminal. There are examples of real one-way channels in which the error probabilities can be reduced by the use of an error-correcting code but not by an error detection and retransmission system. For example, with a magnetic-tape storage system, usually too much time has passed to ask for a retransmission after the tape has been stored for any significant period of time, say a week or sometimes just a day. In such a case, errors are detected when the record is read. Encoding with FEC codes is usually no more complex than it is with errordetecting codes. It is the decoding that requires sophisticated digital equipment. On the other hand, there are good reasons for using both error detection and retransmission for some applications when possible. Error detection is by its nature a much simpler computational task than error correction and requires much

J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.

188

CHANNEL CODING Retransmission request

Figure 2. The major processes in an error-handling system.

Error avoidance and redundancy encoding

Coded modulation

less complex decoding equipment. Also, error detection with retransmission tends to be adaptive. In a retransmission system, redundant information is utilized only in the retransmitted data when errors occur. This makes it possible, under certain circumstances, to obtain better performance with a system of this kind than is theoreticaly possible over a oneway channel. Error control by the use of error detection and retransmission is called automatic repeat request (ARQ). In an ARQ system, when an error is detected at the receiver, a request is sent to the transmitter to repeat the message. This process continues until it is verified that the message was received correctly. Typical applications of ARQ are the protocols for many fax modems. There is a definite limit to the efficiency of a system that uses simple error detection and retransmission alone. First, short error-detecting codes are not efficient detectors of errors. On the other hand, if very long codes are used, retransmission must be done too frequently. It can be shown that a combination of both the correction of the most frequent error patterns along with detection and retransmission of the less frequent error patterns is not subject to such a limitation. Such a mixed error-control process is usually called a hybrid error-control (HEC) strategy. In fact, HEC is often more efficient than either a forward error correction system or a detection and retransmission system. Many present-day digital systems use a combination of forward error correction and detection with or without feedback. If the error rate demanded by the application cannot be met by the unaided channel or storage medium, some form of error handling may be necessary. BASIC PRINCIPLES OF ERROR-CONTROL CODES We have seen that the performance of an error-handling system relies on error-correction and/or detection codes that are designed for the given error-control strategy. There are two different types of codes that are commonly used today: block and convolutional codes. It is assumed for both types of codes that the information sequence is encoded using an alphabet set Q of q distinct symbols, called a q-ary set, where q is a positive integer. In general, a code is called a block code if the coded information sequence can be divided into blocks of n symbols and each block can be decoded independently. The encoder of a block code divides the information sequence into message blocks of k information symbols each. A message block, called the message word, is represented by the k-tuple m ⫽ (m0, m1, . . ., mk⫺1) of symbols. Evidently, there are a total of qk different possible message words. The encoder transforms each message word m independently into an n-symbol codeword c ⫽ (c0, c1, . . ., cn⫺1). Therefore, corresponding with

Channel or storage medium

Demodu− lation

Error detection

Error correction

the qk different possible messages, there are qk different possible codewords at the encoder output. This set of qk codewords of length n is called an (n, k) block code. The code rate of an (n, k) block code is defined to be R ⫽ k/n. If q ⫽ 2, the codes are called binary block codes and can be implemented with a combinational logic circuit. The encoder for a convolutional code accepts k-bit blocks of an information sequence and produces an encoded sequence of n-bit blocks. (In convolutional coding, the symbols are used to denote a sequence of blocks rather than a single block.) However, each encoded block depends not only on the corresponding k-bit message block, but also on the m previous message blocks. Hence, the encoder is said to have a memory of order m. The set of encoded sequences produced by a k-input and n-output encoder of memory of order m is called an (n, k, m) convolutional code. Again, the ratio R ⫽ k/n is called the code rate of the convolutional code. Since the encoder contains memory, it is implemented with a sequential logic circuit. The basic principle of error-control coding is to add redundancy to the message in such a way that the message and its redundancy are related by some set of algebraic equations. When a message is disturbed, the message with such constrained redundancy still can be decoded by the use of these relations. In other words, the error-control capability of a code comes from this relational redundancy which is added to the message during the encoding process. To illustrate the principle of error-control coding, an example of a binary block code of length 3 is discussed. There are a total of eight different possible binary 3-tuples: (000), (001), (010), (011), (100), (101), (110), (111). First, if all of these 3tuples are used to transmit messages, one has the example of a (3,3) binary block code of rate 1. In this case, if a one-bit error occurs in any codeword, the received word becomes another codeword. Since any particular codeword may be a transmitted message and there are no redundancy bits in a codeword, errors can neither be detected nor corrected. The error detection and correction processes are closely related and will be dealt with presently. The actual correction of an error is simplified tremendously by the adoption of binary codes. There are only two symbols, 0 and 1, in this case. Hence, to correct a symbol it is sufficient to know that the symbol is wrong. Figure 3 shows the minimal circuit needed for correction once the bit in error has been identified. The exclusive-OR (XOR) gate shows up extensively in error correction circuits, and the figure also demonstrates its truth table. One way to remember the characteristics of this useful device is that there always is an output ‘‘1’’ when the inputs are different. Inspection of the truth table shows that there is an even number of 1’s in each row and, as a consequence, the device is also called an even parity gate.

CHANNEL CODING

error. For further understanding of convolutional codes, readers may refer to Refs. 1 and 2.

Truth table of XOR gate

In Wrong

189

A

B

C

0 0 1 1

0 1 0 1

0 1 1 0

LINEAR BLOCK CODES

A C B XOR gate

Out

Figure 3. Exclusive-OR Gate.

Parity is a fundamental concept in error detection. In the previous example, let only four of the 3-tuples, (000), (011), (101), (110), be chosen as codewords for transmission. These are equivalent to the four 2-bit messages, (00), (01), (10), (11), with the third bit in each 3-tuple equal to the XOR of its first and second bits. This is an example of a (3,2) binary block code of rate 2/3. If a received word is not a codeword, i.e., the third bit does not equal the XOR of the first and second bits, then an error is detected. However, this code cannot correct any error. To illustrate what can happen when there are errors, suppose that the received word is 010. Such an error cannot be corrected even if there is only one bit in error since, in this case, the transmitted codeword has three possibilities: (000), (011), (110). To achieve error correction, more redundancy bits need to be added to the message words for transmission. Suppose only the two 3-tuples (000), (111) are chosen as codewords. This is a (3,1) binary block code of rate 1/3. The codewords (000), (111) are encoded by duplicating the source bits 0, 1 two additional times, that is, two redundancy bits in each codeword. If this codeword is sent through the channel, and one- or two-bit errors occur, the received word is not a codeword. Errors are detected in this scenario. If the decision (rule of the decoder) is to decide the original source bit as the bit which appears as the majority of the three bits of the received word, a one-bit error is corrected. For instance, if the received word is (010), this decoder would say that 0 was sent. Consider next the example of a (2,1,2) binary convolutional code. Let the information sequence be m ⫽ (m0, m1, . . ., (2) m6) ⫽ (1011100) and the encoded sequence be c ⫽ (c(1) 0 c0 , (2) (1) (2) c(1) 1 c1 , . . ., c6 c6 ) ⫽ (11,10,00,01,10,01,11). Also assume the relations between the components of vectors m and c are given by

ci(1) = mi−2 + mi−1 + mi ci(2) = mi−2 + mi where m⫺2 ⫽ m⫺1 ⫽ 0 and ‘‘⫹’’ means sum modulo 2. Suppose the third digit of the received sequence is in error. That is, let the received sequence begin with c ⫽ (11,00,00, . . .). The following are the eight possible beginning code sequences (00,00,00, . . .), (00,00,11, . . .), (00,11,10, . . .), (00,11,01, . . .), (11,10,11, . . .), (11,10,00, . . .), (11,01,01, . . .), (11,01,10, . . .). Clearly, the sixth path, which differs from the received sequence in but a single position, is intuitively the best choice. Thus, a single error is corrected by this observation. Next, suppose digits 1, 2, and 3 were all erroneously received. For this case, the closest code sequence would be (00,00,00, . . .) and the decoder would make an undetectable

In the previous section, it was shown for a binary block code that the positions of the failed bits can be determined by the use of more parity bits. If these parity bits are generated by a linear combination of message bits, the code is called a linear block code. Some important concepts of a linear block code are introduced next by an example of the Hamming code. Consider a binary linear block code of length 7 and rate R ⫽ 4/7. A four-bit message word m ⫽ (m0, m1, m2, m3) is used to compute three redundancy bits and to make a seven-bit codeword c ⫽ (c0, c1, . . ., c6) from the following set of equations:

c i = mi ,

i = 0, 1, 2, 3

c 4 = m0 + m2 + m3 c 5 = m0 + m1 + m2 c 6 = m0 + m1 + m3 These equations provide the parity-check equations in the matrix form, cHT ⫽ 0, where

1

H=

1 1

0 1 1

1 1 0

1 0 1

1 0 0

0 1 0

0 0 1

is called the parity-check matrix of the code and ‘‘T’’ denotes matrix transpose. The codewords are generated by c ⫽ m ⭈ G, where

G=

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

1 0 1 1

1 1 1 0

1 1 0 1

is called the generator matrix of the code. In the Hamming code, four message bits are examined in turn, and each bit that is a ‘‘1’’ causes the corresponding row of G to be added to an XOR sum. For example, if the message word is (1001), the top and bottom rows of G are componentwise XORed. The first four columns of G form a submatrix, which is known as an identity matrix. Therefore, the first four data bits in the codeword are identical to the message bits that were to be conveyed. This is useful because the original message bits are encoded in an unmodified form, and the check bits are simply attached to the end of the message to construct the so-called systematic codeword. Almost all channel block coding systems use systematic codes. The redundancy or parity bits are calculated in such a manner that they do not use every message bit. If a message bit is not included in a parity check, it can fail without affecting the outcome of that check. For example, if the second bit of a codeword fails, the outcome of the parity-check equation given by the first row of H is not affected. However, the outcomes of other parity-check equations, given by the second and third rows of H, are affected. The position of the error is deduced from the pattern of these successful and unsuccessful

190

CHANNEL CODING

checks in the parity-check matrix. This pattern is known as a syndrome defined by the matrix equation s ⫽ r ⭈ HT, where r ⫽ c ⫹ e and e is a seven-bit error vector. In the previous example of the Hamming code, let a failed second bit be assumed in a received word r, i.e., e1 ⫽ 1, and ei ⫽ 0 for i ⬆ 1. Because this bit is included in only two of the parity-check equations, there are two 1’s in the failure pattern, namely, 011. Since considerable care was taken in the design of the matrix pattern for generating the check bits, the syndrome, 011, is actually the address (i.e., [011]T is the first column of H) of the error bit. This is a fundamental feature of the original Hamming codes, due to Richard Hamming in 1950. It is useful at this point to introduce the concept of Hamming distance. This is the number of positions in which two sequences of length n differ. The minimum (Hamming) distance d of a block code is by definition the least Hamming distance between any two distinct codewords of this code. That is, the minimum distance of a binary code equals the minimum number of bits that needs to be changed in order to change any codeword into any other codeword. A linear code of length n, dimension k, and minimum distance d is often denoted by the notation (n, k, d). If errors corrupt a codeword so that it is no longer a codeword, it is definitely detectable and possibly correctable. If errors convert one codeword into another, it is impossible to detect. Therefore, the minimum distance d indicates the detection and correction capacities of the code. For the Hamming code example it can be found by a direct verification for the 24 ⫽ 16 codewords that the minimum distance of the Hamming code is 3. This coincides with the fact that two or fewer bit errors in any codeword of the Hamming code produce a noncodeword. Hence two bit errors are always detectable. Correction is also possible if the following minimum distance rule is used: Correction (decoding) with the minimum distance rule decodes each received word r to the codeword that is closest to it in Hamming distance. For example, if the received word from the Hamming code is r ⫽ (1111100) and an error occurs in the second bit [i.e., e ⫽ (0100000)], the minimum distance rule always correctly decodes r to the codeword c ⫽ (1011100). It can be shown that the Hamming code is able to correct all single-bit errors. Associated with this fact is the important theorem for a linear block code given next. Theorem A linear block code (n, k, d) has the following minimum distance decoding: 1. If d ⱖ e ⫹ 1, then the code can detect e errors. 2. If d ⱖ 2t ⫹ 1, then the code can correct t errors. 3. If d ⱖ t ⫹ e ⫹ 1 for e ⱖ t, then the code can correct t errors and simultaneously detect e errors. Intuitively, item (2) of this theorem can be explained as follows: If a codeword c is transmitted and errors occur in ⱕt positions, then the received word r clearly resembles the transmitted codeword c more than any other codeword. It has been shown for the Hamming code that the syndromes s are the addresses of the error bits. This concept can be generalized to all linear block codes. That is, each syn-

drome corresponds to one and only one error vector if the number of errors satisfies e ⱕ (d ⫺ 1)/2. Another simpler decoding method, called syndrome decoding, can be shown to be equivalent to the minimum distance decoding rule: 1. Compute the syndrome s of the received word r. 2. Determine the error vector e that corresponds to the syndrome s by a lookup table. 3. Decode r by choosing the corrected codeword to be r ⫺ e. CYCLIC CODES The implementation of the encoder of a Hamming code can be made very fast by the use of the parity-check equations. Such an implementation is ideal for some applications, such as computer memory protection, which requires short codes and a fast access time. However, in many other applications, the messages are transmitted and stored serially, and it is desirable to use relatively large data blocks to reduce the memory storage devoted to preambles, addressing, and synchronization. Where large data blocks are to be handled, the use of simple parity-check equations for encoding has to be abandoned because it would become impossibly complex. However, the principle of the generator and the parity-check matrices can still be employed for the encoder, but now these matrices usually are generated algorithmically. For the decoder, the syndromes are used to find the bits in error not by using a simple lookup table, but by solving algebraic equations. A subclass of linear block codes, called cyclic codes, can provide long codes that have the required encoding and decoding structures. A linear code C of length n is said to be cyclic if every cyclic shift of a codeword c is also a codeword, that is, c = (c0 , c1 , . . ., cn−1 ) ∈ C ⇒ cπ = (cn−1 , c0 , c1 , . . ., cn−2 ) ∈ C When messages can be accessed serially, simple circuitry can be used for the encoder since the same gate can be used for many XOR operations. Unfortunately, the reduction in complexity of the encoding process is paralleled by an increase in the difficulty of explaining what takes place. The methodology described so far about how an error-correction system works is mainly in engineering terms. However, it can be described more mathematically so that the encoding and decoding of long codes can be accomplished in a more efficient manner. Toward this end, codewords of a cyclic code can be represented by the use of polynomials in the following way:

c = (c0 , c1 , . . ., cn−1 ) ∈ C ⇒ c(x) = c0 + c1 x + · · · + cn−1 xn−1 ∈ C(x) where C(x) represents the set of polynomials that is associated with the set of all codewords of C. The term c(x) is called a code polynomial. It is clear that c앟(x) ⫽ x ⭈ c(x) ⫺ cn⫺1 ⭈ (xn ⫺ 1) 僆 C(x), that is, c앟(x) ⬅ x ⭈ c(x) mod(xn ⫺ 1). Let Rn[x] ⫽ GF(2)[x]/(xn ⫺ 1) denote the polynomial ring of degree at most n ⫺ 1 over the finite (or Galois) field GF(2) of two elements. Let g(x) be the unitary polynomial of smallest degree in C(x). Then the degree of g(x) equals n ⫺ k, and every polynomial c(x) 僆 C(x) can be represented as c(x) ⫽ m(x)g(x) for some m(x) 僆 Rn[x].

CHANNEL CODING

c(x) m(x) Figure 4. An encoder of the (7,4,3) cyclic Hamming code.

Therefore, a cyclic code encoder can be conceived to be a polynomial multiplier that can be implemented by the use of what is called a shift register. For example, the cyclic Hamming code has the generator polynomial g(x) ⫽ x3 ⫹ x ⫹ 1. Hence, one implementation of the encoder of this code is the shift register device shown in Fig. 4. The register of the encoder is initially set to zero. Then let the message word, m(x) ⫽ m0 ⫹ m1x ⫹ m2x2 ⫹ m3x3, be the input in sequence from m3 to m0. After shifting seven times, a codeword c(x) ⫽ c0 ⫹ c1x ⫹ ⭈ ⭈ ⭈ ⫹ c6x6 is the sequential output of the bits c6 to c0. Many other methods for encoding cyclic codes can be implemented by the use of shift register circuits. The most useful of these techniques is the systematic encoding method. Encoding of an (n, k) cyclic code in systematic form consists of three steps: (1) multiply the message polynomial m(x) by xn⫺k; (2) divide xn⫺km(x) by g(x) to obtain the remainder b(x); and (3) form the codeword c(x) ⫽ b(x) ⫹ xn⫺km(x). Recall that in the decoding of a linear code, the first step is to compute the syndrome vector s from r by s ⫽ r ⭈ HT. If the syndrome is zero, the decoder accepts r as a codeword. If the syndrome is not equal to zero, r is not a codeword and the presence of errors is detected. For a cyclic code in systematic form, the syndromes are computed easily. The received word r is treated as the polynomial of degree n ⫺ 1 or less, i.e., r(x) ⫽ r0 ⫹ r1x ⫹ r2x2 ⫹ ⭈ ⭈ ⭈ ⫹ rn⫺1xn⫺1. A polynomial division of r(x) by the generator polynomial g(x) yields r(x) ⫽ a(x)g(x) ⫹ s(x), where the remainder s(x) is a polynomial of degree n ⫺ k ⫺ 1 or less. The n ⫺ k coefficients of s(x) form the syndrome s. Therefore, s(x) is called the syndrome polynomial of the cyclic code. In general, the decoding of cyclic codes for error-correction consists of the same three steps used for decoding any linear code: syndrome computation, association of the syndrome with the error pattern, and error correction. However, the limit to this approach is the complexity of the decoding circuit that is needed to determine the error word from the syndrome. Such procedures tend to grow exponentially with code length and the number of errors that need to be corrected. Many cyclic codes have considerable algebraic and geometric properties. If these properties are properly used, then a simplification in the decoding process is usually possible. Cyclic codes are well suited to error detection, and several have been standardized for use in digital communications. The most common of these have the following generator polynomials:

x

16

+x

15

+x +1 2

x16 + x12 + x5 + 1

(CRC − 16) (CRC − CCITT )

These codes can detect many combinations of errors, and the implementation of both the encoding and error-detecting circuits is quite practical. Since every codeword of a cyclic code can be computed from its generator polynomial g(x) as the product c(x) ⫽ d(x)g(x), it is clear that s(x) ⬅ 0 mod(g(x)) if and

191

only if the received polynomial r(x) is a code polynomial. This very useful fact is often employed to design the efficient errordetecting circuits of most cyclic codes. The results discussed in this section are easily generalized to codes constructed over any finite field GF(q), where q is some power of a prime number p. BCH CODES One class of cyclic codes was introduced in 1959 by Hocquenghem, and independently in 1960 by Bose and Ray-Chaudhuri. The codes are known as BCH codes and can be described by means of the roots of a polynomial g(x) with coefficients in a finite field. A cyclic code of length n over GF(2) is called a BCH code of designed distance 웃 if its generator g(x) is the least common multiple of the minimal polynomials of 웁l, 웁l⫹1, . . ., 웁l⫹웃⫺2 for some l, where 웁 is a primitive nth root of unity. If n ⫽ 2m ⫺ 1, i.e., 웁 is a primitive element of GF(2m), then the BCH code is called a primitive BCH code. The performance of a BCH code is specified by its designed distance using the following fact: the minimum distance of a BCH code with designed distance 웃 is at least 웃. This fact is usually called the BCH bound. A primitive BCH code of designed distance 웃 has the minimum distance d ⱕ 2웃 ⫺ 1. To decode BCH codes, let’s once again consider a BCH code of length n over GF(2) with designed distance 웃 ⫽ 2t ⫹ 1 and let 웁 be a primitive nth root of unity in GF(2m). Consider a codeword c(x) and assume that the received word is r(x) = r0 + r1 x + · · · + rn−1 xn−1 Let e(x) ⫽ r(x) ⫺ c(x) ⫽ e0 ⫹ e1x ⫹ ⭈ ⭈ ⭈ ⫹ en⫺1xn⫺1 be the error vector. Now, define the following: M ⫽ 兵i兩ei ⬆ 0其 is the set of positions where errors occur. e ⫽ 兩M兩 is the number of errors. The polynomial ␴(z) ⫽ ⌸ i僆M (1 ⫺ 웁iz) in z is called the errorlocator polynomial. Also, let 웆(z) ⫽ 兺i僆M ei웁iz ⌸ j僆M⶿兵i其 (1 ⫺ 웁jz) be what is known as the error-evaluator polynomial. It is clear that if one can find ␴(z) and 웆(z), then the errors can be corrected. In fact, an error occurs in position i if and only if ␴(웁⫺i) ⫽ 0, and in that case the error is given by ei ⫽ ⫺웆(웁⫺i)웁i / ␴⬘(웁⫺i), where ␴⬘( ⭈ ) denotes the derivative. Assume that the number of errors e ⱕ t (if e ⬎ t, one does not expect to be able to correct the errors). Observe that

ω(z) = σ (z) =

X XX XX X i∈M ∞

ei β i z = 1 − βiz

i∈M

(β i z)l

l=1

∞

ei β li =

zl

l=1

∞

ei

i∈M

zl e(β l )

l=1

where all of these calculations use the operations of what is known as a formal power series over the finite field GF(2m). For 1 ⱕ l ⱕ 2t, one gets e(웁l) ⫽ r(웁l), i.e., the receiver knows the first 2t coefficients on the right-hand side of the equation. Therefore, 웆(z)/ ␴(z) is known as mod z2t⫹1. It is claimed that the receiver must determine polynomials ␴(z) and 웆(z) in such

192

CHANNEL CODING

a manner that deg[웆(z)] ⱕ deg[␴(z)] with deg[␴(z)] being as small as possible under the condition,

ω(z) ≡ σ (z)

X

Digital source

m

c

Encoder

Modulator

2t

zl r(β l )(mod z2t+1 )

n(t)

l=1

In practice, it is very important to find a fast algorithm that actually determines ␴(z) and 웆(z) by solving these equations. Two commonly used algorithms are the Berlekamp–Massey decoding algorithm introduced by E. R. Berlekamp and J. Massey, and the Euclidean algorithm. Interested readers may refer to (1,3–5).

REED–SOLOMON CODE A very useful class of nonbinary cyclic codes is called the Reed–Solomon (RS) code. RS codes were first discovered by I. S. Reed and G. Solomon in 1958. An RS code is defined over GF(pm) with length pm ⫺ 1 and minimum distance d ⫽ n ⫺ k ⫹ 1. Its generator polynomial is g(x) ⫽ (x ⫺ 움u)(x ⫺ 움u⫹1) ⭈ ⭈ ⭈ (x ⫺ 움u⫹d⫺2), where 움 is a primitive element in GF(pm) and where u is some integer. Since RS codes are cyclic, they can be encoded by the product of g(x) and the polynomial associated with the information vector, or by a systematic encoding. For example, the RS code of length n ⫽ 7, dimension k ⫽ 5, and minimum distance d ⫽ 3 where p ⫽ 2 is specified by the generator polynomial g(x) ⫽ (x ⫺ 움3)(x ⫺ 움4) ⫽ x2 ⫹ 움6x ⫹ 1, where 움 is a root of the irreducible polynomial x3 ⫹ x ⫹ 1 and is a primitive element of the finite field GF(23). In the RS codes, data bits are assembled into words, or symbols, which become elements of the Galois field upon which the code is based. The number of bits in the symbol determines the size of the Galois field, and hence the number of symbols in a codeword. A symbol length of eight bits is commonly used because it fits in conveniently with modern byte-oriented computers and processors. The Galois, or finite, field with eight-bit symbols is denoted by GF(28). Thus, the RS codes defined over GF(28) have a length of 28 ⫺ 1 ⫽ 255 symbols. As each symbol contains eight bits, the codeword is 255 ⫻ 8 ⫽ 2040 bits long. A primitive polynomial commonly used to generate GF(28) is g(x) ⫽ x8 ⫹ x4 ⫹ x3 ⫹ x2 ⫹ 1. The decoders of RS codes are usually implemented by the Euclidean algorithm and the Berlekamp–Massey algorithm.

NOISY CHANNEL CODING THEOREM Several important classes of codes have been discussed in the previous sections. In this section the performance to be expected from channel coding is discussed briefly. Some commonly used quantities for measuring performance improvement by channel coding include the error probabilities from the decoder, such as the bit-error rate of the system, the probability of an incorrect decoding of a codeword, and the probability of an undetected error. In the physical layers of a communication system, these error probabilities usually depend on the particular code, the decoder, and, more importantly, on the underlying channel/medium error probabilities.

Digital sink

m

c

Decoder

Demodulator

s(t)

Additive noise channel

r(t)

Figure 5. Coded system on an additive noise channel.

Figure 5 shows a block diagram of a coded system for an additive noise channel. In such a system, the source output m is encoded into a code sequence (codeword) c. Then c is modulated and sent to the channel. After demodulation the decoder receives a sequence r which satisfies r ⫽ c ⫹ e, where e is the error sequence and ‘‘⫹’’ usually denotes componentˆ reprewise vector XOR addition. The final decoder output m sents the recovered message. The primary purpose of a decoder is to produce an estiˆ of the transmitted information sequence m that is mate m based on the received sequence r. Equivalently, since there is a one-to-one correspondence between the information sequence m and the codeword c, the decoder can produce an ˆ ⫽ m if and only if estimate cˆ of the codeword c. Clearly, m cˆ ⫽ c. A decoding rule is a strategy for choosing an estimated codeword cˆ for each possible received sequence r. If the codeword c was transmitted, a decoding error occurs if and only if cˆ ⬆ c. On the assumption that r is received, the conditional error probability of the decoder is defined by P(E r ) = P(cˆ = c r )

X

(1)

The error probability of the decoder is then given by P(E) =

P(E r )P(rr )

(2)

r

where P(r) denotes the probability of receiving the codeword r and the summation is over all possible received words. Evidently, P(r) is independent of the decoding rule used since r is produced prior to the decoding process. Hence, the optimum decoding rule must minimize P(E兩r) ⫽ P(cˆ ⬆ c兩r) for all r. Since minimizing P(cˆ ⬆ c兩r) is equivalent to the maximization of P(cˆ ⫽ c兩r), P(E兩r) is minimized for a given r by choosing cˆ to be some codeword c that maximizes P(cc r ) =

P(rr c )P(cc ) P(rr )

(3)

That is, cˆ is chosen to be the most likely codeword, given that r is received. If all codewords are equally likely, i.e., P(c) is the same for all c, then maximizing Eq. (3) is equivalent to the maximization of the conditional probability P(r兩c). If each received symbol in r depends only on the corresponding transmitted symbol, and not on any previously transmitted symbol, the channel is called a discrete memoryless channel (DMC). For a DMC, one obtains P(rr c ) =

Y i

P(ri ci )

(4)

CHANNEL CODING

since for a memoryless channel each received symbol depends only on the corresponding transmitted symbol. A decoder that chooses its estimate to maximize Eq. (4) is called a maximum likelihood decoder (MLD). One of the most interesting problems in channel coding is to determine for a given channel how small the probability of error can be made in a decoder by a code of rate R. A complete answer to this problem is provided to a large extent by a specialization of an important theorem, due to Claude Shannon in 1948, called the noisy channel coding theorem or the channel capacity theorem. Roughly speaking, Shannon’s noisy channel coding theorem states: For every memoryless channel of capacity C, there exists an error-correcting code of rate R ⬍ C such that the error probability P(E) of the maximum likelihood decoder for a power-constrained system can be made arbitrarily small. If the system operates at a rate R ⬎ C, the system has a high probability of error, regardless of the choice of the code or decoder. The capacity C of a channel defines the maximum number of bits that can be reliably sent per second over the channel.

CODING PERFORMANCE AND DECODING COMPLEXITY The noisy channel coding theorem states that there exist ‘‘good’’ error-correcting codes for any rate R ⬍ C such that the probability of error in an ML decoder is arbitrarily small. However, the proof of this theorem is nonconstructive, which leaves open the problem of the search for specific ‘‘good’’ codes. Also, Shannon assumed exhaustive ML decoding that has a complexity that is proportional to the number of words in the code. It is clear that long codes are required to approach capacity and, therefore, that more practical decoding methods are needed. These problems, left by Shannon, have kept researchers searching for good codes for almost 50 years until the present time. Gallagher (6) showed that the probability of an error of a ‘‘good’’ block code of length n and rate R ⬍ C is bounded exponentially with block length as follows: P(E) ≤ enE b (R)

(5)

where what is known as the error exponent Eb(R) is greater than zero for all rates R ⬍ C. Like Shannon, Gallagher continued to assume a randomly chosen code and an exhaustive ML is then of the order of decoding. The decoding complexity K ⬵ enR, and therefore the dethe number of codewords, i.e., K creasing of P(E) is bounded only algebraically, with a decoding complexity given by P(E) ≤ Kˆ −E b (R)/R

(6)

The exponential error bound given in Eq. (6) for block codes also extends to convolutional codes of memory order m with the form, P(E) ≤ e−(m+1)nE c (R)

193

positive functions of R for R ⬍ C and are completely determined by the channel characteristics. It is shown in (2) and (6) that the complexity of an implementation of ML decoding algorithm called the Viterbi algo ⬵ e(m⫹1)nR. rithm is exponential in the constraint length, i.e., K Thus, the probability of error is again only an algebraic function of its complexity, as follows: P(E) ≤ Kˆ −E c (R)/R

(8)

Both of the bounds Eq. (5) and Eq. (7) imply that an arbitrarily small error probability is achievable for R ⬍ C either by increasing the code length n for block codes or by increasing the memory order m for convolutional codes. For codes to be very effective, they must be long in order to average the effects of noise over a large number of symbols. Such a code may have as many as 2200 possible codewords and many times the number of possible received words. While an exhaustive ML decoding still conceptually exists, such a decoder is impossible to implement. It is very clear that the key obstacle to an approach to channel capacity is not only in the construction of specific ‘‘good’’ long codes, but also in the problem of its decoding complexity. Certain simple mathematical constructs enable one to determine the most important properties of ‘‘good’’ codes. Even more importantly, such criteria often make it feasible for the encoding and decoding operations to be implemented in practical electronic equipment. Thus, there are three main aspects of the channel coding problem: (1) to find codes that have the error-correcting ability (this usually demands that the codes be long); (2) a practical method of encoding; and (3) a practical method of making decisions at the receiver, that is, performing the error correction process. Interested readers should refer to the literature (1–9). BIBLIOGRAPHY 1. S. Lin and D. J. Costello, Jr., Error Control Coding: Fundamentals and Applications, Englewood Cliffs, NJ: Prentice-Hall, 1983. 2. A. J. Viterbi and J. K. Omura, Principles of Digital Communication and Coding, New York: McGraw-Hill, 1979. 3. W. W. Peterson and E. J. Weldon Jr., Error-Correcting Codes, 2nd ed., Cambridge, MA: The MIT Press, 1972. 4. F. J. MacWilliams and N. J. A. Sloane, The Theory of Error-Correcting Codes, New York: North-Holland Publishing Company, 1977. 5. E. R. Berlekamp, Algebraic Coding Theory, New York: McGrawHill, 1968. 6. R. G. Gallager, Information Theory and Reliable Communications, New York: Wiley, 1968. 7. G. D. Forney Jr., Concatenated Codes, Cambridge, MA: The MIT Press, 1966. 8. R. Blahut, Theory and Practice of Error Control Codes, Reading, MA: Addison-Wesley, 1983. 9. G. C. Clark and J. B. Cain, Error Correction Coding for Digital Communications, New York: Plenum, 1981.

(7) IRVING S. REED

where (m ⫹ 1)n is called the constraint length of the convolutional code, and the convolutional error exponent Ec(R) is greater than zero for all rates R ⬍ C. Both Eb(R) and Ec(R) are

University of Southern California

XUEMIN CHEN General Instrument Corporation

194

CHAOS, BIFURCATIONS, AND THEIR CONTROL

CHAOS. See CHAOS, BIFURCATIONS, AND THEIR CONTROL.


●

HOME ●

ABOUT US ●

//

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering

Browse this title

Client–Server Systems Standard Article Andrzej Goscinski1 and Wanlei Zhou1 1Deakin University, Geelong, Vic, Australia Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5303 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (232K)

●



❍ ❍

Acronym Finder

Abstract The sections in this article are The Client-Server Model Communication Between Clients and Servers Sun’s Network File System The Development of the Rhodos



Building the Distributed Computing Environment on Top of Existing Operating Systems | | | Copyright © 1999-2008 All Rights Reserved.


CLIENT–SERVER SYSTEMS

CLIENT–SERVER SYSTEMS By amalgamating computers and networks into one single computing system, a distributed computing system has created the possibility of sharing information and peripheral resources. Furthermore, these systems improve performance of a computing system and individual users through parallel execution of programs, load balancing and sharing, and replication of programs and data. Distributed computing systems are also characterised by enhanced availability and increased reliability. However, the amalgamation process has also generated some serious challenges and problems. The most important, critical challenge was to synthesise a model of distributed computing to be used in the development of both application and system software. Another critical challenge was to develop ways to hide distribution of resources and build relevant services upon them. The synthesis of a model of distributed computing has been influenced by a need to deal with the issues generated by distribution such as Locating and accessing remote data, programs, and peripheral resources Coordinating distributed programs executing on different computers Maintaining the consistency of replicated data and programs Detecting and recovering from failures Protecting data and programs stored and in transit Authenticating users The model that has been used to develop application and system software of distributed computing systems is the clientserver model. Because of this the current image of computing is client-server distributed computing. The goal of this article is to introduce and discuss the client-server model and the communication paradigm which supports this model, and to show how this model has influenced the development of different systems and applications. This article contains three major parts. The first part introduces the client-server model and different concepts and extensions to this model. The second part discusses communication supporting distributed computing systems built based on the client-server model. It contains a detailed discussion of two dimensions of the communication paradigm: the communication pattern, one-to-one and group communication, and the techniques, message passing and remote procedure call (RPC), which are used to design and build client-server based applications. The third part presents advanced applications developed based on the client-server model. The first and simplest of the presented applications of the client-server model is the network file system (NFS). It is an extension to central-

431

ised (local) operating systems (e.g., Unix, MS-DOS) which allows transparent remote file access. Subsequent sections show the RHODOS distributed operating system and distributed computing environment (DCE), respectively. RHODOS has been built from scratch on top of a bare computer. It employs the concept of a microkernel which is a cornerstone of the whole client-server-based operating system. It provides full transparency to the user. On the other hand, DCE is built on top of existing operating systems such as Unix and VMS, and hides differences among individual computers. However, it does not fully support transparency. THE CLIENT-SERVER MODEL The Client-Server Model in a Distributed Computing System A distributed computing system is a set of application and system programs and data dispersed across a number of independent personal computers connected by a communication network. In order to provide requested services to users the system and relevant application programs must be executed. Because services are provided as a result of executing programs on a number of computers with data stored on one or more locations, the whole activity is called distributed computing. The problem is how to formalize the development of distributed computing. The main issue of distributed computing is programs in execution, which are called processes. The second issue is that these processes cooperate or compete in order to provide the requested services. The client-server model is a natural model of distributed computing, which is able to deal with the problems generated by distribution, could be used to describe these processes and their behavior when providing services to users, and allows the design of system and application software for distributed computing systems. According to this model there are two processes: the client, which requests a service from another process; and the server, which is the service provider. The server performs the requested service and sends back a response. This response could be a processing result, a confirmation of completion of the requested operation, or even a notice about a failure of an operation. The client-server model and the association between this model and the physical environment this model is used in are illustrated in Fig. 1. The basic items of the model are the client and server and request and response, and the elements of a distributed computing system are distinguished. This figure firstly shows that the user must send a request to an individual server in order to be provided with a given service. A need for another service requires the user to send a request to another server. Secondly, the client and server processes execute on two different computers. They communicate at the virtual (logical) level by exchanging requests and responses. In order to achieve this virtual communication physical messages between these two processes are sent. This implies that operating systems of computers and the communication system of a distributed computing system are actively involved in the service provision. The most important features of the client-server model are simplicity, modularity, extensibility, and flexibility. Simplicity manifests itself by closely matching the flow of data with the control flow. Modularity is achieved by organizing and integrating a group of computer operations into a separate ser-


432


Client

Server Request

Client process

Server process Response

Virtual communication

Operating system

Operating system

Communication

Communication

Requesting message Figure 1. The client–server model and its association with operating systems and a communication facility.

vice. Also, any set of data with operations on this data can be organized as a separate service. The whole distributed computing system developed based on the client–server model can be easily extended by adding new services in the form of new servers. The servers which do not satisfy user requirements can be easily modified or even removed. Only the interfaces between the clients and servers must be maintained. From the user’s point of view a distributed computing system can provide services such as: printing, electronic mail, file service, authentication, naming, database service, and computing service. These services are provided by appropriate servers. Cooperation between Clients and Servers in a Distributed Computing System A system where there is only one server would not be able to provide high performance and reliable and cost-effective services to the user. As was shown in the previous section, one server is used to provide services to more than one client. The simplest form of cooperation between clients and servers (based on sharing) allows for lowering the costs of the whole system and more effective use of resources. An example of a service which is based on this form of cooperation is a printing service and file service. Processes can act as either clients or servers. It depends on the context. A file server which receives a request to read a file from a user’s client process must check on the access rights of this user. For this purpose it sends a request to an authentication server and waits for a response. Its response to the client depends on a response from the authentication server; the file server acts as a client of the authentication server. Thus, a service provided to the user by a distributed computing system developed based on the client-server model can require a chain of cooperating servers. Distributed computing systems provide the opportunity to improve performance through parallel execution of programs on a network (sometimes called clusters) of workstations, and decrease the response time of databases through data replication. Furthermore, they can be used to support synchronous distant meetings and cooperative workgroups. They also can increase reliability by service multiplication.

Responding message

Network communication

In these cases many servers must contribute to the overall application. Furthermore, it would require in some cases simultaneous requests to be sent to a number of servers. Different application will require different semantics for the cooperation between clients and servers. Distributed computing systems have moved from the basic one-to-one client–server model to the one-to-many and chain models in order to improve performance and reliability. Furthermore, client and server cooperation can be strongly influenced and supported by some active entities which are extensions to the client-server model. In a distributed computing system there are two different forms of cooperation between clients and servers. The first form assumes that a client requests a temporary service. Another situation is generated by a client which wants to arrange for a number of calls to be directed to a particular serving process. This implies a need for establishing long-term bindings between a client and a server. Groups in Distributed Computing Systems A group is a collection of processes, in particular servers, which share common features (described by a set of attributes) or application semantics. In general, processes are grouped in order to deal with this set of processes as a single abstraction; form a set of servers which can provide an identical service (but not necessary of the same quality); encapsulate the internal state and hide interactions among group members from the clients and provide a uniform interface to the external world; and deliver a single message to multiple receivers thereby reducing the sender and receiving overheads (1). There are two types of groups: closed and open (2). In a closed group only the members of the group can send and receive messages to access the resources of the group. In an open group not only can the members of the group exchange messages and request services but nonmembers of the group can send messages to group members. Importantly, the nonmembers of the group need not join the group nor have any knowledge that the requested service is provided by a group. Four group structures are often supported to provide the most appropriate policy for a wide range of user applications.


The peer group is composed of a set of member processes that cooperate for a particular purpose. Fault-tolerant and loadsharing applications dominate this type of group style. The client-server group is made from a potentially large number of client processes with a peer group of server processes. It is an open group. The diffusion group is a special case of the client-server group, where a single request message is sent by a client process to all servers. The hierarchical group is an extension to the client–server group. In large applications with a need for sharing between large numbers of group members, it is important to localize interactions within smaller clusters of components in an effort to increase performance. According to external behavior, groups can be classified into two major categories: deterministic and nondeterministic. A group is considered deterministic if each member must receive and act on a request. This requires coordination and synchronization between the members of the group. In a deterministic group, all members are considered equivalent. Nondeterministic groups assume their applications do not require consistency in group state and behavior, and they relax the deterministic coordination and synchronization. Each group member is not equivalent and can provide a different response to a group request, or not respond at all, depending on the individual group member’s state and function. In order to act properly and efficiently, each member of the group must exchange messages amongst themselves (above normal application messages) to resolve the current status and membership of the group. Any change in group membership will require all members to be notified to satisfy the requested message requirements. Furthermore, users are provided with primitives to support group membership discovery (3) and group association operations (4). Group membership discovery allows a process to determine the state of the group and its membership. However, as the requesting process has no knowledge of the group members location, a network broadcast is required. There are four operations to support group association: create, destroy, join, and leave. Initially a process requiring group communication creates the required group. A process is considered to be a group member after it has successfully issued a group join primitive, and will remain a member of the group until the process issues a leave group primitive. When the last member of the group leaves, the group will be destroyed. Extensions to the Client-Server Model A client and server can cooperate either directly or indirectly. In the former case there is no additional entity which participates in exchanging requests and responses between a client and a server. Indirect cooperation in the client-server model requires two additional entities, called agents, to request a service and to be provided with the requested service. The role of these agents can vary from a simple communication module which hides communication network details to an entity which is involved in mediating between clients and servers, resolving heterogeneity issues, and managing resources and cooperating servers. As was presented previously, a client can invoke desired servers explicitly by sending direct requests to these servers. In this case the programmer of a user application must concentrate on both an application and on managing server coop-

433

eration and communication. Writing resource management and communication software is expensive, time consuming, and error prone. The interface between the client and the server is complicated, differs from one application to another, and the whole service provided is not transparent to the client process. Clients can also request multiple services implicitly. This requires the client to send only one request to a general server. A requested service will be composed by this invoked server by cooperating, based on information provided in the request, with other servers. After completion of necessary operations by involved servers, the general server sends a response back to the client. This coordination operation can be performed by a properly designed agent. Despite the fact that such an agent is quite complicated, the cooperation between the client and the server is based on a single, well-defined interface. Furthermore, transparency is provided to the client which reduces the complexity of the application. Cooperation between a client and multiple servers can be supported by a simple communication system which employs a one-to-one message protocol. Although this communication pattern is simple, its performance is poor because each server involved must be invoked by sending a separate message. The overall performance of a communication system supporting message delivery in a client–server based distributed computing system can be dramatically improved if a one-to-many communication pattern is used. In this case a single request is sent by the client process to all servers, specified by a single group name. The use of multicast at the physical/data link layer improves this system even further. The Three-Tier Client-Server Architecture Agents and servers acting as clients can generate different architectures of distributed computing systems. The threetier client-server architecture extends the basic client-server model by adding a middle tier to support the application logic and common services. In this architecture, a distributed application consists of the three components: user interface and presentation processing component, responsible for accepting inputs and presenting the results (the client tier); computational function processing component, responsible for providing transparent, reliable, secure, and efficient distributed computing—it is also responsible for performing necessary processing to solve a particular application problem (application tier); data access processing component, responsible for accessing data stored on external storage devices, such as disk drives (back-end tier). These components can be combined and distributed in various ways to create different configurations with varying complexity. Figure 2(a) shows a centralized configuration where all the three types of components are located in a single computer. Figure 2(b) shows three two-tier configurations where the three types of components are distributed on two computers. Figure 2(c) shows a three-tier configuration where all the three types of components are distributed on different computers. Figure 3 illustrates an example implementation of the three-tier architecture. In this example, the upper tier consists of client computers that run user interface processing software. The middle tier is computers that run computational function processing software. The bottom tier is back-

434


Computer User interface/presentation

Computational functions

Data access

(a) Client computer User interface

Server computer

Client computer

Computational function

User interface

Data access


Server computer

Client computer

Server computer

User interface



Data access

Data access

(b)

Client computer

Server computer

Data server computer

User interface


Data access

(c) Figure 2. One- (a), two- (b), and three-tier (c) client–server configurations.

end data servers. In a three-tier client–server architecture, application clients usually do not interact directly with the data servers, instead, they interact with the middle tier servers to obtain services. The middle tier servers will then either fulfil the requests themselves, sending the result back to the clients, or more commonly, if additional resources are required, servers in the middle tier will act (as clients themselves) on behalf of the application clients to interact with the data servers in the bottom tier or other servers within the middle tier. Compared with a normal two-tier client–server architecture, the three-tier client–server architecture demonstrates: (1) better transparency, since the servers within the application tier allow an application to detach the user interface from back-end resources, and (2) better scalability, since servers as individual entities can be easily modified, added, or removed. Service Discovery To invoke a desired service a client must know whether there is a server which is able to provide this service, its characteristics, name, and location. This is the issue of service discovery. In the case of a simple distributed computing system, where there are only a few servers, there is no need to identify the existence of a desired server—information about all available servers is available a priori. This implies that service discovery is restricted to locating the server which provides the desired service. On the other hand, in a large distributed computing system which is a federation of a set of distributed computing systems, with the potential for many service providers who offer and withdraw these services dy-

namically, there is a need to learn both whether a proper service (e.g., a very fast color printer of high quality) is available at a given time, and if so its name and location. Service discovery is achieved through the following approaches. Computer Address Is Hardwired into Client Code. This approach requires the location of the server, in the form of a computer address, to be provided. However, it is only applicable in very small and simple systems, where there is only one server process running on the destination computer. Another version of this approach is based on a more advanced naming system, where requests are sent to processes rather than to computers. In this case each process is located using a pair 具computer_address, process_name典. A client is provided with not only the name of a server, but also with the address of a server computer. This solution is not location transparent as the user is aware of the location of the server. Broadcast Is Used to Locate Servers. According to this approach each process has a unique name. In order to send a request a client must know the name of the server. However, this is not enough because the operating system of the computer where the server runs must know the address of the server’s computer. For this purpose the client’s operating system broadcasts a special locate request containing the name of the server, which will be received by all computers on a network. An operating system which finds the server’s name in the list of its processes sends back a ‘here I am’ response containing its address (location). The client’s operating system receives the response and can store (cache) the server’s


Client tier

User

User

Client computer

Client computer

User interface and presentation

User interface and presentation

…

Client process for application tier

Application tier

Client process for application tier

Server computer

Server computer

Computational functions and services

Computational functions and services

…

Client process for back-end tier

Back-end tier

435

Client process for back-end tier

Fast parallel computer

Fast parallel computer

…

Data services

Data services

Data storage

computer address for future communication. This approach is transparent; however, the broadcast overhead is high as all computers on a network are involved in the processing of the location request. Server Location Lookup Is Performed via a Name Server. This approach is very similar to the broadcast-based approach; however, it reduces the broadcast overhead. In order to learn the address of a desired server, an operating system of the client’s computer sends a ‘where is’ request to a special system server, called a name server, asking for the address of a computer where the desired server runs. This means that the name and location (computer address) of the name server are known to all computers. The name server sends back a response containing an address of the desired server. The client’s operating system receives the response and can cache the server’s computer address for future communication. This approach is transparent and much more efficient than the broadcast-based approach. However, because the name server is centralized, the overall performance of a distributed computing system could be degraded as the name server can become a bottleneck. Furthermore, the reliability of this approach is low; if a name server computer crashed a distributed computing system cannot work. In a large distributed computing system there could be a large number of servers. Moreover, servers of the same type

Data storage

Figure 3. An application of the three-tier architecture in a distributed computing system.

can be characterized by different attributes describing the services they provide (e.g., one laser printer is a color printer, another is a black and white printer). Furthermore, servers can be offered by some users and revoked dynamically. A user is not able to know names and attributes of all these servers, and their dynamically changing availability. There must be a server which could support users to deal with these problems. A Broker Is Employed. This approach is very similar to the server location lookup performed via a name server approach. However, there are real conceptual differences between a broker and a name server which frees clients from remembering ASCII names or path names of all servers (and eventually the server locations), and allows clients to identify attributes of servers and learn about their availability. A broker is a server which (1) allows a client to identify available servers which can be characterized by a set of attributes which describe the properties of a desired service; (2) mediates cooperation between clients and servers; (3) allows service providers to register the services they support by providing their names, locations, and features in the form of attributes; (4) advertises registered services and makes them available to clients; and (5) withdraws services dynamically. Thus, a broker is a server which embodies both service management and naming services.

436


Client–Server Interoperability Reusability of servers is a critical issue for both users and software manufactures due to the high cost of software writing. This issue can be easily resolved in a homogeneous environment because the accessing mechanisms of clients may be made compatible with software interfaces, with static compatibility specified by types and dynamic compatibility by protocols. Cooperation between heterogeneous clients and servers is much more difficult as they are not fully compatible. Thus, the issue is how to make them interoperable. Wegner (5) defines interoperability as the ability of two or more software components to cooperate despite differences in language, interface, and execution platform. There are two aspects of client–server interoperability: a unit of interoperation, and interoperation mechanisms. The basic unit of interoperation is a procedure (5). However, larger granularity units of interoperation may be required by software components. Furthermore, preservation of temporal and functional properties may also be required. There are two major mechanisms for interoperation: interface standardization and bridging. The objective of the former is to map client and server interfaces to a common representation. The advantages of this mechanism are: (1) it separates communication models of clients from those of servers, and (2) it provides scalability, since it only requires m ⫹ n maps, where m and n are the number of clients and servers, respectively. The disadvantage of this mechanism is that it is closed. The objective of the latter is to provide a two-way map between client and server. The advantages of this mechanism are: (1) openness, and (2) flexibility—it can be tailored to the requirements of a given client and server pair. However, this mechanism does not scale as well as the interface standardization mechanism, as it requires m ⫻ n maps. Conclusions In this section we introduced the client-server model and some concepts related to this model. Partitioning software into clients and servers allows us to place these components independently on computers in a distributed computing system. Furthermore, it allows these clients and servers to execute on different computers in a distributed computing system in order to complete the processing of an application in an integrated manner. This paves the way of achieving high productivity and high performance in distributed computing. The client-server model is becoming the predominant form of software application design and operation. However, to fully benefit from the client–server model, there is a need to employ an operating system and communication network which links computers on which these processes run. Furthermore, in order to locate a server, the operating system must be involved. The question is what class of an operating system can be used. There are two classes of operating systems which could be employed to develop a distributed computing system: a network operating system and a distributed operating system. A network operating system is constructed by adding a module to the local centralized operating system of each computer which allows processes to access remote resources and services; however, in the major-

ity of cases this solution does not fully support transparency. A distributed operating system is built from scratch, which hides distribution of resources and services; this solution, although futuristic from the current-practice point of view, provides location transparency. It is clear that the extensions to the basic client-server model, described in the previous sections, are achieved through an operating system. Furthermore, network communication services are invoked by an operating system on behalf of cooperating clients and servers. COMMUNICATION BETWEEN CLIENTS AND SERVERS Distributed computing systems must be fast in order to instil in users the feeling of a huge powerful computer sitting on their desks. This implies that communication between the clients and servers must be fast. Furthermore, the speed of communication between remote client and server processes should not be highly different from the speed between local processes. The issue is how to build a communication facility within a distributed computing system to achieve high communication performance. One of the strongest factors which influences the performance of a communication facility is the communication paradigm: that is, the communication model supporting cooperation between clients and servers and the operating system support provided to deal with the cooperation. There are two issues in the communication paradigm. Firstly, a client can send a request to either a single server or a group of servers. This leads to two patterns of communication: one-to-one and one-to-many, also called group communication (which are operating system abstractions). Secondly, these two patterns of interprocess communication could be developed based on two different techniques: message passing, adopted for distributed computing systems in the late 1970s; and remote procedure call (RPC), adopted for distributed computing systems in mid-1980s. These two techniques are supported by two respective sets of primitives provided by an operating system. Furthermore, communication between processes on different computers can be given the same format as communication between processes on a single computer. The following topics are discussed in this section: message passing, including communication primitives; semantics of these primitives; direct and indirect communication; blocking and nonblocking primitives; buffered and unbuffered exchange of messages; and reliable and unreliable primitives are considered. Also, RPC is discussed. The basic features of this technique; parameters, results and their marshalling; client-server binding; and reliability issues are presented. Thirdly, group communication is discussed. In particular, the basic concepts of this communication pattern; group structures; different types of groups; group membership; message delivery and response semantics; and message ordering in group communication are presented. Message Passing—Message-Oriented Communication We define message-oriented communication as a form of communication in which the user is explicitly aware of the message used in communication and the mechanisms used to deliver and receive messages (6).


Basic Message Passing Primitives. A message is sent and received by executing the following two primitives: send(dest, src, buffer). The execution of this primitive sends the message stored in buffer to a server process named dest. The message contains a name of a client process named src to be used by the server to send a response back. receive(client, buffer). The execution of this primitive causes the receiving server process to be blocked until a message arrives. The server process specifies the client name of a process from whom a message is desired, and provides a buffer to store an incoming message. It is obvious that the receive primitive must be issued before a message arrives; otherwise the request could be declared as lost and must be retransmitted by the client. Of course, when the server process sends any message to the client process, it must use these two primitives also; the server sends a message by executing the primitive send and the client receives it by executing the primitive receive. There are several points that should be discussed at this stage. All of them are connected with a problem stated as follows: What semantics should these primitives have. The following alternatives are presented: direct or indirect communication via ports; blocking versus nonblocking primitives; buffered versus unbuffered primitives; reliable versus unreliable primitives; and structured forms of message passing based primitives. Direct and Indirect Communication via Ports. A very basic issue in message-based communication is where do messages go. Message communication between processes uses one of two techniques: the sender designates either a fixed destination process or a fixed location for receipt of a message. The former technique is called direct communication—it uses direct names; the latter is called indirect communication and it exploits the concept of a port. In direct communication, each process that wants to send or receive a message must explicitly name the recipient or sender of the communication. In this case, the send and receive primitives have the following form: send(dest, src, buffer), receive(client, buffer). The dest and client are the names of a destination process (server) and sending process (client) from whom the server is prepared to receive a request. This scheme exhibits a symmetry in naming: that is, both the sender and the receiver have to name one another in order to communicate. A variant of this scheme employs asymmetry in naming: only the client names the server, whereas the server is not required to name the client. Direct communication is easy to implement and to use. It enables a process to control the times at which it receives messages from each process. The disadvantage of the symmetric and asymmetric schemes is the limited modularity of the resulting process definition. Changing the name of the process may necessitate the examination of all other process’ destination. All references to the old process must be found, in order to modify them to the new name. This is not desirable from the point of view of separate compilation. Moreover, the receive primitive in a server should allow receipt of a mes-

437

sage from any client to provide a service to whatever client process calls it. Direct communication does not allow more than one client. Similarly, direct communication does not make it possible to send one request to more than one identical server. This implies the need for a more sophisticated technique. Such a technique is based on ports. A port can be abstractly viewed as a protected kernel object into which messages may be placed by processes and from which messages can be removed: that is, the messages are sent to and received from ports. Processes may have ownership, send, and receive rights on a port. Each port has a unique identification (name) that distinguishes it. A process may communicate with other process by a number of different ports. In this case dest in the send primitive is a name of a port of the server the request is sent to. Logically associated with each port is a FIFO queue of finite length. Messages which have been sent to this port but which have not yet been removed from it by a process reside on this queue. Messages may be added to this queue by any process which can refer to the port via a local name (e.g., capability). A port should be declared. A port declaration serves to define a queuing point for messages. A process which wants to remove a message from a port must have the appropriate receive rights. Usually, only one process may have receive access to a port at a time. Messages sent to a port are normally queued in FIFO order. However, an emergency message can be sent to a port and receive special treatment with regard to queuing. Blocking versus Nonblocking Primitives. One of the most important properties of message passing primitives concerns whether their execution could cause delay. We distinguish blocking and nonblocking primitives. We say that a primitive has nonblocking semantics if its execution never delays its invoker; otherwise, a primitive is said to be blocking. In the former case, a message must be buffered. The previously described primitives have blocking semantics. It is necessary to distinguish two different forms of the blocking send primitives. These forms are generated by different criteria. The first criterion reflects the operating system design and addresses buffer management and message transmission. The blocking and nonblocking send primitives are illustrated in Fig. 4. If the blocking send primitive is used, the sending process (client) is blocked: that is, the instruction following the send primitive is not executed until the message has been completely sent. The blocking receive implies that the process which issued this primitive remains blocked (suspended) until a message arrives, and being put into the buffer specified in the receive primitive. If the nonblocking send primitive is used, the sending process (client) is only blocked for the period of copying a message into the kernel buffer. This means that the instruction following the send primitive can be executed even before the message is sent. This can lead toward parallel execution of a process and message transmission. The second criterion reflects the client-server cooperation and the programming language approach to dealing with message communication. In this case the client is blocked until the server (receiver) has accepted the request message and

438


Trap to kernel, process blocked

Client running Send()

Client running

Trap to kernel

Send() Message Client copied to blocked kernel buffer

Client blocked

Client running

Message being sent

Client running

Returned from kernel, process unblocked

Time

Time

(a)

(b)

Figure 4. Operating system oriented blocking (a) and unblocking (b) send primitives.

the result or acknowledgment has been received by the client, illustrated in Fig. 5. There are three forms of the receive primitive. The blocking receive is the most common, since the receiving process often has nothing else to do while awaiting receipt of a message. There is also a nonblocking receive primitive, and a primitive for checking whether a message is available to receive. As a result, a process can receive all messages and then select one to process. Unbuffered versus Buffered Message Passing Primitives. In some message-based communication systems, messages are buffered between the time they are sent by a client and received by a server. If a buffer is full when a send is executed,

Client running

Trap to kernel, process blocked

Send()

Receive() Server blocked

Client blocked

Message being sent, received by the server (eventually processed), the response sent back

Server running Send() Receive()

Client running

Return from kernel, process unblocked

Server Blocked

Time Figure 5. Client–server cooperation oriented blocked send primitive.

there are two possible solutions: the send may delay until there is a space in the buffer for the message, or the send might return a code to the client, indicating that because the buffer is full the message could not be sent. The situation of the receiving server is different. The receive primitive informs an operating system about a buffer into which the server wishes to put an arrived message. The problem occurs when the receive primitive is issued after the message arrives. The question is what to do with the message. The first possible approach is to discard the message. The client could time out and re-send, and hopefully the receive primitive will be invoked in the meantime. Otherwise, the client can give up. The second approach to deal with this problem is to buffer the message in the operating system area for a specified period of time. If during this period the appropriate receive primitive is invoked the message is copied to the invoking server space. If the receive primitive is not invoked and the timeout expires the message is discarded. Buffered message passing systems are more complex than unbuffered message passing based systems, since they require creation, destruction, and management of the buffers. Also, they generate protection problems and cause catastrophic event problems when a process owning a port dies or is killed. Unreliable versus Reliable Primitives. Different catastrophic events, such as a computer crash or a communication system failure can happen in a distributed computing system. These can cause either a requesting message being lost in the network, a response message being lost or delayed in transit, or the responding computer ‘‘dying’’ or becoming unreachable. Moreover, messages can be duplicated, or delivered out of order. The primitives discussed previously cannot cope with these problems. These are called unreliable primitives. The unreliable primitive send merely puts a message on the network. There is no guarantee of delivery provided and no automatic retransmission is carried out by the operating system when a message is lost. Dealing with failure requires providing reliable primitives. In a reliable interprocess communication, the send primitive handles lost messages using internal retransmissions and acknowledgments on the basis of timeouts. This implies that when send terminates, the process is sure that the message was received and acknowledged. Reliable and unreliable receive differ in that the former automatically sends an acknowledgment confirming message reception, whereas the latter does not. Two-way communication requires the utilization of the basic message passing primitives in a symmetrical way. If the client requested any data, the server sends reply messages (responses) using the send primitive. For this reason the client has to set the receive primitive up to receive any message from the server. Reliable and unreliable primitives are contrasted in Fig. 6. Structured Forms of Message Passing Based Communication. A structured form of communication using message passing is achieved by distinguishing requests and replies and providing for bidirectional information flow. This means that the client sends a request message and waits for a response. The set of primitives is as follows.


Send()

When remote procedure calls are used a client interacts with a server by means of a call statement

Send() Timeout Send()

service_name (value_args, result_args) Receive() Send()

Timeout Send()

Receive() Send()

Timeout

(a)

439

(b)

Figure 6. Unreliable (a) and reliable (b) message passing primitives.

send(dest, src, buffer). Sends a request and gets a response; it combines the previous client’s send to the server with a receive to get the server’s response. get_request(client, buffer). Done by the receiver (server) to acquire a message containing work for them to do. send_response(src, dest, buffer). The receiver (server) uses this primitive to send a reply after completion of the work. It should be emphasised that the semantics, described in the previous sections, can be linked with these primitives. The result of the send and receive combination in the structured form of the send primitive is one operation performed by the interprocess communication system. This implies that rescheduling overhead is reduced, buffering is simplified (because request data can be left in a client’s buffer, and the response data can be stored directly in this buffer), and the transport-level protocol is simplified. Remote Procedure Call Message passing between remote and local processes is visible to the programmer. It is a completely untyped technique. Programming message passing based applications is difficult and error prone. An answer to these problems is the RPC technique which is based on the fundamental linguistic concept known as the procedure call. The very general term remote procedure call means a type-checked mechanism that permits a language-level call on one computer to be automatically turned into a corresponding language-level call on another computer. The first and most complete description of the RPC concept was presented in Ref. 7. Basic Features of Remote Procedure Calls. The idea of remote procedure calls (RPC) is very simple and is based on the observation that a client sends a request and then blocks until a remote server sends a response. This approach is very similar to a well-known and well-understood mechanism referred to as a procedure call. Thus, the goal of a remote procedure call is to allow distributed programs to be written in the same style as conventional programs for centralized computer systems. This implies that RPC must be transparent. This leads to one of the main advantages of this communication approach: the programmer does not have to know that the called procedure is executing on a local or a remote computer.

To illustrate that both local and remote procedure calls look identical to the programmer, suppose that a client program requires some data from a file. For this purpose there is a read primitive in the program code. In a system supported by a classical procedure call, the read routine from the library is inserted into the program. This procedure, when executing, puts the parameters into registers, and then traps to the kernel as a result of issuing a READ system call. From the programmer point of view there is nothing special; the read procedure is called by pushing the parameters onto the stack and is executed. In a system supported by RPC (Fig. 7), the read routine is a remote procedure which runs on a server computer. In this case, another call procedure called a client stub from the library is inserted into the program. When executing, it also traps to the kernel. However, rather than placing the parameters into registers, it packs them into a message and issues the send primitive, which forces the operating system to send it to the server. Next, it calls the receive primitive and blocks itself until the response comes back. The server’s operating system passes the arrived message to a server stub, which is bound to the server. The stub is blocked waiting for messages as a result of issuing the receive primitive. The parameters are unpacked from the received message and a procedure is called in a conventional manner. Thus, the parameters and return address are on the stack, and the server does not see that the original call was made on a remote client computer. The server executes the procedure call and returns the results to the virtual caller: that is, the server stub. The stub packs them into a message and issues send to return the results. The stub comes back to the beginning of its loop to issue the receive primitive, and blocks waiting for the next request message. The result message on the client computer is copied to the client process (practically to the stub’s part of the client) buffer. The message is unpacked, and the results are extracted and copied to the client in a conventional manner. As a result of calling read, the client process finds its data available. The client does not know that the procedure was executing remotely. It is evident that the semantics of remote procedure calls is analogous to local procedure calls: the client is suspended when waiting for results; the client can pass arguments to the remote procedure; and the called procedure can return results. However, since the client’s and server’s processes are on different computers (with disjoint address spaces), the remote procedure has no access to data and variables of the client’s environment. There is a difference between message passing and remote procedure calls. Whereas in message passing all required values must be explicitly assigned into the fields of a message before transmission, the remote procedure call provides marshalling of the parameters for message transmission: that is, the list of parameters is collected together by the system to form a message.

440


Client program

Client computer

Server computer Server program

Read()

Pack parameters

Receive

Send

Unpack parameters

Receive Unpack parameters

Figure 7. The sequence of operations in RPC.

Parameters and Results in RPCs. One of the most important problems of the remote procedure call is parameter passing and the representation of parameters and results in messages. Parameters can be passed by value or by reference. By-value message systems require that message data be physically copied. Thus, passing value parameters over the network is easy: the stub copies parameters into a message and transmits it. If the semantics of communication primitives allow the client to be suspended until the message has been received, only one copy operation is necessary. Asynchronous message semantics often require that all message data be copied twice: once into a kernel buffer and again into the address space of the receiving process. Data copying costs can dominate the performance of by-value message systems. Moreover, by-value message systems often limit the maximum size of a message, forcing large data transfers to be performed in several message operations reducing performance. Passing reference parameters (pointers) over a network is more complicated. In general, passing data by-reference requires sharing of memory. Processes may share access to either specific memory areas or entire address spaces. As a result, messages are used only for synchronization and to transfer small amounts of data, such as pointers to shared memory. The main advantage of passing data by-reference is that it is cheap—large messages need not be copied more than once. The disadvantages of this method are that the programming task becomes more difficult, and it requires a combination of virtual memory management and interprocess communication, in the form of distributed shared memory. Marshalling Parameters and Results. Remote procedure calls require the transfer of language-level data structures between two computers involved in the call. This is generally performed by packing the data into a network buffer on one computer and unpacking it at the other site. This operation is called marshalling. More precisely, marshalling is the process (performed when sending the request as well as when sending the result back) in which three actions can be distinguished:

Pack parameters

Call

Return

Send

Communication network

Extracting the parameters to be passed to the remote procedure and the results of executing the procedure; Assembling these two into a form suitable for transmission among computers involved in the remote procedure call; and Disassembling them on arrival. The marshalling process must reflect the data structures of the language. Primitive types, structured types, and user-defined types must be considered. In the majority of cases, marshalling procedures for scalar data types and procedures to marshal structured types built from the scalar ones are provided as a part of the RPC software. Client-Server Binding. Usually, RPC hides all details of locating servers from clients. However, as we stated in a previous section, in a system with more than one server (e.g., file server, print server), the knowledge of location of clients’ files or a special type of a printer is important. This implies the need for a mechanism to bind a client and a server, in particular, to bind an RPC stub to the right server and remote procedure. There are two aspects of binding: the way the client specifies what it wants to be bound to (this is the problem of naming), and the ways the client locates the server and the specification of the procedure to be invoked (this is the problem of addressing). In a distributed computing system there are two different forms of cooperation between clients and servers. The first form assumes that a client requests a temporary service. Another situation is generated by a client which wants to arrange for a number of calls to be directed to a particular serving process. These imply a need for a run-time mechanism for establishing long-term bindings between this client and a server. In the case of requests for a temporary service, the problem can be solved using broadcast and multicast messages to locate a server. In the case of a solution based on a name server, that solution is not enough, because the process wants


to call the located server during a time horizon. This means that a special binding table should be created containing established long-term binding objects (i.e., a client name and a server name), should be registered. The RPC run-time procedure for performing remote calls expects to be provided a binding object as one of its arguments. This procedure directs the call to the binding address received. It should be possible to add new binding objects to the table, remove binding objects from the binding table (which in practice means breaking a binding), and update the binding table. In systems with name servers, broadcasting is replaced by the operation of sending requests to a name server requesting a location of a given server and sending a response with an address of this server. Binding can take place at compile time, link time, or call time. Error Recovery Issues. Because the client and server are separate processes which run on separate computers, they are prone to failures of themselves, their computers, or the communication system. The remote procedure may not be complete successfully. For example, the result message is not returned to the client as a response to its call message, because one of four events may occur: the request message is lost; the result (response) message is lost; the server computer crashes and is restarted; and the client computer crashes and is restarted. These events form the basis for design of RPC recovery mechanisms. Three different semantics of RPC and their mechanisms can be identified to deal with problems generated by these four events: Maybe call semantics. Timeouts are used to prevent a client waiting indefinitely for a response message; At-least-once call semantics. This mechanism usually includes timeouts and a call retransmission procedure. The client tries to call the remote procedure until it gets a response or can tell that the server has failed; Exactly once call semantics. In the case of at-least-once call semantics it can happen that the call can be received by the server more than once, because of lost responses. This can have the wrong effect. To avoid this the server sends each time (when retransmitting) as its response the result of the first execution of the called procedure. Thus, the mechanisms for these semantics include, in addition to those used in at-least-once call semantics (i.e., timeouts, retransmissions), call identifications and the server’s table of current calls. This table is used to store the calls received first time and procedure execution results for these calls. Message Passing versus Remote Procedure Calls. A problem arises in deciding which of the two interprocess communication techniques presented is better, if any, and whether there are any suggestions for when, and for what systems, these facilities should be used. First of all, the syntax and semantics of the remote procedure call are the functions of the programming language being used. On the other hand, choosing a precise syntax and semantics for message passing is more difficult than for RPC because there are no standards for messages. Moreover, neglecting language aspects of RPC and because of the variety

441

of message passing semantics, these two facilities can look very similar. Examples of a message passing system that look like RPC are message passing for the V system (which in Ref. 8 is called now the remote procedure call system) and message passing for Amoeba (9) and RHODOS (10). By comparing the remote procedure call and message passing, the former has the important advantage that the interface of a remote service can be easily documented as a set of procedures with certain parameter and result types. Moreover, from the interface specification, it is possible to automatically generate code that hides all of the details of messages from a programmer. On the other hand, a message passing model provides flexibility not found in remote procedure call systems. However, this flexibility is at the cost of difficulty in the preparation of precisely documented behavior of a message passing interface. The problem is when these facilities should be used. The message passing approach appears preferable when serialization of request handling is required. The RPC approach appears preferable when there are significant performance benefits to concurrent request handling. RPC is particularly efficient for request–response transactions. Group Communication Distributed computing systems provide the opportunity to improve the overall performance through parallel execution of programs on a network of workstations, decreasing the response time of databases using data replication, supporting synchronous distant meetings and cooperative workgroups, and increasing reliability by service multiplication. In these cases many servers must contribute to the overall application. This implies a need to invoke multiple services by sending a simultaneous request to a number of servers. This leads toward group communication. The concept of a process group is not new. The V-system (11), Amoeba (2), Chorus (12), and RHODOS (10) all support this basic abstraction in providing process groups to applications and operating system services with the use of group communication. Basic Concepts of Group Communication. Group communication is an operating system abstraction which supports the programmer by offering convenience and clarity. This operating system abstraction must be distinguished from the message transmission mechanisms such as multicast (one-tomany physical entities connected by a network) or its special case broadcast (one-to-all physical entities connected by a network). A request is sent by a client called src to a group of servers providing the desired service named group_name by executing either send(group_name, src, buffer) when the message passing technique is used, or call service_name (value_args, result_args) when the RPC technique is used. This request is delivered following the semantics of a primitive used. The primitives should be constructed such that there is no difference between invoking a single server or a group of servers. This means that communication pattern transparency is provided to the programmer. Thus, groups should be named in the same manner that single processes are named. Each group is treated as one sin-

442


gle entity; its internal structure and interactions are not shown to the users. The mapping of group names on multicast addresses is performed by an interprocess communication facility of an operating system and supported by a naming server. However, if multicast or even broadcast is not provided, group communication could be supported by one-to-one communication at the network level. Communication groups are dynamic. This means that new groups can be created and some groups can be destroyed. A process can be a member of more than one group at the same time. It can leave a group or join another one. In summary, group communication shares many design features with message passing and RPC. However, there are some issues which are very specific, and their knowledge could be of a great value to the application programmer.

Message Ordering in Group Communication. The semantics of message ordering are an important factor in providing good application performance and reduction in the complexity of distributed application programming. The order of message delivery to members of the group will dictate the type of group it is able to support. There are four possible message ordering semantics:

Message Delivery Semantics. Message delivery semantics of a group relates to the successful delivery of a message to processes in a group. There are four choices of delivery semantics:

Causal Ordering. The causal ordering semantic delivers request messages to all members of the current group such that the causal ordering of message delivery is preserved. This implies that if the sending of a message m⬘ causally follows the delivery of message m, then each process in the group receives m before m⬘.

Single Delivery. Single delivery semantics require that only one of the current group members needs to receive the message for the group communication to be successful. k-Delivery. In k-delivery semantics, at least k members of the current group will receive the message successfully. Quorum Delivery. With quorum delivery semantics, a majority of the current group members will receive the message successfully. Atomic Delivery. With atomic delivery all current members of the group successfully receive the message or none does. This delivery semantic is the most stringent as processes can and do fail and networks may also partition during the delivery process of the request messages, making some group members unreachable. Message Response Semantics. By providing a wide range of message response semantics the application programmer is capable of providing flexible group communication to a wider range of applications. The message response semantics specify the number and type of expected message responses. There are five broad categories for response semantics: No Responses. By providing no response to a delivered request message the group communication facility is only able to provide unreliable group communication. Single Response. The client process expects (for successful delivery of a message) a single response from one member of the group. k-Responses. The client process expects to obtain k responses for the delivered message from the members of the process group. By using k response semantics the groups resilience can be defined (13). The resilience of a group is based on the minimum number of processes that must receive and respond to a message. Majority Response. The client process expects to receive a majority of responses from the current members of the process group. Total Response. The client process requires all current members of the group to respond to the delivery of a request message.

No Ordering. This semantic implies that all request messages will be sent to the current group of processes in no apparent order. FIFO Ordering. This semantic implies that all request messages transmitted in the first-in first-out (FIFO) order by a client process to the current members of the group will be delivered in the FIFO order.

Total Ordering. Total ordering semantic implies that all messages are reliably delivered in sequence to all current members of the group or no member will receive the message. Also, total ordered semantic guarantees that all group members see the same order of messages. Total order is more stringent that FIFO ordering as all message transfers between all current members of the group are in order. This implies that all processes within the current group perceive the same total ordering of messages. In causal ordering we are concerned with the relationship of two messages while in total ordering we are concerned with seeing the same order of messages for all group member processes. Conclusions In this section we described two issues of the communication paradigm for the client-server cooperation: firstly, the communication pattern, including one-to-one and one-to-many (group communication); secondly, two techniques, messagepassing and RPC, which are used to develop distributed computing systems. The message passing technique allows clients and servers to exchange messages explicitly using the send and receive primitives. Various semantics, such as direct and indirect, blocking and nonblocking, buffered and unbuffered, reliable and unreliable can be used in message passing. The RPC technique allows clients to request services from servers by following a well-defined procedure call interface. Various issues are important in RPC, such as marshalling and unmarshalling of parameters and results, binding a client to a particular server, and handling exceptions.

SUN’S NETWORK FILE SYSTEM The first major step in the development of distributed software was made when inexpensive diskless personal computers were connected by inexpensive local networks in order to share a file service or a printer service.


Distributed File Systems A distributed file system is a key component of any distributed computing system. The main function of such a system is to create a common file system that can be shared by all the clients which run on autonomous computers in the distributed computing system. The common file system should store programs and data and make them available as needed. Since files can be stored anywhere in a distributed computing system, a distributed file system should provide location transparency. To achieve such a goal a distributed file system usually follows the client-server model. A distributed file system typically provides two types of services: the file service and the directory service, which are implemented by the file server and the directory server, respectively, distributed over the network. These two servers can also be implemented as a single server. The file server provides operations on the contents of files such as read, write, and append. The directory server provides operations such as directory and file creation and deletion, for manipulating directories and file names. The client application program interface (client API, usually in the form of a process or a group of processes) runs on each client computer and provides a uniform user-level interface for accessing file servers. In this section we will present one of the most important achievement of the 1980s, which is still in use now, the Network File System, known as NFS, developed based on the client-server model. NFS Architecture NFS was developed by Sun Microsystems and introduced in late 1984 (14). Since then it has been widely used in both industry and academia. NFS was originally developed for use on Unix workstations. Currently, many manufacturers support it for other operating systems (e.g., MS-DOS). Here, NFS is introduced based on the Unix system. To understand the architecture of NFS, we need to define the following terms: INODE. This is a data structure that represents either an open file or directory within the Unix file system. It is used to identify and locate a file or directory within the local file system. RNODE. The remote file node is a data structure that represents either an open file or directory within a remote file system. VNODE. The virtual file node is a data structure that represents either an open file or directory within the virtual file system (VFS). VFS. The virtual file system is a data structure (linked lists of VNODEs) that contains all necessary information on a real file system that is managed by the NFS. Each VNODE associated with a given file system is included in a linked list attached to the VFS for that file system. The NFS server integrates functions of both a file server and a directory server and the NFS clients use a uniform interface, the VFS/VNODE interface, to access the NFS server. The VFS/VNODE interface abstraction makes it possible to achieve the goal of supporting multiple file system types in a generic fashion. The VFS and VNODE data structures pro-

443

vide the linkage between the abstract uniform file system interface and the real file system (such as a Unix or MS-DOS file systems) that accesses the data. Further, the VFS/ VNODE interface abstraction allows NFS to make remote files and local files appear identical to a client program. In NFS, a client process transparently accesses files through the normal operating system interface. All operating system calls that manipulate files or file systems are modified to perform operations on VFSs/VNODEs. The VFS/VNODE interface hides the heterogeneity of underlying file systems and the location of these file systems. The steps of processing a user-level file system call can be described as follows (Fig. 8): 1. The user-level client process makes the file system call through the normal operating system interface. 2. The request is redirected to the VFS/VNODE interface. A VNODE is used to describe the file or directory accessed by the client process. 3. If the request is for accessing a file stored in the local file system, the INODE pointed by the VNODE is used. The INODE interface is used and the request is served by the Unix file system interface. 4. If the request is for accessing a file stored locally in other types of file systems (e.g., MS-DOS file system), a proper interface of that file system is used to serve the request. 5. If the request is for accessing a file stored remotely, the RNODE pointed to by the VNODE is used and the request is passed to the NFS client and some RPC messages are sent to the remote NFS server that stores the requested file. 6. The NFS server processes the request by using the VFS/VNODE interface to find the appropriate local file system to serve the request. The Role of RPC The communication between NFS clients and servers is implemented as a set of RPC procedures. The RPC interface provided by a NFS server includes operations for directory manipulation, file access, link manipulation, and file system access (15). The actual specifications for these remote procedures are defined in the RPC language, and the data structures used by the procedures are defined in the XDR format. The RPC language is a C-like language used as input into Sun’s RPC Protocol Compiler utility. This utility can be used to output the actual C language source code. NFS servers are designed to be stateless, meaning that there is no need to maintain information (such as whether a file is open or the position of the file pointer) about past requests. The client keeps track of all information required to send requests to the server. Therefore, NFS RPC requests are designed to completely describe the operation to be performed. Also, most NFS RPC requests are idempotent, meaning that an NFS client may send the same request one or more times without any harmful side effects. The net result of these duplicate requests is the same. NFS RPC requests are transported using the unreliable User Datagram Protocol (UDP). NFS servers notify clients when an RPC completes by sending the client an acknowledgment (also using UDP).

444


Server computer

Client computer User-level client process 1 Operating system kernel and interfaces 2

Virtual file system interface VNODE

Virtual file system interface VNODE 4 Other types of file system (e.g., MS-DOS)

3

5

UNIX file system INODE

Remote file system

6 NFS server

NFS client

Disk

RPC/XDR

RPC/XDR

UNIX file system INODE

Other types of file system (e.g., MS-DOS)

Disk

Figure 8. The NFS structure. See text for description.

A NFS client sends its RPC requests to a NFS server one at a time. Although a client computer may have several NFS RPC requests in progress at any time, each of these requests must come from a different client. When a client makes an RPC request, it sets a timeout period during which the server must service and acknowledge it. If the server does not acknowledge during the timeout period, the client retransmits the request. This may happen if the request is lost along the way or if the server is too slow because of overloading. Since the RPC requests are idempotent, there is no harm if the server executes the same request twice. If the client gets a second acknowledgment from the request, the client simply discards it. Conclusions In this section we showed an application of the client-server model in the development of a distributed file system based on the Network File System. The NFS server integrates functions of both a file server and a directory server. It has been built as an extension module to a centralized operating system (e.g., Unix or MS-DOS). NFS clients use RPC to communicate with the NFS system. This system allows clients running on diskless computers to access and share files. THE DEVELOPMENT OF THE RHODOS The vast majority of design and implementation efforts in the area of distributed computing systems have concentrated on client-server-based applications running on centralized operating systems (e.g., Unix, VMS, OS/2). However, there have been huge research efforts on the development of operating systems built from scratch based on the client-server model (called distributed operating systems). These systems support distributed computing systems developed on a set of personal

homogeneous computers connected by local or fast wide area networks. The results achieved have changed, and are still changing, operating systems of distributed computing systems and the development of applications supported by these systems. The following systems have been developed based on the clientserver model: V (8), Amoeba (2), Chorus (12), and RHODOS (16). Distributed Operating Systems A distributed operating system is one that looks to its users like a centralized operating system, but runs on multiple, independent computers connected by fast local or wide area networks. There are the following four major goals (the first three are the goals of a centralized operating system) of a distributed operating system: Hide details of hardware by creating abstractions: for example, software which provides a set of higher level functions which form a virtual computer; Manage resources to allow their use in the most effective way and support user processes in the most efficient way; Create a pleasant user computational environment; and Hide distribution of resources, information, peripheral and computational resources, in order to provide full transparency to users. A generic architecture of a distributed operating system which allows these goals to be achieved has the following software levels. Software providing an abstraction sits on bare hardware, and allows the handling of interrupts and context switching. The second level of a distributed operating system is formed by software which manages physical resources such


as processor time, memory, input/output, and virtual resources such as processes, remote communication, communication ports, and network protocols. It depends on the support provided by functions of the software abstraction level. The second level provides it services to the system services level. This third software level allows the management of files and object (services, resources) names, and creates a human user interface formed by graphics terminals, command interpreters and authentication systems. This level creates an image of a computer system for users. User processes form the software level sitting on the system services level. In a client-server based distributed operating system all management functions and services provided to user processes are modelled and developed as individual cooperating server processes. User processes act as clients. However, because servers cooperate in order to achieve the goals of a distributed operating system they also act as clients. As physical memory in a distributed computing system is not shared, remote processes communicate using messages. In order to have a uniform communication model, local processes also communicate using messages. This provides communication transparency in a natural manner. In this section we will use RHODOS to illustrate the application of the client-server model in the development of a new class of distributed operating systems. For this reason we will mainly concentrate on the kernel servers and microkernel as they form a new image of operating systems for distributed computing systems. The RHODOS Architecture RHODOS (research oriented distributed operating system) is a microkernel and message passing based system developed using the client-server model. This operating system is capable of supporting parallel processing on a network of work-

P1

Name server

P2

File server

Migration manager IPC manager Serial driver

Interrupt handling

Process manager Terminal driver

Global schedule server

System servers

Network manager

Kernel servers

Data collection manager

Rex manager Memory manager

stations and providing load sharing and balancing in order to provide high-performance services to users (10). There are three layers of cooperating processes in RHODOS: user processes, system servers, and kernel servers (Fig. 9). Each process executes in user mode and is confined to an individual address space which is controlled and maintained by the RHODOS microkernel. In RHODOS, software creating abstractions forms a microkernel. The microkernel provides the following functions: context switching, interrupt handling, basic operations on memory pages relating to the hardware, and local interprocess communication. Furthermore, this microkernel is responsible for storing and managing basic data structures. Kernel servers implement the mechanisms of the RHODOS functionality. Two groups of kernel servers can be distinguished. To the first group belong these servers which provide services which could be identified in any distributed or network operating system: process management, memory management, remote IPC management, communication protocols, and I/O management (drivers in RHODOS have also been developed as individual servers). The second group encompasses servers which provide advanced services which are necessary to support parallel processing on a network of workstations, and load sharing and balancing. These services are: process migration, remote process creation, and data collection. System servers implement the policy of the RHODOS functionality. They provide services such as naming, file accessing and manipulation (in basic and transaction modes), two-way and m-way authentication, and global scheduling. A broker service has also been developed and will be installed shortly. In order to provide these services, system servers act as clients and invoke relevant kernel servers and the nucleus using standard system calls.

User applications and processes

P3

Authentication server

445

Ethernet driver

Local IPC The nucleus (microkernel)

Context switching

Page handling

Figure 9. The logical architecture of RHODOS.

446

CLIENT–SERVER SYSTEMS User process

RHODOS IPC Primitives

RHODOS Communication facility

IPCM

Send() recv() call()

Interprocess communications manager

Transport protocol

Network manager

IP/Ethernet Microkernel

Local IPC

Physical layer

Figure 10. RHODOS interprocess communication facility.

User applications and processes are those developed and allocated to perform task for users. These processes have no special privileges and obtain services by calls to the microkernel and system servers. Communication in RHODOS In RHODOS, access to local and remote services is achieved in the same transparent manner, via a system name of that service and uniform interprocess communication, which is provided by the Interprocess Communication (IPC) facility. The facility provides three basic communication primitives: send(), recv(), and call(). Both send() and recv() provide the basic message passing semantics while the call(), recv(), and send() in combination provide synchronous RPC. In providing both message passing and RPC semantics the programmer is able to select the most appropriate communication technique for a given application. The functioning of the IPC facility is divided into three sections: local IPC module, the IPC manager, and the network manager (Fig. 10). The local IPC module is an integral part of the RHODOS microkernel and provides local communication between processes on the same personal computer. If the destination process exists on the local computer, the module will complete the transfer. Otherwise, the IPC module sends a request to the IPC manager to provide a remote communication service. The primary responsibility of the IPC Manager is the receiving and transmitting of remote messages for all processes within the RHODOS distributed computing system. It also supports group communication. This service is achieved with the cooperation of the name server by assigning a single name to a group of names. Furthermore, in order to support one-toone and group communication the IPC manager is responsible for address resolution. In particular, a message that is sent to an individual process or a group requires the IPC manager to resolve the destination processes’ (servers’) location and provide the mechanism for the transport of the message to the desired process or group of processes. In order to deliver a message to a remote process (server), the IPC manager invokes a delivery server, called the network manager. This server consists of a protocol stack employing transport, network, and data link layer protocols.

Currently, the transport service is provided by a fast specialized RHODOS Reliable Datagram Protocol (RRDP). Network and data link layer protocols are provided by the IP/ Ethernet suit. RHODOS Kernel Servers and Services One of the basic features of the RHODOS design is that each resource is managed by a relevant server: the process manager is responsible for processes and basic operations on processes, the space manager for memory, and the IPC manager for remote and group communication and address resolution. A process is a very special resource, because it is constructed based on some basic resources such as spaces, data structures usually called process control blocks, communication ports, and buffers. Thus, in RHODOS advanced operations on processes such as process migration and remote process creation are provided by separate servers: the migration manager and REX manager. Process Manager. The job of the process manager is to manage the processes that are created in RHODOS. The process manager manipulates the process queues and deals with parent processes waiting for child processes to exit. It cooperates with other kernel servers, for instance with the migration manager to transfer a process’ state during migration; and the remote execution manager to set up a process’ state when a process is created. Space Manager. One of the goals of RHODOS is portability across hardware platforms. Thus, RHODOS memory management has been separated into two sections: hardware dependent and hardware independent. The small hardware-dependent section is found in the microkernel and the larger hardware-independent section comprises RHODOS space manager. This server deals with spaces, logical units of memory, independent of physical units (e.g., pages), which are mapped to the physical memory. The space manager supports two types of page operations: copy_on_write, which allows twin processes to share pages while they are reading them but makes separate copies when either process attempts to write to the page; copy_on_reference, which is used in process migration where only referenced pages are transferred from a source computer to a destination computer the process has been migrated. Handling exceptions, creating spaces and transferring pages have been extended by adding additional functions in order to provide an operating system built in support for Distributed Shared Memory (DSM). Two consistency models are supported in the RHODOS DSM: invalidation and update based. Device Manager. Transparency is an important feature of RHODOS. This not only includes interprocess communication between remote hosts, but also a transparent unified interface of physical devices such as serial ports, keyboards, video screens, and disks. Device drivers provide this interface. Device drivers in RHODOS are in their own right processes with the privilege and status of kernel servers. The benefits obtained from implementing device drivers as processes include the ability to enable and disable new drivers dynamically, as well as to use normal process debugging tools whilst the de-


vice driver is active. The device manager is the controlling entity that allows users to access a requested physical device. Migration Manager. The process migration manager is responsible for the migration of running processes from the home computer to a remote computer. To migrate a process in RHODOS involves migrating the process state, address space, communication state, file state, and other resources. Thus, process migration requires the cooperation of all the servers managing these resources, the process, space, IPC managers, and the file server, respectively. The process migration manager only coordinates these servers, and all of them cooperate following the client-server model. Process migration in RHODOS is a transaction-based operation performed on processes. Thus, the initial request from the source process migration manager to the destination process migration manager to migrate a selected process starts the transaction. The destination process migration manager commits this transaction by sending a response back, if all operations of installation of resources on the destination computer by individual servers, the process, space, IPC managers, and the file server have been completed successfully. Otherwise, an abort response is sent back. Remote Execution Manager. The function of the remote execution (REX) manager is to provide coordination for creation of processes on local and remote computers. If a process is

created on a local computer only a local REX manager is involved. If processes are created on remote computers, the home REX manager cooperates with remote REX managers to ensure processes are created correctly whilst maintaining the link with the process that issued the request. The generic cooperation of the servers is shown in Fig. 11. Data Collection Manager. The RHODOS Data Collection System is responsible for collection and dissemination of the operational statistics of processes and exchanged messages in the RHODOS environment. The Data Collection System consists of a data collection manager (server) and stubs of code within the microkernel and other servers. The data collection manager is designed to be activated periodically and when special events occur (e.g., a new process was created, a process was killed), and provide a central repository for the accumulation of statistics. It provides accurate process statistics to the global scheduler. These statistics will permit the global scheduler to make the most appropriate decisions concerning process placement within the RHODOS environment. RHODOS System Servers The RHODOS system provides direct services to users by employing the following servers: the naming server, file server, authentication server, and the broker server, called the trader. Furthermore, RHODOS provides a special service which improves the overall performance of all services by em-

Local computer

Remote computer

User process

User process

Request (create process)

447

Response (child info) Request (create process)

REX manager

REX manager

Response

Request (allocate process) Response (SName)

Process manager

Response (I/O) Request (I/O)

Request (retrieve image)

Device manager

Space manager

File server File server computer

Figure 11. The generic cooperation of servers involved in local or remote process creation.

448


ploying the global scheduler. The utilization of the clientserver model in the development of user-oriented services of a distributed operating system is presented here based on the global scheduler. RHODOS provides global scheduling services in order to allocate/migrate processes to idle or lightly loaded computers to share computational resources and balance load. Global scheduling employs both static allocation and load balancing. Static allocation is employed when system load remains steadily high and new processes have to be created. Static allocation is making the decision of where to create new processes. Load balancing is employed to react to large fluctuations in system load. Load balancing is making a decision when to migrate a process, which process to migrate, and where to migrate this process. These servers make these decisions based on the information about the current load of the personal computers participating in global scheduling, their load trends, and the process communication pattern. Conclusions In this section we showed an application of the client-server model in the development of an advanced distributed operating system, RHODOS. RHODOS consists of a microkernel and two layers of cooperating servers, called kernel servers and system servers. Generally speaking, kernel servers implement the mechanism of the RHODOS functionality, whereas system servers implement the policy of the RHODOS functionality. User processes, sitting on top of the RHODOS software, obtain services from RHODOS servers. When a RHODOS server receives a service request, it may serve the request directly, or it may contact other servers if services from these servers are required. BUILDING THE DISTRIBUTED COMPUTING ENVIRONMENT ON TOP OF EXISTING OPERATING SYSTEMS The previous section contains a presentation of RHODOS, an example of a distributed operating system, developed based on the client-server model and the concept of a microkernel. The whole system has been built from scratch on bare hardware. There is another approach to building a distributed computing environment by putting it on top of existing operating systems. Such a software layer hides the differences among the individual computers, and forms a single computing system. The Role of the Client-Server Model in Building a Distributed Computing Environment Open Software Foundation’s Distributed Computing Environment (DCE) (17) is a vendor-neutral platform for supporting distributed applications. DCE is a standard software structure for distributed computing that is designed to operate across a range of standard Unix, VMS, OS/2, and other operating systems. It includes standards for RPC, name, time, security, and thread services—all sufficient for client–server computing across heterogeneous architectures. DCE uses the client–server model to support its infrastructure and transparent services. All DCE services are provided through servers. By using DCE, application programmers can avoid considerable work in creating supporting

Applications Client DCE

Server

Distributed file services Security service

Directory services

Time service

RPC Threads Operating system

Figure 12. The logical architecture of DCE.

services, such as creating communication protocols for various parts of a distributed program, building a directory service for locating those pieces, and maintaining a service for providing security in their own program. In the previous section we mainly addressed the kernel servers and microkernel of RHODOS as they are the result of the new approach based on the client-server model and the concept of a microkernel to building distributed computing systems. Here, since DCE is a complete extension of centralized operating systems to form a distributed computing system we mainly concentrate on servers which directly provide services to users. The Architecture of DCE The architecture of DCE masks the physical complexity of the networked environment by providing a layer of logical simplicity, composed of a set of services that can be used separately or in combination to form a comprehensive distributed computing system. Servers that provide DCE services usually run on different computers; so do clients and servers of a distributed application program that use DCE. DCE is based on a layered model which integrates a set of fundamental technologies (Fig. 12). To applications, DCE appears to be a single logical system with two broad categories of services (18): The DCE Core Services. They provide tools with which software developers can create end-user applications and system software products for distributed computing: Threads. DCE supports multithreaded applications; RPC. The fundamental communication mechanism which is used in building all other services and applications; Security Service. Provides the mechanism for writing applications that support secure communication between clients and servers; Cell Directory Services (CDS). Provides a mechanism for logically naming objects within a DCE cell (a group of client and server computers); Distributed Time Service (DTS). Provides a way to synchronize the clocks on different computers in a distributed computing system.


DCE Data-Sharing Services. In addition to the core services, DCE provides important data-sharing services, which require no programming on the part of the end user and which facilitate better use of shared information: Distributed File Service (DFS). Provides a high-performance, scalable, secure method for sharing remote files; Enhanced File Service (EFS). Provides features which greatly increase the availability and further simplify the administration of DFS. In a typical distributed environment, most clients perform their communication with only a small set of servers. In DCE, computers that communicate frequently are placed in a single cell. Cell size and geographical location are determined by the people administering the cell. Cells may exist along social, political, or organizational boundaries and may contain up to several thousand computers. Although DCE allows clients and servers to communicate in different cells, it optimizes for the more common case of intra-cell communication. One computer can belong to only one cell at a time. The Role of RPC DCE RPC is based on the Apollo Network Computing System (NCA/RPC). The components of DCE RPC can be split into the following two groups according to the stage of their usage: Used in Development. It includes IDL (Interface Definition Language) and the idl compiler. IDL is a language used to define the data types and operations applicable to each interface in a platform independent manner. idl compiler is the tool used to translate IDL definitions into code which can be used in a distributed application; Used in Runtime. It includes RPC runtime library, rpcd (RPC daemon), and rpccp (RPC control program). To build a basic DCE application, the programmer has to supply the following three files: The Interface Definition File. It defines the interfaces (data structures, procedure names, and parameters) of the remote procedures that are offered by the server; The Client Program. It defines the user interfaces, the calls to the remote procedures of the server, and the client side processing functions; The Server Program. It implements the calls offered by the server. DCE uses threads to improve the efficiency of RPCs. A thread is a lightweight process that executes a portion of a program, cooperating with other threads concurrently executing in the same address space of a process. Most of the information that is a part of a process can then be shared by all threads executing within the process address space. Sharing reduces significantly the overhead incurred in creating and maintaining the information, and the amount of information that needs to be saved when switching between threads of the same program.

449

The Servers of DCE All the higher-level DCE services, such as the directory services, security service, time service, and distributed file services, are provided by relevant servers. Directory Services. The main job of the directory services is to help clients find the locations of appropriate servers. To let clients access the services offered by a server, the server has to place some binding information into the directory. A directory is a hierarchically structured database which stores dynamic system configuration information. The directory is a realization of the naming system. Each name has attributes associated with it, which can be obtained via a query using the name. Each cell in a DCE distributed computing system has its own directory service, called the Cell Directory Service (CDS), that stores the directory service information for a cell (18). It is optimized for intra-cell access, since most clients communicate with servers in the same cell. Each CDS consists of CDS servers and CDS clerks. A CDS server runs on a computer containing a database of directory information (called the clearinghouse). Each clearinghouse contains some number of directories, analogs to but not the same as directories in a file system. Each directory, in turn, can logically contain other directories, object entries, or soft links (an alias that points to something else in CDS). Each cell may have multiple CDS servers. Nodes which do not run a CDS server must run a CDS clerk. A CDS clerk acts as an intermediary between a distributed application and the CDS server on a node not running a CDS server. When a server wishes to make its binding information available to clients, it exports that information on one of its cell’s CDS servers. When a client wishes to locate a server within its own cell, it imports that information from the appropriate CDS server by calling on the CDS clerk on its computer. DCE uses the Domain Name System (DNS) or Global Directory Service (GDS, based on the X.500 standard) to enable clients to access servers in foreign cells. To access a server in a foreign cell, a client gives the cell’s name and the name of the desired server. A CDS component called a Global Directory Agent (GDA) extracts the location of the named cell’s CDS server from DNS or GDS, then a query is sent directly to this foreign server. Security Service. DCE provides the following four security services: authentication, authorization, data integrity, and data privacy. A security server (it may be replicated) is responsible for providing these services within a cell. The security server has the following three components: Registry Service. It is a database of principal (a user of the cell), group, and organization accounts, their associated secret keys, and administration policies. Key Distribution Service. It provides tickets to clients. A ticket is a specially encrypted object that contains a conversation key and an identifier that can be presented by one principal to another as a proof of identity. Privilege Service. It supplies the privileges of a particular principal. It is used in authorization.

450


The security server must run on a secure computer, since the registry on which it relies contains a secret key, generated from a password, for every principal in the cell. They are based on the Kerberos V5.0, created by the MIT/Project Athena, and DCE extends Kerberos version 5 by providing authorization services. Time Service. Distributed Time Service (DTS) of DCE is designed to keep a set of clocks on different computers synchronized. DTS uses the usual client-server structure: DTS clients, daemon processes called clerks, request the correct time from some number of servers, receive responses, and then reset their clocks as necessary to reflect this new knowledge. There are several components that compose the DCE DTS: Time Clerk. It is the client side of DTS. It runs on a client computer and keeps the computer’s local time synchronized by asking a time server for the correct time and adjusting the local time accordingly. Time Servers. There are three types of time servers. The local time server maintains the time synchronization of a given LAN. The global time server and courier time servers are used to synchronize time among interconnected LANs. A time server synchronizes with other time servers by asking these time servers for correct times and by adjusting its time accordingly. DTS API. It provides an interface where application programs can access time information provided by the DTS. Distributed File Services. DCE uses its distributed file services (DFS) to join the file systems of individual computers within a cell into a single file space. A uniform and transparent interface is provided for applications to accessing files located in the network. DFS is derived from the Andrew File System. It uses RPC for client-server communication and threads to enhance parallelism; it relies on the DCE directory to locate servers; and it uses DCE security services to protect from attackers. DFS is based on the client-server model. DFS clients, called cache managers, communicate with DFS servers using RPC on behalf of user applications. There are two types of DFS servers: a fileset location server which stores the locations of system and user files in DFS, and a file server which manages files. A typical interaction between various components of DFS is shown in Fig. 13. At first, the application issues a file request call to the cache manager in its computer. If the requested file is located in the local cache, the request is served using the local copy of the file. Otherwise, the cache manager locates the fileset location server through the CDS server, and the location of the file server that stores the requested file is found through the fileset location server. Finally, the cache manager calls the file server and the file data are accessed. Conclusions In this section we described an application of the client-server model in the development of an advanced distributed computing environment, DCE. DCE is built on top of existing operating systems and it hides the heterogeneity of underlying computers by providing an integrated environment for dis-

Client computer

Fileset 3

1 Location data

CDS

Application File request

2

Cache manager 4

DFS file server Cache

Files Figure 13. Interactions between DFS components. 1: file request from an application; 2: locate the fileset location server; 3: locate the file server that stores the requested file; 4: access the requested file.

tributed computing. DCE consists of many integrated services, such as thread and RPC services, security service, directory service, time service, and distributed file service, that are necessary in performing client–server computing in a heterogeneous environment. Most of these services are implemented as individual servers or groups of cooperating servers. Application processes act as clients of DCE servers. Now in its fifth year (DCE 1.0 was announced in 1991), DCE has gone through several major stages of evolution and enhancement (through DCE 1.1 and DCE 1.2). Because of its operating system independence, DCE has gained significant support from user and vendor communities. BIBLIOGRAPHY 1. L. Liang, S. T. Chanson, and G. W. Neufeld, Process groups and group communications: Classifications and requirements, IEEE Computer, 23 (2): 56–66, 1990. 2. A. S. Tanenbaum, Experiences with the Amoeba distributed operating system, Commun. ACM, 33 (12): 46–63, 1990. 3. F. Cristian, Understanding fault tolerant distributed systems, Commun. ACM, 34 (2): 56–78, 1991. 4. K. P. Birman and T. A. Joseph, Reliable communication in the presence of failures, ACM Trans. Comp. Sys., 5 (1): 47–76, 1987. 5. P. Wegener, Interoperability, ACM Comp. Surv., 28 (1): 285– 287, 1996. 6. A. Goscinski, Distributed Operating Systems: The Logical Design, Reading, MA: Addison-Wesley, 1991. 7. A. D. Birrell and B. J. Nelson, Implementing remote procedure calls, ACM Trans. Comp. Sys., 2 (1): 39–59, 1984. 8. D. R. Cheriton, The V distributed system, Commun. ACM, 31 (3): 314–333, 1988. 9. A. S. Tanenbaum and R. van Renesse, Distributed operating systems, ACM Comp. Sur., 17 (4): 419–470, 1985. 10. D. De Paoli et al., Microkernel and kernel server support for parallel execution and global scheduling on a distributed system, Proc. IEEE First Int. Conf. Algorithms Architectures Parallel Process., Brisbane, April, 1995. 11. D. R. Cheriton and Zwaenepoel, Distributed process groups in the V distributed system, ACM Trans. Comp. Sys., 3 (2): 77–107, 1985.

CLINICAL ENGINEERING 12. M. Rozier et al., Chorus distributed operating system, Comput. Syst., 1: 305–379, 1998. 13. M. F. Kaashoek and A. S. Tanenbaum, Efficient reliable group communication for distributed system, Department of Mathematics and Computer Science Technical Report, Vrije Universiteit, Amsterdam, 1994. 14. R. Sandberg et al., Design and implementation of the Sun Network File System, Proc. Summer USENIX Conf., 119–130, 1985. 15. Sun Microsystems, NFS Version 3 Protocol Specification (RFC 1813), Internet Network Working Group Request for Comments, No. 1813, Network Information Center, SRI International, June, 1995. 16. G. Gerrity et al., Can we study design issues of distributed operating systems in a generalized way?—RHODOS, Proc. 2nd Symp. Experiences Distributed Multiprocessor Syst. (SEDMS II), Atlanta, March, 1991. 17. Distributed Computing Environment Rationale, Open Software Foundation, 1990. 18. The OSF Distributed Computing Environment, Open Software Foundation, 1992.

ANDRZEJ GOSCINSKI WANLEI ZHOU Deakin University

451


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Code Division Multiple Access Standard Article Kamran Kiasaleh1 1University of Texas at Dallas, Richardson, TX Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5304 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (188K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Signal Generation and Mathematical Modeling Despreading and Detection Synchronization Interference Near-Far Problem and Power Control Channel Effects Interference-Dispersive Channel Rake Receiver | | | Copyright © 1999-2008 All Rights Reserved.


CODE DIVISION MULTIPLE ACCESS

–

:

CODE DIVISION MULTIPLE ACCESS With the advent of personal wireless communication systems in recent years, the need for instantaneous, seamless personal communication has grown. Unfortunately, this increase in demand strains a natural resource, the radio frequency (RF) spectrum. It is imperative that the design of any communication system that is intended for use in a personal communication domain be as bandwidth efficient as possible. In other words, one has to design communication systems for a bandlimited scenario and to make an attempt to maximize the information throughput for the allotted bandwidth. There are two basic means by which the RF spectrum can be shared among many users: collision-free and collision-impaired multiple access. In a collision-impaired scheme, a protocol is used

521

(which is typically common to all users) by each user to obtain access to the available spectrum. Collisions (the event in which two or more users make an attempt to use the same common resource) are possible in this scenario, and hence one must accommodate for such events (e.g., retransmission). In the collision-free scenario, it is assumed that a user is able (at least in theory) to obtain access to the channel upon request (perhaps with some delay) and that there is never any form of collision possible. In practice, often a hybrid of the two scenarios is used to provide access to the RF medium. Although there are numerous forms of collision-free multiple access, the following means of sharing the RF spectrum have received most attention. There are time-division multiple access (TDMA), frequency-division multiple access (FDMA), and code-division multiple access (CDMA). The concepts of TDMA and FDMA may be explained as follows. In the TDMA scenario, the access to the RF spectrum is rather implicit via time-slot allocation. Namely, there is no single portion of the allotted frequency spectrum that is assigned to an individual user. Instead, users occupy the entire allotted frequency spectrum and are assigned nonoverlapping time slots for communication. In contrast, FDMA operates on the assumption that the nonoverlapping portions of the RF spectrum can be allocated to individual users and communication for each user can proceed in a continuous fashion in time. The CDMA approach is different from TDMA and FDMA in two important aspects. First, explicit frequency assignments are not necessary. More important, communication can initiate at any time, and hence no explicit time-slot assignment must be performed prior to communication. The means by which user discrimination is achieved is through exploiting the correlation properties of binary (or perhaps higher-order) codes used to form CDMA signals. To illustrate this point, let us consider the following. In most CDMA systems, the information provided by a user is often of a bandwidth much smaller than the bandwidth allocated for CDMA communication. First, via a bandwidth spreading tactic, the information provided by a user is expanded in bandwidth to the maximum allowable bandwidth for CDMA communication. This procedure is repeated for all users involved, with each user taking advantage of a bandwidth spreading strategy that is independent of those used by others. At the receiver, a reverse operation (i.e., a bandwidth despreading operation) is performed. Obviously, if this operation is performed successfully, the original signal is recovered. However, since all the users involved occupy the entire allotted frequency band and are allowed to communicate at all times, the bandwidth despreading operation performed on an intended signal is also affected by the presence of other interfering signals. The key assumption in CDMA is that the independence of the bandwidth spreading and then despreading operations guarantees that the bandwidth despreading of an interfering signal leads to a signal whose bandwidth remains identical to the bandwidth allowed for CDMA communication. Namely, only the intended signal is transformed to its original shape, while other signals remain as wideband signals. Considering that a digital receiver makes an attempt to measure the useful band-limited energy of a signal, the detection of the desired signal is hampered by only a small fraction of the energies of the interfering signals, which appears as a spectrally flat noise. To elaborate, one can view a digital re-


522


ceiver as a narrowband filter that is designed to capture the energy (the area beneath the power spectrum) of the desired signal. Such a filter will have a bandwidth proportional to the bandwidth of the desired signal after the bandwidth despreading operation. Since the undesired signals remain wideband after such an operation, the contribution of the interfering signals to the detected energy in the desired frequency band will be small compared with the detected energy of the desired signal. This, in turn, leads to a successful recovery of the desired signal and the rejection of a large portion of the unwanted energy. To gain further insight, we proceed to formulate this problem in the next section. SIGNAL GENERATION AND MATHEMATICAL MODELING The preceding operation can take on a mathematical form. First, we assume that there are N CDMA users that can be active at any point in time. That is, we assume that the allotted frequency spectrum is accessible to N CDMA signals at all times. Let us begin by describing a direct-sequence CDMA signal. In particular, we are interested in representing the jth CDMA signal. For all intents and purposes, one can describe the jth CDMA signal at the transmitter as x j (t) = Re{x˜ j (t)e iω c t } = Re{d j (t)PN j (t)e iω c t }

(1)

where Re兵x其 denotes the real part of x, x˜(t) is the complex envelope of the CDMA signal, i ⫽ 兹⫺1, 웆c denotes the carrier 앝 frequency in rad/s, and dj ⫽ 兺n⫽⫺앝d(j) n Pd(t ⫺ nTs) is the data bearing portion of the jth signal, with d(j) n and Pd(t) denoting the complex data symbol for the jth transmitted signal in the nth signaling interval, taking on an M-ary phase-shift-keying (MPSK) signaling constellation and a unit amplitude nonreturn-to-zero (NRZ) pulse shape of duration Ts s, respectively. Moreover, PNj (t) denotes the jth complex PN signal defined as PN j (t) =

∞ n=−∞

( j) sn,I Pc (t − nTc ) + i

∞ n=−∞

( j) sn,Q Pc (t − nTc )

(2)

(j) where s(j) n,I and sn,Q are the in-phase (I) and quadrature (Q) phase pseudorandom real spreading sequences for the nth chip interval of the kth user taking on 兵⫺1, ⫹1其 according to a PN code generating device (a PN code generator typically consists of one or a combination of a number of linear feedback shift registers); Pc(t) is the chip pulse shape, typically assumed to be a square root raised-cosine pulse shape; and Tc is the chip interval given by

Tc =

Ts Pg

(3)

where Pg Ⰷ 1 denotes the processing gain for the CDMA system. This parameter will be explained in a different context in the ensuing discussion. We also assume that PN j (t ± kPTc ) = PN j (t) for k = 1, 2, 3, . . . This implies that PN codes here are assumed to be periodic with a period of PTc seconds. This further implies that the PN sequences have a period of P chips.

The preceding formulation implies that x˜ j (t) = d j (t)PN j (t)

(4)

From Eq. (4), one can observe that a CDMA signal is obtained via PN code multiplication, justifying the name direct sequence. Before going any further, let us observe the impact of the spreading operation on the spectrum of a narrowband signal. Equation (4) sheds light on the means by which the direct-sequence spreading expands the bandwidth of dj(t). It can easily be inferred from Eq. (4) that the outcome of the correlation operation is to yield a signal whose bandwidth is identical to that of the PN code [PNj(t)]. Since PN code’s bandwidth ⫺1 is far greater than that of dj(t) (i.e., T⫺1 c Ⰷ Ts ), a bandwidth spreading operation is realized. We also note that Pg =

BCDMA BData

(5)

where BCDMA and BData denote the bandwidths of the CDMA and data signals, respectively. This can easily be verified by noting that the bandwidth of a direct-sequence CDMA signal may be shown to be 움1T⫺1 c for some 움1, while the bandwidth of the data signal is 움2T⫺1 s . Since the CDMA and data signals possess identical characteristics, 움1 ⫽ 움2, using Eq. (3), we arrive at Eq. (5). This result indicates that the processing gain for a CDMA signal is identical to the bandwidth spreading factor or PG. In the remainder of this article, for the sake of simplicity, we deal with x˜ j(t), the complex envelope of the jth CDMA signal. In the ensuing analysis, the correlation properties of the PN codes are needed to understand the means by which CDMA receivers function. For this reason, let us define

Ra( j) (n, τ , τˆ ) =

1 PN j (t − τ )PN∗j (t − τˆ )n,τˆ 2

(6)

as the partial autocorrelation function of the jth PN code observed over PG chip symbols with

f (t)n,τˆ =

1 Ts

nTs + τˆ (n−1)Ts + τˆ

f (t) dt

denoting a time-averaging operation over the interval [(n ⫺ 1)Ts ⫹ ␶ˆ , nTs ⫹ ␶ˆ ]. This function will be used in the subsequent analysis to discuss the characteristics of PN code acquisition and tracking systems. It is important to note that, in commercial CDMA systems, the period of the PN code (i.e., PTc) is substantially greater than the processing gain, resulting in an R(j) ˆ ) that is a function of n and represents a (n, ␶, ␶ the partial autocorrelation function of the jth PN code. (Since PN codes are often generated using linear feedback shift registers, one may assume that the resulting codes are periodic with periods that are dependent on the structural properties of the generating shift registers.) In fact, due to the pseudorandom nature of the PN code, R(j) ˆ ) may be viewed as a (n, ␶, ␶ a random sequence. However, if one assumes a large processing gain (large number of chip symbols per integration interval), R(j) ˆ ) does not vary substantially with n, and a (n, ␶, ␶ hence R(j) ˆ ) may be approximated by R(j) ˆ ). For the a (n, ␶, ␶ a (␶ , ␶ scenario where PG ⫽ P (i.e., PN code is repeated every symbol


interval), R(j) ˆ ) is not a function of n and reduces to the a (n, ␶, ␶ autocorrelation function of jth PN code. It is also important to note that R(j) ˆ ), as defined prea (␶ , ␶ viously, is a complex function. In practice, however, the complex PN codes are designed so that the I and Q PN codes (hereafter, Re兵PNj(t)其 and Im兵Pnj(t)其) are referred to as the I and Q PN codes, respectively) are a pair of uncorrelated sequences. That is, Re{PN j (t)}Im{PN j (t)}n,0 ≈ 0; for all j

(7)

In that event, R(j) ˆ ) is a real function that can be expressed a (␶ , ␶ as

1 Ra( j) (τ , τˆ ) = Re{PN j (t − τ )}Re{PN j (t − τˆ )}n,τˆ 2 1 + Im{PN j (t − τ )}Im{PN j (t − τˆ )}n,τˆ 2

SYNCHRONIZATION

(9)

Hence, R(j) ˆ ) may be viewed as the partial autocorrelation a (␶ , ␶ function of the real PN sequences that form the complex PN signal. Using the preceding notation, the despreading operation may also be explained. DESPREADING AND DETECTION Since a binary PN spreading is used, it is fairly easy to see that

1 1 x˜ (t)PN∗j (t)n,0 = 2 j 2Ts 1 = 2Ts

nTs (n−1)Ts nTs (n−1)Ts

x˜ j (t)PN∗j (t) dt (10) d j (t)|PN j (t)| dt = 2

dn( j)

where PN*j (t) is the complex conjugate of PNj(t) and it is assumed that 兩PNj(t)兩2 ⫽ 1. The factor is included to account for the fact that the PN code consists of real and imaginary spreading sequences. For complex spreading signals, we also observe that 1 1 1 |PN j (t)|2 = (Re{PN j (t)})2 + (Im{PN j (t)})2 2 2 2

(11)

where Im兵x其 is the imaginary part of x. Since the real and imaginary parts of PNj(t) are also binary PN codes with unit amplitudes, (Re{PN j (t)})2 = (Im{PN j (t)})2 =

1 |PN j (t)|2 = 1 2

At this junction, we need to consider the received signal when the signals described previously have been subjected to an imperfect channel. In particular, we need to be concerned with the case where the channel coherence bandwidth is smaller than the bandwidth of the CDMA signal. (Coherence bandwidth of a dispersive channel may be viewed as the maximum bandwidth that a signal can take on without getting distorted by the characteristics of the channel.) This implies a frequency selective operation for most practical applications and, in particular, for wireless communication scenarios. Hence, we need to examine the impact of channel on a CDMA signal. This also plays a critical role in selecting a detection mechanism for the problem at hand. Before doing so, however, the important problem of synchronization is addressed.

(8)

If one assumes that the I and Q PN codes possess identical partial autocorrelation properties (a situation where this condition is not satisfied is of little practical interest), then Ra( j) (τ , τˆ ) = Re{PN j (t − τ )}Re{PN j (t − τˆ )}n,τˆ

523

(12)

When PNj(t) is assumed to be real, then it is fairly easy to see that PNj2(t) ⫽ 1. Hereafter, the signal processing defined by Eq. (10) is referred to as a matched filtering (MF), or despreading operation. To elaborate, as can be seen in Eq. (10), the outcome of the correlation operation is the original narrowband signal dj(t).

In virtually any form of digital communication, synchronization in time (symbol clock recovery) precludes communication. CDMA systems are not exempt from this requirement. The synchronization in a CDMA system, however, is somewhat different from its TDMA counterpart. In TDMA systems, one requires synchronization in frequency (and, in some cases, in phase) before a data clock can be recovered. Often, a dotting sequence (1010101. . .) is included in the preamble of a TDMA frame to provide the clock synchronization subsystem the necessary signal to lock onto. In a CDMA scenario, since the desired signal is spread in frequency over the entire allotted CDMA band, the acquisition of PN code clock, which for most practical systems also implies data clock acquisition, must be achieved in the absence of phase and frequency synchronization. (Here, we are interested in the scenario where the PN code clock and data symbol clock are derived from a common source. Hence, an acquisition of the PN code clock leads to data symbol clock recovery.) This is due to the fact that if one chooses to achieve phase and frequency estimation in the absence of PN code acquisition, the phase and frequency synchronizers must extract synchronization information from a wideband signal. This, in general, is a formidable task due to the large bandwidth of typical CDMA signals. Hence, in a CDMA system, PN code timing acquisition precedes any other form of synchronization. Upon the recovery of the PN code phase the CDMA signal is despread and then an accurate estimate of frequency or phase is obtained. We are then faced with a situation where PN code clock must be recovered in a noncoherent fashion. Before describing the mechanism by which PN code clock synchronization can be acquired, we need to point out that synchronization here is achieved in two phases. In phase I, an initial synchronization of the PN code phase is established via acquiring the epoch of the received PN code to within a fraction of a chip interval. This problem is identical to the estimation of the propagation delay between a transmitter/receiver pair when the propagation delay is less than the period of the PN code. In the event that the propagation delay is greater than the period of the PN code, the synchronization procedure yields an estimate of the propagation delay reduced mod PTc. Phase I of the PN code synchronization is equivalent to estimating

524


the state of the shift register that generates the desired PN code. In phase II, the epoch of the desired PN code is tracked so that a real-time estimate of PN code phase can be maintained at the receiver. As noted earlier, PN code estimation is accomplished in the face of unknown channel phase and frequency. In general, one can assume that the partial autocorrelation of a PN code satisfies the following property: |Ra( j) (τ , τˆ )| Ra( j) (τ , τ ) for |τ − τˆ |

> Tc

(13)

This property is critical to a successful PN code acquisition since it can be exploited to realize a PN code acquisition model. To gain further insight, let us assume that the communication channel is such that the transmission through the channel introduces a delay of ␶, a phase offset of ␪, a frequency error or offset of ⌬웆 rad/s, and an amplitude distortion of A(t). That is, let

| < r˜ (t)PN∗j (t − r) ˆ >n,τˆ |2

(14)

n=1

where we have assumed that ␶ˆ denotes an estimate of ␶, the propagation delay between transmitter and receiver, and we have collected energy over L symbol intervals. As noted earlier, if ␶ ⬎ PTc, then ␶ˆ is the estimate of ␶ reduced mod PTc. Obviously, the objective of a PN code acquisition model is to bring ␶ to within a fraction of Tc of ␶. Namely, we are interested in acquiring an estimate ␶ˆ where 兩␶ ⫺ ␶ˆ 兩 ⱕ Tc /Ns for Ns ⱖ 2. In the ensuing discussion, we further assume that the modulation is absent. This assumption is motivated by the fact that in commercial CDMA systems a pilot signal (a CDMA signal without the modulating signal) is provided by transmitter to aid synchronization. The presence of modulation further complicates the model without adding any further insight. For this reason, we proceed with a pilot-signalaided synchronization model. We further assume that an initial frequency estimate is obtained, and hence ⌬웆 is assumed to be relatively small compared with 1/Ts. If one assumes that the jth PN signal (without modulation) is used to generate the CDMA signal and that the amplitude distortion in the channel remains relatively constant for a symbol time, i.e., A(t) 앒 A, then (for most channels of interest, this assumption is valid)

g(τ , τˆ ) =

L n=1

| < Aei[ωt+θ ]PN j (t − τ )PN∗j (t − τˆ ) >n,τˆ |2

L

|eiθ < PN j (t − τ )PN∗j (t − τˆ ) >n,τˆ |2

n=1

= A2 D2 (ωTs )

L

| < PN j (t − τ )PN∗j (t − τˆ ) >n,τˆ |2

n=1

(16) where D(⌬웆Ts) accounts for the distortion caused by the presence of the frequency error. D(⌬웆Ts) is a decreasing function of ⌬웆Ts, and hence for ⌬웆Ts Ⰶ 1 one can expect a small level of distortion. As can be seen, the phase error is eliminated with the aid of the absolute value function. Now, let us consider the case where the I and Q PN codes are nearly orthogonal. That is, let Re{PN j (t)}Im{PN j (t)}n,0 ≈ 0

g(τ , τˆ ) ≈ 4LA2 D2 (ωTs )(Ra( j) (τ , τˆ ))2

denote the complex envelope of the received signal at the input of a CDMA receiver in the absence of additive noise. (Since the inclusion of an additive noise results only in the presence of a noisy term at the output of the correlation operation, and hence does not provide any further insight, we may proceed with a noiseless model to illustrate the function of the PN code acquisition model.) A noncoherent PN code acquisition model computes g(␶, ␶ˆ ), given by L

g(τ , τˆ ) = A2 D2 (ωTs )

(17)

With some effort, it can be shown that

r(t) ˜ = A(t)ei[ωt+θ ]x(t ˜ − τ)

g(τ , τˆ ) =

Hence,

(15)

(18)

Hence, the function of the absolute value operation is to eliminate any phase error that may be present at the receiver, while the integration operation is intended to yield R(j) ˆ ). a (␶ , ␶ Due to Eq. (13), it is relatively easy to observe that g(␶, ␶ˆ ) may be used to launch a search for a correct epoch of the code. The function of L is to provide a confidence in declaring whether or not the correct epoch of the code has been acquired when additive noise (or interference) is present. Before discussing the acquisition model based on the above observation, let us consider the case where additive noise is present. In the presence of noise, additional terms in g(␶, ␶ˆ ) that are dependent on noise must be accounted for. In that case, one can argue that E{ g(τ , τˆ )} = 4LA2 D2 (ωTs )(Ra( j) (τ , τˆ ))2

(19)

where E兵其 is the ensemble average of the enclosed. That is, the operation described by Eq. (14) yields an output whose average value provides one with the necessary function to carry out PN code acquisition. Hence g(␶, ␶ˆ ) may be used as an indicator of the PN code acquisition state. The search mechanism then consists of a chip-by-chip search that will be carried out in a serial fashion. In this scheme, g(␶, ␶ˆ ) is obtained for a ␶ˆ . In the event g(␶, ␶ˆ ) falls below a predefined threshold, ␶ˆ is increased by Tc /Ns (Ns is the number of steps per chip interval). Once the local PN code epoch is within a chip interval of the received PN code, the output of the correlator will exceed the threshold device that is designed to yield an optimum performance. At this stage, the synchronizer declares PN code acquisition and proceeds with PN code tracking. Since noise can impair this process, the performance of this acquisition model is determined in terms of the statistics of the acquisition time (mean and standard deviation of the acquisition time), probability of acquisition, and probability of false acquisition. Phase II of synchronization involves the tracking of the PN code. This process involves maintaining a local PN code signal whose epoch is different from the epoch of the received signal


by no more than a fraction of the chip time Tc. This objective is achieved via a PN code tracking loop that generates a pair of PN signals that are delayed and advanced by a fraction of chip time with respect to the local PN code. More specifically, the following signal is formed:

S(τe ) = g τe −

Tc Ns

− g τe +

Tc Ns

(20)

where ␶e ⫽ ␶ ⫺ ␶ˆ . In arriving at Eq. (20) it is assumed that when 兩␶ ⫺ ␶ˆ 兩 ⬍ Tc, g(␶, ␶ˆ ) ⫽ g(␶ ⫺ ␶ˆ ). This signal is then used as an error signal to adjust ␶ˆ . Since a voltage-controlled-oscillator (VCO) provides the clock signal for the generation of the local PN code, the adjustment of ␶ˆ can be achieved using S(␶e). The expected value of function S(␶e), i.e., E兵S(␶e)其, is often referred to as the ‘‘S-curve’’ of the tracking loop. Such a function determines the tracking behavior of the loop. In particular, the variance of the steady state timing error as well as the mean time to loss of lock are dependent upon this function. Using Eq. (19), we have

525

tained with the I and Q PN sequences that make up the jth PN code satisfy the following properties: Pg n 1 =1

sn( j),I sn( j)+n,I 1 1

=

Pg n 1 =1

sn( j),Q sn( j)+n,Q 1 1

=

Pg ≤ λa

n=0 otherwise

(25)

and Pg n 1 =1

sn( j),I sn( j)+n,Q ≤ λc ; for all n 1

(26)

1

with ␭a Ⰶ PG and ␭c Ⰶ PG denoting the peak out-of-phase autocorrelation function and peak cross-correlation function, respectively, of the I and Q PN sequences. For ␭a ⫽ 0 and ␭c ⫽ 0, a pair of distinct phase shifts of the I or Q PN sequences are uncorrelated. Moreover, the I and Q PN sequences may be viewed as uncorrelated sequences as well. In that event, Eq. (23) denotes the exact, and not an approximate, expression. In practice, however, one encounters ␭a and ␭c that are nonzero, and hence Eq. (23) must be used as an approximate E{S(τe )} partial autocorrelation function. The approximation due to ␭a 2 2 Ⰶ Pg and ␭c Ⰶ Pg conditions, however, is a good one. We note T T c c − Ra( j) τe + = 4LA2 D2 (ωTs ) Ra( j) τe − that although one requires that ␭a Ⰶ ␭c, the critical assumpNs Ns tion for detection is that both ␭c and ␭a remain significantly (21) smaller than PG. Given the assumption states above, we arrive at an STo gain an insight into the operation of this loop, let us con- curve for the above tracking loop that is approximately a linsider a scenario where the I and Q PN sequences are uncorre- ear function of ␶ for the range [⫺T /N , T /N ]. More impore c s c s lated and possess identical autocorrelation functions. As tant, the slope of the S-curve remains positive for this range. noted earlier, this assumption leads to There are several aspects of this function that are of interest. First, when the timing error is zero, which implies a perfect Ra( j) (τe ) = Re{PN j (t − τ )}Re{PN j (t − τ − τe )}n,τ +τ e synchronization has been achieved, E兵S(␶e)其 ⫽ 0. In this case, for |τ − τˆ | < Tc (22) the input to the VCO is reduced to zero. Second, as the timing error begins to depart from 0, the signal at the input of the VCO will have a magnitude that is proportional to ␶e. Hence, When Pc(t) is an NRZ pulse, S(␶e) provides the VCO with a signal that is an odd and monotonic function of the timing error, and thus can be used to Re{PN j (t − τ )}Re{PN j (t − τ − τe )}n,τ +τ e adjust ␶ˆ . As noted above, the other feature of the above S |τ | τ e e ≈ Pg Tc 1 − rect ; |τe | < Tp (23) curve is that it is nearly a linear function of ␶e in the vicinity Tc 2Tc of ␶e ⫽ 0. This is an important property, since the initial synchronization yields an estimate of the PN code epoch that is and within ⫾Tc /Ns of the received PN code epoch. In this case, one can assume that the loop provides us with an error signal Ra( j) (τ + nTP ) = Ra( j) (τ ) for all integer n (24) that is directly proportional to the timing error, and hence a linear tracking loop results. Finally, in practice, Pc(t) is chosen to be a square-root In Eq. (23), raised-cosine pulse shape. In that case, a somewhat different result emerges. That is, 1 |x| < 0.5 rect(x) = Tc Tc 0 otherwise 2 2 E{S(τe )} = 4LA2 D2 (ωTs ) PRC τe = − PRC τe + Ns Ns (27) This situation is commonly referred to as the time-limited case. This is due to the fact that the chip pulse shape extends ⴱ Pc(t) (䊊 ⴱ denotes a convolution operaover a finite time interval, and subsequently its spectrum is where PRC(t) ⫽ Pc(t) 䊊 tion) is a raised-cosine pulse shape given by Eq. (28) (note extended over a large frequency range. Equation (24) implies that the square-root raised cosine pulse Pc(t) is implicitly dethat the autocorrelation function of the PN code is a periodic function. This property is a direct consequence of the fact that fined in terms of PRC(t)) PN codes are periodic functions with period Tp ⫽ PTc. Tp here, then, denotes the period of the PN code. Note that Eq. (24) is a property common to all PN codes, whereas Eq. (23) is ob-

PRC (t) =

sin(πt/Tc ) cos(παt/Tc ) πt/Tc 1 − (2αt/Tc )2

(28)

526


This case is referred to as the bandwidth-limited case. Note that Pc(t) extends over several chip intervals, leading to a spectrum that is limited in bandwidth. Although E兵S(␶e)其 does not yield a linear S-curve over the entire interval of [⫺Tc /Ns, Tc /Ns], it provides us with all the necessary conditions for a successful PN code tracking. That is, E兵S(␶)其 is a linear function of ␶e in the vicinity of ␶e ⫽ 0. Also, E兵S(␶e)其 possesses a positive slope in the range [⫺Tc /Ns, Tc /Ns]. Hence, one may expect a tracking performance similar to that of the NRZ chip pulse shape case. INTERFERENCE Now let us consider the received signal at the input of a CDMA receiver where other CDMA signals are present. We consider two possibilities. First, it is considered that the channel is nondispersive, and hence no multipath components are present. In the second case, a more general scenario where multipath scattering is present is considered. In the event that the channel is nondispersive,

r˜ (t) = A j eiθ j (t ) x˜ j (t − τ j ) +

N

Al eiθ l (t ) x˜l (t − τl ) + z(t) ˜ (29)

l=1;l = j

where now N ⫺ 1 other CDMA signals are present. It is assumed that the lth signal encounters ␶l seconds of propagation delay, an amplitude scaling of Al, and a random phase shift of ␪l(t). Note that any frequency errors caused by channel is represented by d␪l(t)/dt. As can be seen, the received signal is corrupted by many interfering signals and an additive noise z˜(t). The term z˜ (t) is a complex white Gaussian noise whose real and imaginary parts are a pair of independent white Gaussian noise processes with a two-sided power spectrum density of N0 W/Hz over the frequency range of interest. For the additive noise, we have

practice by acquiring the PN code, despreading the signal, and, with the aid of a frequency estimator, acquiring an estimate of the frequency. Then the outcome of the bandwidth despreading operation (when the nth symbol is of interest) after frequency compensation and delay estimation is

1 ˜r(t)PN∗j (t − τˆ j )n,τˆ j 2 A j eiθ j nTs + τˆ j = zn + d (t − τ j )PN j (t − τ j )PN∗j (t − τˆ j ) dt 2Ts (n−1)Ts + τˆ j j (30) N A eiθ l nTs + τˆ j l + dl (t − τl )PNl (t − τl )PN∗j (t − τˆ j ) dt 2T s (n−1)T + τ ˆ s l=1;l = j j It is not immediately obvious whether or not the nth symbol can be recovered using this operation. Depending on the type of detection used to recover the transmitted data symbol, an estimate of ␪j may be needed at the receiver. To go any further, without loss of generality, let us assume that ␶ˆ j ⱖ ␶j when 兩␶ˆ j ⫺ ␶j兩 ⱕ Tc. In that event, Eq. (30) reduces to

1 ˜r(t)PN∗j (t − τˆ j )n,τˆ j 2 iθ j nTs +τ j Aje = zn + d (t − τ j )PN j (t − τ j )PN∗j (t − τˆ j ) dt 2Ts (n−1)Ts + τˆ j j (31) A j eiθ j nTs + τˆ j ∗ d j (t − τ j )PN j (t − τ j )PN j (t − τˆ j ) dt + 2Ts nTs +τ j N Al eiθ l nTs + τˆ j + dl (t − τl )PNl (t − τ1 )PN∗j (t − τˆ j ) dt 2T s (n−1)T + τ ˆ s l=1;l = j j where

zn =

∗

E{z(t) ˜ z˜ (t − s)} = 2E{z˜ r (t)z˜ r (t − s)} = 2E{z˜ i (t)z˜ i (t − s)} = 2N0 δ(t) where 웃(t) is a dirac-delta function and z˜r(t) and z˜i(t) denote the real and imaginary parts of z˜(t), respectively. Note that Re兵z˜(t)ei웆ct其 may now be considered as a band-limited Gaussian noise whose power spectrum remains flat over the frequency range of interest about 웆c rad/s at N0 /2 W/Hz. To gain an insight into the means by which CDMA receivers overcome interference, let us consider the outcome of a bandwidth despreading operation. Furthermore, let us also assume a scenario where we are interested in recovering the jth signal. Obviously, one requires that the receiver acquires an estimate of ␶j. This task remains with the PN code acquisition subsystem discussed previously. Assuming that a successful delay estimation is performed, an estimate of ␶j that is within ⫾Tc /Ns (Ns ⱖ 2) of ␶j can be obtained. Let such an estimate be ␶ˆ j. Also, let us assume that the frequency shift in the signal caused by the channel is compensated for and that the residual frequency error caused by estimation process is small enough so that ␪l(t) 앒 ␪l for the observation interval. That is, ␪l now denotes the residual phase error at the receiver caused by channel phase shift and imperfect estimation and compensation of frequency. This condition is typically satisfied in

1 2Ts

nTs + τˆ j (n−1)Ts + τˆ j

∗ z(t)PN ˜ j (t − τˆ j ) dt

denotes a zero mean Gaussian random variable. This equation then leads to

1 ∗ x(t)PN ˜ j (t − τˆ j )n, τˆ j 2 ( j) = zn + A j eiθ j [R(1) (τ j , τˆ j )dn( j) + R(2) (τ j , τˆ j )dn+1 ] j j +

N l=1;l = j

(32)

Al eiθ l [R(1) (τl , τˆ j )d (lp ) + R(2) (τl , τˆ j )d (lp )+1 ] j,l j,l l

l

where

R(1) (t1 , t2 ) = j

1 2Ts

nTs +t 1 (n−1)Ts +t 2

PN j (t − t1 )PN∗j (t − t2 ) dt

and

R(2) (t1 , t2 ) = j

1 2Ts

nTs +t 2 nTs +t 1

PN j (t − t1 )PN∗j (t − t2 ) dt

are partial autocorrelation functions of the jth PN code. In general, the PN codes are selected from a family of codes with


identical autocorrelation properties, and hence the subscript of j may be dropped. Moreover,

R(1) (t , t ) = j,k 1 2

1 2Ts

nTs +t 1 (n−1)Ts +t 2

PNk (t − t1 )PN∗j (t − t2 ) dt

527

constant over two consecutive time slots and a differential phase modulation is used, an estimate of ␪j is not required at the receiver. For other scenarios, the output of the despreader is fed to a channel estimation system so that an estimate of ␪j (and Aj in some cases) can be obtained and compensated for.

and

R(2) (t , t ) = j,k 1 2

1 2Ts

nTs +t 2 nTs +t 1

PNk (t − t1 )PN∗j (t − t2 ) dt

denote the partial cross-correlation function of the jth and kth PN codes. In arriving at Eq. (32), we have assumed that the integration interval has coincided with the pl and pl ⫹ 1th signaling interval of the lth interfering signal. From Eq. (32), it is rather obvious that the desired symbol d(j) n is recovered. This recovery method, however, has yielded a number of undesirable terms. First, the presence of timing error, similar to other digital receivers, has resulted in the introduction of intersymbol interference in the detection process [note the term involving d(j) n⫹1]. Furthermore, the detection process is now corrupted by an interfering signal, even in the absence of additive noise. To estimate the impact of interference, the properties of the partial autocorrelation and crosscorrelation functions of the PN codes must be evaluated. Before doing so, let us examine the preceding result more carefully. First, it is obvious from the definition of R(2) j (t1, t2) that when ␶ˆ j ⫽ ␶j, the intersymbol interference is reduced to zero. That is, R(2) ˆ j, ␶ˆ j) ⫽ 0. Since 兩␶ˆ j ⫺ ␶j兩 ⱕ Tc /Ns for some Ns ⱖ 2, j (␶ for a typical PN code with large processing gain, R(1) ˆ j) Ⰷ j (␶j, ␶ R(2) ˆ j) for all j. That is, the intersymbol interference may j (␶j, ␶ be viewed as negligible for most practical cases. The other interfering terms, however, are dependent on the partial cross-correlation functions of PN codes and thus cannot be suppressed readily. If one assumes that the product of two PN codes results in yet another wideband code, then the bandwidth of the interfering signal remains unchanged, yielding a wideband interfering signal. That is, the despreading operation manages to despread the bandwidth of the desired signal while yielding a wideband interference. Since the integration over a symbol time is equivalent to a filtering operation over a bandwidth of 1/Ts Hz and the despreading interference signal possesses a bandwidth proportional to 1/Tc, the contribution of the interference noise to the detection process is undermined by a factor proportional to the spreading gain. In other words, the interference contributes only 1/PG of its total power to the detection of d(j) n . Stated differently, the key assumption of a CDMA receiver is that

|R(1) (τ j , τˆ j )| j |R(1) (τl , τˆ j )| + |R(2) (τl , τˆ j )| j,l j,l

≈ PG

for all l ⬆ j. Hence, for a large processing gain, one can expect a significant reduction in the interference level at the output of a CDMA receiver. Note that there are (N ⫺ 1) interferers present, and hence one must consider a large enough PG so that the total interference level remains small. Finally, note that d(j) n is scaled by an unknown coefficient i␪j Aje in Eq. (32). If a phase modulation is used, then ␪j must be estimated at the receiver. In the event that ␪j remains

NEAR-FAR PROBLEM AND POWER CONTROL Another important fact revealed by Eq. (32) is that the interfering signals’ power levels are different from that of the desired signal. Obviously, if Aj ⫽ max兵Al; l ⫽ 1, . . ., N其, a favorable outcome results. That is, the channel has caused an attenuation in the desired signal that is smaller than those experienced by the interfering signals. Since this condition cannot be guaranteed in a mobile communication environment, the interfering signals can take on relatively large amplitudes compared with the desired signal. In this case, the interfering signals can completely suppress the desired signal, resulting in an unacceptable performance for a CDMA system. Since no fading is considered here, and assuming that all the CDMA signals are originated at the transmitter at identical power levels, the aforementioned scenario is only encountered when the distance between the desired user and the receiver is larger than all or some of the distances between the interfering users and the receiver. This problem is commonly referred to as the ‘‘near-far’’ problem in CDMA receivers. Considering the wide range of distances a mobile user can take on, this problem can severely hamper the performance of a wireless CDMA system. In theory, this problem can be circumvented by regulating the power levels of all CDMA transmitters so that received signals at the receiver possess identical power levels (i.e., Aj ⫽ A for all j). The mechanism by which this goal may be achieved is known as power control. Power control, in practice, is accomplished using either an open-loop or a closedloop mechanism. For the sake of discussion, let us consider a mobile CDMA scenario. Furthermore, let r˜(t) denote the received signal at a CDMA base station. Hence, N denotes the number of active mobile transmitters. In an open-loop mechanism, the base station sends a signal (pilot signal) with known power level to all the mobile units (forward link). Mobile units measure the received power levels and, in turn, set their transmitter power levels for the reverse link in accordance with the received power level. (Typically, the power level of a mobile transmitter is increased by x dB, if the pilot signal is received at the mobile at ⫺x dB power level. For the case where the pilot is received at ⫹x dB, the transmitter power level is reduced by x dB.) If the communication channel remains the same for all users and fast fading can be ignored, this mechanism can yield favorable results. Although reciprocity exists between reverse and forward links of a wireless channel when log-normal fading (slow fading) is of concern (log-normal shadowing effect is due to the obstruction of the direct path of communication), the forward and reverse links experience different fast fading effects. That is, the information that is obtained regarding the channel condition by observing the forward channel’s power level (pilot power) may not be used to estimate the channel characteristics in the reverse link. For this reason, after initial power

528


level setting using the open-loop mechanism, a closed-loop procedure is followed to overcome the near-far problem. In this case, Aj is a function of not only propagation distance but also channel fading characteristics. In this case, the base station makes a measurement of the power levels of the received signals from individual mobile units. This information is reported back to the mobile units using what is known as a power control bit, which indicates whether the mobile should boost or reduce its power in some fixed dB increments. This process is repeated up to 2000 times per second in some modern systems. Given the fast rate of updates, this procedure can overcome the impact of rapid fluctuations in the power level. Note that the power level adjustments of the mobile units is based on the information regarding the reverse link, and hence one can expect a more effective means of circumventing the near-far problem using the closed-loop power control mechanism. CHANNEL EFFECTS So far, we have considered a perfect communication channel. That is, we have assumed that the bandwidth despreading operation is performed on an exact replica of the transmitted signal at the receiver. As noted earlier, we are concerned with a dispersive channel. Let the impulse response of the channel be

˜ h(t) =

Np

c˜l (t)δ(t − τl (t))

(33)

l=1

where h˜(t) is the complex impulse response of the channel, 웃(t) is the dirac-delta function, ␶l(t) denotes the propagation delay for the lth multipath between transmitter and the receiver, and c˜l(t) is a complex multiplicative distortion (MD) denoting the channel fading effect for the lth resolvable path of the multipath channel. The term c˜l(t) is often modeled as a low-pass complex Gaussian process. Moreover, Np denotes the total number of resolvable multipaths. The set of multipath delays encountered in a channel is often referred to as the delay profile of a scattering channel. For most channels of interest and when the observation interval is short enough to render a constant delay profile, one may assume that ␶l(t) 앒 ␶l. Np and ␶l are determined by the multipath profile of the channel, whereas the characteristics of c˜l(t) is a function of the Doppler spectrum of the channel. Due to the Gaussian property, one can fully characterize the statistics of c˜j(t) using only the second-order statistics of the process. It is shown that the MD processes have autocorrelation functions that satisfy (assuming no log-normal shadowing) E{c˜l (t)c˜l (t − τ )∗ |σl2 } = σl2 J0 (2π f d(l ) τ )ei2π f e τ

(34)

with ␴l2, f d(l), and f e denoting the mean square value of the MD for the lth path of the signal, the maximum Doppler spread experienced by the lth path of the signal, and the residual frequency error in hertz at the receiver, respectively. Moreover, E兵()兩␴l2其 denotes the expected value of the enclosed condition on ␴ l2. Note that we have kept the discussion as general as possible to entertain the possibility of including a scenario where the desired and interfering users may be at different

Doppler rates. When log-normal shadowing is present, we have σl2 = Pl 10ζ /10 where ␨ is a normal probability density function (log-normal shadowing) with a zero mean and a standard deviation of ␴␨ (many field trials have shown ␴␨ to be in the 4 dB to 8 dB range for microcellular urban environments) and Pl is the received power in the absence of shadowing for the lth path of the signal. Hence, the average power can be calculated using E{σl2 } = ηPl where

η = E{10ζ /10} = exp

ln(10) 10

(35)

2 σζ2 2

and E{}

denote the expected value of the enclosed with respect to ␨. Hence, E{c˜l (t)c˜l (t − τ )∗ } = Rc(l ) (τ ) = ηPl J0 (2π f d(l ) τ )ei2π f e τ

(36)

Also, since uncorrelated fading is considered, E{c˜l (t)cñ (t − τ )∗ } = Rc(l ) (τ )δ[l − n]

(37)

where

δ[x] =

1 x=0 0 otherwise

Since c˜l(t)’s are all Gaussian, 兵c˜l(t); for all l其 is a set of mutually independent Gaussian random processes. Finally, suppose that the channel, in addition to causing a multipath effect, adds an additive noise. That is, the complex envelope of the jth received signal in the absence of user-induced interference may now be approximated as

r(t) ˜ =

Np

c˜l (t)x˜ j (t − τl (t)) + z(t) ˜

(38)

l=1

Therefore, a CDMA receiver must estimate some or all of ␶l’s before any form of communication can take place. Due to the unique properties of the PN codes, an estimate of ␶l is acquired via establishing PN code acquisition for each path. INTERFERENCE-DISPERSIVE CHANNEL Now let us consider a more realistic scenario where other CDMA signals are present and the channel suffers from multipath scattering. In that event,

r(t) ˜ =

Np

c˜l, j (t)x˜ j (t − τl, j (t))

l=1

+

Np N k=1;k = j l=1

(39) c˜l,k (t)x˜k (t − τl,k (t)) + z(t) ˜


where now N ⫺ 1 other CDMA signals and their respective multipath components are considered. Note that we have introduced c˜l, j(t) as the MD for the lth path of the jth signal. All the properties for the MD processes discussed previously can be extended to this scenario as well. That is, we consider c˜l, j(t) as independent, baseband complex Gaussian processes for all l and j. Moreover, ␶l, j(t) now denotes the delay encountered by the lth path of the jth CDMA signal and is assumed to be slow varying. The preceding may also be presented as Np

r(t) ˜ =

c˜l, j (t)d j (t − τl, j (t))PN j (t − τl, j (t)) Np N

+

vious case, where fading was absent and without loss of generality, let us assume that ␶ˆ m, j ⱖ ␶m, j. In that event, Eq. (42) reduces to

1 ∗ x(t)PN ˜ j (t − τˆm, j )n, τˆ m, j 2 nTs +τ m, j 1 = zn + c˜ (t)d j (t − τm, j )PN j (t − τm, j ) 2Ts (n−1)Ts + τˆm, j m, j nTs + τˆ m, j 1 PN∗j (t − τˆm, j ) dt + c˜ (t)d j (t − τm, j ) 2Ts nTs +τ m, j m, j PN j (t − τm, j )PN∗j (t − τˆm, j ) dt +

l=1

c˜l,k (t)dk (t − τl,k (t))PNk (t − τl,k (t)) + z(t) ˜

k=1;k = j l=1

(40) As can be seen, the received signal is corrupted by many interfering signals. Considering a situation where path delays are slow varying, we have a simplified model given by

529

1 2Ts

Np l=1;l = m

nTs + τˆ m, j (n−1)Ts + τˆ m, j

τl, j )PN∗j (t

c˜l, j (t)d j (t − τl, j )PN j (t − − τˆm, j ) dt N p nTs + τˆ N m, j 1 + c˜ (t)dk (t − τl,k ) 2Ts k=1;k = j l=1 (n−1)Ts + τˆm, j l,k PNk (t − τl,k )PN∗j (t − τˆm, j ) dt

(43)

This equation then leads to

r(t) ˜ =

Np

c˜l, j (t)d j (t − τl, j )PN j (t − τl, j )

l=1

+

Np N

(41) c˜l,k (t)dk (t − τl,k )PNk (t − τl,k ) + z(t) ˜

k=1;k = j l=1

1 ∗ x(t)PN ˜ j (t − τˆm, j )n, τˆ m, j 2 ( j) = zn + c˜m, j [R(1) (τm, j , τˆm, j )dn( j) + R(2) (τm, j , τˆm, j )dn+1 ] j j +

At this stage, we assume that the observation interval (symbol time) is short enough so that the delay profile for the channel remains unchanged. That is, ␶m, j(t) 앒 ␶m, j for the observation interval. This condition is satisfied for most practical applications. To gain an insight into the means by which CDMA receivers overcome interference in the presence of the multipath effect, let us consider the outcome of a bandwidth despreading operation. Furthermore, let us also assume a scenario where we are interested in recovering the mth path of the jth signal. This situation is encountered in practice where the strongest paths of the desired signal are acquired by the PN code acquisition subsystem. More precisely, we require that the receiver acquires an estimate of ␶m, j. Assuming that a successful delay estimation is performed, an estimate of ␶m, j that is within ⫾ ␶c /Ns of ␶m, j can be obtained. Let such an estimate be ␶ˆ m, j. Then the outcome of the bandwidth despreading operation (when the nth symbol is of interest) is

1 ∗ x(t)PN ˜ j (t − τˆm, j )n, τˆ m, j 2 1 = zn + c˜m, j (t)d j (t − τm, j )PN j (t − τm, j )PN∗j (t − τˆm, j )n,τˆ m, j 2 Np 1 + c˜ (t)d j (t − τl, j )PN j (t − τl, j )PN∗j (t − τˆm, j n,τˆ m, j 2 l=1;l = m l, j Np N 1 + c˜ (t)dk (t−τl,k )PNk (t−τl,k )PN∗j (t−ˆτm, j )n,τˆ m, j 2 k=1;k = j l=1 l,k

(42) As seen before, it is not immediately obvious whether the desired symbol can be recovered in this case. Similar to the pre-

Np l=1;l = m

+

c˜l, j [R(1) (τl, j , τˆm, j )dq( j) + R(2) (τl, j , τˆm, j )dq( j) j j

Np N k=1;k = j l=1

l, j

l, j

+1

]

c˜l,k [R(1) (τ , τˆ )d (k) + R(2) (τ , τˆ )d (k) p p j,k l,k m, j j,k l,k m, j l ,k

l ,k +1

]

(44) where we have assumed that the channel remains constant over a symbol time [and hence c˜m, j(t) is replaced with c˜m, j]. This assumption is satisfied for many practical communication systems. Moreover, we note that the integration interval has coincided with the ql, jth and ql, j ⫹ 1th signaling intervals of the lth (l ⬆ m) multipath of the desired signal ( jth signal). Similarly, we have assumed that the integration interval includes the pl,kth and pl,k ⫹ 1th signaling interval of the lth path of the kth interfering signal. Obviously, ql, j and pl,k are dependent on the delay profile of the channel and the relative delays encountered by various users, respectively. It is rather obvious that the desired symbol d(j) n is recovered. Considering the conditions imposed on the cross-correlation and autocorrelation function of PN codes in the previous sections, it is rather easy to see that the interfering signals and their respective multipath components are detected at a power level that is approximately PG times smaller than that of the desired signal.

RAKE RECEIVER In the previous section, we introduced CDMA signaling and its properties. We also demonstrated that the received signal at a receiver often comprises delayed and attenuated versions of the desired signal (reflections) due to the multipath effect. It was also demonstrated that the interference due to the

530


other active CDMA users adversely affects the received signal in a typical CDMA receiver. For the sake of clarity, in what follows we consider a scenario in which the desired signal and its multipath components only are present. From Eq. (44), it is obvious that if one despreader is used to extract the mth path of the jth signal, the other multipath components appear as interference to this form of detection. Note that with the exception of amplitude and phase distortion effects, the spreading signal for all the reflected CDMA signals is known to the receiver, and hence one can capture this useful energy using an arrangement that is analogous to a garden rake. To elaborate, without loss of generality, let us consider a single user scenario where the strongest multipath component of the received signal is c˜1(t)x˜(t ⫺ ␶1(t)). That is, the delay associated with the strongest component of the multipath signal is ␶1(t). Moreover, let us assume that the PN code acquisition and tracking subsystem has locked onto this component of the received signal. The other components of the multipath signal [i.e., N 兺j⫽2p c˜j(t)x˜(t ⫺ ␶j(t)] may now be regarded as interference. It is obvious that with the exception of c˜j(t) and ␶j(t), the received multipath components contain the useful modulation. Let us now suppose that from Np possible multipaths, we are only interested in capturing Nf signals. Nf is commonly referred to as the number of ‘‘fingers’’ in a rake receiver. Also, let ␶ ⫽ [␶1, ␶2, . . ., ␶Nf ] denote a vector containing the Nf significant multipath delays in an ascending order. Moreover, let ␶ˆ ⫽ [␶ˆ 1, ␶ˆ 2, . . ., ␶ˆ Nf ] be a vector of delay estimates obtained by the PN code acquisition system. A rake receiver, after acquiring the Nf possible delays, performs an MF operation. The MF operation involves correlation with PN*j (t ⫺ ␶ˆ j); j ⫽ 1, 2, . . ., Nf and integration over a symbol time. That is, the receiver forms the following set of variables: Yn( j) =

1 ˜r(t)PN∗j (t − τˆ j )n,τˆ ; j = 1, 2, . . ., Nf j 2

(45)

At this stage, there are two possibilities. In one scenario, it is possible to estimate the channel MD, and hence cˆj(t) [an estimate of c˜j(t)]; j ⫽ 1, 2, . . ., Nf can be obtained at the receiver. This scenario is of critical importance to a coherent modulation case where a knowledge of channel phase is necessary for a successful demodulation. In the other scenario, such estimates are not available and the modulation scheme used allows for a noncoherent detection. We consider the coherent demodulation case first. In this case, since an MF operation is required, one must recompute Yj as follows: ( j) Yn,c

path, in the absence of timing and channel estimation errors and in the face of an additive white Gaussian noise (AWGN), no other receiver yields an energy level higher than that produced by the arrangement just suggested. For this reason, this receiver is referred to as a maximal ratio combiner (MRC). In the other scenario, the receiver uses Eq. (45) to compute Yj. At this stage, one must remove channel phase ambiguities before a decision can be rendered on the transmitted symbol. If a frequency-shift-keying (FSK) modulation is used, then

Dn =

Dn(c)

=

Nf

|Yn( j) |2

j=1

Note that although phase ambiguities have been removed, the amplitude fluctuations due to c˜j(t) have not been compensated for. This deficiency could seriously impair the performance of the subsequent demodulation process. For yet another modulation scheme, known as differential phase-shift keying (DPSK), the desired information is stored in the difference (reduced mod 2앟) between two consecutive phases of the received signal. It is also assumed that the channel remains stationary for at least two consecutive symbol intervals. To recover the desired symbol, the receiver forms the following decision variables:

Dn =

Nf

( j)∗ Yn( j)Yn−1

j=1

Similar to its FSK counterpart, this receiver is impaired by the changes in the channel amplitude due to c˜j(t). Additionally, any phase changes in two consecutive symbol intervals can produce unfavorable results in this case. BIBLIOGRAPHY 1. M. K. Simon et al., Spread Spectrum Communications Handbook, McGraw-Hill, New York, 1994. 2. W. C. Jakes, Microwave Mobile Communications, New York: Wiley, 1974. 3. J. G. Proakis, Digital Communications, 3rd ed., New York: McGraw-Hill, 1995. 4. A. J. Viterbi, CDMA—Principles of Spread Spectrum Multiple Access Communication, Reading, MA: Addison-Wesley, 1995.

KAMRAN KIASALEH University of Texas at Dallas

1 = ˜r(t)cˆ∗j (t)PN∗j (t − τˆ j )n,τˆ ; j = 1, 2, . . ., Nf j 2

Then a decision variable is formed for the nth transmitted data symbol as follows:

Nf

CODER, VIDEO. See VIDEO COMPRESSION STANDARDS. CODING DATA TRANSMISSION. See INFORMATION THEORY OF DATA TRANSMISSION CODES.

( j) Yn,c

j=1

This variable is passed on to a coherent demodulator for further processing. Since an MF operation is performed for each

CODING, IMAGE. See IMAGE AND VIDEO CODING. CODING SPEECH. See SPEECH CODING. CODING THEORY. See ALGEBRAIC CODING THEORY. COGNITIVE OPERATIONS ON DATA. See INFORMATION SCIENCE.


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Data Compression for Networking Standard Article Wade Wan1 and Xuemin Chen1 1Broadcom Corporation, Irvine, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5305 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (300K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Basic Terminology and Methods for Data Coding Fundamental Compression Algorithms JPEG H.261 and H.263 MPEG H.264/AVC/JVT Keywords: nyquist's theorem; analog-to-digital converter; pulse code modulation (PCM); differential PCM (DPCM); adaptive DPCM (ADPCM); delta modulation (DM); huffman coding; run-length coding; arithmetic coding; transform coding; subband coding; vector quantization; image and audio coding (JPEG, H.261, H.263, MPEG) | | | Copyright © 1999-2008 All Rights Reserved.


DATA COMPRESSION FOR NETWORKING DATA COMPRESSION VIDEO CODEC TRANSFORM CODING IMAGE CODING IMAGE PROCESSING

tization, subsampling, and interpolation), hybrid coding (e.g., JPEG, MPEG-1, MPEG-2, MPEG-4, H.261, H.263, and H.264/AVC/Mpeg-4 Part 10), and other proprietary developed coding techniques (e.g., Intel’s Indeo, Microsoft’s Windows Media Audio and Video, General Instrument’s DigiCipher, IBM’s Ultimotion Machine, and Apple’s Quick Time, etc.). The purpose of this article is to provide the reader with a basic understanding of the principles and techniques of data coding and compression. Various compression schemes are discussed for transforming audio, image and video signals into compressed digital representations for efficient transmission or storage. Before embarking on this venture, it is appropriate to first introduce and clarify the basic terminology and methods for signal coding and compression.

SIGNAL REPRESENTATION

BASIC TERMINOLOGY AND METHODS FOR DATA CODING

VIDEO SIGNAL PROCESSING

The word signal originally referred to a continuous time and continuous amplitude waveform, called an analog signal. In a general sense, people now view a signal as a function of time, where time may be continuous or discrete, and where the amplitude or values of the function may be continuous or discrete, and may be scalar or vector-valued. Thus, a signal is meant to represent a sequence or a waveform whose value at any time is a real number or real vector. In many applications, a signal also refers to an image which has an amplitude that depends on two spatial coordinates, instead of one time variable; or it can also refer to a video (moving images), where the amplitude is a function of two spatial variables and a time variable. The word data is sometimes used as a synonym for signal, but more often it refers to a sequence of numbers or more generally, vectors. Thus, data can often be viewed as a discrete time signal. During recent years, however, the word data has been increasingly been associated in most literature with the discrete or digital case, that is, with discrete time and discrete amplitude, what is called a digital signal. Physical sources of analog signals such as speech, audio, image, video, and all observable electrical waveforms are analog and continuous time in nature. The first step to convert analog signals to digital form is sampling. An analog continuously fluctuating waveform can usually be characterized completely from the knowledge of its amplitude values at a countable set of points in time so that, in effect, one can “throw away” the rest of the signal. One does not need to observe how it behaves in between any two isolated instances of observation. This is at the same time remarkable and intuitively obvious. It is remarkable that one can discard so much of the waveform and still be able to accurately recover the missing parts. The intuitive idea is that, if one samples periodically at regularly spaced intervals, and the signal does not fluctuate too quickly so that no unexpected wiggles can appear between two consecutive sampling instants, then one can expect to recover the complete waveform by a simple process of interpolation or smoothing, where a smooth curve is drawn that passed through the known amplitude values at the sampling in-

VIDEOTELEPHONY DIGITAL TELEVISION AUDIO CODING SPEECH CODING There has been an explosive growth of multimedia communication over networks during the past two decades. Video, audio, and other continuous media data, as well as additional discrete media such as graphics, are parts of integrated network applications. For these applications, the traditional media (e.g., text, images), as well as the continuous media (e.g., video, audio), must be processed. Such processing, referred to as “data coding”, often yields better and more efficient representations of text, image, graphics, audio, and video signals. The uncompressed media data often require very high transmission bandwidth and considerable storage capacity. To provide feasible and cost-effective solutions for the current quality requirements, compressed text, image, graphics, audio, and video streams are transmitted over networks. As shown in Refs. 1–6 there exist many data coding and compression techniques that are, in part, competitive, and, in part, complementary. Most of these techniques are already used in today’s products, while other methods are still undergoing development or are only partly realized. Today and in the near future, the major coding schemes are linear predictive coding, layered coding, and transform coding. The most important compression techniques are entropy coding (e.g., run-length coding, Huffman coding, and arithmetic coding), source coding (e.g., vector quan-

J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright © 2007 John Wiley & Sons, Inc.

2

Speech Coding

stants. When watching a movie, one is actually seeing 24 still pictures flashed on the screen every second. (Actually, each picture is flashed twice.) The movie camera that produced these pictures was actually photographing a scene by taking one still picture every 1/24th of a second. Yet, one has the illusion of seeing continuous motion. In this case, the cinematic process works because the brain is somehow doing the interpolation. This is an example of sampling in action in daily life. For an electrical waveform, or any other one-dimensional signal, the samples can be carried as amplitudes on a periodic train of narrow pulses. Consider a scalar time function x(t), which has a Fourier transform X(f). Assume there is a finite upper limit on how fast x(t) can wiggle around or vary with time. Specifically, assume that X(f) = 0 for |f| ≥ W. Thus, the signal has a strictly low-pass spectrum with cutoff frequency W hertz (Hz). To sample this signal, one can periodically observe the amplitude at isolated time instants t = kT for k = . . . , −2, −1, 0, 1, 2, . . . . The sample rate is fs = 1/T and T is the sampling period or sampling interval in seconds. The idealized case of the sampling model is impulse sampling with a perfect ability to observe isolated amplitude values at the sampling instants kT. The effect of such a sampling model is seen as the process of multiplying the original signal x(t) by a sampling function, s(t), which is the periodic train of impulses p(t) (e.g., Dirac delta functions for ideal case) given by

where the amplitude scale is normalized to T so that the average value of s(t) is unity. In the time domain, the effect of this multiplication operation is to generate a new impulse train whose amplitudes are samples of the waveform x(t). Thus

Therefore, one now has a signal y(t) which contains only the sample values of x(t) and all values in between the sampling instants have been discarded. Figure 1 is an example of continuous signal waveform and its sampled waveform. The complete recovery of x(t) from the sampled signal y(t) can be achieved if the sampling process satisfies the following fundamental theorem:

Nyquist Sampling Theorem. A signal x(t) that is bandlimited to W (Hz) can be exactly reconstructed from its samples y(t) when it is periodically sampled at a rate fs = 2W. This minimum sampling frequency of 2W (Hz) is called the Nyquist frequency or Nyquist rate. If we violate the condition of the sampling theorem, that is, the sampling rate is less than twice of the maximum frequency component in the spectrum of the signal to be sampled, then the recovered signal will be the original signal plus an additional undesired waveform whose spectrum overlaps with the high-frequency components of the original signal. This

Figure 1. An example of sampling process, which shows (a) the original analog waveform and (b) its corresponding sampled waveform.

undesired component is called aliasing noise and the overall effect is referred to as aliasing, since the noise introduced here is actually a part of the signal itself but with its frequency components shifted to a new frequency. The rate at which a signal is sampled usually determines the amount of processing, transmission or storage that will subsequently be required. Hence, it is desirable to use the lowest possible sampling rate that will satisfy a given application. On the other hand, most physical signals are not strictly bandlimited. However, typically, the contribution of the higher frequency signal components diminishes in importance as frequency increases over certain values. For example, music often does not have a welldefined cutoff frequency, below which significant power density exists, and above which no signal power is present. But human ears are not sensitive to very-high-frequency sound. So how does one choose a meaningful sampling rate that is not higher than necessary and yet does not violate the sampling theorem? The answer is to first decide how much of the original signal spectrum is really needed to be retained. Analog lowpass filtering is then performed on the analog signal before sampling, so that the “needless” high-frequency components are suppressed. This analog prefiltering is often called antialias filtering. For example, in digital telephony, the standard antialias filter has a cutoff of 3.4 kHz, although the speech signal contains frequency components extending well beyond this frequency. This cutoff allows the moderate sampling rate of 8 kHz to be used and retains the voice fidelity that was already achieved with analog telephone circuits, which were already limited to roughly 3.4 kHz. In summary, analog prefiltering is needed to prevent aliasing of the signal and noise components that lie outside of the frequency band that must be preserved and reproduced. Just as a waveform is sampled at discrete times, the value of the sampled waveform at a given time is also converted to a discrete value. Such a conversion process is

Speech Coding

Figure 2. PCM coded signal of sampled waveform in Fig. 1(b).

called quantization, which will introduce loss on sampled waveform. The resolution of quantization depends on the number of bits used in measuring the height of the waveform. For example, an 8-bit quantization yields 256 possible values. Lower resolutions of quantization will result in higher losses of the digital signal. The electronic device that converts a signal waveform into digital samples is called an analog-to-digital converter (ADC). The reverse conversion is performed by a digital-to-analog converter (DAC). The first process to sample analog signals and then quantize the sample values was pulse code modulation (PCM). PCM was invented in the 1930s, but only became prevalent in the 1960s, when transistors and integrated circuits became available. Figure 2 depicts the steps involved in PCM at a high level. PCM does not require sophisticated signal processing techniques and related circuitry. Hence, it was the first method to be employed, and is the prevalent method used today in telephone plants. PCM provides excellent quality. PCM is specified by the International Telephone and Telegraph Consultative Committee (CCITT). The current name of the specification is International Telecommunication Union (ITU) for voice coding in Recommendation G.711. A problem with PCM is that it requires a fairly high bandwidth (e.g., 64 kHz, for voice coding) to code a signal. PCM has been around for a long time, and new technologies are beginning to demand attention. Of all the available schemes emerging from the laboratory, differential pulse code modulation (DPCM) and adaptive DPCM (ADPCM) schemes are among the most promising techniques. If a signal has a high correlation between adjacent samples, the variance of the difference between adjacent samples is smaller than the variance of the original signal. If this difference is coded, rather than the original signal, fewer bits are needed for the same desired accuracy. That is, it is sufficient to represent only the first PCM-coded sample as a whole, and all following samples as the difference from the previous one. This is the idea behind DPCM. In general, fewer bits are needed for DPCM than for PCM. In a typical DPCM system, the input signal is bandlimited, and an estimate of the previous sample (or a prediction of the current signal value) is subtracted from the input. The difference is then sampled and coded. In the simplest case, the estimate of the previous sample is formed by taking the sum of the decoded values of all the past differences (which ideally differ from the previous sample only by a quantizing error). DPCM exhibits a significant improvement over PCM when the signal spectrum is peaked

3

at the lower frequencies and rolls off toward the higher frequencies. A modification of DPCM is delta modulation (DM). When coding the differences, it uses exactly one bit, which indicates whether the signal increases or decreases. This leads to an inaccurate coding of steep edges. This technique is particularly profitable if the coding does not depend on 8-bit grid units. If differences are small, a smaller number of bits are sufficient. A prominent adaptive coding technique is ADPCM. It is a successive development of DPCM. Here, differences are encoded by the use of only a small number of bits (e.g., 4 bits). Therefore, either sharp transitions are coded correctly (these bits represent bits with a higher significance), or small changes are coded exactly (DPCM-encoded values are the less-significant bits). In the second case, a loss of high frequencies would occur. ADPCM adapts to this “significance” for a particular data stream as follows: the coder divides the value of DPCM samples by a suitable coefficient and the decoder multiplies the compressed data by the same coefficient, that is, the step size of the signal changes. The value of the coefficient is adapted to the DPCMencoded signal by the coder. In the case of a high-frequency signal, large DPCM coefficient values occur. The coder determines a high value for the coefficient. The result is a very coarse quantization of the DPCM signal in passages with steep edges. Low-frequency portions of such passages are hardly considered at all. For a signal with permanently relatively small DPCM values, the coder will determine a small coefficient. Thereby, a fine resolution of the dominant low-frequency signal portions is guaranteed. If high-frequency portions of the signal suddenly occur in such a passage, a signal distortion, in the form of a slopeoverload, arises. Considering the actually defined step size, the greatest possible change by a use of the existing number of bits will not be large enough to represent the DPCM value with an ADPCM value. The transition of the PCM signal will be faded. It is possible to explicitly change the coefficient that is adaptively adjusted to the data in the coding process. Alternatively, the decoder is able to calculate the coefficients itself from an ADPCM-encoded data stream. In ADPCM, the coder can be made to adapt to DPCM value change by increasing or decreasing the range represented by the encoded bits. In principle, the range of bits can be increased or decreased to match different situations. In practice, the ADPCM coding device accepts the PCM coded signal and then applies a special algorithm to reduce the 8-bit samples to 4-bit words using only 15 quantization levels. These 4-bits words no longer represent sample amplitudes; instead, they contain only enough information to reconstruct the amplitude at the distant end. The adaptive predictor predicts the value of the next signal on the level of the previously sampled signal. A feedback loop ensures that signal variations are followed with minimal deviation. The deviation of the predicted value, measured against the actual signal, tends to be small, and can be encoded with 4-bits.

4

Speech Coding

FUNDAMENTAL COMPRESSION ALGORITHMS The purpose of compression is to reduce the amount of data for multimedia communication. The amount of compression that an encoder achieves can be measured in two different ways. Sometimes the parameter of interest is compression ratio—the ratio between the original source data and the compressed data sizes. However, for continuoustone images another measure, the average number of compressed bits/pixel, is sometimes a more useful parameter for judging the performance of an encoding system. For a given image, however, the two are simply different ways of expressing the same compression. Compression in multimedia systems is subject to certain constraints. The quality of the coded and, later on, decoded data, should be as good as possible. To make a cost-effective implementation possible, the complexity of the technique should be minimal. The processing of the algorithm can not exceed certain time spans. A natural measure of quality in a data coding and compression system is a quantitative measure of distortion. Among the quantitative measures, a class of criteria often used is the mean square criterion. It refers to some type of average or sum (or integral) of squares of the error between the sampled data y(t) and decoded or decompressed data y (t). For data sequences y(t) and y (t) of N samples, the quantity

is called the average least squares error (ALSE). The quantity

is called the mean square error (MSE), where E represents the mathematical expectation. Often ALSE is used as an estimate of MSE. In many applications, the (mean square) error is expressed in terms of a signal-to-noise ratio (SNR), which is defined in decibels (dB) as

where σ 2 is the variance of the original sampled data sequence. Another definition of SNR, used commonly in image and video coding applications, is

The PSNR value is roughly 12 to 15 dB above the value of SNR. Another commonly used method for measuring the performance of data coding and compression system is rate distortion theory. Rate distortion theory provides some useful results, which tell us the minimum number of bits required to encode the data, while admitting a certain level of distortion, and vice versa.

The rate distortion function of a random variable x gives the minimum average rate RD (in bits per sample) required to represent (or code) it while allowing a fixed distortion D in its reproduced value. If x is a Gaussian random variable of variance σ 2 , and y is its reproduced value and if the distortion is measured by the mean square value of the difference (x-y), that is, D = E[(x − y)2 ], then the rate distortion function of x is defined as

Data coding and compression systems are considered optimal if they maximize the amount of compression subject to an average or maximum distortion. As shown in Table 1, compression techniques fit into different categories. For their use in multimedia systems, one can distinguish among entropy, source, and hybrid coding. Entropy coding is a lossless process, while source encoding is a lossy process. Most multimedia systems use hybrid techniques, which are a combination of the two coding techniques. Entropy coding is used independently of the media’s specific characteristics. Any input data sequence is considered to be a simple digital sequence and the semantics of the data is ignored. Entropy encoding reduces the size of the data sequence by focusing on the statistical characteristics of the encoded data series to allocate efficient codes, independent of the characteristics of the data. Entropy encoding is an example of lossless encoding as the decompression process regenerates the data completely. The concept of entropy is derived from classical 19th century thermodynamics. The basic ideas of entropy coding are as follows: First, one defines the term information by using video signals as examples. Consider a video sequence in which each pixel takes on one of K values. If the spatial correlation have been removed from the video signal, the probability that a particular level i appears will be Pi , independent of the spatial position. When such a video signal is transmitted, the information I imparted to the receiver by knowing which of the K levels is the value of a particular pixel, is −log2 Pi bits. This value, averaged over an image, is referred to as the average information of the image, or the entropy. The entropy can therefore be expressed as

The entropy is also extremely useful for measuring the performance of a coding system. In “stationary” systems— systems where the probabilities are fixed—it provides a fundamental lower bound, called the entropy limit, for the compression that can be achieved with a given alphabet of symbols. Entropy encoding attempts to perform efficient code allocation (without increasing the entropy) for a signal. Runlength encoding, Huffman encoding, and arithmetic encoding are well-known entropy coding methods (7) for efficient code allocation, and are commonly used in actual encoders.

Speech Coding

Run-length coding is the simplest entropy coding. Data streams often contain sequences of the same bytes or symbols. By replacing these repeated byte or symbol sequences with the number of occurrences, a substantial reduction of data can be achieved. This is called run-length coding, which is indicated by a special flag that does not occur in the data stream itself. For example, the data sequence: GISSSSSSSGIXXXXXX can be run-length coded as: GIS# 7GIX# 6, where # is the indicator flag. The character “S” occurs 7 consecutive times and is “compressed” to 3 characters “S# 7”, as well as the character “X” occurs 6 consecutive times and is also “compressed” to 3 characters “X# 6”. Run-length coding is a generalization of zero suppression, which assumes that just one symbol appears particularly often in sequences and the coding focuses on uninterrupted sequences, or runs, of zeros or ones to produce an efficient encoding. Huffman coding is an optimal way of coding with integer-length code words. Huffman coding produces a “compact” code whose definition is for a particular set of symbols and probabilities, no other integer code can be found that will give better coding performance than this code. Consider the example given in Table 2. The entropy— the average ideal code length required to transmit the weather—is given by

However, fractional-bit lengths are not allowed, so the lengths of the codes listed in the column to the right do not match the ideal information. Since an integer code always needs at least one bit, increasing the code for the symbol “00” to one bit seems logical. The Huffman code assignment procedure is based on a coding “tree” structure. This tree is developed by a sequence of pairing operations, in which the two least probable symbols are joined at a “node” to form two “branches” of the tree. As the tree is constructed, each node at which two branches meet is treated as a single symbol with a combined probability that is the sum of the probabilities for all symbols combined at that node. Figure 3 shows a Huffman code pairing sequence for the four-symbol case in Table 2. In Fig. 3 the four symbols are placed on the number line from 0 to 1, in order of increasing probability. The cumulative sum of the symbol probabilities is shown at the left. The two smallest probability intervals are paired, leaving three probability intervals of size 1/8, 1/8, and 3/4. We establish the next branch in the tree by again pairing the two smallest probability intervals, 1/8 and 1/8, leaving two probability intervals, 1/4 and 3/4. Finally, the tree is completed by pairing the 1/4 and 3/4 intervals. To create the code word for each symbol, one assigns a 0 and 1, respectively (the order is arbitrary), to each branch of the tree. Then concatenate the bits assigned to these branches, starting at the “root” (at the right of the tree) and the following the branches back to the “leaf” for each symbol (at the far left). Notice that each node in this tree requires a binary decision—a choice between the two possibilities—and, therefore, appends one bit to the code word.

5

Figure 3. Huffman coding tree for the sequence symbols given in Table 2. It demonstrates the Huffman code assignment process.

Figure 4. A process of partitioning the numbered line into subintervals for the arithmetic coding. It illustrates a possible ordering for the symbol probabilities in Table 2.

One of the problems with Huffman coding is that symbols with probabilities greater than 0.5 still require a code word of length one. This leads to less efficient coding, as can be seen for the codes in Table 2. The coding rate R achieved with Huffman codes in this case is as follows:

This rate, when compared with the entropy limit of 1.186 bit/pixel, represents an efficiency of 86 percent. Arithmetic coding is an optimal coding procedure that is not constrained to integer-length codes. In arithmetic coding, the symbols are ordered on the number line in the probability interval from 0 to 1 in a sequence that is known to both encoder and decoder. Each symbol is assigned a subinterval equal to its probability. Note that, since the symbol probabilities sum to one, the subintervals precisely fill the symbol probabilities in Table 2. Figure 4 illustrates a possible ordering for the symbol probabilities in Table 2. The objective in arithmetic coding is to create a code stream that is a binary fraction pointing to the interval for the symbol being coded. Thus, if the symbol is “00”, the code stream is a binary fraction greater than or equal to binary 0.01 (decimal 0.25), but less than binary 1.0. If the symbol

6

Speech Coding

is “01”, the code stream is greater than or equal to binary 0.001, but less than binary 0.01. If the symbol is “10”, the code stream is greater than or equal to binary 0.0001, but less than binary 0.001. Finally, if the symbol is “11”, the code stream is greater than or equal to binary 0, but less than 0.0001. If the code stream follows these rules, a decoder can see which subinterval is pointed to by the code stream and decode the appropriate symbol. Coding additional symbols is a matter of subdividing the probability interval into smaller and smaller subintervals, always in proportion to the probability of the particular symbol sequence. As long as one follows the rule—never allow the code stream to point outside the subinterval assigned to the sequence of symbols—the decoder will decode that sequence. For a detailed discussion of Huffman coding and arithmetic coding, interested readers should refer to (7). Source coding takes into account the semantics of the data. The degree of compression that can be reached by source coding depends on the data contents. In the case of lossy compression techniques, a one-way relation between the original sequence and the encoded data stream exists; the data streams are similar but not identical. Different source coding techniques make extensive use of the characteristics of the specific medium. An example is sound source coding, where sound is transformed from time-dependent to frequency-dependent sound concatenations, followed by the encoding. This transformation, followed by encoding, substantially reduces the amount of data.

Predictive Coding. Prediction is the most fundamental aspect of source coding. The basis of predictive encoding is to reduce the number of bits used to represent information by taking advantage of correlation in the input signal. DPCM and ADPCM, discussed above, are among the simplest prediction coding methods. For digital video, signals exhibit correlation both between pixels within a frame (spatial correlation) and between pixels in differing frames (temporal correlation). Video compression techniques typically fall into two main types: (1) interframe prediction, which uses a combination of motion-prediction and interpolated frames to achieve high-compression ratio; (2) intraframe coding, which compresses every frame of video individually. Interframe prediction techniques take advantage of the temporal correlation, while the spatial correlation is exploited by intraframe coding methods. It is amenable also to utilize intra- and interfield prediction methods for interlaced video which scans alternate lines to distribute the pixels of a single frame across two fields. Motion compensation (MC), one of the most complex prediction methods, reduces the prediction error, by predicting the motion of the imaged objects. The basic idea of MC arises from a commonsense observation: in a video sequence, successive frames (or fields) are likely to represent the same details, with little difference between one frame and the next. A sequence showing moving objects over a still background is a good example. Data compression can be effective if each component of a frame is represented by its difference with the most similar component— the predictor—in the previous frame, and by a vector—the

motion vector—expressing the relative position of the two components. If an actual motion exists between the two frames, the difference may be null or very small. The original component can be reconstructed from the difference, the motion vector, and the previous frame. A weakness of prediction-based encoding is that the influence of any errors during data transmission affects all subsequent data. In particular, when interframe prediction is used, the influence of transmission errors is quite noticeable. Since predictive encoding schemes are often used in combination with other schemes, such as transformbased schemes, the influence of transmission errors must be given due consideration. Transform Coding. If we consider the frequency distribution of signals containing strong correlation, it appears that the signal power is concentrated in the low-frequency region. In general, it is possible to exploit for compression any systematic bias in components of the signal. The key idea behind transform coding is to transform the original signal in such a way as to emphasize the bias, making it more amenable to techniques that remove redundancy. One optimal transform is called the Karhunen–Loeve (KL) transformation. The KL transform can completely remove the statistical correlation of image data and provide a minimum mean-square-error (3). In application of the KL transform to images, there are dimensionality difficulties. The KL transform depends on the statistics as well as the size of the image. It is known that fast KL transform algorithms only exist for certain statistical image models. A number of orthogonal transforms, including the discrete Fourier transform (DFT) and the discrete cosine transform (DCT), have been used in various compression algorithms. Of these transforms, DCT is the most widely used for video compression, because the power of the transformed signal is well concentrated in the low frequencies, and it can be computed rapidly. The following expresses a two-dimensional DCT for an N × N pixel block.

where

After the transformation, DCT coefficients are quantized by levels specified in a quantization table. Usually, larger values of N improve the SNR, but the effect saturates above a certain block size. Further, increasing the block size increases the total computation cost required. The value of

Speech Coding

7

Figure 5. A block diagram of the MC + DCT coding scheme which shows the basic function blocks such as motion estimation, motion prediction, DCT variable length coding, and so on.

N is thus chosen to balance the efficiency of the transform and its computation cost, block sizes of 8 and 16 are common. For large quantization, DCT using block sizes of 8 and 16 often lead to “blocking artifacts”—visible discontinuities between adjacent blocks. In practice, DCT is used in conjunction with other techniques, such as prediction and entropy coding. The Motion Compensation Plus Discrete Cosine Transform (MC + DCT) scheme, which will repeatedly be referred to, is a prime example of such a combination. MC + DCT. Suppose that the video to be encoded consists of digital television or teleconferencing services. For this type of video, MC carried out on the basis of frame differences is quite effective. MC can be combined with the DCT for even more effective compression. The overall configuration of MC + DCT is illustrated in Fig. 5. The selection of block size compares its input signal with that of the previous frame (generally in units of 8 × 8 pixel blocks) and selects those that exhibit motion. MC operates by comparing the input signal in units of blocks against a locally decoded copy of the previous frame, extracting a motion vector, and using the motion vector to calculate the frame difference. The motion vector is extracted by, for example, shifting vertically or horizontally a region several pixels on a side and performing matching within the block or the macroblock (a 16 × 16 pixel segment in a frame) (8). The motion-compensated frame-difference signal is then discrete cosine transformed, in order to remove spatial redundancy. A variety of compression techniques are applied in quantizing the DCT coefficients; the reader is directed to the references for details (8). A leading method is zig-zag scan, which has been standardized in JPEG, H.261, H.263, MPEG-1, -2, and -4, for video transmission encoding (8). Zig-zag scan, which transforms two-dimensional data into one dimension, is illustrated in Fig. 6. Because the dc component of the coefficients is of critical importance, ordinary linear quantization is employed for them. Other components are scanned, for example, in zig-zag fashion, from low to high frequency, linearly quantized, and variablelength-encoded by the use of run-length and Huffman cod-

Figure 6. The zig-zag scan pattern for a 8 × 8 block.

ing. Subband Coding. Subband coding (5) refers to the compression methods that divide the signal into multiple bands to take advantage of a bias in the frequency spectrum of the video signal. That is, efficient encoding is performed by partitioning the signal into multiple bands and taking into account the statistical characteristics and visual significance of each band. The general form of a subband coding system is shown in Fig. 7. In the encoder, the analyzing filters partition the input signal into bands. Such a process is called subband decomposition. Each band is separately encoded, and the encoded bands are multiplexed and transmitted. The decoder reverses this process. Subband encoding does offer several advantages. Unlike DCT compression techniques, it is not prone to blocking artifacts. Furthermore, subband encoding is the most natural coding scheme when hierarchical processing is needed for video coding. The main technological features to be determined in subband encoding are the subband analysis method (2- or 3-dimensional), the structure of the analyzing filters, the bit allocation method, and the compression method within each band. In particular, there are quite a number of candidates for the form of the

8

Speech Coding

Figure 7. A simplified block diagram of subband coding scheme.

analysis and the structure of the filters. The filters must not introduce distortion due to aliasing in-band analysis and synthesis. Figure 8 shows a two-band analysis and synthesis system. Consider the following analyzing filter as an example:

For these analyzing filters, the characteristics of the synthesizing filters are

The relationship between the input and output is then

Clearly, the aliasing components completely cancel. The basic principles illustrated hold unchanged when twodimensional filtering is used in a practical application. Figure 9 illustrates how the two-dimensional frequency domain may be partitioned either uniformly or in an octave parent. If one recalls that signal power will be concentrated in the low-frequency components, then the octave method seems the most natural. Since this corresponds to constructing the analyzing filters in a tree structure, it lends itself well to implementation with filter banks. In practical applications, one of the most important decomposition filters is what is called discrete wavelet transform (DWT). Wavelet theory provides a unified framework

for multiresolution image compression. DWT-based compression enables coding of still image textures with a high coding efficiency as well as scaleable spatial resolutions at fine granularity. The organization of a subband codec is similar to the DCT-based codec. The principal difference is that encoding and decoding are each broken out into a number of independent bands. Quality can be fixed at any desired value by adjusting the compression and quantization parameters of the encoders for each band. Entropy coding and predictive coding are often used in conjunction with subband coding to achieve high-compression performance. If one considers quality from the point of view of the rate-distortion curve, then, at any given bit rate, the quality can be maximized by distributing the bits such that distortion is constant for all bands. A fixed number of bits is allocated, in advance, to each band’s quantizer, based on the statistical characteristics of the band’s signal. In contrast, adaptive bit distribution adjusts the bit count of each band according to the power of the signal. In this case, either the decoder of each subband must also determine the bit count for inverse quantization, using the same criterion as is used by the encoder, or the bit count information must be transmitted along with the quantized signal. Therefore, the method is somewhat lacking in robustness. Vector Quantization. As opposed to scalar quantization, in which sample values are independently quantized one at a time, vector quantization (VQ) attempts to remove redundancy between sample values by collecting several sample values and quantizing them as a single vector. Since the input to a scalar quantizer consists of individual sample values, the signal space is a finite interval of the real number line. This interval is divided into several regions, and each region is represented in the quantized outputs by a

Speech Coding

9

Figure 8. A two-band subband coding system. It demonstrates the in-band encoding and the inband decoding blocks in Fig. 7.

Figure 9. Subband splitting patterns in two-dimensional frequency domain: (a) uniform split (8 × 8); (b) octave split.

single value. The input to a vector quantizer is typically an n-dimensional vector, and the signal space is likewise an n-dimensional space. To simplify the discussion, consider only the case where n = 2. In this case, the input to the quantizer is the vector xj , which corresponds to the pair of samples (s1j , s2j ). To perform vector quantization, the signal space is divided into a finite number of nonoverlapping regions, and a single vector to represent each region is determined. When the vector xj is input, the region containing xj is determined, and the representative vector for that region, yj , is output. This concept is shown in Fig. 10. If we phrase the explanation explicitly in terms of encoding and decoding, the encoder determines the region to which the input xj belongs and outputs j, the index value which represents the region. The decoder receives this value j, extracts the corresponding vector yj from the representative vector set, and outputs it. The set of representative vectors is called the codebook. The performance of vector quantization is evaluated in the same manner as for other schemes, that is, by the relationship between the encoding rate and the distortion. The encoding rate R per sample is given by the following equation

where K is the vector dimensionality, and N is the number of quantization levels. The notation · represents the smallest integer greater than or equal to x (the “ceiling” of x). We define the distortion as the distance between the input vector xj and the output vector yj . In video encoding, the square of the Euclidean distance is generally used as

Figure 10. An example of VQ with two-dimensional vectors. To perform VQ, the signal space is divided into a finite number of nonoverlapping region, and a single vector is used to represent all vectors in each region.

a distortion measure, because it makes analytic design of the vector quantizer for minimal distortion more tractable. However, it is not necessarily the case that subjective distortion perceived by a human observer coincides with the squared distortion. To design a high-performance vector quantizer, the representative vectors and the regions they cover must be chosen to minimize total distortion. If the input vector probability density function is known in advance, and the vector dimensionality is low, it is possible to perform an exact optimization. However, in an actual application it is rare for the input vector probability density to be known in advance. The well-known LBG algorithm is widely used for adaptively designing vector quantizers in this situation (9). LBG is a practical algorithm that starts out with some rea-

10

Speech Coding

sonable codebook and, by adaptively iterating the determination of regions and representative vectors, converges on a better codebook. Figure 11 shows the basic structure of an image codec based on vector quantization. The image is partitioned into M-pixel blocks, which are presented, one at a time, to the VQ encoder as the Al-dimensional vector xj . The encoder locates the closest representative vector in its prepared codebook and transmits the representative vector’s index. The decoder, which need only perform a simple table lookup in the codebook to output the representative vector, is an extremely simple device. The simplicity of the decoder makes VQ coding very attractive for distribution-type video services. VQ coding, combining with other coding methods, has been adopted in many high-performance compression systems. Table 1 shows examples of coding and compression techniques that are applicable in multimedia applications in relation to the entropy, source, and hybrid coding classification. Hybrid compression techniques are a combination of well-known algorithms and transformation techniques that can be applied to multimedia systems. For a better and clearer understanding of hybrid schemes to be identified in all schemes (entropy, source, and hybrid) a set of typical processing steps is described. This typical sequence of operations has been shown in Fig. 5, which is performed in the compression of still images and video sequences. The following four steps describe the compression of one image: 1. Preparation includes analog-to-digital conversion and generating an appropriate digital representation of the information. An image is divided into blocks of 8 × 8 pixels, and represented by a fixed number of bits per pixel. 2. Processing is actually the first step of the compression process which makes use of sophisticated algorithms. A transformation from the time to the frequency domain can be performed by a use of DCT. In the case of motion video compression, interframe coding uses a motion vector for each 16 × 16 macroblock or 8 × 8 block. 3. Quantization processes the results of the previous step. It specifies the granularity of the mapping of real numbers into integers. This process results in a reduction of precision. In a transformed domain, the coefficients are distinguished according to their significance. For example, they could be quantized using a different number of bits per coefficient. 4. Entropy encoding is usually the last step. It compresses a sequential digital data stream without loss. For example, a sequence of zeros in a data stream can be compressed by specifying the number of occurrences followed by the zero itself. In the case of vector quantization, a data stream is divided into blocks of n bytes each. A predefined table contains a set of patterns. For each block, a table entry with the most similar pattern is identified. Each pattern in the table is associated with an index. Such a table can be multidimen-

sional; in this case, the index will be a vector. A decoder uses the same table to generate an approximation of the original data stream. In the following sections the most relevant work in the standardization bodies concerning image and video coding is outlined. In the framework of International Standard Organization (ISO/IEC/JTC1), three subgroups were established in May 1988: the Joint Photographic Experts Group (JPEG) is working on coding algorithms for still images; the Joint Bilevel Image Experts Group (JBIG) is working on the progressive processing of bilevel coding algorithms, and the Moving Picture Experts Group (MPEG) is working on representation of motion video. In the International Telecommunication Union (ITU), H.261 and H.263 are also developed for video conference and telephone applications. The results of these standard activities are presented next.

JPEG The ISO 10918-1 JPEG International Standard (1992) Recommendation T.81 is a standardization of compression and decompression of still natural images (4). JPEG provides the following important features:

JPEG implementation is independent of image size. JPEG implementation is applicable to any image and pixel aspect ratio.

Color representation is independent of the special implementation.

JPEG is for natural images, but image content can be of any complexity, with any statistical characteristics.

The encoding and decoding complexities of JPEG are

balanced and can be implemented by a software solution. Sequential decoding (slice-by-slice) and progressive decoding (refinement of the whole image) should be possible. A lossless, hierarchical coding of the same image with different resolutions is supported. The user can select the quality of the reproduced image, the compression processing time, and the size of the compressed image by choosing appropriate individual parameters. The key steps of the JPEG compression are DCT (8 × 8), quantization, zig-zag scan, and entropy coding. Both Huffman coding and arithmetic coding are options of entropy coding in JPEG. The JPEG decompression just reverses its compression process. A fast coding and decoding of still images also used for video sequences is known as Motion JPEG. Today, JPEG software packages, together with specific hardware support, are already available in many products. ISO 11544 JBIG is specified for lossless compression of binary and limited bits/pixel images (4). The basic structure of the JBIG compression system is an adaptive binary arithmetic coder. The arithmetic coder defined for JBIG is identical to the arithmetic-coder option in JPEG.

Speech Coding

11

Figure 11. The basic structure of a VQ codec.

H.261 AND H.263 ITU Recommendations H.261 and H.263 (6) are digital video compression standards, developed for video conferencing and videophone applications, respectively. Both H.261 and H.263 are developed for real-time encoding and decoding. For example, the maximum signal delay of both compression and decompression for H.261 is specified to be 150 ms for the end-to-end delay of targeted applications. Unlike JPEG, H.261 specifies a very precise image format. Two resolution formats each with an aspect ratio of 4:3 are specified. The so-called Common Intermediate Format (CIF) defines a luminance component (Y) of 288 lines, each with 352 pixels. The chrominance components (Cb and Cr ) each have a resolution of 144 lines and 176 pixels per line to fulfill the 2:1:1 requirement. QuarterCIF (QCIF) has exactly half of the CIF resolution, that is, 176 × 144 pixels for the luminance and 88 × 72 pixels for the other components. All H.261 implementations must be able to encode and decode QCIF. In H.261 and H.263, data units of the size 8 × 8 pixels are used for the representation of the Y, as well as the Cb and Cr components. A macroblock is the result of combining

four Y blocks with one block of the Cb and Cr components. A group of blocks is defined to consist of 33 macroblocks. Therefore, a QCIF-image consists of three groups of blocks, and a CIF-image comprises twelve groups of blocks. Two types of pictures are considered in the H.261 coding. These are I-pictures (or intraframes) and P-pictures (or interframes). For I-picture encoding, each macroblock is intracoded. That is, each block of 8 × 8 pixels in a macroblock is transformed into 64 coefficients by a use of DCT and then quantized. The quantization of dc-coefficients differs from that of ac-coefficients. The next step is to apply entropy encoding to the dc- and ac-parameters, resulting in a variable-length encoded word. For P-picture encoding, the macroblocks are either MC + DCT coded or intracoded. The prediction of MC + DCT coded macroblocks is determined by a comparison of macroblocks from previous images and the current image. Subsequently, the components of the motion vector are entropy encoded by a use of a lossless variable-length coding system. To improve the coding efficiency for low bit-rate applications, several new coding tools are included in H.263. Among them are the PBpicture type and overlapped motion compensation, and so on.

12

Speech Coding

MPEG The ISO/IEC/JTC1/SC29/WG11 MPEG working group has produced three specifications, ISO 11172 MPEG-1, ISO 13818 MPEG-2, and ISO 14496 MPEG-4 (8), for coding of combined video and audio information. MPEG-1 is intended for image resolutions of approximately CIF or SIF (360 × 240) and bit rates of about 1.5 Mbit/s for both video and audio. MPEG-2 is specified for higher resolutions (including interlaced video) and higher bit rates (4 Mbit/s to 15 Mbit/s, or more). MPEG-4 was originally targeted for very low bitrate coding applications. The targeted applications were modified after MPEG-4 compression was found to be effective over a wide range of bitrates. In addition, a completely new concept of encoding a scene as separate “AV” objects was developed in MPEG-4. There are three major parts composed in the MPEG-1, -2, and -4 specifications: Part 1, Systems; Part 2, Video; and Part 3, Audio. The system part specifies a system coding layer for combining coded video and audio and also provides the capability of combining private data streams and streams that may be defined at a later date. The specification describes the syntax and semantic rules of the coded data stream. MPEG’s system coding layer specifies a multiplex of elementary streams such as audio and video, with a syntax that includes data fields directly supporting synchronization of the elementary streams. The system data fields also assist in the following tasks: 1. Parsing the multiplexed stream after a random access 2. Managing coded information buffers in the decoders 3. Identifying the absolute time of the coded information The system semantic rules impose some requirements on the decoders; however, the encoding process is not specified in the ISO document and can be implemented in a variety of ways, as long as the resulting data stream meets the system requirements. MPEG-1, -2 and -4 video often use three types of frames (or pictures): Intra (I) frames; Predicted (P) frames; and Bidirectional (B) frames. Similar to H.261, I-type frames are compressed using only the information provided by the DCT algorithm. P-frames are derived from the preceding I frames (or from other P frames) by using MC (predicting motion forward in time) + DCT; P frames are compressed to approximately 60:1. Bidirectional B interpolated frames are derived from the previous I or P frame and the future I or P frame. B frames are required to achieve the low average data rate. Field-block-based DCT and MC were developed in MPEG-2 for efficient coding of interlaced video. MPEG-1 and -2 video can yield compression ratios of 50:1 to 200:1. It can provide 50:1 compression for broadcast quality at 6 Mbit/s. It also can provide 200:1 compression to yield VHS quality at 1.2 Mbit/s to 1.5 Mbit/s. MPEG-2 can also provide high-quality video for High Definition Television at about 18 Mbit/s. Note that the MPEG video coding algorithms are asymmetrical. Namely, in general, it requires more computational complexity to compress full-motion video than to decompress it. This is useful for applications where the signal

is produced at one source but is distributed to many. The MPEG standards also specify efficient compression algorithms for high-performance audio (8). For example, MPEG-1 audio coding uses the same sampling frequencies as Compact Disc Digital Audio and Digital Audio Tape, that is, 44.1 kHz and 48 kHz, additionally, 32 kHz is available, all at 16 bits. Three layers of an encoder are shown in Fig. 12. An implementation of a higher layer must be able to decode the MPEG audio signals of lower layers. Similar to the use of the two-dimensional DCT for video, a transformation into the frequency domain is applied for audio. The Fast Fourier Transform (FFT) is suitable for audio coding, and the spectrum is split into 32 noninterleaved subbands. For each subband, the amplitude of the audio signal is calculated. Also, for each subband, the noise level is determined simultaneously to the actual FFT by using a psychoacoustic model. At a higher noise level, a coarse quantization is performed, and at a lower noise level, a finer quantization is applied. The quantized spectral portions of layers one and two are PCM-encoded and those of layer three are Huffman-encoded. The audio coding can be performed with a single channel, two independent channels, or one stereo signal. In the definition of MPEG, there are two different stereo modes: two channels that are processed either independently or as joint stereo. In the case of joint stereo, MPEG exploits redundancy of both channels to achieve a higher compression ratio. Each layer specifies 14 fixed bit rates for the encoded audio data stream, which, in MPEG, are addressed by a bit rate index. The minimal value is always 32 kbit/s. These layers support different maximal bit rates: layer one allows for a maximal bit rate of 448 kbit/s, layer two for 384 kbit/s, and layer three for 320 kbit/s. For layers one and two, a decoder is not required to support a variable bit rate. In layer three, a variable bit rate is specified by switching the bit rate index. For layer two, not all combinations of bit rate and mode are allowed:

32 kbit/s, 48 kbit/s, 56 kbit/s, and 80 kbit/s are only allowed for a single channel.

64 kbit/s, 96 kbit/s, 112 kbit/s, 128 kbit/s, 160 kbit/s, and 192 kbit/s are 192 kbit/s are allowed for all modes.

224 kbit/s, 256 kbit/s, 320 kbit/s, and 384 kbit/s are allowed for the modes stereo, joint stereo, and dual channel modes. H.264/AVC/JVT The latest video codec was developed as a joint effort between the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Motion Picture Experts Group (MPEG) (10, 11). The intent of this effort was to create a standard that would produce good video quality at half the bitrates that previous video standards such as MPEG-2 and H.263 required. The techniques used in this new standard were to be constructed in a manner to allow the new standard to be applicable over a very wide range of bitrates and resolutions. The tradeoff compared to previous standards was an increase in complexity that could be eased by Moore’s law and other technological advances. The first version of this

Speech Coding

13

Figure 12. The key functional blocks of audio encoding in MPEG.

standard was completed in 2003. Additional extensions known as the Fidelity Range Extensions (FRExt) were finished in 2004 to support higher-fidelity video coding beyond the support of 8-bit 4:2:0 video (e.g., 10-bit, 12-bit, 4:2:2 and 4:4:4 video). The same syntax has been published by both organizations: the ITU-T H.264 standard and the ISO/IEC MPEG-4 Part 10 standard. Note that MPEG-4 Part 10 is not the same as MPEG-4 Part 2, the original video codec in the MPEG-4 suite of standards. This codec may also be referred to as the AVC (Advanced Video Coding) standard or the JVT standard which references the joint partnership between VCEG and MPEG which was known as the Joint Video Team. H.264/AVC contains many new features for more effective video compression than older standards. Some of these features include:

Multi-picture inter-picture prediction. Previous standards had limited inter-picture prediction to one (for P-pictures) or two (for B-pictures) reference pictures. H.264/AVC allows for up to 16 reference pictures to be used. In addition, there are fewer restrictions on the pictures that can be used for prediction. For example, in MPEG-2, B-pictures were not allowed to be used as reference pictures for the prediction of other pictures. This restriction is not present in H.264/AVC. Variable block-size motion compensation. Block sizes ranging from 4 × 4 pixels to 16 × 16 pixels can be chosen to match the size of objects and regions in the video content. Motion compensation using increased fractional-pixel precision. While half-sample precision was used in MPEG-1, MPEG-2 and H.263, quarter-sample precision is used for luma and eighth-sample preceision is used for chroma in H.264/AVC. Spatial prediction from neighboring blocks for intracoding. For example, only DC coefficients were pre-

dicted in MPEG-2. In H.264/AVC, spatial prediction using neighboring blocks is performed for AC coefficients. An exact integer transform similar to the DCT is specified to allow for exact decoding. Previous standards specified approximations to the ideal DCT which may result in drift when the encoder and decoder implementations differed. In-loop deblocking filter to reduce the blocking artifacts common to DCT-based compression algorithms. Context-adaptive binary arithmetic coding and context-adaptive variable-length coding that are more efficient than previous entropy coding. For evaluation of video, image and audio quality, subjective criteria are often used. The subjective criteria employ rating scales such as goodness scales and impairment scales. A goodness scale may be a global scale or a group scale. The overall goodness criterion rates perceptual quality on a scale ranging from excellent to unsatisfactory. A training set is used to calibrate such a scale. The group goodness scale is based on comparisons within a set of data. The impairment scale rates an image, video or audio sequence on the basis of the level of degradation present when compared with a reference image, video or audio sequence. It is useful in applications such as video coding, where the encoding process might introduce degradation in the output images. BIBLIOGRAPHY 1. N. S. Jayant and P. Noll, Digital Coding of Waveform, Englewood Cliffs, NJ: Prentice-Hall, 1984. 2. K. R. Rao and P. Ypi, Discrete Cosine Transform, San Diego: Academic Press, 1990. 3. A. K. Jain, Fundamentals of Digital Image Processing, Englewood Cliffs, NJ: Prentice-Hall, 1989.

14

Speech Coding

4. W. B. Pennebaker and J. L. Mitchell, JPEG Still Image Data Compression Standard, New York: Van Nostrand Reinhold, 1993. 5. J. W. Woods (ed.), Subband Image Coding, Boston: Kluwer, 1991. 6. K. Jack, Video Demystified, 2nd ed., San Diego: HighText Interactive, 1996. 7. T. M. Cover and J. A. Thomas, Elements of Information Theory, New York: Wiley, 1991. 8. B. G. Haskell, A. Puri, and A. N. Netravali, Digital Video: An Introduction to MPEG-2, New York: Chapman & Hall, 1997. 9. A. Gersho and R. M. Gray, Vector Quantization and Signal Compression, Boston: Kluwer, 1992. 10. G. J. Sullivan and T. Wiegand,“Video Compression – From Concepts to the H.264/AVC Standard”, Proceedings of the IEEE, Vol. 93, Issue 1, December 2004, p. 18–31. 11. A. Puri, X. Chen and A. Luthra,“Video Coding Using the H.264/MPEG-4 AVC Compression Standard”, Signal Processing: Image Communication,Vol. 19, Issue 9,October 2004, p. 793–849.

WADE WAN XUEMIN CHEN Broadcom Corporation, Irvine, CA


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Ethernet Standard Article Mart L. Molle1 1University of California, Riverside, Riverside, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5307 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (129K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Ethernet Components Ethernet Operation Ethernet System Design Issues IEEE 802.3 Ethernet Standard History Keywords: IEEE 802.3 Ethernet standard; 10BASE-T; fast ethernet; gigabit ethernet; manchester, 4B/5B and 8B/10B encoding schemes; ethernet frame format; CSMA/CD; binary exponential backoff; link management and data encapsulation; ethernet components (e.g., transceiver; repeater; network interface); autonegotiation of link capabilities; full-duplex operation and flow control | | | Copyright © 1999-2008 All Rights Reserved.


ETHERNET

ETHERNET Ethernet is a widely used local area network (LAN) technology that allows multiple end stations (such as desktop computers, servers, printers, gateways to other networks, etc.) to exchange data among themselves within a single building or campus environment. The sending station segments the data into a sequence of frames, each of which is sent independently through the network to the destination(s). Every frame carries a globally unique 48-bit source and destination address and other information, laid out according to a standard format. However, the length of a frame can vary between a minimum of 64 bytes and a maximum of 1518 bytes. By design, Ethernet provides only a ‘‘best effort’’ delivery service: The network will not reorder or duplicate frames, but there is no guarantee that a particular frame will reach the destination. Applications must run a reliable transport protocol, such as TCP/IP, on top of the Ethernet service to guarantee delivery. ETHERNET COMPONENTS A typical Ethernet system is shown in Fig. 1. Each end station contains a network interface, which contains some temporary storage for frames being sent or received from the network, along with logic for executing the medium access control (MAC) algorithm, calculating the cyclic redundancy code (CRC) for error detection, and performing related functions. In many cases, the network interface is on a small card or printed circuit board that can be added to an end station if network connectivity is required. The network interface uses a transceiver to perform the actual data transmission and reception over the physical link. Initially, Ethernet used external transceivers, attached to the network interface by an attachment unit interface (AUI) cable. Today, however, the transceiver is often integrated with the network interface card. In some cases, changeable transceiver types may be plugged into the card through a medium independent interface (MII). A variety of link types have been defined, including coaxial cable, unshielded twisted pair (UTP) cabling, and both multimode and single-mode optical fiber. Coaxial cable is restricted to 10 Mbit/s operation. If UTP cabling meets Category 5 requirements, then operation at 100 Mbit/s can be supported via 100BASE-TX and it is expected that operation at

1 Gbit/s will be supported in the future via the 1000BASE-T standard now being developed. Optical fiber can support speeds up to 1 Gbit/s. Multiple transceivers can be connected to a single coaxial cable segment (up to 100 stations per segment of ‘‘thick’’ cable in 10BASE5, and up to 30 stations per segment of 50 ⍀ ‘‘thin’’ RG58 cable in 10BASE2). Coaxial cable segments are inherently half duplex because the electrical signals travel in both directions away from the transmitters along a single metallic conductor, passing the transceivers belonging to all other stations before being absorbed by terminating resistors at the ends of the segment. On the other hand, UTP and fiber-optic segments can support full-duplex transmission because each segment uses a separate signaling path to carry data in each direction between two transceivers at its endpoints. Larger networks are constructed by joining multiple segments together using active electronic devices that relay data from one of the attached segments to the other(s). A repeater (or ‘‘hub’’) immediately copies all bits arriving on each segment to all other segments, whether or not they are part of a valid frame. Segments joined together by repeaters form a single collision domain. If more than one end station within the given collision domain transmit frames at the same time, the data will get garbled together to form a collision, which cannot be understood by any of the receivers. A bridge (or ‘‘switch’’) copies frames that arrive on each segment to those segments that might contain its destination(s). Multicast, broadcast, and frames addressed to an unrecognized destination are copied to all other segments, while the rest are only copied to the segment that contains the destination station. Segments joined together by bridges form a single broadcast domain.

ETHERNET OPERATION High-Level Service Interface Ethernet provides a service interface to each end station that consists of independent, asynchronously operating frame transmitter and receiver functions. These functions are invoked by the higher-layer protocols (such as TCP/IP) in the end station. To send some data, the station creates a higherlayer datagram and passes it to the Ethernet transmit function. After the transmit function returns the outcome (success or failure) of this request, the station can call the transmit function again with another frame. The transmit function begins by converting the higher-layer datagram to an Ethernet frame by adding a 64-bit preamble and start-frame delimiter,

End station

End station

Network interface

Repeater

Bridge

Link

Link

Link

Collision domain Figure 1. Typical Ethernet system.

167

Broadcast domain


168

ETHERNET

a 48-bit source and destination addresses, and a 16-bit length/type value to the beginning and then adding a 32-bit frame check sequence computed with the CRC-32 polynomial to the end. Then the function attempts to transmit the frame over the outgoing link according to the rules of the medium access control (MAC) protocol. The transmit function reports successful delivery if it is able to transmit the entire frame without ever detecting a collision, and it reports failure if every one of the 16 allowable attempts to transmit the frame resulted in collisions. To receive some data, the station calls the Ethernet frame receiver function, and then it waits until the function returns with the next incoming datagram. Once activated, the receive function scans the incoming bit stream until it finds a valid preamble and start-frame delimiter, and then it gathers the rest of the incoming bits until the end of the transmission to form a candidate frame. If the candidate is shorter than the minimum frame length, then it is deemed to be a collision fragment and is discarded. If the candidate does not have a valid CRC, then it is discarded because of bit errors. Once a candidate frame has passed all the validation tests, its destination address is compared with the address of this end station and a list of recognized multicast and broadcast addresses. If there is an address match, then the receive function strips off the Ethernet encapsulation and returns the enclosed datagram to the station. Otherwise, the frame is discarded by the address filter and the receive function resumes looking for another frame in the incoming bit stream. Medium Access Control Ethernet uses a MAC algorithm called Carrier Sense Multiple Access with Collision Detection (CSMA/CD) to control the transmission of frames. CSMA/CD is a distributed algorithm for serializing the transmissions by multiple end stations over a shared channel. When the end station requests the transmission of a frame, the MAC layer frame transmitter starts executing a sequence of trial-and-error steps, as determined by network activity that is reported by its transceiver. In particular, the transceiver sets the carrierSense control signal whenever there are any data present on link, and it sets the collisionDetect control signal if it determines that the data originated from more than one transmitter. Originally, for the case of coaxial cables, both carrierSense and collisionDetect were obtained by analog logic. For example, in 10BASE5 networks, the signal levels used by the transceiver are offset from zero, so that each transmitter acts as a constant 41 mA current source acting on the two 50 ⍀ termination resistors connected in parallel. In this case, an analog voltmeter will read approximately 1 V when there are data present on the link, and a voltage threshold of approximately 1.5 V can be used to identify a collision. Thus, coaxial cable segments can support receive mode collision detection, which means that a transceiver can report collisions among third-party end stations. However, the Ethernet MAC algorithm does not use this feature. For other network types, such as UTP cabling and optical fiber, data are carried unidirectionally over a pair of physical links and both carrierSense and collisionDetect can be obtained using digital logic. A transceiver sets carrierSense if there are data present on the transmit or receive links, and it sets collisionDetect if data are present on both links.

Each transmitter is required to leave a 96-bit interframe gap between the last previous data on the link and the start of its own transmission. This provides some time for the receivers to handle one frame before the arrival of the next. The interframe gap is controlled by the deferring control signal, which becomes true when start-of-carrier is detected but becomes false 96 bit-times after end-of-carrier is detected. If carrierSense returns during the interframe gap part 1, then it is assumed to be caused by analog effects inside the transceiver or the arrival of additional fragments within the same collision event, so the 96-bit interframe gap timer is restarted at the next end-of-carrier. However, if carrierSense returns during the interframe gap part 2, then it is ignored and the interframe gap timer continues to run. The switch from part 1 to part 2 can occur at any time during the first 64 bit-times of the interframe gap. The reason for having part 2 is to prevent a station whose interframe gap timer runs too fast from enjoying an unfair advantage over the other stations. Suppose station A’s clock ran 1% faster than station B’s clock. If station A were to transmit a burst of consecutive frames, then station B’s interframe gap timer would never expire and station B would be blocked from accessing the channel until station A had transmitted its entire burst. However, if station B were to transmit the same burst, then station A’s interframe gap time would expire after every frame, allowing it to compete with B for access to the channel. Under half-duplex mode, the MAC layer transmitter schedules the first attempt to transmit a frame immediately, if deferring is false, or as soon as the deferring control signal becomes false, otherwise. Obviously, this persistent strategy minimizes the delay between the request by the higher-layer protocol and the first attempt. However, it also means that the transmission of large frames are often followed by collisions on a busy network, as multiple stations wait for the same end-of-carrier event to trigger their respective attempts. If the entire frame is transmitted without triggering the collisionDetect control signal, then it is assumed that the frame reached its destination and its successful delivery is reported to the higher protocols at the station. Otherwise, the unsuccessful attempt must be aborted, and possibly rescheduled at a later time. When an attempt is aborted, the transmitter first finishes the 64-bit preamble and start-frame delimiter if necessary, and then it substitutes a 32-bit jam sequence in place of the remaining bits in the frame. In general, the jam can be any bit sequence as long as it is not intentionally chosen to be a valid CRC. Thus, in general a collision fragment starts with a normal preamble and start-frame delimiter (part of which may be garbled), followed by a string of bits that is at least as large as a 32-bit jam sequence and shorter than the minimum frame transmission time, so it can be easily discarded by the receivers using a length threshold. The slot time, which is defined as 512 bit times for networks operating at speeds below 1 Gbps and 4096 bit times (or 512 bytes) for Gigabit Ethernet, is a key parameter for half-duplex operation. In order for each transmitter to reliably detect collisions, the minimum transmission time for a complete frame must be at least a slot time, whereas the round-trip propagation delay (including both logic delays in all electronic components and the propagation delay in all links) must be less than a slot time. Thus, none of the affected stations could have finished its transmission before detecting the collision. Similarly, the receivers must be able to identify

ETHERNET

(and discard) incoming collision fragments using the fact that their length, after removing the preamble and start-frame delimiter, is less than a slot time. As a result of all these requirements, it can be shown that the round-trip delay in a half-duplex Ethernet collision domain must be less than a slot time minus the jam length. To see this, suppose the roundtrip delay between stations A and B is ␶ bit times, and station A starts transmitting at time 0, while station B waits until time ␶ /2 (when the data from A are about to arrive) before starting its own transmission. In this case, station B detects the collision immediately, but must still transmit a 96-bit minimum size collision fragment from ␶ /2 to ␶ /2 ⫹ 96. Meanwhile, station A detects the collision at ␶, sends its jam signal, and stops transmitting at ␶ ⫹ 32. In this case, a receiver adjacent to station A would have received a total of ␶ ⫹ 96 bits between the start of A’s preamble and the end of B’s jam. Thus, after removing the 64-bit preamble and start-frame delimiter the receiver would be left with a ␶ ⫹ 32-bit collision fragment. After each collision, the binary exponential backoff (BEB) algorithm is used to schedule the next retransmission (if any) of the affected frame. Let attempts be the number of times this frame has already been transmitted. If attempts ⫽ 16, then the frame is dropped with an excessive collision error. Otherwise, the BEB algorithm generates a random integer r in the range (0, 2min兵attempts, 10其 ⫺ 1) and instructs the transmitter to sleep for r slot times before making its next attempt. By selecting backoff delays that are multiples of the slot time, colliding stations that pick different delays will not collide with each other again. BEB adjusts the range after each attempt, in an effort to provide an average of one distinct backoff slot per transmitter. Initially, the BEB algorithm only knows that its own station has a frame, so it (greedily) selects a zero backoff before the first attempt. Thereafter, if a collision occurs after randomly selecting one of N slots, BEB raises its estimate to 2N active stations, based on the number of transmitters that selected its own slot. The doubling stops when N ⫽ 1024, since that is the maximum number of stations allowed in a single collision domain according to the Ethernet standard. It is important to note that each station’s transmit function runs an independent copy of the BEB algorithm, which it restarts from the beginning for each new frame. Consequently, the backoff delays selected by different stations following the same collision can become very lopsided, resulting in an unfairness problem known as the capture effect. For example, suppose stations A and B collide at time 0, and it is the first attempt by A to transmit its packet and is the Kth attempt by B to transmit the other packet. In this case, A’s range of backoff delays is 2K⫺1 times smaller than B’s range, so A is very likely to retransmit in an earlier slot. Moreover, if A decides to transmit more frames, A’s first attempt to transmit a new frame will collide with B’s (K ⫹ 1)st attempt to transmit the same frame, so it is even more likely that A will retransmit its packet in an earlier slot. Because of the capture effect, a single station on a busy network can transmit large numbers of consecutive frames while many other stations are unable to transmit any packets. Half-duplex operation in Gigabit Ethernet is more complex to allow the slot time to be increased to 4096 bit times (because higher-speed operation increases the bandwidth–delay product on a fixed diameter network), without changing the existing 512-bit minimum frame size or one-frame-at-a-time

169

service interface. This was accomplished by introducing two new features. First, the minimum transmission time for a short frame was increased to 4096 bit times by appending extended carrier symbols to the end of the frame, if necessary. Extended carrier is a new code word that is neither a data bit nor an idle symbol. If a collision occurs before the end of the extended carrier has been sent, then the transmitter treats the attempt like a normal collision and retransmits the frame after a random backoff delay. Second, a technique called frame bursting was introduced to improve efficiency with short frames. In this case, once a station has successfully transmitted one frame, it is permitted to maintain control of the channel while it sends some additional frames by filling the interframe gap with extended carrier symbols instead of idle symbols. The station can keep adding more frames to the burst until it either runs out of frames or exceeds a total burst length of 65,536 bit times. Full-duplex mode can be used on point-to-point segments that use separate signaling paths for each direction, such as UTP cabling and optical fiber. Under full-duplex operation, the MAC layer transmitter and receiver functions operate independently of each other. Thus, the collisionDetect control signal is never true, and the carrierSense control signal gets split into two signals: receiveDataValid indicates that data are present on the incoming link, while carrierSense indicates that data are present on the outgoing link and are only used to calculate the interframe gap. Full-duplex operation also includes an optional flow control method using pause frames. One end station can temporarily stop all traffic from the other end station (except control frames) by sending a pause frame. The duration of the pause (in multiples of a 512-bit time delay quantum) is controlled by a 16-bit parameter. Traffic resumes when the specified number of bit times has elapsed. If an additional pause frame arrives before the current pause time has expired, its parameter replaces the current pause time, so a pause frame with parameter zero allows traffic to resume immediately. Repeater Operation Repeaters may be attached to the same link types as end stations, but do not contain a MAC layer entity. Instead, a simple finite-state machine is used to control the forwarding of bits among its ports. If one port has incoming data, the data are sent to all other ports. If more than one port has incoming data, a jam signal is sent to all ports. Although some of the preamble bits may be lost while a transceiver is synchronizing with an incoming frame, the repeater is required to transmit the full 64-bit preamble and start-frame delimiter on each output port. Thus, the interframe gap between two consecutive frames can change every time they pass through a repeater. To limit the amount by which the interframe gap can shrink, a maximum of four repeaters is permitted in the path between any pair of end stations in the same collision domain. Repeaters also play a role in improving the robustness of large networks by automatically partitioning misbehaving ports from the rest of the network. For example, port partitioning will be triggered if an incoming data bit from a segment continues well beyond the maximum frame length (a condition known as ‘‘jabber’’), or if many consecutive transmissions to that segment result in collisions (an indication

170

ETHERNET

that a single frame might be colliding with itself after traveling around a loop). Physical Layer Data Encoding Ethernet uses a variety of physical layer data encoding schemes, depending on the link speed and type. In particular, 10 Mbit/s Ethernet uses Manchester encoding, which is a twolevel encoding scheme using a baud rate of twice the bit rate, to distribute both data bits and the clock from the sender to the receiver(s). Each data bit is represented by a pair of channel symbols: either ‘‘HI’’ followed by ‘‘LO’’ (i.e., nominally 0 V followed by ⫺2.05 V on a coaxial cable link) to send a logical ‘‘0’’ data bit, or ‘‘LO’’ followed by ‘‘HI’’ to send a logical ‘‘1’’ data bit. In this way, the receiver(s) can easily synchronize with the data stream and recover the incoming data based on the direction of the transition at the midpoint of each bit. A transceiver with no data to send is required to generate an idle pattern consisting of a constant string of ‘‘HI’’ symbols (i.e., 0 V), which allows multiple transceivers to be connected to a single link without interfering with each other (except through collisions). However, this approach also means that the receivers cannot distinguish between an idle link and a broken link, which limits the fault detection capabilities of the system. Thus, the 10 Mbit/s fiber-optic inter-repeater link (FOIRL) introduced a Link Integrity Test to provide some fault detection on each of its dedicated signaling paths. The same Link Integrity Test was also used in 10BASE-T. An idle transmitter must send a short burst of energy called a link test pulse once every 16 ms. If the corresponding receiver has not seen either data or a link test pulse for at least 50 ms, then it declares the link to be broken. Manchester encoding is very simple and robust, but it is unsuitable for higher-speed operation because its high baud rate means that the link must carry frequencies much higher than the bit rate. Thus Fast Ethernet has adopted the same 4B/5B encoding used by the Fiber Distributed Data Interface (FDDI) for transmission over Category 5 UTP (via the 100BASE-TX standard) and optical fiber (via the 100BASEFX standard). Under 4B/5B, each 4-bit ‘‘nibble’’ of data sent over the MII is converted into a 5-bit code word for transmission over the physical link. Two signaling levels are used, so the baud rate is 20% higher than the bit rate. Since there are twice as many 5-bit code words available compared to the number of distinct 4-bit data nibbles, the code words can be chosen in such a way that the encoder never outputs more than three consecutive logical ‘‘0’’ bits. Thus, since the transceiver indicates logical ‘‘0’’ and ‘‘1’’ bits by generating no change or a reversal of the current signal level, respectively, the receiver(s) will see at least one transition every 3 bit times, allowing them synchronize with the incoming signal and recover both data bits and the clock. (100BASE-TX also randomizes the output of the encoder, so particular data sequences don’t create repeating signaling patterns that might cause high levels of electromagnetic interference.) Fast Ethernet also differs from the earlier designs by taking advantage (at the physical layer) of the fact that only point-topoint transmission over dedicated links will be used. In particular, the Link Integrity Test from 10BASE-T is not needed because all transmitters are always on. As soon as a device is turned on, its transceiver establishes a low-level connection with its peer at the other end of the link. If it has no data to

send, the transmitter sends one of the unused 4B/5B code words, which has been reserved to indicate that the link is in the idle state, in order to maintain clock synchronization. Additional code words are used as delimiters to mark the start and end of each MAC frame, so the preamble and startframe delimiter are reduced to framing overhead that no longer serves any real purpose. Gigabit Ethernet yet again introduces some different encoding schemes. For transmission over optical fiber (via the 1000BASE-SX and 1000BASE-LX standards), the same 8B/ 10B encoding used in Fiber Channel is used. In this case, every 8-bit data byte transferred across the GMII is mapped into a 10-bit code word for transmission over the physical channel. Since there are four code words available for each data byte, only those code words with sufficient transitions to permit clock recovery at the receiver are used. Moreover, the 8B/10B code also maintains direct current (dc) balance over the long term in the following way. The running disparity of the data stream is defined as the difference between the total number of logical ‘‘1’’ bits minus the number of logical ‘‘0’’ bits transmitted. If the running disparity is positive, then one set of code words will be used; otherwise, another set of code words will be used. At least half the bits in every code word belonging to the first set are logical ‘‘0’’ bits, while at least half the bits in every code word belonging to the second set are logical ‘‘1’’ bits. At the time of this writing, the details of the encoding scheme for 1000 Mbit/s operation over Category 5 UTP cabling (via the proposed 1000BASE-T standard) was still under development. It is expected that a full-duplex link will be created by transmitting simultaneously and in both directions over all four pairs in a UTP cable. Each combination of an 8-bit data byte and an alternating clock bit is mapped into a code word that assigns one of five possible voltage levels to each of the four pairs in the UTP cable, giving a total of 54 ⫽ 625 possible code words to represent 512 different combinations. Some of the unused code words are used for ‘‘nondata’’ control symbols, such as idle, extended carrier, and start-frame or end-frame delimiters. Notice that each pair in the UTP cable need only carry data symbols at the same baud rate as Fast Ethernet (i.e., 125 Mbit/s). However, since a five-level encoding is more prone to errors than a twolevel encoding, a trellis decoder is used to reduce the bit error rate. Autonegotiation of Link Capabilities Over time, the Ethernet standard has been updated many times to support advances in technology. Initially, these changes were made to allow new media types (e.g., ‘‘thin’’ coaxial cable, optical fiber, UTP cabling, etc.) to be used with the existing Ethernet standard for 10 Mbit/s operation. However, following the introduction of Fast Ethernet, the same Category 5 UTP cabling and RJ-45 connectors could be used for transmitting at 10 Mbit/s (according to the 10BASE-T standard) or at 100 Mbit/s (using any one of the 100BASETX, 100BASE-T4, or 100BASE-T2 standards). Eventually, transmission at 1 Gbps will also be possible, when the proposed 1000BASE-T standard is completed. In addition, UTP cabling is also compatible with full-duplex operation and its optional flow control scheme. Therefore, it is now possible to design a standards-compliant Ethernet device that plugs into Category 5 UTP cabling and operates according to one of more

ETHERNET

than a dozen different ‘‘modes,’’ and many vendors have designed products that support several of these modes (e.g., 10/ 100 Mbit/s network interface cards). Unfortunately, this variety of operating modes also means that two standards-compliant Ethernet devices that are designed to use the same UTP cabling can’t communicate unless they also use the same mode. Since manual configuration is tedious and error-prone, the Ethernet standard includes a method that allows the attached devices to automatically select their highest common operating mode; this method is called Autonegotiation. When a device that supports autonegotiation is initialized, it transmits a fast link pulse (FLP) over the attached link every 16 ms. Each FLP looks like a normal link test pulse to existing 10BASE-T devices that do not support autonegotiation. However, the FLP is actually a burst of much shorter pulses that encodes a 16-bit ‘‘page’’ of information about the capabilities of the sending device. The encoding for a page consists of 33 pulse positions, each 125 애s apart. The odd pulse positions, which are always present, are used for clock recovery. The data are carried by the even pulse positions, where the presence of a pulse indicates a logical ‘‘1’’ data bit while absence of a pulse indicates a logical ‘‘0’’ data bit. The data in the page are used to advertise the set of capabilities supported by the sending device, and they also include an acknowledgment bit to indicate successful reception of the page being sent in the opposite direction (i.e., the data contained in at least three incoming FLPs were the same) and a next page bit to indicate there are more data to come after this page has been received. Once both devices have finished exchanging their respective pages of data, the link is established using their highest common operating mode (if one exists) using a fixed priority list that gives preference to full-duplex operation over halfduplex operation and also gives preference to higher speeds over lower speeds. After the link has been established, no more FLPs are sent: Higher-speed operation uses an active idle pattern without any link test pulses, and if 10BASE-T is selected, then conventional link test pulses will be used.

ETHERNET SYSTEM DESIGN ISSUES The design of Ethernet systems is limited by several factors. First, there is a maximum distance for each link type, which is determined by the given combination of transceiver type and medium. This distance limit is determined by the physical characteristics of the channel, as well as by how the signal quality changes as a function of distance due to such factors as attenuation (i.e., the signal level at the receiver is too low to be distinguished from background noise) and dispersion (i.e., successive code words blend together because of variability in the signal velocity and cannot be distinguished by the receiver). These factors determine the maximum length of the coaxial cable segments in 10BASE5 and 10BASE2, and they limit the maximum length of a UTP cable segment to 100 m no matter which data rate we use. For optical fiber, these limits are quite large for 10 Mbit/s and 100 Mbit/s operation, but become quite significant for 1 Gbit/s operation. Second, when half-duplex operation is used, then the worst-case round-trip propagation delay must be restricted to less than one slot time in order for the CSMA/CD algorithm to function properly. This is why a 10 Mbit/s collision domain

171

can span a maximum diameter of approximately 2.5 km whereas a 100 Mbit/s collision domain is limited to only 205 m. When full-duplex operation is used, then the propagation delay need not be related to the slot time in any way. Third, when half-duplex operation is used, there can be at most four repeaters in the path between any pair of stations. This restriction comes about to prevent excessive shrinkage of the interframe gap, as explained above. And, finally, the network must be loop-free, whether it is constructed using repeaters and half-duplex links or using bridges (switches) and full-duplex links.

IEEE 802.3 ETHERNET STANDARD The Institute of Electrical and Electronic Engineers (IEEE) Working Group 802.3 is responsible for defining an open vendor-independent standard for Ethernet. The Ethernet standard covers most of the functions of Layers 1 and 2 from the OSI reference model. However, these functions are divided into many sublayers by the Ethernet Reference Model, which is shown in Fig. 2 (1). The data link layer in the OSI reference model is divided into (1) logical link control (along with an optional MAC control sublayer to manage flow control in fullduplex links), which is outside the scope of the Ethernet standard, and (2) medium access control, which is the top sublayer within the Ethernet standard. All of the functions of the OSI physical layer are included in the Ethernet standard. The physical layer has been partitioned into sublayers in several ways, depending on the particular physical medium and/or data rate. Initially, the physical layer functions were separated into (1) physical layer signaling (PLS), which takes care of encoding and decoding the bit stream inside the end station, and (2) an external transceiver known as the medium attachment unit (MAU), which handles the actual transmission and reception of data over the link and generates the carrierSense and collisionDetect control signals. The communication between the PLS and MAU was defined by the attachment unit interface (AUI). However, as higher-speed operation was being developed, a different functional partitioning was adopted. First, Fast Ethernet introduced a new optional 4-bit-wide media independent interface (MII) to replace the bit-serial AUI, which defines a standard way to connect a removable transceiver. The MII can also be used for 10 Mbit/s operation, to simplify the design of equipment that can run at more than one speed. For Gigabit Ethernet, this interface was further changed into an 8-bit-wide Gigabit Media Independent Interface (GMII). The MII or GMII sits between (a) a small reconciliation sublayer, which manages the interface on behalf of the MAC sublayer, and (b) the physical coding sublayer (PCS), which handles encoding and decoding of the data stream and generation of the carrierSense and collisionDetect control signals. Below the PCS is the physical medium attachment (PMA) Sublayer, which contains Ethernet-specific functions for managing the physical transceiver, such as autonegotiation of speed and duplex settings and the operational status of the link. Finally, the lowest level functions were put into the physical-medium-dependent (PMD) sublayer so that the existing methods for transmission over optical fiber and UTP cable in the FDDI standard could be reused in Fast Ethernet. A similar partitioning was used

172

ETHERNET

OSI reference model layers

LAN CSMA/CD Higher layers

Application

LLC—Logical link control

Presentation

MAC Control (optional) MAC—Media access control

Session PLS

Transport

Reconciliation MII

Network Data link Physical

Figure 2. Ethernet reference (From Ref. 1, with permission.)

model.

Reconciliation MII PCS PMA PMD

PLS AUI

AUI PMA

MAU MDI

PMA MDI

Reconciliation GMII

MDI

PCS PMA PMD MDI

Medium

Medium

Medium

Medium

1 Mbit/s, 10 Mbit/s

10 Mbit/s

100 Mbit/s

1000 Mbit/s

AUI = Attachment unit interface MDI = Medium dependent interface MII = Media independent interface GMII = Gigabit media independent interface MAU = medium attachment unit

in Gigabit Ethernet, which uses a PMD derived from fiber channel. In addition to the many sublayers that define the Ethernet functional specifications, the Ethernet standard also defines several groups of managed objects that provide a uniform set of attributes for getting information about the current status of the device (e.g., whether it is currently operating in halfduplex or full-duplex mode), retrieving data from cumulative activity counters (e.g., the number of octets of data sent or received), or setting operating parameters (e.g., whether or not it is set to promiscuously receive all incoming frames). These managed objects can be accessed through the Simple Network Management Protocol (SNMP), once the appropriate methods have been defined in the code contained in the Management Information Base (MIB) associated with the device. HISTORY Development of Ethernet began at the Xerox Palo Alto Research Center in 1973 with a prototype that operated at a data rate of 3.94 Mbit/s. An overview of this system was published by Metcalfe and Boggs (2) in 1976. By 1980, a commercial version of Ethernet, known as Ethernet version 2, had been jointly developed by Digital Equipment Corporation, Intel, and Xerox. Ethernet version 2 included a number of changes, including (1) larger values for the minimum frame size and slot time (2) and an increase in the data rate to 10 Mbit/s. The Ethernet version 2 ‘‘blue book’’ specification (3) formed the basis of the original IEEE 802.3 standard for Ethernet, published in 1983. However, the 802.3 standard introduced some technical changes, notably the replacement of the 16-bit ‘‘type field’’ by a 16-bit ‘‘length field’’ in the frame header. This distinction was eventually removed when a frame type was defined for the flow control pause frame in 1996. Because support for new media types was added to the 802.3 standard, a new naming convention was adopted (4).

PHY

PLS = Physical layer signaling PCS = Physical coding sublayer PMA = Physical medium attachment PHY = Physical layer device PMD = Physical medium dependent

The original version designed to operate over ‘‘thick’’ coaxial cable became 10BASE5, indicating that it operated at 10 Mbit/s, employed base band signaling, and had a maximum segment length of 500 m. In 1985, the standards for 10BASE2 (which defines Ethernet operation over ‘‘thin’’ RG-58 coaxial cable segments up to 185 m long) and 10BROAD36 (which defines Ethernet operation over broadband CATV systems in which the distance between the head end and the stations is at most 1.8 km) were approved. An even lower cost option called 1BASE5, based on the 1 Mbit/s AT&T Starlan design, was approved in 1987 but never became very popular. In 1987, 10 Mbit/s fiber-optic transceivers appeared in a limited way when a fiber-optic inter-repeater link (FOIRL) was approved, and they appeared more generally in 1993. 10BASET, which defines 10 Mbit/s operation over UTP cabling, was approved in 1990. Fast Ethernet, which operates at a data rate of 100 Mbit/ s, includes a variety of transceiver types. 100BASE-TX (for operation of two pairs of Category 5 UTP cabling), 100BASEFX (for operation over optical fiber), and 100BASE-T4 (for operation over four pairs of Category 3 UTP cabling), was approved in 1995 and published as 802.3u (5). In 1996, 100BASE-T2 (for operation on two pairs of Category 3 UTP cabling) was published as 802.3y. However, neither 100BASE-T4 nor 100BASE-T2 has received widespread popularity. At the same time, full-duplex operation was defined in 802.3x, which was approved in 1996. Gigabit Ethernet, which operates at a data rate of 1000 Mbit/s, also includes a number of subtypes. The 802.3z standard, which will be approved in 1998, includes 1000BASE-SX (a short-wavelength laser, suitable for limited distances over multimode fiber), 1000BASE-LX (a long-wavelength laser, suitable for moderate distances over multimode fiber and much longer distances over single mode fiber), and 1000BASE-CX (a short-haul copper jumper cable) (6). In addition, the development of 1000BASE-T (for transmission over distances of up to 100 m using four pairs of Category 5 UTP

ETHICS AND PROFESSIONAL RESPONSIBILITY

cable) is well underway and should be published as 802.3ab sometime in 1999. BIBLIOGRAPHY 1. IEEE 802.3 CSMA/CD (ETHERNET) Working Group Web Site [Online]. Available www: http://grouper.ieee.org/groups/802/3/ 2. R. M. Metcalfe and D. R. Boggs, Ethernet: Distributed packet switching for local computer networks, Commun. ACM, 19 (7): 395–404, 1976. 3. Digital Equipment Corp., Intel Corp., and Xerox Corp., The Ethernet: A local area network data link layer and physical layer specifications, September 30, 1980. 4. ANSI/IEEE Std 802.3, Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifications, 5th ed., 1996. 5. IEEE Std 802.3u-1995, Media access control (MAC) parameters, physical layer, medium attachment units, and repeater for 100 Mb/s operation, type 100BASE-T, 1995. 6. IEEE Draft P802.3z/D4, Media access control (MAC) parameters, physical layer, repeater and management parameters for 1000 Mb/s operation, December 1997.

MART L. MOLLE University of California, Riverside

ETHERNET. See LOCAL AREA NETWORKS.

173


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Group Communication Standard Article P. M. Melliar-Smith1 and L. E. Moser1 1University of California, Santa Barbara, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5335 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (172K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Formal Models and Definitions Message Delivery Algorithms Flow Control Algorithms Fault Detectors Membership Algorithms Future Directions | | | Copyright © 1999-2008 All Rights Reserved.


500

GROUP COMMUNICATION

does not apply an update to that replica. This is an example of an omission inconsistency, as shown in Fig. 1(a). Inconsistency can also arise, as shown in Fig. 1(b), if one process creates a new database entry and transmits an instruction in a message broadcast to the other processes that they should also create that entry. A second process receives the message and generates an update for the new entry, communicating the update in a message broadcast to the other processes. If one of the processes holding a replica of the database receives the second message before the first, it may be unable to handle the message. This is an example of a causal ordering inconsistency. Inconsistency can also arise when two or more processes try to claim a resource, as shown in Fig. 1(c). The requests for the resource are sent in broadcast messages to the multiple processes holding the resources, with the first claimant getting the resource. Multiple processes are needed to manage the resource to ensure continued operation, if a process should fail. If those processes receive the messages in different orders, they may grant the resource to different requesters. This is an example of a total ordering inconsistency. Group communication systems provide message ordering and delivery services that assist the application programmer in avoiding these inconsistencies. • Reliable delivery of messages ensures that all messages broadcast to a group of processes are delivered to all members of the group, thus precluding omission inconsis-

P1

P2

P3

P4

P3

P4

Update A

(a)

P1

P2 Create A

Update A

GROUP COMMUNICATION In distributed computer systems, several or many computers cooperate to perform an application, and information may be replicated on several computers. Replication of information may be required to provide fault tolerance or increased availability after a fault. Replication may also be used to reduce the time required to access the information by providing local or nearby copies of it. Much of the difficulty of programming distributed applications derives from the need to maintain the consistency of replicated information, in the presence of asynchrony and faults. When the replicated information has become inconsistent, the application programming and/or human intervention required to restore consistency can be quite difficult. In general, it is easier to maintain consistency than to restore it. Inconsistency can arise, for example, if a process holding one replica of the data fails to receive a message and thus

? (b)

P1

P2

P3

Claim A

P4 Claim A Grant A to P4

Grant A to P1

(c) Figure 1. Examples of (a) omission inconsistency, (b) causal ordering inconsistency, and (c) total ordering inconsistency.


GROUP COMMUNICATION

tencies. This also requires that messages are delivered exactly once. • Causally ordered delivery of messages ensures that, if a message depends on a prior message, then no process receives that message before it receives the prior message. This precludes causal ordering inconsistency. • Totally ordered delivery of messages ensures that any pair of messages delivered by two or more processes are delivered in the same order by those processes, thus precluding total ordering inconsistency. These message delivery services are discussed further in the next section, titled ‘‘Formal Models and Definitions,’’ and algorithms for achieving these services are given in the section titled ‘‘Message Delivery Algorithms.’’ Group communication systems also provide membership services that maintain the membership of the group, add new processes to the group, remove departing or faulty processes from the group, and report changes in the group membership to the application. If the network partitions into multiple components with no communication among them, membership algorithms are confronted with conflicting objectives. If the application must maintain a single consistent state of its data, then the membership algorithm must form a single primary membership component. If, however, all processes must continue operation even when disconnected, then the membership algorithm must form multiple disconnected memberships. Group membership algorithms for primary component membership and for partitionable membership are described in the section titled ‘‘Membership Algorithms.’’ Group membership services depend on the detection of faulty processes. Unfortunately, in an asynchronous model of computation, it is impossible to distinguish between a process that has crashed and one that is merely slow. Consequently, most group communication systems depend on unreliable fault detectors (1) that detect faulty processes but that are unreliable because they might also suspect a slow process to be faulty. Unreliable fault detectors are discussed in the section titled ‘‘Fault Detectors.’’ To enable a new member of the group to participate meaningfully in the group, it is necessary for a process that is already a member of the group to assemble and transfer state information to the new member. If a message has been processed before the state information is transferred, and if that same message is processed by the new process after it receives the state, an incorrect state can result. An incorrect state can also result if a message has not been processed before the state is assembled and transferred, but the new process determines that the message is an old message and thus ignores it. Consequently, processes must agree on which messages are delivered and processed before the membership change so that their effects are included in the transferred state, and which messages are delivered after the membership change so that they are processed by the new process. Virtual synchrony (2), discussed in the section titled ‘‘Formal Models and Definitions,’’ ensures that the processes agree on which messages precede a membership change and which follow it. Group communication systems can achieve high performance by exploiting available broadcast mechanisms at the physical and network layers in local- and wide-area networks.

501

Implementing broadcasts by multiple point-to-point messages results in poor performance. To achieve high performance, group communication systems must address the issues of buffer management and flow control. When several, or many, processes broadcast messages almost simultaneously, the communication medium and the buffers can quickly become exhausted, resulting in lost messages. Broadcasts to many destinations can also result in large numbers of acknowledgments returning to the source, again exhausting the available resources. The best group communication protocols carefully manage the flow of messages to reduce network contention and ensure that buffers are available at the destinations. Such systems deliver messages reliably to many destinations with performance as good as that of point-to-point protocols operating between a single source and a single destination. Fully integrated group communication protocols (3) address these issues in a unified fashion. Such systems are highly efficient and provide a comprehensive set of services at little or no extra cost. A standard set of services reduces the amount of skill required to exploit the protocol effectively. On the other hand, group communication toolkits (4) provide microprotocols that the user assembles into a protocol that is customized for the application. A custom protocol avoids the costs of services not needed by the application but, instead, incurs the costs of interfaces and mechanisms that are designed to work independently. Furthermore, the user may find it difficult to choose an appropriate set of microprotocols for the particular application.

FORMAL MODELS AND DEFINITIONS The models, definitions, and algorithms described here can be used for sets of processors or for sets of processes executing on multiple processors. In this article, we refer to processes but processors can be substituted throughout. A distributed system is a collection of processes that communicate by messages. In a group communication system, the processes are organized into groups and messages are broadcast to all members of the group. In contrast, messages that are multicast may be sent to some of the processes but not necessarily all of them. We distinguish between the terms receive and deliver. A process receives messages that were broadcast by the processes, and a process delivers messages to the application. Distributed systems can be classified as synchronous or asynchronous. In an asynchronous system, it is not possible to place a bound on the time required for a computation or for the communication of a message. Even though processes may have access to local clocks, those clocks are used only for local activities; they are not synchronized and are not used for global coordination. The advantage of the asynchronous model is that the algorithms can be designed to operate correctly without regard to the number of individual processes or the timing characteristics of the processes or of the communication medium. The disadvantage is that performance characteristics, such as message delivery latency, are necessarily probabilistic, and performance analysis and prediction are difficult. In contrast, the synchronous model requires that processes have access to local clocks that are synchronized across the system with a known bound on the skew between clocks, and

502

GROUP COMMUNICATION

that computation and communication operations complete within specified periods of time. The synchronous model is particularly suited to hard real-time systems and has the advantage that algorithms are deterministic and less complex. The disadvantage is that conservative assumptions are required to approximate synchronous operation in the real world and the resulting system may be inefficient. The timed asynchronous model (5) closely resembles the asynchronous model but includes, in addition, a requirement that eventually there will exist an interval of stability within which computation and communication operations complete successfully within a specified time bound. For this model, the algorithms are generally similar to those for the asynchronous model but are simpler because they need to guarantee termination only within the stability interval. The delay until an interval of stability is reached will, however, probably be longer than the delay until termination of the asynchronous algorithm. The timed asynchronous model trades simpler algorithms against longer termination times.

achieve consensus in finite time in an asynchronous distributed system, even if communication is reliable and only one process can crash. Chandra and Toueg (1) have shown, however, that consensus is possible in an asynchronous distributed system that is subject to crash faults if an unreliable fault detector is provided. Randomized algorithms can also be used to achieve consensus in an asynchronous distributed system that is subject to faults (7,8). In practical systems with unreliable communication, even the reliable delivery of a single message cannot be guaranteed. There is also a nonzero probability that all of the processes will fail. Impossibility results should not, however, be regarded as a proof that asynchronous distributed systems cannot be built. Rather, they are a reminder that the algorithms must be robust against unfortunate sequences of asynchronous events and faults, so that message ordering and membership decisions can be reached in a reasonable time with a high probability. Message Delivery Services

Fault Models In any distributed system model, a process that is nonfaulty (correct) performs the steps of the algorithms according to the specifications. The behavior of a faulty process depends on the particular fault model adopted. In the fail-stop and crash models, a faulty process takes no actions (i.e., it sends no messages and ignores all messages that it receives). These two models differ in that a fail-stop process reports that it has failed, whereas a crash process does not. Both models typically assume that a faulty process never recovers or, if it recovers, is regarded as a new process. Other models allow a faulty process to be repaired and to be reconfigured into the system. When a process recovers, the process knows its process identifier and can retrieve the data it had written to persistent storage, such as disk, before it failed. In the more general Byzantine fault model, processes can exhibit arbitrary and even malicious behavior, such as generating incorrect messages or sending different messages to different processes that purport to be the same message. In general, it is more difficult to develop algorithms that are resilient to Byzantine faults than to fail-stop or crash faults. Most fault models admit communication faults in the form of message loss, which is caused by corruption by the medium or buffer overflow in the intermediate switches or at the destinations. Some fault models also admit network partitioning faults, which split the system into two or more components so that processes in the disconnected components cannot communicate with one another. The synchronous model augments this fault model in that any computation or communication operation that does not complete within its specified time bound constitutes a fault. Similarly, any excessive skew between clocks constitutes a fault. Impossibility Results The problem of maintaining consistent message delivery and membership in a system subject to faults is related to the problem of achieving consensus in such a system. Fischer, Lynch and Paterson (6) have shown that it is impossible to devise a deterministic algorithm that can guarantee to

We now consider various message delivery services. The most basic type of message delivery service is unreliable message delivery, which provides a best-effort service with no guarantees of message delivery or of the order of message delivery. Unreliable message delivery is used for applications such as audio and video streaming. Many other applications, however, require one of the more stringent message delivery services, which are described here and shown in Fig. 2. Reliable Delivery. Reliable message delivery requires that if any nonfaulty (correct) process receives a message, then all nonfaulty processes eventually receive that message, possibly after multiple retransmissions. Reliable delivery can be achieved only probabilistically in the presence of an unreliable communication medium that repeatedly loses messages. Source Ordered Delivery. Source ordered delivery, or FirstIn First-Out (FIFO) delivery, requires that messages from a particular source are delivered in the order in which they were sent. Source ordered delivery is appropriate for multimedia streaming and data distribution applications. Causally Ordered Delivery. Causally ordered delivery (9) satisfies the following two properties: • If process P sends message M1 before it sends message M2 and process Q delivers both messages, then Q delivers M1 before it delivers M2. • If process P receives message M1 before it sends message M2 and process Q delivers both messages, then Q delivers M1 before it delivers M2. Taking the transitive closure of this ‘‘delivers before’’ relation yields a partial order on messages. A partial order, denoted by 앑, satisfies the following properties: • Antireflexive: M ⳝ M. • Antisymmetric: If M1 앑 M2, then M2 ⳝ M1. • Transitive: If M1 앑 M2 and M2 앑 M3, then M1 앑 M3. Delivery of messages in causal order precludes causal ordering inconsistency and prevents anomalies in the processing of

GROUP COMMUNICATION

Messages multicast

503

Messages delivered

Process A

A1

A2

A3

A1 B1

C1 B2

A3 C3

Process B

B1

B2

B3

A1

B2 A2 C1 B3

C2 C3

Process C

C1

C2

C3

C1 B1 C2

B2

A2

Unreliable source ordered delivery A1

A2

A3

A1 B1 A2 C1 B2 B3 C2 A3 C3

B1

B2

B3

A1 B1 B2 A2 C1

C1

C2

C3

C1 B1 C2 A1 B2 C3 A2 A3 B3

B3 A3 C2 C3

Reliable source ordered delivery A1

A2

A3

A1 B1 A2 C1 B2 B3 C2 A3 C3

B1

B2

B3

A1 B1 C1 A2 B2 B3 A3 C2 C3

C1

C2

C3

C1 B1 C2

A1 B2 C3 A2 A3 B3

Reliable causally ordered delivery A1

A2

A3

A1 B1

A2 C1 B2 B3 C2 A3 C3

B1

B2

B3

A1 B1

A2 C1 B2 B3 C2 A3 C3

C1

C2

C3

A1 B1

A2 C1 B2 B3 C2 A3 C3

Reliable group ordered delivery

data contained in the messages, but it does not alone suffice to maintain the consistency of replicated data. Group Ordered Delivery. Group ordered delivery requires that, if processes P and Q are members of a group G, then P and Q deliver the messages originated by the processes in G in the same total order. A reliable group ordered message delivery service helps to maintain the consistency of replicated data, but inconsistencies can still arise in interactions beween groups. Totally Ordered Delivery. Totally ordered delivery, also called atomic delivery, subsumes partially ordered delivery but requires in addition the property: • Comparable: M1 앑 M2 or M2 앑 M1. Thus, totally ordered delivery requires that, if process P delivers message M1 before it delivers message M2 and another process Q delivers both messages, then Q delivers M1 before it delivers M2. If P and Q are both members of the same process group and M1 and M2 are both sent to that group, then group ordered delivery would have sufficed to ensure this property. Typical applications contain hundreds of process groups, and many messages are sent to multiple groups. Totally ordered delivery precludes total ordering inconsistency and is important where systemwide consistency across many

Figure 2. Examples of four types of message delivery services.

groups is required. Thus, more generally, totally ordered delivery implies that there are no cycles in the total order, because the total order is a partial order. Message Stability. A message is stable at a process when that process has determined that all of the other processes in its current membership have received the message. This determination is typically based on acknowledgments of messages. When a process determines that a message has become stable, it can reclaim the buffer space used by the message because it will never need to retransmit that message again. The concept of stability of messages is quite distinct from stability of data, which requires that the data have been written to nonvolatile storage, such as a disk. Delivery of messages only when the messages have become stable is useful, for example, in transaction-processing systems where a transaction must be committed by all of the processes or none of them. Membership Services Maintaining the membership of the groups is an important part of group communication systems because algorithms may block if processes are faulty or cannot communicate with one another. An unreliable fault detector can be used to detect apparently faulty processes, to trigger the membership algorithm, and to ensure that the algorithm satisfies liveness requirements (i.e., that decisions will be made and that the sys-

504

GROUP COMMUNICATION

Initial membership Primary component membership Primary component

Partitionable membership Suspended

Suspended Figure 3. The primary component membership model allows only a single component of a partitioned system to continue to operate, whereas the partitionable membership model allows continued operation in all components.

tem will continue to make progress). The membership algorithm removes apparently faulty processes from the membership and adds new or recovered processes into the membership. Different group communication systems have adopted different formulations of the membership problem. In the primary component model (2), for each process group a single sequence of memberships must be maintained across the distributed system over time, as shown at the left of Fig. 3. In this model, the membership algorithm, upon successive invocations, yields a sequence of memberships over time. In contrast, the partitionable membership model (10) allows multiple disjoint memberships to exist concurrently. At opposite ends of the spectrum are two approaches to the partitionable membership problem, the maximal memberships approach and the disjoint memberships approach. In the maximal memberships approach, the memberships are precisely the maximal cliques, and a nonfaulty process may belong to several (perhaps many) concurrent memberships. In the disjoint memberships approach, concurrent memberships do not intersect (i.e., each nonfaulty process is a member of exactly one membership at a time, and any pair of processes in a membership can communicate). Thus, each membership is a clique, although not necessarily a maximal clique. The partitionable membership model with disjoint memberships is shown at the right of Fig. 3. Neither the disjoint membership approach nor the maximal cliques approach is ideal, but it is not obvious how an intermediate approach would define the collection of memberships. An algorithm that solves these membership problems must ensure that the processes in a membership reach agreement on the membership in a finite amount of time. The algorithm should also ensure that faulty processes are eventually removed from the membership and that nonfaulty processes are not removed capriciously so that a trivial membership is not installed when a larger membership could have been installed. Thus, the membership algorithms must ensure the following properties:

• Agreement: All processes in the membership agree on the membership set. • Termination: A new membership must be formed within a finite amount of time. • Nontriviality: The agreed upon membership should be appropriate, and nondegenerate if possible. The appropriateness of the membership may need to be determined by heuristics. For the partitionable membership problem, several existing specifications admit algorithms that yield degenerate memberships by partitioning the membership into singletons even when larger memberships would have been possible. Specification of the partitionable membership problem is an open research topic. Virtual Synchrony and Extended Virtual Synchrony Virtual synchrony (2) ensures that view (configuration) changes occur at the same point in the message delivery history for all operational processes, as shown in Fig. 4. Processes that are members of two successive views must deliver exactly the same set of messages in the first view. A failed process that recovers can be readmitted to the system only as a new process. Thus, failed processes are not constrained as to the messages they deliver or their order, and messages delivered by a failed process have no effect on the system. If the system partitions, only processes in one component, the primary component, continue to operate; all of the other processes are deemed to have failed. Extended virtual synchrony (11) extends the concept of virtual synchrony to systems in which all components of a partitioned system continue to operate and can subsequently remerge and to systems in which failed processes can be repaired and can rejoin the system with stable storage intact. Two processes may deliver different sets of messages, when one of them has failed or when they are members of different components, but they must not deliver messages inconsistently. In particular, if process P delivers message M1 before P delivers message M2, then process Q must not deliver message M2 before Q delivers message M1, even if the system has partitioned and P and Q can no longer communicate. Extended virtual synchrony eliminates gratuitous inconsistencies between processes that become disconnected by a partitioning fault. Interestingly, extended virtual synchrony

QR Membership is Q and R

P

State transfer P joins group Membership is P,Q, and R

PQR Messages processed before state transfer Messages processed after state transfer

Figure 4. When a membership change brings a new process into the group, the current state of the existing members must be transferred to the new process. Virtual synchrony ensures that all processes agree on which messages precede the transfer of state to a new process and which follow that transfer.

GROUP COMMUNICATION

Nack

Ack

Multicast message

Figure 5. If a process transmits a message to many destinations, it may suffer from an implosion of acknowledgments.

can be guaranteed only if messages are born ordered, meaning that the relative order of any two messages is determined directly from the messages, as broadcast by their sources. MESSAGE DELIVERY ALGORITHMS We now consider algorithms that provide different types of message delivery, as defined in the section titled ‘‘Formal Models and Definitions.’’ Reliable Delivery Algorithms Reliable delivery algorithms typically depend on underlying physical or network layer broadcast or multicast mechanisms that provide only an unreliable best-effort service in which messages may be lost. Algorithms that provide a reliable delivery service aim to ensure that every message is delivered to all of the intended destinations. Error detection and retransmission are typically more often used to provide reliable delivery. Traditional broadcast and multicast algorithms exploit a positive acknowledgment strategy to provide reliable delivery. On receipt of a message, a destination transmits a positive acknowledgment to the source. The source retransmits the message repeatedly until it has received a positive acknowledgment from every destination. Positive acknowledgment algorithms are effective in improving reliability, but they suffer from two problems. First, large numbers of acknowledgments must be transmitted, even when the underlying mechanisms are quite reliable and few messages need to be retransmitted. Second, if there are many destinations, the source must receive and process many acknowledgments for each message that it transmits, resulting in substantial processing overhead, as shown in Fig. 5. This is called the ack implosion problem. Consequently, most reliable broadcast and multicast algorithms use negative acknowledgments to achieve reliable delivery. The source transmits messages with sequence numbers. Destinations detect missing messages by gaps in the sequence numbers and transmit, to the source, negative acknowledgments that list the missing messages. On receipt of a negative acknowledgment, the source retransmits the requested messages. Negative acknowledgment algorithms can

505

be used to achieve reliable delivery, even though they use fewer acknowledgment messages. Reliable delivery algorithms based on negative acknowledgments suffer from two problems. First, if a message is not delivered to several destinations (e.g., because it was lost as a result of buffer overflow at an intermediate switch), all of those destinations will transmit a negative acknowledgment when one would have sufficed. This is called the nack implosion problem. As shown in Fig. 6, this waste can be reduced if a destination suppresses its own negative acknowledgment if it has received a negative acknowledgment that some other destination transmitted (12,13). Suppression of negative acknowledgments is combined with a carefully chosen delay before the negative acknowledgment is transmitted to minimize the probability that multiple negative acknowledgments are transmitted. The second problem with negative acknowledgments is that they provide no indication to the source that all, or even any, destinations have received the messages. To ensure that the source can retransmit any message for which it might receive a negative acknowledgment, the source would need to retain every message indefinitely. Consequently, negative acknowledgments are typically used in combination with positive acknowledgments. The positive acknowledgments confirm that messages have been received by every destination and will not subsequently need to be retransmitted, thus allowing the source to recover the buffer space used by those messages. The use of acknowledgments and retransmissions is ineffective for synchronous systems because it introduces arbitrary delays into the delivery of messages, delays that might exceed the specified bounds. Consequently, in many synchronous designs, processes transmit messages multiple times, typically over multiple communication paths and possibly multiple times on each path. With a proper design and a highquality communication medium, the probability that no copy of the message reaches the destination is negligible (14). Causal Order Algorithms To determine causal dependencies between messages and delivery of messages in causal order, additional information must be included in the messages to indicate their causal predecessors. A naive strategy would require a process to include in every message it transmits a list of all messages it has received since the previous message it transmitted.

Retransmission

Initial multicast Message lost

Nack suppressed

Nack Nack suppressed Figure 6. Excessive numbers of negative acknowledgments and retransmissions can be avoided if each process suppresses its negative acknowledgment or retransmission on receiving a similar transmission from another process.

506

GROUP COMMUNICATION

P1

P2

P3

P4

M1 M2 M3

Nack Time Ack

M4 M5

Figure 7. The transitivity of acknowledgments, piggybacked on regular messages, can be used to derive a causal order while requiring little additional information to be transmitted.

A more sophisticated and efficient algorithm exploits transitivity of positive acknowledgments (15,16). As shown in Fig. 7, message M3 transmitted by process P1 contains positive acknowledgments of messages M1 and M2. If process P2 now transmits message M4 containing a positive acknowledgment of M3, P2’s message also implicitly acknowledges messages M1 and M2 and indicates that M4 causally follows M1, M2 and M3. If process P3 has received messages M1, M3 and M4 but has not received message M2, then P3 transmits message M5 containing a positive acknowledgment of M4 and a negative acknowledgment of M2. The positive acknowledgment of M4 implicitly acknowledges M1 and M3 and indicates that M5 causally follows M1, M3 and M4. The negative acknowledgment of M2 serves to trigger a retransmission of M2 so that M2 can be delivered before M5. Because maintaining the graph structure used by this strategy to determine the causal dependencies is computationally expensive, a variation (17) on this strategy requires a process to receive all of the predecessors of a message before it issues a positive acknowledgment of the message. Thus, the positive acknowledgments directly yield the causal dependencies, whereas the negative acknowledgments trigger retransmissions. This reduces the computational cost of deriving the causal dependencies, with a small cost in increased latency. Another strategy commonly used to determine a causal order on messages exploits a vector clock (2). Each process maintains a local clock that can be either a real-time clock or a logical Lamport clock (9). A process maintains a logical Lamport clock as follows. When the process receives a message, it compares its local clock with the timestamp in the message. If the value of the message timestamp is greater than the value of its local clock, the process advances its local clock to match the timestamp. When the process transmits a message, it first increments its local clock and then uses that value to timestamp the message. As shown in Fig. 8, each process also maintains a local vector clock that contains one entry in the vector for each process in the group. The process’s own entry in the vector is its own Lamport clock. When a process transmits a message, it includes the vector clock in the message as a vector timestamp. When a process receives a message, it compares every entry in the message’s vector timestamp with the correspond-

ing entry in its local vector clock. If a value in the message’s vector timestamp is greater than the corresponding value in its local clock, the process advances the entry in its local clock to the corresponding value in the message. To determine the causal order between two messages, a process compares the corresponding entries in the vector timestamps of the messages. If every entry in one message’s vector timestamp is greater than or equal to the corresponding entry in the other message’s vector timestamp, then that message causally follows the other message. If both vector timestamps contain an entry that is greater than the corresponding entry in the other message’s vector timestamp, then the two messages are concurrent and neither follows the other. The vector clock strategy is effective only if the number of processes is small. As the number of processes increases, the transmission cost for the vector timestamp, and the computational cost of maintaining the vector clock, increase proportionately. Some group communication systems are based entirely on a causal order on the messages (18). Other group communication systems do not construct a causal order on the messages but rather impose a total order directly, where the total order satisfies the causal order requirement. In general, such total order algorithms are as efficient as causally ordered algorithms within a local area but incur higher latency over wide areas. Most synchronous systems do not construct explicit causal or total orders on messages during system operation. Rather, any causal or total order dependencies are considered in advance during the design of the system and the development of the preplanned schedule of operations (14). Total Order Algorithms Total order algorithms can be classified as symmetric or asymmetric, depending on whether all processes play the same role

P1

5

10 15

6

10 15

P2

4

6

6 6

13 15

7

13 15

P3

12 14

6

10 15

6

10 17

6

6

13 15

6

6

17

10 15

13 15

These two messages are concurrent. Neither precedes the other. 13 15

9

This message 12 15 causally precedes this message.

13 15

7

3

6

13 17

13 18

Figure 8. Vector clocks, maintained by the processes and included in each message, allow the causal order to be derived.

GROUP COMMUNICATION

Broadcast message

M5

M21M5

Sequencer

Acknowledgment message determines the message order.

M21 Broadcast message

Figure 9. Acknowledgment messages, broadcast by the sequencer process, impose the total order on messages broadcast by the other processes.

or some processes are distinguished from others. Typical asymmetric algorithms are sequencer algorithms in which one process determines the ordering of messages broadcast by the other processes, and also token algorithms in which a process can broadcast only when it holds a token that rotates through the set of processes. Asymmetric algorithms are quite efficient, but they are vulnerable to a single point of failure. Algorithms based on timestamping messages are more symmetric and are highly efficient, but they may exhibit high latency in wide-area networks. Intermediate between the symmetric and asymmetric algorithms are the hybrid algorithms in which a central core of processes executes a symmetric total order algorithm and other processes transmit their own messages to one of the core processes for ordering and broadcasting. None of the preceding algorithms is fault-tolerant. When a process becomes faulty, the total order algorithm blocks temporarily until the membership algorithm has detected and diagnosed the fault and has formed a new membership that excludes the faulty process. A completely different class of algorithms contains fault-tolerant total ordering algorithms, based on voting (15,19). Such algorithms continue to order messages even though some processes are faulty. They are, however, quite sophisticated and computationally expensive and, thus, have not been widely used. Sequencer Algorithms. In the sequencer algorithms, one process is responsible for determining the total order on messages broadcast by all processes. In the Amoeba system (20), every process transmits its messages over a point-to-point connection to the sequencer process. The sequencer then determines the message order and broadcasts the messages. In alternative sequencer algorithms (21–24), shown in Fig. 9, the originators of the messages broadcast their messages. The messages are received by the sequencer, which then determines the total order and broadcasts an acknowledgment message that lists the various broadcast messages in the total order. Other processes cannot deliver a message until they have received both the message and the acknowledgment message from the sequencer containing the ordering information. If a process does not find the message it broadcast listed

507

in the acknowledgment message, it rebroadcasts the message. If a process finds a message listed in the acknowledgment message but has not received the message, it requests a retransmission with a negative acknowledgment. If the sequencer broadcasts the messages, each message is transmitted twice, which increases the load on the communication medium, and the sequencer may become a bottleneck. If the originator broadcasts its own messages, the load on the communication medium and on the sequencer is reduced, but the processes then receive two transmissions, the broadcast message and the acknowledgment message from the sequencer, which increases the load on the processes. Because the sequencer is a single point of failure and a processing bottleneck, and also to avoid the need for positive acknowledgments, most sequencer algorithms rotate the responsibility for sequencing through the processes in the group. In Ref. (21), the acknowledgment message not only orders the message but also transfers the responsibility for sequencing the next batch of messages to the next process. If the sequencer process fails, this rotation stops, as does the delivery of messages, until the membership algorithm has removed the faulty process from the membership. Token Algorithms Another strategy for totally ordering messages exploits a token rotating around a logical ring (3,25–27). Only the holder of the token can broadcast messages. The token contains a sequence number that is incremented every time a message is broadcast, which imposes a total order on the messages broadcast in the group, as shown in Fig. 10. The token also contains additional information, including positive and negative acknowledgments and also flow control information. To avoid the overhead of circulating the token when there are no messages to broadcast, the algorithm may contain a mechanism for stopping and restarting the circulation of the token. Token algorithms can be very efficient for small groups with a heavy load of message communication. The processing required is simple and the flow control information in the token is effective for ensuring that the buffers at the destinations do not overflow, even under high load. For large groups under low load, the delay waiting for the token to arrive causes a higher latency than for a sequencer algorithm. This

Broadcast message

7

M5 Token

5 M7

7

M6 5

6

Figure 10. In a token algorithm, the token contains a sequence number that is incremented for every message broadcast, imposing a total order on the messages broadcast in the group.

508

GROUP COMMUNICATION

latency can be minimized by passing the token only to the processes that have messages to broadcast and that have requested the token (4,28). If, however, the token does not visit all processes, alternative arrangements must be made for collecting the positive acknowledgments that must be obtained from all of the processes in order to manage buffer space efficiently. A faulty process causes loss of the token and stops message transmission until a membership algorithm has removed the faulty process from the membership. On the other hand, the continuously circulating token allows rapid detection of faulty processes. Timestamp Algorithms. An elegant strategy for totally ordering messages involves timestamping the messages and delivering them in timestamp order (3,29). The timestamps can be derived from either a logical Lamport clock or, alternatively, from synchronized physical clocks. In order that a process can deliver the message with the lowest timestamp, it must know that it will not subsequently receive a message with a lower timestamp from any other process in the group. This can be guaranteed if it receives the messages in reliable FIFO order and if it has received a message with a higher timestamp from every other process in the group. Some processes may need to transmit null messages to ensure that a message from them is always available, to allow messages from other processes to be delivered promptly. Timestamp algorithms involve simple program code and, consequently, can be very efficient. They also have the advantage that messages can be ordered within small groups with the confidence that the local total and causal order is consistent with a system-wide total and causal order, precluding subtle ordering anomalies. The disadvantage of timestamp algorithms, particularly in large groups where many processes transmit infrequently, is that large numbers of null messages may be required. Algorithms have been devised to combine null messages from many processes, thereby reducing their number. As in the sequencer and token algorithms, a faulty process causes message ordering to stop until the membership algorithm has removed the faulty process from the membership. Hybrid Algorithms. Hybrid algorithms (30,31) for total ordering messages provide efficient operation in large systems where many processes have messages to transmit only occasionally. Certain processes, typically those with high transmission rates and also high bandwidth communication links, are designated to be core processes, as shown in Fig. 11. The Message transmitted point-to-point to core group M5

core processes broadcast and deliver messages using one of the other total ordering algorithms. Other processes transmit their messages, point-to-point, to any core process, which then orders and broadcasts those messages. Effective operation of a hybrid algorithm depends on an appropriate choice of processes for the core. Algorithms have been developed to determine that choice dynamically, adding processes to the core, or removing them, as their message transmission rates change. Hybrid algorithms are particularly important when the group size is large (thousands of processes) but only a few processes transmit frequently, as may occur in Internet applications. If, however, the listen-only processes require reliable delivery of messages, they must still transmit positive and negative acknowledgments, and care is required to avoid ack implosion (13). Voting Algorithms. The voting algorithms that produce a total order on messages (15,19) are completely different from the algorithms described earlier. They start from a causal order derived from acknowledgments, as is shown in Fig. 7. Candidate messages that have not yet been ordered but do not follow any other unordered message are selected. Such messages are candidates for immediate advancement into the total order. A voting strategy is used in which messages vote for messages that precede them in the causal order but that have not yet been advanced to the total order. If the causal order is narrow with few concurrent messages, so that it is almost a total order, the voting algorithm is likely to terminate in the first round. If the causal order is broad, several rounds of voting may be required, and termination of the voting algorithm depends on randomness properties of the causal order. Unfortunately, space does not permit a full description of the rather subtle voting algorithm or of the intricate proof of correctness (19). The most interesting feature of the voting algorithms is that, unlike other total ordering algorithms, they are faulttolerant and do not stop ordering messages in the presence of a faulty process. The absence of a hiatus in ordering messages is important for some applications. Moreover, unlike the other total ordering algorithms, the membership algorithm can be mounted above the total ordering algorithm (32), which allows the membership algorithm to be simpler and more robust. The disadvantage of the voting algorithm is its computational cost. The complexity of the algorithm is also a disadvantage because few developers want to use an algorithm if they do not understand why it works. FLOW CONTROL ALGORITHMS

M5

Message rebroadcast in order by core group Core group

Figure 11. In a hybrid message-ordering algorithm, the core processes order and broadcast messages sent to them by the other processes.

Group communication systems incur particularly severe flow control problems because, to achieve high performance, any one process must be able to transmit messages up to the capacity of the network and of the destinations. If, however, several processes transmit messages simultaneously at that rate, saturation of the communication medium can occur, resulting in message loss and retransmission. Moreover, several senders can transmit messages substantially faster than any destination can handle them. This causes messages to accumulate in the input buffers at the destinations until they

GROUP COMMUNICATION

overflow and message loss occurs. In a local area, the high bandwidth of the communication medium may allow even a single sender to overwhelm the destinations. In a wide area, the critical resource is the available bandwidth of the network, which is often much lower than in a local area and is potentially highly variable because of contention with unrelated traffic. Experience demonstrates that message loss in modern communication networks is caused mainly by flow control and buffering problems. The most effective flow control algorithms currently available for a local area are those used by token-based protocols (3,26). Only one process can broadcast or multicast at a time and the token carries flow control information from one process to the next around the ring. If the number of messages transmitted in one token rotation is restricted to the buffer capacity of the receivers, and if each process empties its buffer before releasing the token, buffer overflow is avoided. The token also carries information about the backlog of messages that could not be sent because of flow control, ensuring that all processes receive a fair share of the medium. Sequencer and timestamp algorithms use a window flow control strategy in the style of that used by TCP/IP (33,34). When a process broadcasts a message, it reduces the remaining window space, restoring the window space when it has received acknowledgments for that message from all members of the group (and, thus, no longer needs to buffer the message for possible retransmission). If each process in a group is provided with its own window then, given finite resources, those windows must be smaller than what would have been possible had the processes shared a window. Thus, the transmission rate of a process will be restricted because some of the resources have been allocated for other processes. If all processes share a window, then a process must reduce the space in the window for each message it receives as well as for each message it transmits, again restoring the window space when it has received acknowledgments from all members of the group. However, with a shared window and without control over multiple concurrent transmissions, several processes may transmit messages that attempt to utilize the same residual window space, leading to buffer overflow and message loss. For wide-area group communication systems operating over the Internet, window flow control is essential to achieve good performance. Internet switches use a flow control strategy, Random Early Drop (RED) (35), closely matched to TCP/IP. In contrast, wide-area group communication systems operating over ATM must accommodate the rate-based quality of service mechanisms of ATM (34), defined for each transmitter separately. In both cases, the relatively long delay until acknowledgments are received, which is inevitable in wide-area networks, can severely degrade the performance.

FAULT DETECTORS A fault detector is a distributed algorithm such that each process has a local fault detector module that reports the processes it currently suspects as being faulty (1). For fail-stop and crash faults, fault detectors are typically based on timeouts that are local to the process, with no communication between processes. If a process has not received a message from another process within a certain period of time, its fault detector adds that process to the list of those sus-

509

pected of being faulty. This includes processes that have not acknowledged receipt of a message within a reasonable amount of time. Failure to acknowledge receipt of a message forces other processes to retain the message in their buffers for possible retransmission and could exhaust that buffer space, causing the system to stop. For Byzantine faults, fault detectors must rely on costly techniques such as reliable broadcast or diffusion algorithms and message signatures. Even in models that admit only fail-stop and crash faults, fault detectors are inherently unreliable because processes that are nonfaulty but excessively slow or processes that fail to receive a message an excessive number of times may be suspected, whereas processes that are faulty may not be suspected immediately.

MEMBERSHIP ALGORITHMS The two types of membership, primary component membership and partitionable membership, shown in Fig. 3, satisfy different application objectives. A primary component membership is most useful when the application must maintain a single consistent state for its data in the primary component, at the cost of suspending the operation of processes in the nonprimary components, for example, in banking. A partitionable membership is appropriate when all processes must continue operation, with the cost of reconciling inconsistent data when communication is reestablished between disconnected components, for example, in industrial control. Algorithms exist for both types of membership (10,26,32,36–39), and significant problems exist for both (40,41). For primary component membership, it is possible that no membership satisfies the requirements for being the primary membership (such as a majority of the processes in the group). In practice, however, membership algorithms almost always find primary components quite quickly. For partitionable membership, the algorithm may form a trivial or inappropriate membership, such as allowing every process to form an isolated singleton membership. In practice, however, partitionable membership algorithms do not choose such memberships in preference to other more appropriate memberships. Robust membership algorithms are difficult to program because they must operate under uncertain conditions and must handle additional faults that occur during their operation. Implementation details, such as the relative lengths of timeouts, are very important for robust operation and depend on the underlying platform on which the algorithms operate. We provide next a broad outline of the strategies used by typical membership algorithms. More details can be found in Refs. 26 and 36. Typical membership algorithms involve four phases— initiation, discovery, agreement, and recovery—as shown in Fig. 12. Initiation of the membership algorithm may result from an explicit request by a process to join or leave the group, a suspicion by a fault detector, or reception of a message from a foreign process (not in this membership but in a concurrent membership within a partitioned system) after remerging of a partitioned system. In the discovery phase, all processes broadcast messages inviting responses from other processes. Each such process broadcasts responses that enumerate all processes from

510

GROUP COMMUNICATION

PQR Membership is P, Q and R P fails

Normal operation

Fault detected

Membership algorithm operating

Initiation phase Discovery phase Agreement phase Recovery phase Membership algorithm done Membership is Q, R, and S

S joins group

S

Last few messages of old membership are delivered Normal operation resumes with new messages

Figure 12. The four phases of a membership algorithm.

which they have received messages, the known set, and all processes that they suspect as having failed, the fail set. On receipt of such a message, a process merges all of the processes in the known set of the message into its own known set. Similarly, a process merges all of the processes in the fail set of the message into its own fail set. If a process has not received a response from a process in its known set within a timeout, it also adds that process to its fail set. The discovery phase ends either by a timeout or by agreement on a membership. The discovery phase can, however, be reentered at any time if further processes are suspected, if agreement cannot be reached, or if one or more processes do not install the new membership. The agreement phase seeks to find a set of processes such that every process in that set agrees that its proposed new membership is that set. The proposed membership is typically the difference of the known set and the fail set. If the processes are only partially connected, so that some processes cannot communicate with other processes, a heuristic algorithm may be used to choose an appropriate membership. For a primary component membership, the proposed membership must also satisfy some size constraint or other criterion for being a primary component. The agreement is then confirmed, typically by some variation of two-phase commit. The proposer, usually the process having the lowest identifier, broadcasts a proposal message. The other members then respond with a commit message. The proposer then broadcasts an install message to begin the recovery phase and install the new membership. If any process rejects the membership or does not respond, the proposer returns to the discovery phase. Similarly, if a process does not receive a propose or install message, it returns to the discovery phase. In the recovery phase, the processes first complete the delivery of messages from the old membership and then install the new membership. The processes in the new membership and the same old membership exchange information regarding messages that they have received from the old membership and then retransmit those messages so that all members have them. The messages are then ordered and delivered to ensure virtual synchrony. When all messages of the old mem-

bership are delivered, the algorithm delivers a membership change message announcing the membership change, enumerating the new membership, and starting normal operation with the new membership. Even before the delivery of some of the messages of the old membership, the membership algorithm may have delivered additional membership change messages reporting the loss of processes. Such additional messages are necessary to achieve extended virtual synchrony. If a process determines that any member of its new membership has returned to the discovery phase, it first completes its installation of the new membership and then reinvokes the membership algorithm. For a primary component membership algorithm, it is essential that only a single sequence of memberships exists over time. If the proposer becomes faulty at a critical moment, it may be impossible for the remaining processes to determine whether the proposer has installed the new membership. The system must then stop until the proposer has recovered. This risk of a hiatus can be reduced, but not eliminated, by using three-phase commit in place of two-phase commit. For a partitionable membership algorithm, termination is easy to demonstrate. The known set and the fail set are monotonically increasing and bounded above by the finite number of potential members. Each attempt to form a membership can be defeated by a process that was previously unknown, causing an increase in the known set, or by a process that does not respond, causing an increase in the fail set. Because both sets increase monotonically and are bounded above by the set of potential members, the algorithm terminates, possibly in a singleton membership containing only the process itself. Membership algorithms for operation on top of a fault-tolerant total ordering algorithm (32) are often simpler and more elegant than the algorithm outlined earlier. This simplicity comes at the expense of greater complexity in the faulttolerant total ordering algorithm. For synchronous systems, membership algorithms are much simpler than for asynchronous systems (14,42). Typically, at the end of each prescheduled sequence of message exchanges, each process reports the set of processes from which it received messages during that sequence. This set of processes constitutes its proposed membership. If a process receives a membership that differs from its own, it can choose either to exclude that process from its membership or to exclude other processes from its membership so as to bring its membership into agreement with that of the other process. In principle, these choices are heuristic choices for synchronous systems as they are for asynchronous systems. Practical synchronous systems, however, typically have simpler and more robust communication media and, thus, incur fewer problems in reaching agreement on a membership quickly.

FUTURE DIRECTIONS Much research remains to be undertaken in the area of faulttolerant distributed systems. An important research topic is the integration of group communication protocols with protocols for real-time, multimedia, and data transfer. Real-time protocols, such as are used for instrumentation and control, typically seek to provide low latency and low jitter (variance in latency) but not reliable delivery because new data are

GROUP COMMUNICATION

transmitted more or less continuously. Multimedia protocols that provide broadcasting or multicasting of audio and video, need low latency and low jitter but not reliable delivery. Data transfer protocols provide reliable delivery and may provide broadcasting or multicasting but usually do not need message ordering between multiple sources. The use of these protocols in a distributed system depends on group communication for overall coordination. It is essential to establish a causal order between the control information transmitted through the group communication protocol and the start or end of realtime, multimedia, or data transmission. As group communication protocols become more established, they will be used in larger systems and over wider areas. Over wide areas, with high data rates, existing flow control algorithms are ineffective. To preclude overwhelming the buffers in the intermediate switches, a relatively small window is needed, but messages in the window are quickly transmitted and the source then remains idle for a long time until the acknowledgments return. New flow control algorithms will be required. With existing protocols, the latency to message ordering and delivery can increase substantially over a wide area. Some increase in latency is inevitable because of the propagation delay through the network, but new protocols that can order messages with a latency that is close to this minimum will be required (27). Wide-area systems are also subject to network partitioning; however, many applications require all components of a partitioned system to continue operation. Existing group communication systems provide message delivery and membership algorithms that continue to operate in all components of a partitioned system. Even though the system is partitioned, the disconnected components can perform operations that are inconsistent with those performed in other components. When communication is eventually restored, these inconsistencies must be reconciled. The programming required to achieve such reconciliation is currently quite difficult, and expensive manual intervention may be required. Proposals have been made (44–46) to simplify this programming, although human insight is still required to establish the application requirements on which such programming depends. The development of strategies for preventing or reconciling inconsistencies in partitioned systems is an important topic of research. Group communication is in the middle of a range of approaches to the development of fault-tolerant distributed systems. One end of that range is focused on efficiency, while the other end is focused on simplification of the application programming. When communication networks and group communication protocols were slow, a strong emphasis on efficiency was appropriate (47). A similar concern has led to the development of microprotocol toolkits (4) from which a custom group communication protocol can be constructed, optimized specifically for the particular application. With increasing network performance and more efficient protocols, some of that efficiency can be sacrificed for simpler application programming. The group communication protocols described in this article employ a message-passing application programmer interface. This message-passing interface necessarily exposes to the application programmer the problems of distribution, replication, consistency, and fault tolerance. Correct solutions to these problems require considerable skill and experience, and

511

typical application programmers are not well-trained to solve those problems. Consequently, fault-tolerant distributed systems are still quite difficult and expensive to program. New approaches to building fault-tolerant distributed systems are being investigated. Using the Common Object Request Broker Architecture (CORBA) (48,49), such systems (46,50) provide transparent object replication and fault tolerance. This allows the application programmer to write a distributed object program as though it were to operate unreplicated, without affecting the application programming or the functional behavior of the application. The approach still employs group communication protocols such as those described here, but does not expose those protocols to the application programmer. Such an approach will make the benefits of fault-tolerant distributed systems available to a wider range of applications.

BIBLIOGRAPHY 1. T. D. Chandra and S. Toueg, Unreliable failure detectors for reliable distributed systems, J. ACM, 43 (2): 225–267, 1996. 2. K. P. Birman and R. van Renesse, Reliable Distributed Computing with the Isis Toolkit, Los Alamitos, CA: IEEE Comput. Soc. Press, 1994. 3. L. E. Moser et al., Totem: A fault-tolerant multicast group communication system, Commun. ACM, 39 (4): 54–63, 1996. 4. R. van Renesse, K. P. Birman, and S. Maffeis, Horus: A flexible group communication system, Commun. ACM, 39 (4): 76–83, 1996. 5. F. Cristian, Synchronous and asynchronous group communication, Commun. ACM, 39 (4): 88–97, 1996. 6. M. J. Fischer, N. A. Lynch, and M. S. Paterson, Impossibility of distributed consensus with one faulty process, J. ACM, 32 (2): 374–382, 1985. 7. M. Ben-Or, Randomized agreement protocols, in B. Simons and A. Spector (ed.), Fault-Tolerant Distributed Computing, Berlin, Germany: Springer-Verlag, 1990, pp. 72–83. 8. G. Bracha and S. Toueg, Asynchronous consensus and broadcast protocols, J. ACM, 32 (4): 824–840, Oct. 1985. 9. L. Lamport, Time, clocks, and the ordering of events in a distributed system, Commun. ACM, 21 (7): 558–565, 1978. 10. D. Dolev, D. Malki, and R. Strong, A framework for partitionable membership service, Tech. Rep. CS95-4, Inst. Comput. Sci., Hebrew Univ., Jerusalem, Israel, 1995. 11. L. E. Moser et al., Extended virtual synchrony, Proc. 14th IEEE Int. Conf. Distrib. Comput. Syst., Poznan, Poland, 1994, pp. 56–65. 12. S. Floyd et al., A reliable multicast framework for light-weight sessions and application level framing, ACM/IEEE Trans. Netw., 5: 784–803, 1997. 13. K. Berket, L. E. Moser, and P. M. Melliar-Smith, The InterGroup protocols: Scalable group communication for the Internet, IEEE GLOBECOM ’98: 3rd Global Internet Mini-Conf., Sydney, Australia, 1998. 14. H. Kopetz and G. Grunsteidl, TTP—A protocol for fault-tolerant real-time systems, IEEE Comput., 27 (1): 14–23, 1994. 15. P. M. Melliar-Smith, L. E. Moser, and V. Agrawala, Broadcast protocols for distributed systems, IEEE Trans. Parallel Distrib. Syst., 1: 17–25, 1990. 16. P. M. Melliar-Smith and L. E. Moser, Trans: A reliable broadcast protocol, IEE Proc. I Trans. Commun., 140: 481–492, 1993.

512

GROUPWARE

17. Y. Amir et al., Transis: A communication sub-system for high availability, Proc. 22nd IEEE Int. Symp. Fault-Tolerant Comput., Boston, MA, 1992, pp. 76–84. 18. S. Mishra, L. L. Peterson, and R. D. Schlichting, Consul: A communication substrate for fault-tolerant distributed programs, Distrib. Syst. Eng., 1 (2): 87–103, Dec. 1993. 19. L. E. Moser, P. M. Melliar-Smith, and V. Agrawala, Asynchronous fault-tolerant total ordering algorithms, SIAM J. Comput., 22 (4): 727–750, 1993. 20. M. F. Kaashoek and A. S. Tanenbaum, Group communication in the Amoeba distributed operating system, Proc. 11th IEEE Int. Conf. Distrib. Comput. Syst., Arlington, TX, 1991, pp. 222–230. 21. J. M. Chang and N. F. Maxemchuk, Reliable broadcast protocols, ACM Trans. Comput. Syst., 2: 251–273, 1984. 22. F. Cristian and S. Mishra, The pinwheel asynchronous atomic broadcast protocols, Proc. 2nd Int. Symp. Autonomous Decentralized Syst., Phoenix, AZ: Apr. 1995, pp. 215–221. 23. W. Jia, J. Kaiser, and E. Nett, RMP: Fault-tolerant group communication, IEEE Micro, 16 (2): 59–67, Apr. 1996. 24. B. Whetten and S. Kaplan, A high performance totally ordered multicast protocol, Proc. Int. Workshop Theory and Practice Distrib. Syst., Dagstuhl Castle, Berlin, Germany: Springer-Verlag, Sept. 1994, pp. 33–57. 25. T. Abdelzaher et al., RTCAST: Lightweight multicast for realtime process groups, Proc. 1996 IEEE Real-Time Technology and Applications Symp., Brookline, MA: June 1996, pp. 250–259.

Proc. 12th Symp. Reliable Distrib. Syst., Princeton, NJ: Oct. 1993, pp. 2–11. 39. A. M. Ricciardi and K. P. Birman, Process membership in asynchronous environments, TR 93-1328, Dept. of Computer Science, Cornell Univ., Ithaca, NY, 1993. 40. E. Anceaume et al., On the formal specification of group membership services, Tech. Rep. 95-1534, Dept. of Computer Science, Cornell Univ., Ithaca, NY, 1995. 41. T. D. Chandra et al., On the impossibility of group membership, Tech. Rep. 95-1548, Dept. of Computer Science, Cornell Univ., Ithaca, NY, 1995. 42. A. S. Tanenbaum, Computer Networks, Upper Saddle River, NJ: Prentice-Hall, 1996. 43. R. Koch, L. E. Moser, and P. M. Melliar-Smith, Global causal ordering with minimal latency, Tech. Rep. 98-08, Dept. Electr. Comput. Eng., Univ. California, Santa Barbara, 1998. 44. O. Babaoglu, A. Bartoli, and G. Dini, Enriched view synchrony: A programming paradigm for partitionable asynchronous distributed systems, IEEE Trans. Comput., 46: 642–658, 1997. 45. P. M. Melliar-Smith and L. E. Moser, Surviving network partitioning, IEEE Comput., 31 (3): 62–69, 1998. 46. P. Narasimhan, L. E. Moser, and P. M. Melliar-Smith, Replica consistency of CORBA objects in partitionable distributed systems, Distrib. Syst. Eng., 4: 139–150, 1997.

26. Y. Amir et al., The Totem single-ring ordering and membership protocol, ACM Trans. Comput. Syst., 13: 311–342, 1995.

47. D. R. Cheriton and D. Skeen, Understanding the limitations of causally and totally ordered communication, Proc. 14th ACM Symp. Operating Systems Principles, Asheville, NC: Dec. 1993; Operating Syst. Rev., 27 (5): 44–57, Dec. 1993.

27. B. Rajagopalan and P. K. McKinley, A token-based protocol for reliable, ordered multi-cast communication, Proc. 8th IEEE Symp. Reliable Distrib. Syst., Seattle, WA: Oct. 1989, pp. 84–93.

48. Object Management Group. The Common Object Request Broker: Architecture and Specification, Rev. 2.1, OMG Tech. Doc. PTC/ 97-09-01, 1997.

28. G. A. Alvarez, F. Cristian, and S. Mishra, On-demand asynchronous atomic broadcast, Proc. 5th IFIP Int. Working Conf. Dependable Computing for Critical Applications, Urbana-Champaign, IL: 1995, pp. 119–137.

49. R. M. Soley, Object Management Architecture Guide, Object Management Group, OMG Tech. Doc. 92-11-1, 1992.

29. D. A. Agarwal et al., The Totem multiple-ring ordering and topology maintenance protocol, ACM Trans. Comput. Syst., 16: 93– 132, 1998. 30. P. D. Ezhilchelvan, R. A. Macedo, and S. K. Shrivastava, Newtop: A fault-tolerant group communication protocol, Proc. 15th Int. Conf. Distrib. Computing Syst., Vancouver, BC, Canada: May/ June 1995, pp. 296–306. 31. L. E. T. Rodrigues, H. Fonseca, and P. Verissimo, Totally ordered multicast in large-scale systems, Proc. 16th IEEE Int. Conf. Distrib. Comput. Syst., Hong Kong, 1996, pp. 503–510. 32. L. E. Moser, P. M. Melliar-Smith, and V. Agrawala, Processor membership in asynchronous distributed systems, IEEE Trans. Parallel Distrib. Syst., 5: 459–473, 1994. 33. D. E. Comer, Internetworking with TCP/IP, Englewood Cliffs, NJ: Prentice-Hall, 1995. 34. S. Floyd and V. Jacobson, Random early detection gateways for congestion avoidance, IEEE/ACM Trans. Netw., 1: 397–413, 1993. 35. Y. Amir et al., Membership algorithms for multicast communication groups, Proc. 6th Int. Workshop Distrib. Algorithms, Haifa, Israel, 1992, pp. 292–312. 36. F. Cristian, Reaching agreement on processor-group membership in synchronous distributed systems, Distrib. Comput., 4 (4): 175– 187, 1991. 37. M. A. Hiltunen and R. D. Schlichting, A configurable membership service, IEEE Trans. Comput. 47 (5): 573–586, May 1998. 38. F. Jahanian, S. Fakhouri, and R. Rajkumar, Processor group membership protocols: Specification, design and implementation,

50. L. E. Moser, P. M. Melliar-Smith, and P. Narasimhan, Consistent object replication in the Eternal system, Theory Practice Object Syst., 4 (2): 81–92, 1998.

P. M. MELLIAR-SMITH L. E. MOSER University of California


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering High-Speed Protocols Standard Article Martina Zitterbart1 1TU Braunschweig, 38106 Braunschweig, Germany Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5308 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (148K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Characteristics of High-Speed Networks Light-Weight Transport Protocols Evolution of TCP Implementation Techniques Summary | | | Copyright © 1999-2008 All Rights Reserved.


J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering c 1999 John Wiley & Sons, Inc. Copyright

HIGH-SPEED PROTOCOLS Buzzwords such as data highways and information society are becoming ubiquitous and are no longer specific to the research environment. In this context, many emerging applications are pushing the communication world to drastic changes. Currently the most prominent example can be seen in the World Wide Web, WWW. Furthermore, the popularity of networked multimedia applications, such as teleconferencing and telecollaboration, is also constantly increasing. As a result, many new opportunities for network users appear in the public sector as well as in the business and commercial sector. However, in order to serve all these applications, suitable communication systems are required, including the underlying network as well as network and transport layer protocols. All components together need to provide high performance with respect to throughput and latency. Moreover, the integration of multiple services as it is typical for multimedia applications (e.g., audio, video, and data stream) forms a key requirement. Due to the emergency of fiber-based technology, high-speed networks are being established that enable data rates well over the megabit—even the gigabit—per second threshold. In addition, ATM (asynchronous transfer mode) is under development and will be capable of integrating various services within a single network, the B-ISDN (broadband integrated services network). Moreover, high-speed protocols and efficient implementation techniques have been developed during the last couple of years. They especially address the specific characteristics of high-speed networks and new application requirements. This article focusses on issues related to high-speed networks. This article is structured as follows. Important characteristics of high-speed networks are presented first, followed by information on light-weight transport protocols. Protocol mechanisms as well as the most popular light-weight protocols are discussed. A section is devoted to the evolution of the widely used Internet protocol TCP (Transmission Control Protocol). Following this, implementation techniques including parallel protocol processing and dedicated hardware support are presented. Finally, some conclusions and perspectives on future trends are given.

Characteristics of High-Speed Networks High-speed networks are characterized by a high data rate, typically well in the hundreds of Mbps or even in the gigabit per second range and above. However, there is no specific data rate that qualifies a network to be called high-speed network. Compared to traditional low-speed networks (e.g., Ethernet), the data rate is higher by several orders of magnitude. This leads to some very different characteristics of such networks, especially with respect to end-to-end latency. In low-speed networks, end-to-end latency is dominated by the data rate of the link. In contrast to this, the speed of signal propagation clearly dominates the end-to-end latency in high-speed networks. As a result, the so-called bandwidth-delay-product is rapidly increasing in high-speed networks. This means that a large amount of data can be buffered within the network. This buffering capability is commonly referred to as path capacity. 1

2

HIGH-SPEED PROTOCOLS

Fig. 1. End systems communicating over a high-speed network.

The basic scenario depicted in Fig. 1 is used to clarify the importance of the path capacity. Two end systems are interconnected via a communication link of length l = 5000 km. The data rate on the link is r = 1.2 Gbit/s. Furthermore, we assume the communication link to be a fiber link. The speed of light in a fiber link is approximately v = 2 ∗ 105 km/s. Thus, the signal propagation speed per kilometer calculates to = 1/v = 5 µs/km. The end-to-end transmission delay d of the communication link is d = l ∗ = 25 ms, that is, the round trip time for end-to-end communication is 50 ms. The path capacity p of a link can be calculated as follows: p = r ∗ l/v. For the example, the path capacity is p = 30 Mbit, that is, 30 Mbit of data are stored on the transmission link. Table 1 presents path capacities for various networks with different data rates. The drastic increase in path capacity compared to low-speed networks has a major impact on higher-layer protocols. Several protocol mechanisms that regulate the data flow between end systems are affected, especially at the transport layer. Taking the numbers above, a sending station has to wait 50 ms before it can expect to receive an indication from the receiving station about the transmitted data. However, during that time, the sender can already transmit 60 Mbit (i.e., 7.5 MByte) of data. Thus, it can send a complete file without receiving any indication of proper reception or of any errors. This situation is very different from low-speed networks where usually the first feedback information arrives at the sender after the transmission of a few bits (e.g., 4 bits if ISDN is used in the example). This drastic change of behavior requires enhanced protocol mechanisms. Moreover, due to the increased speed within the network, the time in which a data unit needs to be handled in the attached systems (end systems, routers, . . .) decreases dramatically. Given the previous example, data units of 8 kbytes need to be processed in about 55 µs. However, if the length of the data unit is only 53 bytes (e.g., an ATM cell), it needs to be received and processed in less than 0.4 µs. Comparable numbers in a 10 Mbit/s Ethernet are 6.5 ms and 42 µs, respectively. These numbers underline that requirements on protocol


3

processing and memory access speed are other significant factors that need to be addressed with the advent of high-speed networks.

Light-Weight Transport Protocols Protocol mechanisms at the transport layer must address the increasing path capacity in order to provide efficient communication services to the application. In the following, first basic protocol mechanisms are discussed, followed by the presentation of selected light-weight protocols that introduced and applied novel protocol mechanisms. Basic Protocol Mechanisms. Several protocol mechanisms are typically part of connection-oriented transport protocols that provide a reliable service. Among them are mechanisms for: • • •

connection establishment and termination error control flow control

Connection establishment can be seen as a performance critical protocol function, especially with the event of applications that are based on the client/server paradigm. Typically, a handshake-based mechanism is used to establish a connection as depicted in Fig. 2(a). The sender issues a connect request message and needs to wait for a connect indication message before it is allowed to transfer user data to the peer entity. During this handshake procedure some parameters, such as data rate and window size, can be negotiated among the sender, the receiver, and the service provider. This includes QoS (quality of service) parameters for multimedia services. However, the handshake procedures lead to a latency of at least one round-trip time before the first byte of user data can be sent, that is, a delay of 50 ms with respect to the example presented previously. In this time 60 Mbit of user data could have been transmitted. Many client/server based applications do not require a large amount of data to be sent and, thus, the time needed for connection establishment, tconn , can easily dominate the time needed for user data transfer, tdata , that is, tconn ≥ tdata . In order to increase the efficiency of connection establishments, implicit mechanisms have been developed. User data can be transmitted immediately with or after the connection establishment message; see Fig. 2(b). Implicit mechanisms drastically decrease the connection set-up latency. Therefore, transactions can be finalized much faster. For example, the transaction time ttrans can be reduced to ttrans ≈ 50 ms instead of ttrans > 100 ms with the handshake-based mechanism. However, guarantees with respect to quality of service can not be given. Therefore, such mechanisms are not targeted for multimedia applications. Protocols serving multimedia applications typically use a separate signaling protocol in order to establish a connection with dedicated QoS requirements. Connection termination also needs special attention in the environment of client/server applications and high-speed networks. If handshake-based mechanisms are used for establishment and termination of connections, the number of packets and the time consumed for connection management may be higher than the number of packets and the time needed to transmit the user data. Therefore, the number of connection management packets should be minimized. In addition to an implicit connection establishment, so-called timerbased mechanisms have been introduced in order to avoid connection termination messages. Timers are used at both sides to determine the point in time at which the connection is terminated. The timers are updated each time user data are either received or sent. The value of the timer needs to consider the round-trip time of the connection and possible data retransmissions. The mechanism is very sensitive to a proper dimensioning of the timer. A disadvantage of timer-based connection handling can be seen in the fact that connection state must be held longer, since the timer needs to be set to a high enough value. This can be a burden at servers which are frequently requested.

4


Fig. 2. Connection establishment.

Protocol mechanisms that implement error control also need to be adapted for usage in networks with highpath capacity. Typically, very simple error control mechanisms are implemented within transport protocols. Acknowledgments are used by the receiver to indicate the correct reception of data units that have been sent. If a data unit is not acknowledged it is retransmitted by the sender after the timeout of a related timer. The most simple acknowledgment mechanism uses cumulative acknowledgments. All correctly received data units in sequence are acknowledged. All data received subsequently to a corrupted data unit are discarded and not acknowledged. The sender needs to wait for at least a round-trip time in order to know whether the data unit has been received correctly or not. For retransmission of erroneous data, often the so-called go-back-N mechanism is applied, especially in conjunction with cumulative acknowledgments. With this mechanism, all data following the corrupted data unit are retransmitted. Due to the potentially large amount of data in transit, this can dramatically increase the load on the network. In case of selective acknowledgments, more advanced retransmission mechanisms can be implemented that reduce the traffic with respect to the amount of data to be retransmitted. Selective acknowledgments allow—in contrast to cumulative mechanisms—the acknowledgment of data that has been correctly received subsequent to a corrupted data unit. The disadvantage can be seen in increased memory requirements at the receiving system. However, latency can be significantly reduced which is highly desirable for many applications, especially for interactive applications (e.g., teleconferencing). Selective acknowledgments are very attractive in high-speed networks since they do not require the retransmission of all data in transit. Furthermore, forward error correction (FEC) appears as a useful alternative in high speed networks since it completely avoids the round trip time to correct—at least part of—the errors. The basic principle is that redundant data are transmitted. Thus, the receiver may be capable to reconstruct the original data even in case of lost or corrupted data. However, this does not guarantee complete reliability. Therefore an additional option to use retransmissions is required. This is needed if too many data are corrupted or lost and, thus, the original data cannot be reconstructed. The potential disadvantages of FEC can be seen in the increased load on the network as well as in the considerable high processing power needed for good FEC mechanisms. The task of flow control is to regulate the flow of data between two communicating protocol entities. The resources of the receiving system (buffer, processing power) are protected against overload conditions. The basic mechanism used in various transport protocols is a credit-based sliding window mechanism. The sender has a certain credit—the window—that is, an amount of data (measured in bytes or data units) that it is allowed to send before receiving an acknowledgment. This credit basically reflects the buffer capability of


5

the receiver. With increasing path capacity the utilization of the communication link by two communicating stations decreases with such an approach. Typical window sizes are in the range of kbytes and, thus, smaller than the path capacity of high-speed networks. Therefore, a new mechanism called rate control has been developed during the last couple of years. It is applied in various light-weight protocols. With rate control, traffic is controlled via the rate of the sending station in contrast to the credit used by sliding window. Sending and receiving system as well as the network agree on this rate, that is, the receiving system and the network claim that they can process data received with the rate agreed upon. Light-Weight Transport Protocols. During the past twenty years various transport protocols have been developed. The Internet protocol TCP can be stated as the most prominent example. It was one of the earliest transport protocols and currently is implemented on almost all computers. Since many changes in the networks can be observed, especially with the event of high-speed networks, the requirements on transport protocols have changed. Furthermore, application requirements did dramatically change. The client/serverparadigm is penetrating the communication area today and service integration is getting increasingly important with respect to multimedia applications. In order to address the changes related to high speed networks, various so-called light-weight transport protocols have been developed. Light-weight protocols shorten the regular data path and, generally, minimize protocol overhead. The most prominent ones are briefly described in the following with respect to their individual contributions. The protocols are: • • • • •

Delta-t (1), (2) NETBLT (3), (4) VMTP (5), (6), (7) URP (8) XTP (9), (10)

The “father” of transport protocols can be seen in TCP. Each of the protocols listed above inherits some mechanisms from TCP and replaces or improves other mechanisms. An excellent overview of the correlation among these protocols can be found in (11). The transport protocol of the OSI protocol suite, transport protocol class 4 (TP4), is not discussed further since it does not introduce any new protocol mechanisms and it is not of practical relevance today. TCP itself is constantly undergoing changes. Those relevant to high-speed networks are summarized in the section on TCP. The presented light-weight transport protocols have been designed with a connectionless network service in mind, such as IP (Internet Protocol). The only exception is URP which is based on the virtual-circuit oriented network Datakit. Delta-t. In the late 1970s, the Lawrence Livermore Laboratory designed a new transport protocol called Delta-t. The development was driven by the idea to provide high performance communication support for client/server applications. Delta-t presents one of the earliest protocols that addresses the specific requirements for client/server applications. It provides some trail-blazing approaches that are more or less standard today. One of its main characteristics is that it minimized the latency between initiating the connection establishment and the actual start of the data-transfer phase as well as the overall number of data units exchanged. The novelty that is introduced by Delta-t is an implicit connection establishment mechanism. It minimizes the number of data units exchanged and, thus, the delay until user-data transfer can start. Handshake procedures are completely avoided. Data can be transferred in Delta-t immediately after issuing the connection establishment. Another novelty of Delta-t is the definition of a timer-based protocol mechanism for connection termination. It avoids a handshake procedure for connection termination and, thus, further reduces the overhead with respect to the number of data units exchanged. For such a mechanism, unambiguous connection identifiers are needed. Furthermore, connection identifiers need to be frozen after connection termination, that

6


is, they cannot be immediately used by a newly established connection, in order to avoid problems with delayed duplicates of an already closed connection. Although Delta-t is not used widely today, it provided important insights in efficient support of client/server applications. The connection handling mechanisms have been adopted by subsequent protocols, even by TCP (see section on TCP). Network Block Transfer Protocol. The protocol NETBLT (network block transfer protocol) was developed at MIT. It directly targets transmission links with high-path capacity. However, it addresses links with a lowspeed data rate: satellite links. The extreme length of satellite links (72000 km) leads to a high-path capacity without the necessity of a very high data rate on that link. In such environments, window based flow control can not be applied efficiently. It either leads to large windows or underutilized communication links. NETBLT specifically addresses this problem and a novel mechanism that combines window based flow control with a newly introduced rate based flow control is presented. Furthermore, in NETBLT error control and flow control are decoupled and, thus, not overloaded. NETBLT introduces so-called bursts for the implementation of the rate based flow control. Two burst parameters are defined to regulate data transmission: burst size and burst rate. The burst size limits the length of a burst, that is, the amount of data that can be sent continuously. The burst rate defines the minimum time interval between two subsequent bursts. Consequently, the data rate on the link is controlled by these two parameters. Therefore, larger windows can be used without causing an overflow at the receiver. Burst size and burst rate are negotiated among the involved parties. With respect to error control, NETBLT also introduces novel mechanisms. It uses a combination of cumulative and selective acknowledgments. Selective acknowledgments signal that certain data units have been lost or corrupted. Cumulative acknowledgments signal the correct reception of a sequence of data units. Since selective acknowledgments are used, NETBLT can apply selective retransmissions as well. In order to reduce the overhead introduced by acknowledgments, so-called buffers are introduced as basic units for the handling of acknowledgments. Buffers contain a number of data units that are logically treated as single unit. Thus, an acknowledgement is needed with respect to buffers and not according to the number of data units exchanged. The size of the buffer is negotiated between sender and receiver. Versatile Message Transaction Protocol. The protocol VMTP (versatile message transaction protocol) was developed at Stanford University for usage with the distributed operating system V. The main purpose was in an efficient support of remote procedure calls, that is, transaction-based applications. The goal is somewhat comparable to that of Delta-t. VMTP introduces so-called message transactions that model remote procedure calls. A message transaction comprises the request and one or multiple optional response messages. Message transactions can be directly forwarded from one server to another without involvement of the client. The design of VMTP incorporates an implicit and timer-based connection handling derived from Delta-t. Furthermore, VMTP supports multicasting, that is, a transaction request can be sent to multiple servers. It is among the first protocols that address the need for specific multicasting support. A novelty of VMTP can be seen in the 64-bit long unambiguous connection identifiers which are independent from the network layer address. This enables the migration of VMTP entities in the network, for example in order to implement load balancing in distributed systems. VMTP further applies a rate control mechanism in order to control the data flow between communicating entities. The mechanism is comparable to NETBLT. However, the rate control is also slightly different from that applied in NETBLT. VMTP uses so-called interpacket gaps, that is, it controls the gaps between two consecutive data units. Therefore, a finer timer granularity is required compared to the interburst gaps controlled in NETBLT. However, VMTP was from the beginning developed with a hardware supported implementation in mind. Thus, a higher timer resolution could be integrated more easily than within a software implementation.


7

The idea of a hardware implementation also influenced the packet format defined in VMTP. A pipelined processing of different header fields (e.g., encryption, checksum) is directly supported. Universal Receiver Protocol. The protocol URP (universal receiver protocol) was developed in the mid1980s at AT&T in order to provide universal data transport across their Datakit network. The communication structure of URP is based on simplex transmissions, thus, a full duplex connection is composed of two simplex connections, one per direction. Data are handled on the bases of 8 bits. Two operating modes are distinguished for the receiver: block mode and character mode. In block mode, sequence numbered blocks are used for data transmission. Block retransmission is available as an option in case of error. In character mode, data are transmitted as character stream without the option of retransmission. The service provided by character mode and by block mode without retransmission is very different from the services discussed previously. Data are delivered error-free and in sequence, but some data in between may be lost. Moreover, in character mode, data losses can not be detected. Thus, character mode reflects a very simple service that can be implemented with low overhead. Generally, URP is one of the first protocols that provides some flexibility with respect to the provided service. This is very important, especially with the advent of multimedia applications. The grade of URP service can be selected by the transmitter. Since URP was designed for high-speed networks, optimizations were integrated in the protocol. One of the most interesting design issues is the relocation of some processing intensive tasks from the receiver to the sender. This is motivated by the fact that usually the receiver represents the performance bottleneck. Protocol mechanisms that follow such an approach are called sender-driven mechanisms. This is reflected mainly in the acknowledgment procedure of URP. The sender explicitly asks for an acknowledgment from the receiver, for example, after having sent a full window of data. Thus, the decision and the processing related to it are located on the sender side and not at the receiver side. Furthermore, in block mode a so-called reject mechanism is implemented that basically reflects a selective acknowledgment. This allows the speed up of retransmissions. The data format used in URPs block mode is very different from those of other protocols. It uses a trailer of 4 bytes that follows a block of data; no header is involved. This simplifies processing in hardware. A hardware based implementation of URP was a vision of the protocol designers. URP is not widely used today. However, many concepts are inherited and enhanced in XTP which is among the most prominent protocols during the last couple of years. EXpress Transport Protocol. The protocol XTP (express transport protocol) was developed with the goal of a VLSI implementation. Moreover, it was targeted toward an efficient support for real time applications. The idea of a hardware implementation influenced the data format of XTP similarly to VMTP and URP. However, it needs to be stated that the current data format of XTP 4.0 is very different from that described in the first specifications of XTP. For example, the control fields moved entirely from the packet trailer to the header. XTP introduces an out-of-band signaling at the transport layer. It is the first protocol that uses out-of-band signaling over connectionless network services. The exchange of control information between XTP entities and the transfer of user data are clearly separated. Moreover, the protocol was designed as a set of orthogonal protocol functions that can be mapped onto a multiprocessor platform. Such a design also forms a sound basis for a protocol that provodes multiple services. This idea was inherited from URP. XTP provides many different mechanisms and, thus, is capable of supporting different services for a large variety of applications. Different strategies and options of protocol mechanisms can be selected by individual applications. Generally, XTP uses many of the protocol mechanisms introduced by earlier protocols, for example, selective acknowledgments and rate-based flow control. Moreover, the use of sender-driven mechanisms of URP which relaxes processing requirements at the receiver is applied. In some sense, XTP collected mechanisms that are suited for high-speed networks from earlier protocols. XTP introduces a flexible addressing mechanism, that is, different addressing schemes can be applied (e.g., Internet addressing or OSI addressing). With this flexibility, XTP can be used in different networking environments, such as Internet or OSI. However, today it is mainly used over the Internet, that is, over IP.

8


Furthermore, XTP provides multicasting. Different mechanisms have been supported over time. Multicast support forms an important issue for many emerging applications. Proper support is still under discussion today. Version 4.0 provides a reliable multicast service.

Evolution of TCP As discussed in the previous section, many new transport protocols have been designed with respect to the characteristics of high-speed networks and with respect to application requirements. However, none of these protocols is widely used today. TCP, which was developed over twenty years ago, clearly dominates the market. However, TCP has also changed over time; mechanisms have been integrated to solve some problems of TCP and that support TCP over high-speed networks. Some prominent examples are: • • •

avoidance of the Silly Window Syndrome window scaling connection count

TCP uses a sliding window mechanism for flow control and cumulative acknowledgements with go-back-N retransmission. In this context, one of the first performance problems that was observed was the so-called Silly Window Syndrome. It is characterized by a situation in which the receiver advertises only small windows and, thus, the sender transmits data in small segments (12). Such a situation is initially caused by a sender, who—in the case of a push flag—has only few data to send which are acknowledged immediately by the receiver. During long data transmissions (e.g., large file transfers) the sender cannot recover from such a situation and the performance of the data transfer can decrease drastically. However, the Silly Window Syndrome can be avoided by using the following simple algorithms at the receiver and sender, respectively. The sender should refrain from sending small segments. As a general rule the sender should only send if the window is filled by more than 25% of the window size. At the receiver side, two complementary algorithms should be implemented. First, the receiver should avoid advertising small windows. The receiver should only offer a new window if at least some fraction of the overall window size (e.g., 50%) can be offered. Furthermore, the receiver should refrain from sending an acknowledgment at all if the push flag was not set in the received segment and no data are flowing in the reverse direction. This decreases the processing requirements of the sender since the number of acknowledgments that it needs to process is reduced. These algorithms are suggested for use with TCP in the host requirements document (13). TCP uses a 16-bit field for window advertisements and the basic unit for advertisements is a byte. Thus, the maximum window size wmax can be calculated as follows: wmax = 216 byte = 65 kbyte. With this limited window size, high-speed networks cannot be highly utilized. In order to overcome this problem, a new option has been introduced in TCP that allows the use of other basic units for window scaling. The basic unit is advertised within a SYN segment during connection establishment (see Fig. 3). The basic units are multiples of a byte and they are coded logarithmically, for example, the factor 3 leads to a scaling factor S = 23 = 8. There exists an upper limit of S = 214 for the scaling factor in order to bound the highest usable sequence number. Proper support of client/server applications leads to another new TCP option, the so-called connection count. Connection counters are used to avoid the handshake procedure for connection establishment. This version of TCP is called T/TCP (transaction TCP) or TAO (TCP accelerated open). The connection count is monotonically increased every time a new connection is established. The server stores the last connection count received from a client. If a SYN segment is received by the server, it compares the connection count with the value stored at the server. If the received connection count is larger, the connection can immediately be established. If it is smaller, the normal handshake procedure is used.


9

Fig. 3. Transaction support of T/TCP.

Implementation Techniques During the past few years, implementation techniques for communication systems have been discussed for various reasons. The two main reasons are efficiency and flexibility. The latter occurs especially in environments with networked multimedia applications and is not discussed further here. Given that currently available transmission technology can deliver data across networks at rates exceeding the gigabit per second range, one challenge in building high performance communication systems is the design of efficient protocol implementations. Considering the increasing gap between the growth of physical network bandwidth and the growth in processing power available for protocol processing, multiprocessor platforms were considered to overcome performance bottlenecks. Therefore, the communications system has to be adequately partitioned in order to make use of such architectures. Besides processing requirements, data manipulation functions (i.e., memory access and data copies) play a crucial role for high performance communication systems. The technique of integrated layer processing has been introduced to overcome this bottleneck. Due to the many difficulties with this technique, no major breakthrough could be achieved up to now. Parallel Protocol Processing. In this section, different types and levels of parallelism that can be applied in communication systems are introduced first. Then, some of the most prominent projects in parallel protocol processing are briefly summarized and their results are evaluated. Types of Parallelism. Two types of parallelism can be distinguished: • •

spatial parallelism temporal parallelism

Spatial parallelism is based on the mapping of the processing task on multiple concurrently operating processing units. Two types of spatial parallelism can be distinguished: • •

SIMD-like organization MISD-like organization

Spatial parallelism based on an SIMD (single instruction multiple data) organization reflects parallelism among identical tasks independently and concurrently operating on different data units. A scheduling discipline, for example, round robin, may be used to allocate data to processing units. An SIMD organization requires only a minimum of synchronization among the processing units. However, it does not decrease the processing time for a single data unit. However, it increases the number of data units processed during a certain time interval. The performance benefits are limited by the maximum number of concurrently utilizable

10


processing units which can be approximately calculated by multiplying the mean packet processing time with the target throughput. Spatial parallelism based on an MISD (multiple instructions single data) organization is associated with concurrent processing of different tasks on the same data. Applying an MISD organization reduces the processing time required for a single data unit since multiple tasks are processed concurrently. However, a high degree on synchronization among the processing units may be required. Careful balancing of the system is needed to ensure that synchronization overhead does not paralyze system performance. Temporal parallelism is provided by the concept of pipelining which is similar to assembly lines in industrial plants. To achieve pipelining, the processing task has to be subdivided into a sequence of subtasks each mapped on a different pipeline stage. Thereby, each pipeline stage processes different data at the same point in time. The protocol VMTP has especially addressed this issue in the protocol design. Pipelining does not decrease the processing time needed for a single data unit. It increases the completion rate of packets and, consequently, the system throughput. Since the performance of a pipeline is limited by its slowest stage, stage balancing is an important issue. Levels of Parallelism. In communication subsystems, mainly four different levels of parallelism can be distinguished based on the granularity of an atomic unit: • • • •

stack level entity level function level intra-function level

The stack level forms the most coarse-grained level of parallelism applied to communication subsystems. An atomic unit comprises a complete protocol stack consisting of different protocol layers. Mainly spatial parallelism providing SIMD organization can be implemented at the stack level. Each stack can be associated with a separate connection or a dedicated application data stream (e.g., in case of connectionless services). In this case, packets are scheduled on a per connection basis and parallelism takes place among concurrent connections. However, this approach does not improve the performance of a single stack and, moreover, load balancing between the stacks cannot be influenced since the relation of packets to corresponding connections dictates the data distribution over the processing units. As a consequence, some processing units may be heavily loaded requiring additional processing power, whereas others are lightly loaded or even idle. An alternative approach using per packet scheduling may lead to a more beneficial load distribution. Packets are scheduled based on the availability of processing units, independent of their relation to a connection, or an application data stream. The performance of a single connection can be increased by concurrently processing multiple data units associated with a single connection. However, this solution requires synchronization among processing units concurrently operating on different packets belonging to the same connection. The entity level provides a finer granularity. An atomic unit is either associated with a complete protocol, or, more modular, with the receive entity formed by the receive part of the protocol, and the send entity formed by the send part. Spatial parallelism can be applied resulting in the same trade-offs as at the stack level. Entities belonging to different protocol layers can be implemented in a layer pipeline using temporal parallelism to form a complete protocol stack. The layer pipeline can further be subdivided into a receive pipeline and a send pipeline which are mutually independent to a large extent. They may coordinate their operation on data units belonging to the same connection by using common connection control information. Furthermore, synchronization among send and receive pipeline may be required. At the function level, protocol functions are used as atomic units (e.g., connection establishment, flow control). A first and necessary step in that direction is the analysis of protocols in view of extracting the intrinsic parallelism among protocol functions. The result of this analysis can be given in the form of a dependency graph representing the relationship among the protocol functions, mainly in terms of data dependency.


11

At the intra-function level, concurrent processing is applied in order to increase the performance of a single protocol function. Some examples include computing intensive functions, such as checksum processing, encoding or encryption, or functions operating on large data bases, for example, the routing function. Experiences in Parallel Protocol Processing. Many projects on parallel protocol processing had been started by the end of the 1980s. In this article, only a few are selected that provide dedicated experience and cover the main areas just presented. Most projects experiment with the OSI protocol stack. Few projects address new protocols, such as XTP. One project intensively investigated parallelism at the stack level (18). Important issues that need to be considered at that level are packet ordering, handling of segmented data, access to connection control information, and shared protocol data (e.g., IP routing table). Packets are scheduled on a packet-by-packet bases among the involved stacks. In (19) a centralized internal sequence number scheme is applied to packets as well as to control information. Based on this, processing of packets or control information arriving out-of-order is postponed until a processor identifies them as the next to be processed. To avoid any synchronization problems associated with concurrent processing of connection control information, a separate disjoint processing stage is implemented. This is a proper design in cases where packet processing dominates control processing, for example in the presentation layer. Since scheduling is performed on a per packet bases, segments of the same data unit may be processed by different processing units. The reason is that during reception from the network it cannot be determined whether a segment of some higher layer packet is received. This problem of segmented data units can be avoided by using per connection scheduling at the stack level. However, this requires an unambiguous identification of higher layer connections during packet reception from the network, that is, intelligent packet filters are needed. The results reported in (18) were achieved with up to four processors. A performance improvement of up to 3.5 was achieved with four processors. The results were highly dependent on the size of a data unit. Large data units allow for a higher degree of parallelism and, thus, lead to better performance. Few projects addressed function level parallelism. Some of them with dedicated HW support (20,21), some of them using transputer networks (22,23). The XTP project even tried to directly couple protocol design and VLSI (very large scale integration) implementation (24). They failed with respect to VLSI implementation. However, XTP became a very prominent protocol that highly influenced the development of communication systems over the last years, especially with respect to integrated services and multicasting. The projects that were using transputer networks for parallel protocol processing either applied function level parallelism (15,22) or entity level parallelism (25). All of them needed to overcome the distributed memory architecture associated with transputer networks. In (25) a global memory for several transputers was implemented. Generally, these projects achieved some performance advantage compared to traditional implementations. However, with respect to the increased cost due to multiprocessing, these numbers are not very promising. This mismatch between cost increase and performance speed-up can be seen as one of the main reasons why parallel protocol processing did not succeed up to now. In (25) a speed-up of 1.55 is reported with 2 transputers and a speed-up of 2.17 with the use of four transputers. The reason for the low speed-up numbers can be found in the uneven loadbalancing that occurs because the protocols were not designed for parallel processing. In (22) a speed-up of 3.73 was reported with the use of eight transputers. Again, uneven load balancing is the reason for this comparable low speed-up. In Ref. 26 a protocol processor consisting of multiple processors and memory modules was applied for the implementation of a light-weight transport protocol. The main characteristic of the protocol can be seen in the exchange of complete state information instead of state updates. Processing of 20,000 headers/s is expected to be feasible. The operating system was identified as performance bottleneck. Located between entity level parallelism and function level parallelism are approaches that accelerate processing of an entity by the use of dedicated hardware for some performance critical functions, such as buffer management, timer management and the like. In (27) a dedicated software process was in charge of buffer management. The XTP project, for example, targeted a dedicated VLSI support of buffer management.

12


Furthermore, the usage of specialized memory chips, such as CAMs, was addressed in (20) in order to speed-up timer management. For the implementation of the retransmission timer, CAMs were used. However, the usage of special memory did not succeed mainly because of cost reasons. Checksumming is a typical protocol function that can be implemented in hardware. In transport protocols it usually requires read access to the user data along with some arithmetical calculations. Dependent on the location of the checksum in the data unit, this function can be processed on-the-fly during data movements applying pipelined parallelism and reducing the number of memory accesses needed. Advanced network interfaces provide support for on-the-fly processing of transport layer checksums on the network adapter. Pipelining is usually applied at the interface between network adapter and host processing. Efficient implementation techniques are required to avoid this interface becoming the system bottleneck. Mapping of the adapter memory into the hosts virtual memory is a possible way of avoiding data movements between host and adapter. Examples for memory-mapped network interfaces are (28) and (29).

Summary The event of high-speed networks with a high path capacity stimulated the development of so-called lightweight transport protocols. This article provided an overview of the most important light-weight protocols and the novelties that were introduced by these protocols. Moreover, the evolution of TCP is briefly summarized with respect to high-speed networks. It can be observed that some of the mechanisms introduced within the context of light-weight protocols are now being used in TCP. Examples are implicit connection establishment and timer-based connection termination. Thus, it can be expected that TCP will further adapt promising mechanisms if they are required by dedicated applications. An example that is under discussion is the support of selective acknowledgments. Furthermore, it can be expected that TCP will stay among the most popular protocols—it will not be replaced by a light-weight transport protocol. However, with respect to multicasting, it might be replaced by transport protocols that have dedicated multicasting support. An excellent investigation on protocol mechanisms for reliable multicast protocols can be found in Ref. 30. The increased network speed also imposes higher processing requirements on communication systems. Parallel processing can be seen as one approach to increase processing power. The possibilities of applying parallelism to protocol processing have been briefly summarized. Some of the most interesting projects have been selected and presented with their main achievements. However, it must be stated that, generally, parallel protocol processing did not succeed for several reasons. Protocols are not designed for parallel processing which leads to problems in load balancing among the processors involved. Thus, no linear speed-up can be achieved. This leads to the problem that the increased cost due to parallel processing cannot be justified. The most promising approach can be seen in entity acceleration by implementing dedicated functions in hardware. Implementations complient with this approach can be found quite often (e.g., hardware-based onthe-fly checksumming). Finally, it needs to be stated that even entity acceleration may be less important for certain protocol functions in the future. This is because processors do provide more and more dedicated instructions for performance intensive functions related to multimedia (e.g., video coding (28)).

BIBLIOGRAPHY 1. R. W. Watson Timer-Based Mechanisms in Reliable Transport Protocol Connection Management, Computer Networks, North-Holland, Vol. 5, 1981, 47–56. 2. R. W. Watson The Delta-t Transport Protocol: Features and Experience, in H. Rudin, R. Williamson (eds.), Protocols for High-Speed Networks, Elsevier (North-Holland), 1989, pp. 3–18. 3. D. Clark M. Lambert L. Zhang NETBLT: A bulk data transfer protocol, Request for Comments RFC 998, March 1987.


13

4. D. Clark M. Lambert L. Zhang NETBLT: a high throughput transport protocol, Proceedings of the ACM SIGCOMM ’86, Stowe, VT, 1986, 353–359. 5. D. Cheriton VMTP: A protocol for the next generation of communication systems, ACM SIGCOMM ’86, Stowe, VT, pp. 406–415. 6. D. Cheriton C. Williamson VMTP as the transport layer for high-performance distributed systems, IEEE Commun. Mag., 27: June 1989, 37–44. 7. D. Cheriton C. Williamson VMTP: versatile message transaction protocol–protocol specification, Request for Comments RFC 1045, February 1988. 8. A. G. Fraser W. T. Marshall Data transport in a byte-stream network, IEEE J. Selected Areas in Communications, 7: September 1989, 1020–1033. 9. W. Strayer B. Dempsey A. Weaver XTP. The Xpress Transfer Protocol, Reading, MA: Addison-Wesley, 1992. 10. XTP Forum, Xpress transport protocol specification revision 4.0, Technical Report, March 1995. 11. W. Doeringer et al. A survey of light-weight transport protocols for high-speed networks, IEEE Trans. Commun., 38: 1990, 2025–2039. 12. D. Clark Window and acknowledgement strategy in TCP, Request for Comments RFC 813, July 1982. 13. R. Braden (Ed.) Requirements for internet hosts—communication layers, Request for Comments RFC 1122, October 1989. 14. W. Stevens TCP/IP Illustrated, Volume 3, Reading, MA: Addison-Wesley, 1996. 15. M. Zitterbart High speed transport components, IEEE Network Magazine, January 1991, 54–61. 16. D. Clark D. Tennenhouse Architectural considerations for a new generation of protocols, ACM SIGCOMM ’90, September 1990, 200–208. 17. B. Ahlgren M. Bjorkman P. Gunningberg Integrated Layer Processing Can Be Hazardous to Your Performance, in W. Dabbous, C. Diot (eds.), Protocols for High-Speed Networks, V, London: Chapman & Hall, 1996. 18. R. Lam M. R. Ito On the parallel implementation of OSI protocols processing systems, Proceedings IPPS94, Cancun, Mexico, April 1994. 19. M. Goldberg G. Neufeld M. Ito A Parallel Approach to OSI Connection-Oriented Protocols, in B. Pehrson, P. Gunningberg, S. Pink (eds.), Protocols for High-Speed Networks, III, Elsevier (North-Holland), 1992, pp. 219–232. 20. D. Feldmeier An Overview of the TP++ Transport Protocol Project; in A. N. Tantawy (ed.), High Performance Networks: Frontiers and Experience, Norwell, MA: Kluwer-Academic Publisher, 1993. 21. T. F. La Porta M. Schwartz Performance analysis of MSP: A feature-rich high-speed transport protocol, IEEE/ACM Trans. Networking, 1 (6): 740–753, 1993. 22. M. Zitterbart Parallelism in Communication Subsystems, in A. N. Tantawy (ed.), High Performance Communications, Norwell, MA: Kluwer-Academic Publisher, 1994, 177–194. 23. T. Braun M. Zitterbart A parallel implementation of XTP on transputers, 16th IEEE Conference on Local Computer Networks, Minneapolis, MN, October 1991, 91–96. 24. Protocol Engines, PE-1000 Series, Protocol Engine Chipset, 1992. 25. M. Kaiserswerth The parallel protocol engine, IEEE/ACM Trans. Networking, 1: 1993, 650–663. 26. A. N. Netravali W. D. Roome K. Sabnani Design and implementation of a high-speed transport protocol, IEEE Trans. Commun., 35 (11): 2010–2024, 1990. 27. M. Kaiserswerth A parallel implementation of the ISO 8802-2.2 protocol, IEEE TRICOMM ’91, Chapel Hill, NC, April 1991. 28. M. Blumrich et al. Virtual-memory-mapped network interface; IEEE Micro, February 1995. 29. R. Minnich D. Burns F. Hady The Memory-integrated network interface; IEEE Micro, February 1995. 30. D. Towsley J. Kurose S. Pingali A comparison of sender-initiated and receiver-initiated reliable multicast protocols, IEEE J. Selected Areas Commun., 15 (3): 398–406, 1997. 31. A. Peleg S. Wilkie U. Weiser Intel MMX for multimedia PCs, Communication of the ACM, 40 (1): January 1997.

MARTINA ZITTERBART TU Braunschweig


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Intelligent Networks Standard Article J. Place1 and J. Stach1 1University of Missouri— Kansas City, Kansas City, MO Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5310 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (123K)




❍ ❍

Acronym Finder



Abstract The sections in this article are In Telecommunications Standards Bodies Transaction-Based Call Processing Intelligent Network Conceptual Model In Services New Telecommunications Services and the In In Evolution Conclusion Keywords: telecommunications; advanced intelligent network; ain architecture | | | Copyright © 1999-2008 All Rights Reserved.


504

INTELLIGENT NETWORKS

INTELLIGENT NETWORKS Telecommunications systems around the world are facing dramatic changes. Customers are clamoring for new services, and technology is progressing at a disconcertingly fast pace. Concurrently, telecommunications service providers face many mandates from regulatory agencies such as maintenance of low-cost universal service. The only constant for both telecommunications providers and subscribers is rapid change. At the heart of these rapid changes is the evolution of a concept called the intelligent network (IN). The intelligent network has evolved as a way to speed up the development and introduction of telecommunications services and to provide those services in an efficient and cost-effective manner. The intelligent network is a concept designed to extend the capabilities of telecommunications systems. Our current telecommunications system includes both wireline and wireless networks, and both are heavily dependent on the IN. The IN was designed to provide telecommunications services independent of call connection services, i.e., switching processes. Additionally, the IN was designed to be independent of hardware, service providers, and network protocols and to be an overarching construct spanning all service providers. For example, prior to the deployment of the IN, 800 services required every switch to have the capability to translate the 800 number to the actual called number and to bill correctly. Additionally, every switch must have the correct number translation table correctly connect the call. When new 800 numbers were added, the translation tables must be updated in all switches. The IN moved the 800 service from the call connection system to the call services systems; thus new numbers are added to the IN 800 database which can be accessed by many switches. The IN separates call connection from service provision. IN TELECOMMUNICATIONS STANDARDS BODIES The International Telecommunication Union is the international body developing IN standards. The ITU is an agency of the United Nations. The European Telecommunications Standardization Institute (ETSI) recommends standards for Europe and Bellcore develops North American IN standards. International Telecommunication Union (ITU) The ITU was reorganized in 1993 into three sections: • Radio Communications (ITU-R). This section was formed by combining the International Radio Consultative Committee (CCIR) and the International Frequency Board (IFRB). The ITU-R is concerned with the allocation of the RF spectrum, and its regulations are distributed through the Radio Regulations Board. • Telecommunications Standards (ITU-T). This was formed from the International Consultative Committee

on Telegraphy and Telephony (CCITT). ITU-T recommendations work to standardize global telecommunications and are usually distributed through conferences and workshops. • Telecommunications Developmment (ITU-D). This was formed from the Bureau of Telecommunications Development (BTD) and develops technical specifications for international public telephone networks. ITU standards for IN are divided into capability sets, e.g., CS-0, CS-1, CS-2, and are the standard for Europe and Asia. The ITU documents are generally referred to as the ‘‘Q.12xy’’ series where the ‘‘x’’ digit identifies the capability set. ‘‘Recommendation Q.1200’’ is the general CS-0 IN structure document (1). The ‘‘Q.1201’’ series includes recommendations for IN architecture principles (2), IN service plane architecture (3), global functional plane architecture (4), distributed functional plane architecture (5), physical plane architecture (6), and other key CS-1 interface components (7–12). Recommendations for IN CS-2 may be found in the ‘‘Q.122y’’ series which includes an introduction to IN CS-2 (13), service plane for IN CS-2 (14), and other CS-2 recommended standards (14–17). See the ITU World Wide Web site at http:// www.itu.ch/ for more information. European Telecommunications Standardization Institute (ETSI) The ETSI defines technical standards that are generally consistent with ITU recommendations but do not fully adopt the ITU recommendation structure. Some ETSI standards documents are the IN user’s guide for CS-1 (18), IN ETSI:61010 (19), IN distributed functional plane (20), and others (21,22). See the ETSI World Wide Web site at http://www.etsi.org/ sitemap/ for more information. Bell Communications Research (Bellcore) Standards for the IN in North America are called Advanced Intelligent Network (AIN) releases and are produced by Bellcore, e.g., AIN 0, 0.1, 0.2. See the Bellcore World Wide Web site at http://www.bellcore.com/ for more information. The evolution of IN standards from both Bellcore and the ITU continues. Telecommunications Information Networking Architecture (TINA) and Information Networking Architecture (INA), which are evolutionary offshoots, are discussed separately later. The two IN standards from the ITU and Bellcore, never very far apart, are getting closer to one another, and it is widely believed that the ITU standards eventually will prevail because of the necessity of producing global telecommunications standards that ease world-wide integration and interoperability. For more detail about telecommunications standards, see Refs. 23 or 24. TRANSACTION-BASED CALL PROCESSING New telecommunications services have one thing in common: They require extensive software and hardware support from the underlying telecommunications network. To better understand the current status of telecommunications, we need to go backwards in time. IN as an architecture is a natural evolution of the basic telecommunication network.



The public switched telephone network (PSTN) in North America, prior to the breakup of AT&T, was served by a hierarchy of switches with five levels. Class 5 switches or endoffices terminated the subscriber local loop. Calls were routed from end-office to end-office over interoffice trunks or they were directed to class 4 or higher switches for further routing depending on the called party and traffic congestion. The class 4 switch was called a toll center. There were three classes of toll center: class 3, class 2, and class 1. Call routing was a function of the called party location and the traffic volume. Obviously, the shortest route was always the best but it was not always available, and thus the call was occasionally directed to higher level switches for routing. With the breakup of AT&T and the subsequent separation of local and long-distance service, the upper layers of the switch hierarchy were abandoned for a fully interconnected network of digital switches. The end-office still terminates the subscriber loop but has become much more than a termination center. The end-office terminates traditional analog voice circuits, but it also has a connection to the MTSOs serving the wireless carriers for the area. The end-office also terminates T1 and higher-speed lines that bypass the switch and are connected to a local connection matrix as well as connected to the LEC’s access tandem for inter-LATA and intra-LATA calls. Until direct distance dialing (DDD) was deployed in the early 1950s, all information needed to create connections was stored on individual telephone switches. Stored program control (SPC), which made telephone switches specialized computers, was implemented in network switches such as AT& T’s No. 1 ESS which was deployed in 1965 (25). No. 1 ESS provided residential services such as ‘‘call waiting’’ through software resident in the switch. As SPC logic became more complex, it took longer to develop and test new services, and thus it became increasingly expensive and time-consuming to deploy each new innovation as each of the 15,000⫹ switches in the country had to be loaded with the new software. It took 3 to 4 years from conception to delivery for a new service. It was not uncommon to see switch logic exceed a million lines of program code. Many existing switches had to be extended by adjunct processors—special-purpose computers—to correctly interpret numbers that were dialed and execute their associated services. It became increasingly evident that call connection and call service activities had to be separated. AT&T first accomplished this when they introduced centralized databases to support ‘‘Calling Card’’ and ‘‘800 Service.’’ These facilities were implemented at a network control point (NCP) and were accessed by a specialized signaling network called the common channel interoffice signaling network (CCIS). In the 1980s the ITU defined the common channel signaling system no. 7 (CCS7), which became the intelligent network interconnection mechanism. CCS7 is a packet network that is used for call setup (i.e., out-of-band signaling) and is separate from the resources used to connect the call. After the forced divestiture of the Bell System in the United States, the RBOCs also deployed centralized databases to support 800 service and alternate billing services. The switches, databases, and operations systems that formed these services were collectively referred to as Intelligent Network 1 (IN/1) (26,27). Bellcore recognized that the separation of interconnection and service provision facilities offered huge

505

potential for service development and began development of an expanded IN architecture called IN/2. RBOCs realized that IN/2 could not be implemented in a timely manner and focused on a subset of IN/2 functionality which could be deployed in stages over several years. At about the same time, switch vendors did not believe that they could deliver sufficient performance to support all of the specified services. Bellcore called a multivendor forum to resolve the concerns of all parties and the results of this forum were published in March 1990 (28). The next stage IN development was called Advanced Intelligent Network (AIN) and was defined by a series of numbered releases starting with release 0. AIN 0.1 has been deployed and AIN 0.2 enhancements are currently are being deployed. In the section entitled ‘‘Intelligent Network Conceptual Model,’’ we describe the IN conceptual model. In the section entitled ‘‘IN Services,’’ we discuss the IN service structure. In the section entitled ‘‘New Telecommunications Services and the IN,’’ we discuss some new telecommunications services and their dependence on the IN. In the section entitled ‘‘IN Evolution,’’ we describe the evolution of IN and we conclude the chapter in the section entitled ‘‘Conclusion.’’ INTELLIGENT NETWORK CONCEPTUAL MODEL Organization and Interface. The overarching purpose of the IN is to provide a framework to deploy advanced telecommunications services to subscribers. We describe IN components and their general function within the framework of providing these subscriber services. Although implementations are slightly different from vendor to vendor and from service provider to service provider, intelligent network services are delivered by well-defined interactions between switching systems and intelligent network service logic. A key justification for the deployment of the intelligent network is to provide a telecommunications infrastructure that would facilitate the implementation of sophisticated subscriber services. The best way to view the IN is as a reference model called the intelligent network conceptual model (INCM). The INCM is the basis for development of ITU IN standards (29). The INCM is a four-layer model which, from the bottom up, consists of the physical plane, the distributed function plane, the global function plane, and the service plane. The INCM is shown in Fig. 1. INCM layer descriptions follow. IN Physical Plane. IN physical component descriptions follow. Entities in the physical plane perform the ‘‘real’’ functions that implement the IN—for example, access and maintain support databases, route control packets, or make call-setup decisions. For more detail see Refs. 15,30,31,24, and 32. The architecture of the physical plane is shown in Fig. 2. • Service Switching Point (SSP). The SSP provides the access point to intelligent network services for the subscriber and executes the call model that describes actions to be taken by the IN service layer. The SSP detects IN service requests and formats requests for call implementation instructions from the AIN service logic. The SSP is a logical entity that coexists with the switch; that is, the SSP function has been embedded into switching points and is the access point into the IN.

506


Services

Call forwarding

• Service Control Point (SCP). The SCP provides the service control function (SCF) and the service data function (SDF). The SCP is responsible for responding to SSP queries resulting from interaction with the call model. The software modules that provide IN services are implemented in the SCP. The SCP provides instructions to the SSP on how to continue with call setup. The SCPs are duplicated—mated—for redundancy and to ensure proper response to service requests. The SCP is both a database and a processing environment. The processing environment contains service logic programs (SLPs) that provide specific services to requesting SSPs. The database portion provides information that is processed by the SLPs. For example, an SCP is responsible for number translation in an 800/888 service. The specific 800 SLP uses the dialed 800 number as a lookup argument to query its database for the real number of the called party. The SLP may also use time of day and caller location as additional criteria in selecting the phone number of the called party, e.g., dialing a regional number for a pizza delivery service could trigger an SCP query which would route the call to the location closest to the caller for faster delivery service. Modern SCPs are designed to handle thousands of transactions a second. As IN services expand, SCPs will become more narrowly specialized because it would be impossible for them to handle the volume of transactions that will be generated otherwise. Many researchers believe that almost all calls eventually will require an SCP transaction and many calls will generate several. With mandated local number portability service (LNP), sometimes referred to as the subscriber’s universal personal number (UPN), every call may require an SCP database lookup to find the real phone number to be used to route the call in the same way that 800/888 numbers require an SCP translation.

800/888 Services

Service plane

Global service logic

Xlate

Billing

Routing

Basic call model Global functional plane SSF

Distributed service logic SDF

SRF

SCF

SMF

Distributed functional plane Physical networking entities

SSP IP

STP SSP

Adjunct

Physical plane

Figure 1. IN conceptual model.

• Service Data Point (SDP). The SDP is the platform for the standalone service data function (SDF). The SDP contains both customer and network data that is accessed as part of the execution of an IN service.

Other networks SCP STP SCP STP IP Adjunct

SSP

IP SSP PSTN subscribers ISDN CCS 7 Transport X.25

PSTN subscribers

Links

Figure 2. IN Physical plane architecture.

• Intelligent Peripheral (IP). The IP is the platform that supports the specialized resource function (SRF). The IP responds to directives issued by the SCP or SSP and is used to play out announcements, synthesize speech, provide voice messaging, and speech recognition. The IP is accessed through an integrated speech and digital network (ISDN) link for better performance, or it may be connected to an SCP through the signaling subnetwork. Since the IP may be responsible for time-sensitive actions such as playing out announcements to the calling or called party, e.g., ‘‘Please enter your PIN now,’’ a time-sensitive link is essential. • Adjunct (ADJ). The ADJ provides the same service functions as the SCP and is considered functionally equivalent. The ADJ differs from the SCP only in the interface. The ADJ is connected to SSPs by high-speed links rather than the CCS7 network. Thus, the adjunct may be more suitable to support services that require very fast response. The adjunct usually provides specialized services and may be directly programmable by the service subscriber.


• Service Node (SN). The SN provides the CCF/SSF of the SSP, the SDF, and SCF of an SCP, and the SRF of an IP in one physical entity. The SN allows highly specialized functions to be implemented in one physical device, thus reducing the CCS7 network latency and other interdevice communications overhead. • Service Management Point (SMP). The SMP provides the service management function for all physical entities in the IN. The SMP allows maintenance and testing, and provides the interface between IN entities. IN Distributed Service Plane. Objects in the distributed service plane (DSP) are called functional elements and are the service logic functions (software) associated with the hardware elements in the physical plane. Some of the functional elements at the DSP and their associated physical processors follow. • SSF: Service switching function which is associated with the SSP. • SCF: Service control function which is associated with the SCP. • SDF: Service data function which is associated with the SCP. In some implementations, the SDF can co-reside with the SCF in the SCP. • SRF: Specialized resource function which is associated with the IP. • SMF: System management function which supports service creation, deployment and maintenance, and is associated with the SMP. • SCEF: Service creation environment function allows the specification and testing of IN services. The DFP allows a software function to be viewed independently of the IN physical architecture. Obviously, the software function must at some time be physically implemented on a specific hardware platform once the function has been well-defined; however, this physical association becomes an engineering task. IN Global Functional Plane. The components of the global functional plane (GFP) are called service-independent building block (SIBs). Subscriber services are defined and deployed by combining SIBs. SIBs are standardized architecture-independent functions that expect certain standard input arguments such as caller number and calling number and produce certain standard output arguments such as called party number. SIBs perform basic network functions such as collecting digits, verifying an ID number, or translating inputs. The abstractions of the GFP allow a service function definition that is independent of hardware. Thus IN engineers have some flexibility during function implementation. In Fig. 1, several SIBs (Xlate, Billing, Routing) are shown in the functional plane. These SIBs are used as building blocks to create features (800/888 service, call forwarding) in the service plane. A single SIB may be used to create several service features. For example the three SIBs Xlate, Billing, and Routing are used to make up the 800/888 service feature. The Xlate SIB translates the 800/888 number into actual called party; the Billing SIB determines who should pay for call charges and the Routing SIB uses time of day to deter-

507

mine the actual called party number. The same Routing SIB is used in the implementation of the call forwarding service feature. IN Service Plane. Subscriber services are called features. Magedanz and Popescu-Zeletin (24) further divide features into ‘‘call-related’’ and ‘‘management-service-related’’ features. Call-related features include call waiting, speed dialing, and call forwarding, while management-related features deal with billing, network management, and service deployment. IN SERVICES The concept of IN service must be viewed from two perspectives. The first perspective is the definition of service as the addition of functionality for subscribers. Magedanz and Popescu-Zeletin (24) call functionality ‘‘value added’’ services. The second perspective involves video, voice and signaling services and is called ‘‘bearer services.’’ The IN service concept does not include specific subscriber services, rather the IN is a platform for service development independent of a specific service definition. Included in each new IN version is an expanded set of components that are used to define and deploy specific subscriber services. Basic Call Model A special SIB called the basic call model (BCM) (24) coordinates the telephony and call processing portions of a call. The call model consists of a series of common subscriber actions as ‘‘off-hook’’ and ‘‘digits dialed.’’ A BCM is shown in Fig. 3. Figure 3 shows the call origination and termination BCM. Point in calls (PICs) are checkpoints in call processing used by the SSP to determine if outside services are required (i.e., SCF services). The trigger control points TCPs in the trigger table indicate the specific service logic processes (SLPs) to diagnose and provide the requested service. SLPs make up the service control function and reside in the SCP. These SLPs are called to provide the service and guide the SSP in further processing the call. At each PIC, a decision is made in the switch about the need for AIN services. Figure 4 shows the relationship among CCF, SSF, and SCF logic. If services are required, call processing is suspended at the associated TCP. The trigger table indicates the service logic in the SCP required to progress the call, and the SCF resumes call progressing the CCF logic after providing the requested service. At each PIC in the originating and terminating call model, the switch makes decisions about processing the call on a step-by-step basis. Actions taken by the calling or called party are noted by the SSP; that is ‘‘off hook’’ (the phone is lifted out of its cradle or ‘‘digits dialed’’) the called party number has been dialed. The SSP determines at each PIC if additional outside services are necessary to complete the call. The specific action taken by the SSP depends on its SLP and by the specific PIC where the event took place. The critical point is that although the SSP still must recognize events at PICs such as a subscriber dialing an 800/888 number, the SSP does not have to take action itself to provide the requested service; rather, the SSP off-loads the service request to another AIN component—an SCP or adjunct. The

508


Abandoned

CCF

Exception processing

1. Null and authorize orig. Call originated

2. Collecting information

Orig. BCM

Orig. denied SSF

Info collected

SCF Resume call processing

3. Analyzing information Info analyzed

4. Routing and alerting

Collect timeout Trigger table

Detection point in call

Route failure

Trigger control point

Busy

Figure 4. IN service process.

No answer

actual service provision takes place outside of the SSP, thus freeing it for other tasks related more closely to routing calls. The PIC/TCP mechanism in the IN switch functions like a remote procedure call. Call processing in the switch is suspended while a service request message is transmitted across the CCS7 network to the appropriate service processor— either an SCP or an adjunct depending on the type of service requested and the speed of the connection. The SLP in the service processor responds to the service request and creates a response message which is formatted and transmitted across the CCS7 network back to the AIN switch. The call resumes processing in accordance with directions provided by the SCP or adjunct. A single PIC/TCP can be used to support a variety of services depending on the context of the call. At the PIC the SSP analyzes the data provided about the call and the switch requests service from a specific SLP. The SLP may request additional information from the subscriber such as a personal identification number. The SLP may evoke other process such as those that reside on an IP. The IP may be asked to play out specific message or to collect digits from the subscriber. The notion of ‘‘backroom’’ processing by a network of service processors invisible to the subscriber allows the function of providing telecommunications services to be separated from the function of connecting calls because the service provision function has been off-loaded to other processors. Aside from greatly reducing the work required in the switch and allowing switch cycles to be more tightly allocated to call connection processing, off-loading the service function makes it easier to define new services and the new services can be deployed much faster. The new service is deployed once on the SCP rather than in each switch in the PSTN. The catch is, of course, that there must be a substantial infrastructure that allows global signaling among IN components; but once the infrastructure is in place, the deployment of new national telecommunications services is much easier. However, although service deployment is now easier, we are required to manage an increasingly complex signaling infrastructure.

Service logic

Answered

5. Call active Disconnect

Mid call processing

(a) Exception processing

6. Null and authorize term. Abandoned

Call terminated

8. Select facility and present call

Busy

Call presented

9. Alerting

No answer

Answered

10. Call active Disconnect

Mid call processing

(b)

Figure 3. Basic call model. (a) Originating call model. (b) Terminating call model.


Using the 800/888 number translation example, when number translation is done at an SCP connected to the AIN switch over a CCS7 network, the switch and the SCP work in parallel. Plus new numbers can be added much more rapidly because all that is required is to update the SCP database instead of transmitting a new number translation table to all switches that make number translation. For a new service to be deployed, its SLP is installed only on the SCPs that provide the service—not on every switch in the PSTN. Thus new services can be defined and deployed in an AIN environment in months instead of years. Feature Interaction A major IN problem is the unintended consequences caused by several features active together during one call. This is called feature interaction and refers to the problems caused when parties to a call have different feature sets active. For example, suppose a subscriber makes a call to the universal personal number (UPN) of another subscriber. Further, suppose the UPN translates to a number that is a toll call for the caller. Who pays the charges? The calling party expects to reach a local number thus does not expect to be charged for the call. The called party is not interested in paying toll charges for unwanted calls such as telemarketing calls to his/ her UPN number. How are these feature interactions to be handled? When the feature is introduced, its interaction with all existing features must be clearly defined. Thus when the UPN feature is introduced, the added complexity of possible toll charges must be included in the design of the feature. For example, if the called party UPN translates to a number resulting in toll charges, the UPN feature might notify the calling party that the call is a toll call and ask for billing acceptance. Additionally, the UPN feature may include a list of numbers for which the subscriber is willing to accept toll charges without notifying the calling party. Clearly, the creation of features from SIBs is greatly complicated by the requirement to deal effectively with feature interaction.

509

Universal Personal Number Personal number service or universal personal number (UPN) is expected to be a big component of PCS, with 4.5 million personal numbers projected for the year 2000 (43). While the implementation of UPN differs from vendor to vendor (44), the single number applies regardless of the type of call—that is, FAX, voice, or e-mail. UPN allows a subscriber to be reached (conditionally) regardless of his physical location from a single telephone number. UPN also implies that wireless, voice, data, and FAX calls be automatically routed to the appropriate device. Clearly, the IN is critical to the development and deployment of UPN because as the service becomes available, an increasing percentage of called numbers require treatment. UPN will generally allow the subscriber to determine how he is to be reached and who is allowed access. Clearly, UPN depends on extensive and fast SCP processing. IN EVOLUTION The IN is a powerful mechanism for deploying telecommunications services. However, with this power comes great complexity. IN management is a critical component of the overall IN strategy. Management consists of operations, administration, maintenance, and provisioning (OAM&P). There are many IN component vendors each with their own operations systems (OSs) that want to place their equipment in the Bellcore version of IN in North America and the ITU version of IN in Europe and Asia. Clearly, for the sake of interoperability, there must be movement to a common IN reference model and there are several activities in this IN movement. These activities include the notion of an international telecommunications management network (TMN) (45) and an open distributed processing architecture (46). Bellcore is also developing a long-range view of IN in its information networking architecture (INA) (47,48) and its telecommunications network architecture (TINA) consortium (49–54). Telecommunications Management Network

NEW TELECOMMUNICATIONS SERVICES AND THE IN Personal Communications Services Personal communications services (PCS) is a good example of an IN transaction-intensive service (33–41). There are several generic functions necessary to support PCS. Several of those SCP functions are as follows: • Analyze time-of-day, calling number, and called number for routing or access directions to the SSP. • Collect and analyze user information such as personal identification number (PIN) for billing and call routing. • Database access and billing control functions are used to verify and update user database information and to create and store billing information. Bray (42) estimates that services, excluding PCS, will require SCPs to support query rates in the range of 1000 transactions per second.

One approach to consistent management across network components within a network and networks themselves is called Telecommunications Management Network (TMN) (55,60). TMN emerged in the early 1980s as a mechanism to effectively manage diverse operations systems developed by component vendors to support OAM&P efforts. TMN included a set of standards to ensure component and network interoperability. Conceptually, TMN produces a network of management systems. This overarching software system monitors and tunes entire telecommunications networks. Interfaces were standardized so that introduction of new components into the network was eased from the OAM&P perspective. While some success in this grand vision has been achieved, TMN has not yet been fully realized. There are several very difficult issues with the TMN concept that have not yet been resolved. Development stumbling blocks include the following: • TMN Complexity. Open Systems Interconnection (OSI) system management technology was selected as the basis

510


for TMN interfaces; and while these systems are very powerful, they are also quite complex. Thus, TMN interfaces are being slowly deployed. • Legacy Systems. Developing TMN interfaces is very expensive, and thus developers require a strong incentive to deploy TMN. Since there are few TMN systems currently deployed on legacy systems, the incentive to develop TMN interfaces for new systems is weak. • Alternative Protocols. TMN relies on OSI management standards while the TCP/IP protocol suite uses Simple Network Management Protocol (SNMP). Because TCP/ IP is so widely available, there is pressure to use SNMP. However, since SNMP is simpler than OSI management systems, it is perceived to be less powerful. TMN concepts are critical to the effective management of complex INs. The ability to automate these functions is essential to smooth network interoperability. See Magedanz and Popescu-Zeletin (24, Chapter 4), and other articles referenced here for more details about TMN concepts. Information Network Architecture In 1990 Bellcore started working on a concept meant to be the successor to its AIN. The successor network concept was called information network architecture (INA). Basic INA concepts were specified by Bellcore in 1992 and 1993 and are described in Ref. 53. INA concepts require management software modules and service software modules to be separated and capable of working correctly anywhere in a distributed network of telecommunications service processors. The distributed processing environment of the INA concept has a kernel that will be present in every node and a set of transactions servers that provide the telecommunications service delivery function. See also Ref. 24, Chapter 5 for an INA overview. Telecommunications Information Networking Architecture About the same time that Bellcore was developing the INA concepts, a group of network equipment vendors and network operators formed the Telecommunications Information Networking Architecture Consortium (TINA-C) (61) to specify an architecture that can support all network applications across all network types. Bellcore’s INA concepts influenced the TINA-C, but they took the evolution of the IN further. Specifically, TINA-C focused on four areas (24): • Computing architecture concepts for designing and implementing a distributed computing environment based on the open distributed processing reference model. • Service architecture concepts for designing and implementing the delivery of telecommunications services. • Network architecture concepts for designing and implementing a transport network. • Management architectural concepts for designing and implementing an OAM&P system across the distributed architecture. The TINA initiative is well underway. There exist several proposals for TINA trials most scheduled for mid-1998 (62).

CONCLUSION Telecommunications service providers have dramatically expanded the services they offer to their subscribers. Also the environment in which they operate has become much more complex with competitive long distance, competitive local service, and competitive wireless service. Couple this dramatic increase in complexity with an increasingly demanding subscriber and we have a situation that forces us to carefully examine the service provision architecture. The evolution of the IN is in response to this growing complexity. The bright side of the IN is the speedy development of telecommunications services to sophisticated subscribers. The dark side of the IN is an increasing complex entity that must be maintained and must evolve. BIBLIOGRAPHY 1. ITU, Recommendation Q.1200: General series intelligent networks recommendations structure, Int. TeleCommun. Union, Geneva, September 1997. 2. ITU, Recommendation Q.1201/I.312: Principles of intelligent network architecture, Int. TeleCommun. Union, Geneva, October 1992. 3. ITU, Recommendation Q.1202/I.328: Intelligent network service plane architecture, Int. TeleCommun. Union, Geneva, September 1997. 4. ITU, Recommendation Q.1203/I.329: Intelligent network global functional plane architecture, Int. TeleCommun. Union, Geneva, September 1997. 5. ITU, Recommendation Q.1204: Intelligent network distributed functional plane architecture, Int. TeleCommun. Union, Geneva, March 1993. 6. ITU, Recommendation Q.1205: Intelligent network physical plane architecture, Int. TeleCommun. Union, Geneva, March 1993. 7. ITU, Recommendation Q.1208: Intelligent network interface recommendations for CS-1, Int. TeleCommun. Union, Geneva, October 1995. 8. ITU, Recommendation Q.1211: Introduction to intelligent network capability set 1, Int. TeleCommun. Union, Geneva, March 1993. 9. ITU, Recommendation Q.1213: Global functional plane for intelligent network CS-1, Int. TeleCommun. Union, Geneva, October 1995. 10. ITU, Recommendation Q.1214: Distributed functional plane for intelligent network CS-1, Int. TeleCommun. Union, Geneva, October 1995. 11. ITU, Recommendation Q.1215: Physical plane for intelligent network CS-1, Int. TeleCommun. Union, Geneva, October 1995. 12. ITU, Recommendation Q.1219: Intelligent network user’s guide for CS-1, Int. TeleCommun. Union, Geneva, April 1994. 13. ITU, Recommendation Q.1221: Introduction to Intelligent Network Capability Set 2, Int. TeleCommun. Union, Geneva, September 1997. 14. ITU, Recommendation Q.1223: Global functional plane for Intelligent Network Capability Set 2, Int. TeleCommun. Union, Geneva, September 1997. 15. ITU, Recommendation Q.1224: Distributed functional plane for Intelligent Network Capability Set 2, Int. TeleCommun. Union, Geneva, September 1997. 16. ITU, Recommendation Q.1225: Physical plane for Intelligent Network Capability Set 2, Int. TeleCommun. Union, Geneva, September 1997.

INTELLIGENT TRANSPORTATION SYSTEMS 17. ITU, Recommendation Q.1228: Interface Recommendation for Intelligent Network Capability Set 2, Int. TeleCommun. Union, Geneva, September 1997. 18. ESTI, ETSI ETR NA-61010: Intelligent network user’s guide for CS-1, European TeleCommun. Standardization Institute, May 1994. 19. ESTI, ETSI ETR 300348: Intelligent network CS-1 physical plane, European TeleCommun. Standardization Institute, November 1993. 20. ESTI, ETSI TCR-TR NA-60502: Distributed functional plane for intelligent network CS-1 physical plane, European TeleCommun. Standardization Institute, November 1993. 21. ESTI, ETSI TCR-TR NA-60204: Guidelines for standards on intelligent network CS-1, European TeleCommun. Standardization Institute, March 1923. 22. ESTI, ETSI TCR-TR NA-60501: Global functional plane for intelligent network CS-1, European TeleCommun. Standardization Institute, November 1993. 23. E. Carne, Telecommunications Primer: Signals, Building Blocks and Networks, Upper Saddle River, NJ: Prentice-Hall, 1995. 24. T. Magedanz and R. Popescu-Zeletin, Intelligent Networks Basic Technology, Standards and Evolution, New York: International Thompson Computer Press, 1996. 25. R. Berman and J. Brewster, Perspectives on the AIN architecture, IEEE Commun. Mag., 30 (2): 27–32, 1992. 26. R. Robrock, The intelligent network evolution in the United States, Conf. Rec. Int. Conf. Commun. (ICC) (Singapore), vol. 1, 1990, pp. 11.4/1.5. 27. R. Robrock, The intelligent network-changing the face of telecommunications, Proc. IEEE, 79: 7–20, 1991. 28. Bellcore, Bellcore multi-vendor interactions compendium of 1989 technical results, 1990. 29. ITU, Principles of intelligent network architecture, Q.1201/I.312, 1992, revised 1995. 30. W. Ambrosch, A. Maher, and B. Sasscer, The Intelligent Network: A Joint Study by Bell Atlantic, IBM and Siemens, Berlin: SpringerVerlag, 1989. 31. J. Garrahan et al., Intelligent network overview, IEEE Commun. Mag., 31 (3): 30–36, 1993. 32. P. Russo et al., INrollout in the United States, IEEE Commun. Mag., 31: 56–63, 1993. 33. P. Bragwad and B. Coker, Advanced intelligent network requirements for personal communications services, 1st Int. Conf. Universal Personal Commun. Dallas, TX, 1992, pp. 1–5. 34. R. Cochetti, Intelligent communications: The complete solution, Telephony 226: 22, 26, 28, 1994. 35. G. S. Lauer, IN architectures for implementing universal personal telecommunications, IEEE Netw., 8 (2): 6–16, 1994. 36. W. Lee, Applying the intelligent cell concept to PCS, IEEE Trans. Veh. Technol., 43: 672–679, 1994.

42. M. Bray, Impact of new services on SCP performance, Conf. Rec. Int. Conf. Commun. (ICC), 1990, pp. 241–247. 43. K. Egolf, Single number expanding to multiple applications, Telephony, 230, 1996. 44. J. Shih and C. Addison, Bellsouth telecommunications first step towards personal communications, Proc. IEEE Int. Conf. Universal Personal Commun. San Diego, 1994, pp. 607–611. 45. ITU, Principles for a Telecommunications Management Network, ITU Recommendation M.3010, 1992. 46. ITU, Basic reference model for open distributed processing, X.900/ ISO/IEC 10746-2.2 1-5, 1993. 47. Bellcore, Cycle 1 specifications for information networking architecture (INA), 1993. 48. H. Rubin and N. Natarajan, A distributed software architecture for telecommunication networks, IEEE Network, 8 (1): 8–17, 1994. 49. M. Appeldorn, R. Kung, and R. Saracco, TMN ⫹ IN ⫽ TINA, IEEE Commun. Mag., 31 (3): 78–85, 1993. 50. W. Barr, T. Boyd, and Y. Inoue, The TINA initiative, IEEE Commun. Mag., 31 (3): 70–77, 1993. 51. H. Berndt, P. Graubmann, and M. Wakano, Service specification concepts in TINA-C, Towards a Pan-European Telecommunication Service Infrastructure, IS&N 94 2nd Int. Conf., 1994, pp. 355–366. 52. F. Dupuy, G. Nilsson, and Y. Inoue, The TINA Consortium: Toward networking telecommunications information services, IEEE Commun. Mag., 33 (11): 78–83, 1995. 53. N. Natarajan and G. Slawsky, A framework architecture for information networks, IEEE Commun. Mag., 30 (4): 102–109, 1992. 54. M. Wakano, M. Kawanishi, and L. Richter, Information model to support TINA service and management applications, Proc. IEEE Conf. Global Commun. (GLOBECOM), 1994, pp. 543–547. 55. H. Fowler, TMN-based broadband ATM network management, IEEE Commun. Mag., 33 (3): 74–79, 1995. 56. R. Glitho and S. Hayes, Telecommunications management network: Vision vs. reality, IEEE Commun. Mag., 33 (3): 47–52, 1995. 57. M. Kockelmans and E. Jong, Overview of IN and TMN harmonization, IEEE Commun. Mag., 33 (3): 62–66, 1995. 58. D. Sidor, Managing telecommunications networks using TMN interface standards, IEEE Commun. Mag., 33 (3): 54–60, 1995. 59. T. Towle, TMN as applied to the GSM network, IEEE Commun. Mag., 33 (3): 68–73, 1995. 60. K. Yamagishi, N. Sasaki, and K. Morino, An implementation of a TMN-based SDH management system in Japan, IEEE Commun. Mag., 33 (3): 80–85, 1995. 61. The TINA Consortium, About TINA-C, [Online], 1997. Available: http://www.tinac.com/ 62. The TINA Consortium, The TINA trials, [Online], 1997. Available: http://www.tinac.com/

37. E. Lipper and M. Rumsewicz, Teletraffic considerations for widespread deployment of PCS, IEEE Netw., 8 (5): 40–49, 1994.

J. PLACE J. STACH

38. R. Pandya, Emerging mobile and personal communication systems, IEEE Commun. Mag., 33 (6): 44–52, 1995.

University of Missouri—Kansas City

39. P. Rice, SS7 networks in a PCS world, Telephony, 230: 138, 140, 142, 144, 146, 1996. 40. J. Stach and J. Place, A simulation study of hybrid SSP/IP architecture in the advanced intelligent network, Int. J. Model. Simulation, 16: 111–117, 1996. 41. K. Takami et al., An application of advanced intelligent network technology to personal communication services, IEEE Intell. Netw. Workshop Melbourne, Victoria, Australia, 1996, p. 6.

511


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Internetworking Standard Article Carl A. Sunshine1 1Aerospace Corporation, Los Angeles, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5312 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (107K)




❍ ❍

Acronym Finder



Abstract The sections in this article are History of Computer Internetworking Major Technical Issues Major Internet Examples Future Directions | | | Copyright © 1999-2008 All Rights Reserved.


INTERNETWORKING

653

INTERNETWORKING The terms internetworking or network interconnection refer broadly to the techniques that enable computer systems on one network to communicate with systems on another network. The set of interconnected networks may be called an internet. (We shall use the proper name Internet for the particularly well known global internet that has come to dominate internetworking in the 1990s.) A major challenge for internetworking is to allow different types of networks to participate. A variety of network technologies and products have been devised to provide efficient data communication through different media (twisted pair copper wires, optical fibers, coaxial cable) and over various distances, such as within a building, across a campus, and between widely separated locations. Recently, wireless data communications networks (ground and satellite based) have become more prevalent to support mobile users or remote locations. Providing for all types of networks to be interconnected so that users on one network can effectively communicate with users on other networks adds great value to the system. J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.

654

INTERNETWORKING

However, each network technology comes with its own characteristics for speed, format, reliability, and protocols which define the format and procedures for data exchange (1). There are good technical and marketing reasons for these different solutions, so diversity in network technologies is likely to persist. This suggests that for a network interconnection strategy to succeed, it must accommodate the autonomy and differences of individual networks to the greatest extent possible. On the other hand, some commonality of services must be supported if communication between users on different networks is to succeed. These two requirements represent a tension, within which a variety of interconnection approaches has been devised (2–5). Typically, some additional equipment is required to interconnect two different networks, by connecting to both networks through appropriate interfaces and implementing any necessary additional protocols (see Fig. 1). These intermediate devices that create the internet from its component networks may be called gateways, or routers, since one of their key functions is to forward incoming data in the proper direction to reach its ultimate destination, possibly many networks away. To accomplish this, a higher-level internet addressing scheme must be provided that can identify destinations across all of the networks in the internet. The routers must then determine from this internet address where to send the data next and how to package data in the local protocol used within the next individual network. The basic operation of an internet is much like that of the postal service. The sender of a letter places it in an envelope with the address of the destination and drops it in the mail. The local postal service then reads the address and delivers the letter to an appropriate forwarding office, using whatever transport mechanism is most suitable (bicycles, trucks, planes). At the forwarding office, the letter is sorted and forwarded again, until it reaches the final post office, which can deliver it to the destination. The postal service is not concerned with the contents of the letter, although it must conform to certain maximum size and weight limits (which may vary in different postal systems). Thanks to certain international agreements, there is enough commonality in mail services and the languages used for addresses that the basic mail forwarding service can be provided successfully, even if the contents might not be understood. Similarly, in an internet, the data to be sent are bundled into packets, with an ‘‘envelope’’ of header and trailer information including the source and destination internet addresses. Each network delivers these to an appropriate router, which uses the header information to determine how to forward the packet onward. In Fig. 1, the sending host A

Router R

Host A

WAN W

Router R

LAN X Host C LAN Y Host E

Router T

LAN Z Host D Host B

Figure 1. Routers interconnect networks to form an internet.

sends packets to destination B via local area network (LAN) X to router R, which in turn forwards the packet through WAN Y to router S, which finally forwards the packet via LAN Z to host B. To support all types of data communication applications, the internet must be able to forward arbitrary data inside the packet, so long as the size is acceptable and the ‘‘envelope’’ information is properly formed. If the end host systems and the intermediate routers all implement a common internet protocol to handle the basic addressing and routing functions, the data can reach their destination anywhere in the system. In practice, additional issues, such as congestion control, fragmentation, and multiplexing, must also be dealt with (3). We first summarize how network interconnection has developed historically. We then review the major technical problems of network interconnection, including stepwise versus endpoint services, level of interconnection, addressing, routing, fragmentation, and congestion control, ending with a summary of functions performed by a router. Next we present several important examples of internet systems that illustrate the technical alternatives, and we conclude with some directions for further research.

HISTORY OF COMPUTER INTERNETWORKING Computer networking as we know it today may be said to have gotten its start with the ARPANET development in the late 1960s and early 1970s under the sponsorship of the Advanced Research Projects Agency (ARPA) in the United States. Prior to that time there were computer vendor ‘‘networks’’ designed primarily to connect terminals and remote job entry stations to a mainframe. But the notion of networking between computers viewing each other as equal peers to achieve ‘‘resource sharing’’ was fundamental to the ARPANET design (6). The other strong emphasis of the ARPANET work was its reliance on the then novel technique of packet switching to share communication resources efficiently among users transmitting intermittent bursts of information, instead of the more traditional dedicated links or circuit switching which supported steady rate transmission well. Although the term network architecture was not yet widely used, the initial ARPANET design did have a definite structure and introduced another key concept: protocol layering, or the idea that the total communications functions could be divided into several layers, each building on the services of the one below. The original design had three major layers, a network layer that included the network access and switchto-switch (IMP-to-IMP) protocols, a host-to-host layer (the Network Control Protocol, or NCP), and a function-oriented protocol layer, where specific applications such as file transfer, mail, speech, and remote terminal support were provided (7). Similar ideas were being pursued in several other research projects around the world, including the Cyclades network in France (5), The National Physical Laboratory Network in England (8), and the Ethernet system (9) at Xerox Palo Alto Research Center in the United States. Some of these projects focused more heavily on the potential for high-speed local networks such as the early 3 Mbps Ethernet. Satellite and radio channels for mobile users were also a topic of growing research interest.

INTERNETWORKING

Creation of the Internet Protocol By 1973 it was clear to the networking vanguard that another protocol layer needed to be inserted into the protocol hierarchy to accommodate the interconnection of diverse types of individual networks. Cerf and Kahn published their seminal paper describing such a scheme (10), and development of the new Internet Protocol (IP) and Transmission Control Protocol (TCP) to jointly replace the NCP began. Similar work was being pursued by other groups meeting in the newly formed International Federation of Information Processing (IFIP) Working Group 6.1, called the Internetwork Working Group (11). The basis for the network interconnection approach developing in this community was to make use of a variety of individual networks, each providing only a simple ‘‘best effort’’ or datagram transmission service. Reliable virtual circuit services would then be provided on an end-to-end basis with the TCP (or similar protocol) in the hosts. ARPA sponsored an effort to form a national internet based on TCP/IP protocols and connecting research groups with local networks via the ARPANET. This sytem gradually grew international extensions and eventually became the Internet. Other Internetworking Approaches During the same time period, public data networks (PDN) were emerging under the auspices of what is now the International Telecommunications Union (ITU), then known as CCITT. The newly defined X.25 protocol aimed at providing more traditional virtual circuit types of network service that guaranteed reliable end-to-end delivery (1). The PDNs devised an interconnection scheme based on concatenating virtual circuits across each network (12). The middle and late 1970s saw networking conferences dominated by heated debates over the relative merits of circuit versus packet switching and datagrams versus X.25 virtual circuits (13). The mainframe computer vendors continued to offer their proprietary networks, gradually supporting the new X.25 service as links under their own protocols. Digital Equipment (DEC) was the notable exception, adopting the research community approach of peer-to-peer networking at an early date and coming out with its own new suite of protocols (DECNET). By the late 1970s, a new major influence was emerging in the computer networking community. The computer manufacturers realized that multivendor systems could no longer be avoided and began to take action to satisfy the growing user demand for interoperability. Working through their traditional standards body, the International Standards Organization (ISO), a new group (Study Committee 16) was created to develop standards in the networking area. Their initial charter was to define an explicit architecture or ‘‘reference model’’ for Open Systems Interconnection (OSI) (1). They formalized the concept of protocol layering to facilitate the design of increasingly complex communications software. In a layered architecture, the communications functions in each system are partitioned into a set of layers, with each layer making use of the functions provided by the layer beneath. This allows modifying the protocol within a layer so long as the functions provided upward and used below are maintained. Interconnection of LANs Another force contributing to the growth of internetworking was the introduction of personal computer networks, initially

655

for business purposes. Both Apple Computer and Novell introduced networking software in the mid-1980s that allowed multiple LANs to be interconnected, with sharing of files and printers. The work on Ethernet at Xerox was extended to allow interconnection of LANs over long-distance links (14). The breakup of the long-distance phone monopoly in the United States in 1984 provided competition and a rapid drop in prices for higher-speed links to interconnect the growing business LANs at various sites. Such links also provided greater bandwidth for interconnection of the growing number of TCP/IP networks at university and research sites. This led to formation of the first high-speed national TCP/IP network by the National Science Foundation and a further growth of TCP/IP systems to include commercial sites. The Internet Engineering Task Force (IETF) was formed to guide the further evolution of the TCP/IP internet, later known as the Internet. Meanwhile, the CCITT and ISO camps aligned their efforts, with OSI adding an internet sublayer within the network layer to accomodate the datagram internetworking approach beside the virtual circuit approach. This new OSI protocol family functioned much like the TCP/IP suite. Many proponents of the OSI stack expected it to succeed the TCP/ IP suite, and it enjoyed considerable acceptance in Europe and the Far East. The United States government mandated its inclusion in all network purchases through the Government Open Systems Interconnect Profile (GOSIP). Dominance of the Internet In the mid-1990s, several factors contributed to the growing dominance of the TCP/IP system, which came to be called simply the Internet. Free software for the TCP/IP suite was widely available. The invention of hypertext browser software (the original was called Mosaic) made hypermedia information throughout the Internet easily accessable. With the tremendous growth in PCs and the discovery of the Internet for personal and general business use, demand for connectivity accelerated dramatically, and by the late 1990s there are millions of connections to the Internet. The protocol suite developed by the researchers in the 1970s is now an essential basis for a vast array of personal and enterprise information exchange. In the process of growth, some modifications to the original internet protocols have been proposed, but the fundamental principles remain valid. MAJOR TECHNICAL ISSUES As noted previously, an internet must deal with basic issues common to any switching system, such as addressing, routing, congestion control, fragmentation, and multiplexing. The following sections focus on the extra concerns that are important at the internet level in each of these areas, along with a discussion of alternatives for the level at which to interconnect networks. Naming, Addressing, and Routing To understand the problem of delivering data to the correct destination in an internet, a clear distinction must be drawn among names, addresses, and routes (15). Although these concepts are applicable at each protocol level, we shall be primarily concerned with the network level, where hosts or end systems and routers are the relevant objects. A name serves

656

INTERNETWORKING

Network

Local Internet address

Figure 2. Hierarchical addresses simplify global addressing and routing.

Next Address

Port

Dest. Net

Next Address

Port

A B C D E

A S C S T

X W X W X

X Y Z

Local T S

X X W

...

Addressing. As noted earlier, packets traversing an internet include a header specifying the internet address of the destination host. Internet addresses must provide a unique identifier for each network interface in the internet system. In small internet systems with broadcast media (such as Ethernet, or token rings) it may be sufficient to use a ‘‘flat’’ address format where the addresses provide no indication of the location of the host’s interface. In large internet systems, it is essential to introduce a hierarchical internet address format where an explicit ‘‘network’’ prefix is combined with a ‘‘local’’ suffix to form a complete address (Fig. 2). The network prefix identifies the destination network (or closely related set of networks), and the local suffix identifies the destination host interface within that network. In the Internet, the original internet address format was 32 bits, with either 8, 16, or 24 bits allocated for the network prefix. This allowed for a small, medium, or large number of nets which each contained a large, medium, or small number of hosts, respectively. As the Internet expanded, a more flexible scheme was developed called subnetting (17) which allows a locally interconnected group of networks (such as LANs on a campus) to appear as a single network to the rest of the Internet. Local routers in the group then use the first few bits of the ‘‘local’’ address to properly distinguish between the different ‘‘internal’’ networks. While subnetting has allowed successful expansion of the Internet for many years, the newly adopted Internet Protocol Version 6 provides for longer 128-bit addresses to facilitate future growth. Another addressing issue in forming successful internets is how to create the network level addresses needed to transmit a packet through each individual network along its path. As described later, routing tables in each host or router provide the internet address to which each incoming packet must be forwarded, but this must be translated to a network level address that can be ‘‘understood’’ by the next network. In some cases the local portion of the internet address can be

Dest.

...

to identify the host ‘‘logically,’’ independent of its point(s) of attachment to the network(s). The same host may have several names to provide for convenient ‘‘nicknames’’ or aliases. An address identifies a point of attachement for purposes of delivering data to the host; since the same host may have multiple network interfaces, it may have multiple addresses. Finally, a route is the path taken from source to destination host (the sequence of intermediate nodes that the packet traverses), and there are typically multiple routes available to the same destination. The process of sending data to a destination generally involves first determining its address from its name using a directory service, and then determining the best route to that address. In large systems, this name lookup function is typically implemented in a distributed fashion, with a hierarchical name space where subdirectories are responsible for their portion of the name space (16).

(a) Flat internet addressing

(b) Hierarchical internet addressing

Figure 3. Routing table for Router R in Fig. 1. Table is larger with flat internet addressing (a) and smaller with hierarchical addressing (b).

used or translated directly into a network address (e.g., the IMP and port numbers in the original ARPANET). In other cases, the local portion must be determined or ‘‘resolved’’ using tables created by an address resolution protocol (ARP) (17). The ARP dynamically discovers the network addresses (and internet addresses) of hosts and routers connected to a particular network, and maintains a table giving the correspondence between network and internet addresses on a given network. Routing. As a packet traverses the internet, the source node and each router must take the destination internet address and determine the next place to send the packet. Normally this is done using a routing table, or data structure containing destination addresses and how to reach them. With a flat internet address space routers must maintain information on how to reach each destination individually, and hence have large routing tables. This approach is usable in smaller internet systems, such as interconnected LANs. Figure 3(a) shows a routing table for Router R in Fig. 1, containing a list of destination addresses and the address of the next hop. The next hop specifies the internet address of the next router to use when forwarding the packet, which in turn determines which ‘‘port’’ or network interface to use. The routing subroutines usually index entries with a hash, tree, or other efficient lookup mechanism to speed the process and may contain additional information for each entry, like age and frequency of use. Routing tables may be constructed according to many requirements, such as lowest delay or cost or highest availability. In some cases they are best created statically when the end system or router is configured. For example, host A in Fig. 1 need only create a single default route to Router B, since that is the only path into the internet. Though static tables for singly connected end systems are commonplace, other routing tables are commonly altered dynamically to represent current link and router availability. Accomplishing this task in large networks or internets with high reliability, efficiency, and timeliness has been a very challenging problem that has led to the development of many routing information exchange protocols that balance complexity, optimality, and required processing speed (1,18). For large internets, a hierarchical address is often used in the internet protocol, so that routing can be done in steps. First the gateways route packets to the final network (ignoring the local suffix), and then within the final network to the local address. With this approach the routing table contains an entry for each net, rather than for every destination host.

INTERNETWORKING

This reduces the size of routing tables, as shown in Fig. 3(b), with the potential for some loss in optimality. Despite considerable recent progress in routing algorithms, computer and human errror conditions can still occur that create routing loops, a condition among a set of routing tables in which packets repeatedly traverse the same set of intermediate systems, never reaching their ultimate destination. To prevent these conditions from congesting the network indefinitely, internetwork protocols specify a hop count or time to live field that is decremented by each router. If the hop count ever reaches zero, the packet is discarded. Senders normally set this field equal to or greater than the longest normal path through the internet. Another design choice concerns the frequency of routing decisions. For maximum robustness, each packet may cause a best route selection process to be carried out, as in the original ARPA internet (19). Other systems choose to perform the best route determination process only for the initial packet to a destination. This route is then remembered in the routers, and subsequent packets to the same destination follow the same route. Often this type of path set up is accompanied by an abbreviated addressing convention, where only the first packet must carry full destination address and subsequent packets carry only a shorter path identifier. The CCITT X.75 and the new IPv6 use this approach. A mechanism for timing out such routes and recovering from changes in the internet topology must be provided. Yet another approach employs flooding to avoid the need for intelligence in packet forwarders. Since flooding is expensive of network resources, it is typically used only for control purposes or for initially establishing a path that later packets to the same destination will follow. Another method of routing called source routing allows the sender to avoid the need for intelligent routers or to force a specific path to be used by providing a route in the packets it sends (20). Congestion Control The problems of congestion control in an internet system are much like those of individual networks. Speed mismatches are likely to be more severe between LANs and slower wide area networks (although recent advances in high-speed WAN service should reduce this). In some cases, the individual network procedures may be adequate [e.g., Asynchronous Transfer Mode (ATM) quality-of-service parameters]. In others, some form of explicit internet-level control may be needed. Questions have been raised about the ability of connectionless systems to provide effective congestion control. This is a particular concern when connectionless or datagram internet service is used to support higher-level connection-oriented services. Several techniques have been proposed in this area, including input buffer limits, buffer classes, fair queuing, slow start, and choke packets (1,21). Once the sender has determined that congestion has occurred (by receiving an explicit signal from a host or router or by timing out waiting for an acknowledgment), it must reduce its transmission rate for a while and then try to increase it again. Various specific algorithms for this purpose have been proposed, and this is an active area of research. Fragmentation and Reassembly When networks with differing maximum packet size limits are interconnected, the need to fragment large packets for

657

traversal through networks with smaller size limits must be considered. The original packet is broken into two or more new packets, each small enough to transmit over the next network. These fragments can be reassembled at the exit from the individual small packet network or allowed to propagate all the way to the final destination. Mechanisms to support such fragmentation typically include some sort of additional sequencing information in the packet header. The most robust mechanisms allow further fragmentation of already created fragments and proper reassembly of fragments at the final destination that may have followed different paths. In general, fragmentation is undesirable because of the processing burden placed on routers and because of the possibility of inefficient link utilization. For example, a fragment that fills one network packet may have to be fragmented at a subsequent router into one large and one very small piece. The very small piece has a large ‘‘overhead’’ (ratio of data carried to data and header information), which uses resources inefficiently. To help alleviate this problem, the internet protocol suite may provide for an advisory message to be transmitted back to the source of large packets, indicating that they are too big for the router to forward without fragmentation. Level of Interconnection The previous discussion has assumed that networks are interconnected at the network level of the protocol hierarchy, since this is the dominant approach in use today. However, other levels of interconnection may also be chosen, from the lowest (physical) level to the highest (application) level. In general, the lower the level of interconnection, the more similar the networks to be connected must be, while high-level interconnections support more specialized services. When different networks and protocols are involved, the interconnection involves a conversion process between the services provided for comparable functions in each network (22). The complexity of this process and the quality of end-toend services resulting are largely determined by the level of interconnection chosen. The following sections summarize the key features of each major alternative. Physical Level. The physical level deals with serial transmission of bits over a physical medium. Interconnection devices operating at the physical level are generally called repeaters. They forward individual bits of the packet as they arrive, perhaps translating from one medium to another (e.g., baseband coaxial cable to optical fiber). The resulting interconnected system functions essentially as a single network at the data link level, and hence all networks to be so connected must have identical data rates and link protocols. This approach is typically used to interconnect several physically separate segments of a LAN system, perhaps separated by a point-to-point link. A disadvantage is that repeaters propagate noise and interference as well as valid data. Link Level. The link level deals with transmission of frames over a link, which may be shared by multiple users. Interconnection devices operating at the link level receive entire frames from one link, examine the link level protocol header, and possibly forward the frame onto another link. They are

658

INTERNETWORKING

typically called bridges (20). As with repeaters, they may interconnect two or more local LAN segments or may interconnect remote segments over a long-distance link. Major motivations for their use are to interconnect LAN segments with different speeds and/or protocols or to increase network capacity by ‘‘filtering’’ incoming packets and forwarding only those whose link-level destination is on another segment. Hence bridges accommodate parallelism by permitting simultaneous use of both segments. Moreover, bridges transparently support systems with multiple network-level protocols in use. Network Level. The network level deals with transmission of packets over a network that may include intermediate switches. Traditionally, interconnection at the network protocol level has been a WAN problem, where different networks had independently developed different protocol mechanisms for the variety of network-level functions, such as routing, congestion control, error handling, and segmenting. If the networks are identical, then the problem becomes largely one of routing as with the X.25/X.75 approach in public data networks. When the networks differ, the complexity of protocols at the network level (e.g., X.25 versus ARPANET 1822) makes a translation approach difficult. There has been some success in one vendor emulating another vendor’s network behavior [e.g., IBM Systems Network Architecture (SNA) gateways]. The approach that has gained wide acceptance in the Internet places a common IP sublayer on top of the different network protocols. As noted previously, this has particular benefits for supporting the sophisticated routing procedures needed for large internet systems, and devices operating at this level are often called IP routers. Choosing this level for interconnection makes available the general-purpose services of the network level and allows the router implementor to take advantage of what is normally a well-documented interface with many implementations. It allows each network to function autonomously with its own procedures internally, while requiring some standard ‘‘internet’’ procedures to be used on top of the normal network access for individual networks. Transport Level. The transport layer is intended to provide general-purpose data transfer between end users. In the OSI architecture, the transport service is supposed to be an endto-end service, so transport-level gateways are, strictly speaking, a violation of the architecture. Nevertheless, they may be of practical benefit when common upper-level protocols are in use but different transport protocols are available. Early experiments with the competing protocol hierarchies demonstrated connections of this nature (for example, concatenating TCP and ISO TP4 connections to each other). Higher Level. Many application-level gateways have been implemented to support specific services found at the application level. This type of gateway is essentially a ‘‘Janus host’’ that implements two (or more) full protocol suites. Common examples have been interconnecting terminal concentrators or CCITT packet assemblers/disassemblers (PAD) to provide an interactive terminal service, or electronic mail servers to form a mail forwarding service. Where only a specific application service is wanted and the desired application services on

each net match closely, this type of gateway may be easy to set up with existing equipment. However, the service provided is clearly not general purpose, and the limitations imposed by providing only those service elements common to the interconnected systems are often more irksome than anticipated (23).

MAJOR INTERNET EXAMPLES The following sections illustrate the application of the technical issues discussed previously in several widely used internet systems. The Internet One of the first major internet systems was developed by ARPA in the United States (24,25). This system included the original ARPANET, packet radio nets, satellite networks, and various LANs. The system was subsequently split into separate systems for research users and for operational military users and eventually evolved into the Internet. Networks in the Internet are interconnected by routers that implement a connectionless or datagram IP (19,26,27) to provide maximum robustness and routing flexibility. The system originally employed dedicated router machines based on general-purpose 16-bit minicomputers, but special-purpose high-speed routers are now manufactured by a variety of vendors. Each datagram is analyzed by the routers and routed based on its destination address. The Internet uses hierarchical 32-bit addresses, with routers designed to route to the network portion of the address first, and then the local portion once the correct net is reached. As described earlier, subnetting has been introduced to allow more efficient and flexible use of address space. Host name to address lookup was initially supported by a single flat directory, but as the number of hosts grew, a hierarchical distributed directory service [the domain name system (DNS)] was adopted (17), which now can access millions of names throughout the world within a few seconds. Most of the individual networks in the Internet provide connectionless service, although there is a provision for running IP over connection-oriented network services such as X.25 and ATM. The major transport service is connection oriented, implemented by a common protocol called the transmission control protocol (TCP), that must be present in the end systems (not in routers). IP also supports other types of transport protocols, including datagram and ‘‘stream’’ mode (for packetized voice or video). The Internet IP provides for fragmentation at routers, with reassembly at the final destination so that individual fragments may follow different routes. A time-to-live or hop limit field is included to limit the maximum lifetime of packets in the system, providing an essential part of the overall routing system. Options are defined to allow inclusion of source routes, security markings, timestamps, and so on. There is a separate Internet Control Message Protocol (ICMP) used for signaling errors and diagnostic information. This includes destination unreachable, congestion control (choke packets), packet too big, echo request/reply, and redirect indications (giving a better route for a specific destination).

INTERNETWORKING

Internet routing information exchange was originally handled by a gateway-to-gateway protocol that required interaction between all ‘‘neighboring’’ gateways. As the Internet grew, a more hierarchical scheme called the Exterior Gateway Protocol (EGP) (24) was developed to reduce the amount of routing traffic. In EGP, each autonomous system (typically a campus or corporate internet) elects one gateway to exchange routing data with a neighbor gateway in an adjacent autonomous system, and the systems then propagate the information to all their other gateways through an internal procedure. EGP evolved further to become the Interdomain Routing Protocol (IDRP) (28). The version of IP developed in the 1970s (IPv4) has been adapted to work on a wide variety of subnetwork technologies and is used at very high speeds, but its deployment in largescale networks has revealed opportunities for enhancement. In particular, a larger address space is needed. A new version, IPv6, which supports 128-bit addresses, eliminates fragmentation, and improves route lookup times, has been defined by the IETF and is being cautiously deployed. International Standards Organization The ISO extended its original seven-layer OSI architecture to define three sublayers within the network layer. The topmost layer corresponds to the internet protocol, and the middle layer is intended to adapt (‘‘converge’’) specific network services to those required by the internet sublayer. One example would be use of a connectionless internet protocol over a connection oriented network, requiring a connection management intermediate layer protocol to set up and terminate connections as needed in order to send internet-level datagrams. ISO has defined a connectionless internet sublayer protocol (1,29,30) much like the Internet IP. Although the format of the packet header is different, most fields have a one-to-one correspondence with the Internet IP. However, the ISO IP does not include a field to specify the upper layer protocol being carried since this is viewed as part of the address information. The ISO IP includes an error reporting capability, while the Internet IP provides this through the separate ICMP protocol. The fragmentation (segmentation) fields are different, with the ISO IP including a field giving the total length of the original segment in each fragment to aid in assigning reassembly buffers. The final major difference concerns the format of addresses at the network level, which is not part of the ISO IP itself but is covered in a separate document. The ISO format is a variable-length string that is intended to cover the requirements of both public and private, local and wide area networks for the foreseeable future. This involves a maximum of 16 octets of binary data, which could be alternatively coded as 40 binary-coded decimal digits. The first octet is an authority/format code meant to indicate what format the following data are in. Provision has been made to identify all the major address formats as alternatives (X.121, F.69 [telex], E.163 [telephone], E.164 [ISDN], ISO 6523). The address is assumed to be hierarchical, with each domain responsible for defining the meaning of the suffix portion of the address under its control. Appletalk Appletalk was developed in the mid-1980s and is primarily used on Apple computers. It has several innovations that

659

make it efficient and easy to configure. There are two header format options: a 5-byte short form for use on packets that do not exit a single LAN, and a 13-byte-long form for routed packets. The former is quite compact, containing 6 reserved bits, 10 length bits, the source and destination sockets, and the Datagram Delivery Protocol (DDP) type, indicating the application. The sockets identify the particular application to receive the data, as is usually done in the transport protocol; the packets do not have internetwork addresses because the link layer sends it to the correct destination. Zero to 586 data bytes follow the header. The first two bytes of the extended header are the same as for the short, except for a 4- bit hop count field that limits maximum network diameter to 15. Bytes 3 and 4 are an optional header checksum. Following are two destination and two source network bytes for a total of over 65,000 allowed networks (addresses FF00 through FFFE are reserved.) Each network may have 254 nodes, as indicated in the following two bytes. Addressing is handled in a ‘‘plug and play’’ fashion. End systems arbitrate (using a broadcast protocol) to obtain an unused node ID when they are initialized and learn the network number (if any) from their nearest router. This eliminates the need to configure end nodes with unique addresses. Routers are manually configured with network numbers. Novell IPX Novell began selling its distributed system in the mid-1980s, and made rapid inroads in the office automation market. Novell’s Internet Packet Exchange (IPX) protocol has a fixed 30byte header. The first 2 bytes contain an optional checksum, followed by 2 bytes of length (excluding LAN overhead.) Next is a 1-byte time-to-live field that starts at 16, and a packet type that indicates which transport protocol is used. The destination and source node addresses have a 4-byte network part and a 6-byte node identifier that is the same as the physical address for Ethernet. Since the physical address is expected to be unique, this simplifies manual configuration and eliminates the need to discover physical addresses dynamically. The source and destination sockets identify particular applications, like transport sockets. FUTURE DIRECTIONS The variety of individual network technologies is likely to continue increasing. Fortunately, by introducing standards at the internetwork level, it is possible to interconnect diverse networks while preserving their individual autonomy to a large degree. The success of the Internet in working with new network technologies such as FDDI and ATM indicates the validity of its basic architecture. To cope with the tremendous growth in end systems, broader addressing and routing schemes are now emerging from IETF work (28). Research is also underway on improved methods of congestion control and routing protocols. With the high packet rates now flowing in the Internet, some provisions for streamlining packet processing are needed. A new version of the Internet IP, IPv6, provides for 128-bit addresses, with a flow label in each packet to allow routers to cache routes and avoid a full destination lookup on each packet. These are timed out every few seconds to ensure

660

INTERRUPTERS

responsiveness to changing conditions. IPv6 allows fragmentation only at the source host, to streamline packet processing in intermediate routers. Inclusion of high-latency links, such as satellite hops, and high-error-rate links (mobile users) also provides new challenges for the Internet. Greater demand for broadcast service (the same data going to multiple users), constant rate data (audio and video), and asymmetric rate links (fast data retrieval, slow requests, as provided in some broadband cable systems, and Asynchronous Digital Subscriber Line technology) are other directions for expansion. BIBLIOGRAPHY 1. C. Sunshine (ed.), Computer network architectures and their protocols, 2nd ed., New York: Plenum, 1989. 2. V. Cerf and P. Kirstein, Issues in packet network interconnection, Proc. IEEE, 66: 1386–1408, 1978. 3. M. Gien and H. Zimmermann, Design principles for network interconnection, Proc. 6th Data Commun. Symp., Pacific Grove, CA, 1979, ACM/IEEE, pp. 109–119.

22. P. Green, Jr., Protocol conversion, IEEE Trans. Comm., COM-34: 257–268, 1986. 23. M. Padlipsky, Gateways, architectures, and heffalumps, in The Elements of Networking Style, Englewood Cliffs, NJ: PrenticeHall, 1985, pp. 167–176. 24. R. Hinden, J. Haverty, and A. Sheltzer, The DARPA internet: Interconnection of heterogeneous computer networks with gateways, IEEE Comput., 16 (9): 38–48, 1983. 25. J. Postel, C. Sunshine, and D. Cohen, Recent developments in the DARPA internet program, Proc. 6th Int. Conf. Comput. Comm., London, UK, 1982, pp. 975–979. 26. Department of Defense, Internet protocol, MIL-STD-1777, 1983. 27. D. Clark, The design philosophy of the DARPA internet protocols, Proc. ACM SIGCOMM Symp., 1988, pp. 106–114. 28. S. Thomas, IPng and the TCP/IP Protocols: Implementing the Next Generation Internet, New York: Wiley, 1996. 29. R. Callon, Internetwork protocol, Proc. IEEE, 71: 1388–1393, 1983. 30. International Standards Organization (ISO), Protocol for providing the connectionless network service, IS 8473, March 1986.

CARL A. SUNSHINE Aerospace Corporation

4. J. Postel, Internetwork protocol approaches, IEEE Trans. Comm., COM-28: 604–611, 1980. 5. L. Pouzin, A proposal for interconnecting packet switching networks, Proc. Eurocomp, 1974. 6. L. Roberts and B. Wessler, Computer network development to achieve resource sharing, AFIPS Conf. Proc., (SJCC) 36: 543– 549, 1970. 7. V. Cerf, The DoD internet architecture model, Comput. Netw., 7: 307–318, 1983. 8. R. Scantlebury and P. Wilkinson, The national physical laboratory data communication network, Proc. ICCC, Stockholm, 1974. 9. R. Metcalfe and D. Boggs, ETHERNET: Distributed packet switching for local computer networks, Commun. ACM, 19: 395– 404, 1976. 10. V. Cerf and R. Kahn, A protocol for packet network intercommunication, IEEE Trans. Commun., COM-22: 637–648, 1974. 11. V. Cerf et al., Proposal for an international end-to-end protocol, Comput. Comm. Rev., 6: 68–89, 1974. 12. A. Rybczynski, J. Palframan, and A. Thomas, Design of the Datapac X.75 internetworking capability, Proc. 5th Int. Conf. Comput. Comm., 1980, pp. 735–740. 13. B. Meister, P. Janson, and L. Svobodova, Connection-oriented versus connectionless protocols: a performance study, IEEE Trans. Comput., C-34: 1164–1173, 1985. 14. D. Boggs et al., PUP, An internetwork architecture, IEEE Trans. Commun., COM-28: 612–624, 1980. 15. J. Shoch, Internetwork naming, addressing, and routing, Proc. IEEE COMPCON, 1978, pp. 72–79. 16. P. Mockapetris and K. Dunlap, Development of the domain name system, Proc. ACM SIGCOMM Symp., 1988, pp. 123–133. 17. D. Comer, Computer Networks and Internets, Englewood Cliffs, NJ: Prentice-Hall, 1997. 18. C. Huitema, Routing in the Internet, Upper Saddle River, NJ: Prentice-Hall, 1995. 19. J. Postel, C. Sunshine, and D. Cohen, The ARPA internet protocol, Comput. Netw., 5 (4): 261–271, 1981. 20. R. Dixon and D. Pitt, Addressing, bridging, and source routing, IEEE Netw., 2 (1): 25–32, 1988. 21. V. Jacobson, Congestion avoidance and control, Proc. ACM SIGCOMM Symp., 1988, pp. 314–329.

INTERPRETERS, PROGRAM. See PROGRAM INTERPRETERS.

INTERPROCESS COMMUNICATION. See APPLICATION PROGRAM INTERFACES.


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering ISO OSI Layered Protocol Model Standard Article Adrian Tang1 1University of Missouri-Kansas City, Kansas City, MO Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5313 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (153K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Interoperability OSI Reference Model OSI Application Standards OSI Concepts OSI Layers Conclusions | | | Copyright © 1999-2008 All Rights Reserved.


ISO OSI LAYERED PROTOCOL MODEL

ISO OSI LAYERED PROTOCOL MODEL In recent years, information technology has become a major part of human civilization. The sheer variety of information technology has created a significant problem when interconnecting computer systems. The Open System International (OSI) standards are a set of international standards prescribing how multivendor computer systems can be interconnected in a consistent fashion. Among these standards, the standard on the OSI Reference Model provides a framework for the development of protocols to interconnect open systems. This article attempts to introduce the OSI Reference Model and the OSI protocols. INTEROPERABILITY The purpose of the OSI interconnection standards is to promote interoperability, which will enable open systems to communicate with each other. To appreciate the role of the OSI interconnection standards in promoting interoperability, a three-step strategy to achieve interoperability is described in the following. The first step is to develop the OSI standards. The International Organization for Standardization (ISO) and the International Electrotechnical Committee (IEC) jointly formed a committee, ISO/IEC JTC (Joint Technical Committee) 1, which is responsible for solving problems that can occur at the interconnection and information processing levels. Within JTC 1 are subcommittees (SC). For example, SC 6 is involved in lower-layer information exchange standards and SC 18 is involved in information processing standards. The OSI Reference Model is covered in ISO/IEC 7498 which is an SC 21 standard. Another important OSI standard organization is the International Telecommunications Union (ITU). The ITU recommendations provide end-to-end compatibility for international telecommunications. Although the ITU has been more concerned with issues addressing the lower three layers of the OSI Reference Model and with applications using the telecommunications capabilities, there is, however, a great deal of overlap between its work and that of the ISO. Indeed, ISO and ITU publish nearly identical texts as both an International Standard and a Recommendation. The second step is to establish functional profiles. The difficulty in implementing the open systems standards is the ab-

751

stract often informal nature of the standards documents, as well as the wide variety of options that can be implemented for a protocol. Furthermore, in a given layer of the model, there may be a variety of similar protocols. To overcome such difficulties in consistency, OSI implementors define functional profiles which identify specific protocols and specific choices of permitted options and specific values for parameters in the standards. Functional profiles have been defined primarily in three OSI regional workshops, which are the OSI Implementation Workshop (OIW) in the United States, the European Workshop in Open Systems (EWOS) in Europe and the Asian and Oceanic Workshop (AOW) for Japan, Australia and the Pacific Rim countries. Because functional profiles are developed in three different workshops, there is potential for interoperability problems between systems implemented in different parts of the world. A subcommittee of ISO/IEC JTC 1 was formed in 1987 to define International Standard Profiles (ISPs) which would harmonize potentially divergent regional efforts into common internationally recognized functional profiles. The third step is to test. There are two kinds of OSI tests—conformance testing and interoperability testing. Conformance testing examines a product to determine whether it meets the standard and profile requirements. It is not ensured that two products that have passed conformance testing may interoperate with each other, e.g., in the case where they choose incompatible option values. Interoperability testing is done to determine whether two products conform to the same standard and profile requirements. This process is time consuming because a total of n*(n⫺1)/2 interoperability tests are needed to demonstrate full interoperability in an environment consisting of n implementations. Test strategies, concepts and scripts have been developed by ISO and specific industry consortia. Testing houses have been established to conduct manufacturer-neutral conformance and interoperability test campaigns.

OSI REFERENCE MODEL In 1978, ISO proposed the establishment of a framework for developing standards for the interconnection of heterogeneous computer systems, potentially via a variety of intermediary networking devices. The resulting framework, known as the OSI Reference Model (ISO/IEC 7498 or ITU X.200) was published in the spring of 1983. Since then, it has been extended to cover the connectionless communication mode, security, naming/addressing, and management. The interconnection requirements can be divided into internetworking requirements and interworking requirements. To meet these requirements, the OSI Reference Model is divided into seven layers (Fig. 1). Layers 1 through 4 deal with the internetworking requirements; Layers 5 through 7 deal with the interworking requirements. The objective of the internetworking requirements is to provide physical connectivity between systems in different networks in an internetworking environment. Conceptually, the internetwork environment consists of subnetworks that are connected by intermediate systems (IS) and populated by end systems (ES). Applications are found in ESs and ISs provide the glue to interconnect ESs. The objectives behind the internetworking requirement are twofold, specifically, trans-


752


Open system

Open system

Application

Application

Presentation

Presentation

Session

Session

Transport

Transport

Network

Network

Data link

Data link

Physical

Physical

tured approach to build objects in the application layer due to the large number of communication requirements. OSI APPLICATION STANDARDS Among the OSI application standards, the following are introduced below: Common Management Information, X.500, X.400, File Transfer Access Management, and Remote Database Access.

Physical media

Common Management Information Protocol Figure 1. OSI Reference Model.

parency over the topology of the internetwork, and transparency over the transmission media used in each subnetwork. The internetworking requirements are met by the lower four layers in the following way. Layer 1, the physical layer, is responsible for providing physical connectivity among adjacent systems. The bits in a physical connection may occasionally experience corruption. Thus Layer 2, the data link layer, takes care of error handling as well as flow control for physical connections. Because not all the ESs are necessarily adjacent to each other, Layer 3, the network layer, provides network connectivity among nonadjacent ESs. Although a network connection may require one or more connections to ISs, the Network Layer ensures that the connectivity details are transparent to the two ESs. Layer 4, the transport layer, provides the final touch in meeting the internetworking requirements. Although the lower three layers provide the means for moving data from one ES to another, reliability is still an issue. For example, the aggregated bit-error rate over all the underlying subnetworks may not meet the performance requirements of the application. The purpose of the transport layer is to enhance the transmission quality of the network layer. As a result, users of the transport service are presented with a reliable end-to-end transport pipe. Even with the task of providing the internetworking service, ESs may not be able to interwork significantly with each other. For instance, an ES may not be able to interpret the received bits although the bits are received correctly. The interworking service is provided by Layers 5 through 7. Layer 5, the session layer, is responsible for dialogue control as well as synchronization in the case of bulk transfer. It achieves its functionalities by adding a structure to the transport pipe. Layer 6, the presentation layer, provides the environment for applications to determine the application context. The application context, which determines the scope of communication, should identify the syntax of the application information exchanged. These syntax are known as abstract syntax because they do not depend on how their values are represented locally. Values of the abstract syntax are represented in the transfer by a set of rules known as the transfer syntax. The presentation layer provides the capabilities for applications to negotiate on such transfer syntax. Applications can also rely on the presentation layer to perform encryption for secured communication. Layer 7, the application layer, is responsible for meeting all the interworking requirements that are not met by the lower layers. The application layer model provides a struc-

The OSI management standards are to provide uniformity in the specification of a management communication protocol, uniformity in the definition of managed objects, and uniformity in the definition of systems management functions. The Common Management Information Protocol (CMIP) is the OSI protocol used to support the exchange of system management messages between a manager and an agent (Fig. 2). Managed objects are management views of resources such as network elements. The manager invokes remote management operations such as finding the status of a managed object. After the manager successfully completes these operations, the results are returned to the manager by the agent. The agent can send a notification to the manager as the result of a (e.g., alarm) notification received from a managed object. Unlike other management protocols, CMIP is connectionoriented. The management operations, which are defined in ISO/IEC 9595, can be summarized as follows: • M-GET: This operation is invoked to retrieve attribute values from one or more managed objects. • M-CREATE: This operation is invoked to create a managed object. • M-SET: This operation is invoked to modify attribute values of one or more managed objects. • M-ACTION: This operation is invoked to have a defined, ad-hoc action performed on a managed object, e.g., resetting a network connection. • M-DELETE: This operation is invoked to delete a managed object (e.g., deleting a log record). • M-CANCEL-GET: This operation is invoked to request the cancellation of an outstanding invocation of the MGET operation. • M-EVENT-REPORT: This operation is invoked by the agent to report an event to the manager. Operations such as M-GET, M-SET, M-ACTION and M-DELETE contain scoping and filtering parameters that permit the agent to operate only on the managed objects that have

Management operations Manager

Management operations Agent

Notifications

Notifications Managed emitted objects

Figure 2. CMIP Functional Model.


passed the filter. Only intelligent agents with the ability to evaluate filters can perform such operations. Therefore, it is not practical to run OSI agents on dumb devices. To provide uniformity in the definition of managed objects, the OSI management standards use the object-oriented approach and provide Guidelines to Definition of Managed Objects (GDMO) templates to define managed object classes. A managed object class is a set of managed objects with similar characteristics. A managed object is an instance of a managed object class. For example, the transport-connection managed object class is an encapsulation of common characteristics of transport connections. When a transport connection is established, a transport-connection managed object can be created to model the management behavior of a transport connection. The managed object class template forms the basis of the definition of a managed object class. It supports inheritance by identifying the inheritance relationships between the class and other managed object classes. Within the template are place holders for the specification of behavior, attributes, notifications and actions. Attributes are used to specify properties such as states, and notifications are used to specify unsolicited messages that can be emitted by a managed object. To facilitate rapid development of managed object classes, standard and profile groups have registered managed object classes for the common resources. For example, ITU M.3100 provides definitions of managed object classes for generic telecommunications resources. These definitions can be used to derive (via inheritance) definitions of managed object class for specific telecommunications resources. System management functions are required to address a variety of management functions for the entire managed system. Management functions can be classified into five categories: fault management; configuration management; performance management; accounting management; and security management. The OSI standards have defined systems management functions to address common problems in these five categories. Among these functions, the object management function, for example, provides services for the reporting of creation/deletion and attribute value change of managed objects; the event reporting management function provides services for event report delivery to managers. The OSI management standards have found their greatest success in telecommunications applications. ITU M.3010 defined a five-layer model for Telecommunications Management Network (TMN), and recommended the use of OSI management standards. Given the huge number of OSI management standards, it is not easy for a telecommunications manager to identify the OSI management standards that can be used to solve a specific management problem. To address this dilemma, the Network Management Forum (NMF) came up with the idea of a solution set. Similar to the notion of a functional profile, a solution set provides the solution to a specific telecommunications management problem. It identifies the managed object classes and systems management functions which are needed to solve the problem. Through scenarios, it explains how the functions can be applied to meet the requirements of the problem. The NMF has already defined a number of useful solution sets for practical telecommunications problems. A majority of telecommunications managers have committed to deploying NMF solution sets to solve telecommunications problems.

753

X.500 In a distributed computing environment, the X.500 standard defines a directory service to manage information which is not likely to change often (e.g., e-mail addresses and network addresses). The directory information base (DIB) is the conceptual repository of such directory information. Structured as a directory information tree (DIT), it represents each piece of directory information (known as directory entry) by a tree node. In the X.500 Functional Model (Fig. 3), the directory user agent (DUA) invokes directory access operations to request directory services on behalf of a directory user. The outcome of the invocation is either a result, a referral or an error. A result is returned if the request has been carried out successfully. A referral is returned if the service is unobtainable at a service access point or more easily reached at another service access point. An error is returned if the request cannot be carried out. Directory access operations fall into three categories. The read category consists of operations to retrieve information of a directory entry, to compare the value of a directory entry with a submitted value, or to cancel an outstanding interrogation. The search category consists of operations to list ‘‘child’’ subordinates of a directory entry or to search for directory entries satisfying an input filter. The modify category consists of operations to add/remove a directory entry, to modify a directory entry, or to change a directory name. The directory access operations are described in the Directory Access Protocol (DAP). In a large network, the directory services may be distributed among multiple directory service agents (DSA). The DSAs collaborate to provide the distributed directory service in three modes—chaining, referral and multicasting. In the chaining mode, a DSA continues passing a request to another DSA until one is found that can provide the requested information. In the multicasting mode, a DSA which does not have the requested information chains an identical request in parallel to multiple DSAs. In the referral mode, a DSA which does not have the requested information refers another DSA to the DUA. The directory service protocol (DSP) is used by the DSAs to provide the distributed directory service. The DSP operations are quite similar to the DAP operations. In the X.500 information model, a directory entry is either an object entry or an alias entry. An object entry is the primary collection of information about a real world object. On the other hand, an alias entry, which points to another directory entry, is used primarily to provide a user-friendly name to the referenced entry. Directory entries of similar characteristics are grouped into directory object classes. Examples of directory object classes are country, organization, person and device. The X.500 standard provides templates for the speci-

DUA

DAP

DSA DSP

DUA

DAP

DSA

DSP DSA

DAP

DUA

DSP

Figure 3. The X.500 Functional Model.

754


fication of directory object classes as well as attributes used in the definition of directory object classes. The naming of a directory entry is straightforward. As a tree node, the directory entry is labeled by a set of naming attributes known as the relative distinguished name (RDN). There is a unique path from the node to the root of the DIT. The concatenation of the RDNs of nodes in this path gives the distinguished name (DN) of the directory entry. The X.400 standard relies on DNs for the naming of X.400 users. X.400 The current trends towards electronic office automation have created a demand for universal message communication. The X.400 standard defines a store-and-forward infrastructure for the transport of interpersonal messages and business documents. Figure 4 shows the X.400 functional model. An X.400 user is a person or an application process that originates and receives messages. A user agent (UA) is a process which acts on behalf of an X.400 user. It is responsible for the submission and the delivery of messages. The message transfer system (MTS) provides the store-and-forward transfer service for the X.400 systems, and it transfers messages regardless of their content types. However, it does not examine or modify the message content unless content conversion is explicitly requested by the originator. A message store (MS) provides a secure and continuously available storage mechanism on behalf of a UA. By serving as an intermediary entity between the UA and the MTS, the MS can store incoming messages from the MTS until the UA is ready to process the messages. It can perform autoforwarding and provide a summary of the stored messages. Not every UA has an associated MS. If it does have one, all incoming messages for the UA are delivered first to the associated MS. The value of the X.400 system can be enhanced if it can be connected to other non-X.400 systems such as postal systems. An access unit (AU) is a functional object to link a non-X.400 system to the X.400 system. There are a number of AU types, for example, telex, teletex and facsimile. The functionality of the MTS is distributed among a set message transfer agents (MTA). To illustrate how a UA interacts with an MTA, suppose that the originator UA submits a user message to an originator MTA. First, the originator MTA validates the submission envelope and records appropriate administrative information on the envelope. Next, the MTA attempts to deliver the message to the recipient. If the recipient is local to the originator MTA, the delivery is straightforward because it does not involve another MTA. If not, the

To user

To user UA

UA MTS MS

AU To non-X.400 users Figure 4. The X.400 Functional Model.

originator MTA needs to relay the message to another MTA. When there is more than one recipient, the originator MTA needs to create a copy for each MTA to which the message is relayed. An MTA, on receiving the relayed message from the originating MTA, may be held responsible for progressive delivery of the message to the intended recipient(s). An MTA, if held responsible, can discharge its responsibility by either delivering the message if the recipient is local to the MTA or by transferring the message to other MTAs that are closer (measured according to a metric, such as distance) to the recipient. If delivery is unsuccessful, its responsibility ends with the generation of a nondelivery report. The X.400 protocols are designed to reflect two major levels (i.e., the MTS and the MTS-user). The P1 protocol addresses how an MTA interacts with other MTAs to provide the distributed MTS service. The P3 protocol addresses how a UA accesses the MTS service. The P7 protocol addresses how a MS accesses the MTS service. Similar to the X.500 protocols, each X.400 protocol is modeled by a set of operations. For example, the P3 protocol specifies operations for an originator UA to submit/receive a message, and to manage messages in a mailbox. The X.400 information model deals with how a message is structured. An interpersonal message (IPM) consists of an envelope and an IPM content. The IPM content consists of a header and one or more body parts. Each body part has an encoded information type (EIT), e.g., text, facsimile or telex. If conversion is explicitly requested by the originator UA, an MTA can transform an EIT to another EIT. In addition to IPMs, thre are also interpersonal notifications (IPNs). An IPN conveys a receipt/nonreceipt notification. If the originator UA requests a receipt notification, an IPN signifying receipt is sent by the recipient UA to the originator UA. The X.400 standard relies on the X.500 standard in a number of ways. In naming, for example, an X.400 user is named by an O/R (i.e., Originator/Recipient) name which is a twoslot data structure—DN and O/R address. The DN provides a user-friendly naming of the X.400 user. For example, the following DN may be assigned to Adrian Tang: 兵C ⫽ US, ORG ⫽ UMKC, PersonalName ⫽ Adrian Tang其. The O/R address component identifies the management domains (MD) to which an X.400 user belongs. The X.400 standard defines a management domain to be a set of messaging systems owned by an administration or an organization. An MD managed by an administration (e.g., a public carrier) is called an administration management domain (ADMD), while an MD managed by an organization (e.g., a private company) other than an administration is called a private management domain (PRMD). For example, the following O/R address may be assigned to Adrian Tang: 兵C ⫽ US, ADMD ⫽ ATTMAIL, PRMD ⫽ UMKC, PersonalName ⫽ Adrian Tang其. File Transfer Access Management Existing file protocols are concerned primarily with moving complete files. The file transfer access management (FTAM) standard (ISO/IEC 8571) has broadened the scope of these protocols by offering three modes of file manipulation, i.e., file transfer, file access and file/filestore management. File transfer is the movement of a complete file between two filestores in different open systems. File access enables reading, writing or deletion of selected parts of a remote file. File/filestore


Virtual filestore

FTAM Real filestore

Initiator

Responder

Real filestore

755

• Limited file management: This supports file management operations such as creating and deleting files and interrogating the attributes of files. • FADU locking: This supports concurrency control on either the file basis or the FADU basis. • Recovery: This allows an initiator to perform recovery actions when a failure occurs after a file is opened. • Restart: This allows a data transfer to be interrupted and restarted at some checkpoint.

Presentation layer

Remote Database Access Figure 5. The FTAM Functional Model.

management refers to the management of a remote file/ filestore. The FTAM standard has found popularity in the banking industry. Figure 5 shows the FTAM functional model. Initiators and responders correspond to file clients and file servers, respectively. A virtual file/filestore is the abstraction of a real file/ filestore presented by the responder to an initiator. Before the initiator can read, write, access or manage a virtual file/ filestore, it must establish an FTAM dialogue with the responder. A virtual file is characterized by a document type which specifies the structure and content of a file. In the FTAM information model, the file structure is a hierarchical access structure. Each node of the structure may contain structural information (such as its name and level) and content information called data unit (DU). File access is directed to a file access data unit (FADU) which is a subtree of the structure. As a special case of a hierarchical access structure, the unstructured file structure is a one-level structure consisting of only one node, and the flat file structure is a two-level structure in which the nodes are either named (ordered flat structure) or unnamed (sequential flat structure). A virtual file is described by two classes of attributes: file attributes and activity attributes. The values of a file attribute are supposed to remain constant throughout the lifetime of a file unless specifically modified. For example, file name, date/ time of creation, document type and access control are file attributes. An activity attribute describes a file relative to a particular FTAM dialogue in progress. It is dynamic in nature and has no meaning outside the dialogue. For example, current file position, current access request and initiator identity are activity attributes. The FTAM standard introduces many service elements. FTAM functional units are used to group the functional related FTAM service elements. Some examples of FTAM functional units are given in the following. • Kernel: This supports the basic functions for establishing and releasing an FTAM association, selecting and deselecting of a file for further processing. • Read: This supports data transfer from a responder to an initiator. • Write: This supports data transfer from an initiator to a responder. • File access: This is used to locate a specific FADU for subsequent file operations.

Today, multivendor database interoperability is limited. A database client can only access a limited set of remote database servers. The remote database access (RDA) standard enables multivendor database interoperability. The RDA is premised on a client-server relationship between communicating end systems (Fig. 6). The RDA client is a process desiring the remote database access service supported by RDA. The RDA server is a process running on a remote end system that provides the RDA-based remote database access service to RDA clients. The RDA standard is composed of a Generic Standard (ISO/IEC 9579-1), which defines the common aspects of the RDA protocol independent of specific database models and query languages, and a Specialization Standard, which specifies the part of the RDA protocol specific to a particular database model and query language. The SQL (Structured Query Language) Specialization is specified in ISO/IEC 9579-2. The RDA standard defines two modes of operation—basic and transaction processing. The RDA basic mode is intended to be used for basic tasks such as data retrievals from a centralized database. The transaction processing mode is intended to be used for complex actions such as updates to a distributed database. The RDA standard defines services for managing an RDA dialogue, managing a transaction, controlling outstanding operations, controlling the availability of database resources, and executing database language commands. For example, the Database Language group enables an RDA client to execute database language commands which can be either executed as soon as it is issued, or defined first and then later invoked. The latter mechanism would typically be employed for efficiency purposes in the case where the same command is used repeatedly within a given RDA dialogue. The RDA standard describes procedures for handling errors and recovering from failures occurring during an RDA dialogue. It specifies how an RDA server must react to the failure of an RDA dialogue. When an RDA dialogue failure occurs, the RDA server deletes all state information for that

RDA client

RDA server Database

RDA Client interface

RDA Server interface

RDA Figure 6. The RDA Functional Model.

756


dialogue and rolls back all uncommitted transactions. The actions of the RDA client following an RDA dialogue failure are not addressed in the RDA standard. The RDA standard has been found useful in manufacturing and telecommunications environments. In the telecommunications environment, for instance, the Revenue Accounting Office can act as an RDA client to retrieve telecommunications usage information from a remote RDA server before applying a rate algorithm to the retrieved usage information. OSI CONCEPTS Layering The basic structuring technique used by the OSI Reference Model is layering. Layering divides the overall communication functions of an open system into a succession of smaller subsystems. Subsystems of the same rank (N) form the (N)layer of the OSI reference model. Objects in the (N)-layer are called (N)-entities. Collectively, they provide the (N)-service that is specified in terms of service elements. The (N)-service is always an enhancement of the (N⫺1)-service. Before an (N⫹1)-entity acquires any service from the (N)-layer, it must be bound to one or more (N)-service access-points (SAPs). At any time, no more than one (N⫹1)-entity can be bound to the same (N)-SAP. The service at an (N)-SAP is supported by a unique (N)-entity.

quest. In a negotiated release, which is a special case of orderly release, an (N⫹1)-entity can reject a release service request issued by its peer. In a destructive release, the release of an (N)-connection may disrupt the procedures of service requests issued earlier, implying potential loss of data in transit in both directions. In the connection-oriented mode, the mapping of (N⫹1)connections to (N)-connections can be one-to-one, many-toone, or one-to-many; (N⫹1)-multiplexing means that more than one (N⫹1)-connection is mapped to the same (N)-connection; (N⫹1)-splitting means that an (N⫹1)-connection is split into several (N)-connections, e.g., when the bandwidth of an (N)-connection is less than that of the (N⫹1)-connection. In connectionless communication, there is neither connection establishment nor connection release. The transmitted data units contain all the control information (such as addresses and quality of service) necessary for the transfer. Furthermore, they are transmitted independently of each other, meaning that there is no defined context. Data unit independence has the advantage that the data transfer service can be robust. Store-and-forward communication is a mixture of connection-oriented communication and connectionless communication. A connection between the two communicating end systems is not required, although connections are established with an intermediate system on a hop-by-hop basis. Communication between communicating UAs in an X.400 system uses this mode.

Communication Modes OSI offers three communication modes: connection-oriented; connectionless; and store-and-forward. Connection-oriented communication requires an (N⫹1)-association to be set up between the two communicating (N⫹1)-entities. The establishment of the (N⫹1)-association in turn requires an (N)-connection to be set up between two (N)-SAPs, i.e., the (N)-SAPs to which the (N⫹1)-entities are bound. The lifetime of a connection has three distinct phases: connection establishment; data transfer; and connection release. Once the connection is established, a connection endpoint identifier is assigned at each end to its local service user. During the data transfer phase, requests to data transfer phase are logically related within the context of the connection addressed by the two connection endpoint identifiers. The establishment of an (N)-connection requires the availability of an (N⫺1)-connection. This means that, in the worst case, when none of the lower layers have an established connection, a connection request from a higher layer would trigger connection establishments from the lower layers. As a result, connection establishment can be time consuming. There are at least two methods of optimizing the cost incurred at connection establishment. One method is to assign reasonably permanent connections at the layers where the cost is low, regardless of whether a higher-layer connection request has been issued. The other method, embedding, attempts to establish multiple connections simultaneously. Embedding is used for connection establishment in Layers 5 and 6. The release of an (N)-connection is initiated by either an (N⫹1)-entity associated with the connection or the (N)-layer supporting the connection. There are orderly release, negotiated release and destructive release. In an orderly release, there is no loss in transit for data sent before the release re-

Data Units While providing the (N)-service, the cooperating (N)-entities exchange (N)-protocol-data-units (PDUs) with each other. An (N)-PDU has two components—data and control. The control component, which is known as the (N)-protocol control information (N-PCI), contains control information such as name/ version of the (N)-protocol, type of the (N)-PDU, and addresses of the communicating (N⫹1)-entities. The data component, which is known as the (N)-service-data-unit (SDU), contains user information which is meant to be interpreted by the receiving (N)-entity. When an (N)-entity sends an (N)-PDU to its peer, the (N)PDU is first transported to the lower layers in the source system through encapsulation. Each time the PDU is passed to the layer below, the layer below adds a PCI prefix to the PDU. Thus, when the PDU reaches the lowest layer (i.e., the Physical Layer), it has been encapsulated with one or more PCIs. Next, the encapsulated PDU at the Physical Layer is transmitted across one or more transmission media to the Physical Layer of the target system. In the target system, the encapsulated PDU is transported to the receiving (N)-entity through decapsulation. Each time the encapsulated PDU is passed to the layer above, the layer above strips off a PCI from the PDU. A layer in the target system always strips off the PCI added earlier by the same layer in the source system. When the receiving (N)-entity receives the PDU, all the PCIs which were added by the source system should have been removed. While (N)-PDUs are messages passed between two open systems, (N)-interface-data-units (IDUs) are messages passed between two adjacent layers in the same open system, i.e., the (N⫹1)-layer and the (N)-layer. An (N)-IDU is passed either because of a local management need or because a (N⫹1)-


entity wants to send an (N⫹1)-PDU to its peer. In the latter case, the (N)-IDU is constructed with two components—an (N⫹1)-PDU (which is treated as an (N)-SDU by the (N)-layer) and an (N)-interface-control-information (ICI). The (N)-ICI contains control information (e.g., address of the sending (N⫹1)-entity), which is to be interpreted by the (N)-layer in the source system. In some cases, the sending (N⫹1)-entity may compose more than one (N)-IDU in order to send an (N⫹1)-PDU to its peer. This happens when the (N)-layer imposes a size constraint on an (N)-IDU. Since (N)-IDUs are messages passed between adjacent layers in an open system, the design of their structures is a local implementation issue. OSI standards, which are only concerned with interconnection matters, do not define (N)-IDUs. When the (N)-layer receives an (N)-IDU, it will separate the (N)-ICI from the (N)-SDU. From the control information in the (N)-ICI, an (N)-PCI is built which is then concatenated with the (N)-SDU to form an (N)-PDU (Fig. 7). The (N)-PDU is subsequently passed to the (N⫺1)-layer through the use of one or more (N⫺1)-IDUs. Naming and Addressing The OSI environment is rich in objects. Some OSI objects require a global identification so that they can be referenced unambiguously by applications. They include abstract/transfer syntax, application contexts, FTAM document types and managed object classes. Attribute-based names and object-identifier-based names are found in the OSI environment. An attribute-based name is made up of a set of naming attributes, where each attribute can be represented by a pair, such as (STATE, Missouri). For example, X.500 names are attribute-based names. Object-identifier-based names are given by object identifiers. To explain object identifiers, the object identifier tree (OIT) is described, which is defined to provide global naming. In the OIT, leaf nodes represent objects or object classes, and nonleaf nodes represent administrative authorities. Each arc of the OIT is labeled by an integer and occasionally a mnemonic name for descriptive purpose. An object identifier is an ordered sequence of integer labels for the unique path from the root to a node in the OIT. Let us examine the arcs of the OIT in more detail. The OIT begins with three numbered arcs emanating from the root:

(N+1)-layer (N)-SDU

(N)-layer PCI

SDU

(N)-PCI

(N)-SDU

ICI

ICI

(N)-PDU (N-1)-layer

(N-1)-IDU

SDU

ICI

Figure 7. Building an (N)-PDU and an (N⫺1)-IDU.

757

arc 0 for ITU; arc 1 for ISO; and arc 2 for joint ISO-ITU. Below ITU there are arcs leading to recommendations (0), questions (1), administrations (2), and network operators (3). Below ISO there are arcs leading to standard (0), registrationauthority (1), member-body (2) and identified-organization (3). The arcs below standard (0) shall have the number of an International Standard. Consider the FTAM standard, i.e., ISO/IEC 8571. An arc with the arc number 8571 can be created for the FTAM standard. In this way, the object identifier, 兵1(ISO) 0(standard) 8571其, can be used to name the FTAM standard. An address is used to locate an object, e.g., an (N⫹1)-entity. Suppose that an (N⫹1)-entity is bound to one or more (N)-SAPs. Any one of these (N)-SAPs can be used to locate the (N⫹1)-entity. Hence, an address of the (N⫹1)-entity can be given by a name identifying the set of (N)-SAPs to which the (N⫹1)-entity is bound. Such an address is called an (N)address, or an (N)-SAP address when there is only one (N)SAP in the set. To locate an (N⫹1)-entity, we need lower-layer addressing information to identify a path from the lower layers all the way up to the target (N⫹1)-entity. An (N)-address (for N greater than 3) has two components, i.e., an (N⫺1)-address of a supporting (N)-entity, and an (N)-suffix, known as an (N)selector. The (N)-selector is used to identify the set of (N)SAPs to which the (N⫹1)-entity is bound. Accordingly, a presentation (i.e., Layer 6) address is given by a session address and a presentation selector, a session address is given by a transport address and a session selector, and a transport address is given by a network address and a transport selector. In short, a presentation address can be represented by a quadruple consisting of a presentation selector, a session selector, a transport selector and a network address. OSI LAYERS Physical Layer The purpose of the physical layer is to hide the nature of the physical media from the data link layer in order to maximize the transportability of higher-layer protocols. It provides mechanical, electrical, functional, and procedural means to activate, maintain, and deactivate physical connections for serial bit streams between data link entities. The common functions found in the physical layer are synchronization and multiplexing. The physical layer standards should be distinguished from the physical interface standards (e.g., X.21) which define the boundary or interface between the physical layer and the physical transmission medium. Data Link Layer The data link layer is responsible for error-free data transmission over a data link. It creates data packets, synchronizes the data packets, detects and corrects errors, and controls the flow of the packet stream. The data link service definition of the ISO high-level data link control (HDLC) protocol covers both connection mode operation and connectionless mode operation. While HDLC is seen as a superset of data link procedures, many interesting subsets are defined out of it. For example, the HDLC LAP B subset is adopted by ITU as part of the X.25 packet-switched

758


network standard; HDLC LAP (Link Access Procedure) D, a data link standard developed as part of the ISDN standardization, is a subset of LAP B. Network Layer The Network Layer provides the service for network service users to exchange information without being concerned with the topology of the network and the characteristics in each constituent subnetwork. It is perhaps the most complex of the seven OSI layers due to the fact that many existing subnetwork types use different network addressing schemes, network protocols and communication modes. There are two modes of network service—connection-oriented network service (CONS) and connectionless network service (CLNS). The specification of CONS has six service elements: N-CONNECT to set up a network connections; NDATA to send normal user data; N-DATA-ACKNOWLEDGE to acknowledge the receipt of normal user data; N-EXPEDITED to send expedited data; N-RESET to reset a network connection; and N-DISCONNECT to release a network connection. In CLNS, there is only one service element, i.e., NUNIT-DATA, which is used to send data. In the Network Layer, there are two kinds of network protocols—routing protocols and interconnection protocols. Routing protocols are used to maintain a routing information base, to collect routing information, to distribute routing information to other nodes and to calculate the metrics of a route. Interconnection protocols are used to support the integration of subnetworks of different types. A routing framework should be in place before routing protocols are introduced. ISO/IEC TR (Technical Recommendation) 9575 defines a routing framework, essentially partitioning the global network into administrative domains and routing domains. An administrative domain is an autonomous set of ISs and ESs running under a single administration. Within an administrative domain, there may be one or more routing domains. Each routing domain runs the same IS-IS routing protocol among the ISs. Of all the OSI routing protocols, only the IS-IS routing protocol used within a routing domain is discussed here. This protocol has heavily influenced the design of the Open Shortest Path First protocol which is an Internet routing protocol. In the past, IS-IS routing protocols were primarily based on the vector state algorithm, requiring an IS to periodically send its routing table to its adjacent neighbors. Because the routing information propagates from one link to another, it may take considerable time before a remote IS can receive the update. Thus, the major drawback of the vector state algorithm is slow convergence, resulting in the receipt of an occasional stale update. The link state algorithm, which is adopted by ISO for the IS-IS routing protocol, requires each IS to maintain a complete topology map. Instead of sending a global routing table to its adjacent neighbors, an IS periodically broadcasts the status of its adjacent links. On receipt of link information from other ISs, an IS can build an up-to-date topology map which is modeled as a weighted graph. Using this graph, the IS can apply Dijkstra’s Shortest Path First (SPF) algorithm to compute the shortest distance to a destination. The SPF algorithm uses the well-known minimum spanning tree algorithm to convert a weighted graph into a tree. Because the link state algorithm requires each IS to

broadcast its link state status, every IS can maintain a consistent view of the entire topology. The slow convergence drawback of the vector state algorithm is precluded. Interconnection protocols are used to interconnect subnetworks, which may use different network address schemes, network protocols and communication modes. The current solution is to use interworking units (IWU) which perform relaying. Relaying involves the use of convergence protocols which can adapt non-OSI network protocols to OSI protocols. ISO/IEC 8648 defines a framework of the Network Layer for the introduction of interconnection protocols. In the Internal Organization of Network Layer (IONL) model (Fig. 8), there are three sublayers of the Network Layer. • Subnetwork access sublayer. This sublayer provides the attachment point of a subnetwork. A subnetwork access protocol (SNAcP) operating at this sublayer is a protocol associated with an underlying subnetwork. This protocol may or may not conform to the OSI network service requirement. • Subnetwork dependent sublayer. This sublayer is responsible for augmenting the service offered by a subnetwork technology into something close to the OSI network service. A subnetwork-dependent convergence protocol (SNDCP) is used for this sublayer. The operation of an SNDCP depends on the network service of a particular subnetwork. A common function of an SNDCP is to map between OSI network addresses and addresses specific to the subnetwork. • Subnetwork independent sublayer. This sublayer provides the OSI network service over a well-defined set of underlying capabilities, which need not be based on the characteristics of any particular subnetwork. When a subnetwork-independent convergence protocol (SNICP) is used, it is defined to require a minimal set of services from a subnetwork. Connectionless Network Protocol (CLNP), which is the OSI interconnection protocol providing CLNS, is an example of an SNICP. Using the IONL model, there are three basic strategies to interconnect subnetworks. • Interconnection of subnetworks which support OSI network services. In this strategy, all the subnetworks involved fully support the OSI network service. There is no need for an enhancement protocol. • Hop-by-hop enhancement. This approach is used in an environment containing at least one subnetwork type,

Routing and relaying

SNICP1

Subnetwork independent

Subnetwork independent

SNICP2

SNDCP1

Subnetwork dependent

Subnetwork dependent

SNDCP2

SNAcP1

Subnetwork access

Subnetwork access

SNAcP2

Figure 8. The IONL Model.


which does not provide the OSI network service. It takes each of these subnetworks individually and enhances its subnetwork service to the level of the OSI network service. Different SNDCPs may be required on different subnetworks. • Internet approach. This approach is used in an environment containing at least one subnetwork type which does not provide the OSI network service. The SNICP (e.g., CLNP), which assumes the minimal set of network services from the underlying subnetworks, would operate on top of the SNAcP in all the systems attached to the subnetworks. The last topic in this section is OSI network addressing. The Network Layer must provide a global addressing scheme so that ESs in different subnetworks can be addressed unambiguously. An NSAP address is an address of an SAP of the Network Layer. To cope with the multitude of NSAP addresses in the global environment, NSAP addresses are partitioned into network addressing domains in a hierarchical fashion. Each network addressing domain has its own network addressing format and is administered by an address registration authority. An address registration authority may further suballocate its addressing space to another address registration authority. On the whole, the network addressing domains are structured as a tree where the root has seven toplevel network addressing domains (e.g., X.121, E.164 and ISO 6523-International Code Designator) as children. An NSAP address consists of an initial domain part (IDP) and a domain specific part (DSP). The IDP, in turn, consists of an authority and format identifier (AFI) and an initial domain identifier (IDI). The AFI specifies one of the seven toplevel addressing domains as well as the abstract syntax of the DSP (e.g., binary octets, decimal digits, characters). For example, an AFI value of 47 implies that the format is ISO 6523-International Code Designator (ICD) and the DSP abstract syntax is binary. The IDI component is used to specify an addressing registration authority, e.g., US Government OSI Profile (0005). The DSP is the part of an NSAP address assigned by an addressing registration authority specified in the IDI component. This is the place where one can put information on the routing domain and an ES. Transport Layer The transmission quality of the network service may not meet the requirement of an application because of possible signaled and residual errors. By performing transmission quality enhancements such as error detection and recovery, the transport layer (ISO/IEC 8072/8073) provides a reliable end-toend service. There are two modes of transport service: connection-oriented transport service (COTS) and connectionless transport service (CLTS). When operating in COTS, the transport layer provides a full-duplex transmission between the communicating transport service users. The following discussion focuses on COTS. The transport functions invoked by the transport layer depend on the underlying network type. If the underlying network is reliable, only a simple transport protocol is needed. On the other hand, if the underlying network is unreliable, a sophisticated transport protocol involving elaborate transport

759

mechanisms is needed. For this reason, the following five connection-oriented transport protocols have been defined. • TP 0: This is designed to operate over a reliable network. It is a simple protocol. • TP 1: This is designed to operate over an X.25-like network, which may have an unacceptable signaled error rate. It is capable of resynchronization upon reset of network signaled failures. It is also able to reassign a new transport connection in the event of a network failure. • TP 2: Similar to TP 0, this is designed to operate over a reliable network. It adds the multiplexing capability to TP 0 so that multiple transport connections can share a single network connection. • TP 3: Designed to operate over an X.25-like network, this is basically a combination of TP 1 and TP 2. • TP 4: This is designed to operate over an unreliable network. It is the most sophisticated transport protocol. The TP 4 protocol procedures are complicated. To set up a transport connection, three protocol messages are needed to avoid the processing of connection establishment PDUs for transport connections which have been released. Despite the complexity of the transport protocols, the specification of COTS is straightforward. There are four service elements: TCONNECT to set up a full duplex transport connection; TDATA and T-EXPEDITED-DATA to deliver normal data and expedited data, respectively, and T-DISCONNECT to release a transport connection. The T-DISCONNECT service is disruptive because data sent before the service request may be lost. Session Layer The transport layer provides a full duplex and unstructured pipe between two communicating application processes. In many cases, this pipe is sufficient. In other cases where bulk data transfer is necessary, it is desirable to add structure to the transport pipe so that the application processes can synchronize the data stream and perform recovery after a failure. The session layer (ISO/IEC 8326/8327) enhances the services of the transport layer by enabling application processes to synchronize their dialogues and to manage their data transfer. There are two methods to structure the transport pipe as a session dialogue. One uses the notion of an activity and the other does not. In the latter method, major synchronization points are inserted into the transport pipe to subdivide the pipe into dialogue units. A major synchronization point marks the end of a dialogue unit and no recovery is permitted back to that dialogue unit. Within each direction of a dialogue unit, minor synchronization points can be also added to facilitate the process of recovery. Whenever resynchronization is needed during recovery, a session service user can resynchronize the dialogue to the previous confirmed minor synchronization point within the current dialogue unit. The other method of organizing a session dialogue involves the use of activities. Conceptually, an activity represents a logical piece of work such as a file. It can be dynamically activated, interrupted, resumed or even discarded, thereby supporting parallel tasking and data recovery. The use of activities gives a three-level structure to a session dialogue. At the

760


topmost level, the dialogue is structured into activities. At the second level, each activity is structured into dialogue units. At the third level, each dialogue unit is structured using minor synchronization points. Note that an activity is a logical concept. As such, an activity may span several session connections, and several overlapping activities may be contained in a session connection. A session connection has four token attributes: data token; release token; minor synchronize token; and major synchronize/activity token. Upon negotiation, these tokens are assigned to one of the two session service users during either the connection establishment phase or the data transfer phase. The owner of the data token can send data to its peer during a half-duplex data transfer. The use of the release token allows a session service user to refuse the release of a session connection (e.g., if it has data to send) when its peer, the owner of the release token, requests the release. The owner of the minor synchronize token (e.g., the sender of a file) can insert minor synchronization points. Finally, the owner of the major synchronize/activity token can insert major synchronize points to mark the beginning of a dialogue unit or an activity. The S-CONNECT service element is used to establish a session connection. When the session connection is established, either an existing transport connection is used or a new transport connection is established. When the session connection is released, the associated transport connection does not need to be released and thus can be saved in a reservation pool for future session connections. The session service elements available for the data transfer phase are used for dialogue control, synchronization and resynchronization. The full-duplex mode, which permits both session service users to send data simultaneously, does not require dialogue control and hence no data token is needed. The half-duplex mode, which enables only one session service user at a time to send data, requires dialogue control and hence the use of the token management capability. There are three session service elements for token management: S-TOKEN-GIVE to surrender a token; S-CONTROL-GIVE to surrender the entire set of available tokens; and S-TOKENPLEASE to request a peer to relinquish the ownership of one or more tokens. The transport layer provides only two types of data transfer facilities compared to the session layer, which provides four types of data transfer facilities. Normal data are sent using the S-DATA service element. When the data token is available, only the owner can invoke S-DATA. Expedited data are sent using the S-EXPEDITED-DATA service element. Due to limitations of the underlying expedited transport data facility, a maximum of 14 octets of expedited data can be transferred. The S-TYPED-DATA service element allows a session service user to send data outside the normal data stream independent of the availability and the assignment of the data token. When used properly, this facility gives the session service users a mixed half/full duplex mode which is useful in many applications. The S-CAPABILITY-DATA service element is used to send limited data in between two activities. At the end of an activity, for example, the session service users can use capability data to decide which activity to start next. Synchronized data transfer is intended for a bulky data transfer to facilitate error or crash recovery. It is achieved by

inserting minor or major synchronization points. The owner of the minor synchronization token can use the S-SYN-MINOR service element to insert a minor synchronization point. In practice, the two session service users have agreed on a window size during initialization, with the understanding that the receiver should acknowledge all the previously unconfirmed minor synchronization points before the two users exceed the window size. Unlike S-SYN-MINOR, S-SYNC-MAJOR is a confirmed service element. Once a session service user invokes a S-SYNC-MAJOR request, a confirmation must be received. The receipt of a confirmation acknowledges not only the major synchronization point but also all previously unconfirmed minor synchronization points. It marks the end of the current dialogue unit; thus the sender can discard all the data associated with the dialogue unit. Resynchronization is normally triggered by a notification which signals a possible failure. The notification is initiated by either a session service user or a session entity. The S-UEXCEPTION-REPORT service element is used by a session service user to report a user error (e.g., failure to hand over the data token) to its peer. The S-P-EXCEPTION-REPORT service element is used by a sesson entity to indicate an internal error (e.g., session protocol error) to the session service users. Typically, following an S-U-EXCEPTION-REPORT or S-P-EXCEPTION-REPORT indication, a session service user would initiate resynchronization by invoking the S-RESYNCHRONIZE service element. During the request, it must specify the resynchronize type, such as abandoning the current dialogue unit or resynchronizing the session to an unacknowledged checkpoint within the current dialogue unit. If necessary, the available tokens may be reassigned. Five session service elements are available for the management of activities. They are: S-ACTIVITY-START to initiate a new activity; S-ACTIVITY-INTERRUPT to interrupt an activity; S-ACTIVITY-RESUME to resume an activity; S-ACTIVITY-DISCARD to discard an activity; and S-ACTIVITYEND to end an activity. Two kinds of orderly release are made available to session service users—negotiated release and nonnegotiated release. The distinction between the two forms of release is based on the use of the release token. If this token is not available, the release cannot be negotiated. If the release token is available and the owner invokes the S-RELEASE service, the owner’s peer may choose to reply negatively to the release request when data has to be sent. For destructive release, the Session Layer provides an abortive service which can be either userinitiated (S-U-ABORT) or provider-initiated (S-P-ABORT). Presentation Layer The presentation layer (ISO/IEC 8822/8823) provides a passthrough capability that makes the entire set of session services visible to application processes as presentation services. In addition, it is responsible for handling the representation of application information which is exchanged for the communication. While the representation of information in an end system is a local issue, the two communicating application processes must agree upon what type of information is exchanged and how such information is represented during the transfer. During presentation connection establishment, the two application processes must identify the abstract syntax for the information to be exchanged for the connection. An


abstract syntax can be represented by one or more transfer syntax. Therefore, the two application processes must also agree upon a common transfer syntax for every abstract syntax. Once the abstract syntax and the corresponding transfer syntax have been established, mapping between local syntax (i.e., syntax used for the local representation of information) and transfer syntax can be initiated. A presentation context is a pair consisting of abstract syntax and transfer syntax. The objective of the presentation layer is to establish a set of presentation contexts for the communication. This set is known as the defined context set (DCS). Each presentation context in the DCS is named by an integer-valued presentation context identifier (PCI). The DCS can be the empty set if the two application processes have previously agreed upon a default context. A presentation data value, which is passed to the presentation layer by an application process, may be composed of values from one or more abstract syntax. There are two ways to encode a presentation data value by the presentation layer— full encoding and simple encoding. Full encoding encodes a presentation data value as a presentation data value (PDV) list. Each component in the list is a PCI followed by a value encoded using the appropriate transfer syntax. The PCI value is always encoded using basic encoding rules (BER), which is one of the transfer syntax defined by ISO 8824. Simple encoding is used when the DCS is empty or when the DCS contains only one presentation context. In the simple encoding, the presentaiton data value is given simply by the encoded value; the PCIs are missing since they are not necessary. As far as the presentation services are concerned, the presentation layer makes the session services directly accessible by the application processes; it offers only a few services of its own. In fact, it offers only one service element which is not related to any session service element. This service element, P-ALTER-CONTEXT, provides the capability for the presentation service users to modify the DCS such as adding/deleting a presentation context. For example, an FTAM initiator, which may not know the abstract syntax of a file that it wants to open when the FTAM dialogue is established, can use this service element to add a presentation context for the abstract syntax of the file once it is known. A presentation connection is established using the P-CONNECT service element. The following explains how the initial DCS is established during connection establishment. When a presentation service user makes a P-CONNECT request, it passes the presentation context definition parameter. This parameter specifies a partially filled DCS where each item in the list contains two components—PCI and name of an abstract syntax. On receiving the P-CONNECT request, the local presentation entity first determines which transfer syntax it can use to represent each abstract syntax. It then creates a presentation context definition result list, where each abstract syntax is mapped to the set of transfer syntax, which the local presentation entity can support. The presentation context definition result list is passed to the peer presentation entity by means of a presentation CP (Connect Presentation) PDU. The peer presentation entity passes the presentation context definition result list to the called presentation service user. If the called presentation service user accepts the request, a possibly modified result list is returned. On receiving the response, the local presentation entity has a chance to modify the presentation context definition result list before it

761

returns the modified result list to the presentation entity of the initiating presentation service user. At this stage, the initial DCS is established. The remaining presentation service elements are almost identical to the session service elements. Since the presentation layer reproduces the services of the session layer in a pass-through manner, it is more efficient to implement the two layers in a single implementation module. By sharing the global data structures defined for the two layers, expensive copying can be avoided. For the rest of this section, examples are given as a brief introduction of abstract syntax notation one (ASN.1), and BER. ASN.1 is similar to the data declaration part of a highlevel programming language. It provides language constructs to define types and values. Types correspond to structures and values correspond to content. Unlike any programming language, ASN.1 types, which are meant to be machine-independent, need not be implemented by any machine. For example, the ASN.1 INTEGER type allows all integers as values. An abstract syntax is a named group of ASN.1 types and values. It can be defined by a standard group, a profile group or a user group. One of the reasons why the types are grouped into an abstract syntax is that values of these types are meant to be encoded by the same transfer syntax. Thus an abstract type can be viewed as a unit for transfer encoding. An ASN.1 module is the ASN.1 notation to define an abstract syntax:

ModuleExample DEFINITIONS ::= BEGIN TypeA ::= INTEGER TypeB ::= BOOLEAN valueA TYPEA ::= 10 valueB TypeB ::= TRUE END The foregoing module is named by Module Example. It has two ASN.1 types and two values. An ASN.1 type is either simple or structured. Simple ASN.1 types include INTEGER, REAL, BOOLEAN, CHARACTER STRING, BIT STRING, OCTET STRING, NULL and OBJECT IDENTIFIER. Structured types are built from simple types:

Person ::= SEQUENCE兵 name IA5String(SIZE(0..64), phone IA5String(SIZE(0..64))OPTIONAL, email SET OF IA5String OPTIONAL其 The ASN.1 type in the foregoing can be used to represent a person. The following observation about the type is made. • The keyword OPTIONAL means that the corresponding component of the sequence can be omitted when a value is specified. • The IA5String type is a CHARACTER STRING type consisting of characters taken from IA5 (International Alphabet number 5). • IA5String(SIZE(0.64)) is a subtype of IA5String where the allowed strings have a maximum size of 64. • SET OF is a structured type to represent an unordered list.

762


ISO has defined a number of transfer syntax, including Basic Encoding Rules, Distinguishing Encoding Rules and Packed Encoding Rules. The following gives a brief introduction of BER which is by far the most popular transfer syntax. Every BER encoded value has three fields: a tag (identifier) field that conveys information on the type and the encoding form; a length field that defines the size of the value in octets; and a content field that conveys the actual value. A BER encoded value is sometimes called a Type-Length-Value (TLV) triple. Each ASN.1 type has an associated encoded tag which is used for the tag field in a TLV triple. The following gives an instance of a SEQUENCE type and its BER encoding:

Constructed ::= SEQUENCE兵 name OCTET STRING place INTEGER 兵room1(0), room2(1), room3(2)其 persons INTEGER OPTIONAL 其 meeting Constructed ::= 兵 name ‘1AA2FFGH’, place room3其 The TLV encoding (in hex) of meeting, which is constructed, is

30 09 04 04 1A A2 FF GH 02 01 02 The following explains how the encoded value is derived: • The 30 in the first row is the encoded tag value for SEQUENCE. • The 09 in the first row means that the length of the value field is 9 octets. • The second row gives the encoding of the octet string ‘‘1AA2EFFGH’’ where the encoded tag for octet string is 04. • The third row gives the encoding of the integer 2 (which is an abbreviation for room3). Application Layer The application layer provides all the communication support to application processes. Hence, if the lower six layers do not provide the required communication support, the application layer has to provide it. A framework to build the objects in the application layer is needed. The application layer structure standard (ISO/IEC 9545) defines a framework around which application standards can be developed. Conceptually, an application process can be divided into communication objects and noncommunication objects. The communication objects (i.e., objects which provide communication capabilities to the application process), are called application entities (AE). An application process may have one or more AEs. For example, a business application may consist of an AE containing X.400 capabilities and an AE containing FTAM capabilities. The division is only conceptual, so an actual implementation of an application process may not follow such a division. The structure of an AE can be complex. To understand how one AE communicates with another AE, it is necessary to refine the AE into granular components and analyze how these components communicate with their peers. In this way, the

design of an application protocol between two communicating AEs can be reduced to the design of an application protocol between two communicating components of less complexity. The application layer structure standard proposes structuring an AE in a recursive manner, starting with atomic components called application service elements (ASE). One or more ASEs can be combined to form an application service object (ASO). An ASO can be combined with one or more ASEs or ASOs to form another ASO. Continuing this recursively, the outermost ASO, which is the AE, is derived. Every ASO contains a control function (CF). The CF acts as a traffic cop to coordinate the activities of the ASEs and ASOs within the outermost ASO. In particular, the CF unifies the services of the various components of an ASO. The CF may add temporal constraints on the use of the combined service. Before two AEs can communicate with each other, they must first establish an application association which is an association between two AE-invocations (i.e., invocations of an AE). An application context, which is the most important attribute of an application association, defines the rules to be enforced during the lifetime of the application association. In particular, it specifies the required ASOs and ASEs, the abstract syntax that may be referenced, and the binding/unbinding information that needs to be exchanged before an application association is established/released. In short, an application context defines the working environment or knowledge that is shared by the AEs for the duration of the application association. The ASEs and ASOs can be viewed as workers at work in the constraint of the application context. Each worker communicates with a peer worker (of the same type) using a specialized protocol (e.g., an application protocol of an ASE). The decomposition of an AE into ASOs and ASEs is only static. It does not mean that all the ASOs and ASEs in an AE are always involved in an AE-invocation. For example, an AE may have five ASEs and three ASOs, but a particular AEinvocation may only involve three ASEs and two ASOs in an application association. To understand the dynamic behavior of an AE, one should examine the structure of an AE-invocation. By interacting with its peers, an AE-invocation can be involved in multiple application associations. Conceivably, the application contexts for these application associations may differ from each other. Therefore, one can refine an AE-invocation into components, with one component for each application association. These component objects are called single association objects (SAOs). They are active objects since they maintain states. Every SAO contains a single application control function (SACF) which acts as a coordinator of the ASEs and the ASOs which are involved in an application association. When the AE-invocation contains several SAOs, there may be a need for a multiple association control function (MACF) which is used to coordinate the SAOs. An MACF is similar to an executive manager. In some cases, an MACF is not needed because the SAOs do not need coordination. The structure of the application layer can reduce the design of an application protocol between two AEs to that of an application protocol between two ASEs. Since ASEs are the basic building blocks of an AE, we should first standardize the common ASEs and the associated application protocols. Common ASEs provide generic communication capabilities to a number of applications. Examples include the Application


Control Service Element (ACSE), the Remote Operation Service Element (ROSE) and the Reliable Transfer Service Element (RTSE). The use of common ASEs ensures that applications can be built in a consistent manner. In addition to the common ASEs, there are specific ASEs that provide specific capabilities to applications. The FTAM ASE defined in the FTAM standard is an example of a specific ASE. The ACSE standard is defined for the purpose of establishing and releasing application associations. An application association is a presentation connection with additional application layer semantics, e.g., application context negotiation and peer-to-peer authentication. Currently, there is a one-to-one mapping between application associations and presentation connections. Future versions of the ACSE standard might permit a presentation connection to be reused for a new application association or multiple application associations to be interleaved onto a single presentation connection. There are four ACSE service elements. The A-ASSOCIATE service element is used to establish an application association between two AE-invocations. The A-RELEASE service element is used to release an application association in an orderly manner. The A-ABORT is used by an AE-invocation to abort an application association with possible loss of transit data. The A-P-ABORT service element is used by the ACSE service provider to notify the abortion of an application association. The ACSE service elements are mapped onto the presentation service elements in a straightforward manner. Because all the presentation parameters are supplied by the ACSE users, there are over 30 A-ASSOCIATE parameters. Additional A-ASSOCIATE parameters may be added in the future, whenever there is a need to provide additional semantics of an application association. Of the A-ASSOCIATE parameters which are application-specific (i.e., not specific to the Presentation Layer), only the application_context_name parameter which specifies the application context is mandatory. In a typical interactive environment, an AE-invocation requests a remote AE-invocation to perform an operation. The remote AE-invocation executes the operation and returns either an outcome or an error. Because many distributed applications are written in this kind of interactive environment, it is useful to provide an ASE to provide such interactive communication support. The remote operation service element (ROSE) standard is written for this purpose. It is used by application protocols such as CMIP and DAP/DSP. The ROSE standard defines a model for remote operations. A remote operation is requested by an invoker. The performer attempts to execute the operation and reports the outcome, which is either a normal outcome or an exception. Every invocation is identified by an invocation identifier, which is used to differentiate this invocation from other invocation(s) of the same operation. In addition, it may have a linked invocation identifier, indicating that the operation is part of a group of linked-operations formed by a parent-operation and one or more child-operations. The performer of the parent-operation may invoke zero or more child-operations to be performed by the invoker of the parent-operation. There are five ROSE service elements. An invoker uses RO-INVOKE to request a remote operation to be performed. After execution, a positive result is returned using RO-RESULT while a negative result is returned using RO-ERROR. When an ROSE user detects a problem with the invocation,

763

it can use RO-REJECT-U to reject the request or the reply. The ROSE service provider uses RO-REJECT-P to inform ROSE users of problems such as a badly structured application PDU. ROSE is not meant to be used as a standalone ASE in an application context. In any application context using ROSE, there must be at least one or more ASEs which supply the remote operations for the application needs. ROSE only acts as a courier for such remote operations. To facilitate the specification of remote operations by ROSE users, the ROSE standard provides templates for the definition of remote operations. CONCLUSIONS The OSI Reference Model is a very carefully designed model that provides the framework for the development of protocols to interconnect open systems. By providing a rich set of communication functionalities, it meets all the conceivable interconnection requirements. A seven-layer implementation necessitates good software engineering techniques and sound understanding of OSI concepts. Independently manufactured OSI implementations exist and have proven interoperability. Most of them are based on a profile that meets the requirements of a specific application. A reduced profile, known as Minimal Open System Interconnection (MOSI), was proposed by the OSI Regional Workshop. This profile would meet the requirements of most of the networking applications that exist today. Opponents of OSI believe that the functionalities of the OSI Reference Model are overkill. For instance, very few applications would require most of the session functions. Instead of requiring every open system to implement the session layer, the session functions could have been imbedded in only those applications that require them. This concern of these opponents has been addressed by the MOSI profile, the performance of which is comparable with that of the existing non-OSI stacks. The presentation layer addresses the situation where a negotiation of transfer syntax is necessary. Such situations arise in wireless communications when there is a need for encryption and compression. The growth of personal communication system (PCS) will cause the appreciation of the incorporation of the presentation layer in the communication stack. The success of the OSI Reference Model is clearly illustrated in the deployment of OSI application protocols such as CMIP, X.400 and X.500; CMIP, for example, has been unanimously chosen by telecommunications managers to be the management protocol in the lower layers of the TMN model. The OSI Reference Model is appreciated by practitioners who have a need for the rich functions it provides. It is appreciated by protocol designers (such as the present author) who can learn good protocol design principles from reading the OSI standards. It is certainly a sound protocol model to guide the development of a protocol stack. ADRIAN TANG University of Missouri-Kansas City

IT INDUSTRY. See INFORMATION TECHNOLOGY INDUSTRY.


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Local Area Networks Standard Article Joseph B. Evans1 1University of Kansas, Lawrence, KS Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5314 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (148K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Lan Topologies Ieee 802 Lan Standards Protocol Layering Ethernet (IEEE 802.3) Token Passing Bus (IEEE 802.4) Map/Top Token Ring (IEEE 802.5) Other Token Rings Hyperchannel and Hippi Other Lan Protocols Wireless Lans | | | Copyright © 1999-2008 All Rights Reserved.


LOCAL AREA NETWORKS

511

LOCAL AREA NETWORKS Local area networks (LAN) are data communications networks that are restricted in extent to an office, home, building, or, in some cases, areas as large as a campus. Due to the spectacular growth in networking, LANs can be found deployed in almost every organization. There are a variety of established and evolving technologies that are used in LANs, based on physical facilities ranging from copper and optical fiber to radio. The characteristics of commonly used LAN technologies will be discussed in this article. LAN TOPOLOGIES LANs can be logically organized in several topologies, the most popular of which are the bus, star, and ring. In the bus structure, illustrated in Fig. 1, nodes (computers, printers, or similar devices) are interconnected to a common, shared physical resource, typically a wire or cable. This topology is inexpensive, since wiring expenses are shared among the nodes. Unfortunately, this scheme involves sharing the limited bandwidth resources of the bus and can also be somewhat unreliable, as bus failures in the vicinity of one node can affect the others on the same bus. IEEE 802.3 10base5 and 10base5 Ethernet are examples of networking standards based on a bus topology at the physical layer. This remains the most common topology in use, however, due to the simplicity of deployment. An alternative is the star topology, shown in Fig. 2, in which each node has dedicated resources to some central switching site. This has the advantage of dedicated bandwidth to the interconnection point, but the attendant cabling costs are often higher than the bus topologies. Asynchronous transfer mode (ATM) is an example of a networking standard based on a star topology. There is increasing interest in star topologies (switched Ethernet, for another example) because the limited bandwidth on a cable is not shared and traffic is J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.

512

LOCAL AREA NETWORKS

Figure 1. Bus topology. The transmission medium is shared among stations in this configuration.

not subject to internode arbitration delays for access to the medium. Another option is the ring topology, shown in Fig. 3, in which each node is interconnected to its neighbor. The IEEE 802.5 Token Ring and the Fiber Distributed Data Interface (FDDI) are examples of networking standards based on a ring topology. This topology shares many of the advantages and disadvantages of the bus topology—inexpensive wiring, but with reliability problems if the ring should be broken. Ringbased LANs have been designed to overcome the reliability issues by using counter-rotating rings (FDDI, for example). Depending on the protocols in use, bandwidth in a ring-based network can be reused (since the ring is not physically contiguous) and hence such a topology can have a capacity greater than the equivalent bus. IEEE 802 LAN STANDARDS Much of the growth in deployment of LAN technology can be attributed to the standardization of selected technology options, which has enabled multivendor interoperability and has spawned a highly competitive market. The IEEE 802 LAN standards are among the most widely used data protocols yet developed. The IEEE 802.2 standard specifies the Logical Link Control (LLC) protocols used by the other IEEE LAN standards. IEEE 802.2 allows the lower-level protocols to interface with higher-level protocols in a consistent manner. Using this approach, for example, the Internet Protocol (IP) need not know the type of underlying hardware being used on a particular host, which implies that software can be simplified and made

Figure 2. Star topology. This scheme is based on a central interconnection point for the transmission medium.

more reliable. Note that certain other protocol suites (IP over ATM, for example) use the IEEE LLC SAP (service access point) codes for protocol multiplexing and demultiplexing, so that similar benefits can be obtained. IEEE 802.2 provides several services; which services are used and the extent to which they are used depends on the needs of the other protocols involved. The IEEE 802.3 standard has been one of the most successful in the IEEE LAN suite. This standard describes the Carrier Sense Multiple Access with Collision Detection (CSMA/CD) protocol, which forms the basis for the Ethernet family (note that the 802.3 standard and Ethernet differ slightly but can be made to interoperate). The IEEE 802.3 standard is comprised of several related protocols, for different physical media. Included are the original 10base5 standard for CSMA/CD on 50 ⍀ thick coaxial cable, the 10base2 standard for lighter 50 ⍀ coaxial cable, and the 10baseT standard for unshielded twisted pair cables. Less commonly used today are the 1base5 StarLAN standard and the 10broad36 standard for more widely dispersed networks. In addition, fiber extension options are available for distributed site interconnection (within protocol distance limits). Ethernets can be found in almost all corporate data networks. The primary data rate is 10 megabits/s, although higher rate Ethernet protocols are becoming available, particularly 100 Mbits/s Fast Ethernet and ongoing work on Gigabit Ethernet. The IEEE 802.4 standard specifies the Token Bus protocol. This protocol has been the basis for several networking technologies, including the MAP/TOP (Manufacturing Automa-

Figure 3. Ring topology. The ring is based on a loop configuration for the medium.

tion Protocol/Technical and Office Protocol) suite. Multiple physical layers are defined for token bus on 75 ⍀ coaxial cable, including systems at 1 Mbit/s, 5 Mbits/s, and 10 Mbits/s. These are all broadband systems. The original 1 Mbit/s system has been quite popular due to its low cost and relative simplicity. The IEEE 802.5 standard specifies the Token Ring protocol. This standard has been widely deployed in PC-based networks and is second only to Ethernet in ubiquity. It uses unshielded twisted pair cabling, with data rates at 4 Mbits/s and 16 Mbits/s. It has several very desirable features, including robust behavior in the presence of high traffic loads and bounded delay (to transmit) times. PROTOCOL LAYERING For standardization purposes, networking protocols are most often conceptually partitioned into several layers. In the case of LAN technologies, the physical layer (PHY), media access layer (MAC), and logical link layer (LLC) are commonly specified. The latter two are often grouped together to form the data link layer in standard layering schemes. PHY Layer The PHY, or physical layer, is the lowest layer of a protocol stack. The standards for this layer typically describe the medium to be used (e.g., cable, fiber, wireless), modulation schemes, and encoding schemes used to transmit information across the medium. The PHY layer of LAN protocols generally fall into two categories, baseband and broadband. A baseband PHY layer is one in which the information bearing signals are digital signals, typically encoded using simple level-based keying, Manchester encoding, or differential Manchester encoding. This is the most common type of PHY layer in current LANs, being relatively inexpensive and sufficiently robust for most local environments. The disadvantages are distance limitations, typically 100 m to at most 1000 m on copper, and bandwidth, no more than about 155 Mbits/s over copper using current technologies. Baseband techniques may be used over optical fiber at much greater distances and rates, but with the attendant installation and network equipment costs. For typical LAN installations, however, baseband systems on copper are sufficient. Encoding schemes are another key element of the PHY layer. A variety of schemes, tailored to the physical medium for a given protocol, have been developed. Some typical encoding schemes are depicted in Fig. 4. These can be broadly classed as non-return-to-zero (NRZ) techniques and biphase techniques. The conceptually simplest schemes are the NRZ methods. In the NRZ-level approach, for example, zeros are encoded as low voltage level, and ones are encoded as a high voltage level. In optical fiber systems, the corresponding schemes may be that ones are the presence of optical power, zeros the lack of light. In the NRZI (NRZ with invert on ones) approach, a transition (either falling or rising edge) denotes a one, and the lack of a transition signifies a zero. While simple, the NRZ schemes have several shortcomings. Most significantly, recovery of bit timing at the receiver can be difficult—the moment in time at which to sample a bit

0

0

0

LOCAL AREA NETWORKS

513

1

0

1

0

1

1

Data

NRZ level

NRZI

Manchester Differential Manchester

Figure 4. Baseband PHY encoding schemes. This illustrates the relationship between data bits and the signal (optical or electrical) sent across the physical medium over time.

to determine if it is a zero or one is often not apparent in the presence of noise and other such impairments. A technique that provides an unambiguous timing reference is highly desirable. Furthermore, the occurrence of a long string of zeros or ones can result in an undesirable dc voltage bias on the transmission medium, which may cause threshold-related errors and problems with the use of transformers. There are several approaches to resolving these related problems, which center around the need for signal transitions. The 4B/5B and related techniques (4B/6B and 8B/10B are also common) involve guaranteeing sufficient transitions by inserting extra bits into the signal stream. Data symbols, 4 bits in this case, are mapped into a 5 bit code, which is then transmitted using NRZI, for example. This is illustrated in Table 1. An inspection of this table will prove that strings with a maximum of three consecutive zeros are possible, even when code words are concatenated. Multiple ones are not an issue if NRZI is used for transmission, as ones force a transition to occur. The cost of a 4B/5B mapping, of course, is that only 80% efficiency is possible. Table 1. 4B/5B Encoding Data Symbol

Code Word

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

11110 01001 10100 10101 01010 01011 01110 01111 10010 10011 10110 10111 11010 11011 11100 11101

514

LOCAL AREA NETWORKS

The biphase encoding is beneficial with respect to signal balance maintenance, and bit timing recovery is particularly easy to implement in this scheme. It is based on signal transitions at a rate double that of the bit rate. A transition (rising edge or falling edge) is guaranteed to occur at the center of the bit period. The absence of such a transition can be used as an error detection method. In Manchester encoding, a zero is encoded as a rising edge at the center of the bit period, and a one as a falling edge at such a time. The encoding mechanism can be implemented as an exclusive-or operation between the data and the clock. This is the encoding used for most of the common IEEE 802.3 protocols (10base5, 10base2, 10baseT). Differential Manchester encoding uses the midperiod transition for a clocking reference only, and uses the presence (denoting a zero) or absence (denoting a one) of a transition at the beginning of the bit period to encode the information. This is the method used for the IEEE 802.5 token ring standard. The primary disadvantage of biphase signaling is that transitions happen at twice the data rate, which means that the bandwidth required is greater than that of the equivalent NRZ system, and the hardware must operate twice as fast. The former is particularly critical in wireless systems. A broadband PHY layer is one in which the information is coupled into the medium by analog signals, which are modulated by some carrier, and encoded using frequency shift keying (FSK), amplitude shift keying (ASK), phase shift keying (PSK), or some similar scheme. This type of PHY layer is most often used in situations where longer distances need be served or additional bandwidth is required. Much greater bandwidths may be supported on one cable using broadband schemes, as multiple frequencies can be used. The primary disadvantage of the broadband approach is the cost of modulators, demodulators, and the associated analog hardware. MAC Layer The MAC, or media access layer, is used to arbitrate access to the PHY layer. For example, in the case of Ethernet, there is a shared medium (cable) that must be used by several nodes, and only one of the nodes can be permitted to access the cable at a particular time. The MAC layer influences the effective throughput over a given physical layer and should be efficient in its use of the available bandwidth. This includes minimizing the overhead due to factors such as protocol headers and dead time between transmissions, while at the same time maximizing the successful transmissions on a busy shared-medium network. In addition, the MAC layer is often designed to ensure that errors are not propagated to the higher-layer protocols. Various MAC schemes have been developed for the LAN protocols. The three most common are the Carrier Sense Multiple Access with Collision Detection (CSMA/CD) protocol used in Ethernet, the token ring protocols, and the token bus protocol. The CSMA/CD MAC protocol involves detecting the use of the medium by another station by checking the state of the carrier. If a station has data to transmit, it first attempts to verify if the medium is unused. If it is available, the station transmits. If the medium is not available, the station waits until the medium goes idle and then immediately begins to transmit (note that this is the IEEE 802.3 solution, but other

options are possible in the general case of CSMA). The success or failure of transmissions is monitored on the shared medium, and if a transmission is unsuccessful—that is, a collision is detected—the station waits a prescribed random amount of time (binary exponential back-off) and attempts to transmit again. This procedure is repeated until the transmission is successful, or the limit to the number of transmission attempts (16 in IEEE 802.3) is reached. CSMA/CD is simple, inexpensive, and performs well under light loads. Unfortunately, it can perform poorly under heavy loads and be sensitive to physical layer errors. Token ring protocols use a ‘‘token’’ to arbitrate access to the transmission medium. A token is a small frame that is exchanged between stations to gain the right to transmit. If a station has data to transmit, it waits until a token is seen on the medium. This station then modifies the token and appends the necessary fields as well as its data. When this frame returns around the ring to the originating station, it is purged from the medium. When data transmission is complete, the station inserts a new token onto the ring. Token rings support fair, controlled access to the medium and perform well under heavy load conditions. A disadvantage is the need for careful token maintenance, particularly in the presence of errors. Several varieties of token ring exist; some of these will be discussed in subsequent sections. The token bus protocol is closely related to the token ring, but with an underlying physical bus topology. The token exchange mechanism, however, does in fact use a logical ring for token passing. This logical ring is simply an ordering of stations on the bus. Once the logical ring is in place, token passing can proceed as in a ring-based system. This system provides controlled access to the bus and is robust under heavy loads. One of the disadvantages of this approach is that ring initialization and maintenance is more complex than in a physical ring—the ordering of stations must be determined through some algorithm, and station additions and deletions must be managed. LLC Layer The LLC, or logical link control layer, can be viewed as the upper part of the data link layer. It is used to provide data services to the higher layers. In particular, two types of services, connectionless and connection oriented, are defined. In LANs supporting complex higher-layer protocols, such as TCP/IP, only the simplest LLC services are commonly used. An example of a LLC protocol is the IEEE 802.2 layer. This provides both connectionless and connection-oriented services. The unacknowledged connectionless service provides simple datagram support for the multiplexing and demultiplexing of higher-layer protocols. In addition, a connectionless service with acknowledgments (for monitoring systems, for

8

8

8–16

n

DSAP

SSAP

Control

Data

Figure 5. IEEE 802.2 LLC frame format. The SAP (service access point) fields are used to select the appropriate protocol handler on reception of a packet.

LOCAL AREA NETWORKS

515

Router Upstream 10base5 Hub

Bridge

Figure 6. Typical Ethernet installation. This illustrates the interconnection of various physical and protocol devices in a typical LAN. Hubs are devices used to concentrate the physical media from several Ethernet stations, and are often used as physical layer translation devices (10baseT to 10base2, for example). Bridges provide isolation between Ethernets and allow more complex LANs to be built.

10baseT 10base2

example) is supported, as well as a connection-oriented service that furnishes flow control and error recovery capabilities based on the lower-layer CRC and a ‘‘go-back-N’’ strategy. The IEEE 802.2 LLC frame format is depicted in Fig. 5. The destination service access point (DSAP) and source service access point (SSAP) fields are used to indicate the service type (IP or IPX, for example) to higher layers. The control field is used for the LLC service support, including indication of the type of service. ETHERNET (IEEE 802.3) The Ethernet, and the closely related IEEE 802.3 standard, has been one of the most successful LAN protocols developed to date. This technology is based on CSMA/CD and takes a variety of forms at the PHY layer. A typical Ethernet installation is depicted in Fig. 6. The Ethernet frame format is illustrated in Fig. 7. The preamble is used for frame delineation. The destination and source address fields (48 bits each) are globally unique identifiers for each Ethernet adapter and are used for station to station communication, as well as broadcast (all ones) and multicast (first bit is one). It should be noted that the 48 bit addresses used in Ethernet have become a common feature in IEEE 802-based LANs. The type field (Ethernet) can be used for higher-layer demultiplexing, as in an LLC protocol. The length field (IEEE 802.3) is used to aid in end of frame detection. A 32-bit CRC is used for error detection and is followed by a postamble for end of frame detection. TOKEN PASSING BUS (IEEE 802.4) The IEEE 802.4 Token Bus standard has been widely used in manufacturing systems and early office automation products.

64

Preamble

48

Destination address

48

Source address

16

Type or length

Because it is based on a broadband physical medium, it is somewhat more resistant to the low-frequency electromagnetic (EM) noise that might arise on a factory floor. A token passing bus is a LAN with a bus topology that operates on the principle that a token will be received prior to the transmission of data by a station. The token bus format includes a preamble, frame control byte for denoting whether a particular frame is a token or data, the destination and source addresses (48 bits, as in 802.3), the data (an LLC frame), an error detection field (CRC-32, as in 802.3), and the postamble. The token bus operates by first establishing a logical ring that overlays the physical bus topology. Station additions and deletions require reconfiguration of the logical ring. When a token is received, a station is permitted to transmit multiple packets, until its token holding time has expired. The token bus offers optional support for multiple classes of service through the use of complex timer specifications that enable per-class bandwidth guarantees. Support for simpler nontoken stations is included to allow low-cost devices to respond to polling requests using this medium.

MAP/TOP MAP is the Manufacturing Automation Protocol developed by General Motors Corporation for communication among automated manufacturing devices, including robotic equipment and the associated controllers. It was primarily designed to support communication between very different sorts of devices, in real time with low, predictable delays. It supports applications as varied as word processing and equipment telemetry (temperature measurement, for example). TOP is the Technical and Office Protocol developed by Boeing Corporation for communication between office automation

32

Data

CRC

8

Postamble

Figure 7. Ethernet frame format. Ethernet and IEEE 802.3 differ in the fourth field, which is a type field in Ethernet and a length field in IEEE 802.3.

516

LOCAL AREA NETWORKS

devices such as word processing systems and printers. Interoperability between devices from a variety of manufacturers was a key design goal of this protocol. The MAP/TOP protocol suite is based on the IEEE 802.4 Token Bus protocols. As such, MAP/TOP networks are often interconnected with some variety of token passing network for ease of interface design. TOKEN RING (IEEE 802.5) The Token Ring protocol has been widely deployed in networks based on PCs. Token Ring operates on the principle of the exchange of a ‘‘token’’ to a station before it is permitted to transmit. Only one token is allowed on the ring at one time. The IEEE 802.5 Token Ring frame formats are illustrated in Fig. 8. The first format is used for token frames and only includes start and end delimiters and the access control field, with priorities and reservation information. The second format includes start and end delimiters, a frame control word for optional LLC support, source and destination addresses (in 802.3 format), the LLC (data) frame, a CRC-32, and a frame status word used by transmitting stations to verify reception. OTHER TOKEN RINGS Another example of token ring technology in wide use today is the Fiber Distributed Data Interface (FDDI) standard. This technology supports multiple packets on the ring at one time, with rates of 100 megabits/s. Provisions are made for multiple service classes (synchronous and asynchronous) with differing throughput and delay requirements. Further, reliability support is provided through the capability for optional redundant counterrotating rings, which can mask a station or fiber failure. Slotted Ring

Register Insertion Ring Register insertion rings are a common LAN technology and can be used to provide high performance through their support for multiple packets on the ring at one time. The register insertion ring uses a small shift register at each station to control forwarding and insertion onto the ring. The shift register is at least as large as the maximum frame size. This allows a station to store a frame as it passes. If the station has no data to send, a passing frame is buffered long enough to determine if it is destined for the local station. If it is destined locally, a typical implementation will both copy the frame into adapter memory and forward the frame back around the ring to support acknowledgments. Transmission when the medium is available is handled by simply copying the data onto the ring. If a frame arrives during this time, it is buffered in the insertion register. The register insertion method provides excellent ring utilization due to the multiple simultaneous packets on the ring without the overhead penalty of the slotted ring. The disadvantage of this technology is that the purge mechanism—that is, the technique used to remove problematic packets from the ring—is generally more complex than in other systems. HYPERchannel AND HIPPI

Slotted ring technology uses multiple ‘‘slots’’ that rotate around the ring to arbitrate access. Each slot is a small frame that can be marked empty or full. When an empty slot arrives at a station with data to transmit, the slot is marked full and data is injected. The slot is marked empty when it returns around the ring to its source. A given station cannot transmit again when it has an outstanding slot. The provision for multiple packets from different sources on the ring at one time assists in fair utilization and quality of service support.

Figure 8. Token ring frame format. The different formats used for the control token and data frames are depicted.

The Cambridge Ring is an early example of such technology (some claim it is the ancestor of the ATM protocols, also based on small fixed frame sizes). A slot contains one octet each for the source and destination addresses, five control bits, and two, four, six, or eight data octets, and thus slot sizes are extremely small. This implies that higher-layer packet data is almost always segmented into small units prior to transmission. Stations could choose not to receive packets from particular sources; some of the control bits support this through response codes. The Cambridge Ring was simple to implement, but was somewhat wasteful of bandwidth due to the header overhead in such small datagrams.

A number of LAN protocols are designed for very high-speed interconnection of computers and their peripherals. HYPERchannel, developed by Network Systems Corporation, is one of these. This protocol was developed in the mid-1980s for the interconnection of supercomputers and high-performance peripherals, and has been used with Cray and Amdahl systems, among others. It supports data rates of up to 275 megabits/s over a variety of physical layers.

8

8

8

Start delimiter

Access control

End delimiter

8

8

8

48

48

Start delimiter

Access control

Frame control

Destination address

Source address

Data

32

8

8

CRC

End delimiter

Frame status

LOGARITHMIC AMPLIFIERS

HIPPI, or High Performance Parallel Interface, is another of the protocols developed primarily for interconnection of supercomputers. This protocol supports 800 Mbits/s or 1.6 gigabits/s links over a large parallel cable, which is either 32 lines or 64 lines wide and runs at 25 MHz. The distances over which HIPPI can be used are quite limited, but large enough for a typical supercomputer center equipment floor. Interconnection of sites can be accomplished using fiber extension options. Simple flow control features are provided to lessen problems with computers and peripherals of widely different I/O (input/output) bandwidth. Although simple and effective, this flow control scheme does contribute to the problems of extending HIPPI networks over larger distances while maintaining high throughput. To build HIPPI networks of nontrivial size, simple switches are used to interconnect devices. These switches are typically not designed to switch between sources and destinations at high rates, as with routers and packet switches, but rather act as interconnection panels that may be reconfigured at reasonable rates for sharing peripherals.

OTHER LAN PROTOCOLS A number of new, higher-performance LAN protocols have been developed in recent years. FiberChannel is a LAN protocol suite designed for high-speed communication between nodes using optical fiber. Rates of up to 800 Mbits/s are supported, with systems up to 4 Gbits/s under design. Other developments include FireWire (IEEE 1394) and universal serial bus (USB), high-speed protocol suites based on serial interconnection technology. FireWire, for example, supports bandwidths of up to 400 Mbits/s with up to 63 devices (with no more than 16 cable hops) per bus. USB, a 12 Mbits/s serial protocol with chaining support, is designed primarily as an improvement over traditional serial port technologies. Asynchronous Transfer Mode (ATM) networks are also being widely deployed in LANs. ATM is a switch-based technology that uses small packets (53 bytes) called cells. Interconnection of nodes is through virtual circuits, which are analogous to circuits in voice telephony. Multiple physical layers are supported, including both copper and fiber infrastructure options. Although ATM is often viewed as a wide area networking technology, it does provide support for features that are not available in other technologies. For example, ATM allows the definition of virtual LANs, which provide network administrators with options that are not available in less sophisticated technologies. Virtual or emulated LANs are interconnections of LANs, perhaps widely separated, which are configured to emulate a single local area network. Furthermore, multiple logical local area networks can be supported over a single physical infrastructure using this capability.

WIRELESS LANS Wireless LANs use radio or infrared as the transmission medium, as opposed to the traditional wire or fiber. This has significant advantages, particularly for deployment in older buildings where wiring costs are high, as well as environments in which workers may be moving frequently.

517

Many of the initial wireless LANs in the United States have used radio frequencies in one of the ISM (Instrumentation, Scientific, Medical) bands, which generally may be used without individual site licensing subject to restrictions on power output. The data rates on these systems range from tens of kilobits per second to a few megabits per second, with typical ranges of a few hundreds of meters. Some early European products in this area were based on the digital enhanced cordless telephony (DECT) standard for digital telephony. These systems used multiple channels to provide data rates of hundreds of kilobits per second. Wireless LANs are still evolving, but instances of several standards are now being seen as products. The most significant development in this area is the IEEE 802.11 standard, which will provide data rates of 1 Mbit/s to 2 Mbits/s over a range of approximately 100 m in typical radio configurations. This is based on a CSMA/CA (CSMA with Collision Avoidance) MAC layer with multiple physical layers—in particular, direct sequence spread spectrum, frequency hopping spread spectrum, and infrared. Work on indoor wireless ATM and the European ETSI RES10 standards are focusing on systems with data rates of up to 25 megabits/s with ranges on the order of 100–200 m. While they are still early in the development cycle, these systems promise to deliver multimedia services over wireless links with quality of service support. BIBLIOGRAPHY P. T. Davis and C. R. McGuffin, Wireless Local Area Networks, New York: McGraw-Hill, 1995. L. L. Peterson and B. S. Davie, Computer Networks—A Systems Approach, San Francisco: Morgan Kaufmann, 1996. S. Saunders, The McGraw-Hill High-Speed LANs Handbook, New York: McGraw-Hill, 1996. W. Stallings, Local Networks, 5th ed., New York: Macmillan, 1996. J. Walrand and P. Varaiya, High Performance Communication Networks, San Francisco: Morgan Kaufmann, 1996. J. Wobus, LAN Technology Scorecard [Online], 1996. Available: http:// web.syr.edu/앑jmwobus/comfaqs/lantechnology.html

JOSEPH B. EVANS University of Kansas

LOCAL AREA NETWORKS. See ETHERNET. LOCATION SYSTEMS, VEHICLE. See VEHICLE NAVIGATION AND INFORMATION SYSTEMS.


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Metropolitan Area Networks Standard Article N. F. Maxemchuk1 1AT&T Labs–Research, Murray Hill, NJ Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5315 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (224K)




❍ ❍

Acronym Finder



Abstract The sections in this article are History Differences Among Lans, Mans, and Wans The Man Anomaly The Fiber Distributed Data Interface The Distributed Queue, Dual Bus Protocol The Manhattan Street Network Comparison of FDDI, DQDB, and the MSN CATV Conclusion | | | Copyright © 1999-2008 All Rights Reserved.


602

METROPOLITAN AREA NETWORKS

METROPOLITAN AREA NETWORKS Metropolitan area networks (MAN) have been studied, standardized, and constructed for less than 20 years. During that time the capabilities of the telecommunications network and the requirements of users have changed rapidly. What a MAN is supposed to do has changed as quickly as MANs are designed. Recent changes in user requirements, resulting from the growing use of the Internet at home, are likely to redefine MANs once again. To understand the evolution and predict the future of MAN, one must consider the applications and alternative technologies. There are also inherent differences in the capabilities of local, metropolitan, regional, and wide area networks. HISTORY The first mention of MANs, that I am aware of, occurred at a workshop on local area networks (LAN) in North Carolina in the late 1970s. One session at the workshop was dedicated to customer experience with LANs. One of the customers, from a New York bank, described a successful application of LANs but complained about the difficulty he had transferring data between branches of the bank in the same city. The feeling among the workshop participants was that we could do better connecting sites in the same city than using technology that was designed for a national network. In the 1970s telephone modems were expensive, about a buck a bit per second, and the highest rate modem that was generally available was 9.6 kbit/s. High-rate private lines, such as the current T-carrier system, were not widely deployed. Using the available technology was expensive and created a bottleneck J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.


DIFFERENCES AMONG LANS, MANS, AND WANS The distance spanned by WANs is greater than that by MANs, and the distance spanned by MANs is greater than that by LANs. It is useful to define the maximum distance spanned by the various network technologies as increases of an order of magnitude. LANs span distances up to 3 miles, and include most networks that are installed in a building or on a campus. MANs span distances up to 30 miles (50 km according to the standards committees) and can cover most cities. RANs (regional area networks) span distances up to 300 miles, the area serviced by the telephone operating companies in the United States. And WANs span distances up to 3000 miles, the distance across the United States. The next order of magnitude increase covers international networks. The distances spanned by networks affects the transmission costs, access protocols, ownership of the facilities, and the other users who share the network. The transmission costs usually increase with distance. This cost affects both the applications that are economically viable and the protocols that are used to transfer data. For instance, access protocols have been designed for LANs that trade efficiency for processing complexity. Carrier sense multiple access/collision detection (CSMA/CD) protocols, in which many users share a channel by continuing to try until the data successfully get through, are used on LANs. WANs use reservation mechanisms that require more processing, but pack the transmission facility as fully as possible. With CSMA/CD protocols the propagation delay across the network must be much less than a packet transmission time, which precludes using these protocols in WANs.

Traditionally, LANs are networks that are owned and installed by a single company or organization. An organization can choose to try new technologies. MANs are less expensive to install than WANs and may not be interconnected. There is more freedom to experiment with new technologies on MANs than on WANs. The expense of installing MANs relative to installing LANs has resulted in far fewer experimental MANs than LANs. In a LAN the other network users are generally more trusted than the users in a more open environment. Traditionally, MANs, such as CATV networks, and RANs and WANs, such as the telephone network, service an unrelated community of users. The users in these networks do not trust one another as much as the users on a LAN, and greater measures must be taken to protect data. There are increasing numbers of wide area networks that are owned and controlled by a single organization. Corporate networks and intranets, which use Internet technology within a corporate network, are becoming common. These networks have trust structures and flexibility that is more closely related to LANs than to WANs. The differences between general WANs and intranets are reflected in the applications of the networks and are leading to different implementations. Many of the economic tradeoffs that are related to the distance spanned by networks change with time. However, the difference in propagation delay can never change. As the size of the network increases, the maximum useful transmission rate that a user can access to transfer a particular size packet decreases. This phenomenon is demonstrated in Fig. 1. The lines in this figure show when the propagation delay and transmission time are equal in LANs, MANs, RANs and WANs. The calculations are performed assuming that the propagation delay in the medium is 80% of the speed of light in free space, which is common for optical fibers. To the right of these lines, the time it takes to get the message from the source to the destination is dominated by the propagation delay rather than the transmission time. Increasing the transmission rate when operating to the right of the line does not bring a commensurate decrease in the time it takes to deliver a message. For instance, on a 3000 mile WAN the equilibrium point on a 1.5 megabits/s T1 circuit is about 30.3 kbit. For a message of this size, the propagation delay and transmission time are equal. If the user’s rate is increased to 45 megabits/s, a T3 circuit, which is 30

108 Message size (bits)

between LANs. With the customer and application identified, work on MANs began. The first MANs were designed to interconnect LANs. They evolved from LANs and looked very much like LANs. Two MAN standards that are clearly related to LANs, fiber distributed data interface (FDDI) and distributed queue, dual bus (DQDB), are described later in this article. A third network, which is based on a mesh structure, the Manhattan street network (MSN), is also described. The MSN is a network of two-by-two switches that operate on fixed-size cells. The MSN straddles the middle ground between a LAN and a centralized asynchronous transfer mode (ATM) switch (1). Interest in MANs waned as the useful functions performed by MANs were subsumed by wide area networking (WAN) technology. WANs had a much larger customer base than MANs. It became more economical to interconnect LANs in a city with routers and private lines than to deploy new, special-purpose networks. There is a resurgence in interest in MANs because of the World Wide Web (Web). The time required to download Web pages using WAN technologies is frustrating many users and constraining the growth of this service. As more and more individual homes are connected to the Internet, there is a rapidly growing demand for bursty, high-rate data to a large number of locations in a metropolitan area. Therefore, there is a renewed interest in MANs, although the requirements and customer set are completely different from those of the earlier MANs.

603

3000 miles 300 miles 30 miles 3 miles

106 104 102

9.6 K

1.5 M 150 M Transmission rate (bits/s)

2.4 G

Figure 1. Equal delay lines when the distance between the source and destination is 3, 30, 300, and 3000 miles. Along the line the propagation delay for light and the transmission time of the message are equal.

604


times faster, the time to transmit the message decreases by a factor of 30, but the propagation time remains the same. The time it takes the message to get to the destination decreases by less than a factor of 2 as the transmission rate increases by a factor of 30. At T3 rates the message delivery time is almost entirely due to the propagation delay, and if the user’s rate increases to 155 Mbits/s, an ATM circuit, there is virtually no decrease in the time it takes to receive the message. If the user sends the same message on a MAN and increases the rate from T1 to T3, the time to deliver the message decreases by almost a factor of 30, and increasing the ATM rates decreases the delivery time by another factor of 2. Therefore, ATM rates may be used to obtain faster delivery of this size message in a MAN, but not in a WAN. THE MAN ANOMALY Generally, transmission links cost more as the distance increases. At present, high-rate channels are readily available in LANs and WANs, but not in MANs. The use of computers in offices has become ubiquitous and has resulted in most office buildings being wired for high-speed communications. The backbone of the wide area telephone network is shared by a large number of users. Even though only a fraction of the users require high-rate facilities, the number is large enough to warrant providing those facilities between central offices. Fiber to the curb and other methods to provide high data rates in a MAN exist, but most users do not require these rates. The lines running down a street in a MAN are not shared by as large a number of users as the lines between central offices. As a result, when high-rate channels are installed in an office, the line that spans the final mile or two from the central office to the office building is frequently an expensive, custom installation. The current networks must be modified to provide high data rates to homes until the demand increases to the point where new facilities are justified. Two possible technologies are digital subscriber loops (DSL) and the CATV network. DSL and ADSL (asymmetric DSL) use adaptive equalizers to transmit between 1.5 and 6.3 Mbits/s over current local loops in the telephone network. ADSL provides higher rates in one direction than in the other. To date, the only really successful MAN for distributing information to a large number of homes has been the CATV network. CATV networks are mainly used to distribute entertainment video; however, experimental networks are being deployed to deliver data to homes. Standards organizations and working groups are actively considering these networks. The multimedia requirements of the Web may well make CATV technology the correct solution for the next MAN. Several of the early proposals for MANs used CATV networks to deliver point-to-point voice and data services, as well as broadcast TV. Later we describe one of these techniques, which is still one of the most forward-looking CATV solutions. THE FIBER DISTRIBUTED DATA INTERFACE FDDI (2) is a token passing loop network that operates at 100 megabytes/s. It is the American National Standards Institute (ANSI) X3T9 standard and was initially proposed as the suc-

cessor to an earlier generation of LANs. FDDI started as a LAN and has been primarily used as a LAN; however, it is capable of transmitting at the rates and spanning the distances required in a MAN. Therefore, it has become common to discuss FDDI in the context of MANs. Baseband Transmission FDDI uses a baseband transmission system. Baseband systems transmit symbols, ones and zeros, on the medium rather than modulating the symbols on a carrier, as in a radio network. Baseband systems are simpler to implement than carrier systems; however, the signal does not provide timing and there may be a dc component that is incompatible with some system components. For instance, a natural string of data may have a long sequence of ones or zeros. If the medium stays at the same level for a long period, it is difficult to decide how many ones or zeros were in the string and the dc level of the system will drift toward the value of that symbol. To tailor the signal to have desirable characteristics, the data are mapped into a longer sequence of bits. A common code for transmitting baseband data on early twisted pair networks is a Manchester code. Each data bit is mapped into a 2 bit sequence, a one is mapped into ⫹1,⫺1 and a zero into ⫺1,⫹1. There is at least one transition per bit, which provides a strong timing signal. There is no dc component in this code. Twisted pairs are connected to receivers and transmitters by transformers to protect the electronics from energy picked up by the wires during lightning storms, and transformers do not pass dc. A framing signal is needed to identify bit boundaries and the beginning of a sequence of bits. Framing signals occur infrequently and are sequences that do not occur in the data. Framing can be obtained by alternately transmitting ⫺1,⫺1 and ⫹1,⫹1 every n bits. With a Manchester code the bit rate on the medium is twice the bit rate from the source; however, this is a very simple coding system to implement. The FDDI standard uses a rate 4/5 code that maps 4 data bits into 5 transmitted bits. The constraints on codes used for fiber optics are different from those on twisted pairs. Fibers do not act as antenna during lightning storms, are not coupled to electronics through transformers, and can tolerate a dc component. In addition, logic costs have decreased since the early twisted pair networks were designed, so that it is now reasonable to implement more complex bit mappings and to design signal extraction circuits with fewer transitions. In the FDDI code there are 16 possible data patterns per symbol and 32 possible transmitted patterns. The 16 patterns are selected to guarantee at least one transition every 3 bits. Some of the remaining patterns serve control and framing functions. The use of the transmitted patterns is listed in Ref. 3. Token Passing Protocol A token is a unique sequence of bits following a framing sequence. When a station on the loop receives the token, it may transmit data after changing the token to a different pattern of bits. When the station has completed its data transmission, it transmits a framing sequence and the token so that the next station has a chance to transmit. When a station does not have the token, it forwards the data it receives on the loops. A station may remove the data destined for itself. When a station has the token and is transmitting, it discards any


data it receives on the loop. The discarded data either passed this station prior to the token, and has circulated around the loop or was transmitted by this station after accepting the token. In either case, the data have circulated around the loop at least once and every station has had a chance to receive the data. In FDDI there is a time for the token to circulate around the loop, a target token rotation time (TTRT). In a simple token passing protocol one station can hold the token for a very long period of time. That station can obtain a disproportionate fraction of the bandwidth and delay other stations for long periods. In the FDDI protocol, station i is entitled to send S(i) bits each time it receives the token. The TTRT is set so that every station can send at least the bits it is entitled to send in a single rotation.

605

as the other levels by setting Tjmax ⬍ ⫺(2*TTRT ⫹S(i)), so that data at that level are never inhibited by the threshold. Tokens that circulate the loop and can be used by any station are called unrestricted tokens. The FDDI standard also supports a mode of operation in which all of the capacity that is not being used by synchronous traffic is assigned to a single station, possibly for a large file transfer. To support this mode of operation, a restricted token is defined. A station that enters this mode forwards the restricted token rather than the standard token. Another station that receives a restricted token transmits its synchronous traffic, but does not transmit traffic at level j ⬍ jmax. Therefore, all of the capacity available for asynchronous traffic is given to a single station until that station forwards an unrestricted token. Isochronous Traffic

S(i) + < TTRT

i

where ⌬ is the propagation delay around the loop plus the maximum time that is added by the transmission format. The TTRT provides guaranteed bandwidth and delay for each station. After a station i transmits S(I), it can continue to transmit data if the time since it last forwarded the token is less than TTRT, which indicates that the token is circulating more quickly than required. How long a station holds the token depends on the priority of the data it is transmitting. If the data are high priority, the station can hold the token until all of the surplus time in the token rotation has been used. If the data are lower priority, the station may leave surplus time on the token to give stations with higher-priority data a chance to acquire the surplus time. The TTRT is set individually for each FDDI system depending on the requirements of the stations on that system. The maximum delay that can occur is 2*TTRT, and the average token rotation time is less than TTRT (4). Therefore, TTRT can be set to provide the guaranteed bits, S(i), and an upper bound on delay for synchronous applications. When best effort traffic, rather than traffic that requires service guarantees, is the dominant traffic type, then TTRT is set to trade access delay and efficiency. As TTRT is made smaller, the time delay until a token arrives decreases. As TTRT is made larger, the amount of data transmitted before passing the token increases, less time is spent passing the token, and the efficiency of the system increases. The increase in efficiency is greatest when there is only one active station. The operation of the token protocol is depicted in Fig. 2. When a station transfers the token, it sets the local counter TRT(i) to TTRT. The station decrements TRT(i) every unit of time, whether or not it has the token. If TRT(i) reaches ⫺TTRT before the token is received, then the token has not visited this station in 2*TTRT and is presumed to be lost. When station i receives the token it has Wj(i) ⱖ 0 units of data waiting to be transmitted at priority level j, where jmax is the highest level and Wjmax(i) ⱕ S(i). A station transmits Wjmax no matter how much time is left on TRT(i). For the priority levels j ⬍ jmax, a station transmits a data unit as long as TRT(i) is greater than the threshold Tj. Tj⫺1 ⬍ Tj so that we never transmit data at level j ⫺ 1 if we cannot transmit data at level j. We process priority jmax data in the same structure

FDDI-II adds the ability to send isochronous traffic to FDDI. An isochronous channel provides a regularly occurring slot. The channel is assigned to a specific station on a circuit switched basis. FDDI-II is implemented by periodically switching the network between a circuit switched and packet switched mode. There is a central station that sends out a framing signal every 125 애s. A portion of the interval following a frame signal is assigned to circuit switched traffic, the isochronous mode, and the remaining time in the frame is assigned to the token passing protocol. An isochronous station that is assigned one byte per frame has a 64 kb/s channel. This channel is adequate for telephone quality voice. In FDDI-II the stations that implement the token passing protocol must switch between the two modes of operation when they receive framing signals. A station that enters the circuit switched mode must forward whatever bits it receives. When the circuit switched mode ends, the station must resume the token passing protocol where it left off. Architecture A single failure of a node or a link disconnects the stations on the loop. In a LAN, loops are made more reliable by using normally closed relays to bypass individual stations that lose power or fail in an obvious manner. It is also common to arrange stations in subloops that are chained together at a central location. Subloops with failures are removed from the network (5), so that the stations on the other subloops can continue to communicate. Poor reliability prevents loop networks from spanning the distances and connecting the number of users associated with MANs. In FDDI the reliability is improved with a second loop. The second loop does not carry data during normal operation, but is available when failures occur. Figure 3 shows three components that are used in FDDI networks: A. Units that implement the token rotation protocol B. Units that manage the reliability C. A unit that is responsible for signal timing and framing Type A units connect user devices to the primary loop. There can be more than one type A unit attached to a type B unit. The type A units do not have to be collocated with the type B

606

METROPOLITAN AREA NETWORKS TRT(I)=TTRT

Wait for token No

Yes

TRT(i)←TRT(i)–1

TRT(i)0

No

No

j>0

j←j–1

Yes

Figure 2. The flow diagram for the timed token rotation protocol in FDDI. The diagonal boxes are decision points where a terminal decides whether or not to transmit when a token is received dependent upon a local timer.

Yes

TRT(i)>Tj

No

Source

Source

Source

Source

Source

A

Source

A

A

A

A

A Disabled link B

B

B

C

Disabled link

Figure 3. The topology of an FDDI loop. This shows the application of the three types of units that are defined for an FDDI in unidirectional and bidirectional loops. The dashed lines show how the loops are reconfigured after failures.

A

Source

Clock and frame generator


unit, and they can be daisy-chained together to form a subloop. Type B units are connected to both loops. These units monitor the signal returning from type A units and bypass type A units that have stopped operating. Type B units also monitor the signal on the two loops and bypass failed loop components. The secondary loop is used to bypass failed links, or failed type B units. The signal on the secondary loop is transmitted in the opposite direction from the primary loop. Normally type B units patch through the signal that they receive on the secondary loop. When a primary loop failure is detected, by a loss of received signal on that loop, a type B unit replaces the lost signal with the signal it receives from the secondary loop and stops transmitting on the secondary loop. When a type B unit stops receiving the signal on the secondary loop, it replaces that signal with the signal it would have transmitted on the primary loop and stops transmitting on the primary loop. As an example, in Fig. 3 the X signifies a link failure and the — (horizontal bar) indicates the link on which the type B unit has stopped transmitting. The entire secondary loop replaces the single failed link on the primary loop. The configuration of type A and type B units reflects the way loop networks are installed. Loop networks are installed by running wires or fibers from an office to a wiring cabinet. The type A units are located in offices and the type B units are located in the wiring cabinets. Physically, the topology is a star, but the connection of wires in the wiring cabinet form a logical loop. The star topology makes it possible for stations to be added to a loop, or moved from one loop to another, without rewiring a building. THE DISTRIBUTED QUEUE, DUAL BUS PROTOCOL DQDB (6,7) is the IEEE 802.6 standard for MANs. It uses two buses that pass each station. The stations use directional taps to read and write data on a bus without breaking the bus. Directional taps transmit or receive data from one, rather than both, directions at the point of connection, and are common components in both CATV and fiber optic networks. DQDB transmits information in fixed-size slots and uses the distributed queue protocol to provide fair access to all of the stations. Signals on the two buses propagate in opposite directions. A station selects one bus to communicate with another specific station and uses the other to place reservations for that bus. The DQDB standard was preceded by two earlier protocols, Express-Net (8) and Fasnet (9), that used directional taps. In the two earlier systems there was a single bus that passed each station twice. On the first pass each station could insert signals and on the second pass a station could receive signals from all other stations. Both of the earlier protocols provided fair access by guaranteeing that every station had the opportunity to transmit one slot before any station could transmit a second slot. Passive Taps Passive taps distinguish directional buses from loop networks, which use signal regenerators. In loop networks, there is a point-to-point transmission link between each station.

607

Each station receives the signal on one link and transmits on the next link. A station can add or remove the signal on the loop. A failure in the electronics in a station breaks the communications path. By contrast, the stations on a directional bus network do not interrupt the signal flow. A passive tap reads the signal as it passes the station, and another tap adds signal to the bus. If a station with passive taps stops working, the rest of the network is not affected. The protocols that can be used on bus networks are a subset of the protocols that can be used on loop networks, since the stations can add but not remove signals. The inability to remove signals makes it necessary for the the bus to have a break in the communications path, where signals can leave the system. The protocols for directional buses can be implemented using regenerators rather than passive taps when it is advantageous. Passive taps remove energy from the signal path, and the signal must be restored to its full strength after passing several stations. In addition, by removing information from the transmission medium after it is received, the medium can be reused and to support more communications. The DQDB standard provides for erasure nodes (10,11), which remove information that has already been received. Erasure nodes are regenerators that read slots and, depending on the location of the destination, regenerate the slot or leave the bus empty. Architecture The dual bus in a DQDB network is configured as a bidirectional loop, as shown in Fig. 4. The signal on the outer bus propagates clockwise around the loop, and the signal on the inner bus propagates counterclockwise. The signal does not circulate around the entire loop, but starts at a head-end on each bus and is dropped off the loop before reaching the head-end. To communicate, a station must know the location of the destination and the head-ends and transmit on the proper bus. For instance, station A transmits on the outer bus to communicate with station B, and station B transmits on the inner bus to communicate with station A. The dual bus is configured as a loop so that the head-end can be repositioned to form a contiguous bus after a failure occurs. The head-end for each bus is moved so that the signal is inserted immediately after the failure and drops off at the failure. This system continues to operate after any single failure. The ability to heal failures increases the complexity of stations on the DQDB network. To heal failures, the station that assumes the responsibility of the head-end must be able to generate clock and framing signals. In addition, after a failure each station must determine the new direction of every other station. For instance, after the failure in Fig. 4 is repaired, station A must use the inner bus, rather than the outer bus, to transmit to station B. The Access Protocol In a DQDB network transmission time is divided into fixedsize slots that stations acquire to transmit data. A station at the beginning of the bus periodically transmits a sync signal that each station uses to determine slot boundaries. The first bit in the slot is a ‘‘busy’’ bit. It is one when the slot is being

608


Frame generator Station

Station

•

Bus 2

• Bus 1

• •

Station

Station

Station

•

•

•

•

•

Bus 1

• •

Station A

Station B

Bus 2

•

Station A

Station B Station Figure 4. The topology of the DQDB network. This shows dual bus configured as a dual loop in order to survive single failures.

Station Frame generator Normal

used, and zero when it is empty. When a station has data to send, it writes a one into the busy bit. A read tap precedes the write tap at each station. When a station writes a one into the busy bit, it also reads the bit from upstream to determine if it was already set. If the busy bit was zero, the station transmits data. If the busy bit was one, then the slot is full, but there is no harm in over writing the busy bit. The problem with this type of a protocol is that stations that are closer to the head-end see more empty slots than stations that are farther away. To prevent unfair access to the medium, stations send reservations to the stations that can acquire slots before they have a chance. Reservation requests are transmitted to the upstream stations on the opposite bus as the data. There are two separate reservation systems, one for transmitting data on each bus. In each system, the bus that is used to transmit data is referred to as the data bus, and the other bus is the reservation bus. Reservations are used to prevent upstream stations from acquiring all of the empty slots when downstream stations are waiting. A station places reservations in a queue with its own requests and services the entrys in this queue whenever an empty slot passes. If the entry is a reservation, the busy bit in the slot is left unchanged so that the slot is available for a downstream station. If entry is from the local source, the busy bit is set and data are inserted in the slot. A station with a very large message may still dominate the bus by requesting a large number of slots. To prevent this, a station is only allowed to have one outstanding request at a time. When a message requires several slots, one request is transmitted on the reservation bus and one data slot is placed in the local queue. When the data slot is removed from the queue, a second request is transmitted on the upstream bus and the second data slot is placed in the queue. The process continues until the entire message is transmitted. When there is more than one active user, the active users are given one slot at a time and serviced in a round-robin order. The queue in each reservation system is implemented with two counters that track the number of reservations. Counter CB counts the reservations that precede this station’s data slot, and counter CA counts the reservations that follow this station’s slot. When a station is not actively transmitting, it

Frame generator Failure

increments CB whenever a reservation is received and decrements it whenever an empty slot passes. When a station has a data slot to transmit, it increments CA whenever a reservation is received. When CB is zero and an empty slot appears on the data bus, the station transmits its data slot and then transfers the count from CA to CB. If the station has another data slot, transferring the count from CA to CB gives downstream stations, which placed reservations while this data slot was queued, a chance to acquire the data bus before this station transmits a second data slot. In a DQDB network with multiple priority levels for data, there is one reservation bit and two counters for each priority level. When empty slots are received, the counters for the higher priority levels are emptied first. Protocol Example Figure 5 shows the operation of five stations in the middle of a DQDB bus as data arrive and are then transmitted. The data bus propagates from left to right and the reservation bus from right to left. The figure shows the value of the busy bit on the data bus and the reservation bit on the reservation bus for each time slot, as the bus passes each station. The values in counters CA and CB, and whether or not a station is waiting to transmit data, are shown in between the data and request bus at each station. The operation is simplified and ignores questions of relative timing and propagation delay. Starting at slot time 1, a single data slot arrives at station 5, followed by two data slots at station 2, and another single data slot at station 3. When the data slots arrive, the network is busy transmitting other data slots, so the data is queued rather than being transmitted immediately. The service order for the messages is station 5, station 2 slot 1, station 3, and station 2 slot 2. The protocol operates as a first in first out (FIFO) queue with a round-robin strategy for multiple slot messages. In the first three slots, while the reservations arrive, the data bus is busy carrying slots from stations that are closer to the head-end, which were previously queued. Since the stations are upstream, the reservations are not in the queues at these five stations. In the first slot, station 5 inserts a one on


the reservation bus that is entered in the queues at stations 1 to 4. In the second slot station 2 places a reservation that is only entered in the queue at station 1. In the third slot station 3 places a reservation that is entered in the queues at stations 1 and 2. After three time slots, station 1 has 3 reservations and station 2 has two reservations, one that will be serviced before its own request and one after. At station 3 there is only one reservation, since station 3 did not receive the reservation from station 2. The fact that station 3 cannot contribute to seeing that station 2 is serviced in its fair turn will not matter, since station 2 has earlier access to the data bus. Stations 4 and 5 have 1 and 0 reservations, respectively. The reservations are serviced starting in slot 4. Since CB is greater than zero in stations 1 to 4, these stations let the empty slot pass by, and each removes one reservation by decrementing CB. Station 5 transmits in slot 4. In slot 5 both stations 2 and 3 are poised to transmit, with CB at zero and a data slot waiting. Station 2 has the first access at the slot and acquires it, demonstrating that it is unimportant that station 3 did not have an entry for station 2. Station 2 has a second slot that it must transmit, but CA indicates that one other downstream station has not been serviced. Station 2 moves the count from CA to CB before entering its next request in the queue. Slot 6 is acquired by station 3, since CB is greater than

0 0,0,0

7 0

0 to 1 1 1 1 1 0,0,0 0,0,0 0,0,0 0,0,0 0 0 0 0 0

0

0 1,0,0

6 0

0 0 2,0,0

5 0 Time (in slots)

0,1,0

0 to 1 1 1 1 1 1,1,0 0,1,0 0,0,0 0,0,0 0 0 0 0 1 to 0

0

0 2,0,0

4 0 1

1

1

1

1

1 1,0,0

1

Counters CBXCA

CBXCA

1 0,1,0

0

1,0,0

0 1

1,0,0 0

1 0,1,0

0 1

1,0,0

1

1 1,0,0

1

1 1,0,0

1 to 0 1 0,0,0 0 0

1

1

1,1,0 1 1 to 0

1 1

0

1,1,0 1 1 to 0

1 2,0,0

2

0,0,0

1 1,1,1

1

0 0,1,0

0

3,0,0

1

0 0,1,1

0

3

0 to 1 1 1 1 0,0,0 0,0,0 0,0,0 0 0 0 0

0 1

1,0,0 1

1

0,1,0 1 1 to 0

Busy bits and data CBXCA

CBXCA

CBXCA

4

5

Request bits 1

2

3 Stations

CB, CA = Counters at station X = 1 When station has data to transmit

Figure 5. The simplified operation of the DQDB protocol. This shows the operation of the counters at five stations on the bus as data packets arrive at the stations and are transmitted on the bus.

609

zero at station 2, and slot 7 is acquired by the second slot from station 2. Isochronous Traffic There is no mechanism in the distributed queueing protocol to provide service guarantees on delay or bit rate. This problem has been sidestepped in the standard by creating a separate protocol that shares the bus with the DQDB protocol, as in FDDI. The slots leaving the head-end are grouped into 125 애s frames. In some slots the busy bit is zero and the slots are available for the data transfer protocol. In other slots the busy bit is one so that stations practicing the data transfer protocol do not try to access these slots. The busy slots that are generated by the head-end occur at regular intervals and contain a unique header so that they are recognized by stations that require guaranteed rates. These slots are partioned into octets (bytes) that can be reserved. A station that reserves a single octet in a frame acquires a guaranteed 64 kbits/s channel with at most a 125 애s delay. This is the same guarantee provided by the telephone system. As in FDDI, the channels are circuit switched and referred to as isochronous channels. Protocol Unfairness Soon after the IEEE 802.6 standard was passed, it was noted that because of the distance-bandwidth product of the network, there was a potential for gross unfairness (12,13). In this work we will explain the source of the problem, and a particularly simple solution, called bandwidth balancing (BWB) (14), which eliminated most of the problem and was added to the standard. The IEEE 802.6 standard is designed to operate at 155 megabits/s, with 53 byte slots, and is compatible with ATM. At these rates, a cell is only about 0.4 miles long. The standard spans up to 30 miles. Therefore, there may be 75 cells simultaneously on the bus. Assume that a station near the head-end of the bus has a long file transfer in progress when a station 50 cells away requests a slot. In the time it takes the request to propagate to the upstream station, that station transmits 50 slots. When the request arrives, the upstream station lets an empty cell pass and then resumes transmission. An additional 50 slots are transmitted before the empty cell arrives at the downstream station. When the empty slot arrives, the downstream station transmits one slot and submits a request for another. The round trip for this request to get to the upstream station and return an empty slot is another 100 slots. As a result, the upstream station obtains 100 times the throughput of the downstream station. Although the reason is less obvious, a similar imbalance can occur in favor of the downstream station when that station starts transmitting first. While the downstream station is the only source, it transmits in every cell, while placing a reservation in every slot. When the upstream station begins transmitting, there are no reservations in its counter, but there are 50 reservations on the bus. While the upstream source transmits a slot, a reservation is received. Therefore, the upstream station must allow one slot to pass before transmitting its second slot. During the time it takes to service the reservation and the upstream station’s next slot, two more reservations arrive. Therefore, the upstream station lets two empty slots pass before transmitting its third slot. The reser-

610


vation queue at the upstream station continues to build up each time it transmits a slot, and the upstream station takes fewer of the available slots. The imbalance between the upstream and downstream station is sustained after the 50 reservations pass the upstream station because, at the other end of the bus, the downstream station places a reservation on the bus for each of the empty slot that the upstream station releases. The imbalance is not as pronounced as when the upstream station starts first, but it is considerable. The exact imbalance depends on the distance between two stations and the time that they start transmitting relative to one another (14). Bandwidth Balancing The bandwidth balancing (BWB) mechanism, which was added to the standard to overcome the unfairness, is based on two observations: 1. Each station can calculate the exact number of slots that are used, whether or not the data physically pass the station. 2. It is possible to exchange information between stations by controlling the fraction of the slots that are not used. A station sees a busy bit for every slot transmitted by an upstream station and a reservation for every slot transmitted by a downstream station. By summing the busy bits and reservation bits that apply to a data bus and adding the number of slots that the station transmits, the station calculates the total number of slots transmitted on the bus. Table 1 shows an example of how stations can communicate and achieve a fair utilization by using the average number of unused slots. Two stations, station A and station B, each try to acquire 90% of the unused bandwidth on a channel. Station A starts first and acquires 90% of the total slots. When station B arrives, only 10% of the slots are available, but station B does not know if the slots are being used by a single station taking its allowed maximum share, or many stations transmitting sporadically. Station B uses 90% of the available slots, or 9% of the slots in the system. Station A now has 91% of the slots available. When station A adjusts its rate to 90% of 91% of the slots, it uses 82% of the slots, making 18% of the slots available to station B. Station B adjusts its rate up to 90% of 18%, which causes station A to adjust its rate down, and so on until both stations arrive at a rate of 47.4%. Note that this mode of communications can

Table 1. Convergence of Rates When Two Stations Use 90% of the Slots Available to Them Station A Measure Bsy ⫹ Rqst 0 0 0.19 0.16 0.22 0.27 0.31 ⭈ ⭈ ⭈ 0.474

— Measure Bsy ⫹ Rqst

Take 0.9 ⴱ 1 ⫽ 0.9 ⴱ 1 ⫽ 0.9 ⴱ 0.91 ⫽ 0.9 ⴱ 0.84 ⫽ 0.9 ⴱ 0.78 ⫽ 0.9 ⴱ 0.73 ⫽ 0.9 ⴱ 0.69 ⫽ ⭈ ⭈ ⭈ 0.9 ⴱ 0.526 ⫽

Station B

0.9 0.9 0.82 0.76 0.7 0.66 0.62 0.474

— 0.9 0.82 0.76 0.7 0.66 0.62 ⭈ ⭈ ⭈ 0.474

Take — 0.9 ⴱ 0.1 ⫽ 0.9 ⴱ 0.18 ⫽ 0.9 ⴱ 0.24 ⫽ 0.9 ⴱ 0.3 ⫽ 0.9 ⴱ 0.34 ⫽ 0.9 ⴱ 0.38 ⫽ ⭈ ⭈ ⭈ 0.9 ⴱ 0.526 ⫽

0.09 0.16 0.22 0.27 0.31 0.34 0.474

only be used when stations try to acquire less than 100% of the slots. The implementation of BWB in the standard is particularly simple. A station acquires a fraction of the slots available by counting the slots it transmits and placing extraneous reservations in the local reservation queue when the count reaches certain values. In this way, a station lets slots pass that it would have acquired. For instance, if stations agree to take 90% of the slots that are available, they count the slots that they transmit and insert an extra reservation in CA after every ninth slot that they transmit. As a result, every tenth slot that the station would have taken passes unused. With BWB, the fraction of the throughput that station i acquires, Ti, is a fraction 움 of the throughput left behind by the other stations:

Ti = α 1 −

Tj

j = i

When N stations contend for the channel, they each acquire a throughput: T=

α 1 + α(N − 1)

The total throughput of the system increases as 움 approaches one or the number of users sharing the facility becomes large. The disadvantage with letting 움 approach one is that it takes the network longer to stabilize. We can see from the example in Table 1 that the network converges exponentially toward the stable state. However, as 움 씮 1 the time for convergence goes to infinity. When 움 ⫽ 1, BWB is removed from the network and the original DQDB protocol is implemented.

THE MANHATTAN STREET NETWORK The Manhattan Street Network (MSN) (15) is a network of two-by-two switches. A source and destination may be attached to each switching node. The logical topology of the network resembles the grid of one-way streets and avenues in Manhattan. Fixed size cells are switched between the two inputs and outputs using a strategy called deflection routing. The MSN resembles a distributed ATM switch. The twoby-two switching elements may be in a large number of wiring centers rather than a central location. However, the same interconnection structure is used for switching elements in different wiring centers as for elements in the same location. Routing is simpler in the structured MSN than in a general network of small switches. In deflection routing, packets can be forced to take an available path rather than waiting for a specific path. Each packet between a source and destination is routed individually and may take different paths. The packets may arrive at the destination out of order and have to be resequenced. Deflection routing has several advantages over virtual circuit routing. The overhead associated with establishing and maintaining circuits is eliminated, and the capacity is shared between bursty sources without large buffers and without losing packets because of buffer overflows. Deflection routing is also being used for routing inside some ATM switches (16,17).

METROPOLITAN AREA NETWORKS Manhattan Street Network Source

Source

611

Shuffle-exchange network Output from node 0000

Input to node 0000

0001

0001

0010

0010

0011

0011

0100

0100

0101

0101

0110

0110

0111

0111

1000

1000

1001

1001

1010 1011

1010 1011

1100

1100

1101

1101

1110

1110

1111

1111

Topology The MSN is a two-connected topology. In two-connected networks, other than linear structures like the dual bus or dual loop, a path must be chosen at every intermediate node. Two two-connected topologies that have simple routing rules are the MSN and the shuffle-exchange network (15), shown in Fig. 6. The MSN is a grid of one-way streets and avenues. The directions of the streets alternate. By numbering the streets and avenues properly, it is possible to get to any destination without asking directions or having a complete map. The grid is logically constructed on the surface of a torus instead of a flat plane. The wraparound links on the torus decrease the distance between the nodes on the edges of the flat plane and eliminate congestion in the corners. In the shuffle exchange network node i is connected to nodes 2ⴱi and 2ⴱi ⫹ 1, modulo the number of nodes in the network. In the figure for the shuffle exchange network each node appears in both the left- and right-hand column in order to make it easier to draw the connections. The two links leaving each 2 ⫻ 2 switching element are shown in the left-hand column and the two links arriving at each switching element are shown in the right-hand column. The shuffle exchange network has a simple routing rule, based on the address of the destination. The simple routing rule is one of the reasons why this structure is used in most ATM switches.

cells arriving at the node never exceeds the number of cells that can be stored or transmitted. There is an implicit assumption that the source can be controlled. This assumption is common in data networks with variable throughputs, such as the Ethernet. The MSN is well suited for deflection routing for three reasons: 1. At any node many of the destinations are equidistant on both output links. Cells headed for these destinations have no preference for an output link and do not force other cells to be deflected. 2. When a cell is deflected, only four links are added to the path length. The worse that happens is that the cell must travel around the block.

In 1

D

2 1

Out 1

X In 2

D

Deflection Routing Deflection routing is a rule for selecting paths for fixed-size cells at network nodes with the same number of inputs and outputs, as shown in Fig. 7. The cells are aligned at a switching point. If both cells select the same output, and the output buffer is full, one cell is selected at random and forced to take the other link. The cell that takes the alternate path is deflected. Deflection routing can operate without any buffers in the nodes. Deflection routing gives priority to cells passing through the node. Cells are only accepted from the local source when there are empty cells arriving at the switch. Cells are never dropped due to insufficient buffering because the number of

Figure 6. Regular Mesh Topologies. This figure shows the connectivity of the nodes in the Manhattan Street Network and the Shuffle-Exchange Network. The same nodes appear in the left and right columns of the Shuffle–Exchange Network in order to make the regular structure easier to see.

Out 2

Out

In

Source Switch

Input P1 P3

P2

P3

Output

P1

SRC

P2

P2

P3

P1

Figure 7. A block diagram of a node in a deflection routing network. The arriving packets are delayed so that they aligned. The packets can be exchanged and placed in their respective output buffers.

612


3. Deflection routing can guarantee that cells are never lost, but cannot guarantee that they will not be deflected indefinitely and never reach their destination. It has been proven that this type of livelock will never occur in the MSN (18). Deflection routing is similar to the earlier hot potato routing (19), which operated with variable-size packets, on a general topology, with no buffers. Fixed-size cells, the MSN topology, and two or three cells of buffering converted the earlier routing strategy, which had very low throughputs, to a strategy that can operate at levels exceeding 90% of the throughput that is achieved with infinite buffering. Reliability The MSN topology has several paths between each source and destination. The alternate paths can be used to communicate after nodes or links have failed. There are two simple mechanisms to survive failures in the MSN. Both mechanisms, shown in Fig. 8, are adopted from loop networks. Node failures are bypassed by two normally closed relays that connect the rows and columns through. The missing node in the grid in Fig. 8 has failed. Link failures are detected by a loss of signal, as in loop networks. Nodes respond to the loss of signal by not transmitting on the link at right angles to the link that has stopped. When one link fails, three other links are removed from service and the node at the input to the failed link stops transmitting on it. The dotted link in Fig. 8 has failed and nodes stop transmitting on the dashed links. This link removal procedure works with any number of link failures. Since the number of input and output links are equal at all of the operating nodes, deflection routing continues to operate without losing cells. In addition, it has been found that the simple routing rules that are designed for complete MSNs continue to work on networks with failures. COMPARISON OF FDDI, DQDB, AND THE MSN The FDDI, DQDB, and MSN networks are all two-connected topologies. There are two links entering and leaving each node. Some units in the FDDI network may not be two-connected, but the main part of the network is a dual loop. Twoconnectivity distinguishes these networks from the earlier loop and bus networks, used in LANs, which were one-connected. The increased connectivity makes these networks bet-

Node failures Bypass relay

Figure 8. Failure recovery mechanisms in the MSN. Node failures are survived with bypass relays at the failed node. Link failures are survived by eliminating a circuit that includes the failed link, thus preserving the criterion necessary to perform deflection routing at every node.

ter suited for the increased number of users and longer distances spanned in a MAN. Like the earlier LANs, DQDB and FDDI are linear topologies. Logically, the nodes are arranged in a one-dimensional line. In linear topologies the throughput per user and reliability are not as high as they are in other two-connected topologies, which have shorter distances and more paths between nodes. There were early proposals (20) to connect the second paths in loop networks to achieve higher throughputs and reliability than the bidirectional loop, but these more complicated topologies lost some of the more desirable characteristics of loops and buses. The MSN and shuffle-exchange topology achieve both goals while approaching the ease of routing and growth that is associated with linear topologies. Throughput A disadvantage with linear topologies is that the average throughput that each user can obtain decreases linearly with the number of users. In linear topologies, the average distance between nodes increases linearly with the number of nodes in the network. With uniform traffic, the number of users sharing a link increases linearly with the number of users in the network and the average rate per user decreases accordingly. The protocols used in FDDI and some DQDB networks do not reuse slots. The throughput in these networks is reduced for all traffic distributions. By contrast, in the MSN the distance between nodes increases as the square root of the number of nodes in the network, and in the shuffle exchange the distance increases as the log of the number of nodes. As a result, the reduction in the throughput per user, which occurs as networks become large, is much less in the MSN and the shuffle-exchange networks than in the FDDI or DQDB network. In the DQDB network, the penalty for large networks can be reduced by breaking the network into segments and erasing data that have already been received when they reach the end of a segment. This strategy works particularly well when the traffic requirements are nonuniform. When there are communities of users that communicate frequently, if those users are placed on the same segment of the bus, the traffic between them does not propagate outside the segment and interfere with users in other segments. A similar strategy for reusing capacity does not exist for FDDI. In addition, the concept of communities is not as meaningful in FDDI. FDDI operates as a single loop. When user A has a short path to user B, user B

Link failures Circuit elimination


must traverse the rest of the loop to get to user A, and therefore has a long path. In the MSN and shuffle-exchange network special erasure nodes are not needed because the protocol removes cells that reach their destination. In the MSN, communities of users are supported in a very natural way. If nodes that communicate frequently are located within a few blocks, they only traverse the paths in those few blocks and do not affect the rest of the network. The concept of communities does not exist in shuffleexchange networks. There is less interference between neighborhoods in the MSN than in DQDB networks. In the MSN, when there is heavy traffic within a neighborhood, communications between other neighborhoods can continue without passing through that neighborhood. In deflection routing, because of the random selection process when there are equal length paths or oversubscribed nodes, cells naturally avoid passing through congested neighborhoods. By contrast, when a community in the middle of the bus becomes congested in a DQDB network, communications between nodes at opposite edges of the bus must still pass through that community. Reliability Both DQDB and FDDI have a structure that enables them to survive single failures without losing service. FDDI has a redundant loop that is pressed into service to bypass single failures, and DQDB repositions the head-end of the bus. In both of these networks if two or more failures occur in adjacent components, the mechanism bypasses those components without affecting the operating components. However, in the more likely event that the multiple failures occur in components that are separated from one another, the network is partitioned into islands of nodes that cannot communicate with one another. Nodes in the MSN are not cut off from one another until at least four failures occur. When four failures have occurred, the likelihood of nodes being disconnected, and the number that are actually disconnected, is small. A quantitative comparison of the MSN, DQDB, and FDDI networks is presented in Ref. 21. Routing Complexity An advantage of linear topologies is that routing is relatively simple. All of the data that enter an FDDI system are transmitted on a single path, and there is only one path to select at any intermediate node. In a DQDB network the source must decide which of the two paths leads to the destination, but once the data are in the network there are no choices to make. The MSN and shuffle-exchange networks have simple rules to select a path, but a choice must be made at each node. In a linear topology everything that is destined for an output link either originates at the source at the node or is received from a single input link. When the data that arrive on the input link have precedence over the data from the source, there is no need to delay or buffer the data from the input link. Deflection routing has the same result in any network where the in-degree equals the out-degree at each node. Growth An important consideration in any large network is how easily it can be modified to add or delete users. In the early days

613

of LANs the correspondence between the topology of the network and the physical distribution of users was considered important. A loop network was supposed to be a daisy chain between adjacent offices and a bus network was supposed to pass down a hallway. With experience we realized that it is more difficult to change the wiring between offices as users move than to have all offices connected to a wiring cabinet and change the interconnection in that cabinet. As a result, most LANs are physically a star network, or a cluster of stars, of connections between users and a wiring cabinet. The users are connected together inside the wiring cabinet to form a logical loop or bus or mesh network. The number of wires that must be changed in the wiring cabinet determines how difficult it is to add or delete users from a network. In a bidirectional loop or bus network, adding or deleting a user is a relatively simple operation. To add a user, the connection between two users is broken and the new user is inserted between them. In the wiring cabinet, two wires are deleted and four are added. In the FDDI loop, there is only one path to the new user. In the DQDB network, every user must determine which bus to use to transmit to the new user. In the shuffle-exchange network, adding or deleting nodes is very complex. Complete networks are only defined for certain numbers of users. The shuffle–exchange network shown in Fig. 6 is only defined when the number of users is 2n. When a network is replaced by the next larger network, virtually every wire must be removed and the network reconfigured from scratch. In a complete MSN, two complete rows or columns must be added to retain the grid structure. There are, however, partial MSNs in which rows or columns do not span the entire grid, and a technique is known for adding one node at a time to a partial MSN to construct eventually a complete MSN (22). With this technique the number of links that must be changed in the wiring cabinet is the same as in the loop or bus network. In addition, an addressing scheme is known that allows new partial rows or columns to be added without changing the addresses of existing nodes or the decisions made in the routing strategy. Multimedia Traffic Multimedia traffic is playing an increasingly more important role in networks. Non-real-time video or audio, as is currently delivered by the Web sites on the Internet, is adequately handled by the data modes on any of the MANs that have been described. However, the data mode of operation cannot provide the guarantees on throughput or delay that are required for real-time voice or video applications; nor can it guarantee the sustained throughput that is needed to view movies while they are being received. DQDB and FDDI have an isochronous mode of operation that provides dedicated circuits to support real-time traffic. The isochronous mode is well integrated into the DQDB protocol. Nodes that only require the data mode do not have to change any protocols or hardware when isochronous traffic is added to the network. The only change that these nodes notice is that some slots seem to enter the network in a busy state. By contrast, when isochronous traffic is added to an FDDI network, every node must be able to perform context switches to move between the data and circuit modes. The MSN does not have an isochronous mode of operation. It is

614


unlikely that dedicated circuits can be added to the mesh structure without constructing a separate circuit switch at every node in the network. CATV The current CATV network is an existing MAN that delivers TV programs to a large number of homes. The network is designed for unidirectional delivery of the same signal to a large number of receivers. The channels have a very wide bandwidth relative to telephone channels. In bidirectional CATV systems there are many more channels from the head-end to the home than in the opposite direction. The growth of the use of the Web in homes has created the need for wide-band channels to homes. The traffic on the Web is predominantly from servers to clients, and most homes are clients rather than servers. Increasingly, the Internet is being used for multicast communications (23) of video or audio programming, in which the same signal is received by more than one receiver. CATV networks are naturally suited for the Web and multicast communications. The simplest CATV systems are hybrid networks that use the CATV network to send addressed packets to the home, and the telephone network to receive data from the home. The CATV network provides a means of quickly getting a large amount of data to the home. The home terminals share a CATV channel. They receive all of the packets and filter out the packets that are intended others. If the data are considered sensitive, encryption is used to keep sensitive data from being intercepted. Shared CATV channels are particularly useful when high data rates are needed for short periods, as when receiving new pages from Web servers. The telephone channel is a relatively low bandwidth channel that adequately handles the traffic to Web servers. The IEEE 802.14 working group is currently considering standards for hybrid networks. The application of the Internet is not stationary. Packet telephony or an increase in the number of servers in homes for publishing or small businesses can quickly change the current unidirectional traffic demands. Data MANs that are implemented on CATV networks should be flexible so that they can track changes in the applications. In the early MANs, experimental CATV systems were used for two-way voice and data (24). There was not sufficient demand for data to homes at that time, and work on these networks stopped. The renewed interest in data applications of CATV networks makes it reasonable to reconsider the earlier work. In this section, we describe one such network, the Homenet (25).

is no relationship between the bandwidth that is available on a station’s transmit and receive channel. Bandwidth is assigned to transmitters by adjusting the number of stations that they contend with for a Homenet. For instance, an Internet service provider (ISP) can be given one or more Homenets, so that traffic from the servers can access the network without contention. Many clients, in homes, may share the same Homenet, since the traffic level from a client to the servers is much lower. A convenient characteristic of the Homenet strategy, in comparison with hybrid networks, is that the network can be modified easily to match the changes in load imbalances. If the traffic distribution changes and more traffic originates in some homes, those homes can be placed on less heavily populated Homenets to increase the amount that they can transmit. Access Strategy. The stations in a Homenet access the channel using Movable Slot Time Division Multiplexing (MSTDM) (26), which is a variation on the CSMA/CD (Carrier Sense Multiple Access/Collision Detection) protocol used in Ethernet. MSTDM is implemented by a very small change in the standard Ethernet access unit, and sets up telephone quality voice connections on an Ethernet. Figure 9 shows the transmission strategy that is used within a Homenet. The CATV taps are directional so that a station only receives signals from downstream and can only transmit upstream. The stations in a Homenet transmit upstream in an assigned frequency band. At a reflection point the upstream frequency band is received and transmitted downstream in a different frequency band. Before transmitting, a station listens to the downstream channel to determine if it is busy; CSMA then transmits on the upstream channel. When a station receives the signal that it transmitted, it knows that it has not collided with any other station, CD, so that any station that receives this channel can receive its data.

Station 1

Reflection point

Homenet The Homenet transmission strategy partitions the CATV network into smaller areas, called Homenets. Stations only contend with stations in their own Homenet to gain access to the network. Because of the size of the Homenets, the contending stations satisfy the distance constraints imposed on the CSMA/CD protocol used in Ethernet LANs. Any station on the CATV network can receive the signal on any Homenet. Two stations communicate by transmitting on their own Homenet and receiving on the other stations Homenet. The Homenet strategy is readily tailored to load imbalances, such as the directional imbalance in Web traffic. There

Station 2 Figure 9. Transmission strategy within a Homenet. Stations transmit upstream on the CATV tree using one channel. At the root of the homenet the signal that is received on this channel is retransmitted downstream on a different channel.


615

Upstream Return paths Downstream D2 D3 D4 XH XS RS R1 R2 R3 R4 Homenet Xmit Signaling channels

Homenet Rev

D2 D3 D4 XH XS RS R1 R2 R3 R4 Filters

Homenet 4

XH XS RS R1 R2 R3 R4


XH XS RS R1 R2 R3 R4 Homenet 1

Homenet 2

D2 D3 D4 XH XS RS R1 R2 R3 R4

Homenet 3


XH XS RS R1 R2 R3 R4

Figure 10. Transmission plan for a CATV network with 4 Homenets. The X’s are upstream, transmit channel and the R’s are downstream, receive channels. The cross-hatched channels are filtered out at the edge of the homenets.

Interconnection Strategy. Figure 10 shows the boundaries of Homenets, the assignment and filtering of frequencys, and the interconnections for a four-Homenet system. There are many fewer upstream channels in a CATV network than downstream channels. The same upstream data channel is reused in every Homenet by filtering that channel at the boundaries between Homenets. The stations in every Homenet contend for the same upstream channel, but the transmissions from stations in different Homenets do not interfere with one another. Each Homenet is assigned a unique receive channel. At the Homenet’s reflection point the upstream signal on the transmit channel is received and retransmitted on the Homenet’s receive channel. The signal is also carried by a point-to-point link to the head-end of the CATV network, where it is again transmitted on the Homenet’s receive channel. A Homenet’s receive channel is filtered at the reflection point for the Homenet so that the signal from the head-end does not interfere with the signal that is inserted. The reflection point of each Homenet appears to be the root of the entire CATV tree for that Homenet’s receive channel, and any station on the CATV network can receive the signal that a station transmits in its Homenet. To see how the transmission strategy works, consider a station in Homenet 2. The station transmits on the common upstream channel. This transmission may collide with transmissions from other stations in Homenet 2. The transmission will not collide with transmissions from stations in Homenet 4 because the transmit channel is filtered at the boundary between Homenets 2 and 4. At the reflection point for Homenet 2, the signal on the transmit channel is placed in the downstream channel 2. A station that has transmitted detects collisions by listening to the downstream signal on channel 2.

A successfully transmitted packet on Homenet 2 can be received by any station in Homenets 2 and 4. The signal on the transmit channel is also transferred to the head-end of the CATV network, where it is placed in channel 2, so that stations in Homenets 1 and 3 can also receive the signal from the station in Homenet 2. The signal from the head-end does not interfere with the signal inserted in channel 2 at the reflection point because the signal from the head-end is removed at the entry to Homenet 2. In Fig. 10 there is a second upstream and downstream channel that is used by all of the stations on the CATV network. This is a signaling channel and is used for communications between stations during call setup. A station places a call by transmitting on the upstream signaling channel. At the head-end of the CATV network this signal is placed on the downstream signaling channel. The CSMA/CD contention rule cannot be used on this channel because of the larger distances involved, but the low utilization of the channel makes it reasonable to use a less efficient Aloha (27) contention rule. A station does not listen to the channel before transmitting, but does listen to the receive channel to determine if its signal collided with the signal from another station. If a station can receive its own signal, so can all other stations. When a station is not busy, it listens to the downstream signaling channel and responds to a connect request.

CONCLUSION The principal reasons for using different technologies in different area networks are economic. As MANs have developed, the economics have changed, and continue to change. The initial application of MANs was interconnecting LANs in a city.

616

MICROCOMPUTER APPLICATIONS

When the economics changed, WAN technology was used for this function. At present there is a growing demand for wider bandwidth channels to individual homes in a MAN. The bandwidth is not available. In the short term we should expect the bandwidth to be made available using existing networks, either the CATV infrastructure or ADSL technology over the telephone local loop. As the demand continues to grow, additional capacity will be installed. Until the demand reaches a level where individual fibers to a home are justified, channel sharing techniques, such as FDDI, DQDB, and the MSN, are likely to be used to share the added capacity. Just as the capabilities of WANs have affected MANs, the capabilities of MANs can affect LANs. An interesting question is whether or not the future growth of MANs will lead to the end of LANs. When inexpensive, wide-bandwidth channels are available from every desk to the telephone central office or an ISP, will it still be economical for companies to install and maintain private LANs? BIBLIOGRAPHY 1. E. W. Zegura, Architectures for ATM switching systems, IEEE Comm. Mag., 31 (2): 28–37, 1993. 2. F. E. Ross, An overview of FDDI: The fiber distributed data interface, IEEE JSAC, 7 (7): 1043–1051, 1989. 3. W. Stallings, Local and Metropolitan Area Networks, Upper Saddle River, NJ: Prentice Hall, 1997. 4. M. J. Johnson, Proof that timing requirements of the FDDI token ring protocol are satisfied, IEEE Trans. Comm., COM-35: 620– 625, 1987. 5. H. E. White and N. F. Maxemchuk, An experimental TDM data loop exchange, Proc. ICC ’74, June 17–19, 1974, Minneapolis, MN, pp. 7A-1–7A-4. 6. R. M. Newman, Z. L. Budrikis, and J. L. Hullett, The QPSX man, IEEE Comm. Mag., 26 (4): 20–28, 1988. 7. R. M. Newman and J. L. Hullett, Distributed queueing: A fast and efficient packet access protocol for QPSX, Proc. 8th Int. Conf. on Comp. Comm., Munich, F.R.G., Sept. 15–19, 1986, published by North-Holland, pp. 294–299.

15. N. F. Maxemchuk, Regular mesh topologies in local and metropolitan area networks, AT&T Tech. J., 64 (7): 1659–1686, 1985. 16. S. Bassi et al., Multistage shuffle networks with shortest path and deflection routing for high performance ATM switching: The open loop shuffleout, IEEE Trans. Commun., 42: 2881–2889, 1994. 17. A. Krishna and B. Hajek, Performance of shuffle-like switching networks with deflection, Proceedings INFOCOM ’90, June 1990, pp. 473–480. 18. N. F. Maxemchuk, Problems arising from deflection routing: Live-lock, lockout, congestion and message reassembly, In G. Pujolle (ed.), High Capacity Local and Metropolitan Area Networks, Springer-Verlag, 1991, pp. 209–233. 19. P. Baran, On distributed communications networks, IEEE Trans. Comm. Sys., cs-12, 1–9, 1964. 20. C. S. Raghavendra and M. Gerla, Optimal loop topologies for distributed systems, Proc. Data Commun. Symp., 1981, pp. 218–223. 21. J. T. Brassil, A. K. Choudhury, and N. F. Maxemchuk, The Manhattan Street Network: A high performance, highly reliable metropolitan area network, Computer Networks and ISDN Systems, 1994. 22. N. F. Maxemchuk, Routing in the Manhattan Street Network, IEEE Trans. Commun., COM-35: 503–512, 1987. 23. S. Deering, Multicast routing in internetworks and extended LANs, Proceedings of ACM SIGCOMM ’88, Aug. 1988, Stanford, CA, pp. 55–64. 24. A. I. Karchmer and J. N. Thomas, Computer networking on CATV plants, IEEE Network Mag., pp. 32–40, 1992. 25. N. F. Maxemchuk and A. N. Netravali, Voice and data on a CATV network, IEEE J. Sel. Areas Commun., SAC-3 (2): 300–311, 1985. 26. N. F. Maxemchuk, A variation on CSMA/CD that yields movable TDM slots in integrated voice/data local networks, BSTJ, 61 (7): 1527–1550, 1982. 27. N. Abramson, The Aloha system—Another alternative for computer communications, Fall Joint Computer Conference, AFIPS Conference Proceedings, 37, pp. 281–285, 1970.

N. F. MAXEMCHUK AT&T Labs–Research

8. L. Fratta, F. Borgonovo, and F. A. Tobagi, The Express-Net: A local area communication network integrating voice and data, Proc. Int. Conf. Perf. Data Commun. Syst., Paris, Sept. 1981, pp. 77–88. 9. J. O. Limb and C. Flores, Description of Fasnet—A unidirectional local area communications network, BSTJ, 61 (7): 1413–1440, 1982. 10. M. Zukerman and P. G. Potter, A protocol for Eraser Node Implementation within the DQDB framework, Proc. IEEE GLOBECOM ’90, San Diego, CA, Dec. 1990, pp. 1400–1404. 11. M. W. Garrett and S.-Q. Li, A study of slot reuse in dual bus multiple access networks, IEEE JSAC, 9 (2): 248–256, 1991. 12. J. W. Wong, Throughput of DQDB networks under heavy load, EFOC/LAN-89, Amsterdam, The Netherlands, June 14–16, 1989, pp. 146–151. 13. J. Filipiak, Access protection for fairness in a distributed queue dual bus metropolitan area network, ICC ’89, Boston, June 1989, pp. 635–639. 14. E. L. Hahne, A. K. Choudhury, and N. F. Maxemchuk, Improving the fairness of distributed-queue dual-bus networks, INFOCOM ’90, San Francisco, June 5–7, 1990, pp. 175–184.

MHDCT. See HADAMARD TRANSFORMS. MHD POWER PLANTS. See MAGNETOHYDRODYNAMIC POWER PLANTS.

MICROBALANCES. See BALANCES. MICROCOMPUTER. See MICROPROCESSORS.


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Mobile Network Objects Standard Article Lubomir F. Bic1, Michael B. Dillencourt2, Munehiro Fukuda3 1University of California, Irvine, 2University of California, Irvine, 3University of California, Irvine, Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5336 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (118K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Basic Infrastructure Programming Language Mobility Agent Interactions Protection and Security Utility and Applications | | | Copyright © 1999-2008 All Rights Reserved.


MOBILE NETWORK OBJECTS

353

MOBILE NETWORK OBJECTS Computer networks are collections of computing nodes interconnected by communication channels. They have experienced explosive growth recently, primarily due to the steadily decreasing cost of hardware, and have become an integral part of daily life for most businesses, government institutions, and individuals. One of the main objectives of interconnecting individual computers into networks is to permit them to exchange information or to share resources. This can take on a number of different forms, including electronic mail messages exchanged among individuals, down- or uploading of files, access to databases and other information sources, or the use of a variety of services. It also permits the utilization of remote computational resources, such as specialized processors or supercomputers necessary to accomplish a certain task, the utilization of multiple interconnected computers to solve a problem through parallel processing, or simply the utilization of unused processing or storage capacity available on remote network nodes. The recent explosion in the use of portable devices, such as laptop computers or various communication devices used by ‘‘nomadic’’ users, has opened new opportunities but has also created new technological challenges. The main problem is that such devices are connected to the network intermittently and typically for only brief periods of time, they use low-bandwidth, high-latency, and low-reliability connections, and they may be connected to different points of the network each time. The exchange of data among processor nodes in a network—whether connected permanently or temporarily— occurs via communication channels, which are physical or virtual connections established, either permanently or temporarily, between the nodes. To use a channel, the communicating parties need to obey a certain communication protocol, which is a set of rules and conventions regarding the format of the transmitted data and its processing at the sending and receiving end. There is a wide range of different communication protocols to serve different needs and they are usually structured hierarchically such that each layer can take advantage of the properties provided at the lower level. Despite the great variety of communication protocols, they all embody the same fundamental communication paradigm. Namely, they assume the existence of two concurrent entities (processes or users) and a set of send/receive primitives that permit a piece of data (a bit, a packet, a message) to be sent by one of the active entities and received by the other. The specific protocol used only determines various aspects of the transmission, such as the size and format of the transmitted data, the speed of transmission, or its reliability. This leads to a great variety of send/receive primitives, but the underlying principle remains the same. From the programming point of view, this form of communication is referred to as message passing and is the most common paradigm used in parallel or distributed computing today. Its main limitation is that it views communication as J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.

354


Control only

Control + code Before execution

During execution Partial computation

RPC Figure 1. Levels of computation migration.

Remote execution

Self-contained computation

Passive

Active

Passive

Active

Object migration

Thread migration

Process migration

Agent migration

Code import

a low-level activity and thus is difficult to program, analyze, and debug. To alleviate these problems, higher-level programming constructs have been developed. The best-known representative is the concept of a remote procedure call (RPC) (1). As the name suggests, this extends the basic idea of a procedure call to permit the invocation of a procedure residing on a remote computer. At the implementation level, the RPC must be translated into two pairs of send/receive primitives. The first transmits the necessary parameters to the remote site while the second carries the results back to the caller once the procedure has terminated. Several issues must be handled, including the translation of data formats between different machine architectures and the handling of failures during the RPC. Nevertheless, the RPC mechanisms hide the details of the message-passing communication inside the wellknown and well-understood procedure-calling abstraction. The popularity of the RPC mechanism is due to its affinity with the popular client-server paradigm, where a distributed application is structured as a continuously operating process, termed a server (or collection of servers), that may be contacted by a client process to utilize the provided service. For example, a name server would return the address of a particular host computer given its name. When a client-server application is implemented using RPCs, the client simply invokes the appropriate (remote) procedure with the name as the argument and waits for it to return the result. RPCs require that the procedure to be invoked be preinstalled (compiled) on the remote host node before they can be utilized by a client. Hence, as indicated in Fig. 1, only control is transferred between the client and the server during the RPC. The natural extension to RPCs is remote execution, also referred to as remote programming (2), which permits not only control but also the code that is to be executed remotely to be carried as part of the request. Alternatively, a client could contact a server and download a program to accomplish a certain task (i.e., copy the program from the server to the client’s host). In both cases we are addressing the problem of code mobility. Since both of these scenarios require that the code is carried before it starts executing, the code must be made portable between the different machines. This can be accomplished either by carrying the source code and recompiling it on the target host or by providing an interpreter for the given language on the target host. In both cases, translating the original source code into a more compact and easier to process intermediate code generally yields a much better performance in both transmission and processing. One of the most popular systems today is Java (3), which uses a stream-oriented intermediate representation, referred to as byte code, that lends itself well to interpretation as well as to on-the-fly compila-

tion into native code. The necessary machine independence is achieved by establishing a common standard for the various aspects of code generation, such as byte ordering, data alignment, calling conventions, and data layout. This accounts for Java’s popularity as a language to develop highly portable Internet-based applications. The main motivation for moving code between machines is to make it available for execution on demand (by downloading it when needed), to perform load leveling (i.e., to invoke particular subcomputations on remote nodes to take advantage of their computational capacity), or to reduce communication latency by moving the execution to the data or the service that it needs to access. Another dimension of complexity is added when we permit code to migrate after it has started executing. In this case we are moving the state of the ongoing computation in addition to the code itself. This can be subdivided further along two axes, as shown in Fig. 1. The first axis captures the distinction in granularity—that is, whether the entire computation can be moved or whether it is possible to move only some portion of it while other parts remain at their original sites. The second orthogonal axis divides the space based on who initiates and performs the migration. If this is done by an activity other than the moving one, we call the migration passive. If a computation can effect its own migration, we call it active. There are representatives in each of the four areas resulting from the preceding subdivision. An example of passive migration of parts of an ongoing computation is the movement of objects of an object-oriented language, as pioneered in the Emerald system (4). Emerald provides several primitives by which one object can cause another object to change its location, which then automatically moves all the threads currently executing as part of the moved object to the new host. Active migration of parts of an ongoing computation can be accomplished by permitting a thread to send and invoke a given procedure on a remote host. Unlike the simple remote execution discussed earlier, this transfer does not imply a return of control to the caller site once the procedure terminates. Rather, the migrating procedure carries with it the state of its invoking thread and thus retains its semantics regardless of its current location. One approach for accomplishing this was pioneered in the Obliq language (5). In this approach, backward network references to the originating site are maintained, and parts of the state are copied as necessary. The ongoing copying is transparent to the user. From the user’s viewpoint, the entire thread has been transferred to the new location, where it continues executing until it again decides to move.


The last column of Fig. 1 represents systems where computations are relocated in their entirety. Under passive migration, this is typically done at the process level, where the operating system captures the complete execution state of a process and relocates it, including its entire address space, to another machine. When processes can actively decide if and where to move—that is, perform self-migration—we refer to them as mobile network objects, or mobile agents. These are the main subject of this article. Self-migration requires a basic infrastructure consisting of some form of servers running on the physical nodes to be established, which accept the mobile agents, provide them with an execution environment and an interface to the host environment, enforce some level of protection of the agent and the host, and permit them to move on. The remainder of this article explores these issues in more detail. The utility of mobile agents and some of their applications are discussed later. BASIC INFRASTRUCTURE To give mobile agents their autonomy in moving through the underlying network and performing the necessary tasks at the nodes they visit, a basic infrastructure needs to be established. Figure 2 shows the generic concept of such an infrastructure. The lowest level consists of a physical network of nodes. This is typically a WAN (wide area network), such as the Internet, which is a large heterogeneous collection of different computers ranging from PCs to supercomputers, interconnected by a variety of different links and subnetworks. For some applications (notably, general-purpose parallel/distributed computing), the physical network could also be a LAN (local area network), consisting of a relatively small number of computers interconnected by an Ethernet or a token ring– based network. The mobile agents infrastructure is established on a subset of the physical nodes. This may be viewed as a virtual network, where each node is a software environment that enables the agents to operate in the physical node. In the simplest case, a single virtual node is mapped onto a physical

node and there are no specific virtual connections. That is, the virtual network is a strict subset of the physical network and the virtual connections are simply implied (identical to) the existing physical links. Additional flexibility is attained by permitting more than one virtual node to share a physical network. For example, the node ocean.ics.uci.edu in Fig. 2 shows two virtual nodes mapped to it. Since the physical resources are multiplexed between the virtual nodes, there is no performance benefit but the resulting logical concurrency provides for more flexibility in the design of applications. Some systems, such as UCI MESSENGERS (6) and WAVE (12) support a separate logical network, implemented on top of the virtual network. Logical links are used by the mobile agents to navigate through the network. Having logical links that represent virtual connections provides for greater flexibility in navigation. Each virtual node consists of several components to enable the mobile agents operation. The most important is a ‘‘processing engine’’ that gives the mobile agents their autonomy. This can be subdivided further into a communication module, whose task is to receive and send mobile agents, and an execution engine, responsible for the agents execution while it resides on the current node. Depending on the language used to write the code for mobile agents, this engine can be either a self-contained interpreter or a form of a manager, that creates a new process or thread for each new incoming agent and then supervises its execution. Each virtual node also typically provides a local communication facility, such as a shared data area, that can be used by the mobile agents currently on that node to communicate with one another (that is, exchange data or synchronize their operations). In the remaining sections, we will elaborate on the various aspects of the supporting infrastructure, the various capabilities of mobile agents as supported by different approaches, and their main benefits and applications. PROGRAMMING LANGUAGE There is a wide range of programming languages used to write mobile agents programs (i.e., to describe an agent’s be-

L

L

L

E

E

E

C

C

C

C

wave.ics.uci.edu

cress.ics.uci.edu

ocean.ics.uci.edu

355

L

...

E

Physical network C = Communication module E = Execution engine L = Local communication facility

Figure agents.

2. Infrastructure

of

mobile

356


havior). We can loosely classify them along two orthogonal axes: general purpose versus special purpose and conventional versus object oriented. Within each class we can further distinguish interpreted versus compiled languages, and combinations thereof. The most popular general-purpose programming languages used for mobile agents are C and Java. C is an imperative language (that is, based on sequential flow of assignment, control, and function invocation statements) and is one of the most widely used programming languages today. Java is object oriented, which implies a hierarchical structure of objects derived from common classes and interacting with one another by invoking procedures defined as part of each object. One of the main strengths of Java code is that it is based on a structured bytecode and is thus highly portable between heterogeneous computers. To use a general-purpose programming language like C or Java for mobile agents, it must be extended to be able to handle the specific requirements of selfmigrating code. The most important extension is to support mobility (that is, some set of commands that an agent can use to cause its migration to another computer). Another important area of concern is to provide protection mechanisms to permit a safe operation of mobile agent applications. The main advantage of using existing general-purpose languages is that the programmers do not need to learn yet another language but only extend their knowledge to integrate aspects of mobility. Hence it is easier to make a transition into the new paradigm of mobile agents. New languages have also been developed specifically for the purpose of writing mobile agents code. One such language is Telescript (7), pioneered by General Magic, Inc. This is a high-level object-oriented language designed for mobile agents for the rapidly expanding electronic marketplace on the Internet. A number of other languages, both object oriented and conventional, have also been developed. Another approach has been to adapt existing special-purpose languages. One example is Agent Tcl (8), which is built on top of an extended version of Tcl, a scripting language originally intended for the composition of high-level program scripts to coordinate lower-level computations. Another example is the system developed by researchers at the Johann Wolfgang Goethe University (9), which is built on top of a customized Hypertext Transfer Protocol (HTTP) server. One of the main distinguishing features of all of the preceding languages is whether they are compiled and executed as native code of the host computer or interpreted. This represents a tradeoff between performance, which is degraded due to interpretation, and security, which is improved due to the interpreter’s tight control over the agent’s behavior. There are three general options. First, the agent can carry code that is fully interpreted by the execution engine of the host. This is the safest but also the slowest approach and is typically used with scripting languages. Second, the agent’s code could be an intermediate machine-independent program representation, like the Java bytecode, which can be interpreted more efficiently than source code or can be compiled on the fly into directly executable native code. Finally, the agent could carry native code precompiled for the target host. This is the fastest but also least secure and least flexible approach, since the agent would have to carry different code versions for every machine architecture it may visit. A compromise between the preceding approaches has been adopted by UCI MESSEN-

GERS (6). The agent’s code is written in a subset of C, which is translated into a more efficient yet machine-independent form (similar to bytecode). This is carried by the agent and is interpreted by the execution engine of each host. In addition, the agent can dynamically load and invoke arbitrary C functions, resident on a given host and compiled into the host’s native code. Hence it can alternative between interpreted and compiled code at the programmer’s discretion. Another consideration is whether an agent is executed as a separate process or as a thread within the same address space of the execution engine. This decision represents a tradeoff between security and performance. Starting a new thread is more efficient than starting a new process, but allowing multiple agents to run in the same address space represents a potential security risk. MOBILITY The ability to spread or move computation among different nodes at runtime is perhaps the most important characteristic of mobile agents. We can distinguish three aspects of mobility: addressing, the mechanisms to affect the movement of an activity, and high-level support mechanisms. Addressing For an activity to move, a destination must first be specified. This destination, also referred to as a place, a location, or a logical node by different systems, is some form of an execution environment capable of supporting the mobile agent’s functionality. Depending on how the logical nodes are mapped onto the physical network, different forms of addressing are possible. In the simplest form, the networkwide unique names or addresses of the physical nodes are used to specify a destination. This implies that only a single copy of a logical node can be mapped onto any one physical node. To achieve location transparency, logical names are used, whose mapping to the physical nodes may be changed as necessary (for example, to reflect changes in the physical network topology). This frees the application from having to know anything about the physical network and thus also facilitates its portability. This also permits more than one logical node to be mapped onto a physical node, thus providing better structuring capabilities and facilitating load balancing. As already discussed, another degree of flexibility is achieved by permitting not only logical nodes but also logical links. These are mapped onto paths of zero or more physical links, thus providing virtual connections that can be used for navigation by agents. Addressing can further be subdivided into explicit and implicit. Explicit addressing implies that an agent specifies the exact node destination where it wishes to travel or where a new agent should be spawned. Some systems support itinearary-based addressing, where an agent carries a list of destinations. Each time it issues a migration command, it moves to the next destination on its list. Implicit addressing means that an agent specifies the set of destinations indirectly using an expression that selects zero or more target nodes. The agent is then replicated and a copy sent to all the nodes that meet the selection criteria. The UCI MESSENGERS system defines a elaborate navigational calculus where, given a logical node, an expression involving various combinations of link and node names (including ‘‘wild cards’’), a set of target nodes


relative to the current node is specified. The agent issuing this statement is then replicated and a separate copy sent to each of the selected nodes. For example, an agent could decide to replicate itself along all outgoing links with a specific name and/or orientation or connected to specific neighboring nodes. Mechanisms for Mobility Mobility can be achieved in one of two ways. The first is remote execution, which permits a new activity to be spawned on a remote node. The second is migration, which permits an agent to move itself to another node. The boundary between these two approaches, however, is not crisp, since an agent could spawn a copy of itself on a remote node and then terminate on the current node, thus effectively migrating itself. The main question is the level of support provided by the system to extract the current state at the source node and restore it at the destination node. This may range from no support to a fully transparent migration. In the case of remote evaluation, the commands supported to achieve mobility typically take on a passive form, such as ‘‘spawn’’ or ‘‘dispatch,’’ implying that it is not the currently executing agent itself that moves; rather, it is causing the creation of another agent on a remote node. To achieve active mobility using this approach, it is not sufficient simply to spawn to copy of the current agent, since the new agent would start executing from the beginning. Rather, the sending agent must extract its current state, transmit this along with the code, and cause the new instance to continue executing the code that follows the migration statement. If the agent code is compiled, its state consists of the activation stack, the CPU (central processing unit) registers, any dynamically allocated heap memory, and open I/O (input/output) connections (file and communication descriptors). If the code is interpreted, the agent’s state is typically maintained by the interpreter. In either case, the system must provide support to permit as much of the agent’s state to be extracted in order to permit its migration. Unfortunately, some parts of the state, notably the I/O connections, may be machine dependent and thus cannot be moved. Hence a completely transparent migration may not always be possible. In the case of self-migration, the commands typically take on an active form, such as ‘‘go’’ or ‘‘hop,’’ indicating that it is the agent issuing these commands that is being moved. The problems of state capture are similar to those described previously. That is, the system must provide support for extracting the agent’s current state and reinstating it at the new destination. This, as well as the creation of the new instance and the destruction of the original one, is usually done automatically by the system as part of the migration operation, which then may be viewed as a high-level construct that transparently achieves self-migration of an agent. Given the difficulty of extracting and restoring an agent’s state at an arbitrary point in its execution, some systems will limit migration to only the top level of execution (i.e., the equivalent of the main program). This is the case with the UCI MESSENGERS system, which also prohibits the use of pointers at that level. This eliminates the need to extract/ restore the activation stack as well as any data on the heap storage, and hence only the agent’s local variables and its program counter need to be sent along with the code during migration, thus making this operation very efficient.

357

High-Level Support The third aspect of navigation concerns the high-level tools and mechanisms that make the migration of agents more powerful or more user friendly. These fall into two categories—the first deals with finding agents or services on the net, the latter with finding ways best to reach the corresponding remote sites. There is no conceptual framework for either problem and hence we only mention a few approaches that have been used by various systems. The Agent Tcl project (8) addresses both areas. To locate services, it provides a hierarchy of specialized navigation agents, which are stationary and which maintain a database of service locations. Services are registered with these nagivation agents. A mobile agent looking for a service may query a navigation agent, which suggests a list of possible services based on a keyword search, and possible other navigation agents, which may be more specialized in maintaining services on the requested topic. Later, mobile agents may provide feedback about which services were useful, thus improving the navigation agent’s ability to provide information in the future. The CUI MESSENGERS system (10) also provides extensive support for publicizing and discovering services on the net. One of the main issues is to ensure protection of the service provider. Unlike Agent Tcl, which uses active navigation agents, CUI MESSENGERS use specialized dictionaries in each logical node, which can be consulted by potential clients. Each service is publicized with its operational interface, which, using a specialized interface-description language, specifies the necessary conventions to interact with the service. The second category of high-level support generally includes network sensing and monitoring tools. The complexity and sophistication of these tools range from very simple (for example, to determine whether a particular computer is ‘‘alive’’ and connected to the current node) to continuous network monitoring services that provide estimates on latency and bandwidth of various connections in the network. AGENT INTERACTIONS Agents have the need to interact with one another at runtime, either to exchange information (i.e., communicate) or to synchronize their actions. There are several forms of interagent communication schemes supported by different systems. The simplest is based on shared data. That is, a logical node will contain some agreed-upon variables or data structures, which may be accessed by agents currently executing on that node. The access can either be by location (i.e., reading from or writing to a specific location specified by name or address) or associative by content (i.e., specifying a part of the data item to be accessed and letting the system find all data items that match the given value). In the case of object-oriented systems, another form of interagent communication is possible. Each such agent consists of one or more objects, where objects encapsulate both data and the functions (called methods) that may operate on the data. Two agents operating on the same node may establish a connection that permits them to invoke each other’s methods, thus passing information to each other or otherwise manipulating each other’s internal state. This is analogous to performing remote procedure calls in conventional clientserver applications and thus can be extended to communica-

358


tion with stationary agents or other services on the same or even remote nodes. Since communication via shared variables or method invocation requires both agents to be in the same node, some systems permit agents to establish connections across different nodes and to communicate with each other by messages. This includes connections that may be established between an agent and its owner (user). The send/receive primitives supported by the system may be both synchronous or asynchronous, depending on the system’s intended application domain. To find a particular mobile agent on the net, a ‘‘paging’’ service may also be provided, which returns the location of the sought-after agent. In Agent Tcl, for example, this service relies on each mobile agent registering its position with its ‘‘home’’ machine after each jump, which permits the user or source agent to find its current location. Synchronization may be required for agents operating on different nodes or on the same node. Synchronizing activities on different nodes is a general problem in distributed coordination for which various solutions exist, including distributed semaphores, using a central server/manager, distributed voting, or token-based schemes. For this reason, few mechanisms specific to mobile agents have been proposed. Similarly, synchronization of mobile agents on the same node is generally achieved by adapting classical methods. The solutions incorporated in different systems vary greatly in their sophistication. The simplest way to achieve synchronization is by busy waiting (also called spin lock), where an agent continuously reads a given variable until it has been set to the desired value by another agent. The main disadvantage of this scheme is the wasted CPU time, which can be eliminated by implementing a more sophisticated form of locks (or semaphores), where the waiting agent is blocked (sleeping) while the desired condition is false. The CUI MESSENGERS system (10) provides a novel synchronization mechanism based on the notion of synchronization queues. Agents can create/destroy queues as needed and can use specialized primitives to enter/exit a particular queue. The basic principle is that only those agents that are at the head of a queue (or on no queue) are running. An agent that is currently running may also block/unblock a given queue, thus preventing all agents on that queue from proceeding. Another synchronization mechanism is based on events. These are arbitrary user-defined conditions that can be set and tested at runtime. An agent may indicate that it is interested in certain types of events, in which case it will be notified whenever an event of the desired type occurs. The notification is in the form of an ‘‘interrupt,’’ which executes a specific function (an event handler) provided by the agent.

PROTECTION AND SECURITY The autonomous mobility of agents creates the potential for security violations that would otherwise not be possible. With traditional approaches the outside world can only interact with a computer through well-defined interfaces and by the fixed set of programs installed on the computer. These restrictions provide a barrier that allows the computer to protect itself from external attack. Mobile agents eliminate this barrier, since the code that the computer runs is provided by the

very external agents from whom the computer should be protected. Without proper safeguards, the computer may accept unsafe code and permit it to run. In so doing, the computer opens itself and its other users to abuse or misuse of its resources. For example, the entering agents may consume excessive amounts of memory or CPU time, access memory, disk files, or services for which it has no authorization; leak sensitive information to the outside world; or destroy information or services. Mobile agents also open up the possibility of attacks on the agents themselves. For example, an agent might attempt to steal sensitive information that another agent is carrying. A host in the system might try to modify an agent by changing its data (e.g., the maximum price it is prepared to pay for a service offered by the host) or its instructions (e.g., by altering the agent so that it works on behalf of this host rather than on behalf of its original owner). Thus threats to agents can come either from other agents or from hosts on the network. So there are three kinds of protection that need to be addressed: protecting the system from an agent, protecting an agent from an agent, and protecting an agent from a host. Protecting the System from an Agent Nodes in a system can be protected from agents by using a combination of authentication, restriction of access to potentially dangerous operations, and resource limits. An agent typically carries with it certain identifying information, such as its owner and its origin. Authentication mechanisms check that this information is correct; this can be done using publickey encryption protocols. An agent can then be permitted to perform or forbidden from performing certain operations, depending on its status. In Agent Tcl (8), an agent is assigned a status of ‘‘trusted’’ or ‘‘untrusted.’’ An untrusted agent is run with an interpreter that limits its ability to perform potentially dangerous operations, either by forbidding such operations entirely or by carefully checking the parameters of each such operation before allowing it to proceed. In Telescript, an agent carries with it ‘‘permits,’’ each of which allows it to perform certain operations that are otherwise forbidden (11). Resource limits prevent a single agent from consuming excessive amounts of resources on a single host. Agents could abuse the system in more subtle ways. For example, a malicious agent could simply hop to a new host selected at random, make two copies of itself, and stop. Such an agent could ultimately paralyze the entire network. Agent Tcl and Telescript propose addressing this problem by introducing an analog of a cash economy. Each agent carries with it a certain amount of ‘‘currency,’’ which it must spend in order to use resources. Every time an agent creates a new agent, it must give some of its currency to the child agent; otherwise, the child agent will not be able to use any resources. This mechanism limits the total amount of network resources that can be consumed by an agent and its descendants. Protecting an Agent from Other Agents Once the system has been protected from the agents running on it, the problem of protecting an agent from other agents is quite similar to the classical security problem of protecting a program from other programs on multiuser machines. One approach is to have each agent run in a separate address


space, so that an agent cannot be affected by another agent unless it chooses to communicate with it. Protecting an Agent from a Node This is the most difficult of the three types of protection. It is virtually impossible for an agent to prevent itself from being tampered with by a malicious or faulty host. Nevertheless, it is usually possible to detect whether specific sensitive areas of an agent arriving at a node have been tampered with at the previous node. This is being implemented in Agent Tcl using digital signatures. Once an agent migrates to a new host, it cannot prevent the host from examining its contents and possibly stealing sensitive information that it contains. The damage from such a theft can be limited if an agent makes sure that the sensitive information is stored in a form that is not useful without cooperation from a trusted network node (e.g., by keeping it encrypted.) It is possible for an agent to build an audit trail that includes a list of the nodes it has visited. This does not prevent theft, but it can be used after the fact to help identify nodes that might be stealing data. It is also possible for an agent to be sent out with a list of trusted nodes and with the restriction that it only visits trusted nodes. However, this approach represents a significant restriction, since one of the most attractive features of the mobile agent paradigms is the notion of autonomous agents that can freely roam the network. UTILITY AND APPLICATIONS Mobile agents have several advantages over distributed computing using conventional message-passing approaches. These advantages can be roughly divided into two groups: software engineering advantages and performance advantages. The ability to move computations at runtime between different nodes makes applications functionally open ended and thus arbitrarily extensible. Notably, a server is not linked to a fixed set of predefined functions. Rather, each incoming request can carry with it the necessary code for its processing, thus making the server’s capabilities virtually unlimited. The same principle applies to communication. Mobile agents provide a mechanism for dynamic protocols. Without mobile agents, a computer supports a finite set of protocols to move data between, to, and from other computers. Each can only manipulate the data in a fixed number of ways, determined by its current software capabilities. If it lacks a particular application needed to access, view, or process some received data properly, the needed application must manually be installed to extend the machine’s capabilities. Mobile agents permit new protocols to be installed automatically and only as needed for a particular interaction. Another software engineering advantage of mobile agents is ease of programming for certain kinds of applications. Conventional distributed programming requires viewing the application as a global collection of concurrent activities interacting with each other via message passing. Each program must anticipate in advance all the possible messages it can receive from other programs and be ready to respond to them. Programming with mobile agents is more like driving a car through the network; the programmer’s task is to guide the

359

agent on its journey through the network, describing the computation to be performed at stops along the way. One class of applications that are particularly well suited to implementations using mobile agents are individual-based simulations, in which agents representing individual entities coordinate their activities to model complex collective behavior in a spatial domain. Examples of such applications include interactive battle simulations, particle-level physics simulations, traffic modeling, and ecological studies. All of the above software engineering advantages stem from the fact that the mobile agents paradigm better fits certain types of distributed applications, which reduces the amount of programming necessary. In terms of performance, the ability of mobile agents to move through the network results in a considerable potential reduction in communication cost. Suppose, for example, that a program wishes to process a large amount of data at a remote site. One approach would be for the remote site to send all the data to the local site. This is likely to incur considerably more communication overhead than dispatching an agent to the remote site that processes the data and then returns. If the connection is slow or unreliable, there is a further advantage to the mobile agent approach. If the remote site is sending a large stream of data to the local site and the connection is lost, then the stream may have to be resent in its entirety or a restart protocol may have to be run. With mobile agents, a steady connection is not necessary. Once the agent has arrived at the remote site, there is no reason for the remote site and the local site to have any mutual contact until the agent is ready to return to the local site. A number of applications using mobile agents have been proposed or actually developed. A few are briefly described next. For more details, see the reading list at the end of this article. Information Retrieval One obvious application of mobile agents is accessing and retrieving data at remote sites on a network. If the volume of information is large, it is clearly more efficient to dispatch an agent to the remote site and have it filter the data than to ship all the data over the network and then process it. Servers can support search without providing any specific software capabilities other than permitting mobile agents to enter and execute at their site. These bring with them all the necessary code and ‘‘intelligence’’ to carry out the necessary searches, which is supplied by the user originating the request. The data at the remote site may contain references to other useful data at other remote sites, in which case the agent may move or send copies of itself to these other sites and access the data there as well. Electronic Commerce As commerce on the Internet becomes a reality, the potential uses for mobile agents are almost unlimited. Many of the references at the end of this article address some of the possible uses of mobile applications in the electronic marketplace. Mobile agents can search the Internet to find the best price on a particular item, make certain reservations or purchases on behalf of their owner (e.g., airplane tickets, hotel reservations), or repeatedly search to see if a currently unavailable item (e.g., a ticket to a sold-out concert) becomes available.

360


More complex mobile agents could perform more difficult tasks, such as negotiating deals or closing out business transactions on behalf of their owners. One important related problem is the implementation and use of electronic cash. Intelligent Agents and Personal Assistants The term intelligent agent is used in two different contexts. One use refers to artificial intelligence (AI) systems in which the intelligence stems from the behavior and interaction of individual entities or agents within the system. Generally these agents do not migrate, and hence do not fall within the scope of this article. The term is also used to describe agents that act as personal assistants to the user. Some of these are mobile and some are not. Examples of the latter include interfaces for e-mail and news filtering systems. An example of intelligent agents that are also mobile agents is software for scheduling meetings (interacting with users and/or their calendars at distributed locations). Mobile Computing This application was alluded to at the beginning of this article. The user of a portable computer can submit a mobile agent that contains a program to be run and sign off. When the agent is finished computing, it waits and jumps back to the user’s computer after the user signs back on and requests it do so. Network Management Mobile agents can be used to perform various administrative and maintenance functions in networks. For example, agents can be dispatched to monitor links and nodes, diagnose faults, identify areas of congestion, etc. As another example, one of the stated goals of the CUI Messengers Project is developing a distributed operating system based on mobile agents. General-Purpose Computing Mobile agents can be used as the basis for general-purpose distributed computing (6,12). If the communication overhead is reasonably low compared with the amount of computation required, distributed solutions using mobile agents are competitive in performance with distributed solutions using traditional message-passing approaches. Many algorithms are more naturally implemented using the metaphor of navigation through a network than using message passing, so the mobile agent approach often yields a smaller semantic gap between the abstract specification of the algorithm and the actual implementation. Mobile agents also provide a useful way of coordinating the behavior of functions and data in a distributed application such as a distributed simulation. The use of mobile agents as a coordination paradigm is particularly well suited to systems that permit calls into native mode code. The coordination functions are performed by services provided by the interpreter, while the actual computation can be done in native mode, so the computational cost due to interpretive overhead is minimized. BIBLIOGRAPHY 1. A. D. Birrell and B. J. Nelson, Implementing remote procedure calls, ACM Trans. Comput. Syst., 2: 39–59, 1984.

2. J. W. Stamos and D. K. Gifford, Remote evaluation, ACM TOPLAS, 12 (4): 537–565, 1990. 3. J. Gosling and H. McGilton, The Java Language Environment. Sun Microsystems, Inc., Mountain View, CA 94043, 1995. http:// java.sun.com 4. E. Jul et al., Fine-grained mobility in the Emerald system. ACM Trans. Comput. Syst., 6 (1): 109–133, 1988. 5. L. Cardelli, Obliq: A language with distributed scope, Comput. Syst., 8 (1): 27–59, 1995. 6. L. F. Bic, M. Fukuda, and M. Dillencourt, Distributed computing using autonomous objects, IEEE Comput., 29 (8): 55–61, 1996. 7. The Telescript reference manual. Technical report, General Magic, Inc., Mountain View, CA 94040, June 1996. http://www.genmagic.com 8. R. S. Gray, Agent Tcl: A flexible and secure mobile-agent system. In Proc. 4th Annu. Tcl/Tk Workshop (TCL 96), Monterey, CA, July 1996. http://www.cs.dartmouth.edu/앑agent/papers/index. html 9. A. Lingnau, O. Drobnik, and P. Do¨mel, An HTTP-based infrastructure for mobile agents. In 4th Int. World Wide Web Conf. Proc., pp. 461–471, Sebastopol, CA, December 1995, O’Reilly and Associates. 10. C. F. Tschudin, On the Structuring of Computer Communications. Ph.D. thesis, University of Geneva, Centre Universitaire d’Informatique, Geneva, Switzerland, 1993. http://cuiwww.unige.ch/ tios/msgr/home.html 11. J. White, Mobile agents white paper. Technical report, General Magic, Inc., Mountain View, CA 94040, 1996. http://www.genmagic.com 12. P. S. Sapaty and P. M. Borst, An overview of the WAVE language and system for distributed processing of open networks. Technical report, University of Surrey, UK, 1994

Reading List J. Baumann, Mobile agents: A triptychon of problems. In 1st ECOOP Workshop Mobile Object Systems, 1995. http://www.informatik.unistuttgart.de/ipvr/vs/projekte/mole/agents.html D. Johansen, R. van Renesse, and F. B. Schneider, An introduction to the TACOMA distributed system version 1.0. Technical Report 05-23, Department of Computer Science, University of Tromsø, June 1995. http://www.cs.uit.no/DOS/Tacoma/index.html D. B. Lange and M. Oshima, Programming mobile agents in Java with the Java Aglet API. http://www.trl.ibm.co.jp/aglets/ T. Magedanz and T. Eckardt, Mobile software agents: A new paradigm for telecommunications management. In IEEE/IFIP Network Operations and Management Symposium (NOMS), Kyoto, Japan, April 1996. http://www.fokus.gmd.de/oks/research/ magna_g.html兩#dokumente H. Peine, An introduction to mobile agent programming and the Ara system. ZRI Technical Report 1/97, Dept. of Computer Science, University of Kaiserslautern, January 1998. http://www.uni-kl.de/ AG-Nehmer/Ara/ara.html C. F. Tschudin, On the Structuring of Computer Communications. Ph.D. thesis, University of Geneva, Centre Universitaire d’Informatique, Geneva, Switzerland, 1993. http://cuiwww.unige.ch/tios/ msgr/home.html D. Wong et al., Mitsubishi Horizon Systems Lab, USA, Concordia: An Infrastructure for Collaborating Mobile Agents, in Proc. 1st Int. Workshop Mobile Agents, Berlin, Germany, April 7–8, 1997. http://www.meitca.com/HSL/Projects/Concordia/ MobileAgentConf_for_web.html

MOBILE ROBOTS M. Condict et al., Towards a world-wide civilization of objects, in Proc. 7th ACM SIGOPS Eur. Workshop, Connemara, Ireland, September 1996. http://www.opengroup.org/RI/java/moa/WebOS.ps General Magic, Inc., Odyssey, 1997. agents/odyssey.html

http://www.genmagic.com/

LUBOMIR F. BIC MICHAEL B. DILLENCOURT MUNEHIRO FUKUDA University of California

361


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Multicast Standard Article George C. Polyzos1 and K. Katsaros1 1Athens University of Economics and Business, Athens, Greece Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5302 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (101K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Introduction Multipoint Routing and Multicast Tree Algorithms Feedback Control and Reliable Multicast Multicasting Real-Time Continuous Media Multicasting on the Internet Multicast in ATM and Other Technologies Application-Level Multicast | | | Copyright © 1999-2008 All Rights Reserved.


MULTICAST

INTRODUCTION One of the ways communication can be characterized is by the number of parties involved. Traditional communication modes have been unicast, i.e., one-to-one, and broadcast, i.e., one-to-all. Between these two extremes we find multicast, the transmission of a single message or data stream to a set of receivers. Thus, multicast is a generalization of both unicast and broadcast and a unifying communication mode. For this reason it is receiving increasing attention in modern networking architectures. The above definition of multicast is still traditional in the sense that it is sender-centric and unidirectional. An even more general term would be multipoint communication. Note also that multicast is considered mostly in the context of digital, and particularly, packet-switching networks. Multicast is examined on its own because the specification of receivers through a set introduces features and complications that are not present in traditional communication modes. We can distinguish two cases of increasing complexity with respect to the set of receivers: (1) it is fixed and known (e.g., to the sender), and (2) it is unknown and/or dynamic. The multicast model of communication supports applications where data and control are partitioned over multiple actors, such as updating replicated databases, contacting any one of a group of distributed servers of which the composition is unknown (more appropriately termed anycast), and interprocess communication among cooperating processes, e.g., distributing intermediate computational results from one processor to others in parallel computers. A demanding application of multicast is Distributed Interactive Simulation (DIS). Targeted news and information distribution in near real-time has potential global impact and could normally be less demanding, even though specific applications, such as stock-quote information distribution, might have very stringent requirements (e.g., needing atomic multicast, the semantics of which imply that the message should be received by all receivers in the group or none at all). However, the prototypical multipoint communication application is probably real-time interactive multimedia tele-conferencing, possibly including shared workspaces, and being under the umbrella of Computer Supported Collaborative Work (CSCW). Efficient multicast is a fundamental issue for the success of group applications. Here the selective multicast service takes the place of the indiscriminate broadcasting so as to reduce the waste of resources caused by transmitting all the information or channels to all receivers. The basic means of conserving resources via multicast is sharing: instead of transmitting information from a sender to each receiver separately, we can arrange for routes that share links to carry the information only once over the shared links. We can picture a multicast route as a tree rooted at the sender with a receiver at each leaf and possibly some receivers on internal nodes. The tree can be designed so

as to maximize shared links and thus minimize resource consumption. While multicast is increasingly recognized as a valuable service for packet networks, many architectures still do not support it directly. Many shared-medium networks such as Ethernet support options for broadcast and multicast packets and the corresponding addressing mechanisms. However, processors are often required to perform extra processing when receiving a multicast packet. When switches are used in point-to-point networks, we would like the hardware to automatically recognize multicast addresses as such and transmit multicast packets through multiple links copying packets on the fly, as required. ATM switch designs increasingly support parallel transmission of multicast cells over multiple links in hardware, increasing peak switching speeds. Another set of issues is concerned with extending the feedback mechanisms employed by unicast-oriented protocols to deal with flow, congestion and error control. For example, transport-layer protocols such as TCP adapt their behavior according to the prevailing network conditions at any given point in time by measuring loss rates as experienced by receivers. When extending these protocols for multicast, there is the possibility of feedback implosion when many receivers send such reports towards the sender, thus swamping the network and the source with control information. Apart from the obvious scalability problems of such schemes, there is also the issue of how to adapt the sender’s behavior when conflicting reports arrive from the various receivers. Multicast Groups and their Dynamics The difference between multicasting and separately unicasting data to several destinations is best captured by the Internet host group model: a host group is a set of network entities sharing a common identifying multicast address, all receiving (traditionally, via best-effort service) any data packets addressed to this multicast address by senders that may (closed group) or may not be members of the group (open group) and have no knowledge of the group’s membership. This definition implies that the behavior of the group over time is unrestricted in multiple dimensions; it may have local (LAN) or global (WAN) membership, be transient or persistent in time, and have constant or varying membership. From the sender’s point of view, this model reduces the multicast service interface to a unicast one. This implies that the network software is accordingly burdened with the task of managing the multicasts in a manner transparent to the users. From the network designer’s point of view, this extra work is expected to result in a more efficient usage of resources. This is the primary motive for network providers to support multicast in the first place. These goals for multicast service impose specific requirements for the network implementation. First, there must be a means for routing packets from a sender to all group members whenever the destination address of a packet is a multicast one, which implies that the network must locate all members of the relevant group and make routing arrangements. Second, since group member-


2

Multicast

ship is dynamic, the network must also continuously track current membership during a session’s lifetime, which can range from a short to a very long period of time. Tracking is required both to start forwarding data to new group members and for stopping the wasteful transmission of packets to destinations that have left the group. Both tasks must be carried out without assistance from the sending entity as defined by the host-group model. The dynamic nature of multicast groups has important implications for multicast routing. A very different model is adopted by the Asynchronous Transfer Mode (ATM) technology. The first and currently the only supported model is that of a point-to-multipoint Virtual Channel (VC), where the source signaling agent knows exactly the addresses of all destinations, which are included in the VC, typically by being added one-by-one, during the initial connection set-up time. Proposals for receiver-initiated dynamic modifications to VCs are being investigated.

MULTIPOINT ROUTING AND MULTICAST TREE ALGORITHMS Unicast routing intends to minimize transmission cost or delay, depending on the metric used for the optimization. These two apparently different goals are equivalent from an algorithmic point of view, both leading to the use of shortest-path algorithms, with Dijkstra’s and the BellmanFord algorithm the two common cases. These algorithms find optimal routes between one node (the sender) and all other nodes in the network (including all receivers) in the form of shortest path trees. Thus, a straightforward (but not optimal) solution to the multicast routing problem can be based on the shortest path trees produced by these algorithms by pruning off any branches that do not lead to any receivers in the group. Although details vary according to the base algorithm, there are some observations that generally apply. On the up side, these algorithms are easy to implement, as direct extensions of existing ones, and thus fast to deploy. Additionally, each path is optimal by definition, regardless of changes in group membership, and this optimality comes essentially for free since shortest paths need to be computed for unicast routing as well. On the down side, these algorithms optimize the wrong metric. Also, for large internetworks with widely dispersed groups, either the scale of the network or continuous network changes restrict their use to subnetworks that already employ their unicast counterparts. Similar problems (e.g. processing complexity for Dijkstra and instability for Bellman-Ford) have also forced unicast routing algorithms to rely on hierarchical routing techniques for large networks. Cost optimization in multicast can be viewed from another angle: overall cost optimization for the distribution tree. The shortest path algorithms concentrate on pairwise optimizations between the source and each destination and only conserve resources as a side effect, when paths overlap. We can instead try to build a tree that exploits link sharing as much as possible, and by duplicating packets only when paths diverge, minimize total distribution cost,

even at the expense of serving some receivers over longer paths. What we need is a tree that reaches all receivers and may use any additional network nodes on the way. This is equivalent to the Steiner tree problem, where a cost-labeled graph and a set of nodes, the Steiner points, are given and we want a minimal-cost tree connecting all Steiner points, consisting of the sender and the receivers. Note that if all the nodes in the graph were Steiner points, the problem coincides with finding a spanning tree for the graph, for which efficient algorithms are known. Instead the Steiner tree problem is intractable. However, even though Garey et al. (10) have shown that this problem is NP-complete, approximation algorithms exist with proven constant worst-case bounds. Implementations of such algorithms have been shown to produce lowcost multicast trees with very good average behavior. As an example, trees built with the heuristic by Kou et al. (15) have at most twice the cost of Steiner trees, while simulations of realistic network topologies have shown their cost to be within 5% of the optimum. The advantage of this approach is its overall optimality with respect to a single cost metric, such as transmission cost. However, the disadvantages are also important: the algorithm needs to run in addition to the unicast algorithms, and it will itself have scaling problems for large networks. Furthermore, optimality is generally lost after group membership changes and network reconfigurations if the tree is not recomputed from scratch. Thus, Steiner tree algorithms are best suited to static or slowly changing environments since changes lead to expensive recalculations to regain optimality. Both approaches discussed above suffer from an inability to maintain their measure of optimality in large and dynamic networks. Approaches for extending these algorithms to deal with changes in group membership without complete tree reconfigurations include extending an existing tree in the cheapest way possible to support a new group member and pruning the redundant branches of the tree when a group member departs. The quality of the trees after several local modifications of this sort will deteriorate over time, eventually leading to a need for global tree reconfiguration. A different approach to the routing problem opts for a solution in realistic settings by adopting the practical goal of finding good rather than optimal trees that can also be easily maintained. The departure point for this approach is the center-based tree, which is an optimal-cost tree that, instead of being rooted at the sender, is rooted at the topological center of the receivers. Even though such a tree may not be optimal for any one sender, it can be proven to be an adequate approximation for all of them together. The implication is that one basic tree can serve as a common infrastructure for all senders. Thus, maintenance of the tree is greatly simplified and nodes on the tree need only maintain state for one shared tree rather than many source-rooted trees. Since this method has been developed for broadcasting rather than multicasting, the theoretical investigation does not hold when we prune the broadcast trees to get multicast ones. In addition, the topological center of the tree, apart from being hard to find (the problem being NP-complete), will not even be of use in a dynamic multicast environment.

Multicast

Practical proposals for multicast routing abandon the concrete optimality claims discussed above, but keep the basic idea of having a single shared multicast tree for all senders to a group. This is a departure from approaches that build one tree for each sender. Routing is then performed by defining one or more core or rendez-vous points to serve as the basis for tree construction and adding branches by separately routing packets optimally (in the unicast sense) from the senders to these fixed points and then from there to the receivers. Again, merging of paths is exploited whenever possible, but it is not an explicit goal of the routing calculations. Instead, because of the concentration of paths around the fixed points, common paths are expected to arise. A single shared multicast tree is not optimal in any strict sense since no attempt is made to find the topological center of the tree, both due to its computational cost and the limited lifetime of any topological center for a dynamic environment. But, the advantages of shared multicast trees are numerous. First, a shared tree for the whole group means that this approach scales well in terms of maintenance costs as the number of senders increases. Actually, there is still a tree emanating from each sender, but all these trees merge near the fixed points and the distribution mesh is common from there on to the receivers. Second, the trees can be made quite efficient by clever choice of the fixed points. Third, routing is performed independently for each sender and receiver, with entering and departing receivers influencing only their own path to the fixed points of the single shared tree, employing any underlying mechanism available for unicast routing. This last property means that network and group membership dynamics can be dealt with without global recalculations and by using available mechanisms. In practice, these multicast algorithms are expected to use the underlying unicast algorithms, but are independent of them. Interoperability with different unicast schemes, coupled with the scalability of the shared trees, make these algorithms ideal for use on very large-scale heterogeneous networks. The fixed points can also be selected so as to facilitate hierarchical routing for very large internetworks, further enhancing scalability properties. Group dynamics are an obstacle in maintaining optimality, whatever the method of constructing the initial trees. Since repeating all routing computations whenever members join or leave the group may be prohibitively expensive, an alternative is to prune extraneous links when a member leaves the group, and add the most economical extension path towards a new member, either from a fixed or from the optimal location in the existing group. Rather than making modifications blindly, the most advanced algorithms store some of the state accumulated during tree construction and make only local calculations that still satisfy the requirements of the application. However, simulations have shown that even simple multicast routing using the shortest path tree is not significantly worse in terms of total tree cost from the optimal solutions or the near-optimal heuristics. For realistic network topologies Doar and Leslie (9) have found that the cost of a shortest path tree is less than 50% larger than that of a near-optimal heuristic tree, while path delays for

3

heuristic trees are 30% to 70% larger than shortest path delays. Since shortest path trees are easily built and modified using the underlying unicast routing and they never deteriorate in terms of delay, but simply vary in their inefficiency in terms of total cost, if an application is prepared to accept the overhead, it can avoid special multicast-tree construction and maintenance methods by employing the shortest delay paths. A similar cost versus simplicity trade-off is involved when using shared trees for all senders to a group. For shared trees, optimality is hard to achieve and even harder to maintain, as discussed earlier, but a simple approach is to choose the center among group members in a way so that only as many trees as group members will have to be considered. For these trees, when path delay is optimized, simulations show that delays are close to 20% larger than the shortest paths, and tree cost is about 10% lower than that of shortest path trees. Furthermore, a single tree constructed using the underlying unicast routing mechanisms minimizes state and maintenance overhead. Unfortunately, apart from their moderate sub-optimality, shared trees also suffer from traffic concentration, since they route data from all senders through the same links. Simulations show that delay-optimal member-centered shared trees can cause maximum link loads to be up to 30% larger than in a shortest path tree. For these reasons, recent proposals try to combine shared trees and shortest paths by starting each group connection in the shared tree mode and then changing individual paths to shortest delay ones upon receiver requests. This approach can also support traditional source-rooted trees. Additional complications arise when links and paths are asymmetrical. Most of the algorithms and approaches discussed above need modifications and in some cases they do not apply at all. An interesting problem arising from resource sharing in multicast is how to split the total distribution costs among the receivers, or how to allocate the savings compared to using separate unicasts. This issue is orthogonal to the problem of what the total costs are, a question also arising in the unicast case. Whether these costs are used for pricing or for informational purposes, they are a primary incentive to use multicast. Multicast Routing with Quality-of-Service Constraints The motivation for routing multicast traffic along trees rather than along arbitrary paths is to minimize transmission cost through link sharing. For continuous media, the volume of data transferred makes this goal even more important. However, for real-time multimedia applications we must take into account two additional factors: delay constraints, particularly for interactive applications, and media heterogeneity. Separate handling of media streams is useful in order to use the most effective coding techniques for each stream. The question arises then whether we should use the same or separate distribution trees for each stream. Considering the load that continuous media puts on network links and the interaction among admission control and routing, it seems better to use separate trees. Thus, each media stream could ask for the appro-

4

Multicast

priate Quality-of-Service (QoS) parameters and get routed accordingly, with receivers choosing to connect to any subset of the trees. On the other hand, the management overhead of multiple trees per source may be prohibitive. In addition, routing each media stream separately exacerbates the inter-media synchronization problem. Turning to delay requirements, if we use delay as the link metric during routing, we can easily see that the shortest delay tree, made up from the shortest paths from sender to each receiver, is not the same as the tree of total minimal cost that maximizes link sharing at the expense of individual path delays. We have then a global tree metric (tree cost) and many individual receiver-oriented metrics (path delays) that are potentially in conflict. Since we cannot hope to optimize on all fronts, we can try to optimize cost subject to the constraint that delay is tolerable. Interactive applications can be characterized by upper bounds on end-to-end delay and/or limits on jitter. In this sense, it is reasonable to design the tree so as to optimize total cost while keeping individual paths within their respective bounds. Normally, all receivers would be satisfied by the same limits, as these are determined by human perception properties. This new problem is essentially a version of the Steiner tree problem with additional constraints on the paths. Even though it is NP-complete, fast heuristic algorithms that are nearly optimal have been developed. Almost identical formulations could be obtained when the constraints are delay jitter or a probabilistic reliability constraint. For example the latter could be modeled in the case of independent link losses by a loss probability assigned to each link. Then, using logarithms the reliability metric could be expressed in linear form between a source and each destination by adding the logarithms along the path. This maps the problem to the previous one since the goal is tree cost minimization with a constraint on additive path-based metric. Finally, a similar formulation can be used when the constraint is link capacities which must not be exceeded, instead of a delay bound. Again, heuristics exist to solve this variant of the problem.

FEEDBACK CONTROL AND RELIABLE MULTICAST Whether a network provides a simple connectionless service or a complicated connection-oriented service for unicast, generalizing it for multicast is not trivial. Flow, congestion, and error control depend on feedback to the sender, according to network and receiver-triggered events. For simple network services, no such information is provided by the network itself, but instead end-to-end reports must be exchanged. Error control ensures that packets transmitted by the sender are received correctly. Packets may be received corrupted (detected by error-detection codes) or they may be lost (detected by missing sequence numbers). Flow control assures that the sender does not swamp the receiver with data that cannot be consumed in time. Congestion control deals again with the problem of insufficient resources, but this time at network elements between sender and receiver. Although packets may be dropped at intermediate

nodes, in many networks this loss can be detected only by the receiver, resulting in confusion between errors and congestion. In the unicast case, lost or corrupted packets are retransmitted based on feedback received from the network or the receiver. When packets are multicast, simple feedback schemes face the feedback-implosion problem: all receivers respond with status information, swamping the sender with possibly conflicting reports. Ideally, senders would like to deal with the multicast group as a whole and not on an individual receiver basis, following the hostgroup model. However, the sender cannot simply treat all receivers identically, because this would lead to either ignoring the retransmission requests of some receivers, or to wasting resources by retransmitting to all of them. Since there is no evident solution that satisfies all requirements, several approaches exist emphasizing different goals. The simplest approach of all is to ignore the problem at the network layer and provide a best-effort connectionless service. Delegating the resolution of transmission problems to the higher layers may be an adequate solution in many cases, since they may have additional information about the application requirements and thus can implement more appropriate mechanisms than what is possible at this layer. A second solution sacrifices the host-group model’s simplicity by keeping per-receiver state during multicasts. After transmitting a multicast packet, the sender waits until a stable state is reached before sending the next one. For flow control, this slows down the sender enough so as not to swamp the slowest receiver. For error control, retransmissions are made until all receivers receive the data. This may not be possible even after multiple retransmissions, so the sender may have to take special action, e.g., removing some receivers from the group. Retransmissions may be multicast when many receivers lose a packet, or unicast when few do. Since feedback implosion is always a possibility, all such schemes should use negative rather than positive acknowledgments, i.e., send responses when problems occur rather than confirming that packets are received correctly and in time. In a negative acknowledgment scheme, some responsibilities are moved to the receivers, complicating their operation. However, additional opportunities arise, such as multicasting the negative acknowledgments to all receivers after random periods of time to minimize the number of negative acknowledgments returned to the sender. Assigning such responsibilities to receivers can lead to higher throughput. However, the scalability of such schemes is doubtful, even for very reliable links and rare congestion or overflow problems. The problem is that the sender is still the control center, and as the number of group members grows, receivers and network paths become more heterogeneous. With these essentially symmetric schemes, the service provided to a group member is the lowest common denominator, which may be the slowest or most overloaded receiver, or the slowest or most congested network link. Sophisticated approaches exist that follow these general directions, but their complexity and inefficiency makes them appropriate only for applications that require very high reliability and uniform member treatment. Note that such reliable

Multicast

solutions can be implemented as transport services over a simple connectionless network service. A third solution is to distribute the feedback control mechanism over the entire multicast tree, and follow a hierarchical scheme. A receiver’s feedback need not propagate all the way to the sender. Instead, intermediate nodes may either respond directly or merge the feedback from many downstream receivers to a summary message and then recursively propagate it upwards. In this case, feedback implosion is avoided in terms of messages, but the problem of dealing with possibly conflicting requests remains. If the added complexity of making local decisions on each network node (not only group members) is acceptable, we can narrow down the impact of problems to specific parts of the tree, relieving the sender from dealing with individual receivers. A non-hierarchical method for distributed feedback control targeted to recovery of lost messages is to let all receivers and senders cooperate in handling losses, thus extending the sender-oriented model. When receivers discover a loss, they multicast a retransmission request, and anyone that has that message can multicast it again. To avoid feedback implosion, these requests and replies are sent after a fixed delay based on the distance from the source of the message or the source of the request respectively, plus a (bounded) randomized delay. The result is that most duplicate requests and replies are suppressed by the reception of the first multicasts. By varying the randomdelay intervals, the desired balance among recovery delay and duplicates can be achieved. In contrast to hierarchical schemes and because location-independent multicasts are used, only group members participate but recovery cannot be localized without additional mechanisms. A scalable feedback mechanism that can be used to estimate network conditions without creating implosion problems has been proposed by Bolot et al. (4): it first estimates the number of receivers in a group and then what the average quality of reception is (the averaging depends on the application), using probabilistic techniques. This method has been used in applications for senders to detect congestion problems and adapt their output rates (to relieve congestion) and error redundancy factors (to increase the chances of error recovery). Cheung and Ammar (5), proposed a further enhancement to scalable feedback control, by splitting the receivers in groups according to their reception status and capabilities and only sending them the data that each group can handle. This avoids problems created by very slow or very fast machines dragging the whole group towards one extreme. Finally, another approach (mostly orthogonal to the above), tries to minimize the need for feedback by taking preventive rather than corrective action. For error control, this is achieved by using Forward Error Correction (FEC) rather than simple error detection codes. For flow and congestion control, this is achieved by reserving resources so that both receivers and intermediate network nodes are able to support the sender’s data rate. The cost of these techniques are increased overhead and network complexity. FEC imposes processing and transmission overhead, but requires no additional mechanisms in the network. Resource reservation on the other hand needs additional con-

5

trol mechanisms to set up and maintain the resources for a session. Message Ordering and Atomic Multicast Some applications require delivery of messages in order. In some cases this requirement is expressed across sources, leading to a synchronization problem. The required ordering of messages could be causal or total. Causal ordering is based on the “happens before” relationship and might not be total (i.e., messages could be concurrent). A multicast protocol that ensures reliability and total ordering is called atomic. Such protocols might be necessary for secure distributed computing in the presence of failures and malicious agents, while causal ordering is sufficient to ensure consistency in updates to replicated databases. MULTICASTING REAL-TIME CONTINUOUS MEDIA Host and Network Heterogeneity Several representational formats for various media types coexist. This is a problem with traditional data communications, but it is more of an issue with images, audio and video. Such issues are typically addressed at the presentation layer in the OSI model. Translation between formats can be provided at three points: at the transmitter, at the receiver, or inside the network. In the latter case, format converters are required to be deployed in the network. This may be appropriate for converting protocols or text encodings between autonomous systems with different standards (placing the converters in the gateways), but it is not effective when the terminals themselves can use different encodings within the same area. Thus, it is more realistic to move the translation services to the hosts. With unicast, translation can be effectively done at either the sender or at the receiver. However, heterogeneity problems are aggravated with multicast. For example, translation at the sender requires the stream to be duplicated and translated for each different type of receiver, precluding link sharing over common paths. This approach also does not scale for large heterogeneous groups since the sender’s resources are limited. Finally, it requires the sender to be aware of the receiver’s capabilities, which is incompatible with the host-group model. The sender may use different multicast groups for each encoding to avoid this, but the other problems remain. Translation at the receiver is the most economical and scalable approach in this case since it fully exploits sharing and moves responsibilities away from the sender. Since continuous media impose heavy demands on both networks and hosts, it is likely that not all receivers will be able to receive all of a sender’s traffic. This argues in favor of prioritization of the traffic generated through hierarchical coding. Hierarchical or layered coding techniques decompose a signal into independent or hierarchically dependent components that can be used in specific subsets to provide partial reconstruction of the signal. In this case receivers can choose to get only those parts of the media that they can use or are most important to them. Thus, appropriate hierarchical coding can be easily combined with

6

Multicast

and facilitates translation and reconstruction of the signal at the receivers, according to their needs and abilities. For example, a high resolution component of a video could be dropped from a congested subnetwork, allowing low resolution components to be received and displayed in that subnetwork, but without impacting other subnetworks that are not congested. Resource Reservations Resource reservations at the network switches are needed if any service guarantees are to be provided. The exact nature of these reservations differ according to the required service guarantees and the approach taken towards satisfying them, so resource reservation along transmission paths could be viewed as a subset of the general switchstate establishment mechanisms. An alternative to reserving resources for an indefinite period of time during connection establishment is to make advance reservations for a future connection with a given lifetime. This allows more sessions to be admitted (due to their deterministic timing) and also permits negative responses for reservation requests to be dealt with more gracefully. The first component of resource reservation schemes is a specification model for describing flow characteristics, that depends heavily on the model of service guarantees supported by the network. Then, an appropriate protocol is required to communicate these specifications to the receivers and reserve resources on the transmission path so that the service parameters requested can be supported. Simple unicast approaches to resource reservations are generally source-based. A set-up message containing the flow specification is sent to the destination with the intermediate nodes committing adequate resources for the connection, if available. Resources are normally over-allocated early on in the path, so that even if switches encountered further along the path are short of resources, the connection can still be set up. After the set-up message reaches its destination, assuming the connection can be admitted along the path, a response message is returned on the reverse path, allowing the intermediate switches to relax commitments in some cases. Similarly, for multicast, there must be a way for senders to notify receivers of their properties, so that appropriate reservations can be made. In a perfectly homogeneous environment, the reservations will be made once on each outgoing link of a switch for all downstream receivers, so that resource usage can be minimized. Reserved resources can also be shared among data transmitted from multiple senders to the same group (e.g., in applications such as conferencing where the number of simultaneous senders is much smaller than the total). However, receiver and network heterogeneity often prohibits use of this simplistic scheme. One approach is to allocate resources as before during the first message’s trip and then have all receivers send back their relaxation (or rejection) messages. Each switch that acts as a junction will only propagate towards the source the most restrictive relaxation among all those received. However, since paths from such junctions towards receivers may have committed more resources than are now needed, additional passes will be required

for convergence, or resources will be wasted. To handle dynamic groups without constant source intervention, this model can be augmented with receiver-initiated reservations that propagate towards an already established distribution tree. An alternative approach is to abandon reservations during the sender’s multicast set-up message and instead reserve resources based on the modified specifications with which the receivers respond to the initial message. Again, resource reservations will be merged on junction points, but since the (now upstream) requests are expected to be heterogeneous, each junction will reserve adequate resources for the most demanding receivers and reuse them to support the less demanding ones. Even though it is still unclear how aggregation of reservations should be performed, this approach has the potential to support both heterogeneous requests and resource conservation, possibly without over-committing resources, thus maximizing the possibility for a new session to be admitted. Since this mechanism converges in one rather than in multiple passes, the reservation state in the switches can be periodically refreshed, turning the fixed hard state of a static connection into adaptive soft state suitable for a dynamic environment. In this way this mechanism can accommodate both group membership changes and routing modifications without involving the sender. The interaction of routing and resource reservations further complicates matters. Even in the simple case of static routing, success in building a multicast tree depends on the adequacy of resources on each switch. We would like to construct the tree using the switches that pass the admissibility tests, thus favoring the sender-initiated reservation approach. On the other hand, we do not want the construction to fail due to over-allocation, so receiver-initiated reservations are preferable because they may avoid overcommitting resources and converge in one pass. Now however, the tree constructed by the routing algorithm may be inadequate to support the reservations, again rejecting a session that could in principle be set up.

MULTICASTING ON THE INTERNET The Internet has been extensively used as a testbed for algorithms and protocols supporting multicast. The extensions of the IP model to support multicast are the provision of special (class D) multicast addresses and IGMP (the Internet Group Management Protocol), which supports the host-group model. Multicast-aware routers periodically multicast, on a well-known address, membership queries on their LANs and gather replies from interested hosts in order to discover which groups have members present in their area. To achieve multicasting in a wide area network, we need a mechanism to keep track of the dynamic membership of each group and another mechanism to route the multicast datagrams from a sender to these group members without unnecessary duplication of traffic. IP multicasting implements these mechanisms in two parts: local mechanisms track group membership and deliver multicasts to the correct hosts within a local network, and global mechanisms

Multicast

route datagrams between local networks. Distinguishing local from global mechanisms is appropriate for IP since it is an internetworking protocol: each local network can use mechanisms appropriate to its technology, while cooperation among networks is achieved by hiding local differences behind a common interface. In each local network, at least one router acts as a multicast router. A multicast router keeps track of local group membership and is responsible for forwarding multicasts originating from its network towards other networks, and for delivering multicasts originating elsewhere to the local network. Multicast delivery of either externally or locally originated datagrams to local receivers, as well as reception of local multicasts by the router for subsequent propagation to other networks, depend on the underlying network technology. Accordingly, the information needed within the local network regarding group membership in order to achieve local multicast delivery may vary. In contrast, cooperation among multicast routers with the purpose of delivering multicast datagrams between networks is based on a network independent interface between each local network and the outside world. The information needed in order to decide if multicasts should be delivered to target networks is whether at least one group member for a destination group is present there. A multicast router uses the information for each of its attached local networks along with information exchanged with its neighboring routers to support wide area multicasting. Irrespective of the group membership information tracked by a multicast router for local purposes, the interface between local information and global routing is a list of groups present at each attached network. Based on this interface, alternative algorithms can be used for routing among networks, without affecting local mechanisms. Conversely, as long as this interface is provided by the local mechanisms, they can be modified without affecting routing. A variety of global, wide area multicast routing, mechanisms exist, with the earliest and most widespread being the Distance Vector Multicast Routing Protocol (DVMRP). DVMRP v.1 is a variant of Truncated Reverse Path Broadcasting. Routers construct distribution trees for each source sending to a group, so that datagrams from the source (root) are duplicated only when tree branches diverge towards destination networks (leaves). Each router identifies the first link on the shortest path from itself to the source, i.e., on the shortest reverse path, using a distance vector algorithm. Datagrams arriving from this link are forwarded towards downstream multicast routers, i.e. those routers that depend on the present one for multicasts from that source. A broadcast distribution tree is thus formed with datagrams reaching all routers. Since each router knows which groups are present in its local networks, redundant datagrams are not forwarded by truncating the tree. DVMRP v.3 implements the improved Reverse Path Multicasting mechanism, which prunes tree branches leading to networks that have no members, and grafts them back when members appear, thus turning the group distribution tree to a real multicasting one. Another protocol discussed by Moy (12), the Multicast Open Shortest Path First (MOSPF), uses a link state algorithm: routers flood their membership lists among them,

7

so that each one has complete topological information concerning group membership. Shortest path multicast distribution trees from a source to all destinations are computed on demand as datagrams arrive. These trees are real multicast ones (i.e., not broadcast), but the flooding algorithm introduces considerable overhead. A radically different proposal for multicast routing is the Core Based Trees (CBT) protocol, which employs a single tree for each group, shared among all sources. The tree is rooted on at least one arbitrarily chosen router, called the core, and extends towards all networks containing group members. It is constructed starting from leaf network routers towards the core as group members appear, thus it is a multicast tree composed of shortest reverse paths. Sending to the group is accomplished by sending towards the core; when the datagram reaches any router on the tree, it is relayed towards tree leaves. Routing is thus a two stage process which can be sub-optimal. The first stage may propagate datagrams away from their destinations until the tree is reached, thus increasing delay, and in addition, traffic tends to concentrate on the single tree rather than being spread throughout the network. Finally, the Protocol Independent Multicast (PIM) protocol by Deering et al. (7), employs either shared or per source trees, depending on application requirements. There are two main modes of operation of the PIM protocol depending on the distribution of the multicast group members throughout the network. In the Sparse Mode (PIM-SM), receivers are assumed to be sparsely distributed throughout the network and therefore any router with downstream group members must explicitly inform its upstream multicast routers of its interest in joining a multicast group. The resulting shared tree is rooted at a group-specific Randez-vous Point (RP). Routers can later join a source-specific, shortest path, distribution tree and prune themselves of the shared tree. In the Dense Mode (PIM-DM), the opposite assumption is made, i.e., it is assumed that most of the routers are interested in receiving multicast traffic. Therefore, each router forwards multicast data to all of its neighboring routers. Routers not interested in joining the multicast group, explicitly prune themselves off the constructed source-rooted multicast tree. Networks supporting IP multicasting may be separated by multicast unaware routers. To connect such networks, tunnels are used: tunnels are virtual links between two endpoints, that are composed of a, possibly varying, sequence of physical links. Multicasts are relayed between routers by encapsulating multicast datagrams within unicast datagrams at the sending end of the tunnel and decapsulating them at the other end. The MBone is a virtual network composed of multicast aware networks bridged by such tunnels. Multicast routers may choose to forward through the tunnels only datagrams that have Time-toLive (TTL) values above a threshold, to limit multicast propagation. In contrast to global mechanisms, only a single set of local mechanisms exists. These local multicasting and group management mechanisms were based on shared medium broadcast networks such as Ethernet, and this is evident on some of the design decisions made. Delivery is straightforward on these LANs, as all hosts can listen to all datagrams and select the correct ones. If a LAN supports multicasting

8

Multicast

as a native service, class D IP addresses may be mapped to LAN multicast addresses to filter datagrams in hardware rather than in software. Multicasts with local scope do not require any intervention by the multicast router, while externally originated multicasts are delivered to the LAN by the router. The router also monitors all multicasts so that it can forward to the outside world those for which receivers exist elsewhere. Both unicasts and multicasts are physically broadcast on these LANs, so the only issue for the router when delivering externally originated multicasts is whether at least one member for the destination group exists in the network. The router only has to keep internally a local group membership list, which coincides exactly with the information on which global multicast routing is based. Both versions of the Internet Group Management Protocol (IGMP) provide a mechanism for group management well suited to broadcast LANs, since only group presence or absence is tracked for each group. In IGMP v.1 the multicast router periodically sends a query message to a multicast address to which all local receivers listen to. Each host, on reception of the query, schedules a reply, to be sent after a random delay, for each group in which it participates. Replies are sent to the address for the group being reported, so that the first reply will be heard by all group members and suppress their own transmissions. The router monitors all multicast addresses, so that it can update its membership list after receiving each reply. If no reply is received for a previously present group for a number of queries, the group is assumed absent. In steady state, in each query interval the router sends one query and receives one reply for each present group. When a host joins a group it sends a number of unsolicited reports to reduce join latency for the case where it is the first local member of the group. No explicit action is required when a host leaves a group, as group presence times out when appropriate. In IGMP v.2 a host must send a leave message when abandoning a group, but only if it was the last host to send a report for that group. However, since this last report may have suppressed other reports, the router must explicitly probe for group members by sending a group specific query to trigger membership reports for the group in question. It can only assume the group absent if no reports arrive after a number of queries. All IGMP v.2 queries include a time interval within which replies must be sent: general queries may use a long interval to avoid concentrating reports for all groups, while group specific queries may use a short interval to speed up group status detection. The time between the last host leaving a group and the router stopping multicast delivery for that group is called the leave latency. Other Internet Protocols and Services The Resource ReSerVation Protocol (RSVP), designed by Zhang et al. (21), acts as an overlay on routing protocols, supporting receiver-initiated resource reservations over any available multicast routing scheme. In addition, RSVP supports dynamic reservation modifications and network reconfigurations. A transport protocol supporting continuous media has been developed by Schulzrinne et al. (18): RTP (Real Time Protocol). It provides support for timing informa-

tion, packet sequence-numbers and option specification, without imposing any additional error control or sequencing mechanisms. An application can use this basic framework adapted to its requirements to add whatever mechanisms seem appropriate, such as error control based on loss detection using sequence numbers, or intra-media and inter-media synchronization based on timing information. A companion control protocol, RTCP (Real Time Control Protocol), can be used for gathering feedback from the receivers, again according to the application’s needs. For example, an application can use RTP for transport and RTCP adapted for scalable feedback control, along with appropriate FEC and adaptation mechanisms. Another relevant protocol is SDP (Session Description Protocol), which provides a mechanism for applications to learn what streams are carried in the network, describing them in adequate detail so that anyone interested can launch the appropriate receiver applications.

MULTICAST IN ATM AND OTHER TECHNOLOGIES ATM technology supports point-to-multipoint VCs, with the source responsible for setting up the VC. Two basic models have been proposed to support multicast in ATM. The first is based on a mesh of point-to-multipoint VCs from each source to the destinations. The second resembles the center based trees and uses a multicast server (MCS) with point-to-multipoint VCs to all destinations; sources then establish and use point-to-point VCs to forward their data to the MCS, which then forwards them to the destinations. Both models have advantages and disadvantages. Group management and dynamic join-leave is probably more complex and slow with the mesh, but throughput and delay should be better. Also the MCS is a single point of failure and concentrates traffic, not only to the particular server, but also to the subnetwork surrounding it. LAN emulation service on top of ATM offers a very similar solution with that of the MCS for multicast and broadcast. A third, multipoint-to-multipoint solution has also been suggested based on the shared tree approach and an access control protocol that allows sources to alternate in using the common infrastructure. However, this last proposal is less compatible with the various ATM protocols and techniques currently adopted. Finally, an important problem for ATM is the mapping of high-level multicast addresses (or group names) to specific destination end-points and point-to-to-multipoint VCs. A key solution is based on a Multicast Address Resolution Server (MARS) based on the notion of an ATM ARP server. This service is obviously necessary for full implementation of IP over ATM. Similarly, when multiple hosts and applications are communicating, as in a multi-party conference, there is usually a need to mediate transmission and reception of data among participants at the application layer. As the specific needs of each application and conference setting may vary, one way to support multiple control policies is to use either a specialized server or a logical conference control channel as a shared mechanism through which control messages are exchanged. Floor control and session management applications can then employ this mechanism for

Multicast

their needs. Many other applications and networking technologies have also had to confront the multicast problem. For example multi-hop light-wave technologies using multiple wavelengths assigned to source-destination pairs in the case of unicast, have to modify their architectures for multicast. For Mobile IP and IP Multicast, even though straightforward solutions do exist, they suffer from various problems that are still being investigated. Finally, in some cases multicast is proposed as a solution to other problems. For example, in order to minimize delay and loss during hand-offs in mobile packet communications, it has been proposed to multicast the packets to all base stations near the mobile so that the information will be immediately available in case of hand-off. Multicast Security Issues The traditional security issues for communications, which are typically thought of in the context of unicast (i.e., pointto-point communications), also exist for multicast. They typically relate to data confidentiality and integrity and service availability. However, multicast amplifies the existing problems and poses new ones. In addition, straightforward extensions of unicast either do not apply or are uninteresting. For example, using O(n2 ) independent endto-end unicast secure channels between all pairs of participants can provide secure group communication. However, no benefits from multicasting can be drawn in this case. In particular, no efficient transport, nor flexible membership. With the original Internet protocols session membership is typically not known (except perhaps in an indirect way at the application level) and cannot be controlled. This makes the problems of eavesdropping, unauthorized injection of messages, and even service denial to authorized members, even more important. However, research into these issues is just beginning and experience with real systems is almost non-existent. Some obvious but central requirements for approaches to secure multicast, in particular in the context of global networks such as the Internet and ATM, are: (1) compatibility with existing network protocols, (2) scalability, and (3) transparency to higher level services and applications. APPLICATION-LEVEL MULTICAST The scalability problems faced in the deployment of IP multicast in wide area networks have led to an interest in application-level multicast over peer-to-peer overlay networks. As suggested by Ratnasamy et al (26), the main target is to eliminate the need for a multicast routing algorithm to construct distribution trees. Towards this direction, peer-to-peer overlay networks provide a scalable, fault-tolerant, self-organizing routing substrate. Application-level multicast aims at leveraging these underlying overlay routing properties. Peer-to-peer overlay routing is performed over an abstract namespace. A randomly chosen portion of the namespace is assigned to each participating node. Each node of the overlay network holds routing information only for a small subset of nodes whose namespace portions are neigh-

9

boring its own in the global namespace. In this way, routing information is distributed among all nodes, yielding a scalable routing infrastructure. For the logical namespace proximity to reflect the actual networking proximity (in terms, for example, of round trip time), among all neighboring nodes in the logical namespace, a node holds routing information only for those closest to it in the actual networking topology. Rowstron et al. (23) and Zhao et al. (24) show that overlay routes are approximately 30% longer than the routes followed in the case of direct IP routing based on complete routing tables. This is considered as an acceptable cost in view of the fact that each node holds routing information only for a small subset of the overall topology. Messages are destined to points in the namespace (otherwise termed as keys). The overlay network routes a message to the node that has been assigned the portion of the namespace containing the destination point, i.e. the owner of the key. This is accomplished by each node forwarding the message to the node with the closest namespace to the key. The average number of overlay hops required for a message to reach its destination is a logarithmic function of the number of nodes constituting the overlay network. Two main approaches have been followed towards application-level multicast. The first aims at building the multicast distribution tree on top of the overlay network. Zhuang et al. (25) propose the creation of source-specific trees on top of a Tapestry (24) overlay network. The construction of the multicast tree follows the hierarchical character of the underlying rooting mechanism, i.e. closely neighboring nodes in the namespace belong to the same tree level. However, each join message reaches the source node, so that the construction of the multicast tree is coordinated there, weakening the scalability of the proposed scheme. Castro et al. (27) overcome this limitation by handling group joins locally. They propose the creation of source-rooted trees or trees rooted at a randomly chosen randez-vous node, on top of a Pastry (23) overlay network. In this case, a join message is not propagated towards the root of the tree, but it is suppressed by an intermediate node that has already joined the group. Both approaches result in the creation of well-balanced source-specific trees due to the randomization of the overlay addresses. However, these trees may contain nodes not belonging to that multicast group. The second approach, followed by Ratnasamy et al. (26), aims at the creation of an overlay network for each multicast group, avoiding the construction of a multicast tree. Multicast data is then broadcasted in the overlay network. Contrary to the first approach, there is no restriction on the number of sources, i.e. multiple nodes may broadcast in the overlay network resulting in a multipoint-to-multipoint communication model. Moreover, the dissemination of data is performed only by nodes actually belonging to the multicast group. Nevertheless, Castro et al. (28) show that, the tree-building approach achieves lower delay and signaling overhead than broadcasting over per group overlays due to the significant delay overhead incurred by the routing state establishment during the construction of the overlay network in the latter case.

10

Multicast

Streaming Applications Streaming applications, such as live audio and video delivery, pose additional requirements for the efficient point-tomultipoint data distribution. The bandwidth requirements of these applications are significant, imposing the need for load balancing measures during multicast tree construction. The end-to-end delay from the source to the receivers may be high if the content traverses long paths of nodes until it reaches the leaf nodes of the multicast tree. Hence, the need for a small tree height is apparent. Furthermore, traffic bottlenecks may appear in the tree topology if nonleaf nodes are required to forward the multicast content to a large number of descendant nodes. In effect, the fan-out degree of each node in the tree must be bounded. Overall, bandwidth availability is an essential criterion for the construction of efficient multicast trees and the adequate placement of each node in the hierarchical topology. A significant approach addressing these issues, followed by Castro et al. (30) and Padmanabhan et al. (31), is based on the creation of multiple multicast trees per group. The multicast content is encoded in several separate, decodable streams (stripes) of lower quality and each stream is transmitted over a separate tree. All trees share the same root (source) and leaf nodes, but consist of disjoint sets of intermediate-level nodes. The main goal is to distribute the forwarding load among the participating nodes and this is achieved by each interior node bearing the burden of forwarding a single, lighter stream. Each node participates in all trees, either as an interior or a leaf node, in order to receive all stripes and reconstruct the original content. It is noted that the reception of all stripes is necessary only for the reconstruction of the content in its initial quality. The reception of fewer stripes typically still results in the reconstruction of the content, but at a lower quality. In addition to the balancing of the forwarding load, this approach may also achieve robustness to node failures. A node failure will only result in the loss of a single stripe by the leaf nodes served by the failing node. In effect the leaf nodes will only experience the degradation of the quality of the received content rather than the interruption of the streaming service. In addition to the aforementioned approach, Duc et al. (29) propose the construction of the multicast tree based on a multilayer hierarchy of bounded-size clusters of nodes. In each cluster, a head node is responsible for monitoring membership in the cluster and an associate head node is responsible for transmitting the content to cluster members. Thus, membership management is distributed, relieving the source node from the burden of overall tree management. The height of the resulting tree is at most logarithmic to the size of the node population and the fan-out degree of each node is bounded by a constant. BIBLIOGRAPHY 1. M. Ahamad, Multicast Communication in Distributed Systems. IEEE Computer Society Press, Los Alamitos, CA, 1990. 2. M. H. Ammar, G. C. Polyzos, and S. Tripathi, (eds.), Special Issue on “ Network Support for Multipoint Communica-

3.

4.

5.

6.

7.

8.

9. 10.

11. 12. 13.

14.

15. 16.

17.

tion,” IEEE Journal on Selected Areas in Communications, 15: 273–588, 1997. K. P. Birman, A. Schiper, and P. Stephenson, Lightweight Causal and Atomic Group Multicast, National Aeronautics and Space Administration, Washington, D.C., 1991. J. C. Bolot, T. Turletti, and I. Wakeman, Scalable feedback control for multicast video distribution in the Internet. Computer Communications Review, 24(4): 58–67, 1994. S. Y. Cheung and M. H. Ammar, Using destination set grouping to improve the performance of window-controlled multipoint connections. Computer Communications, 19: 723–736, 1996. S. E. Deering and D. R. Cheriton, Multicast routing in datagram internetworks and extended LANs. ACM Transactions on Computer Systems, 8(2): 85–110, 1990. S. Deering, D. Estrin, D. Farinacci, V. Jacobson, C. Liu, and L. Wei, The PIM architecture for wide-area multicast routing. IEEE/ACM Transactions on Networking, 4: 153–162, 1996. C. Diot, W. Dabbous, and J. Crowcroft, Multipoint communication: a survey of protocols, functions, and mechanisms. IEEE Journal on Selected Areas in Communications, 15: 277–290, 1997. M. Doar and I. Leslie, How bad is naive multicast routing? Proc. IEEE INFOCOM’93: 82–89, 1993. M. R. Garey, R. L. Graham, and D. S. Johnson, The complexity of computing Steiner minimal trees. SIAM Journal on Applied Mathematics, 34: 477–95, 1978. S. L. Hakimi, Steiner’s problem in graphs and its implications. Networks, 1: 113–133, 1971. J. Moy, Multicast routing extensions for OSPF. Communications of the ACM, 37(8): 61–66, 1994. B. K. Kabada and J. M. Jaffe, Routing to multiple destinations in computer networks. IEEE transactions on communications, 31: 343–351, 1983. V. P. Kompella, J. C. Pasquale, and G. C. Polyzos, Multicast routing for multimedia communication. IEEE/ACM Transactions on Networking, 1 (3): 286–292, 1993. L. Kou, G. Markowsky, and L. Berman, A fast algorithm for Steiner trees. Acta Informatica, 15: 141–145, 1981. J. C. Pasquale, G. C. Polyzos, E. W. Anderson, and V. P. Kompella, The multimedia multicast channel. Internetworking: Research and Experience, 5(4): 151–162, 1994. J. C. Pasquale, G. C. Polyzos, and G. Xylomenos, The multimedia multicast problem. Multimedia Systems, 6(1): 43–59, 1998.

18. H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, RTP: A Transport Protocol for Real-Time Applications. Internet Request For Comments, RFC 1889, 1996. 19. D. Towsley, J. Kurose, and S. Pingali, A comparison of senderinitiated and receiver-initiated reliable multicast protocols. IEEE Journal on Selected Areas in Communications, 15: 398–406, 1997. 20. G. Xylomenos and G. C. Polyzos, IP Multicast for Mobile Hosts, IEEE Communications Magazine, 35(1): 54–58, 1997. 21. L. Zhang, S. Deering, D. Estrin, S. Shenker, and D. Zappala, RSVP: a new resource ReSerVation Protocol. IEEE Network, 7(5): 8–18, 1993. 22. W. D. Zong, Y. Onozato, and J. Kaniyil, A copy network with shared buffers for large-scale multicast ATM switching. IEEE/ACM Transactions on Networking, 1(2): 157–165, 1993. 23. A. Rowstron and P. Druschel, Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems.

Multicast IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), Heidelberg, Germany, pp. 329–350, November 2001. 24. B. Y. Zhao, Ling Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and J. D. Kubiatowicz. Tapestry: A resilient global-scale overlay for service deployment. IEEE Journal on Selected Areas in Communications, 22(1), 2004. 25. S. Zhuang, B. Zhao, A. Joseph, R. Katz, and J. Kubiatowicz, Bayeux: An Architecture for Scalable and Fault-tolerant WideArea Data Dissemination, Proc. NOSSDAV, pp. 11–20, 2001. 26. S. Ratnasamy, M. Handley, R. Karp, and S. Shenker, Application-level multicast using content-addressable networks, Proc. International Workshop on Networked Group Communication (NGC), November 2001. 27. M. Castro, P. Druschel, A.-M. Kermarrec, and A. Rowstron, Scribe: A large-scale and decentralized application-level multicast infrastructure, IEEE JSAC, 20(8), 2002. 28. M. Castro, M. B. Jones, A.-M Kermarrec, A. Rowstron, M. Theimer, H. Wang, and A. Wolman, An evaluation of scalable application-level multicast built using peer-to-peer overlays, Proc. IEEE INFOCOM, 2: 1510–1520, 2003. 29. D. A. Tran, K. Hua, T. Do, A peer-to-peer architecture for media streaming, IEEE Journal on Selected Areas in Communications, 22(1): 121–133, 2003. 30. M. Castro, P. Drushel, A.-M. Kermarrec, A. Nandi, A. Rowstron, and A. Singh, SplitStream: High-bandwidth Multicast in Cooperative Environment, ACM Symposium on Operating Systems Principles, 2003. 31. V. N. Padmanabhan, J. J. Wang, P. A. Chou, and K. Sripanidkulchai, Distributing streaming media content using cooperative networking, Proc. NOSSDAV, pp. 177–186, 2002

GEORGE C. POLYZOS K. KATSAROS Athens University of Economics and Business, Athens, Greece

11


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Multiple Access Schemes Standard Article Moshe Sidi1 1Technion—Israel Institute of Technology, Haifa, Israel Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5317 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (158K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Basic Model Conflict-Free Schemes Contention-Based Schemes Collision Resolution Schemes | | | Copyright © 1999-2008 All Rights Reserved.


MULTIPLE ACCESS SCHEMES

663

MULTIPLE ACCESS SCHEMES Communication channels are major components of computer communication networks. They provide the physical mediums over which signals representing data are transmitted from one node of the network to another node. Communication channels can be classified into two main categories: point-topoint channels and shared channels. Typically, the backbone J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.

664


of wide area networks (WAN) consists of point-to-point channels, whereas local area networks (LAN) use shared channels. Point-to-point channels are dedicated to connecting a pair of nodes of the network. They are usually used in fixed topology networks, and their cost depends on many parameters such as distance and bandwidth. An important characteristic of these channels is that nodes do not interfere with each other; in other words, transmissions between a pair of nodes has no effect on the transmissions between another pair of nodes, even if a node is common to the two pairs. Shared channels are used when point-to-point channels are not economical or not available or when dynamic topologies are preferable. In a shared channel, called also a broadcast channel, several nodes can potentially transmit and/or receive messages at the same time. Shared channels appear naturally in radio networks, satellite networks, and some local area networks (e.g., Ethernet). Their deployment is usually easier than point-to-point channels. An important characteristic of shared channels is that transmissions of different nodes interfere with each other; specifically, one transmission coinciding in time with another may cause none of them to be received. This means that the success of a transmission between a pair of nodes is no longer independent of other transmissions. To have successful transmissions in shared channels, interference must be avoided or at least controlled. The channel allocation among the competing nodes is critical for proper operation of the network. This article focuses on access schemes to such channels known as multiple access schemes. These schemes are nothing more than channel allocation rules that determine who goes next on the channel, aiming at some desirable network performance characteristics. Multiple access schemes belong to a sublayer of the data link layer called the medium access control layer (MAC), which is especially important in LANs. Multiple access schemes are natural not only in communication systems but also in many other systems such as computer systems, storage facilities, or servers of any kind, where resources are shared by a number of nodes. In this article we mainly address shared communication channels. One way to classify multiple access schemes is according to the level of contention that is allowed among the nodes of the network. On the one hand, there are the conflict-free schemes that ensure that each transmission is successful, namely, it will not be interfered with by any other transmission. On the other hand, there are the contention-based schemes that do not guarantee that a transmission will be successful, namely, it might be interfered with by another transmission. Conflict-free transmissions can be achieved by allocating the shared channel in an adaptive or nonadaptive (static) manner. Two common static allocations are the time division multiple access (TDMA), where the entire available bandwidth is allocated to a single node for a fraction of the time, and the frequency division multiple access (FDMA), where a fraction of the available bandwidth is allocated to a single node for all the time. Adaptive allocations are usually based on demands so that nodes that are idle use only little of the shared channel, leaving the majority of their share to other more active nodes. Adaptive allocations can be done by various reservation schemes using either central or distributed network control. Polling algorithms illustrate central control,

whereas ring networks generally use distributed control based on token-passing mechanisms. It is important to note that idle nodes consume their portion of the shared channel when conflict-free schemes are used. The aggregate channel portion of idle nodes becomes significant when the number of potential nodes in the system is very large to the extent that conflict-free schemes might become impractical. When contention-based schemes are used, it is essential to devise algorithms that resolve conflicts when they occur, so that messages are eventually transmitted successfully. Conflict-resolution algorithms can be either adaptive or nonadaptive (static). Static resolution can be deterministic using some fixed priority that is assigned to the nodes, or it can be probabilistic when the transmission schedule for interfered nodes is chosen from a fixed distribution as is done in Aloha-type schemes and the various versions of carrier-sensing multiple access (CSMA) schemes. Adaptive resolutions attempt to track the system evolution and exploit the available information. For example, resolution can be based on time of arrival, giving highest (or lowest) priority to the oldest message in the system as is done in some tree-based algorithms. Alternatively, resolution can be probabilistic but such that the statistics change dynamically according to the extent of the interference. This category includes estimating the multiplicity of the interfering nodes and the exponential back-off scheme of the Ethernet standard. Note that when the population of potential nodes in the system increases beyond a certain amount and conflict-free schemes are useless, contentionbased protocols are the only possible solution. The goal of this article is to survey typical examples of multiple access schemes. These examples include TDMA, FDMA, Aloha, polling, and tree-based schemes. The allocated space for the topic of multiple access schemes in the encyclopedia (which is yet another shared resource) is just too tiny to include all the ingenious multiple access schemes that have been designed by researchers over the years. Interested readers should refer to books on the subject (e.g., Rom and Sidi (23), Hammond and O’Reilly (22), and to the international journals that have published papers on the subject).

BASIC MODEL When multiple access schemes are devised, a collection of nodes that communicate with each other or with a central node via a single shared channel is considered. In general, the ability of a node to hear the transmission of another node depends on the transmission power used, on the distance between the two nodes, and on the sensitivity of the receiver at the receiving node. We assume single-hop topologies in which all nodes hear one another, and whenever messages are transmitted successfully they arrive at their destinations. The shared channel is the medium through which data are transferred from their sources to their destinations. The total transmission rate possible in the channel is C bits/s. We consider an errorless collision channel. Collision is a situation in which, at the receiver, two or more transmissions overlap in time wholly or partially. A collision channel is one in which all the colliding transmissions are not received correctly and must be retransmitted until they are received correctly. We assume that nodes can detect collisions. The channel is errorless in the sense that a single transmission heard at a node


is always received correctly. Other possible channels include the noisy channel in which errors may occur even if only a single transmission is heard at a node; furthermore, the channel may be such that errors between successive transmissions are not independent. Another channel type is the capture channel in which one or more of the colliding transmissions captures the receiver and can be received correctly. Yet another case is a channel in which coding is used so that even if transmissions collide the receiver can still decode some or all of the transmitted information. The basic unit of data generated by a node is a message. It is possible, though, that because of its length, a message cannot be transmitted in a single transmission and must therefore be broken into smaller units called packets, each of which can be transmitted in a single channel access. A message consists of an integral number of packets, although the number of packets in a message can vary randomly. Packet size is measured by the time required to transmit the packet after access to the channel has been granted. Typically, all packets are of equal size, say L bits. The number of nodes that share the channel is denoted by M. When M becomes very large, the population of nodes is referred to as infinite population. Only contention-based schemes can cope with an infinite node population. The aggregate arrival process of new packets is assumed to be Poisson with rate ⌳ packets/s. When the population is finite, the arrival rate to each node is ␭ ⫽ ⌳/M packets/s. Nodes are generally not assumed to be synchronized and are capable of accessing and transmitting their messages on the shared channel at any time. Another important class of systems is that of slotted systems in which there is a global clock that marks discrete intervals of time called slots whose length is usually the time required to transmit a packet (i.e., T ⫽ L/C s). In these systems, transmissions of packets start only at slot starts. The slot length is therefore T ⫽ L/C s. Other operations, such as determining activities on the channel, can be done at any time. In some models, nodes can tell if the shared channel is in use before trying to use it. If the channel is sensed as busy, no node will attempt to use it until it goes idle in order to reduce interference. Naturally, additional hardware is required at each node to implement the sensing ability. In other models, nodes cannot sense the channel before trying to use it. They just go ahead and transmit according to their access scheme. Only later can they determine whether or not the transmission was successful via the feedback mechanism. Feedback in general is the information available to the nodes regarding activities on the shared channel at prior times. This information can be obtained by listening to the channel, or by explicit acknowledgment messages sent by the receiving node. For every scheme, there exist some instants of time (typically slot boundaries or end of transmissions) in which feedback information is available. Common feedback information indicates whether a message was successfully transmitted or a collision took place or the channel was idle. Feedback mechanisms do not consume the shared channel sources because they usually use a different channel or are able to determine the feedback locally. Other feedback variations include indication of the exact or the estimated number of colliding transmissions, or providing uncertain feedback (e.g., in the case of a noisy channel).

665

The important performance measures of multiple access schemes are their throughput and delay. The throughput of the channel is the aggregate average amount of data that is transported successfully through the channel in a unit of time. The throughput equals the fraction of time in which the channel is engaged in the successful transmission of node data and will be denoted by S, and it is obvious that S ⱕ 1. In conflict-free access schemes, the throughput is also the total or offered load on the shared channel. However, in contention-based access schemes, the offered load on the shared channel includes transmissions of new packets as well as retransmissions of packets that collide with each other. The offered load is denoted by g (measured in packets per second) and, obviously, g ⱖ ⌳. The normalized offered load [i.e., the rate (per packet transmission time) packets are transmitted on the channel] is denoted by G ⫽ g ⭈ T and, obviously, G ⱖ S. Delay is the time from the moment a message is generated until it arrives successfully across the shared channel. Here one must distinguish between the node and the system measures because it is possible that the average delay measured for the entire system does not necessarily reflect the average delay experienced by any of the nodes. In ‘‘fair’’ or homogeneous systems, we expect these to be almost identical. The average delay is denoted by D seconds, and its normalized version, grouped into units of packet transmission times, is denoted by D (i.e., D ⫽ D/T ⫽ D ⭈ C/L). Another important performance criterion is system stability. Unfortunately, some schemes’ characteristics may be such that some message-generation rates, even smaller than the maximal transmission rate in the channel, cannot be sustained by the system for a long time. Evaluation of those input rates for which the system remains stable is therefore essential. Ideal Access Scheme Before introducing the various multiple access schemes, let us consider an ideal scheme to use the shared channel. Ideally, transfer of the channel from one node to another can be accomplished instantaneously, without cost. Furthermore, whenever a node has data to transmit, some ingenious central controller knows this instantaneously and assigns the channel to that node in case the channel is idle. If the channel is busy, packets that arrive at the nodes are queued. For our purposes, the order in which packets of different nodes are served is not important. The performance of the ideal scheme serves as a bound to what can be expected from any practical access scheme. The way the ideal scheme operates is identical to the operation of a single queue that is served by a single server, because packets do not interfere and because no time is wasted in transferring the channel use from one node to another. Because arrivals of new packets are according to a Poisson process and time is slotted, the performance of the ideal scheme is that of an M/D/1 queue. The throughput of an M/D/1 queue is just the utilization factor of the server as long as S ⬍ 1 (the stability condition), and it equals the offered load, in other words, S = G = T =

λ·M·L C

(1)

666


The normalized average delay of an M/D/1 queue is given by (as long as S ⬍ 1) D =1+

2−S S = 2(1 − S) 2(1 − S)

(2)

The unit in the expression D is the normalized transmission time of a packet, whereas S/[2(1 ⫺ S)] is the normalized waiting time of a packet until being transmitted. No access scheme can achieve throughput higher than S given in Eq. (1), and no access scheme can provide normalized average delays lower than D given in Eq. (2). These quantities will serve as yardsticks in the sequel. CONFLICT-FREE SCHEMES Conflict-free schemes are designed to ensure that a transmission, whenever made, is not interfered with by any other transmission and is therefore successful. This is achieved by allocating the channel to the nodes without any overlap between the portions of the channel allocated to different nodes. An important advantage of conflict-free access protocols is the ability to ensure fairness among nodes and the ability to control the packet delay—a feature that may be essential in realtime applications. We consider both fixed-assignment schemes and dynamic schemes that guarantee no conflicts. In fixed-assignment schemes the channel allocation is predetermined (typically at network design time) and is independent of the demands of the nodes in the network. The most well-known fixed-assignment schemes are the frequency division multiple access and the time division multiple access. For both FDMA and TDMA, no overhead, in the form of control messages, is incurred. However, because of the static and fixed assignment, parts of the channel might be idle even though some nodes have data to transmit. Dynamic channel allocation schemes attempt to overcome this drawback by changing the channel allocation based on the current demands of the nodes. These schemes use some kind of reservation strategies based on either centralized or distributed polling. Fixed Assignment Both FDMA and TDMA are the oldest and most understood access schemes, widely used in practice. They are the most common implementation of fixed-assignment schemes. With FDMA the entire available frequency band is divided into bands, each of which is used by a single node. Every node is therefore equipped with a transmitter for a given, predetermined frequency band and a receiver for each band (which can be implemented as a single receiver for the entire range with a bank of band-pass filters for the individual bands). With TDMA the time axis is divided into time slots, preassigned to the different nodes. Every node is allowed to transmit freely during the slot assigned to it; that is, during the assigned slot the entire shared channel is devoted to that node. The slot assignments follow a predetermined pattern that repeats itself periodically; each such period is called a frame. In most TDMA implementations, every node has exactly one slot in every frame. The main advantage of both FDMA and TDMA is that each transmission is guaranteed to be successful and no control

messages are required. An additional advantage of FDMA is its simplicity—it does not require any coordination or synchronization among the nodes because each can use its own frequency band without interference. However, both FDMA and TDMA are wasteful, especially when the load is momentarily uneven, because when one node is idle, its share of the channel cannot be used by other nodes. Another drawback of FDMA and TDMA is that they are not flexible; adding a new node to the network requires equipment or software modification in every other node. In addition, both waste some portion of the channel to ensure no overlap (either in time or in bandwidth) in the transmissions of different nodes. FDMA uses guard bands between the subchannels, and TDMA uses guard times to separate the nodes. Neglecting the channel waste resulting from guard bands or times, the throughput of FDMA and TDMA is identical to that of the idealized schemes, because packets are never transmitted more than once. Therefore, we have for both S = G = T =

λ·M·L C

The delay characteristics of FDMA and TDMA are different. With FDMA the transmission rate of each node is C/M bits/s; therefore, the time to transmit a packet is M ⭈ L/C seconds. Each node can be modeled as an M/D/1 queue with arrival rate ␭ ⫽ ⌳/M and service time M ⭈ L/C. The normalized average delay is, therefore, 2−S S =M D =M 1+ 2(1 − S) 2(1 − S) which is M times larger than the normalized average delay of the ideal scheme. With TDMA the transmission rate of each node is C bits/ s, and the time to transmit a packet is L/C seconds. Each node can be modeled as an M/D/1 queue with arrival rate ␭ ⫽ ⌳/M, but service is granted to the node only once a frame, namely every M ⭈ L/C seconds. The normalized average delay is therefore D =1+

M 2(1 − S)

Comparing the throughput delay characteristics of FDMA and TDMA, we note that D FDMA = D TDMA +

M −1 2

We thus conclude that for any reasonable parameters, the TDMA-normalized average delay is always less than that of FDMA and the difference grows linearly with the number of nodes and is independent of the load. The difference stems from the fact that the actual transmission of a packet in TDMA takes only a single slot, whereas in FDMA it lasts the equivalent of an entire frame. This difference is somewhat offset by the fact that a packet arriving at an empty node may need to wait until the proper slot when a TDMA scheme is employed, whereas in FDMA transmission starts right away. It must be remembered, though, that at high throughput the dominant factor in the normalized average delay is inversely proportional to (1 ⫺ S) in both TDMA and FDMA; therefore,


M = 15 1000 Ideal

5

TDMA Normalized delay

2

FDMA

100 5 2 10 5 2 1 0.00

0.20

0.40

0.60

0.80

1.00

Throughput Figure 1. TDMA and FDMA performance.

the ratio of the normalized average delays between the two schemes approaches unity when the load increases. Figure 1 depicts the delay-throughput characteristics for TDMA and FDMA and the ideal access scheme for 50 users. Further Reading. Many texts treating FDMA and TDMA are available [e.g., Martin (1) and Stallings (2)]. A good analysis of TDMA and FDMA can be found in Ref. 3. A sample path comparison between FDMA and TDMA schemes is carried out in Ref. 4 where it is shown that TDMA is better than FDMA not just on the average. A TDMA scheme in which the packets of each node are serviced according to a priority rule is analyzed by De Moraes and Rubin (5). The question of optimal allocation of slots to the nodes in generalized TDMA (in which a node can have more than one slot in a frame) is addressed in Itai and Rosberg (6), where the throughput of the network is maximized (assuming single buffers for each node), Hofri and Rosberg (7), where the expected packet-delay in the network is minimized. Message delay (as opposed to packet delay) for generalized TDMA is analyzed by Rom and Sidi (8). Dynamic Assignment Static conflict-free protocols such as FDMA and TDMA schemes do not use the shared channel very efficiently, especially when the network is lightly loaded or when the loads of different nodes are asymmetric. The static and fixed assignment in these schemes cause portions of the channel to remain idle even though some nodes have data to transmit. Dynamic channel allocation schemes are designed to overcome this drawback. With dynamic allocation strategies, the channel allocation changes with time and is based on current (and possibly changing) demands of the various nodes. The better and more responsive use of the shared channel achieved with dynamic schemes does not come for free; it requires control overhead that is unnecessary with fixed-assignment schemes and consumes a portion of the channel. To ensure conflict-free operation, it is necessary to reach an agreement among the nodes on who transmits in a given slot. This agreement entails collecting information as to which nodes have packets to transmit and an arbitration

667

scheme that selects one of these nodes to transmit in the slot. Both the information collection and the arbitration can be achieved using centralized control or distributed control. A representative example of schemes that use centralized control are polling schemes. The basic feature of polling schemes is the operation of a central controller that polls the nodes of the network in some predetermined order (the most common being round-robin) to provide access to the shared channel. When a node is polled and has packets to transmit, it uses the whole shared channel to transmit its backlogged packets. With an exhaustive policy, the node empties its backlog completely, whereas with a gated policy it transmits only those packets that reside in its queue upon the polling instant. The last transmitted packet contains an indication that the central controller can poll the next node. If a polled node does not have packets to transmit, the next node is polled. In between polls, nodes accumulate the arriving packets in their queues and do not transmit until polled. The control overhead of polling schemes is a result of the time required to switch from one node to the next. The switching time, denoted by w, includes all the time necessary to transfer the poll (channel propagation delay, transmission ˆ ⫽ w/T time of polling and response packets, etc.). We let w denote the normalized switching time. The throughput of a polling scheme is identical to that of an ideal scheme and is given by Eq. (1). The normalized average delay is given by D =1+

Mω(1 ˆ − S/M) S + 2(1 − S) 2(1 − S)

We note that the first two terms are just the normalized average delay of the ideal scheme and the third term reflects the overhead resulting from the switching times from one node to the next. As an example of a distributed dynamic conflict-free scheme we use the mini slotted alternating priority (MSAP) scheme (9). The MSAP scheme allows the nodes to determine in a distributed manner the order in which they’ll use the shared channel, assuming the nodes are ordered according to some priority rule. Either the priority rule can be static or it can change in a round-robin manner in each slot. MSAP is based on distributed reservations. To describe its operation, we need to define the slot structure. Let ␶ (seconds) denote the maximum system propagation delay, that is, the longest time it takes for a signal emitted at one end of the network to reach the other end. The quantity ␶ plays a crucial role in multiple access schemes. Its normalized version is denoted by a ⫽ ␶ /T. Let every slot consist of initial M ⫺ 1 reservation minislots, each of duration ␶, followed by a data transmission period of duration T, followed by another minislot. Only those nodes wishing to transmit in a slot take any action: a node that does not wish to transmit in a given slot remains quiet for the entire slot duration. Given that every node wishing to transmit knows its own priority, they behave as follows. If the node of the highest priority wishes to transmit in this slot, then it starts immediately. Its transmission consists of an unmodulated carrier for a duration of M ⫺ 1 minislots followed by a packet of duration T. A node of the ith priority (2 ⱕ i ⱕ M) wishing to transmit in this slot will do so only if the first i ⫺ 1 minislots are idle. In this case, it will transmit M ⫺ i minislots of unmodulated carrier followed

668


Normalized delay

M = 15, a = 0.01 6 5 4 3 2.5 2 1.5

Ideal Dynamic

10 8 6 5 4 3 2.5 2 1.5 1 0.00

0.20

0.40

0.60

0.80

1.00

Throughput Figure 2. Dynamic access.

by a packet of duration T. The specific choice of the minislot duration ensures that when a given node transmits in a minislot all other nodes know it by the end of that minislot allowing them to react appropriately. The additional minislot at the end allows the data signals to reach every node of the network. This is needed to ensure that all start synchronized in the next slot, as required by the reservation scheme. The fraction of slots in which transmissions take place is ⌳T. Because a fraction of M␶ /(T ⫹ M␶) of every slot is overhead, we conclude that the throughput of this scheme is S = T

1 T = T T + Mτ 1 + Ma

The normalized average delay is obtained by using standard analysis of priority queues, and it is given by

1 D = (1 + Ma) 1 + 2[1 − (1 + Ma)S ] Figure 2 depicts the delay-throughput characteristics for the dynamic-access schemes for 50 users. Further Reading. The variants of polling schemes are numerous. Reference 10 contains the analysis of most of the basic schemes with a long list of references that is complemented in Ref. 11. In Ref. 12 more advanced schemes are described along with some optimization considerations in the operations of polling schemes, such as the determination of the poll order of the nodes. The MSAP scheme described previously represents an entire family of schemes that guarantees conflict-free transmissions using distributed reservation. All these schemes have a sequence of preceding bits serving to reserve or announce upcoming transmissions (this is known as the reservation preamble). In MSAP there are M ⫺ 1 such bits for every transmitted packet. An improvement to the MSAP scheme is the bit-map protocol described by Tanenbaum (13). The idea is to use a single reservation preamble to schedule more than a single transmission; using the fact that all participating nodes are aware of the reservations made in the preamble.

The bit-map scheme requires synchronization among the nodes that is somewhat more sophisticated than the MSAP scheme, but the overhead paid per transmitted packet is less than the overhead for MSAP. Another variation of a reservation scheme has been described by Roberts (14). There, every node can make a reservation in every minislot of the reservation preamble, and if the reservation remains uncontested, that reserving node will transmit. If there is a collision in the reservation minislot, all nodes but the ‘‘owner’’ of that minislot will abstain from transmission. Altogether, this is a standard TDMA with idle slots made available to be grabbed by others. Several additional reservation and TDMA schemes are also analyzed by Rubin (4). One of the most efficient reservation schemes is the broadcast recognition access method (BRAM) (15). This is essentially a combination between the bit-map and the MSAP schemes. As with MSAP, a reservation preamble serves to reserve the channel for a single node, but unlike the MSAP the reservation preamble does not necessarily contain all M ⫺ 1 minislots. The idea is that nodes start their transmission with a staggered delay not before they ensure that another transmission is not ongoing [Kleinrock and Scholl (9) also refers to a similar scheme]. Under heavy load BRAM reduces to regular TDMA. CONTENTION-BASED SCHEMES With the conflict-free schemes discussed earlier, every scheduled transmission is guaranteed to succeed. With contentionbased schemes success of a transmission is not guaranteed in advance because whenever two or more nodes are transmitting on the shared channel simultaneously, a collision occurs and the data cannot be received correctly. This being the case, packets may have to be transmitted and retransmitted until eventually they are correctly received. Transmission scheduling is therefore the focal concern of contention-based schemes. Pure and Slotted Aloha The Aloha family of schemes is probably the richest family of multiple access protocols. First of all, its popularity is the result of seniority because it was the first contention-based scheme introduced (16). Second, many of these schemes are so simple that their implementation is straightforward. Many local area networks of today implement some sophisticated variants of this family of schemes. The pure Aloha scheme is the basic scheme in the family and it is very simple (16). It states that a newly generated packet is transmitted immediately upon generation, hoping for no interference by others. If two or more nodes transmit so that their packets overlap (even partially) in time, interference results, and the transmissions are unsuccessful. In this case every colliding node, independently of the others, schedules its retransmission to a random time in the future. This randomness is required to ensure that the same set of packets does not continue to collide indefinitely. The Aloha scheme is very well suited to bursty traffic because a node does not hold the shared channel when it has no packets to transmit. The drawback of this scheme is that network performance deteriorates significantly as a result of excessive collisions at medium and high traffic intensities. The Aloha scheme is a completely distributed scheme that allows every node to operate independently of the others.


The exact characterization of the offered load to the channel for the pure Aloha scheme is extremely complicated. To overcome this complexity, it is standard to assume that the offered load forms a Poisson process (with rate g, of course). This flawed assumption is an approximation (as has been shown by simulation) that simplifies the analysis of Alohatype schemes considerably and provides some initial intuitive understanding of the ALOHA scheme. Consider a packet (new or retransmitted) whose transmission starts at time t. This packet will be successful if no other packet is transmitted in the interval (t ⫺ T, t ⫹ T) (this period of duration 2T is called the vulnerable period). The probability of this happening, that is, the probability of success Ps is the probability that no packet is transmitted in an interval of length 2T. Because the transmission points correspond to a Poisson process, we have Ps = e−2gT Now, packets are scheduled at a rate of g per second, of which only a fraction Ps are successful. Thus, the rate of successfully transmitted packets is gPs. When a packet is successful, the channel carries useful information for a period of T seconds; in any other case, it carries no useful information at all. Because the throughput is the fraction of time that useful information is carried on the shared channel, we have

stability requires S ⫽ ⌳T. Larger values of ⌳ clearly cannot result in stable operation. Note, however, that even for smaller values of ⌳ there are two values of G to which it corresponds—one larger and one smaller than . The smaller one is (conditionally) stable, whereas the other one is conditionally unstable, meaning that if the offered load increases beyond that point the system will continue to drift to higher load and lower throughput. Thus, without additional measures of control, the stable throughput of pure Aloha is 0 (17). It is appropriate to note that this theoretical instability is rarely a severe problem in real systems, where the long-term load, including, of course, the ‘‘off-hours’’ load, is fairly small, although temporary problems may occur. The delay characteristic of the Aloha scheme can be approximated as follows. For each packet, the average number of transmission attempts until the packet is transmitted successfully is G/S ⫽ e2G. Thus, the average number of unsuccessful transmission attempts is G/S ⫺ 1 ⫽ e2G ⫺ 1. If a collision occurs, the node reschedules the colliding packet for some random time in the future. Let the average rescheduling time be B (seconds). Each successful transmission attempt requires T seconds and each unsuccessful transmission attempt requires T ⫹ B seconds on the average. Therefore, the average delay is given by D = T + (G/S − 1)(T + B) = T + (e2G − 1)(T + B)

S = gTe−2gT = Ge−2G This relation between S and G is typical to many Aloha-type schemes. For small values of G (light load), the throughput is approximately the offered load. For large values of G (heavy load), the throughput decreases rapidly because of excessive amount of collisions. For pure Aloha we note that for G ⫽ , S takes on its maximal value of 1/2e 앒 0.18. This value is referred to as the capacity of the pure Aloha channel. Figure 3 depicts the load-throughput characteristics for the Alohatype schemes. We recall that for a system to be stable the long-term rate of input must equal the long-term rate of output meaning that

669

(3)

and in a normalized form D = 1 + (e2G − 1)(1 + B/T ) With pure Aloha, even if the overlap in time between two transmitted packets is very tiny, both packets are destroyed. The slotted Aloha variation overcomes this drawback, and it is simply pure Aloha with a slotted channel. Thus, two (or more) packets can either overlap completely or do not overlap at all, and the vulnerable period is reduced to a single slot. In other words, a slot will be successful if and only if exactly one packet is transmitted in that slot. Therefore, S = gTe−gT = Ge−G

Throughput

Aloha 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00

This relation is very similar to that of pure Aloha, except of increased throughput. Channel capacity is 1/e 앒 0.36 and is achieved at G ⫽ 1. These results were first derived by Roberts (14). Similar to the pure Aloha scheme, the normalized average delay for the slotted Aloha scheme is

Aloha Slotted-Aloha

D = 1 + (eG − 1)(1 + B/T ) Carrier-Sensing Protocols

0.01 0.03

0.1

0.5 Load

1

3

10

Figure 3. Throughput of Aloha and slotted Aloha.

The Aloha schemes exhibit fairly poor performance, which can be attributed to the ‘‘impolite’’ behavior of the nodes, namely, whenever one has a packet to transmit it does so without consideration of others. It is clear that in a shared environment even little consideration can benefit all. Consider a listen-before-talk behavior wherein every node, before attempting any transmission, listens whether somebody else is already using the channel. If this is the case, the node will refrain from transmission to the benefit of all; its packet will clearly not be successful if transmitted; furthermore, disturbing another

670


node will cause the currently transmitted packet to be retransmitted, possibly disturbing yet another packet. The process of listening to the shared channel is not that demanding. Every node is equipped with a receiver anyway, and every node can monitor the channel because it is shared. Moreover, to detect another node’s transmission does not require receiving the information; it suffices to sense the carrier that is present when signals are transmitted. The carriersensing family of schemes is characterized by sensing the carrier and deciding according to it whether another transmission is ongoing. Carrier sensing does not yield conflict-free operation. Suppose that the channel has been idle for a while and that two nodes concurrently generate a packet. Each will sense the channel, discover that it is idle, and transmit the packet to result in a collection. ‘‘Concurrently’’ here does not really mean at the very same time; if one node starts transmitting it takes some time for the signal to propagate and arrive at the other node. Hence concurrently actually means within a time window of duration equal to signal propagation time. The maximum propagation time in the network is ␶, and its normalized version is a, an important parameter that affects the performance of carrier sensing schemes. The larger this quantity is, collisions are more likely and the performance becomes worse. All the carrier sensing multiple access schemes share the same philosophy: when a node generates a new packet, the channel is sensed, and if found idle the packet is transmitted without further ado. When a collision takes place, every transmitting node reschedules a retransmission of the collided packet to some other time in the future (chosen with some randomization to avoid repeated collisions) at which time the same operation is repeated. The variations on the CSMA scheme are caused by the behavior of nodes that wish to transmit and find (by sensing) the channel busy. Most of the basic variations were introduced and analyzed by Kleinrock and Tobagi (18–20). In the nonpersistent versions of CSMA (NP-CSMA) a node that generated a packet and found the channel busy refrains from transmitting the packet and behaves exactly as if its packet collided [i.e., it schedules (randomly) the retransmission of the packet to some time in the future]. With NPCSMA, there are situations in which the channel is idle although one or more nodes have packets to transmit. The 1persistent CSMA (1P-CSMA) is an alternative to NP-CSMA because it avoids such situations by being a bit more greedy. This is achieved by applying the following rule. A node that senses the channel and finds it busy persists to wait and transmits as soon as the channel becomes idle. Consequently, the channel is always used if there is a node with a packet. With the 1-persistent scheme, a collision may occur not only because of nonzero propagation delays but also when two nodes become ready to transmit in the middle of another node’s transmission. In this case, both nodes will wait until that transmission ends and will begin transmission simultaneously, resulting in a collision. For slotted operation, CSMA schemes use time slot of duration ␶ seconds, which is usually much smaller than the slot size of duration T seconds, used with slotted Aloha. However, like slotted Aloha, all nodes using slotted CSMA schemes are forced to start transmission at the beginning of a slot.

Beside the ability to sense the carrier, some local area networks (such as Ethernet) have an additional feature, namely, that nodes can detect interference among several transmissions (including their own) while transmission is in progress and abort transmission of their collided packets. If this can be done sufficiently fast, then the duration of an unsuccessful transmission would be shorter than that of a successful one, thus improving the performance of the scheme. Together with carrier sensing, this produces a variation of CSMA that is known as CSMA/CD (Carrier Sensing Multiple Access with Collision Detection). The operation of all CSMA/CD schemes is identical to the operation of the corresponding CSMA schemes, except that if a collision is detected during transmission, the transmission is aborted and the packet is scheduled for transmission at some later time. For Ethernet networks this random delay is doubled (at most 16 times) each time the packet collides—a scheme known as binary exponential backoff. To ensure that all network nodes indeed detect a collision when it occurs, a consensus reenforcement procedure is used. This procedure is manifested by jamming the channel with a collision signal for a duration of ␶cr seconds, which is usually much larger than the time necessary to detect a collision. We let 웂 ⫽ ␶cr / ␶. The analysis of the throughput of CSMA schemes is rather complicated. It is based on computations of average lengths of idle and transmission periods. For NP-CSMA we have S=

gTe−gτ Ge−aG = −gτ g(T + 2τ ) + e G(1 + 2a) + e−aG

For slotted NP-CSMA, we have S=

aGe−aG 1 − e−aG + a

For 1P-CSMA, we have

gTe−g(T +2τ ) [1 + gT + gτ (1 + gT + gτ /2)] g(T + 2τ ) − (1 − e−gτ ) + (1 + gτ )e−gT +τ Ge−G(1+2a)[1 + G + aG(1 + G + aG/2)] = G(1 + 2a) − (1 − e−aG ) + (1 + aG)e−G(1+a)

S=

For slotted 1P-CSMA, we have S=

Ge−G(1+a)[1 + a − e−aG ] (1 + a)(1 − e−aG ) + ae−G(1+a)

For nonpersistent CSMA/CD, we have S=

Ge−aG

+ γ aG(1 −

Ge−aG + 2aG(1 − e−aG ) + 2 − e−aG

e−aG )

For slotted nonpersistent CSMA/CD, we have S=

Ge−aG

+ γ aG(1 −

e−aG

Ge−aG − aGe−aG ) + (2 − e−aG − aGe−aG )


Throughput

a = 0.01 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00

NP-CSMA 1P-CSMA Slotted NP-CSMA Slotted 1P-CSMA

0.001 0.01

0.1

1 Load

10

100 1000

Figure 4. Throughput of CSMA versions.

Figure 4 depicts the load-throughput characteristics for the CSMA-type schemes. Further Reading. Numerous variations on the environment under which the Aloha and CSMA schemes operate have been addressed in the literature (see, e.g., Refs. 3, 13, and 21–23). For instance, various packet length distributions were considered by Abramson (24) and Ferguson (25) for Aloha and by Tobagi and Hunt (26) and for CSMA. The assumption that, whenever two or more packets overlap at the receiver, all packets are lost is overly pessimistic. In radio networks the receiver might correctly receive a packet despite the fact that it is time-overlapping with other transmitted packets. This phenomenon is known as capture and it can happen as a result of various characteristics of radio systems. Most studies (27,28) considered power capture (the phenomenon whereby the strongest of several transmitted signals is correctly received at the receiver). Thus, if a single high-powered packet is transmitted, then it is correctly received regardless of other transmissions. Hence, channel use increases. Reservation schemes that allow contentions are designed to have the advantages of both the Aloha and the TDMA approaches. Examples of reservation schemes appear in Ref. 29, where the knowledge of the number of users is needed, or in Refs. 14 and 30, where they do not require this knowledge. Approximate analysis of a reservation Aloha protocol can be found in Lam (31). Approximate analysis of the delay was presented by Ferguson (32) for Aloha and by Beuerman and Coyle (33) for CSMA schemes. Instability issues of the Aloha protocol were first identified by Carleial and Hellman (34) and Lam and Kleinrock (35). Later, similar issues were identified for the CSMA family of protocols by Tobagi and Kleinrock (20). COLLISION RESOLUTION SCHEMES The original Aloha scheme and its CSMA derivatives are inherently unstable in the absence of some external control.

671

Looking into the philosophy behind the schemes, it is obvious that there is no sincere attempt to resolve collisions among packets as soon as they occur. Instead, the attempts to resolve collisions are always deferred to the future, with the hope that things will then work out, somehow, but they never do. Another type of contention-based schemes with a different philosophy are collision resolution schemes (CRS). In these schemes the efforts are concentrated on resolving collisions as soon as they occur. Moreover, in most versions of these schemes, new packets that arrive to the network are inhibited from being transmitted while the resolutions of collisions is in progress. This ensures that if the rate of arrival of new packets to the system is smaller than the rate at which collisions can be resolved (the maximal rate of departing packets—throughput), then the system is stable. The basic idea behind these schemes is to exploit in a more sophisticated manner the feedback information that is available to the nodes in order to control the retransmission process so that collisions are resolved more efficiently. The most basic collision resolution scheme is called the binary-tree CRS (or binary-tree scheme) and was proposed by Capetanakis (36), Hayes (37), and Tsybakov and Mikhailov (38). According to this scheme, when a collision occurs, in slot k say, all nodes that are not involved in the collision wait until the collision is resolved. The nodes involved in the collision split randomly into two subsets, by (for instance) each flipping a coin. The nodes in the first subset, those that flipped 0, retransmit in slot k ⫹ 1, whereas those that flipped 1 wait until all those that flipped 0 transmit their packets successfully. If slot k ⫹ 1 is either idle or contains a successful transmission, the nodes of the second subset (those that flipped 1) retransmit in slot k ⫹ 2. If slot k ⫹ 1 contains another collision, then the procedure is repeated (i.e., the nodes whose packets collided in slot k ⫹ 1 flip a coin again and operate according to the outcome of the coin flipping, and so on). A node having a packet that collided (at least once) is backlogged. The operation of the scheme can also be described by a binary-tree in which every vertex corresponds to a time slot. The root of the tree corresponds to the slot of the original collision. Each vertex of the tree also designates a subset (perhaps empty) of backlogged nodes. Vertices whose subsets contain at least two nodes indicate collisions and have two outgoing branches, corresponding to the splitting of the subset into two new subsets. Vertices corresponding to empty subsets or subsets containing one node are leaves of the tree and indicate an idle and a successful slot, respectively. For instance, consider a collision that occurs in slot 1. At this point it is neither known how many nodes nor who are the nodes that collided in this slot. Each of the colliding nodes flip a coin, and those that flipped 0 transmit in slot 2. By the rules of the scheme, no newly arrived packet is transmitted while the resolution of a collision is in progress, so that only nodes that collided in slot 1 and flipped 0 transmit in slot 2. Another collision occurs in slot 2, and the nodes involved in that collision flip a coin again. In this example, all the colliding nodes of slot 2 flipped 1 and therefore slot 3 is idle. The nodes that flipped 1 in slot 2 transmit again in slot 4, resulting in another collision and forcing the nodes involved in it to flip a coin once more. One node flips 0 and transmits (successfully) in slot 5 causing all nodes that flipped 1 in slot 4 to transmit in slot 6. In this example, there is one such node, and there-

672


fore slot 6 is a successful one. Now that the collision among all nodes that flipped 0 in slot 1 has been resolved, the nodes that flipped 1 in that slot transmit (in slot 7). Another collision occurs, and the nodes involved in it flip a coin. Another collision is observed in slot 8, meaning that at least two nodes flipped 0 in slot 7. The nodes that collided in slot 8 flip a coin and, as it happens, there is a single node that flipped 0, and it transmits (successfully) in slot 9. Then, in slot 10, transmit the nodes that flipped 1 in slot 8. There is only one such node, and its transmission is, of course, successful. Finally, the nodes that flipped 1 in slot 7 must transmit in slot 11. In this example, there is no such node; hence slot 11 is idle, completing the resolution of the collision that occurred in slot 7 and, at the same time, the one in the first slot. It is clear from this example that each node, including those that are not involved in the collision, can construct the binary-tree by following the feedback signals corresponding to each slot, thus knowing exactly when the collision is resolved. A collision is resolved when the nodes of the network know that all packets involved in the collision have been transmitted successfully. The time interval starting with the original collision (if any) and ending when this collision is resolved is called collision resolution interval (CRI). In the preceding example the length of the CRI is 11 slots. The binary-tree protocol dictates how to resolve collisions after they occur. To complete the description of the protocol, one must specify when newly generated packets are transmitted for the first time. One alternative, which is assumed all along (known as the obvious-access scheme), is that new packets are inhibited from being transmitted while a resolution of a collision is in progress. That is, packets that arrive to the system while a resolution of a collision is in progress, wait until the collision is resolved, at which time they are transmitted. In the example above all new packets arriving to the system during slots 1 through 11 are transmitted for the first time in slot 12. Let Ln be the expected length of a CRI that starts with the transmission of n packets. From the operation of the scheme, it is clear that as long as the arrival rate of new packets into the system is smaller than the ratio n/Ln (for large n), the system is stable. When fair coins are used for splitting the users upon collisions, one can show that for every n, Ln ≤ 2.886n + 1 yielding stable system for arrival rates that are smaller than 0.346. The performance of the binary-tree protocol can be improved in two ways. The first is to speed up the collision resolution process by avoiding certain, avoidable, collisions. The second is based on the observation that collisions among a small number of packets are resolved more efficiently than collisions among a large number of packets. Therefore, if most CRIs start with a small number of packets, the performance of the protocol is expected to improve. Consider again the example above. In slots 2 and 3 a collision is followed by an idle slot. This implies that in slot 2 all users (and there were at least two of them) flipped 1. The binary-tree protocol dictates that these users must transmit in slot 4, although it is obvious that this will generate a collision that can be avoided. The modified binary-tree protocol was suggested by Massey (39), and it eliminates such avoid-

able collisions by letting the users that flipped 1 in slot 2 in the preceding example, flip coins before transmitting in slot 4. Consequently, the slot in which an avoidable collision would occur is saved. In this case, fair coins yield a stable system for arrival rates smaller than 0.375, and biased coins increase this number up to 0.381. When the obvious access is employed, it is very likely that a CRI will start with a collision among a large number of packets when the previous CRI was long. When the system operates near its maximal throughput, most CRIs are long; hence collisions among a large number of packets must be resolved frequently, yielding nonefficient operation. Ideally, if it were possible to start each CRI with the transmission of exactly one packet, the throughput of the system would have been 1. Because this is not possible, one should try to design the system so that in most cases a CRI starts with the transmission of about one packet. There are several ways to acheive this goal by determining a first-time transmission rule (i.e., when packets are transmitted for the first time). One way, suggested by Capetanakis (36), is to have an estimate on the number of packets that arrived in the previous CRI and divide them into smaller groups, each having an expected number of packets on the order of one and handling each group separately. Another way, known as the epoch mechanism has been suggested by Gallager (40) and Tsybakov and Mikhailov (41). According to this mechanism, time is divided into consecutive epochs each of length ⌬ slots. The ith arrival epoch is the time interval [i⌬, (i ⫹ 1)⌬]. Packets that arrive during the ith arrival epoch are transmitted for the first time in the first slot after the collision among packets that arrived during the (i ⫺ 1)st arrival epoch is resolved. The parameter ⌬ is chosen to optimize the performance of the system. When ⌬ ⫽ 2.68, the system is stable for arrival rates up to 0.429 if slots of sure collisions are not saved, and up to 0.462 if they are. A final enhancement of the epoch mechanism is to start a new epoch each time a collision is followed by two successful transmissions. This guarantees that each CRI will start with an optimal number of packets, and it yields the highest stable throughput known for multiple access systems—0.487. Further Reading. Numerous variations of the environment under which collision resolution protocols operate have been addressed in the literature and excellent surveys on the subject appear in Refs. 42 and 43. Books by Bertsekas and Gallager (21) and Rom and Sidi (23) are also excellent sources on collision resolution protocols. Considerable effort has been spent on finding upper bounds to the maximum throughput that can be achieved in an infinite population model with Poisson arrivals and ternary feedback. The best upper bound known to date is 0.568 and is the work of Tsybakov and Likhanov (44). Practical multiple access communication systems are prone to various types of errors. Collision resolution protocols that operate in the presence of noise errors, erasures, and captures have been studied in Refs. 45–49. Collision resolution protocols yielding high throughputs for general arrival processes (even if their statistics are unknown) were developed by Cidon and Sidi (50) and Greenberg et al. (51). The expected packet delay of the binary-tree protocol has been derived by Fayolle et al. (52) and Tsybakov and Mikhailov (38). Bounds on the expected packet delay of the algorithm with the epoch mechanism have been obtained in Refs. 41 and 53,


and bounds on the packet delay distribution have been obtained in Refs. 54 and 55.

673

22. J. L. Hammond and P. J. P. O’Reilly, Performance Analysis of Local Computer Networks, Reading, MA: Addison-Wesley, 1986. 23. R. Rom and M. Sidi, Multiple Access Protocols; Performance and Analysis, New York: Springer-Verlag, 1990.

BIBLIOGRAPHY 1. J. Martin, Communication Satellite Systems, Englewood Cliffs, NJ: Prentice-Hall, 1978. 2. W. Stallings, Data and Computer Communications, New York: Macmillan, 1985. 3. J. F. Hayes, Modeling and Analysis of Computer Communications Networks, New York: Plenum Press, 1984. 4. I. Rubin, Access control disciplines for multi-access communications channels: Reservation and TDMA schemes, IEEE Trans. Inf. Theory, IT-25: 516–536, 1979. 5. L. F. M. De Moraes and I. Rubin, Message delays for a TDMA scheme under a Nonpreemptive Priority Discipline, IEEE Trans. Commun., COM-32: 583–588, 1984. 6. A. Itai and Z. Rosberg, A golden ratio control policy for a multiple-access channel, IEEE Trans. Autom. Control, AC-29: 712– 718, 1984. 7. M. Hofri and Z. Rosberg, Packet delay under the golden ratio weighted TDM policy in a multiple access channel, IEEE Trans. Inf. Theory, IT-33: 341–349, 1987. 8. R. Rom and M. Sidi, Message delay distribution in generalized time division multiple access (TDMA), Probability Eng. Inf. Sci., 4: 187–202, 1990. 9. L. Kleinrock and M. Scholl, Packet switching in radio channels: New conflict-free multiple access schemes, IEEE Trans. Commun., COM-28: 1015–1029, 1980. 10. H. Takagi, Analysis of Polling Systems, Cambridge, MA: MIT Press, 1986. 11. H. Takagi, Queueing analysis of polling models, ACM Comp. Surv., 20 (1): 5–28, 1988. 12. H. Levy and M. Sidi, Polling systems: Applications, modeling and optimization, IEEE Trans. Commun., 38: 1750–1760, 1990. 13. A. S. Tanenbaum, Computer Networks, 3rd ed., Englewood Cliffs, NJ: Prentice-Hall International Editions, 1996. 14. L. G. Roberts, ALOHA packet system with and without slots and capture, Comput. Commun. Rev., 5 (2): 28–42, 1975. 15. I. Chlamtac, W. R. Franta, and K. D. Levin, BRAM: The Broadcast Recognizing Access Mode, IEEE Trans. Commun., COM-27: 1183–1189, 1979. 16. N. Abramson, The ALOHA System—Another Alternative for Computer Communications, Proc. Fall Joint Comput. Conf., pp. 281–285, 1970. 17. G. Fayolle et al., The stability problem of broadcast packet switching computer networks, Acta Informatica, 4 (1): 49–53, 1974. 18. L. Kleinrock and F. A. Tobagi, Packet switching in radio channels: Part I—Carrier sense multiple-access modes and their throughput delay characteristics, IEEE Trans. Commun., 23: 1400–1416, 1975. 19. F. A. Tobagi and L. Kleinrock, Packet switching in radio channels: Part II—The hidden terminal problem in carrier sense multiple-access and the busy tone solution, IEEE Trans. Commun., 23: 1417–1433, 1975.

24. N. Abramson, The throughput of packet broadcasting channels, IEEE Trans. Commun., 25: 117–128, 1977. 25. M. J. Ferguson, An Approximate Analysis of Delay for Fixed and Variable Length Packets in an Unslotted Aloha Channel, IEEE Trans. Commun., 25: 644–654, 1977. 26. F. A. Tobagi and V. B. Hunt, Performance analysis of carrier sense multiple access with collision detection, Comput. Netw., 4 (5): 245–259, 1980. 27. J. J. Metzner, On Improving Utilization in Aloha Networks, IEEE Trans. Commun., 24: 447–448, 1976. 28. N. Shacham, Throughput-delay performance of packet-switching multiple-access channel with power capture, Performance Evaluation, 4 (3): 153–170, 1984. 29. R. Binder, A Dynamic Packet Switching System for Satellite Broadcast Channels, Proc. ICC’75, pp. 41.1–41.5, 1975. 30. W. Crowther et al., A System for broadcast communication: Reservation-Aloha, Proc. Int. Conf. Syst. Sci., pp. 371–374, 1973. 31. S. S. Lam, Packet broadcast networks—A performance analysis of the R-ALOHA protocol, IEEE Trans. Comput., 29: 596–603, 1980. 32. M. J. Ferguson, On the control, stability, and waiting time in a slotted Aloha, IEEE Trans. Commun., 23: 1306–1311, 1975. 33. S. L. Beuerman and E. J. Coyle, The delay characteristics of CSMA/CD networks, IEEE Trans. Commun., 36: 553–563, 1988. 34. A. B. Carleial and M. E. Hellman, Bistable behavior of ALOHAtype systems, IEEE Trans. Commun., 23: 401–410, 1975. 35. S. S. Lam and L. Kleinrock, Packet switching in a multicast broadcast channel: Dynamic control procedures, IEEE Trans. Commun., 23: 891–904, 1975. 36. J. I. Capetanakis, Tree Algorithm for Packet Broadcast Channels, IEEE Trans. Inf. Theory, 25: 505–515, 1979. 37. J. F. Hayes, An adaptive technique for local distribution, IEEE Trans. Commun., 26: 1178–1186, 1978. 38. B. S. Tsybakov and V. A. Mikhailov, Free synchronous packet access in a broadcast channel with feedback, Prob. Inf. Trans., 14 (4): 259–280, 1978. 39. J. L. Massey, Collision Resolution Algorithms and Random-Access Communications Multi-User Communications Systems, CISM Courses and Lectures Series (G. Longo, ed.), New York: SpringerVerlag, pp. 73–137, 1981 (also in UCLA Technical Report UCLAENG-8016, April 1980). 40. R. G. Gallager, Conflict resolution in random access broadcast networks, Proc. AFOSR Workshop Commun. Theory Appl., Provincetown, pp. 74–76, September 1978. 41. B. S. Tsybakov and V. A. Mikhailov, Random multiple packet access: Part-and-try algorithm, Prob. Inf. Trans., 16: 305–317, 1980. 42. R. G. Gallager, A perspective on multiaccess channels, IEEE Trans. Inf. Theory, 31: 124–142, 1985. 43. B. S. Tsybakov, Survey of USSR contributions to multiple-access communications, IEEE Trans. Inf. Theory, 31: 143—165, 1985. 44. B. S. Tsybakov and N. B. Likhanov, Upper bound on the capacity of a random multiple access system, Prob. Inf. Trans., 23 (3): 224–236, 1988.

20. F. A. Tobagi and L. Kleinrock, Packet switching in radio channels: Part IV—Stability considerations and dynamic control in carrier sense multiple-access, IEEE Trans. Commun., 25: 1103– 1119, 1977.

45. I. Cidon and M. Sidi, The effect of capture on collision-resolution algorithms, IEEE Trans. Commun., 33: 317–324, 1985.

21. D. Bertsekas and R. Gallager, Data Networks, 2nd ed., Englewood Cliffs, NJ: Prentice-Hall International Editions, 1992.

46. I. Cidon and M. Sidi, Erasures and noise in multiple access algorithms, IEEE Trans. Inf. Theory, 33: 132–143, 1987.

674

MULTIPLIERS, ANALOG

47. I. Cidon, H. Kodesh, and M. Sidi, Erasure, Capture and random power level selection in multiple-access systems, IEEE Trans. Commun., 36: 263–271, 1988. 48. M. Sidi and I. Cidon, Splitting protocols in presence of capture, IEEE Trans. Inf. Theory, 31: 295–301, 1985. 49. N. D. Vvedenskaya and B. S. Tsybakov, Random multiple access of packets to a channel with errors, Prob. Inf. Trans., 19 (2): 131– 147, 1983. 50. I. Cidon and M. Sidi, Conflict multiplicity estimation and batch resolution algorithms, IEEE Trans. Inf. Theory, 34: 101–110, 1988. 51. A. G. Greenberg, P. Flajolet, and R. E. Ladner, Estimating the multiplicities of conflicts to speed their resolution in multiple access channels, J. ACM, 34 (2): 289–325, 1987. 52. G. Fayolle et al., Analysis of a stack algorithm for random multiple-access communication, IEEE Trans. Inf. Theory, 31: 244– 254, 1985. 53. L. Georgiadis, L. F. Merakos, and P. Papantoni-Kazakos, A method for the delay analysis of random multiple-access algorithms whose delay process is regenerative, IEEE J. Sel. Areas Comm., 5 (6): 1051–1062, 1987. 54. L. Georgiadis and M. Paterakis, Bounds on the Delay Distribution of Window Random-Access Algorithms, IEEE Trans. Commun., COM-41: 1993, 683–693. 55. G. Polyzos and M. Molle, A Queuing Theoretic Approach to the Delay Analysis for the FCFS 0.487 Conflict Resolution Algorithm, IEEE Trans. Inf. Theory, IT-39: 1887–1906, 1993.

MOSHE SIDI Technion—Israel Institute of Technology

MULTIPLIER. See ANALOG MOS MULTIPLIER.


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Network Flow and Congestion Control Standard Article Eytan Modiano1 and KaiYeung Siu2 1MIT Lincoln Laboratory 2MIT d'Arbeloff Laboratory for Information Systems and Technology Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5333 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (122K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Issues and Mechanisms for Congestion Control Flow Control in Practice Advanced Issues | | | Copyright © 1999-2008 All Rights Reserved.


NETWORK FLOW AND CONGESTION CONTROL


N×N

U1

U2 ..... UN

Figure 1. A hub network.

Without flow control Delay

Modern computer networks connect geographically dispersed nodes using switches and routers, and transmission lines between them. In this way, bursty and random traffic streams can be statistically multiplexed to make more efficient use of resources. For example, the hub network in Fig. 1 is used to connect N users with N shared links and an N ⫻ N switch. Communication between pairs of users is accomplished by going through the hub. If, instead, all nodes were to be connected using dedicated links, N(N ⫺ 1)/2 links would be required. However, at any point in time communication usually takes place only between a small fraction of the users. Hence, providing full and unshared connectivity between all users would be wasteful of resources. The functionality of computer networks in connecting users is quite similar, in many aspects, to that of highways and local streets in connecting households. In both cases, effective traffic control mechanisms are needed to regulate the flow of traffic and ensure high throughput. Unlike traditional voice communications, where an active call requires constant bit rate from the network, data communication is bursty in its nature. A typical data session may require very low data rates during periods of inactivity and much higher rates at other times. Consequently, there may be times when incoming traffic to a network exceeds its capacity. Flow and congestion control are mechanisms used in computer networks for preventing users from overwhelming the network with more data than the network can handle. The simplest way to handle network congestion is to temporarily buffer the excess traffic at a congested switch until it can all be transmitted. Yet, since switch buffers are limited in size, there may be times when sustained excessive demand on parts of the network causes buffers to fill-up, and excess packets can no longer be buffered and must be discarded. When packets are discarded it is typically left up to higherlayer protocols to recover the lost packets using an appropriate retransmission mechanism. For example, the Transmission Control Protocol (TCP) recovers from such buffer overflows by using a timed acknowledgment mechanism and retransmitting packets for which an acknowledgment does not arrive in time. Consequently, at times of congestion, packets may be retransmitted not only because of buffer overflows but also because of the increased delay that is due to the congestion. In the absence of flow control, this, sometimes unnecessary, retransmission of packets can lead to instability where little if any new traffic can flow through the network. A well-designed flow-control mechanism should keep the traffic levels in the network low enough to prevent buffers from overflowing and maintain relatively low end-to-end delays. Furthermore, in the event of congestion, the flow-control mecha-

167

With flow control Throughput Figure 2. An effective flow-control mechanism can yield both higher throughputs and decreased delays.

nism should allow the network to stabilize. Figure 2 illustrates the benefits of an effective flow-control mechanism. In addition to the obvious objectives of limiting delays and buffer overflow, a good flow-control scheme should also treat all sessions fairly. One notion of ‘‘fairness’’ is to treat all sessions in the network equally. However, this notion is not appropriate for networks that attempt to provide Quality-of-Service (QoS) guarantees. In some networks users may be offered service contracts guaranteeing minimum data rates, maximum packet delays, and packet discard rates, as well as other performance measures. In such networks, it is up to the flowcontrol mechanism to make sure that these guarantees are met. Clearly, in this case, sessions cannot be treated equally and a different notion of fairness, related to the service agreements of the users, must be used. A more detailed discussion of fairness and how flow-control mechanisms attempt to provide fairness will be given in the next section. There are a number of flow-control mechanisms that are used in practice, all of which attempt to limit delays and buffer overflows in the network by keeping packets waiting outside the network rather than in buffers within the network. The simplest mechanism for preventing congestion in the network is call admission. Here a call may be blocked, or prevented from entering the network, if it is somehow determined that the network lacks sufficient resources to accept the call. Call admission is a passive flow-control mechanism in the sense that once a call is admitted, nothing further is done to regulate traffic. It is therefore appropriate for traffic with very predictable behavior. Typically, call-admission mechanisms are used in circuit-switched networks (e.g., the telephone network); however, with the recent emergence of packet network services offering QoS guarantees, call blocking may also play a role in data networks, in conjunction with additional mechanisms to regulate traffic among active sessions. This article will focus on active flow-control mechanisms that attempt to regulate the traffic flow among active sessions. A comprehensive discussion of flow control in data networks can be found in Refs. 1 and 2. As described in Ref. 3, one way to classify flow-control mechanisms is based on the layer of ISO/OSI reference model at which the mechanism operates. For example, there are data link, network, and transport layer congestion-control schemes. Typically, a combination of such mechanisms is used. The selection depends upon the severity and duration of congestion. Figure 3 shows how the duration of congestion affects the choice of the method. In general, the longer the duration, the higher the layer at which control should be exercised. For example, if the congestion is permanent, the installation of additional links is required. If the congestion lasts for the duration of the connec-


168


Congestion duration Long

Short

Control mechanism Capacity planning and network design Admission control Dynamic routing End-to-end feedback Link-by-Link feedback Buffering

Figure 3. Control mechanisms based on congestion duration.

tion, admission control (e.g., use of busy signal) or dynamic routing (i.e., rerouting of traffic into another less congested path) is more appropriate. If the congestion lasts for several round-trip delays, transport level control with end-to-end feedback is more effective. If the congestion is of a short duration (less than a round-trip delay), link-by-link feedback or sufficient buffering should be used. Since every network can have overloads of all durations, every network needs a combination of control mechanisms at various levels. No single scheme can solve all congestion problems. The rest of this article will focus on mechanisms that deal with congestion that lasts only a few round-trip delays.

ISSUES AND MECHANISMS FOR CONGESTION CONTROL Buffer Implementation and Management As discussed earlier, call-admission-control mechanism alone is only appropriate for regulating traffic with steady or predictable bandwidth requirements (e.g., voice), but not effective for dealing with unpredictable bursty traffic (e.g., data). For efficient utilization of network bandwidth, it is often necessary to buffer the traffic when the incoming traffic to a node temporarily exceeds the capacity of its outgoing link. Flow control thus involves buffer management at a node in such a way that the service requirements of connections traversing the node can be satisfied. In general, packet loss at a node will occur less frequently with a larger buffer than with a smaller one. However, a larger buffer may also lead to larger packet delay. For most data applications, an excessive packet delay will yield the same effect as a packet loss and will trigger a retransmission of the delayed packet. Thus, there is a tradeoff between throughput and delay in regulating network traffic. A key challenge in flow control is to achieve good delay-throughput, among connections with possibly different service requirements competing for network resources. In addition to buffer size, another issue in buffer management for flow control has to do with the order in which packets of various connections are stored into and transmitted out of the buffer. The simplest way to buffer packets is to implement a single first-in-first-out (FIFO) queueing structure, in which buffered packets of all connections are transmitted on a first-come-first-serve (FCFS) basis. In other words, when a packet arrives at a node and needs to be buffered, it will be put at the end of a queue, regardless of which connection it belongs to. The packet at the front of the queue is always the first to be transmitted. Since the order of packet arrivals

determines the order of packet departure, it is difficult to support different service requirements to connections with FIFO queueing. Moreover, since various traffic is mixed into the same queue, some connections can overutilize the buffer space, thereby preventing other connections from using it. A more sophisticated way is to implement a separate queue for each connection, that is, per-connection queueing, so that buffered packets of different connections are isolated from one another. With per-connection queueing, a scheduling mechanism is used to decide, at any instant, which connection can transmit its packet. Compared with FIFO queueing, perconnection queueing is more expensive to implement, but offers greater flexibility in exercising flow control. For traffic with minimal and similar service requirements, FIFO queueing is usually sufficient. When different classes of traffic with different levels of service requirements are mixed together onto the same link, per-connection queueing may be necessary (4). In the following sections, the problems of packet scheduling and packet discarding, which are closely related to buffer management for flow control, will be discussed.

Packet Scheduling In general, a network node can have multiple incoming and outgoing links. Packets can be buffered at either the entrance or the exit interface of a node. The former is called input buffering and the latter output buffering. With input buffering, packets of different connections from the same incoming link of a node will first be buffered before they are transmitted to different outgoing links. To resolve possible contention caused by packets from different incoming links transmitting to the same outgoing link, a scheduling mechanism is required, to determine which packet should be transmitted at any instant. Moreover, when input buffering is implemented using FIFO queueing, packets of slow connections at the front of the queue will block packets of fast connections that follow. This is usually called the head-of-line (HOL) blocking problem. In fact, it is known that under certain traffic assumptions (e.g., a packet from each incoming link is equally likely to be transmitted to any outgoing link of the node), with input FIFO buffering, at most 58% of the maximum possible throughput can be achieved (5). On the other hand, with output buffering, a packet arriving at a node is immediately transferred to the interface of its destined outgoing link, and is buffered there before it is transmitted. In this case, there is no HOL blocking problem and no scheduling mechanism is needed to resolve transmission contention among packets from different incoming links. However, since packets from all incoming links can potentially go on the same outgoing link at any instant, a node with output buffering needs to transfer packets at a rate that is the aggregate speed of all incoming links. This also means that faster (and, thus, more expensive) switching hardware is usually required for nodes with output buffering than with input buffering. The problem of packet scheduling is further complicated by the fact that different connections may have different bandwidth or service requirements. Thus, a scheduling mechanism is needed to selectively expedite or transmit the buffered packets of various connections. For example, if each con-


nection should share the bandwidth of an outgoing link equally, a node can transmit buffered packets of each connection in a round-robin fashion. Similarly, when a particular connection has a minimum bandwidth requirement of R packets/second, then a node should schedule the transmissions so that, on the average, at least one packet of that connection will be transmitted every 1/R s. Furthermore, a node may desire to delay the transmission of the packets of some connections to avoid or relieve congestion further along the paths used by those connections, when such congestion information is available at the node. Simple FIFO queueing at a node often cannot support these and various scheduling mechanisms, and more expensive per-connection queueing is necessary when the scheduling constraints are stringent. More information on packet scheduling can be found in Refs. 6 and 7. Packet Discarding Since there is only a finite amount of available buffer space at a node, packets will be discarded if congestion persists. When packets of a connection get discarded, whether they will be retransmitted depends on the service requirements of that connection. For example, when the connection is a file transfer application, where each packet carries essential information, discarded packets need to be retransmitted by the source. The retransmission is usually performed if the receipt of a packet has not been acknowledged after a time-out period. In the case of a TCP connection, the destination returns to the source an acknowledgment packet corresponding to each data packet received, and the retransmission time-out is determined dynamically (8). On the other hand, for real-time traffic such as voice or video, discarded packets are usually not retransmitted because delayed information is useless in such cases. A common approach to flow control for real-time traffic is to assign different priority levels to packets so that packets of highest priority will be discarded least often, if at all. Before a connection is established, the network may exercise a call-admission mechanism to ensure that the transmission of these highest priority packets can be maintained above a certain rate in order to support a minimum acceptable level of quality of service. This approach is also applicable to data traffic, where each discarded packet needs to be retransmitted. In this case, a network may offer several different classes of service with different priority levels in terms of packet discarding. When a connection is established, it can negotiate with the network to which service it wants to subscribe and, subsequently, during periods of congestion its packets will be discarded, based on their priority level. It is sometimes desirable to discard packets even when buffer space is still available, particularly if FIFO queueing is used. This is because a connection overutilizing the buffer space in a FIFO queue will cause packets of other connections sharing the FIFO queue to be discarded. Thus, packets should be discarded if they belong to connections that utilize more than their fair share of buffer space, or if they may cause packets of higher priority to be discarded. Furthermore, if packets are to be discarded further down the path, because of congestion there, they should be discarded as early as possible, to avoid wasting additional network resources unnecessarily (9,10).

169

C Link 1 Link 2

A

B Figure 4. Fair bandwidth allocation.

Fair Allocation of Bandwidth In addition to limiting delay and buffer overflow, fairness in network use is another objective of flow control. It is difficult to define a simple notion of fairness when different connections with different service requirements are present in the network. Here only a particular notion of fairness on bandwidth allocation will be discussed. First consider a simple network with two links and three connections, as shown in Fig. 4. For the present, assume that each link has equal capacity, supporting 1 unit/s of traffic. If the fairness criterion is to allocate an equal rate to each connection, then each connection should get a rate of unit/s and the total network throughput in this case would be units/s. Note, however, that the maximum network throughput is 2 units/s, which can be achieved by shutting off connection A and allowing connections B and C each to transmit 1 unit/s of traffic. This example shows that fairness and throughput are two independent (and sometimes conflicting) objectives of flow control. Now suppose the capacity of link 1 is changed to unit/s. In this case, connections A and B can share the bandwidth of link 1 equally, resulting in a throughput of unit/s for each. However, it will be a waste of bandwidth in link 2 if connection C is allocated with less than unit/s of bandwidth; it will be unfair if the bandwidth allocated to connection C is more, since it would further restrict the bandwidth allocated to connection A. This example motivates the notion of max–min fairness, which refers to maximizing bandwidth utilization for connections with the minimum bandwidth allocation. More formally, it can be said that a set of connections has a max–min fair bandwidth allocation if the bandwidth allocated to any connection C⬘ cannot be increased without further decreasing the bandwidth allocated to another connection whose bandwidth is already smaller than C⬘. For example, in Fig. 4 with the capacity of link 1 being unit/s, one cannot increase the bandwidth allocated to connection C above unit/s without making the bandwidth of connection A smaller than unit/s. Max–min fairness can also be defined in terms of the notion of a bottleneck link. With respect to some bandwidth allocation, a particular link L is a bottleneck link for a connection C⬘, which traverses L if the bandwidth of L is fully utilized and if the bandwidth allocated to C⬘ is no less than the bandwidth allocated to any other connection traversing L. Then a max–min fair bandwidth allocation can be shown to be equivalent to the condition that each connection has a bottleneck link, with respect to that allocation. The notion of max–min fairness needs to be modified if each connection requires a minimum guaranteed data rate. One possible way is first to define the excess capacity of each

170


link L to be the bandwidth of L minus the aggregate guaranteed rates of all the connections that traverse L. Then a set of connections has a max–min fair bandwidth allocation if the excess capacity of each link is shared in a max–min fair manner (according to the notion defined earlier). More information on fair queueing algorithms and their performance can be found in Refs. 11–14. Window Flow Control The oldest and most common flow-control mechanism used in networks is window flow control. Window flow control has been used since the inception of packet-switched data networks and it appears in X.25, SNA, and TCP/IP networks (1,2). Window flow control regulates the rate with which sessions can insert packets into the network with a simple acknowledgment mechanism. Within a given session, the destination sends an acknowledgment to the source for every packet that it receives. With a window size of W, the source is limited to having W outstanding packets for which an acknowledgment has not been received. Hence, the window scheme limits the number of packets that a given session can have inside the network to the window size, W. These packets can be either in buffers throughout the network or propagating on transmission lines. This strategy is typically implemented using a sliding transmission window, where the start of the window is equal to the oldest packet for which an acknowledgment has not yet been received. Only packets from within the window can be transmitted, and the window is advanced as acknowledgments for earlier packets are received. An example with W ⫽ 4 is shown in Fig. 5. One reason that this strategy is very popular is its similarity to window-based retransmission mechanisms (e.g., Go Back N or SRP), which are used for error control in data networks, making it easy to implement in conjunction with the error-control scheme. While window flow control, in effect, limits the number of packets that a given session can have in the network, it also indirectly regulates the rate of the session. Suppose that the round-trip delay for transmitting a packet and receiving its acknowledgment is D seconds. Then with a window of size W, a session can at most transmit r ⫽ W/D packets per second. This is because, after sending a full window of packets, the sender must wait for the acknowledgment of the first packet before it can send any new packets. As delays in the network increase (i.e., D increases), the maximum session rate r is forced to decrease, producing the desired effect of slowing transmissions down at time of congestion. As congestion is alleviated, D is decreased, allowing sessions to increase their transmission rates. One problem with window flow control is that window flow control cannot be used for sessions that require guaranteed

data rates, such as real-time traffic, because as delays through the network vary, the rate of the session is forced to vary as well. Another problem is in the choice of a window size. On the one hand, one would like to keep window sizes small, in order to limit the number of packets in the network and prevent congestion. However, one would also want to allow sessions the ability to transmit at the maximum rate, at times when there is no congestion in the network. Consider a network where the transmission time for a packet is X. In order to allow unimpeded transmission, the window size W must be greater than D/X. That is, the window size must be large enough to allow a session to transmit packets continuously, while waiting for acknowledgments to return. Clearly, when W is greater than D/X, flow control is not active (i.e., the session can transmit at the maximum rate of 1/X packets per second) and when W is smaller than D/X, flow control is active and the session transmits at a rate of W/D ⬍ 1/X packets per second. The problem is in choosing a window size that both allows sessions unimpeded transmission when there is no congestion, and also prevents congestion from building up in the network. When there is no congestion in the network, the primary source of delay is propagation delay. Since propagation delay would be present, regardless of congestion, the window size should be big enough to allow unimpeded transmission when propagation is the only source of delay in the network. Hence if the propagation delay is equal to Dp then the window size should be at least equal to Dp /X, allowing transmission at a rate of 1/X packets per second, when the only source of delay is due to propagation. This is particularly needed in highspeed networks, where propagation delays can be relatively large. However, this can lead to the use of very large windows. Consider, for example, transmission over a satellite, where the round-trip propagation and signal processing delays can be on the order of a second. Suppose that the transmission rate is 106 bits/s and that the packet size is 1000 bits. In order to allow sessions to transmit at the full rate of 106, the window size must be at least 1000 packets. Hence, as many as 1000 packets per second can be in the network for each session. With so many packets in flight simultaneously, attempting to control congestion in the network becomes very difficult. First, the window mechanism becomes somewhat ineffective, because delays that are due to congestion are likely to be relatively small, compared with the propagation delay. Recalling that when flow control becomes active the allowable session rate is r ⫽ W/D, and since the overall increase in delay due to congestion is small, as compared with the overall delay, the result is only small decrease in the session rate. Also, with very large windows, sufficient buffering must be present throughout the network, to prevent buffer overflows

Window Window Window Window Window Source

Figure 5. Sliding window mechanism with W ⫽ 4.

Destination

PKT-0 PKT-1 PKT-2 PKT-3 PKT-4 PKT-5 PKT-6 PKT-7 PKT-8 PKT-9 ACK-0 ACK-1 ACK-2 ACK-3 ACK-4 ACK-5 ACK-6 ACK-7 ACK-8


in the event that congestion sets in. Clearly, a mechanism is needed to dynamically alter the window size allocated to a session, based on estimated traffic conditions in the network. In this way, when the network is not congested the window size can be increased, to allow unimpeded transmission, but, as congestion begins to set in, the window size can be reduced to yield a more effective control of the allowed session rate. An example of a dynamic window adjustment mechanism is given in Ref. 15. In order to be able to adjust the window size in response to congestion, a mechanism must exist to provide feedback to the source nodes, regarding the status of congestion in the network. There are many ways in which this can be done, and finding the best such method is an area of active research. One approach, for example, would require nodes in the network, upon experiencing congestion, to send special packets (sometimes called choke packets) to the source nodes, notifying them of the congestion. In response to these choke packets, the source nodes would reduce their window size. Other mechanisms attempt to measure congestion in the network, by observing the delay experienced by packets and reducing the window size as delay increases. Yet another mechanism used by the Transmission Control Protocol (TCP) reduces the window size, in response to lost packets (packets for which an acknowledgment was not received). This is done based on the assumption that lost packets are due to buffer overflows and are a result of congestion. The flow-control mechanism used by TCP, and some of the problems associated with it, will be discussed in more detail in the next section. One problem with using end-to-end windows for flow control is that, when congestion sets in on some link in the network, the node preceding that link will have to buffer a large number of packets. Consider, for example, a session operating over a multi-hop network and suppose that congestion sets in at one of the links along its path. With a window size of W, as many as W packets can be sent by the session into the network without receiving an acknowledgment. When a link becomes congested, all W packets associated with that session will arrive at the congested link and have to be buffered at the node preceding that link. With many simultaneous sessions, this can lead to significant buffering requirements at every node. An alternative, known as link-by-link window flow control, establishes for a session windows along every link between the source and the destination. These windows can be tailored to the specific link, so that a long delay link (e.g., satellite link) would have a large window, while a short delay link would have a smaller window. These link-by-link windows can be much smaller than the end-to-end windows and, as a result, the amount of buffering at each node can be significantly reduced. In effect, link-by-link windows distribute the buffering in the network evenly among all of the nodes rather than require the congested nodes to handle all of the packets. Of course, link-by-link windows are not always possible, for example, in networks that use datagram routing, sessions do not use a fixed path between the source and destination; hence, setting up windows on a link-by-link basis is not possible.

171

large, very large windows are required, making window flow control ineffective. Another problem is that window flow control cannot be used for sessions that require guaranteed data rates, such as real-time traffic, because as delays through the network vary, the rate of the session is forced to vary as well. An alternative mechanism, which is more appropriate for high-speed networks and real-time traffic, is based on explicitly controlling the rate at which users are allowed to transmit. For a session that requires an average data rate of r packets per second, a strict implementation of a rate-control scheme would allow the session to transmit exactly one packet every 1/r s. Such implementation would amount to time-division-multiplexing (TDM), which is appropriate for constant rate traffic, but inefficient for bursty data traffic. Data sessions typically do not demand a constant transmission rate but are rather bursty, so that, at times, little if any transmission is required, and at other times, much higher rates are required. A more appropriate mechanism for supporting a bursty data session with an average rate of r packets per second is to allow the transmission of B packets every B/r seconds. In this way, bursts of up to B packets can be accommodated. A common method for accomplishing this form of flow control is the leaky bucket method, shown in Fig. 6. In this scheme, a session of rate r has a ‘‘bucket’’ of permits for its use. The bucket, is constantly fed new permits at a rate of 1 every 1/r s and it can hold at most B permits. In order for a packet to enter the network, it must first obtain a permit from the bucket. If the bucket has no more permits, it must wait until such a permit becomes available. It is easy to see that, in this way, up to B packets can burst into the network all at once. An important parameter in the design of a leaky bucket rate control scheme is the bucket size B. Clearly, a small bucket size would result in strict rate control scheme and would be ineffective for bursty traffic. However, too large a bucket would be ineffective in controlling congestion. Again, as with the dynamic adjustment of window size in the window flow-control scheme, it is also sometimes desirable to dynamically alter the bucket size and rates given to a session based on traffic conditions in the network. FLOW CONTROL IN PRACTICE TCP Flow Control The transmission control protocol (TCP) is the most commonly used transport layer protocol in today’s internet. Virtu-

Packet buffer Packets

Network

Permits B Permit Bucket

Rate Flow Control One problem with window flow control is that in very-highspeed networks, where propagation delays are relatively

Figure 6. Leaky bucket flow control. Permits arrive at the bucket one every 1/r s and a packet must obtain a permit before entering the network.

172


ally all session-based traffic in the internet uses TCP. Among other things, TCP is responsible for flow control. There are a number of different TCP implementations (16–18), the details of which vary slightly from one another. The details of a particular standard are not emphasized here, but rather the general concepts that guide TCP flow control (19) are described. TCP controls the flow of traffic in a session, using end-toend windows. The key behind TCP flow control is the window size allocated for a given connection. For each connection, TCP determines a maximum allowable window size, Wmax. The value of Wmax is typically a function of the particular TCP implementation. Most TCP implementations use a value of Wmax that is somewhere between 4 kbytes and 16 kbytes (20). Upon connection setup, the value of Wmax is determined, based on the version of TCP used by the end stations. Once the maximum window size is determined, the communication can begin. However, in order to prevent a new connection from overwhelming the system, communication does not begin with the maximum window size. Rather, communication starts with a window size of W ⫽ 1 packet, typically around 512 bytes, and the window size is gradually increased, in what is known as a slow-start phase. During the slow-start phase, the window size is increased by one packet for every acknowledgment that returns from the destination. Therefore, the window size is doubled with every successful transmission of a complete window. The slow-start phase continues until the window size reaches half of the maximum window size, at which point the communication turns into what is known as the congestion-avoidance phase. During the congestion-avoidance phase, the window size is increased by one packet for every successful transmission of a full window. Hence, during the congestion-avoidance phase, the window size is increased much more slowly than during slow start. The window size continues to increase in this way, until it reaches its maximum value of Wmax. In the above discussion we described how TCP sets its initial window size. In addition, TCP adjusts the window sizes in response to congestion in the network. TCP assumes that any packets lost in the network (e.g., packets that are not acknowledged in time) are due to buffer overflows resulting from congestion. In response, upon detecting a lost packet, TCP reduces the window size. Most TCP implementations (20) reduce the window size to one packet, at which point the window size is increased gradually back to the maximum value, in accordance with the slow-start and congestion-avoidance algorithms described above. While TCP has been used effectively for many years, there are many shortcoming to its flow-control mechanism that make it ineffective for networks of the future. First, as discussed for general window flow control, it is not an effective mechanism for supporting sessions that require guaranteed data rates, such as real-time traffic. Second, as network transmission speeds increase, the window size needed to maintain unimpeded transmission has to be very large, especially over long delay links. With most versions of TCP having a maximum allowable window of around 16 kbytes, this is much too small for future high-speed networks (21,22). Furthermore, TCP’s response to lost packets, as if they are due to congestion, may be appropriate for networks that experience very little loss due to transmission errors. However, for wireless or satellite networks, lost packets are likely to be due to transmission errors and, hence, the TCP responses of clos-

ing the window is not appropriate, and results in significant performance degradation (23). Finally, the TCP slow-start mechanism, which gives a session a small window and gradually increases the window size with time, prevents TCP from taking full advantage of the high transmission capacity offered by networks of the future. Flow Control in ATM Networks Asynchronous transfer mode (ATM) is a network technology developed to carry integrated traffic including data, voice, images, and video. ATM carries all traffic on a stream of fixedsize packets (cells), each comprising 5 bytes of header information and a 48 byte information field (payload). The reason for choosing a fixed-size packet is to ensure that the switching and multiplexing function could be carried out quickly and easily. ATM is a connection-oriented technology, in the sense that, before two systems on the network can communicate, they should inform all intermediate switches about their service requirements and traffic parameters. This is similar to the telephone networks, where a fixed path is set up from the calling party to the receiving party. In ATM networks, each connection is called a virtual circuit or virtual channel (VC), because it also allows the capacity of each link to be shared by connections using that link on a demand basis, rather than by fixed allocations. The connections allow the network to guarantee the quality of service (QoS), by limiting the number of VCs. Typically, a user declares key service requirements at the time of connection setup, declares the traffic parameters, and may agree to control these parameters dynamically as demanded by the network. The available bit rate (ABR) service is one of the services in ATM developed to support data traffic. In other ATM services, network resources are allocated during connection establishment and sources are not controllable by feedback after a connection is established. ABR service, on the other hand, performs an end-to-end rate-based flow control, by requiring data sources to adapt their rates to their proper share of the available bandwidth by obtaining feedback information from the network. The feedback information is carried in resource management (RM) cells of each connection. These RM cells are generated by the source between blocks of data cells and returned by the receiver in the backward direction. The feedback control operates in two modes—explicit binary or explicit rate indication. The explicit binary indication mode assumes that a congested network node will mark a specific field (equivalent to a binary bit) in the header of any passing data cell. The receiver monitors the fields of the data cells it receives and sets the congestion fields in the backward RM cells appropriately. The data sources can then increase, decrease, or stay at their current rates based on the congestion information contained in the backward RM cells received. The explicit rate indication mode assumes that network nodes are capable of computing the proper share of available bandwidth for each source and writing this amount into a specific field in passing RM cells. The source will then adjust its rates to no more than the amount indicated in the RM cells. While the standards for ABR service specify the general behavior of the source and the receiver, the specific mechanism that governs when each network node should set the congestion field, or how it should compute the explicit rate, is


173

TCP loop control ATM traffic control mechanism TCP/IP

TCP/IP IP router

IP router Legacy LAN

ATM interface

ATM links

ATM interface

left to the discretion of network equipment vendors. Design objectives of such a mechanism include maximal utilization of network bandwidth, fairness in network use, and low cost, in terms of algorithm complexity and buffer space. For a good historic account of the development of the ABR service in ATM, see Ref. 24. Detailed descriptions of various approaches and mechanisms for flow control in ATM networks can also be found in Refs. 24–28. ADVANCED ISSUES Because most Internet applications are currently supported using TCP, a lot of research activities in the networking community have been focused on improving TCP performance. It has been realized that, in a high-latency network environment, the window flow-control mechanism of TCP may not be very effective, because it relies on packet loss to signal congestion, instead of avoiding congestion and buffer overflow (29). For bulky data connections, the arrival time of the last packet of data is of primary concern to the users, whereas delays of individual packets are not important. However, for some interactive applications such as Telnet, the user is sensitive to the delay of individual packets. For such low-bandwidth delay-sensitive TCP traffic, unnecessary packet drops and packet retransmissions will lead to significant delays perceived by the users. It is suggested in some recent work (30) that the performance of TCP can be significantly improved if intermediate routers can detect incipient congestion and explicitly inform the TCP source to throttle its data rate before any packet loss occurs. This explicit congestion notification (ECN) mechanism would require modifications of existing TCP protocols. For example, a new ECN field can be implemented in the packet header, and will be used by an IP router, which monitors the queue size and, during congestion, marks the ECN field of an acknowledgment packet. The TCP source will then slow down after receiving the acknowledgment and seeing the ECN field being marked. An additional motivation for using ECN mechanisms in TCP/IP networks concerns the possibility of TCP/IP traffic traversing networks that have their own congestion-control mechanisms (e.g., ABR service in ATM). Figure 7 shows a typical network scenario where TCP traffic is generated from a source connected to a LAN (e.g., Ethernet) aggregated through an edge router to an ATM network. Congestion at the edge router occurs when the bandwidth available in the ATM network cannot support the aggregated traffic generated from the LAN. Existing implementations of TCP only rely on packet drop as an indication of congestion, to throttle

Legacy LAN

Figure 7. A typical network scenario where TCP traffic flows over an ATM network.

the source rates. By incorporating ECN mechanisms in TCP protocols, TCP sources can be informed of congestion at network edges and will reduce their rates before any packet loss occurs. The use of such ECN mechanisms to inform TCP sources of congestion would be independent of the congestion control mechanisms within the ATM networks. Instead of incorporating ECN mechanisms in TCP, which requires modifications of TCP, it is proposed in Ref. 31, that congestion can be controlled by withholding at network edges the returned acknowledgments to the TCP sources. Such a mechanism has the effect of translating the available bandwidth in the ATM network to an appropriately timed sequence of acknowledgments. A key advantage of this mechanism is that it does not require any changes in the TCP end-system software. There are other research efforts in improving TCP performance in a wireless network environment. The main problem here is that the noise in wireless transmission medium often can lead to corrupted TCP packets. Such corrupted packets are usually discarded at the destination, and the flow-control mechanism of TCP will mistakenly treat such packet loss as an indication of congestion and will throttle the source rate. Several solutions have been proposed to address this problem, most of which require modifying TCP protocols by decoupling the flow-control loops in the wireless medium from the wireline network. The readers are referred to Ref. 32 for detailed description of these mechanisms. BIBLIOGRAPHY 1. D. P. Bertsekas and R. Gallager, Data Networks, Englewood Cliffs, NJ: Prentice-Hall, 1987. 2. M. Schwartz, Telecommunication Networks: Protocols, Modeling and Analysis, Reading, MA: Addison-Wesley, 1987. 3. R. Jain, Myths about congestion management in high speed networks, Internetwork.: Res. Exp., 3 (3): 101–113, 1992. 4. N. McKeown, P. Varaiya, and J. Walrand, Scheduling calls in an input-queued switch, Electron. Lett., 29 (25): 2174–2175, 1993. 5. M. J. Karol, M. G. Hluchyj, and S. P. Morgan, Input versus output queueing in a space-division packet switch, IEEE Trans. Commun., 35: 1347–1356, 1987. 6. J. Rexford et al., Scalable architectures for integrated traffic shaping and link scheduling in high-speed ATM switches, IEEE J. Sel. Areas Commun., 15 (5): 938–950, 1997. 7. Hui Zhang, Service disciplines for guaranteed performance service in packet-switching networks, Proc. IEEE, 83: 1374–1396, 1995. 8. A. Romanow and S. Floyd, Dynamics of TCP traffic over ATM networks, IEEE J. Sel. Areas Commun., 13 (4): 633–641, 1995.

174

NETWORK MANAGEMENT

9. S. Floyd and V. Jacobson, Random early detection gateways for congestion avoidance, IEEE/ACM Trans. Network., 1 (4): 347– 413, 1993.

32. H. Balakrishnan et al., A comparison of mechanisms for improving TCP performance over wireless links, IEEE/ACM Trans. Network, 5 (6): 756–769, 1997.

10. H. Li et al., On TCP performance in ATM networks with perVC early packet discard mechanisms, Comput. Commun., 19 (13): 1065–1076, 1996.

EYTAN MODIANO MIT Lincoln Laboratory

KAI-YEUNG SIU

11. A. Demers, S. Keshav, and S. Shenker, Analysis and simulation of a fair queueing algorithm, Proc. ACM SIGCOMM, 19 (4): 1– 12, 1989.

MIT d’Arbeloff Laboratory for Information Systems and Technology

12. J. C. R. Bennett and Hui Zhang, Hierarchical packet fair queueing algorithms, IEEE/ACM Trans. Network., 5 (5): 875– 889, 1997. 13. A. K. Parekh and R. G. Gallager, A generalized processor sharing approach to flow control in integrated services networks: The single-node case, IEEE/ACM Trans. Network., 1 (3): 344–357, 1993. 14. S. J. Golestani, Network delay analysis of a class of fair queueing algorithms, IEEE J. Sel. Areas Commun., 13 (6): 1057–1070, 1995. 15. D. Mitra and J. B. Seery, Dynamic adaptive windows for highspeed data networks with multiple paths and propagation delays, Comput. Networks ISDN Syst., 25 (6): 663–679, 1993. 16. V. Jacobson, Berkeley TCP evolution from 4.3-Tahoe to 4.3 Reno, Proc. Internet Eng. Task Force, 18th, Vancouver, 1990. 17. L. S. Brakmo and L. Peterson, TCP Vegas: End-to-end congestion avoidance on a global internet, IEEE J. Sel. Areas Commun., 13 (8): 1465–1480, 1995. 18. L. S. Brakmo, S. W. O’Malley, and L. Peterson, TCP Vegas: New techniques for congestion detection and avoidance, Comput. Commun. Rev., 24 (4): 24–35, 1994. 19. V. Jacobson, Congestion avoidance and control, Proc. ACM SIGCOMM, 18: 314–329, 1988. 20. W. R. Stevens, TCP/IP Illustrated Vol. 1: The Protocols, Reading, MA: Addison-Wesley, 1994. 21. V. Jacobson, R. Braden, and D. Dorman, TCP extensions for high performance, Internet Eng. Task Force, 1992, RFC-1323. 22. R. C. Durst, G. J. Miller, and E. J. Travis, TCP extensions for space communications, Wireless Networks, 3 (5): 389–403, 1997. 23. T. V. Lakshman and U. Madhow, The performance of TCP/IP for networks with high bandwidth-delay products and random loss, IEEE/ACM Trans. Network., 5 (3): 336–350, 1997. 24. Ohsaki et al., Rate-based congestion control for ATM networks, Comput. Commun. Rev., 25 (2): 60–72, 1995. 25. F. Bonomi and K. W. Fendick, The rate-based flow control framework for the available bit rate ATM service, IEEE Network, 9 (2): 25–39, 1995. 26. R. Jain, S. Kalyanaraman, and R. Viswanathan, The OSU scheme for congestion avoidance in ATM networks: Lessons learned and extensions, Perform. Eval., 31 (1–2): 67–88, 1997. 27. P. Narvaez and K.-Y. Siu, Optimal feedback control for ABR service in ATM, Int. Conf. Network Protocols, pp. 32–41. Atlanta, GA, 1997. 28. K.-Y. Siu and H.-Y. Tzeng, Intelligent congestion control for ABR service in ATM networks, Comput. Commun. Rev., 24 (5): 81– 106, 1994. 29. K. Fall and S. Floyd, Simulation-based comparisons of Tahoe, Reno, and SACK TCP, Comput. Commun. Rev., 26 (3): 5–21, 1996. 30. S. Floyd, TCP and explicit congestion notification, Comput. Commun. Rev., 24 (5): 8–23, 1994. 31. P. Narvaez and K.-Y. Siu, An acknowledgment bucket scheme to regulate TCP flow over ATM, Proc. IEEE Globecom, 3: 1838– 1844, 1997.

NETWORK INTERCONNECTION. See INTERNETWORKING.


●

HOME ●

ABOUT US ●

//

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering


Network Management Standard Article Yechiam Yemini1 1Columbia University, New York, NY Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5319 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (120K)



❍ ❍

Acronym Finder

Abstract The sections in this article are Challenges and Problems Architecture of Network Management Systems | | | Copyright © 1999-2008 All Rights Reserved. file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...CTRONICS%20ENGINEERING/38.%20Networking/W5319.htm16.06.2008 16:26:01

174

NETWORK MANAGEMENT

NETWORK MANAGEMENT The term network management is often used in an imprecise way to capture multiple meanings. The first part, network, can mean the entire range of network communications and computing systems and services, or just a subset of these associated with the physical and network layers; in the latter case one distinguishes network management from system management. Management means both a collection of operations tasks handled by network and system administrators and support staff, as well as technologies and software tools intended to simplify these tasks. This article uses the term network management in its broadest sense. Network here means any system and service, whether associated with communication or with computing functions of a network; in practical terms, the entire range of systems and services provided by an information system. Management means both operations and administration tasks as well as technologies and tools to support them. This article is organized as follows: The first section describes and illustrates the central operational problems that network management technologies seek to resolve. The second section describes the architecture, operations, and protocol standards underlying current network management systems. The final two sections describe emerging technologies used in network management. CHALLENGES AND PROBLEMS Several factors render network management an area of increasing importance. First, with rising scale, the complexity and rate of changes of network information systems increase the difficulties associated with their operations management. Operations staff confront increasingly more complex challenges in configuring underlying network elements. Failure modes can escalate rapidly among multiple components in a manner that is unpredictable; and their diagnosis can involve complex analysis requiring substantial knowledge about the large variety of components composing a network. Frequent changes in configuration, components, and applications often introduce significant performance inefficiencies and increased exposure to failures. With growing dependence on networks to deliver mission-critical functions, organizations are increasingly exposed to such failures and inefficiencies. Network failures can paralyze not only an entire organization, J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.

NETWORK MANAGEMENT

but possibly an entire industry. For example, in 1998 failure in a data network paralyzed retail sales throughout the United States as vendors could not access credit card authorization and billing functions. Similar network failures have paralyzed air traffic through major US airports, Nasdaq trading, and other major national-level network-dependent functions. A 1992 failure in the ambulance dispatch network of London, England, resulted in the loss of over 20 lives. At present, network management requires a significant level of expert manual labor by operations staff. The complexity and intensity of operations management activities translate into enormous costs; operations management costs are responsible for 65 to 85% of information technology budgets, by far the largest component. In a paradoxical way, discussed in the next section, current management technologies have been responsible for sharply increasing the complexity and costs of operations management rather than reducing them. The rest of this section describes central network management challenges and illustrates them through respective operations scenarios. Problem Management A central concern of network management is to detect, diagnose, isolate, handle, and track network operations problems. Detection is typically handled through alarms, dispatched by network management software, typically embedded in network elements, to a network management system (NMS) at an operations center. These alarms are typically displayed on operations staff consoles. Operations staff must correlate the alarms associated with a particular problem, isolate this root cause problem, and then resolve it. This often requires coordinated activities by multiple staff, possibly including multiple administrations of affiliated network domains as well as the vendor’s technical support. To track the work flow of these problem resolution activities, one often uses a trouble-ticketing system. A problem instance is described by a trouble ticket, which records the problem’s state of resolution and triggers activities by various individuals and organizations involved. The following scenario, depicted in Fig. 1, illustrates some of the fundamental elements of problem management and the technical challenges involved. The figure depicts a database server (on the left) interacting with a remote client (on the right) over a network of IP routers (marked A, B, C, D). The IP links are layered over underlying Ethernet local area network (LAN) links (domain I, V), or wide area network (WAN) links (Domains II, III, IV). In particular, a T3 link connects router D to the backbone router C and an ATM permanent virtual circuit (PVC) connects the router A to the backbone

175

router B. The database client and server use a typical TCP/ IP stack to exchange query-response transactions. Consider a problem scenario whereby a SONET interface underlying the T3 link loses clock synchronization intermittently; the synchronization problem lasts for a short 1 ms until the equipment automatically recovers from it. This loss is well under the threshold for the WAN operation center to notice, and thus the problem will typically go unnoticed. However, during this 1 ms, the T3 frames are corrupted, resulting in a loss of thousands of bits. As a result, packets transmitted over the IP link between routers C and D are damaged or lost. The routers detect corrupted headers, drop such packets, and increase the value of a discard counter built into their network management instrumentation. However, most packets that are lost or corrupted would be invisible to the router management instrumentation. The first element to detect and respond to this loss is the TCP software at the client and server hosts. TCP responds to packet corruption and loss by retransmissions and reduction in the size of the window controlling the amount of data that it transmits. The result is a significant increase in connection delays and reduction in throughput. This performance degradation of TCP causes the database client-server transactions to hold locks on data for a much longer time. Similar performance degradations are experienced by all transactions between the database server and remote clients. This results in significant degradation of the database server response time, as clients have to wait significantly longer for transactions to complete and release their locks. Many of these transactions will be aborted, and the database server response time will significantly degrade. Thus, minor problems at low network layer equipment often propagate, across numerous operating domains, to end systems and have amplified effects on the performance of applications and services executed by these and systems. Propagation of problems among domains and layers is very typical. Often the symptoms for a problem are not directly observable at the operations domain where the problem exists. In the preceding example the alarm events associated with the problem would be observed at the database application management domain. The application manager will get complaints from end users, who see long response times and abortion of their transactions. Diagnosing the source of these symptoms and isolating it to the performance of the IP link from C to D requires complex ad hoc processes that demand the collaboration of multiple experts responsible for different domains. Resolving the problem presently requires coordinated problem management and tracking between the WAN domain managers, the IP routers managers, the LAN managers, the database server administrator, and the application managers.

Appletn

Appletn Server

TCP

Client

IP Interface

IP

Router backbone Domain V

D

C

T3 Domain III SONET

Domain IV

TCP

B

Interface

A PVC

Domain II ATM network

Domain I Ethernet Figure 1. A problem management scenario.

176

NETWORK MANAGEMENT

Presently, these different tasks involved in problem management are processed through complex, ad hoc, manual, expert-intensive activities. Often these processes require hours, even days, during which a network may remain nonoperational or only partially operational. Often these processes result in errors that exacerbate the problems rather than resolve them. The central challenge of problem management is how to automate the detection, diagnosis, and resolution of problems. Configuration Management Configuration management is concerned with setting or changing configuration parameters of elements to accomplish a desired operational mode and to assure consistent and efficient operations. A typical network goes through multiple changes daily as new systems are installed or existing systems are reconfigured. A typical network system can involve hundreds or thousands of configuration parameters. A single reconfiguration task may involve multiple coordinated changes of configuration parameters of various systems. These changes must be executed to assure consistent operations among the various systems involved. Consider, for example, an organization that wishes to move a Web server to a different virtual LAN in order to accommodate changes in traffic levels. It is necessary to change the virtual LAN configuration data at underlying Ethernet switches; assign new IP address to the server to reflect its new network affiliation; change address translation tables of other systems attached to the new LAN and of those of the old LAN; change the primary DNS name servers database entries associated with the respective networks and propagate these changes to secondary DNS servers; reconfigure the Web server and respective file servers to reflect the new associations among them; reconfigure directory servers respectively; and possibly reconfigure routers serving the new and old virtual LANs to reflect the new traffic patterns. Each of these different reconfiguration activities involves very different configuration data repositories and change mechanisms and protocols. These changes can result in various forms of operational inconsistencies that can have enormous impact on the overall performance of the network. Should such inconsistency occur, recovering a consistent operational mode is very difficult. Each of these systems may or may not admit backup of its configuration state. Even if it permits simple backup and recovery of a configuration state, it is often difficult to restore the overall network to a consistent operational mode. Various components of the operational mode of a network are constructed to adapt automatically to configuration changes. For example, routers typically adapt dynamically to topology changes. A configuration change in one system often triggers propagated adaptation mechanisms in other systems. Even if one restores the original configuration state of the system, the propagated changes may be irreversible and thus other systems remain in an inconsistent operational mode. Presently, the tasks of managing configurations are handled by expert operations staff through ad hoc processes. Configuration changes are responsible for some of the most catastrophic network failures. To avoid failures, it is necessary to assure consistency of the changes in multiple systems. The rules governing consistency of configurations can be very complex and are often unknown to the operations staff responsi-

ble for the changes. At present this knowledge is acquired through apprenticeship and through trial and error. Therefore, configuration management problems result in enormous exposure to unpredictable and uncontrollable failures and require scarce and very expensive expertise and labor; the expertise and exposure are replicated among enterprises. Furthermore, exposure is rapidly emerging as a central hurdle in deploying new technologies and capabilities in networked information systems. Clearly, this primitive state of the art of configuration management is fundamentally inadequate. The long-term solution is to create technologies for self-configuring, plug-andplay networks. Performance Management Performance management is concerned with planning and allocating network resources to optimize the network’s overall performance behavior. A typical network involves resources representing multiple technology generations with a broad spectrum of capacities. The bandwidth of communication links can range over several orders of magnitude, as is the speed of routers and switches; the processing speeds of various attached systems can vary greatly; and applications demands for resources can change dramatically as a result of even minor changes. Furthermore, the relationship between resources of different systems can change dramatically as new components are deployed in the network. As a result, performance bottlenecks are typically formed and change very rapidly. Operations staff must often reconfigure resource allocations to address emergent performance bottlenecks and inefficiencies. Performance problems can be intimately linked with failures and configuration changes. For example, consider again the network failure described previously. A failure of a SONET layer interface resulted in a performance problem of overlaid TCP links; and these, in turn, resulted in a databaseserver failure. In what follows we use a typical network performance problem scenario, depicted in Fig. 2, to illustrate some of the challenges of performance management. Consider a network of a financial enterprise, consisting of a data processing center interconnected with various sites via a private router network. Clients at these sites interact with various applications services offered by the data center. The data center includes a market information system, executing on a very large server system, and a mission-critical trading server. A Web server is introduced to provide novel and very powerful access to the market information system. This Web service generates significant new traffic levels over the WAN links connecting the data center and remote offices. This new Web traffic competes for link bandwidth and router buffer space with traffic of existing applications, such as the trading application, where remote clients transact with the trading server. As the Web traffic grows, it generates frequent congestion and packet losses at backbone and access router. Congestion and loss at routers result in several performance problems. The performance of TCP connections served by a congested router degrades sharply and results in respective impact on the performance of the applications. In the scenario of Fig. 2, the Web services and the trading server transactions will see reduced throughput and increased delays.

NETWORK MANAGEMENT

177

London Tokyo Trading server High speed router backbone Hong Kong

Router

Router T3

T1 Router

NY Data center 100 MB LAN

Market information system

Web server Figure 2. A performance management problem scenario.

Applications may show varied sensitivity to such performance problems. The Web traffic may be somewhat insensitive to increased delay, while the trading center transactions are very sensitive. In general, database transaction traffic as well as emerging real-time continuous media applications can be very sensitive to the quality of services (QoS) delivered by the underlying network. Technologies to manage the QoS delivered to different applications are just emerging. The network managers solved the congestion problem of Fig. 2 by using a separate T1 link to route the transaction traffic between the trading server and the router backbone. This required complex reconfiguration of the LAN switch to relocate the trading server on a separate virtual LAN; reconfiguration of the routers to handle the transaction traffic accordingly; and propagation of these configuration changes to various directories. This approach of solving performance problems by creating separate networks to serve different applications is unscalable, complex, and can lead to intractable configurations that are difficult to manage and change and result in complex failure patterns. Emerging technologies offer simplified solutions to the problem of QoS management. One such solution is to use a traffic shaper on WAN links. A traffic shaper can be configured to control and assure the bandwidth allocated to different applications served by the WAN link. In the scenario of Fig. 2, a traffic shaper can be installed at the T3 link connecting the data center to the router backbone. This traffic shaper can be configured to assure that trading server transaction traffic obtains the necessary bandwidth, by controlling the amount of Web and other lower-priority traffic admitted into the network. A traffic shaper thus serves to effect and manage site-level admission control policies to assure the QoS delivered to applications. The RSVP protocol allows applications to accomplish more global bandwidth reservations at routers, based on their priorities. However, this requires globally coordinated management of priorities among competing traffic streams. Performance problems can lead to complex failure patterns. In the scenario of Fig. 2, sustained congestion at a router can lead to loss of network-layer control traffic. For example, routers often use ‘‘keep-alive’’ packets to monitor the links connecting them. When enough of these packets are lost, the link is considered dead and the router switches the traffic to an alternate link. This clears the congestion and allows ‘‘keep-alive’’ packets to reach the router, causing it to recognize the link as having become live again. The router re-

sponds by switching traffic back. This instability, known as route flapping, can lead to significant avalanche failures. In summary, performance problems typically arise due to a congested network-layer or application-layer resource. Performance problems can have a significant impact on applications behaviors and cause failures of applications or networklayer systems. Technologies to manage and assure applications QoS will permit effective solutions of the problems. However, these technologies are in the early stages of development and involve complex technical and operations management challenges that are not yet fully understood. ARCHITECTURE OF NETWORK MANAGEMENT SYSTEMS Figure 3 depicts the overall architecture and operations of a typical network management system. A network system (element) is instrumented to monitor and configure the operations of the element. The instrumentation data are organized in a management information base (MIB), and an embedded agent software provides access to this MIB via a respective management protocol. A manager software at an NMS can use the management protocol to access and manipulate MIB data at remote elements. The NMS presents to operations staff status data on the operations of the network and provides tools to configure network elements and tools to monitor and analyze their operations. The management protocol, governing interaction between a manager and element agents, is an application-layer protocol that depends on an underlying transport protocol to move its packets between the element and NMS. This manager-agent architecture has been crystallized and popularized by the Simple Network Management Protocol (SNMP) and adopted by the ISO Common Management Information Protocol (CMIP). For example, the element

Mgmt. info base (MIB) Agent Instrumentation Transport

Network mgmt. system (NMS) Mgmt. protocol

Manager Transport Transport

Element

Figure 3. Overall architecture and operations of a network management system.

178

NETWORK MANAGEMENT

instrumentation may include various counters for error conditions. The manager software at the NMS may poll these error counters through the management protocol. The data collected by the manager are then used by a monitor application to provide operations staff with graphical depiction of error conditions. This manager-agent architecture rigidly separates computational and communications functions between elements and centralized NMS. Embedded management software in elements is responsible for a minimal role of instrumenting monitoring and configuration functions and providing access to a repository of this data. The tasks of monitoring, analyzing, controlling, and handling problems and configuration management are entirely processed by NMS software. The manager-agent protocol is thus responsible for allowing the NMS applications to access and manipulate remote element instrumentation. This division of labor among element and NMS software has been inspired by the needs, realities, and opportunities of networks of the middle to late 1980s. At the time, network elements had limited software processing power compared with the NMS; management functions were mostly simple to allow occasional interaction among an NMS and elements; networks were often small enough to admit a centralized management paradigm; and the focus of management was on individual elements rather than on the end-end functions that they support. These needs, realities, and opportunities have changed dramatically, requiring and stimulating new technologies and paradigms currently in early stages of development. It is therefore useful to consider the architecture of network management systems from a broader functional perspective that is independent of the specific division and organization of functions under the manager-agent architecture. Furthermore, with the broadening expansion of the term network management to include management of systems, applications, and services and with the boundaries among these different operations management tasks and technologies all but disappearing, it is necessary to develop a common architectural framework that can capture operations management activities, based on SNMP, as well as those based on systemlevel directories such as NIS⫹, NDS, and the Registry, or application-level management functions involving specialized configuration and problem management software. Such architecture is depicted in Fig. 4. At the bottom layer reside elements and systems to be managed. The instrumentation layer, above the element layer, includes software that instruments monitoring and configuration of the underlying elements and data structures to organize it. The instrumenta-

tion access layer includes software, API, and protocols to access and manipulate this instrumentation. The modeling layer includes data models of managed components and subsystems that enable management applications to handle uniformly the tasks of simplifying and automating problem, configuration, and performance management. For example, in the manager–agent architecture of SNMP and CMIP, the instrumentation layer includes instrumentation data and procedures to monitor and configure elements and hierarchical MIB data structures to organize naming and access to this instrumentation. The instrumentation access layer includes manager and agent software, at the NMS and elements, respectively, to handle access to MIB data; and a management protocol to support their interactions. The management data semantics modeling layer is implemented in various NMS databases, where data about network elements and their connectivity and other relationships are maintained. The management applications layer includes various NMS tools to monitor and configure elements, and the GUI display layer provides user interfaces to these tools. The Element Instrumentation Layer Element instrumentation typically includes the following categories of data: 1. Operating Statistics. These include routines to monitor and collect operational statistics and status data of network elements and their operating components. For example, a switch can include instrumentation counters to monitor the number of bytes processed by a given port over its input or output streams; the number of packets or cells handled by the port; the number of packets discarded due to various error conditions; or the operational status of the port (e.g., disabled, testing). 2. Configuration. This includes routines to set configuration data of network element. For example, a LAN switch can include configuration data that disconnects ports, assigns a port to a virtual LAN, creates a new virtual LAN, or deletes a virtual LAN. 3. Identification. This includes various identifying system data. For example, a router can include data about the model, version of components, and contact information to reach its administrators. 4. Operating Events. These include event generators that create spontaneous notifications when certain conditions arise in an element. For example, a router port can generate events when a link is detected to be lost, when the port hardware is experiencing a failure, or when buffers overflow due to high-level of congestion.

Management GUI display layer Management applications layer Management data semantic modeling layer Instrumentation access layer Instrumentation layer Managed elements and systems layer

Figure 4. Functional architecture of a management system.

Operating statistics and configuration data are accessed synchronously by processes that monitor and change them. The instrumentation software is passive, and its access derives from synchronous activities by its consumers who pull it to support their computational needs. In contrast, events are generated and notified asynchronously to processes that wish to process them. The event notifications instrumentation software is active, and its access thus requires subscription by its consumers who gets the events pushed to them. The instrumentation layer can be very rich. The instrumentation of a typical router or a LAN switch involves several

NETWORK MANAGEMENT

thousand various statistics and configuration data items and hundreds of potential events. Consistent access to and naming of these instrumentation data require an organizational scheme. This function is provided by a MIB data structure. Both SNMP and CMIP pursue a hierarchical directory naming structure for their MIB—called Management Information Tree (MIT), in the case of CMIP. In what follows we describe briefly the structure of managed information (SMI) model pursued by these protocols. SNMP Managed Information Model. SNMP establishes a static tree-structure naming scheme for instrumentation data. Specific instrumentation data are located at the leaves of a hierarchical naming tree. A given instrumentation is named by a string of numbers along the path from the root of the tree. For example, the string 1.3.6.1.2.1.1 can be the name of data describing the system being managed. This name is static in that its binding to a specific data set is entirely determined at MIB design. A static naming structure has multiple benefits. First, designers of management applications software can encode with ease the name of data that they need to access and manipulate. The binding of MIB naming to respective data is known to the designers of the software and does not require complex management during runtime. When the naming structure can change dynamically, the application software must first resolve what is the name of the data that it needs to access and bind with it. This renders the task of building, deploying, using, and maintaining management applications more complex. Second, a static naming structure facilitates simpler coordination of management instrumentation names among vendors and standard committees. With static names, each participating entity can be allocated its own subtree and managed its own naming conventions. Thus, for example, the SNMP naming directory allocates a subtree to a common standard MIB, called MIB-2, and to numerous other standards MIBs while allocating similarly private subtrees to each vendor desiring one; each of the owners of a subtree can allocate and manage its name space independently. However, not all instrumentation at a given element can be known at MIB design time. For example, one cannot know at MIB design how many interface objects will be incorporated in a given switch. Furthermore, this information can change dynamically when one configures new interfaces into the switch. Therefore, even a static MIB must include provisions to accommodate dynamic changes in instances of similar instrumentation data. SNMP solves this problem by organizing such multi-instance data in tabular form. Each instance is described by a row of instrumentation data. These rows are

1 1

Subgroup 1 2

Attribute 1 Attribute 2 Instances

3

179

stacked into a table. Rows can be added or deleted dynamically. This overall organization of an SNMP MIB tree is depicted in Fig. 5. The organization of dynamic instances in tables requires a naming scheme that can distinguish different rows in a given table. This naming scheme is depicted in Fig. 5 and explained here. The instrumentation data are depicted in the figure as gray cells at the bottom of the MIB tree. The name of a singleinstance variable is the path leading to it from the top of the tree. For example, the name of attribute 1 of subsubgroup1 of subgroup2 is 1.2.1.1. The name of a table cell is identified by a static column identifier X followed by a dynamic row (instance) identifier Y. Columns have static identifiers defined by their unique tree path labels. For example, the identifier for the attribute2 column of the table on the left of Fig. 5 is 1.1.2. How can rows be uniquely identifier, even when they can be dynamically added or deleted to the table? SNMP solves this by selecting a collection of columns (key) whose contents uniquely identifies the respective rows. For example, the first column of the Interface table of MIB-2 is an index attributes (ifIndex) that assigns a unique integer to each row of the table. Suppose, for example, that the table in Fig. 5 has attribute1 serve as a key for the table. Suppose also that the values of attribute1 for the four rows depicted in the figure are 5,12,31,54. SNMP assigns names for cells of a table by concatenating the column name with the row name. The identifiers of the cells in the attribute2 column are, therefore, 1.1.2.5, 1.1.2.12, 1.1.2.31, 1.1.2.54. SNMP also provides unified mechanisms to associate syntactic data type with the contents of cells (instrumentation data). A small number of data types, originally 9, are provided. An ASN.1 specialized syntax notation, called the structure of managed information, is used to declare formally the data types and names of cells. These SMI notations are compiled by an MIB compiler to appropriate MIB data structures that are used to organize and access the instrumentation of managed information at elements. CMIP Managed Information Model. CMIP, like SNMP, uses a hierarchical tree structure to associate unique names with instrumentation of managed information. Names are defined. However, unlike SNMP, this MIT is dynamically structured and can change among elements of the same type and in the same element over time. The CMIP model of instrumented managed information is based on an extended object model. Related instrumentation data and methods are aggregated to form managed objects. For example, a port object may include various data attributes of a port (e.g., traffic and error statistics of the port,

MIB name 1 Group 1 2 Subgroup 2 1 Subsubgroup 2 2 1 Attribute 1

2

Figure 5. An SNMP MIB tree and its respective naming scheme.

180

NETWORK MANAGEMENT

status of the port, configuration data of the port) but also methods to handle management tasks associated with the port (e.g., a procedure to attach it to a virtual network, a procedure to disconnect the port, or a procedure to configure the port operating parameters). In SNMP such procedures must be activated as implicit side effects of changing a setting of port data. In CMIP management procedures are explicit integral components of the managed information instrumentation model. CMIP’s managed information model goes a step beyond traditional object-oriented software. An object can have, in addition to data attributes and methods, event notifications. Event notifications are active elements of an object. Their access and manipulations obey a completely different communications and processing model than the synchronous pull model of accessing object attributes and methods. CMIP uses, like SNMP, a formal ASN.1-based language to specify the structure of managed information. This language, called Guidelines for the Definitions of Managed Objects (GDMO), is substantially more complex than the SNMP SMI language; unlike the SMI, the GDMO language does not lend itself in its most general form, as defined by ISO, to machine compilation. Managed objects are organized in a hierarchial directory tree, the MIT. Unlike SNMP MIBs, managed objects of CMIP are located at each tree node, not just leaves. Each managed object has attributes that uniquely identify it among its tree siblings; these attributes are called the relative distinguishing name (RDN). The name of a CMIP object is constructed by concatenating the RDN along the path to it from the root of the MIT. This name is called the unique distinguishing name (UDN) of the object. To identify a specific attribute of a managed object, one uses the UDN concatenated by the name of the attribute. The naming scheme defined by the UDN is based on a dynamic MIT tree structure. When the tree structure changes, so will the respective names. This means that management software must first identify the current name of the objects that it needs to manipulate. Therefore, management software must concern itself with the complexity of managing a dynamically changing name space. Furthermore, the MIT associated with similar elements (say, two routers of the same type) can be very different depending on various aspects of the router configuration. As a result, the software to manage specific elements cannot assume a particular unified structure of the MIT associated with these elements and must be designed to handle possible heterogeneity. Management Protocols: The Manager-Agent Paradigm Management protocols, as depicted in Fig. 4, are used by manager software to access and manipulate MIB instrumentation data at an element agent. The fundamental constructs of these protocol must enable a manager to retrieve (GET) data from a MIB, change (SET) data of the MIB, and obtain event notifications (TRAP) from the agent. SNMP Manager-Agent Protocol. SNMP supports a very simple manager-agent protocol to access and manipulate instrumented management information. The protocol consists of several primitives: GET: This operation specifies the name of MIB data to be retrieved by the manager.

GET-NEXT: This operation specifies the name of a table row to be traversed and retrieved by the manager. SET: This operation specifies the name of a MIB data and a value to be written into it. GET-RESPONSE: This operation retrieves the MIB data requested by a GET/GET-RESPONSE request; it also acknowledges execution of a SET request. TRAP: This operation is initiated by the agent to provide asynchronous event notification to a manager. The SNMP protocol is carefully designed so that all commands and responses fit in a single UDP frame. The parameters of a command are coded as variable-binding pairs. Each such pair consists of a name and a respective value. For example, a GET command can specify several MIB variables to be retrieved. The GET packet will carry a variable binding pair for each of these desired data. The name of each pair is included in the pair, while the value part is left as an empty container for the agent to fill in. The agent reads the name of each pair, retrieves the data, codes it in the value part, and then packages it as a GET-REPONSE frame and sends it to the manager. Both the name and the value have statically known lengths, and thus the total size of the data can be determined statically and ascertained by the manager code with ease to fit in a UDP frame. CMIP Manager-Agent Protocol. CMIP, like SNMP, supports capabilities to GET and SET managed object data and to notify events. It offers some additional capabilities not available or needed in SNMP. The protocol consists of the following primitives: CREATE: Create an object and place it on the MIT. DELETE: Delete an object from the MIT. GET: Retrieve data associated with a subset of the MIT. CANCEL-GET: Stop retrieving the data. SET: Change the value of objects attributes. ACTION: Invoke a method of the respective objects of the MIT. EVENT-REPORT: Notify manager of an event and provide its respective parameters. The first two commands are provided to allow managers to configure and manipulate the structure of the MIT. In SNMP this structure is static and not controlled by managers. CMIP managers create managed objects to subscribe to events, or associate managed objects with dynamic managed components. It is interesting to note that SNMP has incorporated capabilities for manager-directed creation/deletion of dynamic instances of managed entities. This was required first in the design of the RMON, remote monitoring MIB. RMON controls the functions of a remote monitor. A manager must be able to configure the remote monitor to filter certain traffic statistics of interest. Each such configuration is defined in RMON as a managed entity in a control table. The manager can cause creation or deletion of such monitoring configurations as side effects of a respective SET command. With the growing management functionality required by emerging systems, manager-directed creation/deletion of managed objects has been increasingly incorporated in SNMP MIB designs,

NETWORK OPERATING SYSTEMS

pursuing this RMON methodology. This is used, for example, in creating permanent virtual circuits in ATM switches and in creating and configuring virtual LANs. CMIP GET is substantially different from an SNMP GET. Unlike SNMP, CMIP managers cannot possibly know the precise location or even number of instances of managed objects of interest on the MIT at a given time. Therefore, retrieval actions are applied to all objects that meet certain filtering criteria and belong to entire subtrees of the MIT. CMIP GET, therefore, does not include variable binding parameters, as SNMP GET. Instead, a CMIP GET includes specifications of a subtree and respective filter to be applied in searching and retrieving data of interest. The result of a CMIP GET is therefore unpredictable. Any amount of data can be generated in response to a GET request. The CANCEL-GET command has been included to abort such a response stream of excessive size. CMIP, therefore, cannot depend on a datagram transport protocol as SNMP does. Instead, it requires a full-fledged stream-transport protocol, as provided by the OSI transport stack. CMIP, unlike SNMP, provides an explicit construct— ACTION—to invoke remote management procedures. In SNMP remote invocations are accomplished as implicit sideeffects of a SET command. Suppose one wishes to activate a configuration procedure to set up a permanent virtual circuit through an ATM switch with appropriate operational parameters. In SNMP one needs to first apply SET to set the appropriate operational parameters and then invoke a SET that activates a remote configuration procedure that uses these parameters to establish the desired circuit. In CMIP the entire task can be accomplished by a single invocation of a remote action to which parameters are passed explicitly. The SNMP implicit and fragmented method of remote procedure invocation presents difficult software architecture problems that are resolved by the explicit ACTION construct of CMIP. Finally, CMIP EVENT-REPORT offers richer event notification services than SNMP. SNMP originally assumed a management operations model based on trap-directed polling. That is, management software should be activated in response to traps, and then it should pursue polling to determine the details of a problem generating the traps. SNMP version 1, therefore, provided a very minimal event reporting facility intended to be used only for major coarse-grained events, such as complete failure of a system. It became clear that this trap-directed polling was an inadequate assumption of how users and vendors structure operation management. Indeed, typically traps play a central role in management, while polling is rarely conducted. As a result, vendors have incorporated a substantial range of highly informative traps, typically in the hundreds, associated with elements. The primitive trap reporting mechanism of SNMP thus became an obstacle in efficient processing of such rich event systems. This was recognized by the designers of SNMP v.2, who included substantial support of richer event notifications. CMIP incorporated rich event notification mechanism from the start. A manager can subscribe to event notifications and obtain respective managed object information based on configurable event discrimination filters. YECHIAM YEMINI Columbia University

181

NETWORK, NONLINEAR. See NONLINEAR NETWORK ELEMENTS.


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Network Operating Systems Standard Article Partha Dasgupta1 1Arizona State University, Tempe, AZ Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5320 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (160K)




❍ ❍

Acronym Finder



Abstract The sections in this article are History Services for Network Operating Systems Mechanisms for Network Operating Systems Kernel Architectures Theoretical Foundations System Features Related Topics | | | Copyright © 1999-2008 All Rights Reserved.



181

NETWORK, NONLINEAR. See NONLINEAR NETWORK ELEMENTS.

NETWORK OPERATING SYSTEMS Network operating systems extend the facilities and services provided by computer operating systems to support a set of computers, connected by a network. The environment managed by a network operating system consists of an interconnected group of machines that are loosely connected. By loosely connected, we mean that such computers possess no hardware connections at the CPU-memory bus level, but are connected by external interfaces that run under the control of software. Each computer in this group runs an autonomous operating system, yet cooperates with the others to allow a variety of facilities including file sharing, data sharing, peripheral sharing, remote execution, and cooperative computation. Network operating systems are autonomous operating systems that support such cooperation. The group of machines composing the management domain of the network operating system is called a distributed system. A close cousin of the network operating system is the distributed operating system. A distributed operating system is an extension of the network operating system that supports even higher levels of cooperation and integration of the machines on the network (features include task migration, dynamic resource location, and so on) (1,2). An operating system is low-level software controlling the inner workings of a machine. Typical functions performed by an operating system include managing the CPU among many concurrently executing tasks, managing memory allocation to the tasks, handling of input and output, and controlling all the peripherals. Applications programs and often the human user are unaware of the existence of the features of operating systems as the features are embedded and hidden below many layers of software. Thus, the term lowlevel software is used. Operating systems were developed, in many forms, since the early 1960s and have matured in the 1970s. The emergence of networking in the 1970s and its explosive growth since the early 1980s have had a significant impact on the networking services provided by an operating system. As more network management features moved into the operating systems, network operating systems evolved. Like regular operating systems, network operating systems provide services to the programs that run on top of the operating system. However, the type of services and the manner in which the services are provided are quite different. The services tend to be much more complex than those provided by regular operating systems. In addition, the implementation of these services requires the use of multiple machines, message passing, and server processes. The set of typical services provided by a network operating system includes (but are not limited to): 1. 2. 3. 4. 5.

Remote logon and file transfer Transparent, remote file service Directory and naming service Remote procedure call service Object and brokerage service


182


6. Time and synchronization service 7. Remote memory service The network operating system is an extensible operating system. It provides mechanisms to easily add and remove services, reconfigure the resources, and has the ability of supporting multiple services of the same kind (for example two kinds of file systems). Such features make network operating systems indispensable in large networked environments. In the early 1980s network operating systems were mainly research projects. Many network and distributed operating systems were built. These include such names as Amoeba, Argus, Berkeley Unix, Choices, Clouds, Cronus, Eden, Mach, Newcastle Connection, Sprite, and the V-System. Many of the ideas developed by these research projects have now moved into the commercial products. The commonly available network operating systems include Linux (freeware), Novell Netware, SunOS/Solaris, Unix, and Windows NT. In addition to the software technology that goes into networked systems, theoretical foundations of distributed (or networked) systems have been developed. Such theory includes topics such as distributed algorithms, control of concurrency, state management, deadlock handling, and so on. HISTORY The emergence of and subsequent popularity of networking prompted the advent of network operating systems. The first networks supported some basic network protocol and allowed computers to exchange data. Specific application programs running on these machines controlled the exchange of data and used the network to share data for specific purposes. Soon it was apparent that a uniform and global networking support within the operating system would be necessary to effectively use the underlying network. A particularly successful thrust at integrating networking extensions into an operating system resulted in Berkeley Unix (known as BSD). Unix was an operating system created at Bell Labs, and was licensed to the University of California at Berkeley for enhancements and then licensed quite freely to most universities and research facilities. The major innovation in Berkeley’s version was support for TCP-IP networking. In the early 1980s TCP-IP (or transmission control protocol-Internet protocol) was an emerging networking protocol, developed by a team of research institutions for a US Government funded project called the ARPANET. Specialized machines were connected to ARPANET and these machines ran TCP-IP. Berkeley made the groundbreaking decision to integrate the TCP-IP protocol into the Unix operating system, suddenly allowing all processes on a general-purpose Unix machine to communicate to other processes on any machine connected to the network. Then came the now ubiquitous programs that ran on top of the TCP-IP protocol. These programs include telnet, ftp, and e-mail. The telnet program (as well as its cousins rlogin and rsh) allow a user on one machine to transparently access another machine. Similarly, ftp allows transmission of files between machines with ease. E-mail opened a new mode of communication. While these facilities are very basic and taken for granted today, they were considered revolutionary when they first ap-

peared. However, as the number of networked computers increased dramatically, it was apparent that these services were simply not enough for an effective work environment. For example, let us assume a department (in 1985) has about 40 users assigned to 10 machines. This assignment immediately led to a whole slew of problems, we outline some below: • A user can only use the machine on which he or she has an account. Soon users started wanting accounts on many if not all machines. • A user wanting to send mail to another colleague not only had to know the recipients name (acceptable) but which machines the recipient uses-in fact, the sender needs to know the recipient’s favorite machine. • Two users working together, but having different machine assignments have to use ftp to move files back and forth in order to accomplish joint work. This not only requires that they know each other’s passwords but also they have to manually track the versions of the files. Suddenly the boon of networking caused segregation of the workplace and became more of a bother rather than an enabling technology. At this point the systems designers realized the need for far tighter integration of networking and operating systems and the idea of a network operating system was born. The first popular commercial network operating system was SunOS from Sun Microsystems. SunOS is a derivative from the popular Berkeley Unix (BSD). Two major innovations present in SunOS are called Sun-NFS and Yellow Pages. Sun-NSF is a network file system. Sun-NSF allows a file that exists on one machine to be transparently visible from other machines. Yellow Pages, which was later renamed to NIS (Network Information System), is a directory service. This service allowed, among other things, user accounts created in one central administrative machine to be propagated to machines the user needs to use. The addition of better, global services to the base operating system is the basic concept that propelled the emergence of network operating systems. Current operating systems provide a rather large number of such services built at the kernel layer or at higher layers to provide application programs with a unified view of the network. In fact, the goal of network operating systems is network transparency; that is, the network becomes invisible to users and application programs.

SERVICES FOR NETWORK OPERATING SYSTEMS System-wide services are the main facility a network operating system provides. These services come in many flavors and types. Services are functions provided by the operating system and form a substrate used by those applications, which need to interact beyond the simplistic boundaries imposed by the process concept. A service is provided by a server and accessed by clients. A server is a process or task that continuously monitors incoming service requests (similar to telephone operators). When a service request comes in, the server process reacts to the request, performs the task requested, and then returns a response to the requestor. Often, one or more such server


processes run on a computer and the computer is called a server. What is a service? In regular operating systems, the system call interface or API (application programming interface) defines the set of services provided by the operating system. For example, operating system services include process creation facilities, file manipulation facilities, and so on. These services (or system calls) are predefined and static. However, this is not the case in a network operating system. Network operating systems do provide a set of static, predefined services, or system calls like the regular operating system, but in addition provides a much larger, richer set of dynamically creatable and configurable services. Additional services are added to the network operating system by the use of server processes and associated libraries. Any process making a request to a server process is called a client. A client makes a request by sending a message to a server containing details of the request and awaiting a response. For each server, there is a well-defined protocol defining the requests that can be made to that server and the responses that are expected. In addition, any process can make a request; that is anyone can become a client, even temporarily. For example, a server process can obtain services from yet another server process, and while it is doing so, it can be termed a temporary client. Services provided by a network operating system include file service, name service, object service, time service, and memory service. Peripheral Sharing Service Peripherals connected to one computer are often shared by other computers, by the use of peripheral sharing services. These services go by many names, such as remote device access, printer sharing, shared disks, and so on. A computer having a peripheral device makes it available by exporting it. Other computers can connect to the exported peripheral. After a connection is made, to a user on the machine connected to a shared peripheral, that peripheral appears to be local (that is, connected to the users machine). The sharing service is the most basic service provided by a network operating system. File Service The most common service that a network operating system provides is file service. File services allow a user of a set of computers to access files and other persistent storage objects from any computer connected to the network. The files are stored in one or more machines called the file server(s). The machines that use these files, often called workstations, have transparent access to these files. Not only is the file service a common service, but it is also the most important service in the network operating system. Consequently, it is the most heavily studied and optimized service. There are many different, often noninteroperable protocols for providing file service (3). The first full-fledged implementation of a file service system was done by Sun Microsystems and is called the Sun Network File System (Sun-NFS). Sun-NFS has become an industry standard network file system for computers running the Unix operating system. Sun-NFS can also be used from com-

183

puters running Windows (all varieties) and MacOS but with some limitations. Under Sun-NFS a machine on a network can export a file system tree (i.e. a directory and all its contents and subdirectories). A machine that exports one of more directories is called a file server. After a directory has been exported, any machine connected to the file server (could be connected over the Internet) can import, or mount that file tree. Mounting is a process, by which the exported directory, all its contents, and all its subdirectories appear to be a local directory on the machine that mounted it. Mounting is a common method used in Unix system to build unified file systems from a set of disk partitions. The mounting of one exported directory from one machine to a local directory on another machine via Sun-NFS is termed remote mounting. Figure 1 shows two file servers, each exporting a directory containing many directories and files. These two exported directories are mounted on a set of workstations, each workstation mounting both the exported directories from each of the file servers. This configuration results in a uniform file space structure at each the workstation. While many different configurations are possible by the innovative use of remote mounting, the system configuration shown in Fig. 1 is quite commonly used. This is called the dataless workstation configuration. In such a setup, all files, data and critical applications are kept on the file servers and mounted on the workstations. The local disks of the workstations only contain the operating system, some heavily used applications and swap space. Sun-NFS works by using a protocol defined for remote file service. When an application program makes a request to read (or write) a file, it makes a local system call to the operating system. The operating system then consults its mounting tables to determine if the file is a local file or a remote file. If the file is local, the conventional file access mechanisms handle the task. If the file is remote, the operating system creates a request packet confirming to the NFS protocol and sends the packet to the machine having the file. The remote machine runs a server process, also called a daemon, named nfsd. Nfsd receives the request and reads (or

Workstations

Remote mount

File servers Exported directory

Figure 1. The file mounting structure for Sun NFS.

184


writes) the file, as requested by the application and returns a confirmation to the requesting machine. Then, the requesting machine informs the application of the success of the operation. Of course, the application does not know whether the execution of the file operation was local or remote. Similar to Sun-NFS, there are several other protocols for file service. These include Appleshare for Macintosh computers, the SMB protocol for Windows 95/NT, and the DFS protocol used in the Andrew file system. Of these, the Andrew file system is the most innovative. Andrew, developed at CMU in the late 1980s, is a scalable file system. Andrew is designed to handle hundreds of file servers and many thousands of workstations without degrading the file service performance. Degraded performance in other file systems is the result of bottlenecks at file servers and network access points. The key feature that makes Andrew a scalable system is the use of innovative file caching strategies. A file being used at a workstation is cached (i.e., a copy is kept) in its entirety at the workstation or at an intermediate server close to the workstation. Updates are applied to the cached copy, and later transmitted to the file server. The Andrew file system is also available commercially and is called DFS (distributed file system). In Andrew/DFS when an application accesses a file, the entire file is transmitted from the server to the workstation, or a special intermediate file storage system, closer to the workstation. Then, the application uses the file, in a manner similar to NFS. After the user running the application logs out of the workstation, the file is sent back to the server. Such a system however has the potential of suffering from file inconsistencies if the same user uses two workstations at two locations. In order to keep files consistent, when it is used concurrently, the file server uses a callback protocol. The server can recall the file in use by a workstation if another workstation uses it simultaneously. Under the callback scheme, the server stores the file and both workstations reach the file remotely. Performance suffers, but consistency is retained. Since concurrent access to a file is rare, the callback protocol is very infrequently used; and thus does not hamper the scalability of the system. Directory or Name Service A network of computers managed by a network operating system can get rather large. A particular problem in large networks is the maintenance of information about the availability of services and their physical location. For example, a particular client needs access to a database. There are many different database services running on the network. How would the client know whether the particular service it is interested in, is available, and if so, on what server? Directory services, sometimes called name services, address such problems. Directory services are the mainstay of large network operating systems. When a client application needs to access a server process, it contacts the directory server and requests the address of the service. The directory server identifies the service by its name-all services have unique names. Then, the directory server informs the client of the address of the service-the address contains the name of the server. The directory server is responsible for knowing the current loca-

tions and availability of all services and hence can inform the client of the unique network address (somewhat like a telephone number) of the service. The directory service is thus a database of service names and service addresses. All servers register themselves with the directory service upon startup. Clients find server addresses upon startup. Clients can retain the results of a directory lookup for the duration of its life, or can store it in a file and thus retain it potentially forever. Retaining addresses of services is termed address caching. Address caching causes gains in performance and reduces loads on the directory server. Caching also has disadvantages. If the system is reconfigured and the service address changes, then the cached data is wrong and can indeed cause serious disruptions if some other service is assigned that address. Thus, when caching is used, clients and servers have to verify the accuracy of cached information. The directory service is just like any other service, i.e. it is provided by a service process. So there are two problems: 1. How does the client find the address of the directory service? 2. What happens if the directory service process crashes? Making the address of the directory service a constant solves the first problem. Different systems have different techniques for doing this, but a client always has enough information about contacting the directory service. To ensure the directory service is robust and not dependent on one machine, the directory service is often replicated or mirrored. That is, there are several independent directory servers and all of them contain (hopefully) the same information. A client is aware of all these services and contacts any one. As long as one directory service is reachable, the client gets the information it seeks. However, keeping the directory servers consistent, i.e. having the same information, is not a simple task. This is generally done by using one of many replication control protocols (see section entitled ‘‘Theoretical Foundations’’). The directory service has been subsequently expanded not just to handle service addresses, but higher level information such as user information, object information, and web information. A standard for worldwide directory services over large networks such as the Internet has been developed and is known as the X.500 directory service. However the deployment of X.500 has been low and thus its importance has eroded. A simpler directory service called LDAP (lightweight directory access protocol) is more popular, and most network operating systems provide support for this protocol. RPC Service A particular mechanism for implementing the services in a network operating system is called remote procedure calls or RPC. The RPC mechanism is discussed later in the section entitled ‘‘Mechanisms for Network Operating Systems.’’ The RPC mechanism needs the availability of an RPC server accessible by an RPC client. However, a particular system may contain tens if not hundreds or even thousands of RPC servers. In order to avoid conflicts and divergent communica-


tion protocols the network operating system provides support for building and managing and accessing RPC servers. Each RPC service is an application-defined service. However, the operating system also provides an RPC service, which is a meta-service, which allows the application-specific RPC services to be used in a uniform manner. This service provides several features: 1. Management of unique identifiers (or addresses) for each RPC server. 2. Tools for building client and server stubs for packing and unpacking (also known as marshalling and unmarshalling) of arguments between clients and servers. 3. A per-machine RPC listening service. The RPC service defines a set of unique numbers that can be used by all RPC servers on the network. Each specific RPC server is assigned one of these numbers (addresses). The operating system manages the creation and assignment of these identifiers. The operating system also provides tools that allow the programmers of RPC services to build a consistent client–server interface. This is done by the use of language processing tools and stub generators, which embed routines in the client and server code. These routines package the data sent from the client to the server (and vice versa) in some predefined format, which is also machine independent. When a client uses the number to contact the service, its looks up the directory and finds the name of the physical machine that contains the service. Then it sends a RPC request to the RPC listener on that machine. The RPC listener is an operating-system-provided service that redirects RPC calls to the actual RPC server process that should handle the call. RPC services are available in all network operating systems. The three most common types of RPC systems are Sun RPC, DCE RPC, and Microsoft RPC. Object and Brokerage Service The success and popularity of RPC services coupled with the object-orientation frenzy of the mid-1980s led to the development of object services and then to brokerage services. The concept of object services is as follows. Services in networked environments can be thought of as basic services and composite services. Each basic service is implemented by an object. An object is an instance of a class, while a class is inherited from one or more base or composite classes. The object is a persistent entity that stores data in a structured form, and may contain other objects. The object has an external interface, visible from clients and is defined by the public methods the object supports. Composite services are composed of multiple objects (basic and composite) which can be embedded or linked. Thus, we can build a highly structured service infrastructure that is flexible, modular, and has unlimited growth potential. In order to achieve this concept, the network operating systems started providing uniform methods of describing, implementing, and supporting objects (similar to the support for RPC). While the concept sounds very attractive in theory, there are some practical problems. These are:

185

1. How does a client access a service? 2. How does a client know of the available services and the interfaces they offer? 3. How does one actually build objects (or services)? We discuss the questions in reverse order. The services or objects are built using a language that allows the specification of objects, classes, and methods, and allows for inheritance and overloading. While C⫹⫹ seems to be a natural choice, C⫹⫹ does not provide the features of defining external service interfaces and does not have the power of remote linking. Therefore, languages have been defined based on C⫹⫹ that provide such features. The client knows of the object interface, due to the predefined type of the object providing the service. The programming language provides and enforces the type information. Hence at compile time, the client can be configured by the compiler to use the correct interface based on the class of object the client is using. However, such a scheme makes the client use a static interface. That is, once a client has been complied the service cannot be updated with new features that change the interface. This need for dynamic interface management leads to the need for brokerage services. After the client knows of the existence of the service, and the interface it offers, the client accesses the service using two key mechanisms—the client stub and the ORB (object request broker). The client stub transforms a method invocation into a transmittable service request. Embedded in the service request is ample information about the type of service requested and the arguments (and type of these arguments) and the type of expected results. The client stub then sends a message to the ORB handling requests of this type. The ORB is just one of the many services a brokerage system provides. The ORB is responsible for handling client requests and is an intermediate between the client and the object. Thus, the ORB is a server-side stub that receives incoming service requests and converts them to correct formats, and sends them to the appropriate objects. The Brokerage Service is a significantly more complex entity. It is responsible for handling: 1. Names and types of objects and their locations and types. 2. Controlling the concurrency of method invocations on objects, if they happen concurrently. 3. Event notification and error handling. 4. Managing the creation and deletion of objects and updates of objects as they happen, dynamically. 5. Handling the persistence and consistency of objects. Some critical objects may need transaction management. 6. Handling queries about object capabilities and interfaces. 7. Handling reliability and replication. 8. Providing trader services. The trader service adds more features to the object services. The main power in object services is unleashed when clients can pick and choose services dynamically. For example, a cli-

186


ent wants access to a database object containing movies. Many such services may exist on the network offering different or even similar features. The client can first contact the trader, get information about services (including quality, price, range of offerings, and so on) and then decide to use one of them. This is, of course, based on the successful, realworld business model. Trader services thus offer viable and useful methods of interfacing clients and objects on a large network. The object and brokerage services depend heavily on standards, as all programs running on a network have to conform to the same standard, in order to interoperate. As of writing, the OSF-DCE (open software foundation, distributed computing environment) is the oldest multiplatform standard, but has limited features (does not support inheritance, dynamic interfaces, and so on). The CORBA (common object request broker architecture) standard is gaining importance as a much better standard and is being deployed quite aggressively. Its competition, the DCOM (distributed common object model) standard is also gaining momentum, but its availability seems to be currently limited to the Windows family of operating systems.

Totally Ordered Multicast. All the multicasts are ordered strictly; that is, all the receivers get all the messages in exactly the same order. Totally ordered multicasting is expensive to implement and is not necessary (in most cases). Causal multicasting is powerful enough for use by applications that need ordered multicasting. Causally Ordered Multicast. If two multicast messages are causally related in some way then all recipients of these multicasts will get them in the correct order. Imperative in the notion of multicasting is the notion of dynamic process groups. A multicast is sent to a process group and all current members of that group receive the message. The sender does not have to belong to the group. Group communication is especially useful in building faulttolerant services. For example, a set of separate servers providing the same service are assigned to a group and all service requests are sent via causally ordered multicasting. Now all the servers will do exactly the same thing, and if one server fails, it can be removed from the group. This approach is used in the ISIS system (4). Time, Memory, and Locking Services

Group Communication Service Group communication is an extension of multicasting for communicating process groups. When the recipient of a message is a set of processes the message is called a multicast message (a single recipient message—unicast, all processes are recipients—broadcast). A process group is a set of processes whose membership may change over time. If a process sends a multicast message to a process group, all processes that are members of the group will receive this message. Simple implementations of multicasting does not work for group communications for a variety of reasons, such as: 1. A process may leave the group and then get messages sent to the group from a process who is not yet aware of the membership change. 2. Process P1 sends a multicast. In response to the multicast, process P2 sends another multicast. However, P2’s message arrives at P3 before P1’s message. This is causally inconsistent. 3. Some processes, which are members of the group, may not receive a multicast due to message loss or corruption. The main provision in a group communication system is the provision of multicasting primitives. Some of the important ones are: Reliable Multicast. The multicast is send to all processes and then retransmitted to processes that did not get the message, until all processes get the multicast. Reliable multicasts may not deliver all messages if some network problems arise. Atomic Multicast. Similar to the reliable multicast, but guarantees that all processes will receive the message. If it is not possible for all processes to receive the message, then no process will receive the message.

Managing time on a distributed system is inherently conceptually difficult. Each machine runs its own clock and these clocks drift independently. In fact there is no method to even initially synchronize the clocks. Time servers provide a notion of time to any program interested in time, based on one of many clock algorithms (see section on theoretical foundations). Time services have two functions: provide consistent time information to all processes on the system and to provide a clock synchronization method that ensures all clocks on all systems appear to be logically synchronized. Memory services provide a logically shared memory segment to processes not running on the same machine. The method used for this service is described later. A shared memory server provides the service, and processes can attach to a shared memory segment which is automatically kept consistent by the server. There is often a need for locking a resource on the network, by a process. This is especially true in systems using shared memory. While locking is quite common and simple in single computers, it is not so easy on a network. Thus, networks use a locking service. A locking service is typically a single server process that tracks all locked resources. When a process asks for a lock on a resource, the server grants the lock if that lock is currently not in use, or else it makes the requesting process wait until the lock is released. Other Services A plethora of other services exists in network operating systems. These services can be loosely divided into two classes: (1) services provided by the core network operating system and (2) services provided by applications. Services provided by the operating system are generally low-level services used by the operating system itself, or by applications. These services of course vary from one operating system to another. The following is a brief overview of services provided by most operating systems that use the TCPIP protocol for network communications:


1. Logon Services. These include telnet, rlogin, ftp, rsh, and other authentication services that allow users on one machine to access facilities of other machines. 2. Mail Services. These include SMTP (simple mail transfer protocol), POP (post office protocol), and IMAP (Internet message access protocol). These services provide the underlying framework for transmitting and accessing electronic mail. The mail application provides a nicer interface to the end user, but uses several of these low-level protocols to actually transmit and receive mail messages. 3. User Services. These include finger, rwho, whois, and talk. 4. Publishing Services. These include HTTP (hyper-text transfer protocol), NNTP (network news transfer protocol), Gopher, and WAIS. These protocols provide the backbone of the Internet information services such as the WWW and the news network. Application-defined services, on the other hand, are used by specific applications that run on the network operating system. One of the major attributes of a network operating system is that it can provide support for distributed applications. These application programs span machine boundaries and user boundaries. That is, these applications use resources (both hardware and software) of multiple machines and input from multiple users to perform a complex task. Examples include parallel processing and CSCW (computer supported cooperative work). Such distributed applications use the RPC services or object services provided by the underlying system to build services specific to the type of computation being performed. Parallel processing systems use the message passing and RPC mechanisms to provide remote job spawning and distribution of computational workload among all available machines on the network. CSCW applications provide services such as whiteboards and shared workspaces, which can be used by multiple persons at different locations on the network. A particular, easy to understand application is a calendering program. In calendaring applications, a server maintains information about appointments and free periods of a set of people. All individuals set up their own schedules using a front-end program, which downloads such data into a server. If a person wants to set up a meeting, he or she can query the server for a list of free periods, for a specified set of people. After the server provides some alternatives, the person schedules a particular time and informs all the participants. While the scheduling decision is pending, the server marks the appointment time temporarily unavailable on the calendars of all participating members. Thus, the calendering application provides its own unique service—the calendar server. MECHANISMS FOR NETWORK OPERATING SYSTEMS Network operating systems provide three basic mechanisms that are used to the support the services provided by the operating system and applications. These mechanisms are (1) message passing, (2) remote procedure calls (RPC), and (3) distributed shared memory (DSM). These mechanisms support a feature called interprocess communication or IPC.

187

While all the mechanisms are suitable for all kinds of interprocess communication, RPC and DSM are favored over message passing by programmers. Message Passing Message passing is the most basic mechanism provided by the operating system. This mechanism allows a process on one machine to send a packet of raw, uninterpreted stream of bytes to another process. In order to use the message passing system, a process wanting to receive messages (or the receiving process) creates a port (or mailbox). A port is an abstraction for a buffer, in which incoming messages are stored. Each port has a unique system-wide address, which is assigned, when the port is created. A port is created by the operating system upon a request from the receiving process and is created at the machine where the receiving process executes. Then, the receiving process may choose to register the port address with a directory service. After a port is created, the receiving process can request the operating system to retrieve a message from the port and provide the received data to the process. This is done via a receive system call. If there are no messages in the port, the process is blocked by the operating system until a message arrives. When a message arrives, the process is woken up and is allowed to access the message. A message arrives at a port, after a process sends a message to that port. The sending process creates the data to be sent and packages the data in a packet. Then it requests the operating system to deliver this message to the particular port, using the address of the port. The port can be on the same machine as the sender, or a machine connected to the same network. When a message is sent to a port that is not on the same machine as the sender (the most common case) this message traverses a network. The actual transmission of the message uses a networking protocol that provides routing, reliability, accuracy, and safe delivery. The most common networking protocol is TCP-IP. Other protocols include IPX/SPX, AppleTalk, NetBEUI, and PPTP. Network protocols use techniques such as packetizing, checksums, acknowledgements, gatewaying, routing, and flow control to ensure messages that are sent are received correctly and in the order they were sent. Message passing is the basic building block of distributed systems. Network operating system use message passing for interkernel as well as interprocess communications. Interkernel communications are necessary as the operating system on one machine needs to cooperate with operating systems on other machines to authenticate users, manage files, handle replication, and so on. Programming using message passing is achieved by using the send/receive system calls and the port creation and registering facilities. These facilities are part of the message passing API provided by the operating system. However, programming using message passing is considered to be a low-level technique that is error prone and best avoided. This is due to the unstructured nature of message passing. Message passing is unstructured, as there are no structural restrictions on its usage. Any process can send a message to any port. A process may send messages to a process

188


that is not expecting any. A process may wait for messages from another process, and no message may originate from the second process. Such situations can lead to bugs that are very difficult to detect. Sometimes timeouts are used to get out of the blocked receive calls when no messages arrive, but the message may actually arrive just after the timeout fires. Even worse, the messages contain raw data. Suppose a sender sends three integers to a receiver who is expecting one floating-point value. This will cause very strange and often undetected behaviors in the programs. Such errors occur frequently due to the complex nature of message passing programs and hence better mechanisms have been developed for programs that need to cooperate. Even so, a majority of the software developed for providing services and applications in networked environments uses message passing. Some minimization of errors is done by strictly adhering to a programming style called the client– server programming paradigm. In this paradigm, some processes are predesignated as servers. A server process consists of an infinite loop. Inside the loop is a receive statement which waits for messages to arrive at a port called the service port. When a message arrives, the server performs some task requested by the message and then executes a send call to send back results to the requestor and goes back to listening for new messages. The other processes are clients. These processes send a message to a server and then wait for a response using a receive. In other words, all sends in a client process must be followed by a receive and all receives at a server process must be followed by a send. Following this scheme significantly reduced timing related bugs. The performance of client–server-based programs are, however, poorer than what can be achieved by raw message passing. To alleviate this, often a multithreaded server is used. In a multithreaded server several parallel threads can listen to the same port for incoming messages and perform requests in parallel. This causes quicker service response times. Two better interprocess communication techniques are RPC and DSM. Remote Procedure Calls RPC is a method of performing interprocess communication with a familiar, procedure-call-like mechanism. In this scheme, to access remote services, a client makes a procedure call, just like a regular procedure call, but the procedure executes within the context of a different process, possibly on a different machine. The RPC mechanism is similar to the client–server programming style used in message passing. However, unlike message passing where the programmer is responsible for writing all the communication code, in RPC a compiler automates much of the intricate details of the communication. In concept, RPC works as follows: A client process wishes to get service from a server. It makes a remote procedure call on a procedure defined in the server. In order to do this the client sends a message to the RPC listening service on the machine where the remote procedure is stored. In the message, the client sends all the parameters needed to perform the task. The RPC listener then activates the procedure in the proper context, lets it run, and returns the results gener-

ated by the procedure to the client program. However, much of this task is automated and not under programmer control. An RPC service is created by a programmer who (let us assume) writes the server program as well as the client program. In order to do this, he or she first writes an interface description using a special language called the interface description language (IDL). All RPC systems provide an IDL definition and an IDL compiler. The interface specification of a server documents all the procedures available in the server and the types of arguments they take and the results they provide. The IDL compiler compiles this specification into two files, one containing C code that is to be used for writing the server program and the other containing code used to write the client program. The part for the server contains the definitions (or prototypes) of the procedures supported by the server. It also contains some code called the server loop. To this template, the programmer adds the global variables, private functions, and the implementation of the procedures supported by the interface. When the resulting program is compiled, a server is generated. The server loop is inserted by the IDL compiler contains code to: 1. Register the service with a name server. 2. Listen for incoming requests (could be via the listening service provided by the operating system). 3. Parse the incoming request and call the appropriate procedure using the supplied parameters. This step requires the extraction of the parameters from the message sent by the client. The extraction process is called unmarshalling. During unmarshalling some type checking can also be performed. 4. After the procedure returns, the server loop packages the return results into a message (marshalling) and sends a reply message to the client. Note that all of this functionality is automatically inserted into the RPC server by the IDL compiler and the programmer does not have to write any of these. Then, the programmer writes the client. In the client program, the programmer includes the header file for clients generated by the IDL compiler. This file has the definitions and pseudo-implementations (or proxies) of the procedures that are actually in the server. The client program is written as if the calls to the remote procedures are in fact local procedure calls. When the client program is run, the stubs inserted via the header files play an important role in the execution of the RPCs. When the client process makes a call to a remote procedure, it actually calls a local procedure, which is a proxy for the remote procedure. This proxy procedure (or stub) gets all the arguments passed to it and packages them in some predefined format. This packaging is called marshalling. After the arguments are marshaled, they are sent to the RPC server that handles requests for this procedure. Of course, as described, the RPC server unmarshals arguments, runs the procedure, and marshals results. The results flow back to the client, and the proxy procedure gets them. It unmarshals the results and returns control to the calling statement, just like a regular local procedure.


One problem remains. How does the client know what is the address of the server handling a particular procedure call? This function is automated too. The IDL compiler, when compiling an interface definition, obtains a unique number from the operating system and inserts it into both the client stub and the server stub, as a constant. The server registers this number with its address on the name service. The client uses this number to look up the server’s address from the name service. The net effect is that a programmer can write a set of server routines, which can be used from multiple client processes running on a network of machines. The writing of these routines takes minimal effort and calling them from remote processes is not difficult either. There is no need to write communications routines and routines to manage arguments and handle type checking. Automation reduces chances of bugs quite heavily. This has led to the acceptance of RPC as the preferred distributed programming tool. Distributed Shared Memory While message passing and RPC are the mainstays of distributed programming, and are available on all network operating systems, DSM is not at all ubiquitous. On a distributed system, DSM provides a logical equivalent to (real) shared memory, which is normally available only on multiprocessor systems. Multiprocessor systems have the ability of providing the same physical memory to multiple processors. This is a very useful feature and has been utilized heavily for parallel processing and interprocess communication in multiprocessor machines. While RPC and message passing is also possible on multiprocessor systems, using shared memory for communication and data sharing is more natural and is preferred by most programmers. While shared memory is naturally available in multiprocessors, due to the physical design of the computer, it is neither available nor is thought to be possible on a distributed system. However, the DSM concept has proven that a logical version of shared memory, which works just like the physical version, albeit at reduced performance, is both possible and is quite useful. DSM is a feature by which two or more processes on two or more machines can map a single shared memory segment to their address spaces. This shared segment behaves like real shared memory; that is, any change made by any process to any byte in the shared segment is instantaneously seen by all the processes that map the segment. Of course, this segment cannot be at all the machines at the same time, and updates cannot be immediately propagated, due to the limitations of speed of the network. DSM is implemented by having a DSM server that stores the shared segment; that is, it has the data contained by shared segment. The segment is an integral number of pages. When a process maps the segment to its address space, the operating system reserves the address range in memory and marks the virtual addresses of the mapped pages as inaccessible (via the page table). If this process accesses any page in the shared segment, a page fault is caused. The DSM client is the page fault handler of the process. The workings of DSM are rather complex due to the enormous number of cases the algorithm has to handle. Modern

189

DSM systems provide intricate optimizations that make the system run faster but are hard to understand. In this section, we discuss a simple, un-optimized DSM system, which if implemented would work, but would be rather inefficient. DSM works with memory by organizing it as pages (similar to virtual memory systems). The mapped segment is a set of pages. The protection attributes of these pages are set to inaccessible, read only, or read-write: 1. Inaccessible. This denotes that the current version of the page is not available on this machine and the server needs to be contacted before the page can be read or written. 2. Read-only. This denotes that the most recent version of the page is available on this machine; i.e., the process on this machine holds the page in read mode. Other processes may also have the page in read-only mode, but no process has it in write mode. This page can be freely read, but not updated without informing the DSM server. 3. Read-write. This denotes that this machine has the sole, latest version of the page; i.e., the process on this machine holds the page in write mode. No other process has a copy of this page. It can be freely read or updated. However, if this page is needed anywhere else, the DSM server may yank the privileges by invalidating the page. The DSM client or page fault handler is activated whenever there is a page fault. When activated, the DSM client first determines whether the page fault was due to a read access or a write access. The two cases are different and are described separately. Read Access Fault. On a read access fault, the DSM client contacts the DSM server and asks for the page in read mode. If there are no clients that have already requested the page in write mode, the server sends the page to the DSM client. After getting the page, the DSM client copies it into the memory of the process, at the correct address, and sets the protection of the page as read only. It then restarts the process that caused the page fault. If there is one client already holding the page in write mode (there can be at most one client in write mode) then the server first asks the client to relinquish the page. This is called invalidation. The client relinquishes the page by sending it back to the server and marking the page as inaccessible. After the invalidation is done, the server sends the page to the requesting client, as before. Write Access Fault. On a write access fault, the DSM client contacts the server and requests the page in write mode. If the page is not currently used in read or write mode by any other process, the server provides a copy of the page to the client. The client then copies the page to memory, sets the protection to read-write, and restarts the process. If the page is currently held by some processes in read or write mode, the server invalidates all these copies of the page. Then, it sends the page to the requesting client, which installs it and sets the protection to read-write.

190


The net effects of this algorithm is as follows: 1. Only pages that are used by a process on a machine migrate to that machine. 2. Pages that are read by several processes migrate to the machines these processes are running on. Each machine has a copy. 3. Pages that are being updated migrate to the machines they are being updated on; however, there is at most one update copy of the page at any point in time. If the page is being simultaneously read and updated by two or more machines, then the page shuttles back and forth between these machines. Page shuttling is a serious problem in DSM systems. There are many algorithms used to prevent page shuttling. Effective page shuttling prevention is done by relaxed memory coherence requirements, such as release consistency. Also, with careful design of applications page shuttling can be minimized. The first system to incorporate DSM was Ivy (5). Several DSM packages are available; these include TreadMarks, Quarks, Avalanche, and Calypso. KERNEL ARCHITECTURES Operating systems always have been constructed (and often still are) using the monolithic kernel approach. The monolithic kernel is a large piece of protected software that implements all the services the operating system has to offer via a system call interface (or API). This approach has some significant disadvantages. The kernel, unlike application programs, is not a sequential program. A kernel is an interrupt driven program. That is, different parts of the kernel are triggered and made to execute at different (and unpredictable) points in time, due to interrupts. In fact, the entire kernel is interrupt driven. The net effect of this structure is that: 1. The kernel is hard to program. The dependencies of the independently interrupt-triggerable parts are hard to keep track of. 2. The kernel is hard to debug. There is no way of systematically running and testing the kernel. When a kernel is deployed, random parts start executing quite unpredictably. 3. The kernel is crucial. A bug in the kernel causes applications to crash, often mysteriously. 4. The kernel is very timing dependent. Timing errors are very hard to catch problems that are not repeatable and the kernel often contains many such glitches that are not detectable. The emergence of network operating systems saw the sudden drastic increase in the size of kernels. This is due to the addition of a whole slew of facilities in the kernel, such as message passing, protocol handling, network device handling, network file systems, naming systems, RPC handling, and time management. Soon it was apparent that this bloat led to kernel implementations that are unwieldy, buggy, and doomed to fail.

This rise in complexity resulted in the development of an innovative kernel architecture targeted at network operating systems, called the microkernel architecture. A true microkernel places only those features in the kernel that positively have to be in the kernel. This includes low-level service such as CPU scheduling, memory management, device drivers, and network drivers. Then, it places a low-level message passing interface in the kernel. The user-level API is just essentially the message passing routines. All other services are built outside the kernel, using server processes. It has been shown that almost every API service and all networking services can be placed outside the kernel. This architecture has some significant benefits, a few of which are listed: 1. Services can be programmed and tested separately. Changes to the service do not need recompiling the microkernel. 2. All services are insulated from each other—bugs in one service do not affect another service. This is not only a good feature, but makes debugging significantly easier. 3. Adding, updating, and reconfiguring services are easy. 4. Many different implementations of the same service can coexist. Microkernel operating systems that proved successful include Amoeba (6), Mach (7), and the V-System (8). A commercial microkernel operating system called Chorus is marketed by Chorus Systems (France). The advantages of microkernels come at a price, namely performance. Performance of operating systems is an all-important feature that can make or break the usage of the system, especially commercial systems. Hence, commercial systems typically shun the microkernel approach but choose a compromise called the hybrid kernel. A hybrid kernel is a microkernel in spirit, but a monolithic kernel in reality. The Chorus operating system pioneered the hybrid kernel. Windows NT is also a hybrid system. A hybrid system starts as a microkernel. Then, as services are developed and debugged they are migrated into the kernel. This retains some of the advantages of the microkernel, but the migration of services into the kernel significantly improves the performance. THEORETICAL FOUNDATIONS The theoretical study of autonomous but networked computing system was propelled by the need for algorithms for use in networked environments. This active field of research has produced some interesting and seminal results. Much of the foundational work has resulted in the development of distributed algorithms (9). These algorithms are designed to allow a set of independent processes, running on independent computers (or machines, or nodes) to cooperate and interact to achieve a common goal. Many such algorithms are used for application programming. Some of the algorithms are, however, relevant to management of distributed systems and are used in network operating systems. In the following sections, we present a few algorithms which form the theoretical foundations of network and distributed operating systems. These include time management, deadlock handling, mutual exclu-


sion, checkpointing, deadlocks detection, concurrency control, consensus, and replication control. Distributed Clocks Each physical machine on a network has its own clock, which is a hardware counter. This clock runs freely, and cannot be physically synchronized with other clocks. This makes the notion of time on a distributed system hard to define and obtain. The first clock synchronization algorithm provided a method of logically synchronizing clocks such that no application running on the system could ever detect any drift amongst the physical clocks (even though the clocks do drift). Clocks on systems built using this technique are called Lamport clocks after the inventor of the algorithm (10). The Lamport clock algorithm works by stamping a time on every message outgoing from any machine. When the operating system on system Si sends out a message, it stamps it with the time Ti, where Ti is the time according to the physical clock on Si. Suppose the message is received by the operating system on system Sj. The operating system on Sj checks the timestamp in the message with the time according to the local clock on Sj, i.e. Tj: • If Ti ⬍ Tj then no action is needed. • If Ti ⬎ Tj then the clock on Sj is incremented to Ti⫹1. This action, at the least, ensures that no messages are received before they are sent. However, it also has some interesting side effects. These are: • All clocks follow the fastest clock. • The clocks are not physically synchronized, but they are logically synchronized. That is, to all applications running on the systems, the clocks appear completely synchronized. • If two actions or events on two different machines are transitively related; that is, there is a chain of events from the occurrence of event i to the occurrence of event j; then the time of occurrence of i will always be lower than the time of occurrence of j. Even if i and j happened on two different machines with two different clocks. The Lamport clock is a very simple algorithm which produces properly synchronized (logical) distributed clocks. However, it has the shortcoming that clocks cannot be set back, and hence real time clocks cannot use this method. In fact, setting back a clock will cause it to race ahead to catch up with the fastest clock. This problem is solved by the use of vector clocks. In the vector clock scheme, each system clock is independent and is never updated by the clock algorithm. Every system maintains its own time, and information about the time on other systems. That is, there is a local clock on each system, as well as registers containing some approximation of the time on the sibling systems. The time is maintained as an n-tuple (or vector) where n is the number of systems on the network. Each machine maintains this n-tuple. On machine Si, the n-tuple (or the time vector) is Tn. Tn, of course has n fields and Tn(i) is the

191

local clock time. The other fields are updated in accordance to the following algorithm. When a message is sent from Si to Sj, the value of Ti is sent along with the message. When Sj receives the message, it updates its time vector Tj by updating each field in Tj to the larger of the values contained in the corresponding fields of Ti and Tj. Now it can be shown that any two timestamps can be compared using vector clock algebra. Suppose we want to compare two timestamps Ta and Tb. Each has n fields, Ta(0) to Ta(n⫺1). The comparison operators are defined below. Equal: For all i, Ta(i) is equal to Tb(i). Not equal: For some i Ta(i) is not equal to Tb(i). Less than or equal: For all i, Ta(i) is less than or equal to Tb(i). Not less than or equal: For some i, Ta(i) is not less than or equal Tb(i). Less than: Ta is less than or equal to Tb, and Ta is not equal Tb. Concurrent: Not Ta less than Tb and not Tb less than Ta. The vector clock thus provides all the functions of Lamport clocks as far as timestamps and event ordering is concerned. It is also just as simple to implement, but the time on one machine can be adjusted without affecting the time on other machines. Distributed Mutual Exclusion Distributed mutual exclusion (DME) is a classic problem in distributed computing. There are n processes executing on n sites. Each process is an infinite loop and has a critical section inside the loop. How do you ensure that at most one process executes within its critical section at any given time? The easy solution is to use a lock server. Each process asks the lock server for permission to enter. The lock server permits only one process at a time. When a process leaves the critical section, it informs the lock server and the lock server can now allow another process to enter. This solution is called the centralized solution to the DME problem. This solution is called centralized because all decisions are made at one site. In a problem such as DME, we can define two sets for each site. A site i has a request set Qi and a response set Ri. Qi is the set of sites that i will contact when it wants to enter the critical section. Ri is the set of sites that contact i if they want to enter the critical section (11). In order for a mutual exclusion algorithm to be distributed, two rules must apply. These are: Equal responsibility rule: For all i, j, 兩Qi兩 ⫽ 兩Qj兩. Equal effort rule: For all i, j, 兩Ri兩 ⫽ 兩Rj兩. In the centralized case, Ri for all i is the lock server site; and for all i, Qi is empty. Thus, the centralized solution fails the two rules. Many different DME algorithms can meet such rules. Lamport proposed the first solution. In the Lamport algorithm, there are three steps: Step 1: When a process wants to enter the critical section, it sends a request message, along with a timestamp, to

192


all other processes, including itself. Upon receiving such a message, each process queues the request in timestamp order in a local request queue and sends an acknowledgment. The requesting process waits for all acknowledgments before proceeding. Step 2: A process can enter when it notices that its own request is the first request in its own local request queue. Step 3: Whenever a process exits the critical section it informs all processes and they remove the exiting processes request from their local request queues. This algorithm meets the equal responsibility and equal effort rules. It uses 3n messages per entry into a critical section. The number of messages can be reduced to sqrt(n) by using a type of algorithm first proposed by Maekawa (11). Currently there are a large number of algorithms each having some advantage over the other. Note that in most practical situations, the centralized algorithm works better and uses the lowest number of messages (just 2 messages per entrance). Thus, it is the most commonly used algorithm. Distributed Checkpoints Checkpointing is a method used to restart or debug computations. On a centralized operating system, checkpointing is easy, the process to be checkpointed is stopped and its memory contents are written to a file, then the process can continue execution. The checkpoint can later be used to restart the process (in case of failure) or to analyze its execution (in case of debugging). In a networked or distributed system this technique does not work. Consider two processes P1 and P2. P1 sends a message to P2. We ask both P1 and P2 to stop and checkpoint themselves. P1 does so, and then continues, and sends a message to P2. P2 receives the message from P1 and then receives the checkpoint notification and then checkpoints itself. Now, if we compare the checkpoints of P1 and P2, we find P2 has received a message that has not yet been sent by P1. This is called an inconsistent checkpoint. The classic consistent checkpoint algorithm was proposed by Chandy and Lamport and is called the snapshot algorithm (12). In the snapshot algorithm, to initiate a checkpoint, a marker message is sent to any one process. When a process gets a marker message for the first time, it checkpoints itself and then sends out marker messages to all the processes it communicates with. If a process receives a marker message subsequent to its first time, it ignores the message. It can be shown that the markers eventually disappear, and when the markers disappear, all processes have recorded a set of consistent checkpoints. Of course many other checkpointing algorithms have been propose since then, having characteristics and features greater that the basic algorithm outlined. Distributed Deadlocks Resource management in operating systems can lead to deadlocks. A resource is any entity, such as files, peripherals, memory, and so on. Deadlocks occur for instance when processes acquire locks on resources. For example, suppose a process P1 locks resource x and then process P2 locks resource y.

Thereafter, process P1 requests a lock on x and process P2 requests a lock on y. Neither P1 nor P2 can progress any further and has to wait forever. This situation is called a deadlock, and it needs to be detected and then resolved by terminating one of the processes. Deadlock detection on centralized systems is easier than deadlock detection on distributed systems. Consider the following situation, similar to the deadlock described previously, but in the context of a distributed system. A process P1 requests and obtains a lock on resource x. The resource x is located on a machine Mx and hence is controlled by a lock server running on machine Mx. Now, process P2 requests and obtains a lock on resource y, which is located on a machine My and controlled by a lock server on machine My. Then process P1 requests a lock on y and process P2 requests a lock on x. This situation is a deadlock. However, the lock servers cannot detect this deadlock by themselves. At the lock server on machine Mx, a process (P1) holds a lock on x and another process (P2) has requested a lock on x. This is a perfectly normal, legal situation that is not a deadlock. Similarly, there is no deadlock at machine My. However, a global or distributed deadlock exists, involving two lock servers. In a system consisting of a large number of lock servers and large numbers of processes and resources, detection of deadlocks becomes a serious issue. Most early distributed deadlock detection algorithms tried to consolidate the data about resource allocation from multiple lock servers in order to find deadlocks. Such algorithms proved to be complicated, expensive in terms of computational complexity, and prone to detect deadlocks even if there are no deadlocks (a phenomenon called false deadlocks). A distributed deadlock detection algorithm by Chandy and Misra was a breakthrough that solved the deadlock problem in a simple fashion. The solution is called the probe algorithm (13). In this scheme, a process waiting for a resource sends a probe message to the lock server handling the resource. The lock server forwards the probe to a process that is currently holding the resource. When a process receives a probe, and the process is not currently waiting for a resource, it ignores the probe. If the process is currently waiting for a resource, then it forwards the probe to the lock server that controls the resource. If the originator of the probe gets the probe returned to it, then there is a deadlock. A careful implementation of this protocol can be shown to be free from detection of false deadlocks. Distributed Concurrency Control Concurrency control is a mechanism by which the integrity of data is preserved in spite of concurrent access by multiple processes. Concurrently, control is necessary in both single computer systems and distributed systems. In distributed system, the issues are somewhat more complicated as the data may be stored at many different sites. Concurrency control ensures serializability. Serializability is a property that ensures that the concurrent execution of a set of processes has results that are equivalent to some serial execution of the same set of processes. Serializability is an important property for any system that handles persistent, interrelated data. Provision of serializability is made possible by many techniques, the two most well known are two-phase locking and timestamping.


In the two phase commit scheme, a process that reads or writes data has to obtain a lock on the data item it accesses before it can access the data item, and may release the lock after the access is over. If multiple data items are accessed, then no lock can be released until all locks have been acquired. This ensures serializable updates to the data. In the timestamp scheme, all data items bear two timestamps, the read-timestamp and the write-timestamp. All processes or transactions also bear timestamps. The process timestamp is the time at which the process was created. The read-timestamp on a data item is the value, which is the largest of all the process timestamps, of processes which have read the data item. The write-timestamp is equal to the process timestamp of the process that last wrote this data item. The timestamp protocol works as follows. Suppose a process bearing a timestamp pt wants to read a data time with a read-timestamp rt and a write timestamp wt. If pt ⬍ wt then the process is aborted or restarted. Otherwise it is allowed to read the item, and if pt ⬎ rt then the read timestamp of the item is updated to be equal to rt. If the process tried to write a new value to the data item, then pt must be higher than both rt and wt (else the process is aborted). After the write, both read and write timestamps of the data item are set to pt. The timestamp protocol is termed an optimistic protocol, as it does not have any locking delays and all operations are processed immediately or aborted. The two-phase locking and timestamp protocol can be adapted to distributed systems. To implement two-phase locking, one or more distributed lock servers have to be provided. If multiple lock servers are provided, then distributed deadlock detection has to be added. In addition, the two-phase commit protocol may have to be used for consensus (next section). To make timestamping work in a distributed system, there needs to be a mechanism to provide system-wide unique timestamps. This is of course possible by using vector clocks as the timestamp. Even Lamport clocks can be used, but to ensure uniqueness, the site identifier of the site that assigns the timestamp is appended to the end of the timestamp. Distributed Consensus Consensus is a problem unique to distributed systems. The reason is that distributed systems are composed of separate autonomous systems that need to cooperate. At the times they need to cooperate, there is often a need to agree on something. Suppose there is a file containing the value 0 (zero) on three machines. A process wants to update the value to 1 on all three machines. It tells servers on all the three machines to perform the update. The servers now want to ensure all of them updates, or none of them does it (to preserve consistency). So they need to agree (or arrive at a consensus) to either perform the operation (flip the 0 to 1) or abort the operation (leave it as 0). In theory, it can be shown that consensus in distributed system is impossible to achieve if there is any chance of loosing messages on the network. The proof is quite involved, but consider the following conversation: Machine 1 to Machine 2: Flip the bit from 0 to 1, and tell me when you are done so that I will flip it too.

193

Machine 2 to Machine 1: OK, I have flipped it. But, please acknowledge this message, or else I will think you did not get my reply and you chose not to flip—in which case I will flip mine back to 0. Machine 1 to Machine 2: Everything is fine. Got your message. But, please acknowledge this message, as I need to know that you got this message, or you may flip the bit back. Machine 2 to Machine 1: Got it. But now I need another acknowledgment, to ensure . . . As is obvious, this bickering continues forever. It can be shown that there is no finite length sequence of messages that achieves consensus, even if messages are not lost, as long as there is a fear of a message getting lost. In reality, however, there is need for consensus, and impossibility is not a deterrence. Many systems just assume messages are not lost and thus implement consensus trivially (Machine 1 tells Machine 2 to flip it and assumes it will be done). In more critical applications, the two-phase commit protocol is used. The two-phase commit protocol works as follows. A machine is selected as the leader (e.g., the one that started the process, that made updates) and the rest of the machines are cohorts. That leader tells all the cohorts to ‘‘flip the bit’’. All of them flip it, and retain a copy of the old value and send an OK to the coordinator. This is called the pre-commit phase. At this point, all the cohorts have the old value and the new value. After all the OKs are received, the leader sends a commit message which causes all the cohorts to install the new (flipped) value. If some OKs are not received, the leader tells all the cohorts to abort, that is install the old value back. It can be shown that this protocol (with some extensions for failure handling) works for most cases of message loss and machine failure. Replication Control In distributed systems, data are often replicated; that is, multiple copies of the same data are stored on multiple sites. This is for reliability, performance, or both. Performance is enhanced if regularly accessed data are scattered over the network, rather than in one place—it evens out the access load. In addition, if one site having the data fails then the date is still available from the other sites. Replication works very well for read-only data. But, to be useful, replication should work with read-write data also. Replication control protocols ensure that data replication is consistent, in spite of failures for read-write data. There are many protocols, a few are outlined below. Read One, Write All. In this scheme, a reader can read from any copy, but a writer has to update all copies. If not all copies are available, the writer cannot update. Most commonly used. Primary Copy. A variant of the previous, read any copy, write to the primary copy. The machine holding the primary copy then propagates the update. Read Majority Write Majority. If there are N copies, then read N/2 ⫹ 1 copies and take the value from the most recent of the copies. Writing to any of the N/2 ⫹ 1 copies is good enough.

194


Voting. Each copy has a certain number of votes. The total number of votes is v. Choose a read quorum r and a write quorum w such that r ⫹ w ⫽ q ⫹ 1. Now, to access, find enough copies such that the total vote is equal (or greater) than r for reading, and w for writing. Depending on the read traffic, the write traffic, and the failure probabilities, one of these protocols is chosen. Note that voting is a general protocol, where setting the votes of each item to 1 and r to 1 and w to N makes it the read-one writeall protocol. Similarly, it can mimic the majority protocol. There are other protocols that are more general than voting (such as quorum consensus). SYSTEM FEATURES The following paragraphs outline the salient features of a set of network (or distributed) operating systems that either are in operation or have significant contributions to the state of the art. Amoeba Amoeba, developed at Vrije University (6), is an operating system using a microkernel design, supporting very fast message passing designed to utilize processor farms. A processor farm is a set of rack-mounted single-board computers connected by regular networking (Ethernet). Amoeba makes the collection machines look like one fast timesharing system. It also provides support for threads, RPC, group communication, and all other facilities needed for networking. Amoeba supports a parallel programming language called Orca.

Unix service. Mach is also heavily customizable, making it an ideal platform for research with operating systems. Sprite Sprite, developed at University of California, Berkeley (15), is an operating system that provides a single system image to a cluster of workstations. Much of the focus of research with Sprite has been directed at improving file system performance. As a result, Sprite provides a very high performance file system through client and server caching. It has process migration to take advantage of idle machines. It was used as a testbed for research in log-structured file systems, striped file systems, crash recovery, and RAID file systems. Unix Unix is a commercial product of Unix Systems Laboratories. Various other companies sell variants of Unix, using other trade names, the most well-known being SunOS/Solaris. SunOS was the first system to provide a commercial, robust, full-featured network file system (NFS). Linux is a free Unix compatible operating system. The kernel of Unix is monolithic and most network-based services are added as separate user processes. Unix is an older operating system, adapted for network use. Because of the prevalence of Unix in research institutions, all services developed for networking are developed on Unix platforms first. Hence, everything is available for Unix, though not from the commercial providers of Unix. Unix is the mainstay of network operating systems in the academic and research communities. V-System

Clouds Clouds, developed at Georgia Tech (14), is a system designed to support persistent objects that are large grained. Each object is an address space that is backed up on disk and hence is persistent. The system paradigm uses a thread-object model, where threads are distributed and can access objects via a modified RPC mechanism. The object invocation causes the thread to move between address spaces rather than use a server for processing the RPC request. The entire system is supported on top of a low-level distributed shared memory mechanism thus making all objects available at all computers. Services are built into objects and can be accessed using the RPC mechanism. Message passing is not supported at the API level. Clouds has been used for research in reliability, transaction processing, replication, and distributed debugging. Mach Mach, developed at Carnegie-Mellon University (7), is a Unix compatible operating system that is built on a microkernel. The microkernel supports message passing, tasks, and threads. Mach supports an innovative user-level external paging system that causes messages to be sent to a paging process whenever there is a page-fault generated by a user process. These external pagers allowed Mach to support a variety of emulation features. The Unix operating system is supported on top of Mach as a user-level process, providing the

The V-System, developed at Stanford University (8), is a microkernel operating system with support for fast message passing. Services are added to V by running user-level servers. The innovative use of low-latency protocols for intermachine messaging provides V with excellent performance on a networked environment. Also innovative is the uniform support for input–output, a capability based naming scheme, and the clean design of the kernel. Windows NT Windows NT is a commercial product of Microsoft Corporation. This operating system has a hybrid kernel; that is, the inner core of the operating system follows the microkernel technology, but the services are not at the user-level. Services are added to Windows NT as modules called DLLs (dynamic link libraries). The operating system is extensible and allows for a variety of pluggable modules at the level of device drivers, kernel extensions, as well as services at the user level. Windows NT provides many of the services described in this article in a commercial product and competes with the various forms of Unix in the marketplace. Windows NT also has the ability of running applications written for DOS, Windows 3.1, and Windows 95, all of which are completely different operating systems. For network use, Windows NT provides file service, name service, replication service, RPC service, and messaging using several protocols.


RELATED TOPICS Distributed Operating Systems Distributed operating systems are network operating systems with significantly more integration between the autonomous operating system running on each machine. The distributed operating system is hence able to provide services that are beyond the capability of network operating systems. A few of the additional facilities are summarized. Dynamic Distributed Data Placement. A data item of file is located close to where it is used. Its location changes dynamically as its usage pattern changes. The logical location (such as a file is in one particular directory) is not an indicator of its physical locations. For example, a directory may contain three files, but the files may be located at three different machines, at some point in time. Process Scheduling. When a process is started, it is not started on the same machine as its parent, but the process scheduler decides where to start the process. The chosen machine may be a machine with the lightest load, or a machine that is close to the data the process will be accessing. Process Migration. Processes may move from machine to machine (automatically) depending on its data access patterns, or resource needs, or just for load balancing. Fault Tolerance. Failures of sites do not affect any of the computations. Failed computations are automatically restarted, inaccessible data are made available though replicated copies. Users connected to the failed machine are transparently relocated.

Distributed Parallel Processing Systems The bastion of parallel processing used to be large, expensive machines called parallel processors. The advent of network operating systems has shifted the focus of parallel processing platforms to cheaper hardware—a network of smaller machines. Parallel processing involves splitting a large task into smaller units, each of which can be executed on a separate processor, concurrently. This method uses more hardware, but causes the task to run faster and complete quicker. Parallel processing is very necessary in applications such as weather forecasting, space exploration, image processing, large database handling, and many scientific computations. Parallel processing on network operating system uses toolkits, also known as middleware, which sit between the application and the operating system and manage the control flow and the data flow. A particularly popular package is called PVM (parallel virtual machine) (16). PVM augments the message passing system provided by the operating system with simpler to use primitives that allow control of spawning processes on remote machines, transmission of data to the machine, and collection of results of the computations. Another package with similar characteristics is MPI (17). An interesting system that uses a radically different approach to parallel processing is Linda (18). Linda integrates the notion of work and data into a unified concept called the tuple-space. The tuple-space contains work tuples and data tuples. Processes

195

called workers run on many machines and access the tuplespace to get work, to get input, and to store the results. Some recent parallel processing system use distributed shared memory to hold the data, mimicking the facilities available on the large parallel processors. Such systems are easier to program as they insulate the programmer from the idiosyncrasies of data placement and data transmission. TreadMarks (19) is a product that provides a high-performance distributed shared memory system using a method called release consistency. Calypso (20) is another system that supports easy to program parallel processing, and also provides load balancing and fault tolerance with no additional cost. Calypso uses a manager–worker model that creates a logical parallel processor, and can dynamically change the number of workers depending on physical network characteristics. Other systems that are in use include Amber, Avalanche, GLU, P4, Piranha, and Quarks.

BIBLIOGRAPHY 1. A. S. Tannenbaum, Distributed Operating Systems. Englewood Cliffs, NJ: Prentice-Hall, 1995. 2. P. K. Sinha, Distributed Operating Systems: Concepts and Design. New York: IEEE Press, 1997. 3. E. Levy and A. Silberschatz, Distributed file systems: Concepts and examples, ACM Comput. Surv., 22 (4): 321–374, 1990. 4. K. Birman, The process group approach to reliable distributed computing, Commun. ACM, 36 (12): 37–53, 1993. 5. K. Li and P. Hudak, Memory coherence in shared virtual memory systems, ACM Trans. Comput. Syst., 7 (4): 321–359, 1989. 6. A. S. Tanenbaum et al., Experience with the amoeba distributed operating system, Commun. ACM, 33 (12): 46–63. 7. M. Accetta et al., Mach: A new kernel foundation for Unix development, Proc. Summer Usenix Conf., 1990. 8. D. R. Cheriton, The V Distributed System, Commun. ACM, 31 (3): 314–333, 1988. 9. N. Lynch, Distributed Algorithms, San Francisco: Morgan Kaufman Publishers, 1997. 10. L. Lamport, Time, clocks and ordering of events in a distributed system, Commun. ACM, 21 (7): 558–565, 1978. 11. M. Maekawa, A. E. Oldehoft, and R. R. Oldehoft, Operating Systems: Advanced Concepts, Menlo Park, CA: Benjamin-Cummings, 1987. 12. K. M. Chandy and L. Lamport, Distributed snapshots, ACM Trans. Comp. Sys., 3 (1): 63–75, 1985. 13. K. M. Chandy, J. Misra, and L. M. Haas, Distributed deadlock detection, ACM Trans. Comp. Sys., 1 (2): 144–156, 1983. 14. P. Dasgupta et al., The Clouds Distributed Operating System, IEEE Computer, November 1991. 15. M. Nelson, B. Welch, and J. Ousterhout, Caching in the Sprite network file system, ACM Trans. Comp. Syst., 6 (1): 134–154, 1988. 16. A. Geist et al., PVM: Parallel Virtual Machine: A Users’ Guide and Tutorial for Networked Parallel Computing, Cambridge, MA: MIT Press, 1994. 17. W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, MIT Press, 1994. 18. D. Gelernter, Programming for advanced computing, Sci. Amer., 257 (4): 65–71, 1987.

196

NETWORK PARAMETERS

19. C. Amza et al., TreadMarks: Shared memory computing on networks of workstations, IEEE Comp., 29 (2): 18–28, 1996. 20. A. Baratloo, P. Dasgupta, and Z. M. Kedem, Calypso: A novel software architecture for high performance parallel processing on workstation networks, 4th Int. Conf. High Performance Distributed Comput., 1995.

PARTHA DASGUPTA Arizona State University


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Network Performance and Queueing Models Standard Article Murat Azizoglu1 and Richard A. Barry2 1University of Washington, Seattle, WA 2MIT Lincoln Laboratory, Lexington, MA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5321 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (154K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Role of Queueing Analysis in Network Design and Operation An Introduction to Queueing Theory Little's Result Arrival and Service Distributions in Queueing Basic Queueing Models Priority Queueing Networks of Queues Discrete-Time Queues Future Trends in Queueing Analysis and Network Performance | | | Copyright © 1999-2008 All Rights Reserved.


204

NETWORK PERFORMANCE AND QUEUEING MODELS

putations. In any case, since the actual traffic offered to a network is generally unknown or difficult to describe, the outcomes of these performance models only approximate reality to varying degrees of accuracy. Nevertheless, queueing analysis is an essential tool in the design, operation, and theory of networks. Many of these models are applicable to a wide variety of network types, while others are quite specific. For instance, Markov chains are particularly useful in the evaluation of both circuit and packet switched networks. On the other hand, there are several modern models which are applicable to networks such as Asynchronous Transfer Mode (ATM) networks which provide virtual circuits with quality of service (QoS) guarantees. We first define some of the most common performance metrics that queueing theory can predict. We will then describe how these predictions are used in network design and operation. After that, we will describe some of the most common models and important results. Network Performance Metrics There are a variety of QoS metrics, or measures, of a network’s performance—for example, blocking probabilities, packet and message delay, delay jitter, throughput, and probability of loss. Roughly speaking, a blocking probability is the probability a new connection request is denied access to the network, packet (message) delay is a measure of how long the network takes to deliver the packet (message), jitter is a measure of how much variance there is between successive packet (message) deliveries, throughput is a measure of how much information is delivered per unit time, and loss probability is the probability a packet (message) is never delivered. The relevance of a particular metric depends upon the type of network (e.g., connection-oriented or connectionless), the requirements of the applications which are using the network (e.g., real-time or non-real time), and goals of the network operator.

NETWORK PERFORMANCE AND QUEUEING MODELS Queueing models are an important class of mathematical models which can ‘‘predict’’ and explain certain aspects of the performance of networks and other systems where users statistically share resources. For communication networks, queueing models are often used to predict basic performance metrics such as blocking probability in circuit switched networks or packet delay in packet switched networks. Some of these models exactly predict the performance under some assumed traffic conditions, while others are only approximate. Some are statistical, some are deterministic. Some have simple analytical solutions, while others require numerical com-

Blocking Probability. Blocking probability is a fundamental metric of most connection-oriented networks—that is, circuitswitched and virtual-circuit networks. In these networks, an application requests bandwidth in the form of a connection before transmitting data into the network. If insufficient resources are available for the connection (as determined by the type of network, a description of the desired resources, and network policy), the request is blocked. The QoS of a single-rate circuit-switched network is often measured by the blocking probability, defined as the probability that a new call request is denied access to the network. Modern circuit-switched networks [e.g., integrated services digital network (ISDN)] provide multiple-rate circuits. The analysis of a multirate circuit network is more complex than that of a single-rate network, but the underlying queueing theory is quite similar. One key difference is that multirate networks are generally not evaluated based on a single blocking probability parameter, but rather on the set of blocking probabilities, one for each available rate. Blocking probabilities are a function of the statistics of the traffic offered to the network (the call arrivals, the durations of calls, and the requested resources), the call admission control (CAC) policy which determines if a connection will be ac-



cepted, and the routing algorithm used to assign resources within the network. As such, blocking probabilities are often used to evaluate CAC and routing algorithms. Also, since user traffic can vary dramatically over time, blocking probabilities are often measured over time periods longer than a call duration but short enough so that traffic characteristics do not significantly change. Very often connection-oriented networks are evaluated based on their blocking probabilities during the busiest hour of the day. Packet (Message) Delay and Loss. Classical data networks such as the Internet are typically used to transfer messages between computer applications. For these applications, the most basic metrics are message delay and loss. Message delay is the total time the network takes to deliver the message from the time the first bit of the message enters the network to the time the last bit is delivered to the destination (if it is delivered). The message loss probability is the probability that a message offered to the network is never delivered. Most networks do not transfer messages as their basic unit. The main reason for this is that simple queueing models show that message delays can be reduced by breaking down messages into smaller-sized units called packets. The Internet Protocol (IP) is the most common packet switching protocol and uses variable-sized packets with a minimum size of 20 bytes and a maximum size of 64,000 bytes. Packet networks can also use fixed-size packets. Continuous-time queueing models are used for variable-sized packet networks; discretetime queueing models are used for fixed-sized packet networks. Most classical data networks use variable-sized packets. Modern fast packet networks use both variable-sized (Frame Relay) and fixed-size (ATM) packets. These latter networks can also transfer real-time applications such as voice and video, which tend to be very sensitive to delay and loss. The most basic and fundamental measures of the performance of a packet network are packet delay and loss. Packet delay is the total time the network takes to deliver a packet from the time the first bit of the packet enters the network to the time it is delivered to the destination in its entirety. The packet loss probability is the probability that a packet offered to the network is never delivered. Packet delay and loss are important for the obvious reasons that they strongly influence the total time needed to transfer a message as well as affecting the quality of realtime applications. For instance, delays in voice connections greater than a quarter of a second are quite perceptible and annoying to the participants. Note that message delays are functions of packet delays and packet losses. In case of packet loss, lost packets must be retransmitted, which delays the complete delivery of the message. Delays and loss can occur for a variety of reasons but tend to dramatically increase as the amount of activity in the network increases; an excess of activity leads to congestion in all or part of the network, which causes queues to become backlogged or possibly run out of memory and overflow. Since usage is very often dynamic and hard to predict, delays and losses are time-varying random metrics. Given an appropriate statistical model for the offered traffic (packet arrivals and size), queueing models can predict the average packet delay, the variance of the packet delay, and a full statistical description of the delays and loss. These delays in turn can be used

205

to predict message delays seen by the application. However, such an analysis can often be quite complex, and hence simplifying approximations are typically used. The situation becomes even more complex when packet networks are layered; that is, one packet network is used to send the packets of another packet network. A common example is the delivery of IP packets over an ATM or Frame Relay network. Throughput. Throughput is a measure of the amount of data delivered per unit time. Throughput is often measured in packets per second, but may also be measured in terms of bits per second. Note that throughput is a time-varying quantity and hence can be measured on different time scales. Also note that if the packet loss rate is low, the throughput should be approximately equal to the rate at which bits are offered to the network. The maximum throughput is the maximum rate at which the packet loss rate and the packet delay are below predetermined acceptable levels. Throughput is important for real-time applications. For example, good-quality video using MPEG-2 video compression requires a minimum throughput of 6 Mb/s. Since throughput is strongly related to the amount of activity, a typical performance analysis will measure the packet delay and packet loss as functions of the throughput. Such an analysis is useful in a variety of situations. For instance, if we wish to compare two design options, we can say that one performs better if it has lower delay and loss for the same throughput. Delay Jitter. An important metric in some virtual circuit packet networks is packet delay jitter. Jitter is a measure of the degree of variability in the time between successive packet deliveries in a virtual circuit. Excessive jitter can be highly detrimental to real-time applications such as video and voice. Packet delay and jitter may be traded off against each other. For example, if jitter is high, a large number of packets may have to be buffered at the destination in order to ensure a smooth play-out to the application. ROLE OF QUEUEING ANALYSIS IN NETWORK DESIGN AND OPERATION By modeling traffic, queueing models describe the system performance. These descriptions can then be used in the network planning, design, and operation. Queueing models are widely used in network design. For instance, in the early planning stages of a new network, certain decisions have to be made. These decisions range from the most basic, such as deciding whether the network should be a packet network or a circuit network, to more complex decisions, such as the amount of bandwidth needed on the link, the topology of the network, and the protocols to be used (or invented). There are generally innumerable choices for the physical layer of the network. For the purposes of this article, the physical layer may be thought of as a specified network topology which indicates which nodes, or switches, are connected to each other as well as a specification of the link bandwidths, or the rates at which nodes can communicate. Once the physical topology is determined, several other decisions must be

206


made, including routing, flow control, and admission control policies. There are many options that can be considered in each of these decisions. As such, it is generally impossible for a human to reach an optimal answer (very often it is impossible for a computer to reach it since many network design problems fall into the category of NP-complete problems). Nevertheless, complex algorithms are used to design networks. Many of these algorithms use queueing models to evaluate the quality of a design and to decide on how to modify the current design to obtain a better design. Queueing models are also needed in the operation of networks. Operational decisions based on queueing analysis occur on many different time scales. In the longest scale, the network operator or owner may decide to change some feature of the network—for example, purchase more bandwidth for a particular link. These decisions can be based on an analysis of the network and why the improved network should perform better—that is, generate more revenue, provide better service, and so on. On shorter time scales, the network operation can be modified in various ways. For instance, the routing decisions can be changed; the decisions as to which route to follow can be based on measured congestion in the network and some mapping of how congestion affects performance.

AN INTRODUCTION TO QUEUEING THEORY Queueing theory is the mathematical framework used in the analysis and design of queueing systems. A queueing system is a system to which ‘‘customers’’ arrive in order to get ‘‘service.’’ A bank branch in which tellers serve customer requests, a packet switch in a communication network which routes packets from its input ports to its output ports, and a statistical multiplexer which combines several traffic streams into one higher-rate stream are all examples of queueing systems. An important characteristic of these systems is the nondeterministic nature of the customer arrivals and their service demands. In the bank branch example above, it is not possible to determine the exact number and arrival times of customers, with certainty and a priori. Similarly the time required to serve a customer is typically unknown before the actual service takes place. Therefore, probabilistic models are employed to statistically characterize the arrival times and the service times of customers in a queueing system. Queueing theory is a branch of applied probability in which appropriate probability models are developed and utilized to predict system performance. As in many engineering problems in which quantitative models are employed, queueing models involve tradeoffs between the capability of the model in reflecting the real system’s properties and the model complexity. At one extreme are the simple queueing models that make simplifying assumptions about arrival and service statistics to ensure analytical tractability, and at the other extreme are complex and realistic computer models, sometimes developed from experimental observations of the real system, that require extensive development and simulation times. Often, a combination of analytical and computational techniques are used to evaluate the performance of the queueing system of interest. This article emphasizes analytical techniques because they have broader applicability and they provide valuable insights into the fundamental nature of queueing systems.

The simplest queueing system is a single-server queue which consists of a waiting room and a server as shown in Fig. 1. An arriving customer enters the waiting room and waits for its turn to receive service. In the context of a communication network, the customers are often either data packets or connection requests, and the waiting room is an electronic buffer (queue). The terms packet, queue, and server will be used to refer to the components of this queueing system. We will use the terms queue and buffer interchangeably. An important issue in the design of a single-server queue is the service discipline. In first-in first-out (FIFO) service, the packets are served in the order of arrival with the earlier arrivals exiting the system before later arrivals. In last-in first-out (LIFO) service, the service order is reversed. The service discipline in a broadband network with multiple classes of traffic may be priority-based: Packets from a high-priority class are served before lower-priority packets. Several system parameters must be specified to model a single-server queue. The average arrival rate ␭ (in packets per unit time) is a measure of the expected demand for the system. The average service rate 애 is the average number of packets that are served per unit time by a busy server. The service rate determines the average speed of the server in units of packets per second—for example, the speed of transmission line in a multiplexer. The average time a packet spends in the server is given by 1/애. Finally, the buffer size is the maximum number of packets that can be held in the buffer, including the packet receiving service. The average arrival and service rates do not completely characterize the arrival and service processes. Probability distributions of these processes are required for system performance analysis. Several canonical distributions are commonly used for this purpose; these will be described in the following sections. First, an interesting and useful property that relates the average rates in an arbitrary queueing system to the average system occupancy and delay will be described. LITTLE’S RESULT Let ␭ denote the average arrival rate into a general queueing system (for systems in which arriving packets may be blocked from entering the system, ␭ must be replaced by the rate of packets entering the system). Suppose a packet P that arrives at this system in the steady state spends a time T in the system. [Steady state means that the system has been in operation for a sufficiently long time such that transient effects of an initially atypical (e.g., empty) system have subsided.] The value of T is random due to (a) the random number of packets that P finds in the system upon its arrival and (b) the random service requirements of these packets as well as that of P. Also suppose N is the number of packets in the system as seen by an independent observer at the steady state. J. Little has found that the average (expected) values of N and T are

Output

Input Server Buffer

Figure 1. A single-server queueing system. Incoming customers wait in the buffer until they are processed by the server.


related as follows (1): E(N) = λE(T ) [We use the notation E(X) to denote the expected value of a random variable X.] This relationship, known as Little’s result, is perhaps the most useful result in queueing theory. The result is of surprising generality; it is valid irrespective of arrival or service distributions, the average service rate, the service discipline, and even the precise composition of the system. Little’s result quantifies the intuition that congested systems [large E(N)] result in large delays and vice versa. The result also indicates that systems with large arrival rates tend to get more congested than those with lower arrival rates. Little’s result has found many applications in queueing theory. With appropriate definitions of a system and the quantities N, T, and ␭, many interesting results can be obtained with economy. For instance, when the server of a single-server queue with average service rate 애 is considered as the system of interest, one obtains E(N) = λ/µ = ρ because the average time in the server is 1/애. (The service rate must exceed the arrival rate for the system to be stable, a fact that will be elaborated upon when queueing delay is considered in detail.) Since the server can have 0 or 1 packets at a given time, E(N) is the probability that N ⫽ 1. Thus ␳ is the fraction of time the server is busy and is called the server utilization. Little’s result is particularly useful when either the average system occupancy or the average system delay is known and the other quantity is to be found. The reader is referred to Ref. 2 for an elementary and insightful proof of this result as well as many interesting applications to network performance analysis. Reference 3 provides a review of various generalizations of Little’s result. ARRIVAL AND SERVICE DISTRIBUTIONS IN QUEUEING In order to obtain explicit performance results for queueing systems, one has to develop models for the statistics of arrival and service processes. Some of these models result in closedform expressions, while others require numerical evaluation. A natural means to model packet arrivals in queueing systems is through the use of counting processes. A counting process N(t) is an integer-valued random process, whose value N(t) is the number of events (packet arrivals) that occur up to (and including) time t. Thus N(t) ⫺ N(s) is the number of arrivals during the time interval (s, t]. It is usually assumed that the process starts at time 0, so N(0) ⫽ 0. A particular realization of the packet arrival process can be specified by the counting process 兵N(t): t ⱖ 0其 or, equivalently, by the sequence of packet arrival times 兵Sn: n ⫽ 1, 2, . . .其, where Sn is the arrival time of the nth packet. The statement N(t) ⫽ n is equivalent to the statement Sn ⱕ t ⬍ Sn⫹1. A third equivalent characterization of an arrival process is through the interarrival times Xn ⫽ Sn ⫺ Sn⫺1, where Xn is the time elapsed between the (n ⫺ 1)th and nth packet arrivals. This last characterization is the most common in queueing analysis. It is typically assumed that the interar-

207

rival times X1, X2, . . . are statistically independent and identically distributed (i.i.d.) random variables. That is, successive interarrivals are assumed to have no correlation. In this case the complete statistical description of the arrival process requires a single function to be specified, namely that characterizing the probabilistic behavior of the generic interarrival time X. Arrival processes with i.i.d. interarrival times are said to have the renewal property, because at each arrival instant the same probabilistic behavior is expected for the next arrival regardless of the past behavior of the process. A distinction must be made between continuous-time and discrete-time queueing systems before specifying the interarrival statistics. In a continuous-time queueing system, arrivals and departures can occur at any time instant t. On the other hand, arrivals and departures are allowed to occur only at discrete time instants in a discrete-time queueing system. Discrete-time queueing systems will be considered in the section entitled ‘‘Discrete-Time Queues.’’ For continuous-time systems, interarrival statistics are often described in terms of a probability density function f X(x). This function quantifies the likelihood of the random variable X taking a value around x. In particular, for small 웃 ⬎ 0, the probability P(x ⱕ X ⬍ x ⫹ 웃) is approximately 웃 f X(x). (This interpretation also justifies the term ‘‘density.’’) A common probability density function (pdf) in continuoustime queueing theory is the exponential density f X (x) = λe−λx ,

x≥0

where ␭ is a positive parameter. X is said to be exponentially distributed if it has the pdf above. The expected value of X is E(X) ⫽ 1/ ␭. Thus, ␭ is the average number of arrivals per unit time and is called the (average) arrival rate. An arrival process with exponential interarrival pdf has the probability distribution P(N(t) = k) =

(λt)k −λt e , k!

k = 0, 1. . . .

This is the Poisson distribution, and an arrival process with this distribution is called a Poisson process. Hence a queueing system has Poisson arrivals if (and only if) the interarrival times are exponentially distributed i.i.d. random variables. The Poisson arrival process plays an important role in queueing theory because it simplifies the analysis of many queueing systems. While the traffic in real networks is almost certainly non-Poisson, networks designed using the Poisson traffic assumption have usually performed well. In a network with a large number of users each offering a small amount of traffic, the aggregate traffic tends to the Poisson distribution. In this sense, the role of the Poisson process in traffic modeling is analogous to that of the Gaussian process in noise modeling. The exponential distribution is the only continuous distribution that has the following property. If X has exponential distribution, the conditional probability of the event X ⱖ t ⫹ s given that X ⱖ s is the unconditional probability of the event X ⱖ t; that is, P(X ⱖ t ⫹ s 兩 X ⱖ s) ⫽ P(X ⱖ t). If X is the waiting time until the occurrence of an event (say the arrival of a bus at a bus stop), according to this property, the amount of additional waiting is independent of the amount already spent waiting. This is known as the memoryless property, and


it is the primary reason for the frequent use of Poisson traffic in queueing theory. The second component of traffic characterization is the description of service times. Service times of packets in a queueing system may be random due to variable packet lengths (as in Internet and Ethernet). Even when the packet length is fixed (as in ATM), the amount of time a packet occupies the ‘‘head-of-line’’ in a queue may be random due to statistical sharing of transmission resources (e.g., in a switch). Therefore service times are commonly modeled as random variables in queueing analysis. The service times of packets are assumed to be statistically independent of the arrival times. In many systems the service times of successive packets may be accurately modeled as i.i.d. random variables. For these systems it suffices to specify a single probability density function f Y(y) for the service time of a generic packet. Exponential distribution f Y(y) ⫽ 애e⫺애y, y ⱖ 0, is a frequent choice. Here E(Y) ⫽ 1/애 is the average service time, and 애 is the average service rate of the server. Other service distributions are also common, including deterministic service for fixed-size packets served by a dedicated constant rate server.

30.0

Average system size, E (N)

208

20.0

10.0

0.0 0.0

0.2

0.4 0.6 0.8 Server utilization, ρ

1.0

Figure 2. Average system occupancy of the M/M/1 queue as a function of the server utilization ␳. The system size increases slowly with ␳ (server utilization) at first, then very sharply for ␳ ⱖ 0.8.

BASIC QUEUEING MODELS A queueing system is typically described using a shorthand notation of the form A/B/L/K. In this notation, A refers to the interarrival distribution, B refers to the service distribution, L denotes the number of servers in the system, and K denotes the size of the buffer. The last quantity is usually omitted when there is no limit on the number of customers that can be admitted to the system (K ⫽ 앝). Typical choices for the first two letters A and B are 兵M, D, G其, where M corresponds to exponentially distributed interarrivals or service (memoryless), D stands for a deterministic quantity, and G stands for a general (arbitrary) distribution. Examples of this notation are M/M/1, M/D/1/K, G/M/K/K, and so on. This notation provides a compact reference to the queueing system under consideration and is due to D. G. Kendall. The simplest queueing system is the M/M/1, a singleserver queue with Poisson arrivals, exponential service, and infinite buffer size. This system can be analyzed using continuous-time Markov chains, and many quantities of interest can be determined with ease. For example, the probability that there are n packets in this system in the steady state is given by pn = P(N = n) = ρ n (1 − ρ),

n = 0, 1, 2, . . .

where ␳ ⫽ ␭ /애 is the server utilization (see section entitled ‘‘Little’s Result’’). The average number of packets in the system (system occupancy) is then found to be E(N) =

∞ n=0

npn =

ρ 1−ρ

which is depicted in Fig. 2. Note that as the utilization increases, so does the system congestion, and sharply so beyond 80% utilization. This observation continues to hold for more general queueing systems and points out the need for excess service capacity to avoid system congestion and the associated delays. Little’s result can be used to relate the system occupancy to average packet delay as

E(T ) =

1 1 E(N) = λ µ−λ

which exhibits a similar behavior with increasing server utilization as the one shown in Fig. 2. It is observed that the average time a packet spends in the system is larger than the average service time 1/애 by a factor 1/(1 ⫺ ␳) due to the waiting time in the buffer. The average rate of packets processed by the queueing system is also known as the throughput of the system. Thus the server utilization ␳ is also the normalized system throughput (throughput per average service time). The increase in system delay with increasing throughput is known as the throughput-delay tradeoff. While throughput is a measure of the revenue expected by the network operator, delay is a measure of the service quality the network customers get. A satisfactory resolution of this tradeoff is a critical task in network design. As an application of the M/M/1 results above, consider a packet transmission system whose arrival rate is increased from ␭ to b␭ (where b ⬎ 1) while the service rate is increased from 애 to b애. The server utilization remains the same; therefore the average number of packets in the system is not affected by the scale-up. However, the average packet delay is reduced by a factor b. That is, a transmission system b times as fast will accommodate b times as many packets per second at b times smaller delay. This is an important reason why queueing delays may not be as important in high-speed networks. The example above also points out the benefit of statistical multiplexing in networks. Suppose there are b traffic streams each at rate ␭ packets per second and a total server capacity of b애 packets per second. In traditional time division multiplexing (TDM), each of the streams see an effective service rate of 애, while in statistical multiplexing the streams are merged into an aggregate stream of rate b␭ and a single server of rate b애 is employed. As a result, the packet delays are b times lower with statistical multiplexing. It is also seen that it is advantageous to merge waiting lines in a multiple


server environment (such as a bank branch or a fast food enterprise). This observation continues to hold for arbitrary arrival and service statistics. The analyses of the single-server, finite-buffer queueing system M/M/1/s and the s-server queueing system M/M/s utilize the same Markov chain formulation as the M/M/1. Specific results on delay and system occupancy can be found in standard texts on queueing theory (e.g., Ref. 2). An interesting variant of the M/M/s system is the s-server loss system M/M/s/s. In this system there are s servers and no buffers. A packet that finds all the servers busy upon arrival does not enter the system and is lost. Hence the accepted packet rate into the system is lower than the arrival rate by a factor 1 ⫺ PB, where PB is the packet loss (blocking) probability given by (λ/µ)s /s! PB = s i i=0 (λ/µ) /i! This equation is known as the Erlang-B formula and is very useful in dimensioning M/M/s/s systems. The M/M/s/s formulation finds a variety of applications in the design and analysis of telephone networks where it is used to estimate the call blocking probability as a function of traffic load per trunk ␭ /애 and the number of trunks s. It turns out that this loss formula is insensitive to service distribution and remains valid for M/G/s/s systems with service rate 애 (4). The M/G/1 queueing system is a generalization of the M/M/1 system with an arbitrary probability density function f Y(y) for the service time. The first two moments of the service time E(Y) ⫽ 1/애 and E(Y2) are sufficient to obtain the average packet delay and system occupancy. The typical analysis involves an embedded Markov chain obtained by observing the system just after a service completion. At these time instants the memoryless property of interarrival times and the fresh start of a new packet service imply that the future evolution of the system state (the number of packets in the system) is independent of the past. An important expression in the analysis of M/G/1 queues is the Pollaczek–Khinchin formula, which relates the average system occupancy to the arrival and service parameters as E(N) = ρ +

λ2 E(Y 2 ) 2(1 − ρ)

which in conjunction with Little’s result yields the average packet delay as E(T ) =

λE(Y 2 ) 1 + µ 2(1 − ρ)

Note that these M/G/1 expressions reduce to the corresponding M/M/1 expressions since E(Y2) ⫽ 2/애2 for exponential service. It is also interesting to observe that deterministic service with Y ⫽ 1/애 minimizes both the average system occupancy and the average packet delay among all service distributions with the same service rate. For this M/D/1 queue, the second term in the delay expression above, which is the average waiting time in the buffer prior to service, is exactly 50% of the corresponding value for M/M/1. In some multiple access networks the server is shared among many nodes. Token passing networks such as the to-

209

ken ring are examples of this type. Nodes in these networks can be modeled as M/G/1 queues with server vacations. In this model, the server takes a ‘‘vacation’’ after serving all the packets in a buffer. The amount of time the server spends in vacation is a random variable V with moments E(V) and E(V2). The average system delay in this setting is given by the generalized Pollaczek–Khinchin formula E(T ) =

λE(Y 2 ) E(V 2 ) 1 + + µ 2(1 − ρ) 2E(V )

Among all vacation queues with a given mean vacation period, the server with deterministic vacation causes the smallest delay and the smallest queue size. PRIORITY QUEUEING Modern broadband networks are designed to serve multiple classes of traffic, such as voice, video, data, and so on. Each such traffic class has a different delay requirement. Real-time traffic such as voice and video are sensitive to delay, while data traffic (e.g., e-mail and file transfer applications on Internet) is relatively delay-tolerant. When different traffic types share common network resources, such as transmission lines, routers, and so on, they may be given different service priorities to accommodate their service requirements. For example, in a single server system, delay-sensitive traffic may be served before delay-tolerant traffic. One possible scenario is to divide traffic into L priority classes with class i having priority over class i ⫹ 1 and to maintain a separate queue for each priority class. When a server becomes free, it starts serving a packet from the highest priority queue that is nonempty. In a nonpreemptive priority scheme, a packet service is completed without interruption even if a higher-priority packet arrives during that service. In preemptive priority schemes, packet service is interrupted with the arrival of a high-priority packet which starts receiving service immediately. The discussion below will focus on nonpreemptive priority, which is more appropriate for packet transmission by a single server. The M/M/1 framework can be extended to multiple priorities as follows. For simplicity, we consider the case with two priorities. Let the Poisson arrival and exponential service rates of class i traffic be ␭i and 애i, respectively, with ␳i ⫽ ␭i /애i. Assume for stability that ␳1 ⫹ ␳2 ⬍ 1. The average waiting time for a high-priority packet before it can start receiving service is E(W1 ) = E(R) +

E(NQ1 ) µ1

where R is the residual time of the packet being served at the time of arrival and NQ1 is the number of high-priority packets already in the queue. Little’s result applied to the high-priority buffer yields E(NQ1 ) ⫽ ␭1E(W1); therefore E(W1 ) =

E(R) 1 − ρ1

For the low-priority class, two additional delay components are present. A low-priority packet has to wait for the service

210


of all the packets that have arrived earlier, as well as those high-priority packets that arrive before this packet starts service. Then E(W2 ) = E(R) +

E(NQ1 ) µ1

+

E(NQ2 ) µ2

+ ρ1 E(W2 )

where the last term is due to tardy high-priority packets. Applying Little’s result to both buffers one obtains E(W2 ) =

E(R) (1 − ρ1 )(1 − ρ1 − ρ2 )

It is observed that low-priority packets experience a larger waiting time than high-priority packets by a factor (1 ⫺ ␳1 ⫺ ␳2)⫺1. The final step to complete the delay analysis involves the calculation of the average residual time E(R). The server is idle with probability (1 ⫺ ␳1 ⫺ ␳2) and busy serving a class i packet with probability ␳i. The memoryless property of the exponential service distribution then yields E(R) =

ρ1 ρ + 2 µ1 µ2

which can be used to obtain the waiting times explicitly. The average packet delays of the two classes are then found as E(Ti ) = E(Wi ) +

1 , µi

i = 1, 2

The average packet delay of an arbitrary packet is E(T ) =

λ1 E(T1 ) + λ2 E(T2 ) λ1 + λ2

This final delay expression can be used to verify that the average packet delay is minimized by assigning high priority to traffic with higher service rate. This is because a shorter packet will cause a lower waiting time for a longer packet than the case with reversed service orders. However, the packet delay averaged over service classes is often not the relevant performance measure due to difference in maximum tolerable delays for various traffic types. The priority assignments as well as the rate assignments must be chosen such that the delays for all traffic classes satisfy their requirements.

NETWORKS OF QUEUES In a communication network, packets traverse a sequence of servers, such as transmission lines, switches, and store-andforward nodes. Such a network may be modeled as an interconnection of queues where a packet departing from a server may enter another queue (or may depart from the network). An important characteristic of these systems is traffic mixing; different traffic streams interact with each other, making a compact traffic description very difficult. There are two classes of queueing networks, open and closed networks, which will be treated separately below.

Open Queueing Networks An open queueing network is a collection of queues with external arrivals and departures. Every packet entering an open network eventually departs from it. In a queueing network, the basic assumption of statistical independence between interarrival times and packet service times that makes the analysis of a single queue possible no longer holds for the downstream queues. Consider, as an example, two single-server queues in tandem. The output packets from the first server join the queue for the second server, and packets leave the system once they are served by the second server. If the packet lengths (service times) are exponentially distributed and the external arrival process to the first queue is Poisson, the first queue can be analyzed by using the M/M/1 framework. An important result known as Burke’s theorem, which applies not only to M/M/1 but more generally to M/M/s and M/M/앝 queues, implies that the departure process from the first queue is also Poisson. Therefore the second queue has Poisson arrivals. However, an interarrival time X at the second queue and the service time Y of the arriving packet at the first queue are statistically dependent (to see this, observe that X ⱖ Y). The packet length remains constant through the network; as a consequence, the service times at the second queue are not statistically independent of the interarrival times. Due to this correlation between arrivals and service, the second queue is not M/M/1 (or even G/G/1). The exact analysis of the system with two queues in tandem is not known, because it is inherently difficult to account for the correlation illustrated above. To resolve this difficulty, an engineering approximation is employed in the analysis of queueing networks. This approximation is motivated by the fact that the input stream into a queue is typically a mixture of several packet streams. Kleinrock has suggested that this mixing effectively restores the independence of the arrival times and packet lengths. Consequently, Kleinrock’s independence approximation adopts an M/M/1 model for each queue in a network. The approximation is accurate for networks with Poisson external traffic, exponentially distributed packet lengths, and a densely connected topology to ensure adequate mixing of traffic streams. A typical application of Kleinrock’s independence approximation in a queueing network is the delay analysis of a virtual circuit network. Here each packet stream l has a packet arrival rate ␭l and is assigned a path from the source node to the destination node in a given network topology. Let (i, j) denote the directed link from node i to node j and let S(i, j) denote the set of streams that use this link. Each link is then modeled as an M/M/1 queue with the arrival rate λi j =

λl

l∈S(i, j)

and service rate 애ij. As a result, the average number of packets in the network at the steady state is given by E(N) =

ρi j

(i, j)

1 − ρi j

where ␳ij ⫽ ␭ij /애ij is the utilization of link (i, j). Little’s result, when applied to the whole network, yields the average packet


delay as E(T ) =

1 ρi j γ (i, j) 1 − ρi j

where 웂 ⫽ 兺l ␭l is the total arrival rate into the network. When processing and propagation delays are significant, the delay expression can be easily modified to take these effects into account. When the packet lengths are not exponentially distributed, the M/G/1 Pollaczek–Khinchin formula replaces the M/M/1 expressions above. Jackson’s theorem is a powerful result which shows that the delay expression above is exact, provided that packets are assigned anew independent and exponentially distributed service times in the queues they traverse. More generally, Jackson’s theorem states that for a network with Poisson external arrivals the number of packets in each queue is statistically independent of those in all other queues. If (n1, n2, . . ., nK) denotes the number of packets in a network of K queues, one has as the joint probability distribution P(n1 , n2 , . . ., nK ) = P1 (n1 )P2 (n2 ) · · · PK (nK ) where n

Pj (n j ) = (1 − ρ j )ρ j j ,

n j = 0, 1, 2, . . .

is the geometric distribution one would have for an M/M/1 queue in isolation, and ␳j is the utilization of the jth server. The importance of Jackson’s theorem lies in the fact that it enables each queue in the network to be considered as an M/M/1 system in isolation, although the actual arrival process to the queue is, in general, non-Poisson. To see the latter, consider a single queue with external Poisson arrivals of rate ␭0 and a service rate 애 Ⰷ ␭0. Suppose each packet completing service immediately returns to the queue with probability p and departs the system with probability 1 ⫺ p. The total arrival rate into the queue is ␭0 /(1 ⫺ p), and from Jackson’s theorem the number of packets in the system is geometrically distributed with parameter ␳ ⫽ ␭0 /애(1 ⫺ p). Since each external arrival is likely to find the system empty, it induces another arrival after a short time with probability p (due to the short service time). Hence the aggregate arrival process is bursty and non-Poisson, although the system can be analyzed as if it were an M/M/1 system. Closed Queueing Networks A closed queueing network is a network in which a fixed number L of packets circulate without any external arrivals or departures. Such a model is usually employed to investigate the effect of limited system resources by implicitly assuming that each departure is immediately replaced by a new arrival. A common application of closed networks is in the analysis of window-based flow control schemes in packet-switched networks (5). The typical quantity of interest in a closed network is the joint probability distribution of the number of packets in different queues. In a network of K queues these numbers n1, n2, . . ., nK are clearly statistically dependent because their sum is a constant. Consequently, the isolation afforded by Jackson’s theorem for open queueing networks does not hold

211

for closed networks. However, the joint probability distribution of (n1, n2, . . ., nK) can still be written as P(n1 , n2 , . . ., nK ) = c(L)P1 (n1 )P2 (n2 ) · · · PK (nK ), n1 + n2 + · · · + nK = L where c(L) ⫽ (兺n1⫹ ⭈ ⭈ ⭈ ⫹nK⫽L P1(n1)P2(n2) ⭈ ⭈ ⭈ PK(nK))⫺1 and Pi(ni) ⫽ ␳ini with ␳i ⫽ ␭i /애i. Here 애i is the service rate of the ith server and ␭i is the arrival rate to the ith queue. The arrival rates are determined by the routing probability matrix R whose (i, j)th entry Rij is the probability that a packet leaving the ith queue joins the jth queue. The arrival rate vector ⌳ ⫽ (␭1, ␭2, . . ., ␭K)T is the solution to the equation ⌳ ⫽ RT ⌳. This equation has a unique nonzero solution within a multiplicative factor for a well-behaved R (the technical requirement is the irreducibility of the Markov chain with the transition probability matrix R). The multiplicative factor does not affect the probability distribution and can be arbitrarily chosen. As an example, let us consider two queues in tandem with L circulating packets. Both servers have service rate 애. A packet served by the first server joins the second queue with probability r and returns back to the first queue with probability 1 ⫺ r. The output packets from the second queue join the first queue. The arrival rates then satisfy ␭2 ⫽ ␭1r. The packet occupancy distribution is then found as P(n1 , L − n1 ) =

1 − r −n r 1, r−L − r

n1 = 0, 1, . . ., L

from which other quantities of interest, such as average queue occupancy and packet delay, can be determined. DISCRETE-TIME QUEUES Our discussion has so far focussed on continuous-time queues in which packet arrivals and service may occur at any time instant. There has been an increasing interest in broadband networks with fixed packet sizes over the last decade. This interest is primarily motivated by the asynchronous transfer mode (ATM) standard for broadband ISDN. ATM uses 53-byte packets (cells) and the network elements operate synchronously using time slots. The fixed packet size and timeslotted operation simplify the architecture and implementation of packet switches. Since the servers in such a network start service only at slot boundaries, the nature of queueing in a discrete-time queue is quite different from that in a continuous-time queue. In particular, the exact arrival times of packets are of secondary importance: The number of packets that arrive during a time slot is what affects the state of the system when observed at the beginning of the next time slot. For this reason, it is usually more convenient to describe the arrival process of a discrete-time queue in terms of the number of arrivals per slot instead of interarrival times. The packet service times are described in integer number of time slots. Hence the G-D-1 queue refers to a discrete-time queue with a general distribution on the number of arrivals per slot, a deterministic service time, and a single server. The G-G-1 queue is similarly defined. The discrete-time G-D-1 queue is of primary interest in an ATM setting where each fixed-size packet needs a single time slot of service. The arrival process is i.i.d. from one slot to

212


another and is specified by the probability distribution PA(n) ⫽ Pr(n packet arrivals) or equivalently by the probabil앝 ity generating function ␾A(z) ⫽ 兺n⫽0 PA(n)zn. The arrival rate is defined as the average number of packet arrivals per slot, ␭ ⫽ 兺n nPA(n). For stability ␭ should not exceed unity. Let us observe the system at the beginning of each time slot. Let Nk be the number of packets in the system at the beginning of slot k, and let Ak be the number of arrivals that occur during that slot. The system occupancy is then described by the dynamic evolution equation Nk+1 = Nk − u(Nk ) + Ak where u(n) ⫽ 1 for n ⬎ 0 and u(0) ⫽ 0. The term u(Nk) is the number of served packets in the kth slot. Since the number of arrivals Ak is independent of the state Nk, the sequence 兵Nk其 is a discrete-time Markov chain with transition probabilities Pi j = Pr(Nk+1 = j | Nk = i) = PA ( j − i + u(i)) The steady-state distribution of this chain has the generating function φN (z) = (1 − λ)

(z − 1)φA (z) z − φA (z)

which can be inverse z-transformed, for a given arrival distribution, to obtain the steady-state system occupancy distribution. The average system occupancy can be found from E(N) ⫽ ␾N⬘ (1) as E(N) =

σA2 λ + 2 2(1 − λ)

where ␴A2 is the variance of the arrival distribution. This is the discrete-time Pollaczek–Khinchin formula, and it shows that deterministic arrivals minimize average system occupancy and delay. An important application of discrete-time queueing is the analysis of an input-queueing packet switch. This switch is an M-input M-output device with a queue per input port. Each incoming packet is assumed to be equally likely to be destined to any one of the M output ports. These packets have fixed size which equals the slot duration. The switch serves up to M head-of-line (HOL) packets every time slot, two HOL packets with the same destination cannot be served in the same time slot. The system has a first-in first-out (FIFO) service discipline for each input queue. This means that a HOL packet that cannot be served in a given slot makes it impossible for a subsequent packet in the same input queue to be served in that slot, even if the output request of the latter packet could be honored. This effect is known as HOL blocking and introduces a correlation between destinations of HOL packets. (Two HOL packets are more likely to have an output conflict than two non-HOL packets.) This correlation has to be taken into account in the performance analysis. The input-queueing packet switch is a system of M discrete-time queues with correlated service. The performance of this switch has been analyzed in Refs. 6 and 7 for Bernoulli arrivals. In this arrival model each input port receives a new packet with probability ␭ in a time slot. Multiple packet arrivals at the same input port in the same time slot are not al-

lowed. Consequently, the arrival rate is ␭ packets per port per slot. The analysis decomposes the switch into M independent queueing systems in which a HOL packet is served with probability q in each time slot. Therefore the service time of a HOL packet is geometrically distributed with parameter q. (Since the packet interarrivals are geometrically distributed as well, this queue is sometimes referred to as a Geom/Geom/ 1 queue.) The decomposition approximation is known to be accurate when the parameter q is calculated by taking the HOL effect into account. This calculation yields (6) q=

2(1 − λ) 2−λ

The average number of packets per input port can then be found as E(N) =

λ(1 − λ) q−λ

and the average packet delay is obtained from Little’s result as E(T) ⫽ (1 ⫺ ␭)/(q ⫺ ␭). The maximum throughput of this switch is defined as the traffic rate ␭ beyond which finite system size and packet delay cannot be supported. This can be calculated from ␭max ⫽ 2(1 ⫺ ␭max)/(2 ⫺ ␭max) as ␭max ⫽ 2 ⫺ 兹2 앓 0.586. HOL blocking and destination conflicts reduce the maximum switch throughput from 100% to 58.6%. If the correlation between HOL destinations can be eliminated (e.g., by dropping the HOL packets that cannot be immediately switched), the throughput can be improved to 1 ⫺ e⫺1 앓 63.2% at the expense of packet loss (7). It has been shown recently that 100% throughput can be achieved if non-FIFO service disciplines are used (8). FUTURE TRENDS IN QUEUEING ANALYSIS AND NETWORK PERFORMANCE In this final section we outline some of the research issues in modern network engineering which are related to queueing analysis. A nonprobabilistic characterization of arrival processes has been developed by Cruz (9,10). This model assumes that every arrival process obeys certain average rate and burstiness criteria. Namely for all time intervals [s, t], the number of packets that enter the network during this time interval is upper bounded by ␴ ⫹ ␳ (t ⫺ s), where ␳ is the long-term average packet rate and ␴ is a parameter that controls the size of allowed packet bursts. This deterministic traffic description allows a worst-case characterization of packet delay and system occupancy. This framework has been applied to flow control in broadband ISDN (11,12). Another current issue in network traffic engineering is the characterization of correlation and burstiness in statistical traffic models. Data, voice, image, and video sources all exhibit a strong temporal correlation which is not well-modeled by traditional models. Several advanced models have been proposed to account for correlation, such as Markov modulated Poisson processes (MMPP), fluid models (13), spectral models (14), and so on. These models attempt to quantify traffic correlation and burstiness in a parsimonious manner that enables a performance analysis. A consensus is yet to emerge

NETWORK RELIABILITY AND FAULT-TOLERANCE

on the adequacy of these models for characterizing traffic in modern networks. For a discussion of these issues the reader is referred to Ref. 15. A related research topic in network performance is measurement-based traffic modeling. Many real traffic traces, including measurements of Ethernet and Internet traffic, have been observed to exhibit strong and slowly decaying temporal correlation (16). Statistical analyses of measured traffic data in many different network settings suggest a self-similar nature to network traffic. That is, time-averaged traffic seems to exhibit a behavior that is independent of the time scale over a wide range of such time scales, from a few milliseconds to several hours. This behavior is quite different from that of traditional traffic models used in queueing analysis and requires further study. Network performance implications of self-similar traffic are largely unknown at present. While early results suggest dramatic differences with certain performance metrics (e.g., packet loss probability in finite buffers), a comprehensive understanding of the queueing behavior with self-similar traffic is yet to be developed. Finally, the interaction between delay in communication networks as quantified by queueing theory and the fundamental limits to reliable information transfer rates as quantified by Shannon’s information theory remains to be fully understood. There are some interesting early results in this context (17,18); however, a basic framework that unifies queueing and information theories for network performance analysis is quite far in the horizon.

BIBLIOGRAPHY 1. J. Little, A proof of the queueing L ⫽ ␭W, Oper. Res., 9 (3): 383– 387, 1961. 2. D. P. Bertsekas and R. G. Gallager, Data Networks, 2nd ed., Englewood Cliffs, NJ: Prentice-Hall, 1992. 3. W. Whitt, A review of L ⫽ ␭W and extensions, Queueing Syst.: Theory and Appl., 9: 235–268, 1991. 4. S. M. Ross, Stochastic Processes, New York: Wiley, 1983. 5. M. Schwartz, Telecommunication Networks: Protocols, Modeling, and Analysis, Reading, MA: Addison-Wesley, 1987. 6. J. Y. Hui, Switching and Traffic Theory for Integrated Broadband Networks, Boston: Kluwer, 1990. 7. M. J. Karol, M. G. Hluchyj, and S. P. Morgan, Input versus output queueing on a space-division packet switch, IEEE Trans. Commun., 35: 1347–1356, 1987. 8. N. McKeown, V. Anantharam, and J. Walrand, Achieving 100% throughput in an input-queued switch, Proc. IEEE Infocom ’95, 1: 296–302, 1996. 9. R. L. Cruz, A calculus for network delay, part I: Network elements in isolation, IEEE Trans. Inf. Theory, 37: 114–131, 1991. 10. R. L. Cruz, A calculus for network delay, part II: Network analysis, IEEE Trans. Inf. Theory, 37: 132–141, 1991. 11. A. K. Parekh and R. G. Gallager, A generalized processor sharing approach to flow control in integrated services networks: The single-node case, IEEE/ACM Trans. Netw., 1: 344–357, 1993. 12. A. K. Parekh and R. G. Gallager, A generalized processor sharing approach to flow control in integrated services networks: The multiple node case, IEEE/ACM Trans. Netw., 2: 137–150, 1994. 13. D. Anick, D. Mitra, and M. M. Sondhi, Stochastic theory of datahandling systems with multiple sources, Bell Syst. Tech. J., 61: 1871–1893, 1982.

213

14. S. Q. Li and C. L. Hwang, Queue response to input correlation functions: discrete spectral analysis, IEEE/ACM Trans. Netw., 1: 522–533, 1993. 15. R. G. Gallager et al., Advances in the fundamentals of networking–part I: Bridging fundamental theory and networking, IEEE J. Selected Areas Commun., 13: 1995. 16. W. E. Leland et al., On the self-similar nature of Ethernet traffic (extended version), IEEE/ACM Trans. Netw., 2: 1–15, 1994. 17. I. E. Telatar and R. G. Gallager, Combining queueing theory with information theory for multiaccess, IEEE J. Selected Areas Commun., 13: 963–969, 1995. 18. V. Anantharam and S. Verdu, Bits through queues, IEEE Trans. Inf. Theory, 42: 4–18, 1996.

MURAT AZIZOG˜LU University of Washington

RICHARD A. BARRY MIT Lincoln Laboratory


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Network Reliability and FaultTolerance Standard Article Deepankar Medhi1 1University of Missouri— Kansas City, Kansas City, MO Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5322 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (118K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Network Availability and Performability Survivable Network Capacity Design Fault Detection, Isolation, and Restoration Advanced Preparation for Network Reliability Recent Issues | | | Copyright © 1999-2008 All Rights Reserved.



213

NETWORK RELIABILITY AND FAULT-TOLERANCE When we make a telephone call, the call is connected through a communication network to the receiving party. Similarly, when we send an e-mail using the Internet, the message is sent through a communication network to the recipient. Such communication networks are made up of nodes and links that connect the nodes by hardware as well as the software components that allow for the functionality to communicate through such networks. Network reliability refers to the reliability of the overall network to provide communication in the event of failure of a component or components in the network. The term fault-tolerant is usually used to refer to how reliable a particular component (element) of a network is (e.g., a switch or a router.) The term fault-tolerant network, on the other hand, refers to how resilient the network is against the failure of a component. Communication network reliability depends on the sustainability of both hardware and software. A variety of network failures, lasting from a few seconds to days depending on the failure, is possible. Traditionally, such failures derived primarily from hardware malfunctions that result in downtime (or ‘‘outage period’’) of a network element (a node or a link). Thus, the emphasis was on the element-level network availability and, in turn, the determination of overall network availability. However, other types of major outages have received much attention in recent years. Such incidents include accidental fiber cable cut, natural disasters, and malicious attack (both hardware and software). These major failures need more than what is traditionally addressed through network availability. For one, these types of failures cannot be addressed by congestion control schemes alone because of their drastic impact on the network. Such failures can, for example, drop a significant number of existing network connections; thus, the network is required to have the ability to detect a fault and isolate it, and then either the network must reconnect the affected connections or the user may try to reconnect it (if the network does not have reconnect capability). At the same time, the network may not have enough capacity and capability to handle such a major simultaneous ‘‘reconnect’’ phase. Likewise, because of a software and/or protocol error, the network may appear very congested to the user J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.

214


(1–3). Thus, network reliability nowadays encompasses more than what was traditionally addressed through network availability. In this article, we will use the term network reliability in a broad sense and cover several subtopics. We will start with network availability and performability and then discuss survivable network design, followed by fault detection, isolation, and restoration as well as preplanning. We will conclude with a short discussion on recent issues and literature. NETWORK AVAILABILITY AND PERFORMABILITY Network availability refers to some measure of the reliability of a network. Thus, network availability analysis considers the problem of evaluating such a measure. [Note that in current literature, this is often termed as the network reliability analysis (4)]. Moore and Shannon did early work in this area (5). We discuss network availability through an example. Figure 1 shows that two telephones are connected by distribution segments (A) to local switches (S), while the switches are connected by the facility (B). The following allocation of outage/ downtime percentage is assumed for the different elements: S 0.01%; A, 0.01%; B, 0.03%. Then, the availability of this connection is (1 ⫺ 0.0001)4(1 ⫺ 0.0003) ⫽ 99.93%; this translates to the maximum downtime of 368 min per year. In general, network availability computation addresses the availability of a network in operational states, and discrete probability models are often used in analysis. Let E denote the set of elements of a network (for examples all the nodes and links). Each element may be in up or down state, where up refers to fully operational and down refers to total loss of the element. Let pe denote that probability that element e 僆 E is up—this is also referred to as the availability of element e. Now consider the subset Ei of E consisting of the up elements of state i. Then, the probability that the network is in up state Ei is given by Pr(Ei ) = pe (1 − pe ) (1) e∈E \E i

e∈E i

Note that there are 2兩E 兩 possible states (where 兩E 兩 denotes the cardinality of the set E ); thus, usually network availability computation needs to deal with the problem of this exponential growth in states. A variety of algorithms for efficient computation have been developed over the years for different availability measures; the interested reader is directed to Ref. 4 and the references therein for additional information. A related issue to availability is the performability. Most availability measures deal only with the connectivity aspect of the network; for example, what is the availability of a path from a source node to a destination node. However, when a failure occurs, the network may not be able to perform at the same level as when there were no failure. For example, the

S A

B

average network blocking in voice telephone networks (circuit-switched networks) is typically the measure for grade-ofservice (GoS). A common value of GoS is 1% blocking under the normal operational mode, but under a specific outage, this may increase to more than 10% blocking; similarly, in a packet-switched network, the average packet delay may increase by an order of magnitude during a major failure compared to under the normal circumstances. Thus, the network failure performability addresses the performance of the network under various failure states. Consider a network with m elements that can each be either in operational or in a completely failed state; then, the total number of states is 2m. The performability measure P is given by

P =

2m

Pr(k)X (k)

where Pr(k) is the probability of state k, and X(k) is the measure (e.g., network blocking in circuit-switched networks or average delay in packet-switched networks) in state k. Again, we face the issue of the exponential number of states. This can, however, be bounded by considering most probable t states as was first shown by Li and Silvester (6). Often, with the proper choice of t, the performability measure can be quite accurately computed. For example, if in a network, multiple simultaneous link failure scenarios are extremely unlikely, then the most probable states are the failure of each link independently. Accordingly, one may limit the computation to these states. SURVIVABLE NETWORK CAPACITY DESIGN While network availability and performability address important measures for evaluating the reliability of a network, designing the networks for survivability is extremely important for overall network reliability. In this section, we address this topic for the capacity design problem using separate examples for circuit-switched networks and packet-switched networks. Circuit-Switched Traffic Networks Example Consider the three-node circuit-switched network (Fig. 2) for which we are given that the availability of each link is 99.9%. We assume that the network has symmetric offered traffic (or load) and capacity. Offered load in circuit-switched networks is given in erlangs; this load is the product of the average call arrival rate and the average call holding time. For example, if the average call arrival rate is 200 calls/h and the average call holding time is 3 min, then the offered load is 10 erlangs (⫽3 ⫻ 200/60). For the symmetric three-node network, offered load between any pair of nodes is assumed to be 10 erlangs, and the link capacity on each link is given to be 21 trunks (or circuits). We assume that the traffic between each pair of nodes is

S

2 A

1 Figure 1. Network view for availability example.

(2)

k=1

3

Figure 2. Three-node network.


routed on the direct link that connects the end nodes of the pair, and we would like to know the call-blocking probability. For an offered load of a erlangs, and c trunks, and under the assumption that call arrival follows a Poisson process, Erlang-B loss formula can be used for computing the blocking probability, which is given by ac /c! k k=0 a /k!

E(c, a) = c

(3)

Thus, in our example, we have E(21, 10) ⫽ 0.000889 앒 0.001. That is, the network is providing a service quality (grade-ofservice) of 0.1% blocking. (In actuality, the blocking for each pair of nodes is slightly different because any traffic blocked on the direct link can try the alternate route.) Now, suppose that the link 2–3 fails; in this case, the network is still connected because node 1 is connected to node 3 via node 2. Assuming that the network still has the same amount of offered load, the load between node 2 and node 3 is now required to be routed through node 1; thus, the load offered to each link is 20 erlangs, whereas the capacity on each link is still 21 trunks. Thus, the blocking seen by traffic on each link is E(21, 20) ⫽ 0.13144, and the blocking seen by pair 2–3 traffic going through node 1 is even higher. Under the link independence assumption, the blocking on a path consisting of two links is given by 1 ⫺ (1 ⫺ b)2, where b is the link blocking probability. Thus, in our example, the blocking for traffic between 2–3 going through node 1 is 1 ⫺ [1 ⫺ E(21, 20)]2 ⫽ 0.24558. Thus, we can see that, under no failure, the network provides a grade-of-service of 0.1%, whereas under a single link failure, the worst traffic pair blocking is 24.558%, although the network connectivity is still maintained. Recall that the link availability was assumed to be 99.9%; this means that the link can possibly be down for as long as 8 hours in a year. If we assume one event per link per year, then this link could conceivably be down for up to 8 hours straight! In some networks, this may be unacceptable given that the worst traffic pair blocking jumps to 24.558% from 0.01%. If we assume that the network should still provide a 0.1% blocking grade even under a single failure for every traffic pair, then to accommodate for the worst path blocking, we need link blocking on each of the remaining links to be b such that the path blocking for traffic between node 2 and node 3 using links 2–1 and 1–3 needs to satisfy 1 ⫺ (1 ⫺ b)2 ⫽ 0.001; this translates to b ⫽ 0.0005 for each link. Because, now we have an offered load of 20 erlangs on each link, we need to find the smallest c such that E(c, 20) ⫽ 0.0005. Solving for integral c, we find that c needs to be at least 36 (i.e., we need to have 36 units of capacity on links 1–2 and 1–3 each). By the same argument, if we consider the failure of a different link independently, then the other two links each need 36 trunks. Thus, to cover for failure of each link independently, each link needs 36 trunks to provide the same level of blocking as was originally wanted for the network in the nonfailure mode. In other words, the network needs 80% more capacity to cover for a link failure compared to the no-failure case although network availability requirement was met. Packet-Switched Networks Example Consider this time a three-node packet-switched network. We will use Fig. 2 again. In packet networks, the offered traffic

215

is usually given by the average packet arrival rate (packets per second, pps in short). If the average packet arrival rate to a network link is ␭ and follows a Poisson process, the average packet size is exponentially distributed with mean 1/애ˆ kilobits, and the link speed is C kilobits per second (kbit/s), then the average packet delay (caused by the queueing phenomenon) can be obtained from the M/M/1 queueing system and is given by T (λ, C, µ) ˆ =

1 µC ˆ −λ

(4)

For the three-node example, we assume unit mean packet size (i.e., 애ˆ ⫽ 1), in addition to assuming that the average arrival traffic between each pair of nodes is 10 packets per second and that the capacity of each link is 30 kbit/s. If all traffic between each node-pair is routed on the direct link, this provides an average delay of T(10, 30, 1) ⫽ 0.05 s, or 50 ms. Now suppose that the link 2–3 fails, then the traffic between node 2 and node 3 is routed through node 1; this induces an offered traffic of 20 pps on each remaining link. Thus, the average delay on each link (1–2 and 1–3) is 100 ms which is observed by traffic between nodes 1 and 2 and between nodes 1 and 3. On the other hand, the traffic between nodes 2 and 3 will go over two links and will thus experience a delay of 2 ⫻ 100 ⫽ 200 ms; this delay is four times more than under the no-failure situation. If the network goal is to provide the average delay for any pair to be less than or equal to 50 ms under a single link failure, then to meet this condition we need link capacity C such that 2 T(20, C, 1) ⫽ 2/(C ⫺ 20) ⫽ 0.05 which implies that C needs to be 60 kbit/s on each of the remaining links. Similarly, if we consider the independent failure of a different link, then the other two links will require 60 kbit/s to provide the same level of service. Thus, in this network, we see that we need to double the capacity to provide the same level of service obtained under a single-link failure. Discussion We can see from these examples that if the network is not provided with additional capacity, then the traffic blocking can be very high in circuit-switched networks, which can result in excessive retry by users, or the packet backlog (queue) can build up in packet-switched networks. Thus, a transient effect can take place. From these two examples for two different networks, we can also see that, in some circumstances, the network capacity needs to be 80% to 100% more to provide the same level of service under a single link failure. This of course depends on the network objective (in our examples, we have used the objective that worst-pair traffic blocking or delay is minimized). In some networks, this near doubling of capacity can be cost-prohibitive; thus, the network performance requirement under failure may be relaxed. For example, under a single-element failure, it may be acceptable to have 5% blocking under a single link failure for the circuitswitched network case, or the average delay is acceptable to be 100 ms for the packet-switched network case. It is easy to see that this will reduce the additional capacity requirements in both cases. Even though additional capacity can meet GoS requirement under a failure, the actual network topology layout and

216


routing are also critical for survivable design (7). Thus, we also need to understand the network connectivity requirement for the purpose of survivability. For instance, a network needs to be minimally two-edge connected to address a singlelink failure; this means that there must be two links connected to each node so that if one of them fails, a node can still be connected to the rest of the network through the other link; this avoids isolation of a node or a part of a network from the rest of the network. If a network is prone to multiple link failures at a time, this would require the network to have a higher degree of connectivity, which, in turn, would usually mean more network resource requirement to address for such failure situations. Survivable design for different node and edge connectivity level is extensively discussed in Ref. 8; the interested reader is directed to this reference for additional information. Going back to the three-node examples, recall that the routing choice was limited to taking the only two-link path in the event of a failure. In a larger network, usually multiple routes between each origin and destination nodes are available; in the event of a failure, traffic can be sent on any of the unaffected paths. However, the actual flow on each path would depend on the actual routing rule in place as well as the availability of network capacity. Thus, it is not hard to see that the actual capacity requirement to address a failure in the network depends also on the actual routing schemes available in the event of a failure. In any case, the overall network survivability and reliability depends on a number of issues. Network capacity design for survivability, as we see from these examples, plays an important part. In the next section, we discuss fault detection and isolation as well as network restoration—another key piece in network reliability.

FAULT DETECTION, ISOLATION, AND RESTORATION Usually, different elements in a network are equipped with alarm generation capability to indicate the occurrence of any abnormal condition, which may cause the reduction or complete loss of the element. This abnormal condition is sometimes labeled as a fault. When an actual failure occurs, depending on the triggers set by various elements in the network, multiple alarms may be generated by a number of network elements—this is the fault-detection phase. Then, the network management system that monitors the network needs to determine the root cause of the fault. Fault isolation is the process of identifying the root cause of the fault. Thus, an issue that first needs to be addressed is correlation of alarms (9) to determine and isolate the actual point of failure in the network. Such fault-detection systems are needed to determine the cause of a fault quickly so that appropriate action can be taken. It is easy to see the relation of fault isolation to network reliability. The longer it takes to detect the cause of a fault, the longer it takes to fix it, and thus, conceivably the network is affected for a longer period of time, which decreases the performability of the network. Rule-based and model-based systems are used for fault isolation. Both centralized and distributed fault localization can be used; see Ref. 10 for a survey of different techniques. Along with the fault-isolation phase, the restoration/repair phase begins. First, the network may be provided with addi-

tional capacity. If the additional capacity is provided so that even after failure the quality of service is met, then from the user’s viewpoint, the failure is not perceived! Thus, a way of ‘‘restoring’’ the network is through additional capacity in the network (although, in actuality, the fault is not physically repaired yet). As we have already seen, to address for a single failure, the network may need twice the capacity, which may be sometimes cost prohibitive. Thus, the network may be provided with less than full spare capacity to address for a failure. In such cases, if the network has adaptive routing capability, then some of the traffic can be rerouted around the failure; thus, the users may not perceive the full impact of a failure. Sometimes, the spare capacity can be provided in a different layer in the network because of cost and technological considerations. In the simplest architectural view of the communication network infrastructure, services such as voice or Internet are provided over logical switched or router-based networks; the capacity required for these logical networks is then provided over the physical transmission network, which may be connected by the digital cross-connect systems or SONET (Synchronous Optical Network) rings. For example, if a network is equipped with fast automated digital cross-connect system and/or SONET self-healing ring capability at the transmission network, the network where the services are provided may not perceive any failure because of fast automated restoration (11,12). At the same time, the transmission network level restoration schemes do not address failures such as a line card failure, or a switch or router failure; thus, restoration at the logical network level also needs to be triggered; this may include rerouting and automatic reconnection of affected connections. It is clear from this discussion that to restore from a failure, the network should be equipped with capacity as well as the proper network management system and software components to detect, isolate, and recover from a failure. Other types of failures such as a software attack or a protocol operation failure cannot be addressed through the restoration process discussed earlier. An example is the SYN attack (2) in transmission control protocol (TCP), which severely affected an Internet service provider (TCP is the transport layer protocol on which services such as email, file transfer and web browsing are provided in the Internet). In this case, the mechanism is needed to identify where such attacks are coming from so as to stop such attacks. ADVANCED PREPARATION FOR NETWORK RELIABILITY To provide network reliability, it is also important to do preplanning and/or advanced preparation. Of course, one way is to have additional spare capacity in the network. However, there can be a failure in the network that can actually take away the spare capacity if the network is not designed properly because of dependency between the logical network and the physical network (7). Thus, it is necessary to audit the network and find the vulnerable points in the network and then to equip the network with additional capabilities to avoid such vulnerabilities. For example, 1. The network may be provided with transmission-level diversity so that for any transmission link failure there


is at least another path not on the path of the failure. 2. A redundant architecture at network nodes can be built to address for a node component or nodal failure; this may include dual- or multihoming to provide for multiple access and egress points to and from the core network. To address for failures due to a software or protocol operations error or a software attack, different types of preparations are necessary. Several software errors that have occurred on various data networks such as the ARPANET, Internet, and SS7 Network (the data network that carries the signaling information for the public telephone network) (1,2) have caused such severe congestion in the network that it cannot be adequately addressed by normal congestion control schemes. Although enormous efforts go into developing robust software, it is not always possible to catch all possible software bugs (and sometimes bugs in the protocol operation). Should any software errors occur, the network should be provided with the capability to go to a known state in a speedy manner [e.g., speedy manual network reinitialization (1)]. If an error occurs as a result of a new feature, then it should have the ability to disable this feature and go to a known state for which the track record is good (3). To address for a software attack that can take advantage of a protocol’s ‘‘loop hole’’, however, requires the development of intrusion-detection schemes.

RECENT ISSUES Much research remains to be done to address network reliability in today’s complex networking environment. We briefly touch on two areas in this regard: multilayered networking architecture and software errors/attacks. Networking environment is evolving to various services being provided over multiple interconnected networks with different technologies and infrastructure. For example, the voice service is provided over circuit-switched networks, which are carried over the transmission network. Similarly, for Internet, applications such as web, email, and file transfers are carried over internet protocol (IP) layer connected by routers, which can be connected to the same transmission network or carried over an asynchronous transfer mode (ATM) or frame relay layer and then over the same transmission network. Thus, we are moving to an environment that we have coined the multinetwork environment. In such environment, in each of these networking layers, different types of failures/attacks and responses are possible. Some work in recent years has addressed this subject to some extent (7,13–17). It remains to be seen the impact of the failure propagation from one network to another, how the restoration process at each of these layers interacts with one another, whether they can make the best use of the network resources, and what type of network management coordination is needed for this purpose. Thus, network reliability in such interconnected multitechnology architecture needs further research. Software/protocol operations errors and software attacks encompass the other area where mechanisms are needed to provide network reliability. This subject is relatively new— research on intrusion detection mechanisms is currently be-

217

ing explored to determine if an attack has occurred. Also, we need to see more work that helps us understand how severely the network will be affected in terms of network performance if a software attack or protocol failure occurs and how to recover from this anomaly. Also, the network architecture should be revisited to identify if there are ways to reconfigure the network after an attack so that parts of the network remain operational.

BIBLIOGRAPHY 1. B. A. Coan and D. Heyman, Reliable software and communication: III. Congestion control and network reliability, IEEE J. Select. Areas Commun., 12: 40–45, 1994. 2. S. Dugan, Cyber sabotage, Infoworld, 19(6): 57–58, 1997. 3. D. J. Houck, K. S. Meier-Hellstern, and R. A. Skoog, Failure and congestion propagation through signalling controls, in J. Labetoulle and J. Roberts (eds.), Proc. 14th Intl. Teletraffic Congr., Amsterdam: Elsevier, 1994, pp. 367–376. 4. M. O. Ball, C. J. Colbourn, and J. S. Provan, Network reliability, in M. O. Ball, et al., (eds.), Network Models, Handbook of Operations Research and Management Science, Vol. 7, Amsterdam: Elsevier, 1995, pp. 673–762. 5. E. Moore and C. Shannon, Reliable circuits using less reliable relays, J. Franklin Inst., 262: 191–208, 281–297, 1956. 6. V. O. K. Li and J. A. Silvester, Performance analysis of networks with unreliable components, IEEE Trans. Commun., 32: 1105– 1110, 1984. 7. D. Medhi, A unified approach to network survivability for teletraffic networks: Models, algorithms and analysis, IEEE Trans. Commun., 42: 535–548, 1994. 8. M. Gro¨tschel, C. L. Monma, and M. Stoer, Design of survivable networks, in M. O. Ball, et al. (eds.), Network Models, Handbook of Operations Research and Management Science, vol. 7, Amsterdam: Elsevier, 1995, pp. 617–672. 9. G. Jakobson and M. Weissman, Alarm correlation, IEEE Netw., 7 (6): 52–59, 1993. 10. S. Ka¨tker and K. Geihs, A generic model for fault isolation in integrated management systems, J. Netw. Syst. Manage., 5: 109– 130, 1997. 11. W. D. Grover, Distributed restoration of the transport network, in S. Aidarous and T. Plevyak (eds.), Telecommunications Network Management into the 21st Century, Piscataway, NJ: IEEE Press, 1994, pp. 337–417. 12. T.-H. Wu, Fiber Network Service Survivability, Norwood, MA: Artech House, 1992. 13. R. D. Doverspike, A multi-layered model for survivability in intra-LATA transport networks, Proc. IEEE Globecom’91, 1991, pp. 2025–2031. 14. R. D. Doverspike, Trends in layered network management of ATM, SONET, and WDM technologies for network survivability and fault management, J. Netw. Syst. Manage., 5: 215–220, 1997. 15. K. Krishnan, R. D. Doverspike, and C. D. Pack, Improved survivability with multi-layer dynamic routing, IEEE Commun. Mag., 33 (7): 62–69, 1995. 16. D. Medhi and R. Khurana, Optimization and performance of network restoration schemes for wide-area teletraffic networks, J. Netw. Syst. Manage., 3: 265–294, 1995. 17. D. Medhi and D. Tipper, Towards fault recovery and management in communication networks, J. Netw. Syst. Manage., 5: 101– 104, 1997.

218

NETWORK ROUTING ALGORITHMS

Reading List This list includes work on network reliability that address different failure and fault issues. This list is by no means exhaustive. This sampling should give the reader some feel for the wide variety of work available for further reading, as well as lead to other work in this subject. Y. K. Agrawal, An algorithm for designing survivable networks, AT& T Tech. J., 63 (8): 64–76, 1989. D. Bertsekas and R. Gallager, Data Networks, 2nd ed., Englewood Cliffs, NJ: Prentice-Hall, 1992. C. Colbourn, The Combinatorics of Network Reliability, Oxford, UK: Oxford Univ. Press, 1987. P. J. Denning (ed.), Computers Under Attack: Intruders, Worms, and Viruses, Reading, MA: ACM Press & Addison-Wesley, 1990. B. Gavish et al., Fiberoptic circuit network design under reliability constraints, IEEE J. Select. Areas Commun., 7(8): 1181–1187, 1989. B. Gavish and I. Neuman, Routing in a network with unreliable components, IEEE Trans. Commun., 40: 1248–1258, 1992. A. Girard and B. Sanso´, Multicommodity flow models, failure propagation, and reliable loss network design, IEEE/ACM Trans. Netw., 6: 82–93, 1998. W. D. Grover, Self healing networks: A distributed algorithm for kshortest link-disjoint paths in a multigraph with applications in real time network restoration, Ph.D. Dissertation, Univ. Alberta, Canada, 1989. Fault Management in Communication Networks, Special Issue of J. Netw. Syst. Manage., 5 (2): 1997. Integrity of Public Telecommunication Networks, Special Issue IEEE J. Select. Areas Commun., 12 (1): 1994. Y. Lim, Minimum-cost dimensioning model for common channel signaling networks under joint performance and reliability constraints, IEEE J. Select. Areas Commun., 8 (9): 1658–1666, 1990. C. L. Monma and D. Shallcross, Methods for designing communications networks with certain two-connected survivability constraints, Oper. Res., 37: 531–541, 1989. L. Nederlof et al., End-to-end survivable broadband networks, IEEE Commun. Mag., 33 (9): 63–70, 1995. B. Sanso´, F. Soumis, and M. Gendreau, On the evaluation of telecommunication networks reliability using routing models, IEEE Trans. Commun., 39: 1494–1501, 1991. D. Shier, Network Reliability and Algebraic Structures, Oxford, UK: Oxford Univ. Press, 1991. D. Tipper et al., An analysis of congestion effects of link failures in wide-area networks, IEEE J. Select. Areas Commun., 12: 179– 192, 1994.

DEEPANKAR MEDHI University of Missouri—Kansas City


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Network Security Framework Standard Article H. J. Schumacher1 and Sumit Ghosh2 1Arizona State University 2Arizona State University Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5338 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (122K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Review of Current Research in Network Security The network Security Framework Summary Keywords: ATM security; security on demand; framework for network security; security monitoring; NSA's network rating model; security attributes; security perspectives; intrusion detection; military security networks | | | Copyright © 1999-2008 All Rights Reserved.



NETWORK SECURITY FRAMEWORK The article “Network Security Fundamentals” outlines security threats, concerns, and components, all of which are elements of the overall security posture of a network. In an increasingly dynamic environment, where diverse user groups are being interconnected, there arises naturally the need for a universal definition and framework for network security. The framework must provide a comprehensive picture of network security and enable all parties to define and describe their security posture and concerns. This article first reviews the current literature in network security and then describes a fundamental framework for network security that consists of eight perspectives and nine attributes of a secure network. The framework approach has been adopted by the National Security Agency (NSA) to underlie its Network Rating Methodology (NRM). The perspectives, termed pillars, individually provide orthogonal views of network security and collectively constitute a comprehensive stable structure that supports the total network security. The attributes refer to the inherent characteristics of a secure network. The general concept of security in message communications may be traced back to the advent of human civilization. In contrast, however, security in automation and control is a recent phenomenon, originating with the computer age, and is rapidly gaining importance with the proliferation of networks. With computer networks integrating the dual functions of (1) communications and (2) automation and control, computer network security must address the security issues inherent in both communications and automation. Until recently, research and development in computer security was strongly linked with cryptography, including encryption and decryption of electronic messages. However, as computer networks have started to proliferate into large, complex, real-world systems such as electronic banking, the power grid, and the proposed intelligent vehicle highway system, the authors believe that computer network security has transcended the traditional definition and has migrated to a higher, logical level. In current and future networks, the information riding on the network may control parts of the network while the control, in turn, may ensure the correct propagation of information from the source to the intended destination. Thus, networks constitute complex, multidimensional entities that require security at different levels of both network hardware and network software. The computer-driven integration of the fields of (1) communications and (2) automation and control has been primarily responsible for the proliferation of today’s networks. Computer networks have grown from a simple time-sharing systems—a number of terminals connected to a central computer—to large, complex environments that provide the infrastructure to many critical and economically valuable components of the economy. Many of the large-scale real-world systems in the government, military, and industrial sectors consist of a number of geographically dispersed hardware and software entities that are interconnected through a network that facilitates the exchange of both data and control traffic. Examples include the Federal Reserve banking network, the power grid, the proposed intelligent vehicle highway system network, the US Treasury network (1), the FBI network (1), and the proposed community health care network. There is an increased reliance on computer networks today that may not be widely known to the general public. In fact, most of us, do not realize that we rely on hundreds of computer networks during the normal course of the day and the proper functioning of these networks is critical to their well-being and survival. As a result, the risk to the economy, infrastructure, and well-being of the population has not been widely reported. 1

2

NETWORK SECURITY FRAMEWORK

Such complex systems, however, are often vulnerable to failures, intrusion, and other catastrophes. Backhouse and Dhillon (2) estimate the yearly damage to the vulnerable finance and banking sectors in the United States at $2,000 million. With the growing use and ubiquitous reliance on such computer networks, an increasing emphasis is being placed on security. Both industry and government are engaged in developing new ways to ensure that the networks are more reliable, survivable, and secure. The military started out with the idea of securing each individual computer and later expanded the concept to securing a network of computers and devices. However, it is not the only organization that requires and has implemented some form of security. Network security has evolved over the years, and other departments of government and government networks—including the US Treasury (1), the FBI (1), and the Federal Reserve banking network—as well as commercial institutions and commercial networks such as the banks, financial institutions, and credit card transaction networks (3), have embraced the idea of developing a secure network. Recently, the commercial banking industry has become very much interested in security of networks, since now a favorable cost benefit can be associated with security. The Internet’s growing popularity and potential for commerce (4) has increased the amount of money and effort devoted to producing and enforcing security with respect to privacy and nonrepudiation (3). Corporations, such as General Electric, that have lost money as a result of intrusion can justify increased attention and spending on network security. The vulnerability of the power grid has raised deep concerns about the stability and reliability of networks. Fundamentally, the reason underlying network security is the value of the information riding on the network. Admiral Grace Hopper (5) pointed out, as early as in the 1970s, that the industry is engrossed in the processing aspect of the information processors, and lacks a basic understanding of the value of the information. Even though computers and networks have been around for decades, there appears to be a lack of a community-wide agreement on adopting a common framework to define, describe, and evaluate network security. In the literature, definitions of network security terms are influenced heavily by the respective researcher’s affiliation and background—industry, government, or military. Each of the three sectors continues to maintain its individual vocabulary, which is built around the perceived threat and cost benefit. However, the distinction is increasingly being blurred by overlapping networks, as is highlighted by a recent fact—90% of the electronic of the Department of Defense (DoD) traffic runs over the public networks (6). The lack of a common language to describe network security and the consequent inability to discuss network security hampers progress in the field and threatens the livelihood of millions of people and hundreds of corporations and government agencies. It is therefore imperative for all parties involved to agree on a common framework and revitalize the efforts towards evaluating network security. Recognizing this problem, the National Security Agency, the nation’s chief proponent for computer and network security, organized the first Network Rating Model conference in Williamsburg, VA, on March 20–22, 1996 (7, 8). The goal was to develop a comprehensive model to rate the security of networks that would be acceptable to government, industry, defense, university, and other relevant organizations. The development of security in automation and control over the years has been ad hoc, led primarily by the available technology and the goals of the funding agency. During World War II, the focus was on cryptography, which aimed to protect written traffic between encoding and decoding machines. This was defined as communications security. With the proliferation of computers and the birth of networks, the role of cryptography also expanded. However, cryptography is only one attribute of a secure network, and it alone cannot guarantee comprehensive security, particularly with today’s and tomorrow’s sophisticated computerliterate population. The US military has methodically categorized the security attributes in the Orange Book (9). While the concepts of communications security (COMSEC) and information security (INFOSEC) are well understood within the Department of Defense, they mean little to most of industry and many civilian government agencies. In writing this article, first a comprehensive literature search was carried out, culminating in a detailed listing of the attributes of network security, as used by the military, and the specific terms used to describe them in industry and government. Then, the common terms were grouped together, and the best fit term was


3

selected to describe each issue. As an example, consider the term “classification,” which the military uses to describe whether a network is restricted to a particular person, group, or class. This is further subdivided into the categories of unclassified, for official use only, confidential, secret, and top secret. In contrast, the term, “private” or “proprietary” in industry restricts the use of a network to a specific person, group or class and therefore corresponds to the military’s “classified.” Motorola’s POPI classification is based on whether failure to protect data may disrupt business, provide undue economic advantage to the receivers, cause embarrassment, permit access to other classified data, provide undue advantage to a competitor in the marketplace or in its negotiations with a mutual customer or in its market strategy or access to technology, or lead to legal problems including liability. A key difference between the industry, government, and defense perspectives has traditionally concerned threats and their sources. This difference is also becoming blurred. To the military, the traditional threat has been the enemy, typically a hostile government or terrorist, whose efforts were aimed at stealing valuable information from the network. Today, however, a new kind of threat, termed information warfare (10), has gained notoriety. It consists in disabling or rendering useless the enemy’s key networks, including the command and control (10), power grid (6), financial, and telecommunications networks. In addition, the threat of economic espionage—that is, stealing secrets from industry and government networks—is on the increase. There is increasing evidence that insiders (disgruntled and recently fired employees) constitute the most significant threat (1). Malfeasants are another threat capable of causing mischief or serious harm to networks. The remainder of the article is organized as follows. The next section reviews the current literature on network security. The section after presents the network security framework, and the last section presents a summary.

Review of Current Research in Network Security The DoD’s perspective on trusted computer systems is presented in the Orange Book (9). 5 presents a generalization of the INFOSEC and COMSEC concepts [9] to network security and provides the following general definitions. Information security (INFOSEC) is defined to consist of procedures and actions designed to prevent, at a given level of certainty, the unauthorized disclosure, transfer, modification, or destruction, whether accidental or intentional, of information in a network. Information includes data, control, voice, video, images, FAX, and so on. In contrast, communications security (COMSEC) refers to the protection resulting from the application of cryptosecurity, transmission security, and emissions security measures to telecommunications and from the application of physical security measures to communications security information. Abrams and Joyce (11) review the trusted system concepts and reference validation mechanism and explore a new computer architecture to generalize the concepts for distributed systems. (Nessett (12) reviews the difficulties in authentication and notes the security advantages of centralized authentication during logon in distributed systems. Lin and Lin 13 note that in enterprise networks, the principle security “areas” include confidentiality, integrity, data-origin authentication, nonrepudiation, user authentication, and access control. They review public-key and secret-key cryptographic techniques for confidentiality, and kerberos for thirdparty authentication. They also suggest the use of centralized security management over distributed schemes to reduce overhead and security risks. Cryptography has continued to play a major role in security. To (Janson and Molva (14), network security involves controlling access to objects, enumerating the access rights of subjects, the threats that must be considered during access control design, and mechanisms to enforce access control. They describe the role of cryptography as central to both authentication and access control. In addition, they propose tracking resource usage by authorized users, at least for accountability, and for subscribers to identify themselves to each other to fend off masquerading intruders. Power (10) introduces the notion of information warfare and notes that its scope includes (1) the electronic battlefield (disruption of enemy command and control), (2) infrastructure attacks, (on key telecommunications,

4


financial systems, and transportation), (3) industrial espionage, (covert operations aimed at stealing proprietary secrets or sabotage of company information network), and (4) personal privacy (theft of private information such as credit card or driver’s license or social security numbers). The security services required in electronic commerce (EC) networks (3) include authentication, authorization, accountability, integrity, confidentiality, and nonrepudiation. 3 also identifies two kinds of possible attacks on EC networks—(1) passive, or pure listening, and (2) active, or insertion of modified packets. To defeat such attacks, the goals of security must be aimed at preventing traffic analysis attacks, preventing release of contents attacks, detection of message-stream modification attacks, detection of denial-of-service attacks, and detection of spurious association initiation attacks. Hosmer (15) remarks that the desired goal in the current computer security paradigm is absolute security. This requires both logical and mathematical precision, and it is unfortunate that precision and complexity are inversely related. A related complication is that the future may witness other types of threats to network security. According to Hill and Smith [16], the risks in the corporate world include personnel, property, information, and liability. Today’s corporations are concerned with (1) protecting financial resources, personnel, facilities, and information, (2) access control for facilities and management information systems, and (3) recovery from disaster and continuity of operations. Chambers [17] underscores the difficulty in detecting intrusion and notes that although the FDA network was successfully penetrated in 1991, the logging and monitoring tools, left running for weeks, revealed no signs of unauthorized access. Wolfe (18) underscores the value of the information contained in the hardware by pointing out that for many likely events that arise from the lack of security, such as virus attacks, there is no widely accepted measure of risk and the risk is not insurable. Oliver [19] traces the concept of privacy of computer users and individual-related data to the US Constitution and notes that it is provided by a third party as far as distribution, publication, and linkage of the information to the individual are concerned. Oliver also addresses the debate as to whether computer users making anonymous statements may be held accountable. Hitchings [20] stresses the need to examine the human issues—cultures of people involved, attitudes, morale, and differences between personnel and organization objectives—relative to network security. The literature on the use of audit trails to realize accountability, detect anomalous behavior of users, and possibly flag intrusion is rich. Vaccaro and Liepins [21] describe their experiences with recording and analyzing anomalous behavior in computer systems at Los Alamos National Laboratory immediately following an intrusion. Helman and Liepins [22] present a stochastic foundation for audit trail analysis. They also suggest several criteria for selecting attributes. Janson and Molva (14) propose the tracking of system resource usage by authorized users for accounting as well as intruder detection. They point out the need to (1) identify objects access to which must be controlled, (2) identify subjects whose access must be controlled, (3) identify the possible threats that must be defeated, and (4) catalog enforcement mechanisms. Lunt and Jagannathan (23) enumerate several discrete and continuous intrusion detection criteria and state that their system maintained system usage profiles of users, which in turn were periodically updated based on the a priori known user behavior. Kumar and Spafford 24 encode the knowledge of known attack procedures through specialized graphs in their system and use a pattern-matching scheme to detect network penetration. Soh and Dillon 25 present a Markov model of intrusion detection and devise a “secure computation index” measure to quantify the intrusion resistance of a system. Their results, however, are limited to a single computer system. In her survey of intrusion detection techniques, Lunt 26 notes that they are primarily based on maintaining audit trails and observes a few key controversial issues. They include the appropriate level of auditing, the voluminous amount of audit information, the comprehensibility of detailed audit information, the possible performance degradation as a result of audit, and the invasion of privacy of computer users. A variation of the audit trail concept has been proposed for the electric power industry. Weerasooriya and colleagues (27) present a neural network solution to the problem of security assessment in large-scale power systems. They use neural nets for


5

fast pattern matching of the state of the power system immediately following a “contingency” with historical trends. Their results are, however, limited to static security. Recent research in intrusion detection continue to focus on the use of audit trails (28, 29), attempt to detect patterns in the traces of data and privilege flows (30, 31) and employ statistical and neural network models (32,33,34). 35 proposes the use of autonomous agents to collect break-in information. Following testing of the current intrusion detection products by vendors, Newman et al. observe that no product is capable of successfully detecting all attacks under heavy network loading, a conclusion corroborated by Lunt (37). An analysis of the current literature reveals the following. First, the nature of the security concerns differs for each of the sectors—military, government, and industry. This has led to problems, since many of these sectors are forced, for efficiency and economy reasons, to use each other’s networks. Third, there is the lack of a common framework to describe security and intrusn resistance of networks, an important issue that had dominated the Network Rating Model workshop.

The network Security Framework Recently, many computer network experts (4, 6) have joined the electric power system researchers in sharing the latter’s long-held belief in system availability (38 39,40) and transient stability (41) as primary security concerns. Fitzpatrick and Hargaden 42 argue that the design of complex networks must take into account scenarios where the network may be rendered unavailable by enemy action. They point out that in military command and control networks, units may need to continue fighting while out of contact with the higher headquarters and adjacent units, acting on their own initiative within the framework of the commander’s intent. Given the wide scope of today’s networks and their enormous future potential, the goal of achieving comprehensive network security is challenging. The National Security Agency realized the need for a comprehensive definition of network security and organized an NRM workshop to address the issue. The overall goal of the workshop was to determine the degree of protection that should or could be provided, synthesize a measure of protection and a methodology for evaluation, and determine the cost and performance tradeoffs. The workshop was organized to first arrive at a definition of network security, acceptable to the government, military, industry, and university. Next, the potential threats were enumerated and the key attributes of a secure network identified. Logically, one must bound what one is protecting before one can analyze how well one is protecting it. Thus, the attributes serve as potential weak points in a network. It has become increasingly evident that the vulnerability or security of a network may be viewed from different conceptual points of view, termed perspectives in this article. Although this idea has been referred to as “disciplines” in the literature, the term “perspectives” appears to capture the underlying meaning more accurately. The total security of a network requires its detailed evaluation relative to every perspective. While one organization, building on its assumption of a specific set of threats, may find one subset of the perspectives important, another organization may find a different subset of the perspectives critical based on its own perceived threats. The consensus definition of a network rating model is: “A consistent, cost-effective methodology based upon a defined set of characteristics for assessing the total security of any network or combinations of networks, either in operation or development; to define what exists, determine what is needed, identify what could affect security, and provide a universally acceptable assessment report.” In the definition, the term “consistency” stresses the need for the security rating of a network to apply uniformly across different sectors. Furthermore, a rating must be valid for a reasonable length of time into the future despite rapid advances in the networking technology. The cost-effectiveness criterion underscores the need to balance the cost of the threat against the cost of implementing security. The defined set of characteristics is currently under consideration. The total security refers to the different dimensions of a secure network, while the phrase “network or combinations of networks” reflects the increasing bluring of network boundaries. Since

6


the report must be universally acceptable and useful, it must record the security measures currently in place in the network, which in turn will facilitate identifying what more is required to ensure total security. In order to define the characteristics or attributes of a given secure network, it was decided at the workshop that one must focus on the relevant set of network security perspectives to yield security services that satisfy stated concerns. The comprehensive list of perspectives include (1) systemic, (2) communication, (3) physical, (4) personnel, (5) operational, (6) application, and (7) performance, and (8) design correctness. The services were enumerated as (a) access control, (b) confidentiality, (c) integrity, (d) authentication, (e) traffic flow security, (f) assured service, (g) nonrepudiation, (h) anonymity, and (i) intrusion detection. The concerns included (i) accountability, (ii) availability, (iii) liability, (iv) reliability, (v) auditability, (vi) interoperability, (vii) confidentiality, (viii) integrity, and (ix) uncertainty. These perspectives, services, and concerns were corroborated in Ref. 1 At the first NRM workshop, given the limited time available for a thorough discussion, security services and concerns were separated into two distinct lists. This split was driven by the divergent views of the representatives of industry, government, and military, which, in turn, stemmed from differing perceptions of the threat sources. Defining the Framework for Network Security. Upon careful analysis, it became increasingly evident to the authors that a unified approach to total network security, across the military, government, industry, and university sectors, requires the recognition of two fundamental components of network security. First, any secure network must possess a few inherent characteristics, regardless of the sector to which it belongs and independent of any specific threat. These characteristics are referred to as attributes of a secure network and are the result of unifying security services and concerns. Second, a network’s security may be viewed at different conceptual layers, each view reflecting a threat, being relatively orthogonal of others, and thereby permitting independent development and evaluation. The conceptual aspects are referred to as perspectives or pillars. The list of attributes includes (1) privacy, (2) integrity, (3) accountability, (4) availability, (5) reliability, (6) connectivity, (7) recovery from disaster, (8) liability, and (9) uncertainty, and they constitute a superset of the attributes proposed in the literature. The list of pillars includes (a) systemic, (b) communication, (c) physical, (d) personnel, (e) operational, (f) application, (g) performance, and (h) design correctness. This orthogonal framework approach was also adopted at the second NRM author’s group workshop in July 1996. Figure 1 shows the network security framework wherein the attributes permeate each of the pillars that, in turn, collectively hold up network security. The relative strengths of the pillars may vary, depending on the perceived threats in a given scenario. Thus, network security is only as strong as the weakest pillar. Figure 1 presents a representation of the framework through a matrix. It provides an organized framework for the network security evaluation information, which may be utilized to improve security or to evaluate the security resulting from interconnecting two or more networks. Ideally, a fully secure network would require every attribute to be strongly protected in all pillars, subject to some standard threat, relative or absolute. However, this may be neither cost-effective nor practical, due to limited time and resources. Network-security-related decisions are based on the perceived threat to a particular pillar and/or attribute and the level of risk that the security management is willing to assume. Pillars of Network Security. The choice of the term pillars reflects the eight foundation blocks, each of which may be under attack, either independently or together, that cumulatively support a network’s security. Thus, each pillar, corresponding to one the eight perspectives, describes an orthogonal conceptual view of network security and may be developed and evaluated independently, based on the degree of importance assigned to the appropriate threats. Consequently, the pillars may exhibit different relative strengths. Should new types of threats emerge in the future, requiring additional views of network vulnerability, additional pillars may need to be incorporated into the framework. The scope of the eight pillars is elaborated as follows. •

Systemic encompasses the software that operates the network and constitutes the basic infrastructure of the high-level application software.


7

Fig. 1. Components of network security.

• • • • • • •

Communications encompasses the links and devices that interconnect the computers to constitute the network. Physical encompasses the equipment, material, and documents associated with the network. Personnel encompasses the people associated with the operation or use of the network. Operational encompasses the procedures, policies, and guidelines that constitute the security posture of networks. Application encompasses the high-level software that executes on the network. Performance encompasses the normal range of operating parameters and throughput of the network. Design correctness encompasses the correctness of the total system. The complex interactions between the different components of the system will, in general, result in a very large number of states and state transitions. Without ensuring that every state and state transition is correct, the threat of the system entering an unstable state, which then triggers catastrophic failure, is very real.

Attributes of a Secure Network. Each of the attributes will bear a specific degree of relationship to each of the seven perspectives defined by the network and the current understanding of security attacks. While most of the relationships are readily understood, a few are unclear at the present time, while all are subject to evolution as our understanding of network security matures. For instance, the privacy attribute bears a strong relationship to the personnel pillar. In contrast, consider the relationship between the performance pillar and the liability attribute. At the present time, the relationship is weak, since it is difficult to prosecute a hacker for degrading a network’s performance and even more difficult to quantify the degradation and, therefore, determine a commensurate punishment. However, as society acquires a better grip on the responsibilities and consequences, the relationship will be greatly refined. The relationships may be evaluated, objectively or subjectively, through mechanisms, some of which are well known while others are yet undefined. As an example, the use of background checks may help strengthen the privacy attribute and the personnel pillar. Similarly, the strength of the relationship between the systemic pillar and privacy attribute for a given network may be evaluated through the access controls implemented. While the dependencies (1) between the “design

8


correctness” pillar and all of the attributes and (2) between the “uncertainty” and “liability” attributes and all of the pillars are clear, the exact relationships and the corresponding mechanisms to evaluate them are yet to be defined. The attributes are elaborated as follows: •

•

•

• • •

• •

•

Privacy (10, 19) is defined as intention for or restriction to the use of a particular person, group, or class. It applies to data, control signals, and traffic flow. Synonymous and associated words in the literature include confidentiality (3, 14), anonymity (19), classification (9), proprietary, TRANSEC, cryptosecurity, EMSEC, and encryption (5). Integrity (3, 40) is defined as ensuring that information held in a system is a proper representation of the information intended and that it has not been modified, created, destroyed, or inserted by an unauthorized entity. Integrity also refers to the processes, process sequences, and other system assets. Synonyms and associated words include soundness, incorruptibility, completeness, and honesty. Accountability (19) is defined as a statement or exposition of reasons, causes, or motives to furnish a justifying analysis or explanation that can be documented or traced and ownership established. Synonyms and associated words include nonrepudiation (40), auditability (32), audit trail (26), answerable, authentication (3), signature, and responsibility. Availability (24, 40) is defined as being qualified and present or ready for immediate use by authorized users and worthy of acceptance or belief as conforming to fact or reality. Synonyms and associated words include access control (14), authentication (3), and confirmation. Reliability is defined as generating consistent results during successive trials. Synonyms and associated words include assured service, assuredness, certainty, and dependability. Connectivity (43) is defined to consist of the devices that constitute the network including the computers and links between them, and the intelligence that supports the seamless and transparent integration of a wide variety of different protocol-driven terminals and host computers. Synonyms and associated words include interoperability, traffic flow, logical flow, associations, relationships, emissions control, and TEMPEST. Recovery (15) is defined as returning from a disaster and continuity of operations. Synonyms and associated words include self-healing and contingency planning. Liability (16) is defined as having to do with legal obligation and responsibility that may affect property and information. Synonyms and associated words include responsibility, due process, ethical responsibility, open, and exposure (i.e., lack of protection or powers of resistance against something actually present or threatening). Uncertainty reflects the lack of complete knowledge of the system security as a result of previous penetrations, with known and unknown consequences, that may degrade future network security. This attribute may be viewed as a generalization of the concept of anomaly detection (23, 26) in a user’s behavior through audit trail analysis.

Uses of the Network Security Framework. The framework provides a basis to address, fundamentally, every weakness in a given network. Furthermore, it applies to every level of the network, from the highest network-of-networks level down to the single computing node that maintains connections with other nodes. Thus, the framework enables the comprehensive understanding of the security posture of an individual network, the comparative evaluation of the security of two or more networks, and the determination of the resulting security of a composite network that is formed from connecting two or more networks with known security. These uses of the framework can be extrapolated to establish a user-level security-on-demand system in an ATM network. The value of the proposed framework is that it stimulates the network designers to examine the vulnerabilities of all eight pillars even when they may appear inconsequential. For instance, while a credit card network, operating on the Internet, may successfully address the privacy attribute and feel secure, malicious agents may penetrate the network and reduce the availability so that customers are prevented from making


9

purchases. An examination of the performance pillar may be advisable under these circumstances. In a different scenario, while the military assigns resources to ensure the privacy and connectivity attributes, a disgruntled employee may send out an unauthorized message, under an assumed ID, to the finance and accounting military pay program, take advantage of a weakness in the accountability attribute, and deny hundreds of thousands of soldiers their pay on time. The first use of the framework is to provide an overall view of a network’s security posture. The procedure for determining a rating of a network consists of the following. For a given standard threat level, relative or absolute, and a given environment, the strengths of the intersection points in the matrix are obtained through evaluating the corresponding mechanisms. The evaluations may assume the form of numerical values, narratives, or graphs, subjective or objective. To improve the security posture of the network, either (1) the individual values along a row that constitute an evaluation of the corresponding pillar may be examined against a perceived threat level, or (2) the values along a column that reflect an evaluation of the strength of the corresponding attribute may be compared against a desired measure for the attribute. Clearly, the desired measure will reflect a cost–benefit analysis, with respect to the level of risk that the security management is willing to assume. As indicated earlier, the matrix provides a meaningfully organized framework for the evaluation of network security information, in terms of its fundamental characteristics. Thus, to compare the security postures of two or more networks, either (1) the individual values along a row of the corresponding matrices may be examined against each other, or (2) the values along a column of the corresponding matrices may be contrasted. It should also be noted that network security is a continuous process and must be exercised periodically. With time and as the roles of networks evolve, security breaches may appear in previously unsuspected areas. To understand the operation of the framework, consider that the military perceives the primary threat to its networks and data from hostile governments. Clearly, to the military, the communications and physical pillars are vulnerable. This, in turn, points to the connectivity attribute. Furthermore, the desire to protect data riding on the network requires focus on the privacy attribute. In contrast, consider a financial network’s concern that a malicious agent may disrupt its financial services. Clearly, the systemic pillar is vulnerable, which in turn points to the connectivity attribute. In addition, the privacy attribute may also be flagged due to the confidential nature of the financial transactions. Assume that a defense agency plans to send top secret traffic through an ATM network. Initially, it will insert the highest value, say 0, in the entire privacy column and communication row of the matrix associated with the corresponding call request. To successfully propagate the traffic through the network, the call setup process must first determine a route, if possible, where each and every ATM node along the route offers a privacy value of 0 in every pillar and 0 in every attribute of the communications pillar. The values assigned to the elements of the security framework matrix of the node reflect the strengths of the security in the respective domain. The values of the individual elements may differ over a wide range, with some elements possibly being 9, implying the absence of security in that element area. Examples of three matrices, corresponding to three military traffic types—top secret, secret, and confidential—are presented in Fig. 2 along with the relevant element values. The framework’s second use is in computing the resulting security of the composite network, AB, formed from connecting two networks A and B, with known security, as shown in Fig. 3. By design, the framework applies to every level of the network, from the highest network-of-networks level down to the single computing node that maintains connections with other nodes. A key goal of the framework is to provide a template for organizing the different aspects of network security to permit the military, government, and industry to start their connectivity discussions from a common baseline. Whether they choose to use or ignore some or all of the elements of the framework is their decision and is based on the amount of risk they wish to assume. In any case, they will all be aware of the total framework and all of its elements.

10


Fig. 2. User-specified security matrices for military traffic.

The framework’s third use is in a user-level security-on-demand system where the framework is used by both the user and the network operating software and hardware to determine a path between source and destination that satisfies the user security requirements or rejects the traffic. Such a system has been modeled and simulated for a large-scale asynchronous transfer mode (ATM) network (44, 45). The advantages of an ATM network include its high speed, small packet size, point-to-point connection, and virtual path-oriented transmission of messages. The comprehensive framework provides the ability for all user groups to define their security requirements within the context of the framework. The framework, when integrated into an ATM network, provides a template for matching network security resources to user requirements. The user-level aspect of the security-on-demand system is possible in an ATM network due to ATM’s unique call setup process. ATM networks are ideal for implementation of the proposed user-level, security-on-demand system because the route the user’s data will follow is known a priori and can be manipulated during the call setup, ensuring that the user-required security is provided or the call is not established. The security-on-demand approach in ATM networks is yet to be deployed by the industry.

Summary This chapter has reviewed the current literature in network security and presented the definition of the Network Rating Model, arrived at by consensus at the National Security Agency’s NRM conference in March 1996 and a subsequent NRM author’s group workshop in July 1996. It has defined a fundamental framework for network security, which consists of eight perspectives of network security and nine attributes of a secure


11

Fig. 3. Comparing two networks’ security.

network. The perspectives, termed “pillars” in this article, individually provide orthogonal views of network security and collectively constitute a comprehensive stable structure that supports the total network security. The attributes reflect the inherent characteristics of a secure network. The framework addresses a previously unfulfilled need within the community. The uses of the framework are threefold. The framework enables the understanding of the security posture of an individual network in a comprehensive manner, the comparative evaluation of the security of two or more networks, and the determination of the resulting security of a composite network that is formed from connecting two or more networks with known security. The framework can also be used as a basis for a user-level security-on-demand system in an ATM network.

BIBLIOGRAPHY 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

P. Edfors presented at Network Rating Model Conference, Department of Justice, Williamsburg, VA, 1996. J. Backhouse G. Dhillon Managing computer crime: A research outlook, Comput. and Security, 14 (7): 645–651, 1995. D. E. Geer Electronic commerce, banking and you, Comput. Security J., XI (2): 55–62, 1995. J. M. Tenenbaum et al. “CommerceNet: Spontaneous electronic commerce on the Internet, Proc. IEEE Comput. Soc. Int. Conf. ’95 (COMPCON 95), 1995, pp. 38–43. T. W. Madron Network Security in the ’90s—Issues and Solutions for Managers, New York: Wiley, 1992. C. Baggett Keynote address, Network Rating Model, First Public Workshop, National Security Agency, Williamsburg, VA, 1996. NSA, Network Rating Model: Operational Capability Maturity Model (OCMM), National Security Agency, March 6, 1996. NSA, Network Rating Model (NRM): Strawman, National Security Agency, March 20–22, 1996. Department of Defense, Department of Defense trusted computer system evaluation criteria, 5200.28-STD, Washington, DC, 1985. R. Power CSI special report on information warfare, Comput. Security J., XI, (2): 63–73, 1995. M. D. Abrams M. V. Joyce Trusted system concepts, Comput. and Security, 14 (1): 45–56, 1995. D. M. Nessett Layering central authentication on existing distributed system terminal services, Proc. IEEE 1989 Comput. Soc. Symp. Security and Privacy, Oakland, CA, 1989, pp. 290–299.

12


13. P. Lin L. Lin Security in enterprise networking: A quick tour, IEEE Commun. Mag., Jan. 1996, pp. 56–61. 14. P. Janson R. Molva Security in open networks and distributed systems, Comput. Netw. ISDN Syst., 22: 323–346, 1991. 15. H. H. Hosmer Security is fuzzy! Applying fuzzy logic to the multipolicy paradigm, Comput. Security J., XI (2): 35–45, 1995. 16. S. Hill M. Smith Risk management and corporate security, Comput. and Security, 14 (3): 199–204, 1995. 17. T. Chambers Case study: A managerial perspective on an Internet security incident, Comput. Security J., XI (1): 17–23, 1995. 18. H. B. Wolfe Computer security: For fun and profit, Comput. and Security, 14 (2): 113–115, 1995. 19. C. Oliver Privacy, anonymity, and accountability, Comput. and Security, 14: 489–490, 1995. 20. J. Hitchings Deficiencies of the traditional approach to information security and the requirements for a new methodology, Comput. and Security, 14 (5): 377–383, 1995. 21. R. S. Vaccaro G. E. Liepins Detection of anomalous computer session activity, Proc. 1989 IEEE Comput. Soc. Symp. Security and Privacy, Oakland, CA, 1989, pp. 280–289. 22. P. Helman G. Liepins Statistical foundations of audit trail analysis for the detection of computer misuse, IEEE Trans. Softw. Eng., 19: 886–901, 1993. 23. T. F. Lunt R. Jagannathan A prototype real-time intrusion-detection expert system, Proc. 1988 IEEE Comput. Soc. Symp. Security and Privacy, Oakland, CA, 1988, pp. 59–66. 24. S. Kumar E. H. Spafford An application of pattern matching model in intrusion detection, Technical Report 94-013, Dept. of Computer Sciences, Purdue Univ., 1994. 25. B. C. Soh T. S. Dillon Setting optimal intrusion-detection thresholds, Comput. and Security, 14 (7): 621–631, 1995. 26. T. F. Lunt A survey of intrusion detection techniques, Comput. and Security, 12 (4): 405–418, 1993. 27. S. Weerasooriya et al. Towards static-security assessment of a large-scale power system using neural networks, IEE Proc. C Generation Transmission Distribution, 139 (1): 64–70, 1992. 28. M. Esmaili R. Safavi-Naini, M. B. Balachandran AUTOGUARD: A continuous case-based intrusion detection system, Austral. Comput. Sci. Commun., 19 (1): 392–401, 1997. 29. A. P. Kosoresow S. A. Hofmeyr Intrusion detection via system call traces, IEEE Softw., 14 (5): 35–42, 1997. 30. S. P. Shieh V. D. Gligor On a pattern-oriented model for intrusion detection, IEEE Trans. Knowl. Data Eng., 9: 661–667, 1997. 31. N. Puketza et al. Software platform for testing intrusion detection system, IEEE Softw., 14 (5): 43–51, 1997. 32. D. Qu et al. Statistical anomaly detection for link-state routing protocols, Proc. 1998 Int. Conf. Netw. Protocols, ICNP, Austin, TX, 1998, pp. 62–70. 33. G. P. Kumar P. Venkataram Security management architecture for access control to network resources, IEE Proc. Comput. Digital Tech., 144 (6): 362–370, 1997. 34. J. Mauricio Bonifacio Jr. et al. Neural networks applied in intrusion detection systems, IEEE Int. Conf. Neural Netw.— Conf. Proc. IEEE World Cong. Comput. Intell. Proc. 1998 IEEE Int. Joint Conf. Neural Netw., Part 1, Anchorage, AK, 1998, Vol. 1, pp. 205–210. 35. M. Asaka Information gathering with mobile agents for an intrusion detection system, Sys. Comput. Japan, 30 (2): 31–37, 1999. 36. D. Newman, T. Giorgis, F. Yavari-Issalou Intrusion detection systems: Suspicious finds, Data Commun., 27 (11): 8, 1998. 37. Teresa Lunt, Panelist, National Information Systems Security Conference, Oct 8, Baltimore, MD, 1997. 38. R. Billington E. Khan A security based approach to composite power system reliability evaluation, IEEE Trans. Power Syst., 7: 65–71, 1992. 39. Computer Sciences Corp., UCA and DIAS information security analysis, EPRI Technical Report TR-103773, 1994, Electric Power Research Institute, Palo Alto, CA. 40. S. A. Klein J. N. Menendez Information security considerations in open systems architecture, IEEE Trans. Power Sys., 8: 224–229, 1993. 41. J. A. Pecas Lopes et al. A new approach for transient security assessment and enhancement by pattern recognition, Proc. Second Eur. Workshop on Fault Diagnostics, Reliability, and Related Knowledge Based Approaches, Pergamon Press, 1987, pp. 189–215. 42. S. K. Fitzpatrick P. J. Hargaden Multimedia communications in a tactical environment, Proc. IEEE MILCOM, 1994, Vol. 1, pp. 242–246.


13

43. L. Guidoux Intelligent solutions for data communications networks, Telecommunications, Int. Ed., 29 (6): 25–29, 1995. 44. H. J. Schumacher S. Ghosh An integrated approach to security on demand in ATM networks, Inf. Syst. Security, 6 (4): 10–21, 1998. 45. H. J. Schumacher S. Ghosh A fundamental framework for network security towards enabling security on demand in an ATM network, Comput. and Security, 17 (6): 527–542, 1998.

H. J. SCHUMACHER Arizona State University SUMIT GHOSH Arizona State University


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Network Security Fundamentals Standard Article Bruce Barnett1 1General Electric, Schenectady, NY Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5324 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (213K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Goals of Security Why Security is Problematic Security Principles Network Security Model Security Policy Security Mechanisms Threat Model Probes, Attacks and Threats Risk Analysis Firewall Architecture Summary to Network Security History of Network Security Incidents | | | Copyright © 1999-2008 All Rights Reserved.



NETWORK SECURITY FUNDAMENTALS Network security is defined as the task of keeping a collection of computer systems secure from attacks through the network. Network security is a critical part of any system connected to a public network, such as the Internet. However, the task of providing a comprehensive security solution is often misunderstood, as security is not a simple issue. Network security is a subset of computer security, which may be divided into four categories: • • • •

Information security is protecting the data, which includes keeping copies in locations safe from physical threats such as fires. Host security is protecting a computer from physical access and access from the users on that system, as well as the ability to detect tampering and quickly recover from a security breach. Network security is protecting the system from attacks from other systems from the same organization. Internetwork security is protecting a system from attackers outside of the organization.

Many people consider internetwork security to be the primary emphasis of network security. However, security administrators must consider all four categories. Addressing only one category (i.e, adding a firewall) creates an unbalanced defense, which is often characterized as “locking the front door, but leaving the back door open.” Connections from the Internet are an obvious threat, but others are equally important. As an example, hackers have used social engineering to walk into buildings assuming the identities of security officers doing a ‘surprise audit.’

Goals of Security It is useful to use systematic descriptions of security issues. There are three aspects of informational security: Attacks compromise systems, mechanisms protect systems, and services use mechanisms, but may be disrupted by attacks. Services are the implementation of the desired goals of security, which are as follows. • • • • • •

Authority Ensures that the origin of the message is known and not falsified Confidentiality Ensures that only those authorized are able to see the privileged information Integrity Ensures that only authorized individuals are able to insert and change information Access control Ensures that the access to information is controlled Availability Ensures that only authorized individuals can use the services without disruption Nonrepudiation Ensures that neither the sender nor the receiver can refute the existence of the message 1

2

NETWORK SECURITY FUNDAMENTALS

Why Security is Problematic Here are some guidelines that represent important concepts governing security, and hopefully dispel some of the myths: • • • •

• • •

• •

•

Network security is part of total security Network security is a subset of a comprehensive security solution, and having strong network security without addressing the other issues may leave a site vulnerable to other penetration techniques. Security is difficult to measure There is no easy way to prove that security is sufficient. A compromised system indicates security was not sufficient, but this measurement comes too late. The worst case is a system that as been compromised for months or years, and the compromise never detected. Security requires many experts Distributed systems are complex, and proper analysis of security requires experts in operating systems internals, protocols, application implementation, and computer architecture. Very few are experts in all areas for all systems, and the best experts still make mistakes. Nothing is 100% secure Given enough time, resources, and incentive, any system can be compromised, and any encrypted information can be decrypted. A suitcase containing a million dollar bribe combined with a gun held to someone’s forehead can be a convincing incentive. The main goal of a security administrator is to make the cost and risk of an attack too high to be useful. The cost of increased security is not linear There are diminishing returns to improving security. Doubling the budget may not double the security. Principle of easiest penetration The attacker will find any method of entry, and not the one with the strongest defense. The attacker has the advantage because they just need one successful attack, while the defender has to guard against all attacks. Security is potentially only as strong as the weakest link This is a similar to the above-noted rule and indicated why measuring security can be difficult. There may be hundreds of “weak links” using dozens of types of technologies. Experts in all of these technologies are needed to enumerate all of the weak links, and even then the list may not be complete. The complexity of the system is proportional to the number of weaknesses Simpler systems are easier to understand and therefore easier to secure. Security through obscurity provides temporary protection, which degrades over time Many believe security based on secret algorithms is sufficient. This is called “security through obscurity” and is commonly felt to increase security. However, this increase is temporary in nature and is not sufficient to prevent a security breach in the future. Given enough time and effort, any secret algorithm can be deduced and duplicated. The danger is complacency. Security lapses occur when ownership is uncertain Sites often divide responsibility between different organizations. They may cause each one to assume the other organization is addressing a problem.

Security Principles Because of these issues, certain principles are commonly accepted. •

Strong security is based on least privilege Essentially this means that any action that needs privileges to perform its action should be given the least set of privileges it needs to operate. Give the process what it needs, not what it wants. And if one person needs a special feature, give it to that single person, and not everyone. The more generous privilege granting is, the more likely security will be breached. However, this increases the complexity of management.

NETWORK SECURITY FUNDAMENTALS • •

• •

•

•

•

3

The more layers of security, the stronger Because weak points exists, additional layers increase security. An attacker must penetrate each layer, which is more difficult. This is also called defense in depth and it is an important principle in improving security. The fewer security points, the better By concentrating security entry points to a small number, security becomes easier. It is easier to maintain and observe fewer systems. Some refer to these systems as choke points. However, there are limits as this may conflict with defense in depth. If taken to the extreme, all security can be focused at a single point. If this fails, then all security is lost. Security is everyone’s responsibility Security is not a task for a small group. If any single person creates a weakness, the entire system can be compromised. All users must be educated in security issues, and there must be a clear division of responsibility. Peer review is critical History has taught us that systems designed by security experts may still have a security flaw. The more resources devoted to reviewing a security mechanism without finding any flaws, the less likely security flaws will exist. Therefore consider peer review essential to adequate security. The more resources spent on peer review, the stronger the security. If equipment fails, it should fail safe Failures occur. If they do, it is better the failure does not cause a security breach. This is why many firewalls have a default deny stance: That which is not expressly permitted is prohibited. As an example, if a new application needs to bypass a firewall, the firewall should prevent this unless it is configured otherwise. If the firewall had a default allow stance, then the new application would be allowed unless the firewall explicitly blocked the application. The default deny stance is preferable because no action is needed to remain secure. Identity is based on four proofs Many computer systems require people to authenticate themselves to the computer. There are four general proofs that may be required: Something you know, something you have, something you are, and someplace you are. The strength of the identification depends on the ability of the computer system to test these four proofs. Typically, a password or personal identification number (PIN) is used as the first proof, or something you know. However, passwords may be learned; therefore this by itself may not be considered sufficient proof. Something you have is a physical object that only one person can possess at a time. Special devices can be used that provide tokens used to identify someone. This makes sure there is only one person using the token. Still, these devices, like passwords, can be stolen. An authentication that uses the first two proofs is called token-based authentication and sometimes called two-factor authentication system. The third proof is something you are. This describes biometrics, which uses sophisticated technology that identifies a characteristic of your physical body, such as voice analysis, retinal scanning, fingerprint or face recognition. It is reported that one site used fingerprint identification, until some criminals chopped off the thumbs of the CEO to gain access. The fourth proof or factor is someplace you are. This requires a mechanism for determining the location, perhaps using the Global Positioning System (GPS), or more simply making a decision based on the location or identification of the physical access mechanism. You may only be allowed to access a particular system from one special terminal, such as one in a shielded and locked room. As the number of authentication factors increase, the risk of miss-authenticating someone decreases. Cryptographic solutions depend on keeping the key hidden, but nothing else According to the generally accepted rules of cryptography (1), a secure solution is based on published algorithms the cryptographic key of which is secret. That is, given all of the source code, and the encrypted message, a system is secure if the time it takes to find the key (to decode the information) is longer than the time the information must remain hidden. Typically the strength of the algorithm is proportional to the size of the key used to encrypt the information, as well as the algorithm. The security of the system can be measured, because an estimate can be calculated on how long it would take to decode the information. Compare this approach to “security through obscurity,” in which the strength disappears as soon as the secret is discovered, which is unpredictable.

4 •

•

•

• •

• •

NETWORK SECURITY FUNDAMENTALS Diversity increases security Attackers often have limited experience and/or resources. Many have narrow fields of expertise. If an intruder has to compromise many different systems, it becomes harder to succeed. Therefore if different layers of filters and bastion hosts come from different vendors, security is increased. This is a temporary solution but practical at times. This also increases complexity, which tends to degrade security. Security is inversely proportional to convenience One common goal in a business environment is to increase productivity. However, this is often at odds with security. Many people would be unhappy if they had to lock up their terminal and office every time they left their office “for a second.” In a network environment, people do not want to log into a computer system a dozen times a day. People often let others use their computer, yet the only way to trust a borrowed computer is to reinstall the operating system and all of the software you use yourself. A system administrator’s job is particularly difficult if hundreds of computers are managed. Stronger authentication makes some tasks either more expensive or more costly with respect to time. Security degrades over time There are two reasons for this “entropy.” The more a system is maintained, the more likely some human error will occur in the configuration and management of the system. Also the public knowledge and understanding of any system will increase over time, and new vulnerabilities may be found. To counter this, systems must be maintained and periodically reviewed by someone who understands all of the services the systems provide. Security is a process, not a product Because of entropy, security is often something that is addressed once and then forgotten. Security should be is an on-going process. Complex systems are difficult to secure The more complex a system is, the more likely there are security problems with that system. Try to simplify the critical system as much as possible. A server that provides six services is harder to secure than one that provides one service. And if one service provides a vulnerability, all six services could be affected. A more secure (although more expensive) solution is to use six separate servers, with each one doing one task. Least common mechanism Users should share as little as possible. The more a common program is used, the more impact a single flaw in that program can cause. Therefore security is improved if each user and service has the smallest number of privileges. However, this increases the cost of administration. Security is hard to learn Even among experts, information sharing is not free and unrestricted. Experts tend to be vague when discussing security problems to prevent hackers from learning too much. The amount of information that is needed to master security is very large, requiring expertise in every related field. This makes it difficult for a beginner to pick up, as the problems are based on internal details of operating systems, applications, and protocols.

Network Security Model The network security model is a high-level description of the systems, networks, and users for a network. This includes a logical view of the functionality and structure. It discusses the roles, personal, locations, and organization of the system. It may also classify data and services and describe permissions. The model is used to clearly describe the security policy. Think of the model as defining the nouns to be used, while the policy describes the verbs that apply to the nouns.

Security Policy The security policy specifies the goal of the security guidelines. It provides a way to specify the security requirements. Without written policies, there is no way to enforce security or measure the effectiveness of the


5

policy. It uses the terms defined in the model and places restrictions and assigns functionality to the elements of the model. It should provide enough information to evaluate the security of the system. The policy may also specify responsibilities of individuals and organizations. Some of the questions that should be considered are below. • • • • • • • • • •

Do guidelines and procedures exist? Do they have management support? Do they achieve consistent and complete security? Do they provide legal protection? Are roles and responsibilities clearly defined? Are the services and architecture understood and documented? Are the policies detailed enough to make sure they are enforced? Or are they from the 10,000 m level, and useless in the real world? Are the weaknesses understood? Are policies enforced and verified? How? Has everyone (not just the managers) agreed that they can follow the policies?

Policies should also cover management and change control, intellectual property rights, and personal privacy. Policies may also define acceptable use of the company resources, and define grounds for termination. A helpful book is Information Security Policies Made Easy (2).

Security Mechanisms Mechanisms enforce policies and therefore are an implementation of a policy. However, policies are often an abstraction that describes the desired result. The policy “all users should be authenticated” is a policy, but the details may be unspecified to allow greater flexibility in the enforcement of the policy. The mechanism describes the how and why. The security mechanism often provides an imperfect implementation of the policy. To give an example, a site’s policy might state that unauthorized connections are not allowed. One mechanism may be to prevent this from happening using two-factor authentication. Another mechanism may choose to detect unusual and unauthorized connections and to terminate any found. Another mechanism is to simply “do nothing”—indicating that there is no mechanism in place to address a policy. Well-organized security organizations document the mechanism along with the policy. This helps identify the known weaknesses of a security policy. Mechanisms used to enforce these services fall into the following categories. Obfuscation. It is often possible to identify a particular vendor package using various probing techniques. The vendor, operating systems, and application set can be determined by examining announcement banners, login screens, active ports, and reactions to unusual network packets. Knowing the operating system can help determine future attacks. Hiding the characteristics of the system can make compromises difficult. Some recommend modifying all banners on all applications to mislead the attacker. This is security through obscurity, however, and therefore the use is limited. It will mislead or slow down an attacker but does not prevent attacks. Prevention. Systems are often protected by the insertion of equipment or software that prevents violations. The commonest approach is to add a network firewall. A firewall will prevent attackers from connecting to unauthorized systems. An authentication system will prevent an unauthorized user from connecting to interior systems. Detection. Detection is a passive protection mechanism. By watching log files, traffic patterns, file signatures, or any unusual activities, it can detect actions determined as suspicious. This activity is called an

6


intrusion detection system, or IDS. After detecting improper activity, a system may indicate a suspicious file or system, generate an alarm, or perhaps initiate a defense (integrating detection with reaction). Reaction to attacks. When an intrusion is detected, a decision must be made concerning the next step to take. These steps often require planning, and sites are not always prepared when an attack/occurs. Some of the choices follow. Gather more information. The site may decide to remain passive and gather more information. This usually requires a security team that has the ability to monitor activity and terminate it instantly. There are advantages to this approach. The team may learn the hacker’s techniques, goals, and motivation. However, it requires skilled security administrators that are better trained than the hackers. Communicate with the authorities. Tracing down the culprit may require cooperation with other authorities. However, finding the right authorities is not always easy and may be more difficult if the network is under attack. Sending e-mail to a site might only alert a hacker if the remote site has been compromised. Consider using non-networked methods of communication (phone) and having a contact list prepared ahead of time. Tracing connections may even require court orders. Terminate the attack. The attack can be disconnected by closing a socket, terminating a process, unplugging the network connection, reconfiguring a firewall, or halting the computer. Disconnection a system from a network removes functionality the system provides to the network, alerts the attacker that the intrusion was detected, and inhibits tracing of the culprit. Restoring the system to a known state returns the service but may make forensics difficult. Diversion. If an intrusion is detected, the system administrator may use diversion. This may be the creation of a false target (sometimes called a honey pot) that is more attractive to the intruder, thus diverting their attention. It might simply be a file with an attractive name or an elaborate simulation of a system that appears to be penetrated, while learning more about the attacker and the methods used. It may also provide additional time to trace the location of an intruder. Forensics. After a penetration (actual or suspected), forensics may be necessary. This involves capturing and analyzing the state of the system. This may be dynamic (as the system is running) or static (analyzing the files on a disk). The reasons include understanding the motivation, identifying and studying hacking techniques, determining the degree of intrusion, and providing legal evidence. Be aware that merely running a damaged system may destroy evidence. The best approach may be to halt the infected computer as quickly as possible and make a complete copy of the entire file systems, allowing you to safely analyze the data at your leisure. Foundstone (http://www.foundstone.com) makes a toolkit for Windows systems. Unix users maywish to look into The Coroner’s Toolkit (TCT) written by Dan Farmer and Wietse Venema (http://www.fish.com/forensics/). Legal response. Taking legal action requires planning and attention to detail. It also requires skills in forensics to find the individuals who were involved and prove they did specific damage. In some cases, systems have special banners displayed during the login sequence, which warn the visitor of the consequences of unauthorized access. Recovery. After the attack, the next step is recovery. Since the integrity of the system cannot be trusted, it may be necessary to reload all of the software from the distribution tapes. Alternately, a file system integrity checker (e.g., Tripwire) can be used to identify the damaged files precisely.

Threat Model A threat model is an attempt to characterize the capabilities, motives, goals, and tolerance of risk of the person or organization attempting to compromise a system. It may be an adolescent hacker, a disgruntled former employee, a foreign government, or a business competitor. The U.S. government has a different threat model than the PizzaDelivered To YourDoor.com Web Site. There is no way to prevent all potential risks. Instead,


7

the threat model helps prioritize the dangers by identifying those risks that are more likely to occur and the possible consequences. Some models to be considered are as follows. • • •

•

•

• • •

Accident Hacker This is someone who stumbles onto confidential information. It may be someone who walks into your room and sees what is written on your screen. Some may report the situation when they discover the leak. Potential threat: disclosure of information. Merely curious person Some individuals are curious about particular tools or programs. They may execute a tool to see how it works or probe a site to see if a guest account exists. Potential threat: Disclosure of information. “Script-kiddy” This describes someone who has access to tools but does not understand how they work. However, this group is very large and given the right tool, can break into any site that is vulnerable to that particular tool. Targets are often picked randomly. The damage can still be significant because they may install a back door or allow others to abuse the system. Potential threat: Installing rootkits and back doors, telling others about the hole, allowing other hackers to misuse the system, using the system to stage attacks on other systems, and making forensics difficult. Experienced Hacker The experienced hacker often has a stronger motivation for the break-in attempt. They will try alternative break-in methods. The motivation could be a specific goal for political or financial reasons, or may be the use of the compromised system to stage attacks against other systems. Some consider failed break-in attempts as making the target more desirable. Potential threat: all of the above, loss of confidential information, loss of finances, public relations damage, loss of credibility in the market. Insider An insider may cause great damage to a company and is usually hard to detect. Confidential information may be leaked to competitors. Other motivations may be blackmail or revenge—both of which may cause damage to the infrastructure. The inside hacker may be coerced by outsiders into participating in the break-in. Insiders may also be embezzlers. Potential threat: all of the above, long-term loss of information and/or finances, loss of backups. Professional A professional hacker may be a security analyst being paid to break into a site to verify the security, or may be paid by another corporation or government. Potential threat: all of the above. Industrial Espionage agent The motivation for corporate espionage may be to steal secrets, or to sabotage the company through corrupted databases, public relation blunders, or simple feed misinformation. Potential threat: all of the above. Government worker Government-based hackers may have huge resources available and can cause great damage to a company. Potential threat: all of the above, damage to integrity of company, damage to government.

Probes, Attacks and Threats Before the security of a system can be analyzed, the potential threats must be understood, as well as the mechanisms of attack. Threats are occurrences of something that can weaken or damage a computer system. Threats include disasters such as fire and loss of power as well as direct attacks. Other threats are subtler and do no obvious damage, yet provide additional information that can be used to plan an attack. These are called probes—operations that do not compromise the system, yet reveal information that can be used to plan and execute additional attacks in an efficient manner. The attacker, or hacker (some prefer to use the term cracker, but hacker is common usage), often has a set of tools used to examine, probe, and compromise a system. The number of easily available tools is staggering. Understanding the commonly available tools and how they work help in determining the potential risk of a computer network.

8


Probes. Probes may or may not set off alarms. The difference between a probe and a curious individual may only be a matter of frequency and duration. Some probes gather information over a period of months or years to escape detection. Other probes may be coordinated with other large-scale attacks and thus be hidden in the noise. One of the most successful probing techniques is social information. Social engineering requires that the attacker bluff an employee into revealing information or perform some operation that makes access easier. An attacker can pretend to be someone in authority, a customer, a vendor, a consultant, or a new hire. The attacker may also pretend to be a user who forgot his/her password, or assume the role of a system administrator, and ask a user to perform some action. Successful penetrations have occurred when hackers have printed fake credentials and boldly claim to be performing a security audit. Other roles involve helpful customers asking secretaries and administration assistants for information because their boss needs help. Every employee must be alert for such actions and should notify a security representative if someone has a strange request. Contact points that are publicly known are particularly vulnerable. This includes public phone numbers, operators, the SOA account in the Domain Name Service, the WHOIS database, and common e-mail accounts like postmaster, hostmaster, news, etc. An important part of any attack is gathering information. Programs such as nmap can determine the type and perhaps the revision of the operating system. This is called OS fingerprinting. Other examples of probe software are finger, showmount, rpcinfo or nmbstat, and nmap. Social engineering is surprisingly easy, because all it takes is a winning personality. The attacker is rarely caught because the conversation should seem normal and physical access is not necessary. In many cases, social engineering can be the actual attack; if hackers can convince someone to grant privileges, they will gain access into the system. Social engineering can be accomplished over the network as well. Innocent employees can receive faked electronic mail. Inside may be instructions to change passwords or install programs. This can be combined with web access, as e-mail discussing software updates can contain references to “update” programs that are really Trojan horses. Another type of probe is called trashing. The attacker gains information by examining material in trash bins. Passwords, accounts, telephone numbers, and manuals can all provide useful clues and be combined with social engineering. General Classifications of Attacks. Attacks may be classified by two major categories: passive and active. Passive attacks do not introduce any new data into the network and are therefore difficult to determine when they happen. The first common technique is release of message contents. Timing analysis may determine the length of a cryptographic key. The second form of passive attack is traffic analysis and uses knowledge of data size, frequency, source and destination to gain insight. This can be defeated by using circuitous routes with true source and destination hidden from sniffers, such as onion routers (discussed later). Active attacks may be broken down into three categories: interruption (which affects availability), modification (which affects integrity), and fabrication (which affects authenticity). Modification can be broken down into four divisions: masquerade (pretending to be someone else), replay (repeating the transmission of data captured earlier), modification of messages, and denial of service. This later attack can be relatively simple. It can be accomplished by sending unexpected data or initiating thousands of connections, without terminating any of them. The latter can cause system resources dedicated to each connection to be consumed, preventing authorized individuals from using the system. Mechanisms of attacks. Network attacks are possible because of flaws in a computer system. There are many reasons for these flaws. Conceptual errors and carelessness. One of the most common problems is human error. As stated before, complex systems are harder to maintain. Computer systems tend to be very complex, which increases the chance of human error. The error may be caused by carelessness, lack of resources, or lack of understanding.


9

In some cases, people assume the problem is someone else’s domain. Consequently, no one addresses the problem. One variation is the Trojan horse program, where someone is tricked into executing a program. Faulty Software. The second most common problem is a security flaw in a computer program. If you can gain access to a program then you can exploit this flaw. Access may be facilitated by having an account on a computer system, or by connecting to a program that provides network services. There are many ways a program can have a security flaw. Stack overflow. Writing a secure program that properly separates data from instructions is difficult. This can be exploited by using a buffer-overflow-based exploit. In this case, a large amount of data is presented to the system, which is not programmed to handle so much data. Normally this will cause the computer to execute random code. However, a hacker can insert code with the data and trick the computer into executing unexpected code. This allows the attacker to execute arbitrary code, which can therefore grant extra privileges. Any service on the Internet has probably been vulnerable to a stack overflow attack at some point in time. This is accomplished by finding a way to specify a parameter to a program, making it much larger than the programmer expected. An example might be specifying a username 2048 bytes long, instead of 8. If the program does not explicitly check for this error, the excess data may be thrown onto the stack, and a smart programmer can use this to trick the system into executing arbitrary code. There exist many stack overflow exploit kits that make it easy to overflow some parameter and execute random code. If the process is privileged, the code can execute commands that weaken the security of the system. Most break-ins occur because a privileged server is vulnerable to a stack-overflow attack, the exploit is available, and the program has not been patched. Exploits are commonly published on the Bugtraq mailing list, so anyone can gain access using the exploit. Heap Overflow. The heap is used to store data in memory explicitly requested by the application. If the programmer is careless, someone can send unanticipated data that modify critical locations in the heap. Trojan Horse. If someone has access to a system, they can run a program that asks for passwords. This might be a web page or a login screen. It may be a misspelling of a common word. If the user is an administrator, it allows someone to weaken the security of a system. Some people put the Current directory in their searchpath or a directory that might allow someone besides the superuser to replace a file. It is therefore possible to get a privileged account to execute the wrong file. The author’s trojan.pl program can check if a Unix account is vulnerable to Trojan horses. Password capturing and cracking. There exist several programs that can take encrypted password files (from Unix NT systems), and by using a dictionary attack, decode the passwords. Most Unix systems use shadow passwords, but if they use NIS, it is often trivial to get this encrypted password file (by using ypcat.). Windows NT systems have potentially similar weakness. There is a program called pwdump.exe that can sometimes obtain a password file. This requires loose permissions on the account. However, if someone has created a startup disk, there exists another copy of the password file. The program l0phtcrack combines a dictionary search with a password sniffer to learn passwords. Race Condition. Some systems allow set user identification (UID) shell scripts. Some pass the script to the shell using a file descriptor (e.g., Solaris 5.x). Others pass a filename that is insecure. It is possible to change the script after the shell has started yet before it has read the script. If so, this race condition can allow someone to trick a privileged account to access the wrong file. Temporary Files. If a user has access to a system and the system opens a temporary file, it is often possible to trick the program into opening the wrong file. It may be the encrypted password file, and this can reveal the contents of this file. Some temporary files are made world writable, and it is possible to create a symbolic link to point to a critical file. This problem can be exploited if the OS opens a symbolic link, which can point to another symbolic link. This can be nested, and by the time the system opens the final file, it may have changed. Programs must be careful so that the file they finally open is the same as the one they think they have opened. This is done by using system calls to check the status of the file before and after opening the file. Some systems create temporary files that exist inside a subdirectory that is writable by the user and only

10


the user. Instead of opening the file /tmp/123 the system would open the file /tmp/username/123. This is not an easy problem to fix, and some systems do not have libraries to open a temporary file securely. Chroot. Some accounts are established in a chroot() environment. If the user does escape the server, then accounts are found inside a subset of the entire filesystem. There exist techniques that can break out of such an environment, and this is easier if the process is privileged. One trivial technique involves repeated cd.. commands followed by chroot/ commands. Keystroke Watchers. Programs exist that can capture all keystrokes on a computer and can therefore capture passwords. Cryptographic attacks. Cryptographic systems have many weaknesses that can cause a system to be compromised. Weak cryptographic algorithms. The first problem is caused by the choice of the cryptographic algorithm. It takes decades of study to verify the strength of a cryptographic algorithm. Even the best cryptographers may design a weak cryptosystem, and the flaws may not be discovered for 30 years. Other algorithms can weaken over time, as computers become more powerful. The Data Encryption Standard (DES) has been broken. That is, an encrypted message was decrypted using massive computer systems. A dedicated engine can be built that can crack DES-encrypted message. Stronger systems require more time. A message that takes 1000 years to decrypt is considered secure for the short term. The only provable secure cryptosystem is the One Time Pad, where the length of the key is equal to the length of the message. However, distributing the key becomes the problem. Key discovery. As stated before, the strength of a cryptographic system should rely on keeping the key secret and not on keeping the algorithm secret. Some systems establish a secure connection using a secret key. In some cases, the key is generated randomly. If the hacker can guess the key, they can pretend to be one of the two parties making a secure connection. This can cause a problem if the number is not randomly generated. In most cases, computers use pseudo-random-number generators (PRNG) instead of truly random numbers. These generators use a seed. As an example, some programmers do the equivalent of the following C code to initialize the seed of a PRNG:

Someone who knows when the PRNG was seeded has the ability to guess the actual seed. Because the algorithm used to seed a pseudo-random-generator is known, the seed may be known, and the key exploration can be reduced greatly, therefore reducing the advantage of a long key length. Sometimes the length of time of a process can be used to determine the length of a key, which again shortens the search. In other systems, the length of a short message might determine the key length. Known Plaintext attack. If the attacker knows the plaintext before encryption and can examine the results afterwards, this can provide clues to the key Chosen Plaintext Attack. This is similar to the known plaintext attack, but the attacker gains the ability to provide plaintext to the system. Therefore the hacker can provide patterns that reduce the search space. Password guessing. The most primitive technique is password guessing, using a brute force dictionary attack. The hacker, using a dictionary, guesses a username and/or password. This may be a local account or on a remote server. This will probably take a long time to find the right combination, but eventually the right combination may be found. Secure systems have mechanism to detect and/or prevent brute force attacks by terminating the connection, disabling the account, or slowing down the response. Faulty protocols. A third common problem is using insecure protocols. All of the protocols created in the early days of the Internet were not secure. This includes transport protocols (TCP, IP, UDP, ICMP) and application protocols (MAIL, File Transfer, Name service). Newer protocols are being created that are secure,


11

but not every system uses them. There are several general techniques that can be used to attack faulty algorithms. Confidentiality (release of message content) errors. Passing sensitive data in cleartext allows a release of message content. If the information is in cleartext (i.e., not encrypted) the contents can be learned by a program called a “sniffer”. A sniffer program passively watches network traffic without disrupting it. The information released may be partial. Standard telnet and file transfer protocol (FTP) are vulnerable to this, as passwords are sent in the clear. Some common sniffers are tcpdump, snoop, dsnoop, sniffit, and juggernaut. Others include programs that display information of applications, like the X Windows server. Masquerade (Authentication) errors. Man-in-the-middle attack. Normally, two systems A and B connect to each other. In a man-in-the-middle attack, a third system interposes itself between the two without their knowledge. The attacker pretends to be system A to B, and pretends to be system B to A. In doing this, one of the systems can be tricked into revealing privileged information. Host-based authentication. Some systems are protected by using filters based on IP addresses or host addresses. The authentication mechanism is flawed, that is, only specific IP addresses are allowed to connect to a system. The IP address provides the authentication needed, as it is assumed to come from a trusted source. The Berkeley r commands (rlogin, rsh, rcp) make this assumption. However, the IP address can be changed. A hacker may disable or disconnect a trusted system and change the address on a second computer under the attacker’s control. The hacker may also send out packets with faked source addresses and use a sniffer to receive any responses to the system. Another example of this flaw is any system where the remote system gives information to authenticate a user and the server must trust the remote host. The rexec command has this problem. Standard NFS also has this problem, because the remote client tells the server what the user’s identity is. This can be faked, allowing someone to get access to another person’s account. Media-access-control layer spoofing. Some sites use a switching hub that prevents a network sniffer from seeing packets addressed to other addresses. However, systems can create forged media-access-control (MAC) addresses, such as Ethernet addresses. This fools the hub into thinking the systems with the spoofed MAC addresses are located on another port. Thus the hub is fooled into forwarding packets to the wrong machine. Pavel Krauz’s program hunt demonstrates this concept. Modification errors. Interjecting false credentials. Some protocols do not detect or prevent intruders from interjecting false information. A query is sent to a server, which responds. However, someone else can interject a response with false information; therefore providing false information to the client, which may then trust another system Falsely. Another example is router protocols such as RIP, which may not authenticate information. A properly formed packet from an attacker may convince a router that it is the proper authority, and to ignore any more responses from the same address. The results may be to forward all traffic for a network to the hacker’s network. Sometimes attacks can be combinations. One example of this is ypsnarf , which listens (using a sniffer) to login requests from systems using YP/NIS protocol, and when seen, can return a packet that says the account has the UID of zero (Superuser) and no password. The client uses this information to grant the person superuser access, and the attacker now has control of the client machine. Hijacking (Nonblind spoofing). If the packets can be sniffed, and new packets created, then hijacking of any IP-based connection is possible. The higher-level protocols have to provide authentication, and most of

12


them do not. There exist tools (Sniffit, Juggernaut) that observe IP-based connections, and when desired issue a two-pronged attack. The first attack is to send forged packets to one side that causes the connection to be reset. The second attack is sent to the other side with forged headers and proper IP sequence numbers that steals the connection. Any TCP-based service is vulnerable to this attack. This includes sessions with one-time passwords, like SecurID.

Blindspoofing. This is a variation of the attack just discussed. Blind spoofing does not require a sniffer. However, it does require a predictable IP sequence number, a vulnerable protocol, and the ability to forge IP addresses. A system that is configured to allow rlogin access from a single site that has predictable IP sequence numbers is vulnerable to blind spoofing. Packets with forged IP addresses are sent to the server, which responds to the trusted remote site. The hacker may never sees these packets, as it does not control the remote trusted system. However, if the IP sequence number is predictable, the hacker can send the next packet with the proper IP sequence number of the server, and the server will accept the packet. The Berkeley r commands are vulnerable to masquerade errors, because they depend on host-based authentication. It is said that Kevin Mitnick used these flaws to insert the commands

into an IP connection without a sniffer (3). Interruption attacks. Several attacks can interrupt an existing service. In general, these are called denial of service attacks.

Denial of Service. There exist dozens of denial of service attacks. Some that generate large quantities of electronic mail are called Unabomber, KaBoom, Up Yours, Avalance and Voodoo. Others use mail-formed packets or open hundreds of simultaneous connections without releasing resources. (e.g., SYN flood). Some of the programs are bonk, jolt, land, nestea, targa, netear, syndrop, teardrop, winnuke, ptebug, eudora4, portmap, spiffit, rwhokill, overdrop, slmail26, aixttdbserver, biffit, pnbug, netscaetbl, antisentry, and hugweb. If these programs are executed with your system as a target, your system will either crash or slow down to a crawl. Some of them (smurf ) have a multiplying effect, and a single packet from one system can cause a hundred packets to be echoed to your server through innocent third parties. In 1998, a package called Xcrush was released that provided a menu-driven interface to 40 different denial-of-service attacks.

Distributed Denial of Service. In 1999, hackers developed a distributed denial of service (DDOS) attack called trin00 and TFN. In simple terms, several innocent sites were hacked, and special software was installed on them. These systems were called zombies, because they did not know what they were being asked to do. On command, these zombies would attack other systems, using faked IP addresses. It is very difficult to trace a packet coming from a faked source IP address, and tracking down hundreds of these systems made the task much more difficult. Other DDOS programs are TFN2K and Stacheldraht. Traffic analysis. Even encrypted traffic can be vulnerable to traffic analysis attacks. Onion routing is a conceptual model that provides protection from this as well as from conventional sniffing. Instead of talking to a server directly, an application connects to a proxy on an onion router. This encrypts the data and sends then to another nearby onion router. Encrypted packets may contain encrypted packets inside. Each router decrypted the data addressed to itself, and eventually the packet exits somewhat randomly from the onion router network and connects to the target server. See http://www.onion-router.net/ for more information.


13

Miscellaneous attacks. Other attacks include physical access and modern access. There exists special equipment that can examine magnetic media and read information from disk after the data have been overwritten with ones and zeros. After the attack. The question arises: What if someone breaks into a system? What happens next? Log Scrubbers. The first step of a hacker is to hide any trace that they have been on the system. There exist programs that delete entries in the log, so that the system administrator cannot detect that the system has been abused. This is often done immediately after breaking in. Rootkit. A more sophisticated technique is to install what is called a rootkit. These kits are readily available, and can be installed in moments. This is a set of replacement programs that have some or all of the following features: • • • •

•

The programs look like the original programs. The programs have the same checksum as the original programs. The programs have the same dates as the original programs. The programs have been modified to hide the hacker’s activity. If you try to spot hackers on your system, you cannot see them. This is because the rootkit replaces standard programs like ps, netstat, and ls. If the hacker is running a sniffer program, for instance, programs that list running programs will not show the sniffer is running. If the user lists filenames, particular files remain hidden. The programs often have a back door in the network services, so that any user whose name is something like “XYZ” will be allowed onto the system without a password and will be granted unlimited privileges. Or they may reinstall back doors that have been removed by a system administrator. This is why it is recommended that the entire operating system be reimaged. It is difficult to detect the existence of a program when you can not trust the programs you use. Tripwire provides some help in this problem, but you must set this up beforehand.

Tunnels. Hackers can install tunnels that allow communication to or through a firewall, without raising alarms. The data stream may be on an innocuous port number, or may be inside UDP datagrams. Some create tunnels to carry information inside other protocols like ICMP (ping). Sniffers. Hackers use new accounts to break into other systems. One way they do this is to run a sniffer on a compromised machine. They transfer their attack programs to a new foothold and continue the search. One common step is to install a password-sniffing program, and the hacker can penetrate deeper and deeper into the network. Zombie agents. As mentioned in the discussion on the distributed denial of service, software may be installed that allows hackers to attack other systems from the compromised machine. Others are general purpose software and allow the hacker to take over the system. One example is Back Orifice, for Windows-based systems. Risk Analysis Types of risk analysis. Risk analysis is a mechanism to identify, measure, control, and minimize risk. Vulnerability Analysis. A vulnerability scanner is easy to operate and use, as it is usually a series of tests that are executed in sequence. There are several scanners that can be used to analyze potential risks. Two general categories are host-based scanning and network-based scanning. Host-based scanning is cooperative and runs off the system being examined. Examples include COPS and TIGER. Network-based scanners examine remote systems and may be done without cooperation (and therefore suitable for use by hackers). The first freely available network-based vulnerability scanner released to the public was called SATAN.

14


The following is a list of some of the commercial vulnerability analysis tools at the time of writing this document. Some of the available host-based vulnerability Analysis tools are as follows: • • • • • • • • • •

COPS 1.04—Dan Farmer Security Manager—NetIQ VigilEnt Security Agents—PentaSafe Norman Virus Control—Norman Data Defense Systems Database Scanner—Internet Security Systems System Scanner—Internet Security Systems SFProtect Enterprise Edition—Agilent Technologies Security Analysts—Intrusion.com bv-control for Windows—Bindview Security Configuration Manager—Microsoft Some network-based vulnerability analysis packages are as follows:

• • • • • • • •

SATAN—Wietsa Venema and Dan Farmer SAINT—WorldWide Digital Solutions NetRecon—AXENT Technologies bv-control for Internet Security—BindView NetSonar—Cisco Internet Scanner—Internet Security Systems Nessus—Renaud Deraison and Jordan Hrycaj NMAP—Fyodor

Network-based scanners have to estimate if a system is vulnerable, because typically they do not actually attempt to break into a system, but they just look for the potential that a vulnerability exists. Vulnerability analysis gives a list of potential vulnerabilities and indicates some value (such has high, medium, or low) of the priorities. Commercial scanners may have thousands of potential vulnerabilities in their database, and the measurement is a list of well-known vulnerabilities with potential risks. However, it is difficult to determine priorities accurately and evaluate consequences of vulnerabilities. Scenario Analysis. Scenario analysis is a fairly easy mechanism to estimate risk. Several potential scenarios are considered, and the likelihood of the risk is estimated. No real data are used in this analysis system. Application Risk Analysis. Application risk analysis is used to example one or more applications. The potential risks are identified, and the potential damage or consequence of that risk is considered. This is helpful in determining priorities. It is useful for application developers but does not provide a complete view of a system. Qualitative Approach. The qualitative model is an attempt to calculate the overall risk of a system. It uses simple formulas to estimate risk. It considers eventual risks, vulnerabilities, attacks, vectors (systems used to launch attacks), and controls. Controls may include deterrent, preventive, corrective, and detective controls. The numbers are subjective and have approximate values, such as high, medium, or low of 1 through 10. The scales may be logarithmic, where a cost of $10 is 1, while $10 million is 9. A comprehensive list of vulnerabilities is used, with classifications described earlier. The formula may be based on a calculation such


15

as

The cost may include such factors as • • • • •

Financial loss Cost of disruption Extent of legal liability Breach of confidentiality Extent of embarrassment

Other calculations may include a control factor, indicating the means to reduce the risk of vulnerability. Qualitative analysis is simple to calculate, but produces numbers that lack concrete values. A total score of 1500 has little meaning by itself. It can be useful in determining priorities or evaluating security over time. Quantitative Approach. Another approach for calculating risk is to use the quantitative analysis approach. An example of this is the annual loss expectancy (ALE). Other formulas can be used. The advantage of this approach is the calculation of potential lost and return on investment. There are several steps. Step (1): Identify assets. Identify all items of value. This includes items such as • • • • • •

Hardware Services Software Intellectual property Reputation Customer base

Step (2): Determine the value of each asset. Calculate the value of each asset. This is typically a logarithmic scale ($1000, $10,000, $100,000, etc.). There may be different categories of loss, such as availability, integrity, and confidentiality. Each category must be considered. Next, the total value of each asset is calculated, summing up the loss in each category.

Step (3): Determine, the likelihood of vulnerability. Next, a calculation is used to determine the likeliness of losing that asset. The scale is often logarithmic. Table 1 is an example. Step (4): Calculate the potential lost (cost) each year. For each asset, a calculation is made of the potential cost of the threat. Consider a server containing a customer database. If the server was disabled for a week, it would cost $100,000 (availability). The database may take months to build, and would cost $2,000,000 (integrity), and the confidentiality would cost $5,000,000. The total potential loss is $7,100,000. If the threat is a hacker using a stack overflow of an out-of-date program to gain privileges with the occurrence of once every 5 years, the loss would be $7,100,000 × 1/5 or $1,420,000 a year for that single threat. The annual lost

16


expectancy is a summation of each of the threats.

Total the ALE for each service or component to get the total ALE. Step (5): Survey applicable controls. Determine what mechanisms exist to reduce or eliminate the threat. Step (6): Determine which threat produces the biggest ALE. In this example, consider the hacker breaking into the database to be the biggest danger. A potential control is to add an intrusion detection system. Assume this will eliminate 85% of the attacks. This specifies the effectiveness factor (0.085). Step (7): Calculate return on investment. If an intrusion protection system costs $10,000 per year, with an effectiveness factor of 0.085, the return on investment (ROI) can be calculated as

In the example above, the value would be (0.085 × 1,420,000)/10,000; therefore the return on investment would be 12:1. Advantages and disadvantages of the Quantitative Analysis Approach. There are several advantages of this approach. It can calculate dollars lost per year, in terms easier to understand. However, the process is more complicated, and more expensive in time and resources. Changes in the system may have a major change in the calculation. The calculations do not require expertise in security. However, many assumptions are made, especially concerning the likelihood of an attack. It is difficult to get real numbers of attack frequencies. It is also difficult to estimate value of assets. Some threats have high frequency and low impact (operator error)


17

Fig. 1. This network gateway has no filter and provides no protection against attacks.

while others have low frequency and high impact (earthquake). Both may have same ALE per threat. The quantitative model cannot distinguish between the two,

Firewall Architecture As the Internet evolves, understanding of network security evolves. Early on, systems connected to the Internet were bastion hosts—hardened against attacks. Systems providing connectivity often had two network addresses, one on the outside and one on the inside of the network. These systems were called gateways. However, these systems offered poor defense. If they were compromised, the entire network would be vulnerable to attacks. Instead, the network connection evolved into firewalls and demilitarized zones (DMZ). This adds a “defense in depth” of two or more layers. Several components comprise the total system. See the book Building Internet Firewalls (4) for more information. This section provides is a list of components that might comprise a firewall system. Internal network. The internal network consists of all of the systems that need protection yet still allows people to productively use the resources. One approach is to make every system visible to the Internet. However, this requires every system to be secure, and according to the principle of easiest penetration, only one system need be compromised. Bastion Host. A bastion host is a system that has been hardened against attack. It is stronger than normal defenses and protective devices. Generally bastion hosts are the systems that are typically most visible, usually attacked first, and will probably be the first to be compromised. In most cases, they are general-purpose systems that have been hardened using software and configuration options to protect the system from attack. A network without a firewall would require every system to be a bastion host. The phrase bastion host was popularized by Marcus Ranum (4). Demilitarized Zone. A demilitarized zone is a term for a network that is exposed to attack. Rather than protect every system, it is easier to protect a few and place them in a DMZ. However, this does not in itself protect the internal network. A DMZ is sometimes called a perimeter network. Filters and firewalls. Filters are hardware devices that isolate one network from another. Each device uses particular rules to decide if data transfer should be allowed or not. Figure 1 shows a network gateway. This has no filter. The early firewalls had two filters and a DMZ, and looked like Fig. 2. The filters contained rules that only allowed the external network to connect to the bastion host and did not allow the bastion host to connect to the internal network. The internal network could connect to the bastion host, but the bastion host was not allowed to initiate a connection. This rule is in place to protect the internal network in case the bastion host becomes compromised. Rather than use two devices, modern firewalls often look like Fig. 3.

18


Fig. 2. This early firewall only allowed the external network to connect to the bastion host. The internal network could initiate a connection to the bastion host.

Fig. 3. This modern firewall combines the external and internal filters of Fig. 2.

No filter is perfect, and each has unique strengths and weaknesses. As vendors gain experience, the intelligence of the filters increases, as does the capabilities. Filters perform one of three actions: (1) They can allow data to pass through them without modification. (2) They can modify data that travel through them. (3) They prevent data from traveling through them. Filters that block or modify data for security reasons are called firewalls. There are two ways to classify a firewall: Layer of operation and layer of filtering. By layer, I am referring to the ISO Protocol stack. On a TCP/IP network the four layers correspond to the network layer, the IP layer, the transport layer, and the application layer. The layer of operation specifies how the filter behaves to an outside observer. It is viewed from the outside. The layer of filtering determines what is blocked, and what is not. Each firewall may operate on one layer, but provide filtering at several layers. Network Layer-2 Firewall. This filter operates at the data-link level or layer 2 in the ISO reference model. It is placed between IP devices and is visible at the MAC layer. On an Ethernet, the Ethernet address


19

(e.g., 8:0:20:11:ac:85) would be different, but substituting a firewall for a bridge is not noticed by the host. It operates to their perspective as another bridge. These devices are sometimes called secure or intelligent hubs or bridges. Some bridges perform encryption, and can be used to construct virtual private networks (VPN). A variation of this is a smart hub that performs filtering. For instance, it may only allow packets to travel to a system if the destination matches the system. This prevents a network sniffer installed on one system from seeing packets addressed to other systems. However, a sniffer can still see packets addressed going to and from its own connection. A switching hub does this automatically, as long as it is safe from modification. Some firewalls can modify their MAC address and be completely invisible to the hosts. Network Layer-3 Firewall. The layer-3 filter operates at the IP network layer. That is, it acts as a router that may block or modify certain packets. A common name for a layer-3 filter is a packet filter, or a screening router, as it always performs layer-3 filtering. It may, however, perform layer-4 or -5 filtering as well. Packet filters are configured to prevent certain packet configurations from passing through. These devices make decisions based on IP addresses, IP ports, and header information such as flags and options. While earlier layer-3 filters acted on information in the headers, later versions performed stateful inspection of the data, essentially examining the data at layer 4 and potentially layer 5, Some layer-3 firewalls advertise routes through them, that is, servers are unaware of the filter. Such a device will allow a system inside the firewall to make connections to the Internet, with no modifications to the protocols. They may also perform network address translation (NAT) to prevent internal addresses from being seen on an external network. They do this by translating internal layer-3 addresses to a set of corresponding external IP addresses. Screening routers can be configured as an internal router or external router. They can block TCP sessions because they understand the flags in a header that specify the start of a new connection. Therefore they can be configured to allow a system on one side to connect to a system on the other side and prevent the reverse. However, if an attacker changes the normal order, or sends packets out of order, a layer-3 filter will not detect this. Two publicly available packages allow you to turn a PC into a layer-3 IP filter. One is called Drawbridge, from Texas A&M University. The other is called KarlBridge, by Doug Karl. The same PC can be conveted to using the Linux operating system, which comes with filtering software. Another is called. ipfwadm, and it is part of the Linux software. Layer-4 Firewalls. The layer-4 firewall operates on the transport layer, either at the TCP or UDP layer. Another name is a application-layer filter, application-layer gateway, or a proxy firewall. The firewall understands the status of a connection. It retains the history of a connection and understands when packets arrive out of order. Layer-4 firewalls always perform NAT, as the devices on either side do not talk to each other directly. Instead, they talk through a proxy, which is why these firewalls are called proxy servers. Proxy servers operate at layer 4. Three popular proxy packages are SOCKS, screend, and the TIS Firewall toolkit. The proxies used by web browsers are also layer-4 firewalls. The filtering the proxy firewall performs may or may not be aware of the data corresponding to the data. TCP connections maintain state, and particular applications associated with well-known ports have well-defined data formats. UDP packets are difficult to filter, because they have no state. Because of this, it is difficult to determine when a connection is finished and how it starts. Many sites block UDP packets because of this problem. Difference between packet filtering and proxy firewalls. Early packet filters did little stateful inspection. As they grew in sophistication, they added more and more “state” to the rules. One of the potential problems with a layer-3 filter is that the filter must perform the same operation on packets as the device it is protecting. If it receives malformed packets or fragments of IP packets, it must reassemble and/or reject them with the same logic as the devices it is designed to protect. Fragments may contain overlapping data segments, and segments may arrive in any arbitrary order. However, there are subtle differences between

20


different operating systems and how they respond to strange and unusual packets. Since a single firewall may have to protect multiple operating systems, it is difficult for it to exactly match the characteristics of each and every operating system. It is possible that a packet filter will try to determine the exact state of the system being protected and be incorrect. This allows attacks to bypass the firewall. Firewalls that operate using the IP stack of the native operating system may be vulnerable to attacks that fool the filter, but penetrate to the system under protection. Proxy firewalls allow applications to be built and installed on a system in a DMZ, which intercepts and examines all data at the application layer. The proxies are typically on a second system on the DMZ. If the proxy is compromised, the internal network is protected. Stateful filters cannot handle arbitrary applications, and since a single system is used, if it is compromised, the internal network may be compromised. Some layer-3 and layer-4 firewalls can perform layer-5 filtering. The system examines the data specific to the application and passes or blocks connections/packets based on this information. A mail server that blocks unwanted e-mail (SPAM), or access to any image file can be considered a layer-5 filter. FTP servers that block access based on names of files are also layer-5 filtes. Typically the rules for these filters differ depending on the application, so a mail filter operates differently than a FTP filter. It is easier for an application-specific proxy to have advanced rules and to make a general-purpose packet filter understand the rules. Layer-4 firewalls can be combined with layer-3 firewalls, with the proxy server in the perimeter network, protected by a layer-3 firewall.

Intrusion Detection system. An intrusion detection system (IDS) is a system to give you an immediate alert if someone is actively trying to break into your network. In some cases, the detection system is also a prevention system, as it will react to attacks and block or deny the attacks. There are two kinds—network based and host based. Host-based intrusion detection systems look for tampering and attacks on a single host. Network-based IDSs use either a sniffer (to examine all packets) or else examine log files from multiple machines. Tamper Detection. Eugene Spafford and Gene Kim developed tripwire at the COAST Laboratory at Purdue University in 1993. It is used to detect if any files have been modified on a system, and therefore detect a rootkit or other back doors that have been installed. Free and commercial versions are available. It usually runs on a bastion host. Port Scanner detection. A port scan is a probe that examines the system for potential vulnerabilitites, but does not actually attempt to break into a system. Someone running a port scanner may merely be curious, while others may be planning a break-in attempt. Some people log, then ignore these attempts, and others actively contact the source of the attack to investigate the cause. Some port scanners operate at low frequencies, or with unusual options, and therefore attempt to avoid detection. These are called stealth scans. There exist several packages to detect port scanning. However, many of these cannot detect stealth scanning. (Some commercial scanners do.) Some of the public domain ones include Klaxon, Courtney, and scan-detector. Other IDS packages. Intrusion detection systems fall into many categories. The detection may be host based, multihost based, or network based. The detection may be based on misuse, by detecting signatures of known attacks, or by detecting anomalies (variations from proper patterns). Some of the first detected scans from particular packages, like Courtney and Klaxon, which detects scans made by the SATAN package. However, these did not detect scans using stealth techniques. Scanners were developed that watched for these actions, but this was still insufficient. Researchers worked on new solutions. Some of these solutions were DIDS (University of California at Davis), EMERALD, IDES/NIDES (SRI), Haystack (which became Stalker and WebStalker), NADIR (Los Alamos National Laboratories), NetRanger (WheelGroup), OmniGuard (AXENT), RealSecure (ISS), and SWATCH—a public domain package by Todd Atkins of Standford University. Researchers have recently developed more sophisticated intrusion detection systems. The COAST Laboratory at purdue University developed IDIOT (intrusion detection in our time) The University of California


21

at Davis has developed GRiDs (graph-based intrusion detection system). In 1998, COAST announced AAFID2 (autonomous agents for intrusion detection). In 1998, the Naval Surface Warfare Center, Dahlgren Division, has developed the Cooperative Intrusion Detection Evaluation and Response project and made the software available (http://www.nswc.navy.mil/ISSEC/CID) Some of the available host-based IDS packages include the following. • • • • • • • • • • •

Intruder Alert—AXENT Technologies Centrax—CyberSafe CyberWallPLUS—Network-1 Security Solutions Entercept—Entercept Security Technologies PentaSafe VigilEnt Security Agents—PentaSafe Tripwire Software—Tripwire Harvester—farm9.com NFR Intrusion Detection Appliance—Network Flight Recorder (NFR) Security Manager—NetIQ ManTrap—Recourse Technologies RealSecure—Internet Security Systems Some available network-based IDS packages include the following.

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14)

NetProwler—AXENT Technologies Dragon IDS—Network Security Wizards Shadow—the SANS Institute Network Flight Recorder—Network Flight Recorder (NFR) Anzen Flight Jacket for NFR—Anzen Computing Open View Node Security—Hewlett Packard Cisco Secure Intrusion Detection System—Cisco Systems Centrax—CyberSafe ManHunt—Recourse Technologies Harvester—farm9.com NetProwler—AXENT Technologies RealSecure—Internet Security Systems RealSecure for Nokia—Nokia SecureNet Pro—Intrusion.com

Authentication devices. Authentication devices provide additional proofs of identity. Typically a physical token is used. This may be an electronic device or something as simple as a piece of paper. One paper-based system is the S/KEY identification system, which is a list of passwords on a piece of paper. Each password is used once, and once used is crossed off the list. Other authentication systems use special token devices such as SecureNet or SecurID that generate a number that is cryptographically difficult to guess, yet must be used to access the system. Other authentication system used biometric data. Virtual Private Network. A virtual private network (VPN) provides a secure mechanism for someone on a public network to have access to the interior network. This is done by using strong authentication and secure

22


protocols. In some cases, a secure tunnel is made between the outside user’s system and the interior network, and all protocols are sent through this secure tunnel. Honey pot. A honey pot is a decoy—a system that seems to be easy to penetrate and provides interesting information. But in reality, the information is false, and the main purpose is to delay and confuse the attackers. Trends in firewalls. Firewalls can be configured in multiple layers, requiring a hacker to penetrate all of the firewalls to get to key data. Firewalls are also being inserted between workgroups, providing extra layers. Network interface cards are available that perform both encryption and filtering and can be combined with strong authentication and policy management. VPN systems are being combined with firewalls and policy management systems, allowing entire networks to have strong control of access privileges, while protecting individuals accessing their home network while using the Internet.

Summary to Network Security This document discusses the components of a secure system, and suggests steps to take to address the issue. A written policy is the first step, as it defines the security you wish to enforce. Understand the threat model and learn about current threats and hacker techniques. Use risk analysis to determine and measure potential vulnerabilities. Increase the authentications and access control. Consider physical security and training people in secure procedures. Protect critical resources, and constantly review potential weaknesses. Follow the suggestions earlier, and increase auditing of systems. Replace insecure protocols with secure protocols.

History of Network Security Incidents In the 1960s, hacker originally meant clever programming. Nowadays, it commonly refers to someone trying to break into computer systems. Some people have suggested the term cracker is more accurate, to better distinguish the evolutionary change in meaning. Network security evolved from issues dealing with telephone security. In the late 1960s “phone phreaks” learned how to circumvent phone network security, in part for the challenge of the exercise, and in part for the ability to lower phone bill costs. In the next decade, modems and dumb terminals were used to break into mainframes. When personal computers became available, the technology became more sophisticated, as “war dialers” were created, able to automatically search for phone numbers connected to modems. Bulletin board systems (BBSs) became popular, and small groups set up private BBS meeting areas to spread hacking technology. The FBI apprehended several hackers (5), and hacking began to gain notoriety. As computers became networks of computers, a new breed of network-literate hackers evolved. In 1979 (6) Robert. H. Morris and Ken Thompson published a classic paper describing the Unix Password system. Later AT&T published a Unix-specific issue of their Technical Journal in October 1984, with similar articles (7). Morris also wrote articles on the weaknesses of TCP/IP. Robert’s son Robert T. Morris worked with his father, investigated the weaknesses of UUCP and published a paper discussing the sequence number attack of the TCP protocol stack in 1985 (8). Steve Bellovin wrote a thorough discussion of the weaknesses of the TCP stack in his 1989 paper (9). In 1987, the Christmas Tree Worm spread over bitnet and the European Academic Research Network (10). In September 1987, Cliff Stoll described his experiences with a hacker from Germany (11). In 1988, the Internet Worm, created by Robert T. Morris, infected thousands of sites (3, 12). CERT was formed in November 1988, as a reaction from the Morris worm. In 1989, the WANK worm interfered with the NASA launch of Galileo, destined for Jupiter (13).


23

The movie Wargames came out in 1989. Hacking became “cool” and the popularity rose. In January 1991, Bill Cheswick experienced what he later called “An Evening with Berferd” (1). [Kevin Mitnick was captured twice (14, 15), the second time he used an IP hijacking technique described by Robert T. Morris (8)]. In 1994, Russian hacker Vladimir Levin stole $12 million from Citibank. Starting in 1996, several denials of service attacks occurred, such as the SYN Flood, Ping of Death, and the SMURF attack. Unsolicited e-mail also became a large problem. Various Windows NT protocols and applications were investigated, and numerous security problems were fixed, and more discovered. In February 1998, an Israeli hacker known as Analyzer broke into US Department of Defense computers. In 1999, hackers developed systems using distributed that forged IP addresses, and executed coordinated attacks on victims. This is described as the distributed denial of service (DDOS) attacks and can be difficult to prevent, because the attacks come from compromised, and otherwise innocent, systems. In 1999 and 2000, an increase in self-replicating virus/worms occurred which were able to infect a system, and extract e-mail addresses from the victim, sending copies of itself to others in the victim’s address book. The first was called the Melissa Virus. Currently, experts believe 68% of all security breaches are caused by insider abuse. (Source Computer Security Institute, Inc. 1997). In 1996, Dan Farmer measured that 2/3 of the commercial sites on the Internet had obvious security vulnerabilities (13). A report was issued in 1996 that informed the US government that the Defense Information Systems Agency (DISA) may have experienced 250,000 attacks in 1995. The attacks were successful 65% of the time. Only 1 attack in 150 was detected and reported. Hacking is a serious problem, and hackers view most systems as having “lame” security. If hackers want to gain access, they probably will. An expert hacker can generate an exploit minutes after it is published on the Bugtraq mailing list. Vendors typically take days or weeks to respond to new threats. System administrators often have other tasks, and a home life. Every system will, therefore, have periods of vulnerability. What can be done to protect a network from attacks?

BIBLIOGRAPHY 1. B. Cheswick S. Bellovin Firewalls and Internet Security, Reading MA: Addison-Wesley, 1994 (a classic book that defined how firewalls should behave). 2. C. C. Wood Information Security Policies Made Easy, Houston, TX: Baseline Software, Inc. 1996 (describes how to write policies, with 730 examples). 3. S. Garfinkel G. Spafford Practical UNIX and Internet Security, 2d ed., Sebastopol, CA: O’ Reilly and Associates, 1996. 4. D. B. Chapman E. D. Zwicky Building Internet Firewalls, Sebastopol, CA: O’Reilly and Associates, Inc., 1995. (a practical guide to implementing firewalls). (a) D. Farmer “Shall we dust Moscow?,” from website: http://www.trouble.org/survey/ 5. B. Landreth Out of the Inner Circle. Seattle: Microsoft Press, 1985 (a description of how a hacker operates). 6. R. Morris K. Thompson UNIX password security, CACM 22, November 1979. 7. F. T. Gramp R. H. Morris UNIX system security, AT&T Bell Lab. Tech. J., 63 (8): 1649–1672, October 1984. 8. R. Morris A weakness in the 4.2BSD UNIX TCP/IP software, Computing Science Technical Report 117, Murray Hill, NJ: AT&T Bell Laboratories, February 1985. 9. S. M. Bellovin Security problems in the TCP/IP protocol suite, Comput. Commun. Rev., 19 (2): 32–48, April 1989. 10. Stallings, William Network and Internet Security: Principles and Practice, Englewood Cliffs NJ: Prentice-Hall, 1995. 11. C. Stoll The Cuckoo’s Egg, Garden City: Doubleday, 1989. 12. E. Spafford The Internet Worm Program: An Analysis, Technical Report CSD-TR-823, Purdue University, 1988. 13. S. Dreytus Underground, London: Reed Books, 1997 (a discussion of the hacker’s view of the world, with several cases detailed). 14. K. Haffnew J. Markoff Cyberpunks: Outlaws and Hackers on the Computer Frontier, New York: Simon and Schuster, 1991. 15. T. Shimomura J. Markoff Takedown, New York: Hyperion, 1996.

24


READING LIST D. Farmer V. Wietse Improving the Security of Your Site by Breaking Into It, December 1993 from website: (ftp://coast.cs.purdue/pub/doc/general/Farmer Venema admin-guide-to-cracking-txt.Z) C. Kaurman R. Perlman M. Speciner Network Security: Private Communication in a Public World, Englewood Cliffs, NJ: Prentice Hall, 1995, (a technical but readable description of how to implement secures protocols). B. Schneier Applied Cryptography, 2d ed., New York: Wiley, 1996. J. G. Steiner B. C. Neuman Jeffrey I. Shiller Kerberos: An Authentication Service for Open Network Systems, Winter 1988, USENIX Conference, Dallas.

BOOKS AT&T Bell Lab. Tech. J. 63 (8), p, 2, October 1984 (several articles on UNIX and security). P. H. Wood S. G. Kochan UNIX System Security, Hasbrouck Heights, NJ: Hayden Books, 1985. (The first book dedicated to UNIX security. The section on networking included a discussion of UUCP, HoneyDanBer, RJE, Hyperchannel and 3B Net. No mention of TCP. Contains source code for setuid programs, and chroot environments.) D. Fiedler B. Hunter UNIX System Administration, Hasbrouck Heights, NJ: Hayden Books, 1986 (contains a chapter on security). R. Farrow UNIX System Security, Reading, MA: Addison Wesley, 1991. B. Sterling Hacker Crackdown, Bantam, 1992 (case studies of several hackers, includings. M. Garfinkel and G. Spafford, Web Security and Commerce, Sebastopol, CA: O’Reilly and Associates, Inc. 1997). W. Dalton S. M. Fuler B. Kolosky J. Millecan C. Nachenberg K. S. Siyan L. Skok S. Tate Windows NT 4: Security, Troubleshooting, and Optimization, Indianapolis, IN: New Riders Publishing, 1996 (covers security). S. A. Sutton Windows NT Security Guide, Addison Wesley, MA: Trusted System Services, Inc. 1997. C. P. Meinel The Happy Hacker, American Eagle Publications, 1998 (a how-to book for hackers) www.ameaglepubs.com L. McCarthy Intranet Security, Englewood Cliffs NJ: Prentice Hall, Sun Microsystems Press.

PAPERS A. Muffet Crack Version 4.1: A Sensible password checker for UNIX, from website: http://www.users.dircon.co.uk/-crypto/ D. Farmer E. Spafford The COPS Security Checker System Summer USENIX, Anaheim, CA., 1990. R. Baldwin Kuang: Rule-based security checking, MIT, June 1987 documentation in website: ftp://ftp.cert.org/pub/tools/cops/1.04/cops.tar.Z D. Farmer W. Venema Security Administrator’s Tool for Analyzing Networks, from website: http://www.fish.com/zen/satan/satan.html W. Venema TCP WRAPPER—Network monitoring, access control, and booby traps, Proc. 3rd USENIX Security Symp., Baltimore, September 1992. G. Kim E. Spafford The Design and Implementation of Tripwire: A File System Integrity Checker, Technical Report CSD-TR-93-071, Purdue University, 1993 D. Safford D. Schales D. Hess The TAMU Security Package: An ongoing Response to Internet Intruders in an Academic Environment, 4th USENIX Security Symposium, 1993. M. Blaze A Cryptographic File System for UNIX, Proc. 1st ACM Conf. Comput. Commun. Security, Fairfax, VA, November 1993.

MAILING LISTS Firewalls FWALL—Users Bugtraq—send mail containing “subscribe bugtraq” to [email protected] NTBugtraq—mail to [email protected]


25

Ntsecurity Send mail containing “subscribe ntsecurity” to [email protected] Academic—Firewalls CERT—Advisory RISKS WWW—Security

WEB SITES GENERAL COAST—http://www.cs.purdue.edu/coast/coast.html—An excellent starting point. Purdue University is a leader in computer security. CERT—http://www.cert.org—Another site to study. Read all of their advisories, and reports. http://spider.osfl.disa.mil/cm/security/check list/check list.html—Contains security checklists for several operating systems. http://www.sunworld.com/common/security-faq.html—Peter Gavin’s Solaris security FAQ. http://www.security.org.il/ http://geek-girl.com/bugtraq/—Archives of the famous Bugtraq mailing list. Http://www.wwdsi.com—Home of the SAINT program. Http://www,kirch.net/unix-nt—John Kirch’s Unix versus Windows NT page.

WINDOWS NT SECURITY http://www.microsoft.com/ http://www.ntsecurity.net/ http://www.trustedsystems.com/—Check out their Windows NT Security Guidelines http://www.trustedsystems.com/NSA Guide.htm http://www.iss.net/ http://iss.net/vd/bill stout/ntexploits.htm http://ntbugtraq.ntadvice.com—Site of Russ Cooper’s NTBUGTRAQ mailing list. Http://www.ntbugtraq.com

“HACKER” SITES http://www.10pht.com/—home of LOPHTCRACK http://www.secnet.com—home of NTCRACK http://www.rootshell.com—many exploit tools available http://www.cultdeadcow.com/ http://www.phrack.com/—home of PHRACK magazine

RESPONSE TEAMS CERT http://www.cert.org/ FIRST—http://www.first.org/ AUSCERT—http://www.auscert.org.au/ CIAC—http://www.ciac.llnl.gov/ NIST CSRC USENIX Association System Administrators Guild

26


NEWSGROUPS Comp.security.unix Comp.security.announce Comp.security.misc Comp.security.firewalls Alt.security Comp.admin.policy Comp-protocols-tcp-ip Comp.unix.admin Comp.unix.wizards

BRUCE BARNETT General Electric


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Remote Procedure Calls Standard Article David E. Ruddock1, Richard Wikoff1, Richard Salz2 1Bellcore 2Cert Co., Inc. Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5328 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (112K)




❍ ❍

Acronym Finder



Abstract The sections in this article are RPC Features RPC Technical Overview RPC-Based Distributed Computing Benefits and Drawbacks RPC Products Evolution of RPCs Programming with RPCs Advanced RPC Topics Summary | | | Copyright © 1999-2008 All Rights Reserved.


REMOTE PROCEDURE CALLS

459

• Transport independence. Developers of RPC-based applications do not have to program for a specific underlying network protocol. This means that the communications protocol can be changed at runtime without changes to the program.

REMOTE PROCEDURE CALLS The remote procedure call (RPC) is a computer technology for interprocess communications. A calling program uses an RPC to invoke a program running in another process. This other process may be running on the local host machine or may be on another host machine. RPC implementations make an interprocess call look similar to a local procedure call, shielding the programmer from many of the complexities of the underlying communication technologies. The RPC technology allows a programmer to code the calling program the same way, regardless of whether the called program is running on the same host or a remote host. Thus, RPCs simplify and standardize interprocess communications. In addition, RPC implementations provide features beyond simple data transmission, including platform independence, location transparency, security, and transport independence, and run on a wide variety of commercial computer platforms. This article is organized as follows: The first section introduces RPC technology; the following section covers programming with RPCs and generic RPC functions that are common to most RPC implementations; and the last section describes advanced RPC topics. Examples in this article come from two popular RPC implementations: the SUN ONC⫹RPC and the Open Software Foundations Distributed Computing Environment (OSF DCE) RPC. Details of how each of these mechanisms implement RPC functionality are discussed in some of the sections. RPC FEATURES The main features of RPC programming are as follows: • Platform independence. RPCs compensate for differences in how host platforms represent data internally. All RPC data communication uses standardized data representation. This means that computers having dissimilar internal data representation can communicate properly because all shared data are converted to a standard representation before being sent to the receiver. • Location transparency. Computers that make up the application are not required to be geographically colocated. This means that distributed applications could be spread across large geographic regions to improve response time and availability. • Secure network communications. All commercial computer applications are exposed to security threats. Because the physical communications network is not 100% secure, distributed applications have a higher exposure to security violations. Most RPC mechanisms provide tools to enable various levels of security to guarantee that an application receives the appropriate level of security.

RPC TECHNICAL OVERVIEW The RPC is modeled after the ordinary local procedure call. The differences between the RPC and the local call are extensions resulting from the effects of remoteness that are introduced when the calling and called operation do not share the same process. Local procedure calls are calls that one part of an active process makes to another part of the same process (or address space). Local procedure calls are also called functions, subroutines, and application programming interfaces (APIs). Like a local procedure call, an RPC may or may not contain input parameters to set initial conditions within the function and output parameters to pass back final state information. Unlike a local procedure call, however, the RPC executes in two separate processes that may or may not be running on the same host. The process that initiates an RPC is called the client and the process that receives, processes, and returns the request is called the server. From the programmer’s point of view, RPCs are rooted in the same programming semantics as local procedure calls. They have well-defined input and output, and the call returns when the function has completed processing (synchronous calls). Having identical semantics as the local procedure call means that the programming with RPCs is virtually identical to programming with the local procedure call. This means that there is very little additional knowledge that a developer must have when working with RPCs. However, as discussed later, there are some additional steps required to check for the effects of remoteness that must be built into the software, especially the consideration of RPC call semantics, which govern how applications behave in the case of errors, such as network failures and server crashes. Figure 1 illustrates the

Client Process

Server Process

Application level code

Application level code

1

10

6

Client stub

2

5

Server stub

9

7

4

8

RPC library and network routines

RPC library and network routines 3

Local kernel

Remote kernel

Figure 1. Remote procedure call.


460


RPC model (1) and demonstrates how all the associated components interact. 1. The client calls the RPC by executing a local procedure call as defined in the interface definition. (The local procedure call is actually a routine in the client stub generated by the interface definition language compiler.) 2. The client stub takes the arguments, if any, and packages them into a standard data representation and forwards the data to the local kernel code for network transmission. 3. The kernel software transmits the information across the network and then waits for the server to reply. This software also executes all actions necessary to support the network protocol used in the call. 4. The information is received by the remote host kernel level code and forwarded to the server stub for processing. 5. The server stub translates the message from the standard data representation into the format described in the interface definition and is passed to the server using a local procedure call. 6. The server processes the request and returns the data to the server stub in the format described in the interface definition. From the server’s perspective, it is returning to a local procedure call. 7. The server stub formats the results into a standard data representation format and sends the data back to the kernel code. 8. The kernel code transmits the data back to the client host. 9. The client’s kernel receives the data and sends them back to the client’s stub. 10. The client stub reformats the data from the standard data representation into the local data representation and returns the data in the format defined in the interface definition.

RPC-BASED DISTRIBUTED COMPUTING BENEFITS AND DRAWBACKS RPC technology is a basic tool of distributed computing, with attendant benefits and drawbacks. Benefits of RPC-Based Distributed Computing Applications developed using RPC-based service providers (servers) enjoy many benefits, including, but not limited to, the following: • Reduced administration costs in the areas of data backup and recovery. Examples of this class of servers include file and database servers that centralize system and application administration to a single host. Client hosts request files and/or data, manipulate the information, and then store the information back on the server. There can be many clients and a single server. Costs are reduced in two ways. First, client hardware requirements are reduced. Data are centrally stored on the server, so client

•

•

•

•

hosts need not be able to store data. Second, administration costs are reduced. Only the servers, and not each client, are backed up on a regular basis. These cost reductions become more and more significant as the ratio of clients to servers grows. Increased application power. As more users store their results in a common store, increasingly powerful applications become possible. Elimination of data redundancy problems. Data redundancy occurs when each host requires identical consistent data for the application to run correctly. Managing data consistency between more than one computer is expensive in terms of application design, implementation, and runtime resources. Applications that require common data that are asynchronously accessed are simpler to design and build using the client/server model. This is because there is only one copy of the data, and consistency is not an issue. An example of this type of application would be consistent times among many applications. Keeping multiple hardware clocks at the same time is a complex task because of different clock drift rates and frequency of clock synchronization. However, a time service provider could parcel out time stamps required by clients. Removing the requirement for synchronized clocks greatly simplifies application design, implementation, and administration. Reduced specialized hardware requirements. For example, clients running on inexpensive slow computers can send specialized computer-intensive operations to a specialized server that performs mathematical operations very quickly. This promotes the efficient use of specialized equipment. Increased availability due to multiple identical servers. Some services can be provided by multiple servers. Availability is increased because the client has multiple servers that it can select from to execute a request. When one server is unavailable, the client can select another server to satisfy its request.

Drawbacks of RPC-Based Distributed Computing There are design trade-offs in all aspects of computing, and RPC-based processing is no exception. Most of the drawbacks of RPC applications center on the effects of remoteness. Effects of remoteness are defined as the potential for dysfunction because of the following: • Dependency on the server host being available. Unlike the local procedure call, a client process has no guarantee that the computer running the server it wants to contact is available. • Dependency on the server process running at the time the client issues the request. In this case, the target computer is available but the required process is not running. • Dependency on end-to-end network connectivity. Data communications networks are physical entities and therefore subject to outage because of broken wires and network element (routers, bridges, etc.) failure. • Latency in processing requests. Because RPCs are more likely to require calls across a network in a fully distributed computing environment, clients will have to wait


longer for a response due to the additional network delay. • Single point of failure. Once a service is centralized, a failure in that service affects more clients. • Time shifts due to clock skew and time zone differences. These drawbacks are not the only negative effects of remoteness, but they are the major challenges when designing, building, and deploying distributed applications. Other factors include debugging in a distributed environment, administration of networked computers, and degraded performance due to network congestion and network configuration. Despite the drawbacks, distributed computing remains a feasible processing model because of the many advances that address the effects of remote computing. For example, good design techniques can be used to compensate gracefully for lack of server and network availability. Other advances in software design and development, such as threads and threads aware debuggers, minimize the effects of remoteness. These design techniques and advances are covered in detail later in this article.

RPC PRODUCTS The following two subsections describe basic and advanced features of RPC computing. Where appropriate, they include descriptions of how two popular RPC mechanisms, ONC⫹ and OSF DCE RPC, implement the feature being described. Architecturally, these two products lie between the application and the operating system and network services. Client applications issue a request for service using RPC functions. These functions, in turn, use the operating system and network services to communicate that request to a server and the results of the remote computation back to the client that made the request. Both RPC mechanisms are available on a wide variety of operating systems and support a variety of network-level protocols. ONCⴙ RPC ONC⫹ is an RPC implementation developed by Sun Microsystems. It provides core services (2) that enable applications developers to design and build distributed applications that run in a heterogeneous environment. The communication mechanism is the synchronous transport-independent (TI) RPC with support for multithreading to enhance concurrency. ONC⫹ uses external data representation (XDR) to enable platform independence. OSF DCE DCE is a collection of services (3) for the development, use, and maintenance of transparent distributed systems using the client/server architecture. DCE enables application-level interoperability and portability through common APIs among heterogeneous platforms. The communication paradigm supported by DCE is synchronous RPC across address spaces (over various network protocols), with multithreading within an address space for concurrency. Client/server location transparency is provided by a directory service/name server. DCE directory services are provided within an administration domain, called a cell, and among cells using Domain Name

461

Service, lightweight directory access protocol (LDAP), and X.500. DCE security services are based on private (Kerberos) and public key authentication. The security service includes authentication of servers and clients, support for resource authorization by an application server in providing services to its clients, and various levels of message integrity/encryption (at different levels of computing resources). Two other distributed services of DCE are its Distributed File System (DFS), for accessing files across hosts, and Distributed Time Service (DTS), for synchronizing clocks across hosts.

EVOLUTION OF RPCs RPCs are an evolutionary step in the development of communications tools for distributed computing. RPC technology builds a program-level protocol upon earlier network-level communications protocols. RPC is often layered atop sockets. In turn, object-level communications protocols, such as DCOM and COBRA, have built on top of RPC technology: DCOM explicitly uses DCE RPC, and COBRA reimplements RPC concepts as internal underlying protocols.

PROGRAMMING WITH RPCs Application development with RPCs is slightly different from typical application development. In addition to the usual steps of writing a client (calling) program and a server (called) program, the developer creates a service interface definition in one of several languages, each called an interface definition language (IDL). The interface definition is passed to a code generator (compiler) that generates stubs for both the client and server. Stubs are linked into the respective program to provide all the functionality necessary to make the RPC work. There are four stages to developing a RPC based application:

1. Develop a formal description of the services provided by the server. The description includes the service names and their respective input and output parameters. Definitions are written in the IDL of the RPC package used. 2. Generate stub programs for both the client and server. The RPC package provides an IDL compiler that generates stubs from IDL. The stubs transform a local procedure call into a remote procedure call. That is, when invoked by a client via a local procedure call, the client stub makes an interprocess call to the server stub, which invokes the server via a local procedure call. The stubs contain all the code required to marshal the client’s arguments into a well-known format, send the data to the server, and return the respective results to the client. 3. Create server and client application software. 4. Compile and link application server code and server stub and application client code and client stub.

462


RPC specification Stage 1 RPC compiler

Stage 2

Stage 3

Stage 4

Client stub

Shared filters and header file

Application client

Compile and link

Client executable

Client stub

Application server

RPC libraries

Compile and link

Server executable

Figure 2. Application integration.

Figure 2 illustrates four stages of building an RPC-based application (4). The shaded boxes indicate steps performed by the applications developer. Interface Definition The interface definition is the mechanism by which one describes services offered by a server. The purpose of the interface definition is to define the programming interface between the client and the server. The interface definition is required so that the server knows what the client is passing in as parameters and the client knows what the server is returning. The interface compiler uses an interface definition to generate the needed header file for the client and server programs as well as stub code for the client to invoke the service and for the server to return the results. The RPC runtime service on each host does data marshaling/unmarshaling necessary for RPC programs to run on different hosts and host types. Marshalling is the process of preparing the parameters to the RPC for transmission across the network. The marshalling logic uses the parameters to the RPC to build a message in a well-known format, which includes data and control information. Unmarshalling is the process of retrieving the parameters to the RPC from the message sent across the network. The syntax of most IDLs is a C language syntax. Therefore, most interface specifications look like a series of C language declarative statements. Most modern IDLs support primitive data types and allow the developer to create more complex data objects using structures. Also, full pointer support is provided to allow linked lists, nested structures, and reference pointers. Reference pointers can be used to pass parameters whose length is not known until execution time, by roughly effecting call-by-reference semantics. The ONC⫹ IDL supports most C language data primitives in the interface definition as well as user-defined data types

(5). It also supports multiple input parameters and a single output parameter for each RPC defined. (Previous releases of ONC RPC only supported single input and output parameters, and the output parameter had to be of type static.) RPC arguments cannot be both input and output. Full pointer support means that RPCs are useful for sending recursive data types, such as linked lists and trees, across the network. Multidimensional arrays are defined as linked lists. DCE RPC IDL supports (6) a wider variety of C language data types than ONC⫹. It also supports user-defined data types and multiple input and output parameters, and all parameters can be both input and output parameters. Multidimensional arrays can be declared in the language and are not required to be implemented as linked lists. DCE IDL also supports a mechanism called pipes that allows transfer of large quantities of identically typed data. Integration of Stubs and Application Code. Compiled stubs are used in the application build process to create complete clients and servers. The link step is where this step is performed. One popular way to develop client/server applications is to build both client and server code together in the same program, debug the application, and then split the program into its respective client and server modules. Although this development strategy is not required, it does help to debug the application before the effects of remoteness are introduced into the application. RPC Runtime Semantics Synchronous RPCs. Both RPCs and local procedure calls are synchronous. This means that the process making the call waits for the calling routine to complete its tasks. Because RPCs are made to other processes, they tend to block the client process. Process blocking causes some performance degradation because the client process must wait while the request is transmitted across the network, processed by the server, and transmitted back to the client. The use of threads can alleviate this degradation, as explained later under concurrency. Asynchronous RPCs. Performance degradation due to blocking RPC calls can be addressed with the use of asynchronous RPCs, also known as one-way RPCs. Asynchronous RPCs are calls that allow the client process to continue processing while the RPC call is pending. When the RPC completes, a predefined callback routine is invoked. Performance improves because of the increase in concurrency. Concurrency means that the client can perform other operations, including making more RPCs, while waiting for the RPC to complete. Asynchronous RPCs can be emulated with the use of threads in the client. In the threaded model, the client creates a thread to make the RPC. When the RPC is made, only the thread that made the call blocks. This allows the client process to execute other operations while the RPC is pending (7). Either practice, asynchronous RPCs or multithreaded programming, alleviates the degraded performance resulting from blocking RPC calls. Callback RPCs. Callback (8), or follow-up, RPCs are useful when the application cannot wait for the synchronous RPC


to complete and multithreading the client is not an option. Graphical user interface (GUI)–based clients are examples of programs that cannot indefinitely block waiting for a longrunning RPC to complete. Essentially, the flow of control in a callback RPC is as follows: 1. The client issues the RPC and blocks waiting for a reply. 2. The server receives the request and returns an acknowledgment to the client to indicate the message was received and starts processing the request. 3. The client receives the server’s acknowledgment and continues processing. 4. The server completes the request and sends the output, if any, back to the client. 5. The client receives the reply via application-level polling software. Both ONC⫹ and DCE RPC are synchronous calls. ONC⫹ supports callback routines. Both products support multithreaded programming and therefore allow simulated asynchronous RPCs. ADVANCED RPC TOPICS Advanced topics in RPC programming include interface management and change control, platform independence, RPC call semantics, binding information, supporting service providers, concurrency, and event propagation. These subjects are important in discussing any RPC mechanism because they are used by application developers to make better distributed applications. They allow developers to extend the functionality of servers (while not requiring existing clients to change), facilitate location transparency by providing centralized directory services, facilitate secure processing, and increase performance. Binding Information Client processes must have information on where a service is being provided to contact, or bind, with the server. Binding information includes the interface number(s), supported network protocols, and network address of the host providing the service. Of these bits of information, the program identification number(s) are usually known at compile time, and the server host location can be obtained by a centralized name service or as a runtime parameter to the client. Network transport selection (e.g., TCP or UDP) is usually negotiated at runtime and may vary with each invocation of the client. Binding Classes. Server programs listen for requests on a logical device called a port. Port numbers can either be well known (static binding) between the server and client or dynamically assigned (dynamic binding) at runtime. Static bindings assign have their port numbers assigned at compile time, and dynamic binding ports are assigned by a well-known lookup service (well known because the service uses a static binding and resides at the same network address as the server). Client programs consult the lookup service, or port mapper, to obtain the server’s port number. Port mapper services are desirable because the range of well-known transport selectors is very small for some trans-

463

ports and the number of services is potentially very large. By running only the lookup service on a well-known transport selector, the port number of other servers is ascertained by querying the lookup service. Binding Management. Clients manage dynamic binding information in one of three ways (9). The management technique used dictates the allowable client behavior, and therefore applications must match the correct binding management technique with the expected client behavior. • Automatic binding is the simplest method because the client stub manages all binding information and completely hides binding information from the application code. Automatic bindings can sometimes automatically retry failed RPCs. Retries are automatically done if the previous RPC never began or when the operation is idempotent, meaning that it can be executed many times without affecting correctness, e.g., a ‘‘read’’ operation. Drawbacks to the automatic binding include not being able to identify server information such as host name and network protocol. • Implicit binding is slightly more complex than automatic binding because the application must establish the server binding information and assign it to a global variable. The global variable identifies the targeted server in the stub. Benefits of implicit binding include centralized binding assignment. Drawbacks to this method include restrictions on multiple threads using the global variable at the same time. • Explicit binding is marginally more complex than implicit binding because the application code must explicitly use the binding information on each RPC. Although this method is the most complex, it is also the most flexible. By allowing the client to manage the binding information for individual RPCs, the explicit binding method enable clients to meet special binding requirements. ONC⫹ supports explicit binding. The binding information is the last parameter in the RPC call. DCE RPC supports all three types of binding classes. In the case of explicit binding, binding information is the first parameter passed in the RPC call. Interface Numbers. Interfaces are defined in an IDL and describe the interface specification in human readable form. However, computers require that identification numbers be used to identify services in place of human readable format. To facilitate this mapping of human form to host form, the interface specification must include the interface number and version information as part of the interface definition. This information is usually passed to the client and server in the common header file generated by the interface compiler. These numbers are used by the servers to register their presence to the local port mapper and by clients to tell the servers port mapper which service they require. The RPC port mapper is usually responsible for routing the initial client request to the server, and subsequent requests go directly to the server. Version Numbers. Application requirements tend to be dynamic and change over time. Sometimes the changes are simple, like the addition of new RPCs to the interface, and some-

464


times the changes are dramatic, like the removal of fields from an RPC call. In either case, the RPC mechanism must have a way of allowing the application developers to minimize the impact of changes to an interface on existing software. Some RPCs allow for two types of changes, major and minor. A major change is when the new interface definition is not backward compatible with existing client software. This class of change is usually the redefinition of fields from an RPC definition. In this case, the existing client software would not work with the new service provider. Minor changes would include the addition of a new procedure call to an existing interface definition. Since the existing clients would never invoke the new RPC, its software would still be compatible with any server that provided the new service. ONC⫹ RPC supports an allowable range of valid interface numbers. As such, it is possible to have interface number collisions at both compile time and runtime. To avoid collisions, SUN recommends that interfaces, or protocol specifications, be registered with Sun Microsystems (10). A utility program generates the DCE RPC interface numbers at development time. Interface number collisions are eliminated because the utility program uses multiple varying input parameters to generate the number. Platform Independence Platform independence is important in distributed computing. Differences in internal data representation should not effect the results of remote operations. In other words, servers with one way of representing data must be able to process requests from clients with a completely different data representation. Without platform independence, distributed computing would work only with computers of identical data representation. Host machines have several degrees of freedom in their internal representations of data (e.g., ASCII or EBCDIC character representation, check-sum bit representation, floating point representation, and byte ordering). For example, in byte ordering, the two popular data formats are big endian and little endian. Big endian computers store data in memory with the high-order byte in the lower address location, and little endian computers store data with the low-order memory location first. Therefore, some protocol is required for communication between computers of dissimilar memory storage. There are at least two protocols that could be used to achieve platform independence (11): • The first protocol defines a single standard format for all data communications. The sender converts the data from the current format to the standard format and sends the message to the receiver. The receiver, in turn, receives the information in standard format and converts it to its respective format for processing. In this protocol there are four conversions in each RPC: (1) for the client to send the data to the server; (2) for the server to convert the RPC data into the server’s format; (3) for the server to convert any results into the standard format; and (4) for the client to convert the results into its format for processing. • The second protocol, known as ‘‘receiver makes right’’, supports multiple standard data formats and lets the sender choose any one of the allowed formats to use when sending the message. In this scenario, the sender can use

its data format, as long as it is one of the supported formats, for the transmission. Part of the protocol tells the receiver the translation scheme used in formatting the message, and the appropriate filter routine is called to reformat the message correctly. In the first case there are always two data conversions per transmission, once before transmission and again after reception. In the ‘‘receiver makes right’’ case, the number of conversions per transmission can range from zero to two, depending on the internal data representations of both the originating and receiving hosts. ONC⫹ RPC interfaces uses the eXternal Data Reference (XDR) protocol, which supports a single fixed-format data representation (single canonical format). DCE RPC interfaces use the ‘‘receiver makes right’’ multicanonical protocol (12). Supporting Services Supporting services provide common functionality that enables software designers and developers to create secure applications with a high degree of location transparency. Secure services means that developers do not have to develop custom security mechanisms. Location transparency means that there is no impact on the client configuration when the servers are moved from host to host. Security. Distributed applications require security services to restrict access to shared resources to valid users and systems. The security service must provide the capabilities to identify users and external systems in a reliable fashion. The application logic must use the service and determine if the requester has the authority to perform the specified operation. An additional requirement for the security service is to provide message encryption and decryption. These services are necessary because the RPC transmission may take place on an unsecured network, where it can be intercepted by an unintended recipient. The application must determine when message encryption is necessary and use the appropriate level of encryption to ensure optimal security. Name Services. Distributed applications use name services to facilitate location of available servers. Servers register their names and locations with the name service. Clients make inquiries into the name service by server name and receive information to allow direct contact to the server. Use of the name service dictates the use of well-known server names between the clients and servers. Use of a name service allows clients to contact servers regardless of where the server is running. This is also known as location transparency. Time Services. For a variety of reasons, clocks on all hosts participating in a distributed computing environment must be synchronized. Debugging distributed applications is easier when all cooperating hosts have nearly the same time. Also, many security packages, such are MIT’s Kerberos, require very tight tolerances in clock differences. ONC⫹ and DCE both support the Network Time Protocol, NTP, for synchronizing host clocks (13).


RPC Call Semantics Call semantics define the behavior of the RPC from the client’s perspective. Local procedure calls are invoked only once. If anything goes wrong, e.g., the server process or host crashes, the process is terminated, and both the caller and callee are terminated together. Unlike local procedure calls, RPCs may be invoked once, more than once, or not at all, because two processes are involved. For example, • Server and network failures could cause an RPC either to time out or wait forever. • After an RPC has been successfully processed by a server, the server could crash before the server stub routines have completed the reply. • The client could trap the RPC time-out and issue the same request again. Thus, in a distributed application using RPCs, the definition of proper RPC call semantics is important to provide for correct handling of potential error situations, particularly server crashes. Idempotent RPCs can be executed many times without affecting application correctness. Idempotent services are typically read-only operations, and examples include RPCs that return the current time, RPCs that return the balance of a checking fund, and servers that return stock quotes. Non-idempotent services usually read and write operations that could affect application correctness when executed more that once. Examples of this class of servers are bank account withdrawals, file servers, and application database servers. There are at least three classes of RPC call semantics (14): 1. Exactly once, which means that the remote operation is executed once and only once. This class of service is difficult to achieve because of the effects of remoteness discussed earlier. 2. At most once, which means that the remote operation was executed once or it was not executed at all. To the client, this means that if the RPC returns successfully, the remote operation completed as expected. However, if the RPC times out and returns an error code, the client does not know if the remote operation was executed once or perhaps not at all. Non-idempotent servers have this RPC semantic. 3. At least once, which means the remote procedure was executed at least once, but maybe multiple times. With this semantic, clients can continuously send RPCs until a reply is received. The application correctness is not affected if the server processes one or more of the requests. Idempotent servers have this RPC semantic. Call semantics can be defined both at the system and the application level. Each server designer and developer must consider these semantics when developing applications. Given that there can be multiple RPC interfaces per interface definition, there may be a hybrid model in which some RPCs have at most once semantics and other RPCs have at least once semantics. For example, a bank account interface could have withdraw and deposit RPCs that are have at most once semantics and a get balance RPC that has at least once semantics.

465

Applications can be designed to make non-idempotent operations idempotent by building an application-specific protocol that prevents multiple processing of the same request. For example, each client and server interaction can be uniquely labeled. The server records all labels and verifies that all new labels have not been used before processing new RPC calls. While many such protocols can be designed and built, they add complexity to the application. Applications that require true atomic, consistency, isolation, and durability (ACID) transaction semantics can obtain transaction services from various vendors. The transaction services shield the applications developer from the RPC call semantics while managing data transformation from state to state. Thus, the transaction semantics are not managed at the application level. However, the application could experience a penalty in performance due to the overhead of the transaction management software. Neither ONC⫹ nor DCE directly support transaction semantics. However, they can be used to build transaction services. Concurrency and Threads One drawback of synchronous RPCs is that they are blocking calls. Distributed applications use multithreading to address the blocking issues. Multithreading allows the programmer to create multiple concurrent flows of control within an address space and perform multiple operations in a concurrent paradigm (15). When one thread of a process is waiting on an RPC reply (or any I/O), other threads within the process can be doing useful tasks, thus increasing the throughput of an application. Furthermore, the use of threads enables low system overhead, because using multiple threads within a process reduces the number and cost of context switches that the operating system performs on the process, compared with a multiprocess/shared-memory solution. Context switching between two processes is a lot more expensive, in terms of host resources, than context switching between multiple threads within a process. Concurrency can be realized without using threads, but to do so requires a more complex, high overhead, multiple process/shared memory processing model. Thus, a threads/RPC combination affords the best of both worlds— ease of use, and performance. By using threads in the client/server model, server applications can service multiple clients concurrently. A client can use threads to make multiple simultaneous requests to a server or multiple servers. Each thread progresses independently using its resources (stack space and registers), periodically synchronizing with other threads and sharing the process-level resources (heap data) as necessary. Some threads continue processing while other threads wait for services, such as disk I/O or network packet reception. Event Propagation In the context of distributed computing, events are things that happen within the context of normal processing. Events may occur in all processes, are asynchronous in nature, and are optionally caught by predetermined event-handling routines. Processes that do not trap certain events terminate on reception of the event. Remote computing introduces a new class of events and the concept of event propagation. The new class of events re-

466

REMOTE SENSING BY RADAR

sults from the effects of remoteness and therefore includes server and network failure. In event propagation, remote events are trapped and forwarded to the client for processing. Once the client receives the event, it can process the event information, often simply the type of event, to maximize application accuracy. Clients that do not trap events usually terminate abnormally. Some programming languages implement certain events as exceptions. This leads to exception processing, a form of programming that allows blocks of operations to perform without having to check for exceptions, such as failures, on each statement (16). Rather, the programming language intercepts the exception, when generated, and invokes a predefined set of code to process the exception. ONC⫹ does not support exception processing for remote events. This means that an application should check for remote errors on every remote operation. DCE RPC supports both individual RPC call error checking and exception processing on remote events. SUMMARY The RPC is a powerful technology used in interprocess communications. It simplifies the development of distributed applications. RPC allows a programmer to code the calling program the same way, regardless of whether the called program is running on the same host or on a remote host. RPC implementations provide platform independence, location transparency, security, and transport independence and run on a wide variety of commercial computer platforms. BIBLIOGRAPHY 1. W. Richard Stevens, UNIX Network Programming, Englewood Cliffs, NJ: Prentice-Hall, 1990, pp. 693–695. 2. John Bloomer, Power Programming with RPC, Sebastopol, CA: O’Reilly & Associates, 1992, pp. 3–5. 3. Harold W. Lockhart, Jr., OSF DCE Guide to Developing Distributed Applications, New York: McGraw-Hill, 1994, pp. 47–48. 4. John Bloomer, Power Programming with RPC, Sebastopol, CA: O’Reilly & Associates, 1992, p. 58. 5. ONC⫹ Developers Guide, Mountain View, CA: SunSoft, 1995, p. 251. 6. ONC⫹ Developers Guide, Rev 1.0, Englewood Cliffs, NJ; PrenticeHall, 1993. 7. D. E. Ruddock and B. Dasarathy, Multithreading Programs: Guidelines for DCE Applications, Piscataway, NJ: IEEE Software, January 1996, pp. 80–90. 8. John Bloomer, Power Programming with RPC, Sebastopol, CA: O’Reilly & Associates, 1992, p. 236. 9. OSF DCE Application Development Guide, Rev 1.0, Englewood Cliffs, NJ: Prentice-Hall, 1993. 10. John Bloomer, Power Programing with RPC, Sebastopol, CA: O’Reilly & Associates, 1992, p. 44. 11. W. Richard Stevens, UNIX Network Programming, Englewood Cliffs, NJ: Prentice-Hall, 1990, pp. 178–179. 12. John Bloomer, Power Programming with RPC, Sebastopol, CA: O’Reilly & Associates, 1992, p. 22. 13. John Bloomer, Power Programming with RPC, Sebastopol, CA: O’Reilly & Associates, 1992, p. 28. 14. W. Richard Stevens, UNIX Network Programming, Englewood Cliffs, NJ: Prentice-Hall, 1990, pp. 696–697.

15. D. E. Ruddock and B. Dasarathy, Multithreading Prorams: Guidelines for DCE Applications, Piscataway, NJ: IEEE Software, 1996, pp. 80–90. 16. OSF DCE Application Development Guide, Rev 1.0, Englewood Cliffs, NJ: Prentice-Hall, 1993, pp. 7-1 to 7-12.

DAVID E. RUDDOCK RICHARD WIKOFF RICHARD SAK

REMOTE SENSING. See ELECTROMAGNETIC SUBSURFACE REMOTE SENSING;

INFORMATION PROCESSING FOR REMOTE SENSMICROWAVE REMOTE SENSING THEORY; OCEANIC REMOTE SENSING; VISIBLE AND INFRARED REMOTE SENSING. ING;


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Signaling Standard Article Thomas F. La Porta1 1Lucent Technologies, Bell Laboratories, Holmdel, NJ Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5330 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (110K)




❍ ❍

Acronym Finder



Abstract The sections in this article are ISDN Signaling Architecture Access Signaling Network Signaling Signaling Network Management Signaling for B-ISDN Summary | | | Copyright © 1999-2008 All Rights Reserved.


276

SIGNALING

SIGNALING Signaling protocols are used to support real-time control procedures in telecommunication networks. These procedures include users requesting services from a network and network elements interacting with each other to fulfill user service requests. Examples of such services include simple connection services, such as establishing a point-to-point connection between users, and more complex value-added services that may be provided in addition to connection services. Examples of value-added services include validating credit-card numbers for credit-card calls and translating 800 numbers. Signaling protocols, and the control procedures that they support, are defined in standards to allow equipment from different vendors, and networks from different providers, to interoperate and provide end-to-end service among users attached to different networks. The major international standards body that defines signaling protocols and procedures is the International Telecommunication Union (ITU), formerly known as CCITT. Other various national and commercial standards bodies also exist. J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.

SIGNALING

Control procedures and signaling protocols are divided into two main areas—access signaling and signaling between network elements. Typical access signaling occurs between a subscriber and its access network over a point-to-point signaling link. Signaling within and between modern networks is carried out over separate signaling networks that connect various network elements. Signaling protocols support call processing and service processing applications. The call processing application executes on end devices and on network switches. Service applications may execute on end devices or network switches, but most typically execute on processors located in the network that may be accessed by switches. Signaling protocols are defined for the physical layer, link layer, network layer, transport/ session layer, and application layer. Nodes in which the application layer signaling protocols are terminated are called signaling points and perform the call and service processing applications. Over the years signaling protocols and procedures have evolved from supporting analog telephony networks (1) to integrated services digital networks (ISDNs) (2). More recently, signaling procedures have been defined to support mobile users and to support advanced services on broadband ISDNs (BISDN) (3). To illustrate the purpose and structure of signaling networks, protocols, and procedures, ISDN signaling is ideal. Signaling for B-ISDN is based, in concept, largely on the ISDN signaling model. ISDN SIGNALING ARCHITECTURE In this section, the signaling architecture of modern ISDNs is described. A simplified view of the current signaling architecture for ISDN is shown in Fig. 1. Subscribers signal into the network to request calls (services and connections) across the user-network interface (UNI) by using the Digital Subscriber Signaling System No. 1 (DSS1) (4,5) protocol suite. Access signaling is performed over a dedicated point-to-point signaling link called the D-channel. The D-channel, and the entire DSS1 protocol suite, is terminated on the access switch to which the subscriber customer premises equipment (CPE) is connected. The access switch and CPE each execute peer callprocessing applications. The call-processing application exe-

cuting on the CPE is often trivial, whereas the call processing application executing on the access switch is typically complex. The user information, such as voice, travels over a separate logical channel, called the B-channel. Within networks, nodes signal to each other across the network node interface (NNI) using the Signaling System No. 7 (SS7) (6) protocol suite. This signaling takes place over a separate common channel signaling network. Application layer signaling messages are routed to, and processed in, the switches that will eventually support the connection carrying the user information. In addition, switches may signal to other network processors, called network control points (NCPs), to request service processing, for example a creditcard validation for a credit-card call. The switches and NCPs are signaling points which execute the call processing and service applications. The common channel signaling network over which various network elements communicate is made up of packet switches called signaling transfer points (STPs). The common channel signaling network is configured in a highly reliable configuration; as shown in Fig. 1, each signaling point is dualhomed to the common channel signaling network, and STPs are connected in a quad arrangement. This structure eliminates any single points of failure. The common channel signaling network is largely self-managing; that is, the protocols that operate on this network continually monitor network performance and automatically react to network failure, congestion, or errored conditions. In some cases, signaling points may have signaling links connecting them directly. There are three routing modes defined for signaling messages on the common channel signaling network. For associated signaling, messages are sent on links directly connecting the corresponding signaling points. Nonassociated signaling includes intermediate nodes, such as STPs. For quasi-associated signaling, intermediate nodes are included, but the path taken by signaling messages is predetermined and fixed. ACCESS SIGNALING Access signaling is used by a subscriber to request service from a network and by the network to indicate incoming service requests to a subscriber. Typical access signaling occurs

NCP

Common channel signaling network Customer premises equipment

STP

STP

CPE

STP

STP

SW CPE

UNI

SW Switching offices

277

STP CPE

STP

SW

SS7 Signaling System No. 7

Access signaling link

SW

B-channel

CPE D-channel DSS1 Digital Subscriber Signaling System No. 1

User information path Common channel signaling link Figure 1. Current signaling architecture.

278

SIGNALING

Table 1. Q.931 and Q.932 Signaling Message Summary

Transactions capability application part (TCAP)

Q.931

Q.932

LAPD (Q.921)

ISDN user part (ISUP)

Signaling connection control part (SCCP)

Message Name SETUP

SETUP_AK (optional)

MTP level 3 MTP level 2

D-channel MTP level 1 DSS1 protocol stack

Signaling system number 7 protocol stack

LAPD -Link access procedures for the D-channel MTP -Message transfer part

INFORMATION (optional)

Figure 2. ISDN signaling protocol stacks.

between a subscriber and its access switch. For analog telephone networks, signaling is performed using tones. For digital networks, such as ISDN, access signaling is message based; that is, messages are passed between the subscriber CPE and the access switch. The interface between the user and the network is called the user-network interface (UNI). For ease of understanding, we discuss ISDN access signaling here. As shown in Fig. 1, subscribers signal into the network using the DSS1 protocol suite. Figure 2 shows the ISDN signaling protocol stacks. The DSS1 link layer protocol, called LAPD (7), provides a point-to-point link between a subscriber and its access signaling point, typically its access switch. Two application layer protocols reside above LAPD—Q.931 (4) and Q.932 (5). Q.931 is used to control circuit-mode connections, packet-mode access connections, and temporary signaling connections. The most common type of connections supported are circuit-mode connections which are used for voice and voice-band data communications. Q.932 is used to control supplementary services. A typical end-to-end signaling flow for basic call establishment is shown in Fig. 3. The access signaling messages are shown in italic letters. Table 1 summarizes the key signaling messages for call establishment and their use. Note that these messages are part of the Q.931 protocol and are used to control circuit-mode connections. Each Q.931 message contains parameters called information elements that are used to characterize the requested call or services being requested.

DSS1 CPE Setup Call proceeding Alert Connect

Figure 3. ISDN message flows.

CALL PROCEEDING

ALERTING

CONNECT CONNECT_AK FACILITY (optional)

Calling Party

Called Party

Generated to establish a call; contains all or part of the called party number; may also contain service invocations. Indicates that the network has received the SETUP message but is expecting more information; used if only part of the called party number is included in the SETUP message. Contains additional information required to complete the call, such as the remainder of the called party number if it is not fully included in the SETUP message. Indicates the network has received all information required to process the call and is accepting no more information. Indicates the remote party has received the call, e.g., the remote phone is ringing. The remote party has answered the call. Indicates the connection is active. Used to invoke a value-added service; based on ROSE; can be used as part of a Q.931 message or as a Q.932 message.

Indicates an incoming call; may contain the calling party number.

Similar meaning to calling party.

Similar meaning to calling party.

Similar to calling party.

Indicates the local phone is ringing.

The local party has answered the call. Indicates the connection is active. Indicates service invocation.

Analogous messages are used to control packet-mode access connections. The flow of Fig. 3 is more fully described after the discussion of the network signaling in the next section. In addition to the information elements related to simple communication services, Q.931 messages may also contain an

Access (originating) Switch

SS7

IAM

Access (terminating) Switch

DSS1 CPE Setup

ACM

Alert

ANM

Connect Connect_Ak

SIGNALING

CPE

Switch

Call processing Q.931

Q.932

Call processing Q.931 Q.932 ISUP

LAPD LAPD

STP

Network control point

Service processing

Service processing

TCAP

TCAP SCCP MTP L3 MTP L2 MTP L1

SCCP MTP L3 MTP L2 MTP L1

information element that invokes a service. This information element, called the FACILITY information element, is based on the OSI remote operations service element (ROSE) (8). A subscriber may invoke a service operation, and will be returned either the results or a failure notification for the service request. Q.932 also uses the FACILITY information element in a dedicated message to provide service control. In addition, Q.932 defines messages specifically for providing such services as placing calls on hold. NETWORK SIGNALING Signaling between network nodes, such as between switches and between switches and NCPs, is carried out using Signaling System No. 7 (SS7) protocols (6). The full SS7 protocol stack is shown in Fig. 3. The lower three layers of the protocol stack, message transfer part level 1–3 (MTP L1–3) (9), are collectively called the network services part. The network services part is responsible for routing, monitoring network link performance, and network level signaling network management. The ISDN user part (ISUP) (10) is an application layer protocol that supports basic and supplementary telephony service. The signaling connection control part (SCCP) (11) provides expanded addressing and various grades of delivery service to network-based applications. These applications are typically provided through the transactions capability application part (12), a ROSE-based session layer protocol. Figure 4 shows a simplified network configuration and the protocol stacks that reside on each network element. The dashed lines show peer signaling relationships. Notice that the signaling points, the CPE, switch, and NCP, all terminate application layer signaling protocols, whereas the STPs only contain the signaling protocols that provide reliable message transfer. Network Services Part The lower three layers of the SS7 protocol stack, collectively called the network services part, include message transfer part levels 1, 2, and 3. MTP level 1 is responsible for the signaling data link function, and corresponds to the OSI physical layer. Link bit rates are defined for 64 Kbps by the ITU; the minimum bit rate to support telephony is 4.8 Kbps. MTP level 2 (MTP L2) is responsible for signaling link functions, and corresponds to the OSI link layer. It provides reliable point-to-point service between two connected signaling nodes. MTP L2 frames contain forward and backward se-

MTP L3 MTP L2 MTP L1

279

Figure ration.

4. Network

protocol

configu-

quence numbers which are used to detect lost frames. For error recovery, MTP L2 uses two methods. In the basic method, negative acknowledgments are generated if a missing MTP L2 is detected by a receiver. When the MTP L2 receiver detects a negative acknowledgment, it rolls back its local transmitter to the lost frame, and performs go-back-N retransmission. On signaling links with long delays, such as satellite links, preventive cyclic retransmission is used. In this scheme, if an MTP L2 transmitter is idle, it retransmits all unacknowledged frames. In this way, extra link bandwidth is used to retransmit frames that might be lost, hence the term preventive retransmission. MTP L2 continually transmits frames, even when signaling points have no data to send. This is done so that MTP L2 may monitor link error rates. This is discussed more fully in the signaling network management section. MTP Layer 3 (MTP L3) is responsible for signaling message handling and signaling network management. Signaling network management is discussed in a subsequent section. Signaling message handling is broken into three functions: discrimination, routing, and distribution. To perform these functions, MTP L3 processes the MTP L3 header which contains an origination and destination point code. This header is shown in Fig. 5. The point codes are the addresses for the message of sender and receiver, respectively. The discrimination function determines if a packet is meant for the current local node, or if it must be forwarded to a subsequent node. The destination point code in the message is compared to the address of the local node to make this determination. If the message must be forwarded, the routing function is performed. MTP L3 provides connectionless routing. It examines the originating and destination point codes and determines the next hop to which the message should be sent. In

SLC

OPC

DPC

MTP L3 routing label

Circuit ID Code Type Mandatory fixed

ISUP header

Mandatory variable Optional SLS – Signaling link selection OPC – Origination point code DPC – Destination point code Figure 5. MTP L3 and generic ISUP header.

280

SIGNALING

addition, a signaling link selection (SLS) parameter may be used to indicate a particular signaling link that should be used to carry the message. If the message is local, the distribution function takes place. This entails sending the payload of the MTP L3 message to the proper higher layer protocol executing on the signaling point. This is performed by inspecting the service information octet which is part of the MTP L3 message. ISDN User Part ISUP (10) is an application layer protocol that supports telephony services, such as connection establishment and release, and supplementary services (also known as value-added services), such as call forwarding. ISUP messages are processed in switches that will ultimately carry the connection being established. ISUP interworks with the access signaling protocols, such as Q.931. The generic format for an ISUP message is shown in Fig. 5. The circuit identification code (CIC) indicates the circuit to which the message applies. For example, messages used to establish a connection contain the CIC of that connection. The type field indicates the message type, and hence its function. The remaining parameters in the messages can be classified as one of three types. Mandatory fixed parameters are required for a particular message, and have a fixed length. Mandatory variable parameters are also required for a particular message, but are of variable length. The remainder of the components are optional. These parameters often carry information pertaining to services that may not be active on every call. Table 2 summarizes key ISUP messages used in connection establishment. Call Establishment Example Figure 3 shows a sample flow of ISDN signaling for establishing a call, illustrating the use of the Q.931 and ISUP messages and their interworking (ISUP messages are shown in boldface). The Q.931 SETUP message sent into the network requests that a call be established to a destination number contained in the message. This message also includes the following information elements: 1. Bearer capability describes the bearer services that must be provided for the call by the network, e.g., 3.1 kHz audio.

2. Lower layer compatibility is used to ensure user-to-user compatibility. 3. Higher layer compatibility is used to ensure user-touser compatibility concerning the application supported on the call, e.g., group IV fax. The originating switch then receives the SETUP message, determines the next hop switch, and sends an initial address message (IAM) to that switch. IAM seizes a trunk between the switches over which the user information will be carried. The terminating switch determines that it directly serves the destination party and generates a SETUP message to the destination. The SETUP message sent to the called party indicates an incoming call has arrived. The Q.931 ALERT message indicates that the called party has received the call and the phone is ringing. The terminating switch receives the ALERT messages and generates an ACM message to the originating switch to indicate that the remote phone has been contacted and is ringing. When the remote terminal answers the call, a Q.931 CONNECT message is sent to the terminating switch. This message may include different values for the bearer capability and lower layer compatibility information elements than contained in the original SETUP message. This is how the called party negotiates these values with the calling party. The terminating switch generates an ANM to indicate to the calling party that the call has been answered. The CONNECT message returned to the calling party contains the bearer capability and lower layer compatibility information elements sent by the called party in its CONNECT message. ISUP also supports supplementary services, and thus optional parameters in the ISUP messages are used to carry service information. For example, ISUP supports the call-forwarding service. Consider the case of call forwarding on busy, in which a call routed to a terminating point is forwarded if the destination subscriber is already involved in a call. In this case, the call is forwarded to a predefined destination. The IAM messages that are used to extend the connection carry parameters that indicate the original destination number, the number to which the connection is being forwarded, and the number of times a connection has been forwarded. Other services supported by ISUP include calling line identification (ID)/calling line ID restricted and closed user group, which is used to support virtual private networks. Signaling for Network-Based Services

Table 2. Key ISUP Messages Message Type Initial Address Message (IAM)

Address Complete Message (ACM)

Call Progressing (CPG) (optional) Answer (ANS)

Usage Seizes a trunk over which connection is carried; contains CIC, routing, and services information. Indicates remote end has received sufficient information to process the call; may contain an indication that the remote terminal is alerting. May be used to indicate the remote terminal is alerting. Indicates remote terminal has answered the call.

In addition to basic connection and supplementary services provided by ISUP, network nodes signal to NCPs to receive value-added services, such as 800-number translations, and credit-card validations. Signaling to NCPs is carried out using TCAP (12) over SCCP (11). SCCP provides the services of the upper half of the OSI network layer. SCCP provides addressing that is expanded beyond MTP L3 through the inclusion of a subsystem number (SSN). The SSN and DPC together are used by SCCP to determine the local user (application) on a node that must receive an application layer protocol message. SCCP also provides global title translation. This entails translating logical identifiers called global titles, to destination point codes to which signaling messages may be routed. SCCP provides four classes of service: basic connectionless, sequenced connec-

SIGNALING

tionless, basic connection-oriented, and flow controlled connection-oriented. TCAP is a session layer protocol that supports client– server operations between network signaling points and NCPs. TCAP consists of two sublayers. The transaction sublayer manages the dialog between communicating peers. There are four transaction sublayer messages. The BEGIN message opens a dialog between communicating peers. The CONTINUE message is used for ongoing communication. The END message closes the dialog. For a single message transaction, a UNI (for unidirectional) message is used. The TCAP component sublayer is closely based on OSI ROSE. There are four main components in this sublayer. The invoke component remotely activates an operation. This component includes an operation type and any data required for the execution of the operation. This component receives one of four component responses. A return result component signifies the successful operation, and contains any information to be returned by the operation. If the information cannot be contained in a single return result component, several return result components are used, the last being called a return result last component. If the operation could not be completed, a return error component is received which contains a cause for the error. If the receiving node cannot process the operation, it may also return a reject component. Every operation invocation is given a unique identifier so that requests and responses can be correlated and so that different operations may be linked. The component sublayer defines four classes of operation. Class 1 requires that both success and failure of the operation be reported; class 2 requires that only failures be reported; class 3 requires that only operation success be reported; and class 4 requires no response. Figure 6 shows the flow for a TCAP transaction. In this example, a switch is requesting a credit-card validation. The transaction sublayer BEGIN message opens the dialog between the switch and the NCP that provides credit-card translation service. This message contains the invoke component which is defined as verify-credit-number. The credit-card number is included as a parameter in the component. This invocation is identified as invocation number 1. The NCP requests that the user-specific personal identification number (PIN) be collected. It does this by invoking an operation in the switch. This operation, called collect-PIN, is sent in a CONTINUE message. It is numbered component 2. This invocation is linked to the first invocation so that the switch may correlate this operation invocation with the invocation 1 that it had previously made. The switch returns the PIN in a return result component in a transaction sublayer CONTINUE

Switch

NCP Begin {Invoke (1): verify-credit number(xxxxx)} Continue {Invoke (2): collect -PIN; Link (1)}

Continue {Return_result (Pin) (2) } End {Return_result (1) }

Figure 6. TCAP flow.

281

message. The NCP then verifies the credit-card number and PIN, and returns the result to the switch in the return result component. This component is sent in the END message that closes the dialog.

SIGNALING NETWORK MANAGEMENT The performance of the common channel signaling network is self-managed by the SS7 protocols, specifically MTP L2 and L3. MTP L2 monitors link error rates, and reports any problems to MTP L3. MTP L3 is then responsible for taking action to ensure that a high level of signaling performance in maintained. MTP L2 has three defined frame types. A message signal unit is used to transport MTLP L3 payloads. Link status signal units are used to indicate the current status of a signaling link. They may be sent when a link is flow controlled and no new data should be transmitted. Fill-in signal units (FISUs) are sent when the link would otherwise be idle; that is, there is no layer 3 payload to send. By continually transmitting frames, MTP L2 can monitor the error rate on the signaling link by checking the cyclic redundancy code (CRC) in each frame. MTP L2 uses a leaky bucket mechanism to determine when notification of a faulty link should be reported to the higher layer protocol, MTP L3. Initially, MTP L2 sets a COUNT variable to 0. Every time an errored frame is detected, the COUNT variable is incremented. Every time 256 consecutive frames are received correctly, the COUNT variable is decremented unless it is already at 0. If the COUNT value exceeds 64, MTP L2 reports the problem to MTP L3. MTP L3 reacts to the reports from MTP L2 by rerouting signaling traffic. MTP L3 manages signaling links, routes, and traffic. It is responsible for recovering from network failures and congestion. The differences between link management and route management are shown in Fig. 7. Several signaling links are grouped to form a link set. A link set has a common route. If a link becomes unavailable, MTP L3 performs a link changeover which entails switching signaling messages from one link to another in the same set. Therefore, the signaling route stays the same. Links may become unavailable because of high error rates detected by MTP L2, excessive time for MTP L2 to realign or achieve synchronization, or because of a failure at the far-end singling equipment. To perform a link changeover, MTP L3 first retrieves outstanding data from the lower layer protocol, and then sends the data on the new link. This ensures ordered data delivery. If a link set fails, MTP L3 may perform a route changeover. In this case, signaling traffic is taken off of the failed link set and sent on a new link set, and hence, different route as shown in Fig. 7. The status of a route may be available, unavailable, or restricted. A route is unavailable if a destination can no longer be reached on that route, perhaps because of a configuration problem or intermediate component failure. A route is restricted if there is difficulty in reaching the destination, perhaps because of congestion. The route changeover mechanism is similar to the link changeover mechanism; MTP L3 data are retrieved and sent on the new link set. MTP L3 traffic management is responsible for load sharing. If a link is becoming overloaded or congested, MTP L3

282

SIGNALING

Signaling terminal equipment

Failed link


Link changeover Link set Link changeover


Failed link


Route changeover

STP Route changeover

Figure 7. MTP L3 link and route management.

will change the route of some existing traffic and route new traffic on a new link or link set.

SIGNALING FOR B-ISDN B-ISDN is based on asynchronous transfer mode (ATM) (13) transport technology. The goal of B-ISDN is to support integrated services by providing connections of different quality of service (QoS) supporting different traffic types. QoS can specify bit rates, loss tolerances, and delay budgets. Traffic types can be variable bit rate, constant bit rate, or available bit rate. Target applications include multimedia applications with voice, video, and data connections. ATM transport is a connection-oriented cell-based switching technology. Cells are short, fixed-sized packets. To establish a connection, routing tables in switches are configured so that cells on a particular connection are routed in order within their specified QoS. There are special signaling considerations for B-ISDN. First, varying CPE types and applications will be supported on these networks, so negotiation procedures must be well defined. Control procedures, and hence protocol messages and information elements, must be defined to support QoS. In addition, the QoS requirements of a connection may change over time, so dynamic connection support must be built into signaling procedures. Key applications include multimedia conferencing and multimedia distribution, therefore requiring signaling support for multi-user connections and multicast connections. Applications may require asymmetric communication; that is, the bandwidth and QoS requirements may not be the same in both directions of a connection. Asymmetric connections must then be supported by B-ISDN signaling protocols. Standards for B-ISDN signaling are being addressed in two bodies: the ITU and the ATM Forum, a consortium of vendors.

protocols are part of SS7 (15). The protocol stacks for DSS2 and the B-ISDN part of the NNI are shown in Fig. 8. In B-ISDN users signal to establish a virtual connection (VC) with a certain QoS. The connection may be routed on a predefined virtual path (VP). Virtual connections are identified locally by a virtual channel identifier (VCI) and virtual path identifier (VPI). Two types of access signaling are defined. In VP-associated signaling, signaling messages are sent on a special signaling virtual connection and are used to establish, release, and modify user virtual connections on the same virtual path. In nonassociated signaling, a subscriber has a single virtual connection for signaling and uses it to control connections across virtual paths. To establish signaling connections, meta-signaling procedures (16) are defined. These procedures are used to establish two types of signaling connections: point-to-point and broadcast. Point-to-point signaling links are used by subscribers to request typical communication services. Broadcast signaling links may be further classified into two categories: general broadcast and selective broadcast. The selective broadcast signaling links may be used by the network to offer calls to multiple subscribers. The default point-to-point signaling connection is carried on VCI 5 on all virtual paths, and is approximately 64 Kpbs. The default broadcast signaling link is on VCI 2.

Transactions capability application part (TCAP)

Q.2931

S-AAL

Signaling connection control part (SCCP)

Broadband ISDN user part (B-ISUP)

MTP level 3 S-AAL

ITU Signaling for B-ISDN ATM

The signaling architecture for B-ISDN is similar to that of ISDN. Subscribers signal into the network using access signaling protocols, and network signal to each other using network signaling protocols. The UNI protocols are called the digital subscriber signaling system 2 (DSS2) (14). The NNI

ATM DSS2 protocol stack

Signaling system number 7 protocol stack for B-ISDN

Figure 8. B-ISDN signaling protocol stacks.

SIGNALING

Access Signaling. The access signaling protocols for B-ISDN are close in spirit to those used in ISDN. The D-channel in ISDN is replaced with a signaling virtual connection over ATM. LAPD is replaced with the signaling ATM adaptation layer protocol (S-AAL) (17). The Q.931 application layer protocol is replaced with its broadband counterpart, Q.2931 (14). S-AAL is used both at the UNI and NNI. At the UNI it emulates LAPD. It provides point-to-point reliable data delivery for Q.2931. The S-AAL is broken into 2 main layers. The lower layer, called the common part convergence sublayer (CPCS), provides error detection and reporting and ensures ordered data delivery. The higher layer, called the servicespecific convergence sublayer (SSCS), is further broken into two sublayers: the service-specific connection-oriented peerto-peer protocol (SSCOP) and the service-specific coordination function (SSCF). SSCOP provides reliable data delivery and flow control and monitors link status. Reliable data delivery is provided through periodic and on-demand status reports of received and transmitted messages. Retransmissions are performed on a selective repeat basis. Link status is monitored through the use of keep-alive messages. Status changes on the link, such as the link becoming congested or a signaling endpoint becoming unavailable, are reported to higher layer protocols. SSCF provides an interface to the higher layer protocols. In this case it provides a set of primitives to Q.2931. Q.2931 is similar in many respects to Q.931. The message names and message flows are very similar. The major differences in the two protocols are the information elements. Q.2931 must provide support for B-ISDN services, such as QoS and varying traffic types. Key parameters to note are: 1. End-to-end transit delay specifies a maximum delay budget. 2. Broadband bearer capability specifies traffic type (variable/constant bit rate, etc.). 3. AAL parameters specify AAL type and parameters within a specific AAL. 4. Traffic descriptor specifies bandwidth requirements for a connection. These information elements are processed in the network to route connections and allocate resources within the given subscriber requirements. Because connections may be asymmetrical, an indication is given if the parameter values apply to both directions of a connection, or if a connection is unidirectional. Network Signaling. The NNI signaling in B-ISDN is also similar to its ISDN counterpart. The protocols used for service control, TCAP and SCCP, are reused directly. MTP L3 is reused for its routing and signaling network management capabilities. MTP L1 is replaced with ATM and MTP L2 is replaed with the S-AAL. The application layer protocol ISUP is replaced with B-ISUP (15). At the NNI, S-AAL provides services similar to MTP L2. SSCOP provides reliable point-to-point data delivery. It also monitors the status of the signaling link and signaling endpoints throughout the use of keep-alive messages. If SSCOP detects any problems, such as a link becoming congested or a signaling code becoming unavailable, it reports the change in

283

status to MTP L3. SCCF provides a set of primitives to MTP L3 for management and data transfer. B-ISUP interworks with Q.2931 and provides support for B-ISDN functions such as QoS. It does this by passing parameters similar to those defined in the Q.2931 messages for required bandwidth, traffic types, and AAL types and parameters. In addition, B-ISUP passes cumulative delay estimates from node to node as a connection is being established. In this way, the terminating node can check the cumulative expected delay on a connection against the requested delay budget given in the Q.2931 connection establishment request. BISUP will also support supplementary services similar to those supported by ISUP. ATM Forum Signaling The ATM Forum is a consortium of vendors that is defining its own set of standards to support B-ISDN, including standards for UNI and NNI signaling. ATM Forum UNI Signaling. The ATM Forum UNI (18) reuses the ITU S-AAL, and its UNI application layer message names, parameters, and procedures are all close in definition to Q.2931. The main contribution of the ATM Forum UNI version 3.1 is support for multicast connection establishment and control. The ATM Forum defines a root node which controls the multicast connection. The root node establishes a unidirectional multicast connection to a set of leaf nodes. All branches on this tree have identical connections, i.e., cell transfer rate and broadband bearer capabilities. There is zero bandwidth allocated on the path from the leaf nodes to the root. Two types of state are defined to support multicast communication. Link state is maintained on both sides of the UNI, the CPE, and network, and is similar to the normal call state. In addition, party state is maintained in all nodes. The root node establishes a point-to-point connection using procedures similar to Q.2931, and then adds other leaf nodes to the connection through the use of additional defined messages, such as ADD PARTY and ADD PARTY ACK. Procedures and messages are also defined for dropping parties from a multicast connection. ATM Forum NNI Signaling. The ATM Forum NNI signaling protocol is called the private network–network interface (PNNI) (19). Its focus is on providing a scalable solution to routing in large ATM networks with a given QoS. In the PNNI routing protocol, each node floods the network with advertisements of their connectivity and their QoS capabilities. Two types of QoS parameters are passed: additive parameters such as delay, and nonadditive parameters such as available bandwidth. To establish a connection, source routing is used. A node gathers information received from the advertisements of the other nodes, and determines a list of possible routes to its desired endpoint using the connectivity information. Once it forms this list, it determines on which routes the QoS requirements on the connection may be right. The node picks a route, and generates a message to the next hop node to request the connection. This message contains a list of designated transit links (DTLs) which denote the source route.

284

SIGNALING

If along the path, the connection request cannot be fulfilled because of some event, such as a link or node outage, or sudden congestion, a crank-back mechanism is defined. The crank-back entails backtracking along the connection until a node that can find a path to the destination that meets the QoS requirements is reached. This node then determines a route to complete the connection using similar procedures as the source node. To make this solution scalable, a hierarchy of nodes is created. A group of nodes at the same level in the hierarchy is called a peer group. Each peer group exchanges routing information with each other so that every member of a peer group can route within its group. Each peer group has a group leader which exchanges information with other peer group leaders. In this way, the amount of routing information that must be exchanged between peer groups in the wide area network is limited. There may be several levels of peer groups defined in a network.

SUMMARY Signaling procedures are used for subscribers to request services from a network, and for a network to fulfill these service requests. They support procedures to establish connections, provide value-added services, and manage and monitor performance of signaling networks. The lower layers of the signaling protocols are designed to provide high performance and provide reliable service. The application layer protocols are designed to support specific services. ISDN signaling protocols support circuit-switched connections. B-ISDN protocols extend the ISDN protocols to support a richer set of connection services, including QoS-dependent connections and multiparty connections.

System No. 7—Signaling Connection Control Part, Geneva, 1993. 12. ITU Recommendations Q.771–Q.775, Specifications of Signaling System No. 7—Transactions Capabilities Application Part, Geneva, 1993. 13. ITU Recommendations I.150, General Description of Asynchronous Transfer Mode, 1995. 14. ITU Recommendation Q.2931, Digital Subscriber Signaling System No. 2—User-Network Interface Layer 3 Specification for Basic Call/Connection Control, 1995. 15. ITU Recommendations Q.2761, General Functions of Messages and Signaling of the B-ISDN User Part of SS7, 1995. 16. ITU Recommendation Q.2120, B-ISDN Meta-Signaling Protocol, 1995. 17. ITU Recommendation Q.2100, B-ISDN Signaling ATM Adaptation Layer Overview Description, 1994. 18. ATM Forum, User-Network Interface Specification, Version 3.1, 1994. 19. ATM Forum, Private Network-Network Interface, Version 1.0, 1995.

THOMAS F. LA PORTA Lucent Technologies, Bell Laboratories

SIGNALING, TELECOMMUNICATION. See TELECOMMUNICATION SIGNALING.

SIGNAL INTERFERENCE. See NOISE AND INTERFERENCE MODELING.

SIGNAL PROCESSING. See ACOUSTO-OPTICAL SIGNAL PROCESSING;

MEDICAL SIGNAL PROCESSING.

SIGNAL PROCESSING AND CHAOS. See CHAOS BIBLIOGRAPHY 1. R. F. Rey (ed.), Engineering and Operations in the Bell System, 2nd ed. Murray Hill, NJ: AT&T Bell Laboratories, 1984. 2. ITU I-Series Recommendations, 1988. 3. ITU Recommendation I.311, B-ISDN General Network Aspects, Draft, 1996. 4. ITU-T Recommendation Q.931, Digital Subscriber Signaling System No. 1—ISDN User-Network Interface Layer 3 Specification for Basic Call Control, 1993.

TIME SERIES ANALYSIS.

SIGNAL PROCESSING BY SURFACE ACOUSTIC WAVE DEVICES. See SURFACE ACOUSTIC WAVE DEVICES. SIGNAL PROCESSING IN SONAR. See SONAR SIGNAL PROCESSING.

SIGNAL PROCESSING, MULTIDIMENSIONAL. See MULTIDIMENSIONAL SIGNAL PROCESSING.

SIGNAL PROCESSING, NONLINEAR. See NONLINEAR FILTERS.

SIGNAL PROCESSING, RADAR. See RADAR SIGNAL DETECTION.

5. ITU-T Recommendation Q.932, Digital Subscriber Signaling System No. 1—Generic Procedures for the Control of ISDN Supplementary Services, 1993.

SIGNAL PROCESSING, SEISMIC. See SEISMIC SIGNAL

6. ITU Recommendations Q.700–Q.795, Specifications of Signaling System No. 7, 1989.

SIGNAL PROCESSING, SONAR. See SONAR SIGNAL

7. ITU Recommendations Q.920–Q.921, Digital Subscriber Signaling System No. 1—Data Link Layer, 1989.

SIGNAL PROCESSING, VLSI. See VLSI SIGNAL PRO-

8. ITU Recommendation X.219, Remote Operation Service Element, 1989.

SIGNAL REPRESENTATION. See DATA COMPRESSION

9. ITU Recommendations Q.701–Q.709, Specifications of Signaling System No. 7—Message Transfer Part, Geneva, 1993.

SILICON CONTROLLED RECTIFIER (SCR). See IN-

10. ITU Recommendations Q.761–Q.768, Specifications of Signaling System No. 7—ISDN User Part, Geneva, 1993.

SILICON-ELECTROLYTE JUNCTIONS. See SEMICON-

11. ITU Recommendations Q.711–Q.716, Specifications of Signaling

PROCESSING.

PROCESSING.

CESSING.

FOR NETWORKING;

HILBERT TRANSFORMS.

VERTER THYRISTORS;

THYRISTOR PHASE CONTROL.

DUCTOR-ELECTROLYTE INTERFACES.


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Telephone Networks Standard Article Gary K. Holt1 1Kendrick Communications Consulting International, Tucson, AZ Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5331 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (233K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Telephone Network Evolution Numbering System Telephone Network Hierarchy Postdivestiture Network Analog and Digital Signals Basic Switching Types Signaling Multiplexing Network and Traffic Management Advanced Telephone Network Service | | | Copyright © 1999-2008 All Rights Reserved.


512

TELEPHONE NETWORKS

TELEPHONE NETWORKS The telephone network, as we know it today, began in 1876 when Alexander Graham Bell was granted a patent for inventing the first telephone. Today’s modern and advanced telephone system has evolved from a few directly connected subscribers to an extensive network of cable, radio, and satellite transmission systems interconnected through high-speed digital switching centers capable of connecting any two subscribers in the world in a matter of seconds. While transmission technology and advances in end instruments have played a large part in the expansion of the telephone network, it has really been the advances in computer-controlled switching elements that have propelled the telephone system into the fast, feature-rich, and reliable network that exists today.

TELEPHONE NETWORK EVOLUTION At first, telephones were directly connected on a point-topoint basis, but it quickly became apparent that some form of multiple connection scheme was necessary to make the telephone practical. If a person needed a telephone for everyone that he or she wanted to talk to and a wire to that location, we would have a house full of phones and the skies would be darkened with telephone wires. The first form of switching was a manual process. Telephone customers were connected to a switchboard and would call into the operator and ask to be connected to one of the other customers on that switchboard. The earliest operational system of this type was placed in service in 1878 with 21 customers (1). This concept worked well for small communities where there were few telephone customers. There was no numbering plan and customers were known by their names. The method of making a phone call started by turning the ringer crank to signal the operator and then asking to be connected to another customer (i.e., Doc Jones). Each customer on the switchboard was connected to a patchcord (receive) and a socket (send). To make the connection, the operator would plug the calling person’s patchcord into the socket of the person being called, and ring the line. The operator would monitor the call and pull the patchcords when done. Eventually switchboards were connected, allowing expansion of the calling area. Later, in 1879, an alphanumeric system of two letters and five numbers was used to identify each customer. While this was a highly personalized system, and sometimes the center of town gossip, it was completely manual, very slow, and cumbersome. Figure 1 shows an example of an early manual switchboard. Some of the early switchboards were operated by boys, but eventually phone companies changed to female operators (2). The limited capabilities of an operator switchboard network forced the development and introduction of a more automatic capability for establishing phone calls. In 1892, Almon B. Strowger introduced the first commercial form of electronic switching called ‘‘step-by-step,’’ which formed the basis of the automatic telephone system of the 1900s (3). This mechanical system, known as the Strowger switch, was based on the use of a dial-type telephone connected to a series of stepping relays that allowed the customer to dial a number representing that of the person being called. While this system became the foundation of switching for many years, improvements were made with the introduction of other forms of mechanical

Figure 1. Example of an early manual switchboard. (Copyright 1997. Reprinted by permission of Classic P10 Partners.)

switching systems such as crossbar, panel, and XY, all designed to improve on the speed and efficiency of network connections. The invention of the transistor made the electronic switches the next generation of modern switching. During the late 1950s, many experimental systems were developed, but it was not until 1965 that the first stored program control switch was put into service. This system was the American Telephone and Telegraph (AT&T) electronic switching system (ESS), which significantly improved the speed of call handling and became the basic building block for all switching systems through the 1970s and 1980s (4). As the computer came of age, advances in processor-controlled switching systems set forth a new era of stored program controlled switching. This new form of switching introduced the high-speed intelligent network of today, which offers advanced features such as calling party identification, voice activated calling, call waiting, and call forwarding.

NUMBERING SYSTEM Once the concept of an automatic telephone network began to take shape, some form of structured hierarchy and numbering needed to be developed. Customers now became subscribers and were assigned unique four-digit numbers representing their line number on the automatic switch providing their service. The early numbering schemes used a two-alpha, fivedigit numbering notation (the alpha representation was converted into actual numbers when being handled by an automatic switching system). During this era, the two-letter alphas were assigned to represent the name of the place where the switching system was located. When the limited alphanumeric numbering system became too cumbersome and unable to support the vast amount of customers wanting telephone service, the all number calling (ANC) plan was in-


TELEPHONE NETWORKS

troduced in 1958. Each of the switching offices was assigned a unique address code of three digits to represent its office code, and several switching offices were grouped into areas (usually by state) and provided with a unique area code of three digits. This numbering plan became the North American Numbering Plan (NANP) in use today in the United States, Canada, Puerto Rico, Guam, and most Caribbean Islands. Under this plan, every telephone is assigned a unique 10-digit address consisting of a three-digit area code, threedigit office code, and four-digit subscriber number. Special dialing arrangements within modern switches allow dialing of only seven digits within the same area code and local calling area. In addition, special access codes also allow for dialing international subscribers almost as easily as calling across town. With the proliferation of telephone numbers, driven by the introduction of pagers, fax machines, cellular phones, and computer connections, the NANP has undergone major revisions in recent years to add more area codes and reduce the geographical size of these area codes in order to provide more customer numbers. Now many states and some larger cities are using multiple area codes (5). TELEPHONE NETWORK HIERARCHY As the number of subscribers grew in the telephone network, along with their need to call more and more people across the nation, it became apparent that it would be impractical and costly to connect all switching offices so that every person could call any telephone subscriber in the country (and eventually the world). This evolving telephone industry became a national system under the control of the Bell telephone operating system. In the early 1900s there were only 10 million

Regional center

RC 1

RC 2

Class 1

Sectional center

SC 1

SC 2

Class 2

Primary center

PC 1

PC 2

Class 3

Toll center

TC 1

TC 2

Class 4

EO EO

Tandem office

EO

EO

EO

EO

Tandem office

Direct routes Alternate high capacity routes Figure 2. Bell Telephone toll hierarchy.

EO Class 5 EO

513

Table 1. Public Network Hierarchy of the Bell System (1982) Switch Class

Functional Designation

No. in Bell System

No. in Independent

Total

1 2 3 4 5

Regional center Sectional center Primary center Toll center End office

10 52 148 508 9803

0 15 20 425 9000

10 67 168 933 18,803

(Digital Telephony, James Bellamy, Copyright  1991, Reprinted by permission of John Wiley & Sons, Inc.)

subscribers in the Bell System, but by the 1980s there were close to 180 million subscribers. The Bell System introduced a hierarchical network architecture to reduce the amount of network connections and improve the speed of completing calls. This network hierarchy is shown in Fig. 2 and consists of class 5 end offices (EOs) or central offices (COs) at the lowest level. These EOs are the switching systems that terminate all the subscriber lines in the area. In many cases, in large metropolitan areas using several EOs, the class 5 offices were connected to a local tandem office to provide interconnectivity to all class 5 EOs in the area. The tandem office’s sole function was to connect calls from one EO to another. It did not terminate subscribers. The next level in the hierarchy is the class 4 toll office. Each class 5 EO is connected to a class 4 toll office for access to the long-distance toll network interconnecting the country. The toll network has three levels of connectivity through class 3 primary centers, class 2 sectional centers, and class 1 regional centers. The whole design of the system was to allow calls to complete by using the fewest number of switching centers. The routing logic was designed to complete the call using the least number of connections, but alternate routing used the longer path, if needed, to ensure completion of the call. Figure 2 shows the different routing paths available for call completion. Table 1 lists the number of each class offices in operation in the United States in 1982. Tandem offices are not listed in the table because they were not part of the toll network (6). POSTDIVESTITURE NETWORK The Bell hierarchical network was modified in January 1984, when divestiture of AT&T was decreed by the courts. No longer was AT&T the sole provider of telephone service from phone to phone, but competition was allowed in virtually every aspect of the business. The divestiture broke AT&T into seven smaller Bell operation companies (BOCs) as well as many other independent telephone companies. AT&T itself provided the long distance service, but now came into competition with other companies such as MCI and Sprint. AT&T replaced the three level toll network with a flat (i.e., single) level network consisting of 142 tandem switches. The new network is shown in Fig. 3. At the lowest level are local access and transport areas (LATA) established mainly along existing networks being operated by incumbent service providers known as local exchange carriers (LEC). The number of LATAs at the time of divestiture was 164. Toll calls between exchanges within the

514

TELEPHONE NETWORKS

IXC (A) IXC (B) EO

POP

EO

EO

EO

POP

LEC

EO

EO

EO

LATA

LEC

EO LATA

Figure 3. Postdivestiture network.

LATA are the responsibility of the LEC. For inter-LATA traffic, the LECs connect to interexchange carriers (IXC) whose sole responsibility is to connect the long-distance toll calls. Today, customers are allowed to pick which IXC they use for their long-distance service, and that part of the market has become a multi-billion-dollar industry. LECs interface with the IXCs at a single point of entry into and out of the LATA called a point of presence (POP). This POP is nothing more than a designated point, like a tandem switch or class 4 toll office, within the LATA (7). ANALOG AND DIGITAL SIGNALS The basic premise that makes the telephone work is the conversion of the spoken word into an electrical signal. The vi-

brations created by our speech are converted into electrical variations of voltage for transmission over wires. This signal is an analog representation of the voice pattern and is continually varying in amplitude and frequency. This analog signal is based on the frequency of our voice in the range of 0–3 kHz. Figure 4(a) shows the basic method of speech transmission for the analog signal. Transmission over long distances causes the signal to fade, and therefore it needs to be amplified at various points along the path. This amplification process unfortunately introduces noise into the original signal so that when it arrives at its intended destination it is of lower quality than the original. Noise is introduced into this signal from unwanted sources such as power lines, motors, switching equipment, and electrical storms. The amplifier has no way of distinguishing between the real signal and the noise. Special care is taken to ensure that the amplifiers are built with noise-eliminating filters and circuitry, but the noise can never be totally eliminated. The need to improve on service led telephone engineers to develop methods of digitally encoding the human voice into a series of pulses that would represent the analog signal. One form of encoding, known as pulse code modulation (PCM), is the technique used in modern switching systems to provide the high-speed and high-quality networks in use today. The uniqueness of digital transmission is that it reduces the introduction of noise. Signals are not amplified (along with noise), but rather they are regenerated (repeated) as pulses of ‘‘1s’’ and ‘‘0s.’’ Figure 4(b) shows how the noise is eliminated in transmission with the use of repeaters. Digital signals are also processed more efficiently in modern digital switching and transmission systems using computer-based principles. The higher speed at which the digital signals are processed allows them to be placed in different time slots

Noise

4A

Transmitter

Amplifier

Receiver

Noise

4B

ADC

Repeater

DAC

Sample 1 2 3 4

Figure 4. Comparison of analog and digital signals and the affect of noise.

ADC

Sample

8 bit code

1 2 3 4

01010101 10101010 01110001 01100111

TELEPHONE NETWORKS

(multiplexing) in transmission systems, thus allowing many conversations to take place over the same pair of wires. For the conversion of an analog signal to digital pulses, the analog signal is processed by an analog to digital converter (ADC) also known as a coder/decoder or codec. Figure 4(c) shows the basic ADC process. The ADC samples the analog signal and converts it into a series of digital pulses. Experimental design determined that sampling an analog signal 8000 times per second was sufficient enough to represent the signal adequately. Therefore, at an 8 kHz sampling speed, the ADC samples the analog signal every 125 애s and converts it into a number representing the amplitude and slope (direction) of the signal. This number is then converted into a binary eight-digit code for use in switching and transmission. At the receive end, the binary code is converted back into a voltage level and slope in the digital to analog converter (DAC), which then provides the reproduced analog signal to the receiver. As with analog, a digital signal also becomes degraded due to signal loss and noise in the transmission media and switching equipment. However, the digital signal is regenerated rather than amplified through use of regenerative repeaters. These repeaters determine if the incoming signal is a 1 or 0 and then generate a new pulse of 1 or 0. Any noise on the circuit is lost since it is a new signal. The signal received at the destination is still the same 8 bit code sent at the originating end and is used to reproduce the original signal using the DAC process (8).

515

1000 group 2000 group

2300 group

2350 group

BASIC SWITCHING TYPES The first telephone switches were manual switchboards, but they used the same basic process for call completion used by all automatic systems today. This basic process has the following steps: 1. Initiation. The subscriber notifies the switch that a call needs to be placed (off hook). 2. Signaling. The subscriber tells the switch where the call is to go (dialing). 3. Switching. The switch determines where to connect call (routing). 4. Connection. The switch sets a path to the destination and connects (ringing). 5. Disconnect. The call is dropped when completed (on hook) Starting with the introduction of the first automatic switch in 1892 by Almon B. Strowger, all switching systems for the next 50 years or so were electromechanical. These systems were progressive control switches and were either under direct control, as in the step-by-step Strowger type, or under common control, as with the crossbar. These switches could only do basic call completion and had no other features to offer the subscribers. It was not until the introduction of stored program controlled (SPC) switching with the AT&T No.1 electronic switching system (ESS) in 1965 that advanced features were possible (9).

Subscriber 2359 Figure 5. Step-by-step switch operation example that shows the completion of a call to a subscriber with the number 2359.

Step by Step The Strowger step-by-step switch is a direct control form of switching. The principle behind the step-by-step switch is the use of dial pulses generated by the dialing telephone to control the ‘‘stepping’’ of a switch relay. Strowger’s telephone was designed to generate a series of pulses by opening and closing a set of contacts on the phone a prescribed number of times for each digit dialed. The opening and closing of the contacts create a voltage and no-voltage condition, causing a current flow on the line to step the relay at the other end. Thus one pulse is a 1, two for the number 2, and so forth. Zero is actually 10 pulses. Figure 5 shows the basic concept of the Strowger step-by-step switch. The incoming line is connected to the first ‘‘stepper’’ and, based on the number of pulses dialed, the stepper steps to one of ten positions. Thus a dialed number of 2 would step to the second contact, setting up a path to all subscribers in the 2 ‘‘thousands’’ group of numbers. This contact is connected to another ‘‘stepper’’ consisting of another set of 10 contacts and will move to the position determined by the next dialed number. The call progresses as the dialed digits step each successive set of ‘‘steppers.’’ The example in Fig. 5 shows how the

516

TELEPHONE NETWORKS

Sleeve wiper

Line wiper

Sleeve bank (control)

Line bank (voice path) Figure 6. A Strowger-type step-by-step switch with 10 levels of 10 contacts.

dialed number 2359 will complete to a specific subscriber line. The ‘‘0’’ level as the first number dialed is reserved for the operator. When dialing to another exchange, the first three digits of a seven-digit number determine which trunk to connect. Once connected to the trunk, the last four digits are used to complete the call to the specific subscriber. As advances were made in the step-by-step process, contacts were stacked in a vertical bank and ganged together so that more than one call could be in progress at one time. As shown in Fig. 6, the most popular Strowger switch was stacked in a 10 ⫻ 10 matrix. Thus each level of contacts are connected to 10 more banks. Consider the set of ‘‘steppers’’ in Fig. 5 to be repeated or stacked in 10 vertical planes. As the first call is initiated, the stepper switch automatically moves up to the first bank of contacts. The second call would use the second level, the third would use the third level, etc. Once the level is established, the dialed pulses step to the appropriate contact. The process is repeated until the call is connected (10). Step-by-step switching is progressive, under direct control of the dial pulses from the calling phone. As each set of pulses is dialed and the switching relay is stepped to appropriate contacts, the next stepping relay must be ready to accept the next set of pulses. If anything happens to the call progress due to equipment failure or blocked routes, the call is not completed. There is no backing up to try another route. The step-by-step switch is also a mechanical nightmare to maintain, and because of the current-driven relays, it becomes a major source of unwanted noise. Figure 7 shows a bank of step-by-step switches along with the relays and wiring as they were installed in a telephone control office. A typical control office would have thousands of these ‘‘steppers’’ connected together in multiple rows, stacked from floor to ceiling.

aled into a memory logic, called a register or marker, where they are maintained for use until the call is connected. The basic building blocks of a common control switch are as follows (11): 1. Line Equipment. Determines request for service and accepts digits into memory. 2. Switching Network. Used to connect calling party to called party (i.e., crossbar). 3. Common Control. Performs control of switching process. 4. Trunk Equipment. Provides connectivity to adjacent switches and toll network.

Common Control Crossbar The deficiencies of the step-by-step Strowger switch led to development of the crossbar switch using a process known as common control. With common control, all the digits are di-

Figure 7. A bank of step-by-step switches used in a telephone central office.

TELEPHONE NETWORKS

517

services, like call forwarding, call transfer, and call waiting. As the electronic age emerged in the 1970s and 1980s, control of telephone switches migrated from fixed, hardwired logic to processor controlled systems. The early SPC switches used writeable memory, ferrite core cards to store the programs for controlling of switch functions. Later, computers and software took over that role.The basic operation of a SPC switch allocates call completion and management to one of five program functions (12).

0 9 8 7 16

6 5

1. Input Programs. Scan incoming and outgoing lines and trunks to recognize requests for service and maintain status information.

4

2. Operational Programs. Analyze received information (i.e., digits dialed) to determine what action to take.

3

23

3. Subroutines. Perform digit analysis, determine routing information, and establish connectivity through switch. 4. Output Programs. Make and release calls to lines and trunks.

2 1

41

1

2

3

4

5. Executive Control. Oversee all other program actions and direct the process as the call is handed off between programs.

5

Figure 8. An example of a crossbar switch matrix showing the connection of 3 different call paths.

The crossbar switch differs from the step-by-step in implementation of the method of connecting two points together. Figure 8 shows a simplified example of how a crossbar matrix works. The example shows a 5 by 10 matrix. Each crosspoint in the matrix is activated by the intersection of two bars, thus the name crossbar. A path through the matrix is established by activating the intersection point of each bar through a set of magnetic contacts. Thus three different connections can be made at the same time for numbers 16, 23, 41. Several crossbar matrixes are staged together to provide for several simultaneous yet independent paths through the network. The common control network uses the stored digits in the register to determine the final address as it establishes a path through the matrix to the endpoint. If a path is blocked, the common control will try alternate paths, thus improving the probability of call completion. Crossbar switch matrix elements were used for many years until being replaced in the 1970s and 1980s by electronic switching systems using stored program control. Electronic Switching Systems The advent of modern-day telephone networks, with featurerich service, began with the introduction of electronic switching systems (ESSs) using stored program controlled (SPC) logic to control switch call handling. Step-by-step and crossbar switches were fixed in their architectures and the services that they could provide. Once the architecture was ‘‘wired’’ in, it could not be changed. Such switches basically only connected one subscriber to another and did not offer any special

Switch Matrix. The modern-day ESS digital-type switches have replaced the mechanical-type stepping and crossbar switching arrangements with faster electronic devices based on transistor-type electronics. Switching is done digitally and is fast and reliable. The functional area of the switch that performs this switching is called a digital matrix. There are two common methods of connecting calls through the matrix: space division and time division. Several combinations of space and time division switching are used by the switch manufacturers in their architecture designs. The analog automatic switches, such as the Strowger stepby-step and the crossbar, used space division switching as the method of connecting one point to the other using shared facilities. These switches were designed with the realization that not everyone would be using the telephone at the same time. Therefore, the equipment to cross connect the two subscribers is used for the duration of the call and then released, for someone else to use, when the call is completed. For instance, when designing a space division switch to support 100 subscribers, the equipment would be engineered to support 10 calls at once and would be wired to support 10 separate calling paths. Thus the allocation of calls is done in the physical realm of established paths (space). Space division switching is also used in digital switching networks when connecting one digital path to another. The difference between analog and digital is that pulses are being moved instead of analog signals. In all ESS digital-type switches, the analog signal produced by the calling party at the instrument is converted to a digital signal at the switch. This signal is handled digitally throughout the switching process and transmission, until it is connected to the called party at the final destination. The call is then converted back to analog. A digital space division matrix switches the digital signal from one digital time slot to another, as shown in Fig.

518

TELEPHONE NETWORKS Time slot interchanger

a b c d1

1 e f g h

e f g h2

2 i

i

j k l 3

j k l

3 a b c d Time slots

Time division multiplexing

a b c d l h d k g c j

e f g h i

f b i e a

j k l Space division multiplexing

Figure 9. Comparison of digital space division and time division switching.

9(a). With time division switching, the digital signal is switched in time slot order in a digital pulse chain, as shown in Fig. 9b. With combinations of space division and time division switching, many simultaneous calls are able to be processed though an ESS-type switch. Figure 10 is a simplified example of the use of a time-space-time (TST) matrix to connect three calls. As an example of a modern-day switch, the AT&T 5ESS2000 switch (Fig. 11), is an example of a system that is based on a distributed SPC architecture. It is a digital time division switch of modular design and can be configured to serve up to 150,000 subscribers lines and 35,000 trunks simultaneously. Its call processing capabilities are not only fast, but it offers the advanced features of today’s networks, such as caller identification, call waiting, call forwarding, and three-way calling. The 5ESS-2000 switch is made up of three major

functional areas (modules), each performing complicated tasks through distributed processors, all under control of the central control (CC) computer. Table 2 lists the three modules and the functions that they perform. During call processing within the switching module (SM/SM-2000), the line unit (LU) detects an incoming call and in coordination with the module control time slot interchanger (MCTSI) sets up the call, provides dial tone, and accepts the dialed digits. Once the MCTSI determines where the call is going, it sends a message to the communications module (CM) and an intermediary routing MCTSI to perform digit translation to determine final call routing. The CM sends a message to the terminating switching module (SM) containing the dialed digits and other information needed to handle the final call connection. The CC and the terminating SM coordinate the status of the called subscriber (busy or not) and then establish the time slot to be used to connect the call through the switch matrix. Once this is done, the terminating MCTSI sets up a path through the terminating LU and rings the called subscriber. When the call is completed by either party hanging up, the CM is notified and coordinates the disconnect of all equipment and sets the time slots to idle, ready for use in the next call. During the call process, a record is kept of the call information and recorded through the administrative module on to magnetic tape for billing information (13). The AT&T 5ESS is just one example of ESS SPC-type switching technology. Table 3 lists the most common digital telephone switching systems in use in North America over the last 20 years.

SIGNALING Subscriber Signaling For a telephone switch to operate and be able to perform the functions that it is designed to do, there needs to be some form of signaling scheme set up between the telephone and

Analog to digitial convertor Time slot interchanger a b c d 1

1

e f g h 2

2

i

j k l 3

3 Time slots

Time division multiplexing Space division multiplexing

Analog to digitial convertor Time slot interchanger j k l 1

1

a b c d 2

2

i

e f g h 3

3 Time slots

Time division multiplexing Space division multiplexing

Figure 10. An example of a time-space-time switch.

l h d k g c j

f b i e a

TELEPHONE NETWORKS

519

LU TU

MCTSI

AM

CM

DLTU2 DSU2

CC

MMSU TMS

GDSU

MSGS

DCTU DFC

Switching module

Disks Tape

LU TU

MCTSI

DLTU2 DSU-2

DSU3

IDCU GDSU

PSU2

SM-2000

IOP

MCC OS Scan points

Legend Line termination units

Common units

LU TU DLTU DNU-S DSU GDSU IDCU ILSU2 MMSU PSU2

CC CM DFC IOP MCC MCTSI MSGS OS SM TMS

Line unit Trunk unit Direct connected test unit Digital network unit-sonet Digital service unit Global digital service unit Integrated digital carrier unit Integrated service line unit 2 Module metallic service unit Packet switch unit model 3

Central control Communications module Disk file controller Input/output processor Master control center Master control time slot interchanger Message switch Operations system Switching module Time multiplex switch

Figure 11. Functional diagram of a 5ESS-2000 switch.

the switch to communicate when a call needs to be placed and to whom. The first automatic switch designed by Strowger used dial pulses to communicate to the switch. This form of signaling was used until the 1970s, when tone dialing was introduced and became the standard form of dialing. Although the dial pulse method can still be used on telephone switches, it can only convey limited information and cannot be used after call completion to control automatic answering systems and services. A phone has two basic states of operation: on hook (idle) or off hook (in use). In the on-hook condition, the pair of wires to the switch are in an open condition. When a call is to be made, or answered, the receiver is taken off hook, closing the loop through a pair of contacts. This is normally a ⫺48 V loop that draws loop current through the wires from the battery (central ) source at the switching center. This flow of loop current tells the line termination device (LU for the 5ESS-2000) that a call is about to be established (off hook) or a call has been answered. When placing a call, the switch applies dial tone to the line indicating to the caller that the switch is ready to accept dialed numbers. For dial pulse, the dialing mechanism (rotary) pulses the contacts the proper number of times for each number. Each pulse opens the contacts for 50

ms and closes them for 50 ms at a rate of 10 pulses per second. The time between digits is nominally 700 ms and is based on the minimum time to dial the next number on a rotary dial. Because of the slow speed of dial pulses and limited information set, tone dialing was introduced. The tone keypad produces a discrete set of frequencies for each number key pressed. This method is called dual tone multifrequency (DTMF) because each frequency is comprised of two tones (one high frequency and one low) determined by the switch closure at the crosspoint, as shown in Fig. 12. For number 1, the composite tone is made up of 1209 and 697 Hz, number 2 is 1336 and 697 Hz, and so forth. When a call is connected, these same frequencies are used to access such things as bank accounts, reservation services, and subscriber extensions. These tones can be passed over the same channel as the voice call is using since these composite tones are within the voice frequency range (14). Interoffice Trunk Signaling Signaling between switches offices (interoffice signaling) is done in a similar manner as with loop signaling between the

520

TELEPHONE NETWORKS

Table 2. 5EES-2000 Functions Module

Functions

Administrative Module (AM)

Communications Module (CM)

Switching Module/Switching module-2000 (SM/SM-2000)

Submodules

Provides a common interface to all of the 5ESS-2000 Switch Outputs reports for all the 5ESS-2000 Switch Coordinates common maintenance activities Switches messages between processors of the SM/SM-2000(s), CM, and AM Provides timing synchronization Switches data (voice) between SM/SM-2000(s) Is central space switcher Handles global or central resources Terminates analog, digital and ISDN lines and trunks Converts analog to digital and digital to analog Contains highly intelligent processors that perform over 95% of the call processing functions of the 5ESS-2000 Performs time division switching of time slots Performs service functions

subscriber and the switch. A series of on and off pulses (reversals of battery polarity) are used to signal the other end that a call is to be made or disconnected, and then the signaling is sent over the transmission media using a set of frequencies similar to the DTMF frequencies. Table 4 lists the set of multifrequency (MF) signaling codes used to signal between offices. The KP signal indicates the start of pulse sending and the ST signal indicates the end. In the process of signaling between offices, the calling office sends a connect or seizure signal to the other end indicating that it wants to set up a call. A seizure signal consists of a constant on-hook connection. The other end returns a momentary pulse called a wink to indicate it is ready to receive signaling information. Once

Common Control (CC) Disk File Controller (DFC) Input/Output Processor (IOP) Master Control Center (MCC) Time Multiplex Switch (TMS) Message Switch (MSGS)

Line Unit (LU) Trunk Unit (TU) Digital Line Trunk Unit (DLTU) Digital Service Unit (DSU) Global Digital Service Unit (GDSU) Module Control Time Slot Interchanger (MCTSI) Modular Metallic Service Unit (MMSU) Directly Connected Test Unit (DCTU) Integrated Digital Carrier Unit (IDCU) Integrated Services Line Unit (ISLU) Packet Switching Unit (PSU)

the circuit (trunk) is set up, the calling office sends the digits for the number being called in a sequence such as KP9295521ST. This form of signaling is referred to as in-band or channel associated signaling (CAS) since the signaling information is sent between switches over the same path used for the voice call. In the 1980s, a new form of signaling was introduced to telephone networks to improve on the older style CAS. By using a separate out-of-band, or common channel signaling (CCS) link, high-speed messages are passed between switches to handle call management. The higher-speed message traffic allows switches to communicate at a higher level of intelligence and make call connections extremely fast, even on

Table 3. Digital Central Office Switching Systems of North America Manufacture

Designation

Date of Introduction

Application

Line Size

AT&T AT&T CIT-Alcatel GTE GTE LM Ericsson NEC ITT NorTel NorTel NorTel Siemens Sromberg Carlson Vidar Vidar

4 ESS 5 ESS E 10-five 3 EAX 5 EAX AXE10 NEAX-61 System 1210 DMS-10 DMS-100 DMS-200 EWSD DCO ITS4 ITS4/5

1976 1983 1982 1978 1982 1978 1979 1978 1977 1979 1978 1981 1977 1977 1978

Toll Local Local Toll/tandem Local Local/toll Local/toll Local/toll Local Local Toll Local Local Toll/tandem Local/toll

107,000 100,000 100,000 60,000 145,000 200,000 80,000 26,000 7,000 100,000 60,000 200,000 32,000 7,000 12,768

(Digital Telephony, James Bellamy, Copyright  1991, Reprinted by permission of John Wiley & Sons, Inc.)

TELEPHONE NETWORKS

1209

1336

521

1477 STPs

697

1

2

3

770

4

5

6

SLP SCP

852

7

8

9

941

∗

0

#

Data

EOs SSP

EOs SSP

IP

IP

Figure 12. Dual tone multiple frequency touch pad. 56 kb/ps 56 kb/ps

coast-to-coast or international calls. An in-band CAS call may take from 2 to 10 s to be established while a CCS call will be set up in under a second. Figure 13 shows the connectivity of the CCS network used in North America today. Each EO and other levels of switches are connected to a signaling transfer point (STP), which accepts a call setup message from an originating switch and relays that information the terminating switch. The messages coordinate the type of call, called number, calling number, status of the called party (busy), and which trunk circuit to use in the call establishment. In the case of 800 and 888 type numbers, the STP will look in a database (service control point, SCP) and determine what the real telephone number (area code and subscriber number) is for the call. The 800/888 numbers are really phantom-type numbers that do not change very often but can be assigned to any real number depending on the information in the database. When someone calls the 800/888 number, the CCS link gets the real telephone number from the SCP database. That way a company can have one 800/888 type service number but can assign the answer point wherever it wants merely by changing the database entry. One other key feature of CCS is that the STP can communicate with the terminating switch to determine if the number being called is busy. If so, it sends back a message to the originating switch and a circuit is never established between EOs. The originating switch provides the busy tone signal to the calling party. This saves expensive cross-country long-distance trunks from being tied up sending busy tones to the calling parties (that is not

Table 4. Multifrequency Tones Number

Low Tone

1 2 3 4 5 6

700 700 900 700 900 1100

⫹

High Tone

Number

Low Tone

900 1100 1100 1300 1300 1300

7 8 9 0 KP ST

700 900 1100 1300 1100 1500

⫹

High Tone 1500 1500 1500 1500 1700 1700

SLP STP IP SCP SSP

= = = = =

Service logic program Signal transfer point Intelligent processor Service control point Service switching point

Figure 13. Simplified common channel signaling advanced intelligent network.

charged) instead of serving revenue-type calls (15). See SIGNALING.

MULTIPLEXING The simplest method of establishing a telephone voice call from one location to the other is to utilize a single path of wires between each end of the circuit. With increased longdistance traffic, more and more wires were required to carry traffic between switching centers across the country. To improve the efficiency of this operation, multiplexing schemes were introduced to facilitate transmitting simultaneous voice conversations over the same path. Frequency division multiplexing (FDM) allocates each voice call a different frequency slot on the transmission media. A pair of wires can actually carry frequency signals higher than the one 3 kHz voice call. The basic FDM scheme allocates voice channels to one of 12 frequency slots between 60 and 108 kHz at 4 kHz spacing. The voice signal is modulated with the carrier frequency (i.e., 64 kHz) for transmission over the path and then is demodulated through electronic filters at the receiving end. These 12 voice channels are a ‘‘group’’ of signals used in the Bell System as a building block for multiplexing signals in a FDM hierarchy of increasing size. Five groups are combined into a supergroup, 10 supergroups are combined into a mastergroup, and 6 mastergroups into a jumbogroup. Finally, 3 jumbogroups are combined into a jumbogroup mux, providing for the transmission of 10,800 individual voice calls over one single transmission media. As the number of channels increases and the frequency bandwidth reaches 60,000 kHz, different types of transmission media are used, from twisted wire to coaxial cable and microwave radio. As with the FDM networks, digital networks are also multiplexed but are based on time division techniques instead of frequency. Time division multiplexing (TDM) hierarchy is

522

TELEPHONE NETWORKS

Table 5. Fiber Optic Carrier (OC) Rates OC Rate OC OC OC OC

1 3 12 18

Megabytes per Second 51.840 155.520 622.080 933.120

OC Rate OC OC OC OC

Megabytes per Second

24 48 96 192

1244.160 2488.320 4976.640 9953.280

based on using the sampled digital signals derived in the pulse code modulation (PCM) technique used to convert analog voice signals to a series of digital samples. As was discussed earlier, an analog voice signal is sampled at 8 kHz rate (125 애s) and converted into an 8 bit code word. The multiplexing of voice channels in the time division spectrum provides a time slot every 125 애s for a new sample. By designing a transmission system that is fast enough to transmit pulses much faster than 125 애s, many channels can be sent over the transmission system virtually simultaneously. If a single voice call is used, the rate of transmission would need to be 8 bits every 125 애s or 8000 bits per second (b/s) (for two voice channels, 16 bits every 125 애s or 16 kb/s). The basic multiplexing scheme divides the time slots into 24 channels and operates at a rate of 1,544,000 bits per second or 1.544 megabits/s. At that rate, 193 bits of information can be sent over the transmission media every 125 애s as derived from the following formula:

(24 channels × 8 bits per sample) = 192 bits + 1 bit for alignment = 193 bits 193 bits/0.000125 = 1,544,000 bps In North America (and Japan) this digital transmission system is called the T carrier system. The basic rate of 1.544 megabits/s is called the T1 rate or DS1 (digital signaling rate 1). The system is then multiplexed up at higher rates at T2/ DS2 at 6.312 megabits/s (4 T1s), T3/DS3 at 44.736 megabits/ s (28 T1/DS1s), or T4/DS4 at 274.176 megabits/s (6 T3/DS3s). As can be seen by this hierarchy scheme, up to 4032 voice channels can be sent over the maximum T4/DS4 transmission media (16). To gain more efficiency and higher capacity of telephone transmission highways, the introduction fiber optic transmission technology has launched circuit capacity into another realm. The capacity of wire and radio transmission systems is physically limited by the media itself. The spectrum of electrical frequencies is limited to wire or radio transmission systems. Fiber optics allows the use of the higher-speed light spectrum for transmission over an optical path at literally the speed of light. By modulating this light into discrete channels, much higher capacity can be realized. Therefore, optical transmission media can provide the multiplexing of voice signals into optical channel rates, as shown in Table 5. NETWORK AND TRAFFIC MANAGEMENT The advanced sophistication of modern switching systems has brought about increased emphasis on network and traffic management. With the amount of calls being processed each

hour of each day, system failures can prevent millions of people from completing their calls. Lost calls mean lost revenue. It is extremely important that subscribers complete their calls and that the quality of service is high. Most telephone companies employ elaborate network management systems to report on the health and welfare of switches and transmission equipment. Using highly sophisticated processors and software, they extract live performance data and analyze it to determine trouble spots. The basic functional areas of a network management systems are: 1. Fault Management. Detection, isolation, correction, and prevention of faults affecting service. 2. Performance Mmanagement. Monitoring and controlling the quality and integrity of the system. 3. Trouble Administration. Administration of trouble calls from customers reporting service problems. 4. Configuration Management. Maintenance of records on software, hardware, network database, and subscriber configurations. 5. Accounting/Billing Management. Collection of usage data and generation of billing information. 6. Service Provisioning. Adding, deleting, and changing customer service. 7. Traffic Engineering. Data collection, analysis, and service planning to ensure that system performance meets design goals. A key area of telephone network performance is in traffic engineering. Telephone systems are not built to provide for 100% call completion for all subscribers at once. Calling pattern studies indicate that not all people call at the same time and there are peak periods of calling based on day of the week and time of day. Traffic engineers operate on the premise that a small percentage of the total number of subscribers will be using the telephone at any given time. The percentage varies between business area locations and residential locations as well as by regions. The average number of callers at any given time is on the order of 10–20% of the total number of subscribers. Networks are built to handle the predicted average busy hour/busy day traffic. Traffic engineering uses a unit of measurement called a CCS (100 call seconds) to indicate the load on a system. Because calls vary in number and duration, the CCS helps to quantify the amount of traffic in a single unit of measurement. For instance, 10 calls at 6 minutes duration each presents the same load to a network as 60 calls at 1 minute duration. The total time is still 60 minutes (3600 seconds), or 36 CCS (36-100 call seconds). The term 36 CCS is also referred to as an Erlang and is used to determine the number of trunks needed to support call traffic between switching offices. The average Class 5 End Office is engineered for 3.3 to 3.6 CCS for residential phones and 6.0 for a commercial environment. Another important unit of measurement in traffic engineering is the grade of service (GOS) provided to the customer, which is the probability of a call being blocked (busy tone). Blockage can occur in either the switch or between switches if all trunks are being used. A key area of blockage in switches is the matrix. There are only a limited number of paths engineered into the design, and when they are fully utilized, no more calls can be made. Modern switches are de-

TELEPHONE NETWORKS

523

signed to provide virtual nonblocking service, which means that enough capacity has been engineered into the switch to handle the normal amount of calls plus some level of extra capacity for stress conditions. However, in cases of local emergencies, when everyone wants to call at once, the demand can exceed design capacities and calls will be blocked. With blockage between switches, the same principle exists. There are only so many trunks available, and when that amount is being fully utilized, the next call will be blocked. Traffic studies examine the busy hour calling volume, quantified in CCS or Erlangs, and determine the number of trunks needed to support the normal service load to meet the desired blocking probability. If the probability that 1 call in 100 will be blocked (get a busy signal), the GOS is said to be P.01. If 5 calls are blocked out of 100, the GOS is P.05. Telephone companies consider the cost of trunking and equipment to determine the optimum level GOS needed (nominally in the range of P.01 to P.02) to provide the best service at the least cost (17).

This basic rate interface (BRI) is normally called 2B ⫹ D. With this capability, users have the ability to make a voice and data call at the same time or a single data call at speeds up to 128 kb/s. Explosion of Internet service and telecommuting markets have made ISDN a popular residential and small business offering. Figure 14 shows a typical BRI application. The network terminating device (NT1) converts the four-wire 2B ⫹ D signal into a special encoded signal capable of being transmitted over a standard two-wire twisted pair local loop. ISDN also offers higher bandwidth service known as primary rate interface (PRI), which offers 23 bearer channels and 1 signaling channel and is known as 23B ⫹ D. This offering provides wideband incremental 56/64 kb/s digital service up to 1.472 megabytes/s and in most cases is the ISDN equivalent to a 24-channel T1 trunk used to interconnect switching systems. The most common use of PRI is for video teleconferencing systems that dial up bandwidth at multiple rates of 64/56 kb/s.

ADVANCED TELEPHONE NETWORK SERVICE

Advanced Intelligent Networks

Integrated Service Digital Network Integrated services digital network (ISDN) was introduced in the late 1980s as a means of providing to subscribers improved end-to-end digital connectivity at speeds up to 128 kb/ s using the existing two-wire cable plant. It maximizes the transmission capability of the existing local loop (connection between subscriber and EO) for the simultaneous transmission of voice and data. Basic rate ISDN subscribers have access to two bearer (B) channels operating at 56 or 64 kb/s each and a data signaling (D) channel operating at 16 kb/s.

An advanced intelligent network (AIN) is a service-independent network that takes the intelligence currently imbedded in switch manufacturers’ software and outboards it to separate intelligent processors (IP). This provides the network operator with the capability to develop and control services more effectively and severs the tie between switch manufacturers and service providers (telephone companies) in the development of feature offerings. In the mid-1980s, commercial operating companies began asking for more control over how and when features were implemented. In many cases, one switch manufacturer would not have the correct mix of fea-

Standard wall jack

128 4-wire NT1 2-wire house wiring

2B + D

128 kb/ps

Worldwide interoperability integrated data and voice

Central office Figure 14. Basic ISDN application.

524

TELEROBOTICS

tures that a telephone company wanted to provide to its customers. This AIN concept introduces standard trigger detection points (TDP) into the call processing software to deflect the call process to an IP for further processing when special features are invoked. Figure 13 shows how the intelligent processors are tied in through the common channel signaling system to obtain the full intelligent power of the network. The development of programs for the IPs is done by a separate service creation environment (SCE) operated by an independent vendor or by the telephone company. With this capability, telephone companies can develop and control their own special features and offer them without waiting for the switch manufacturers to provide them in a future software release. Some examples of the expanded feature sets that could be provided by AIN based systems are 1. Local number portability 2. Voice-activated dialing 3. Zip code routing 4. Area-wide networking 5. Intelligent routing, time of day, day of week, etc. 6. Calling party pays (wireless) 7. Do not disturb 8. Follow-me routing Advances in switching technology are focused on improving call processing time, with experimental work being done with fiber optic switching systems. Future advanced intelligent networks promise to be feature rich and reliable, connecting subscribers at the speed of light. Wideband, global network switching will allow subscribers to call around the world using voice, data, and video at economical rates.

BIBLIOGRAPHY 1. M. L. Gurrie and P. J. O’Connor, Voice/Data Telecommunications Systems, An Introduction to Technology, Englewood Cliffs, NJ: Prentice Hall, 1986, 14–15. 2. J. Brooks, Telephone, The First Hundred Years, New York: Harper & Row,1976. 3. Ref. 1, page 14. 4. D. Talley, Basic Telephone Switching Systems, 2nd ed., Rochelle Park, NJ: Hayden Book Company, 1979, 144–167. 5. B. Bates and D. Gregory, Voice and Data Communications Handbook, New York, McGraw-Hill, 1995, 29–31. 6. J. Bellamy, Digital Telephony, 2nd ed., New York: Wiley, 1991, 6–10. 7. Ref. 6, pages 10–12. 8. Ref. 5, pages 223–225. 9. Ref. 6, page 15. 10. Ref. 4, pages 85–102. 11. M. Harb, Modern Telephony, Englewood Cliffs, NJ: Prentice Hall, 1989, 47–53. 12. Ref. 4, pages 152–154. 13. AT&T Network Systems Customer Education & Training, 5ESS-2000 Switch Architecture, 1995, AT&T. 14. S. J. Bigelow, Understanding Telephone Electronics, 3rd ed., Indianapolis, IN: SAMS Publishing, 1984, 51–57.

15. E. Bryan Carne, Telecommunications Primer, Signals, Building Blocks and Networks, Upper Saddle River, NJ: Prentice Hall, 1995, 291–298. 16. Ref. 5, pages 227–237. 17. Ref. 4, pages 68–71.

GARY K. HOLT Kendrick Communications Consulting International

TELEPHONE SIGNALING. See TELECOMMUNICATION SIGNALING.


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Token Ring Local Area Networks Standard Article Norman C. Strole1, Werner Bux2, Robert D. Love3 1IBM Corporation, Research Triangle Park, NC 2IBM Corporation, Zurich, Switzerland 3LAN Connect Consultants, Raleigh, NC Copyright © 2007 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5332. pub2 Article Online Posting Date: June 15, 2007 Abstract | Full Text: HTML PDF (828K)




❍ ❍

Acronym Finder



Abstract The token ring access protocol evolved through a series of university and corporate research projects, followed by several international standards activities beginning in the 1970s and continuing for two decades into the 1990s. The token access protocol with a ring topology was the basis for both the IEEE 802.5 Token Ring and the ANSI X3T9.5 FDDI standards in the 1980s. This article focuses on the IEEE 802.5 protocol and operation with some comparisons with FDDI in specific areas. Several factors that distinguish token ring operation from other LAN protocols, such as fault detection, fault isolation, priority access protocol, and ring performance are discussed in depth. Introduction Brief History of Ring Networks Token Ring Technology Token Ring Access Protocol, Monitoring, and Recovery Multi-Ring Networks Dedicated (“Switched”) Token Rings Fddi Token Protocol Lan Evolution Conclusion Keywords: token ring; local area network (LAN); fiber distributed data interface (FDDI); IEEE 802.5; transmission media; differential manchester; dedicated token ring (DTR); ANSI X3T9.5 | | | Copyright © 1999-2008 All Rights Reserved.


TOKEN RING LOCAL AREA NETWORKS TOKEN RING LOCAL AREA NETWORK, TOKEN RING LAN

INTRODUCTION Local area networks (LANs) have evolved since the late 1970s to become central elements in corporate networks around the world. Many of today’s LANs are based on one or more network technologies that originated as one of the early standards developed within the Institute of Electrical and Electronic Engineer’s (IEEE’s) Project 802. The Token Ring Working Group, IEEE 802.5, which produced a family of token ring standards throughout the 1980s and 1990s, played a central role in the evolution of one highly popular LAN technology, particularly in commercial environments. Both the IEEE 802.5 and the ANSI X3T9.5 FDDI token ring standards were developed through the collaboration of many individuals representing a broad segment of the network industry. These token ring standards were subsequently endorsed as worldwide standards by the International Organization for Standardization (ISO). BRIEF HISTORY OF RING NETWORKS A token ring network is distinguished from other networks by a combination of network topology and an access method that allows hundreds of devices to reliably share the total available bandwidth. Token ring and the predecessor networks were based on evolving time division multiple access (TDMA) schemes where a common medium was used for both transmit and receive operations by tens or hundreds of devices. Researchers continued their quest for the ideal protocol that was both efficient and reliable. The loop systems of the 1970s were important precursors to ring systems. In both systems, network nodes are interconnected in a serial fashion forming a closed loop (or ring) on which encoded digital information flows in one direction. Loop and ring systems can be distinguished by their respective access control schemes. Loops operate in a master/slave fashion in which access to the medium is governed by a single master control node via the periodic issuance of a special control message known as the poll. Upon receipt of the poll, selected network nodes are permitted to send data to the master control node. In contrast, all nodes on a ring system are peers and autonomously determine when to transmit, based on the state of the ring. Early token ring prototypes also demonstrated that data transmission rates of 1 Mbps and greater were achievable, representing a significant advancement over the 56 kbps link rates that were prevalent at the time. One of the first accounts of a ring-based communication system was presented by Farmer and Newhall (1); other significant ring networks include the distributed computing system (DCS) (2), the Pierce ring (3), the ring built at MIT (4), the Cambridge ring (5), and the ring network at the IBM Zurich Research Laboratory (6).

Both the Pierce ring and the Cambridge ring used a slotted-ring technique where multiple fixed-length data slots continuously circulate around the ring. Any node can place a data packet (or a packet fragment) in one of the empty slots, along with the appropriate address information. Each node examines the address information and copies those slot contents destined to that node. In a second type of ring system, the buffer or register insertion ring, contention between data ready to be transmitted by a node and the data stream flowing on the ring is resolved by allowing the transmitting node to dynamically insert sufficient buffer space into the ring to avoid loss of data (7, 8). A third scheme known as token-access control was first implemented in the DCS and the MIT rings and was the basis for the ring system built at the IBM Zurich Research Laboratory. It also underlies two important LAN standards, the IEEE 802.5 Token Ring (9) and the American National Standards Institute (ANSI) X3T9.5 Fibre Distributed Data Interface (FDDI) (10–12). In a token ring, access to the transmission medium is controlled by passing a unique digital signal, the permission token, around the ring. Each time the ring is initialized, a designated node generates a token that travels around the ring until a node with data to transmit captures the token and transmits its data. At the end of its transmission, the node passes the access opportunity to the next node downstream by generating a new token. Standardization has played an important role in the evolution of the various LAN technologies. LAN standards have been developed primarily by the IEEE, the European Computer Manufacturers Association (ECMA), and ANSI. Token ring standardization was pursued by the IEEE 802.5 Committee, which produced its first standard in 1984. ANSI ratified it as an American National Standard in 1985 and forwarded it to ISO in 1985, which approved it as an International Standard (IS) in 1986. In Europe, ECMA issued its first token ring standard, ECMA-89, in 1982. The IEEE 802.5 Token Ring standard was to a considerable extent based on contributions from IBM Corporation, which in the late 1970s, had investigated various LAN techniques at its Zurich Research Laboratory. The token ring topology and protocol was found to be particularly applicable to commercial applications, with several advantages over other LANs. Performance studies showed that the token protocol was more efficient as the network load increased and that a token ring was not subject to the same distance constraints as collision-based access protocols (13). IBM’s research also represented an advancement beyond earlier ring concepts in an effort to define a system architecture that was reliable, easily deployed, and relatively simple to recover from errors or faults. One important innovation was the introduction of a central wiring concentrator unit and the star-ring topology. The robustness of the token ring was improved through the concept of a backup ring path that would permit self-healing when a break in the primary ring occurred. A method to ensure that there was always a token on the ring was introduced with the development of the token monitor concept. The design of a priority access mechanism provided the basis needed to support real-time applications such as voice (6,


2

Token Ring Local Area Network, Token Ring Lan

14). The first network adapter and concentrator products to support the IEEE standard were shipped by IBM and Texas Instruments in 1985. Several other companies joined them to provide a wide range of token ring products throughout the 1990s. IEEE standard-based token ring deployment declined and, in many instances, was replaced by higher speed 100 Mbps and 1 Gbps switched Ethernet products in the early 2000s. However, the original IBM Cabling System remains in use as the transmission media for these 100 Mbps and 1 Gbps systems over 20 years after it was initially deployed for 4 Mbps LANs. IEEE 802.5 Standards Evolution IBM’s technical contribution to the IEEE Project 802 committee in March 1982 formed the basis for the initial IEEE 802.5 Token-Ring Standard. This standard incorporates both the Physical (PHY) and Media Access Control (MAC) layers, which are Layers 1 and 2 of the Open Systems Interconnection (OSI) reference model (15). Token ring was initially standardized at the PHY layer as a 4 Mbps data transmit rate over 150 ohm shielded twisted pair (STP) cables known as the IBM Cabling System. Shortly after the issuance of that standard, subsequent releases expanded the cabling options to include 100 ohm unshielded twisted pair (UTP) cabling (i.e., telephone grade wire) and optical fiber. These were followed by standards for 16 Mbps token ring, first on 150 ohm STP cabling and optical fiber and later on 100 ohm UTP. Support of the 100 ohm UTP cabling required the introduction of active concentrators. The migration from a shared ring to a dedicated, switched link per station, known as dedicated token ring (DTR), provided a transition path for 16 Mbps operation as well as for 100 Mbps token ring operation. With each of these changes, the frame format remained the same so that any token ring formatted frame could be easily and economically bridged between token ring segments operating at different speeds. TOKEN RING TECHNOLOGY Basic Protocol Information on a token ring is transferred sequentially from one node to the next. The token is a control signal composed of a unique signaling sequence which any node may capture (Fig. 1a). The node having control of the token and, thus sole access to the medium, transmits information onto the ring (Node A in Fig. 1b). For IEEE 802.5 operation, capturing the token is accomplished by simply modifying a single bit on-the-fly to form a start-of-frame sequence and then appending appropriate control information, address fields, the user information, frame-check sequence, and a frame-ending delimiter. All other nodes repeat, and thus redrive, each bit received. The addressed destination node copies the information from the ring as it passes (Node C in Fig. 1b). After completion of its information transfer and after checking for the correct return of its frame header, the sending node generates a new token that provides other nodes with the opportunity to gain access to the ring (Node A in Fig. 1c). The transmitting node keeps the ring open by

transmitting idle characters until its complete frame has returned to be removed. The transmit opportunity passes with the token to all other nodes on the ring before a node can seize the token again to send additional data. The maximum frame size is bounded by the maximum transmit time when a token is captured. The IEEE 802.5 standard defines this time as 9.1 milliseconds (e.g., 0.0091 seconds). As the number of 8-bit octets transmitted in a fixed time period is dependent on the data transmit rate, the upper limit becomes 4550 octets at 4 Mbps and 18,200 octets at 16 Mbps (16). An important and unique characteristic of ring networks is that each node becomes an active participant in all ring communications, because each node must forward or retransmit the data signal to the next downstream node. This fundamental property is reflected in the wiring and transmission techniques that have been chosen for token rings, as described below. Star Wiring As information flows sequentially from node to node around the ring, a failure in a single node can disrupt the operation of the entire ring. This potential problem is addressed by the star-wiring topology in which each node is wired to a so-called wiring concentrator or multistation access unit, whereas the wiring concentrators are interconnected with point-to-point links (Fig. 2). Wiring lobes, consisting of two distinct send and receive signal paths, radiate from the wiring concentrators to the various network interface points, typically wall outlets, in a building. The wall outlets provide physical interfaces to the network to allow fast, reliable, and convenient attachment or relocation of workstations or servers. The lobes are physically interconnected within the concentrators via electromechanical relays to form a serial link. A lobe is only included in the ring path when the node is active; otherwise, the bypass mechanism in the wiring concentrator causes that lobe to be skipped. If the bypass mechanism were positioned at the node itself, the inactive lobe would cause an undesirable increase in the distance between the active nodes on the ring. The wiring concentrators can be completely passive, i.e., contain relays but no active elements, such as processing logic or power supplies, and require only enough power from an attached node to activate the relays when a node needs to get inserted into the ring. The concentrator design and interconnect scheme also provides an alternative ring path that can be used to bypass a break or disruption in the primary ring path (Fig. 2). As an alternative to the simple passive wiring concentrators, active concentrators or “hubs” later became very popular among users who required more stringent availability and manageability features. Active concentrators incorporate additional processing capability and contain either one or two complete token ring nodes; i.e., they represent addressable entities, which enable them to provide powerful management, security, and reconfiguration functions. Furthermore, active concentrators may be combined with a bridging function to other rings or a high-speed “uplink” to a LAN switch.


3

Figure 1. Token protocol overview.

When optical fibers are used to interconnect wiring concentrators and the nodes, insertion/removal signaling is accomplished via optical rather than electrical signals. The control information for ring insertion can be carried by special MAC control frames or unused code symbols; an alternative is to use out-of-band signaling with a suitable form of multiplexing. Wiring concentrators need to be active in this case. Pre-cabling an office area or building with star wiring is practiced for most LAN installations and has several additional advantages: 1. It provides centralized points for reconfiguration management. 2. Workstations can easily be moved from one location to another without requiring installation of a new cable. 3. The wiring is segmented at the wiring concentrators rather than being a continuous cable, thus permitting the intermixing of transmission media. For example, twisted-pair wire can be used to interconnect the wiring concentrators and nodes, whereas optical fiber is employed for the transmission links between wiring concentrators. 4. As long as a node is in passive state (i.e., not inserted in the ring), its lobe is wrapped around in the wiring concentrator (Fig. 2), which enables the node to perform a self-check before inserting itself into the ring. Should an active node detect a fault within either its own components or its wiring lobe, it can remove

itself from the ring. Transmission Media When token ring was initially under development in the early 1980s, it was thought that telephone-grade unshielded twisted-pair cabling was incapable of carrying high-speed (4 Mbps) signals for sufficient distances to be practical for commercial LAN applications. Therefore, a specially designed shielded twisted-pair cabling was simultaneously developed by IBM. In 1984 the IBM Cabling System using 150 ohm shielded twisted-pair copper media for deployment in a star topology was introduced. Commercial customers embraced the star-wired topology but were unsatisfied with the large diameter of the 150 ohm cabling and subsequent cable expense. In that same time frame, the telephony industry was looking at ways to use thinner telephone wiring for high-speed data transmission and realized that, to do so, telephone wire would have to be substantially improved. Those improvements had not reached the market when the first token ring products became available in 1985. At that time, complex wiring rules were published for robust operation of the token ring protocol over large networks spanning multiple wiring closets and containing up to 260 devices (See tables 2.3–2.8 on pp 64–69 of Reference 15). As a result of customer demand to use existing telephone wires, alternative rules were also published for more modest networks of up to 100 nodes, all cabled to the same wiring closet (16). Minimum requirements were placed on the telephone wire, which was dubbed “Type 3 media.” During this time frame, the

4


telephony industry was also developing and standardizing improved telephone wiring and connectors for the express purpose of carrying high-speed LAN data for both token ring and Ethernet applications, first in North America under the auspices of TIA, and later internationally, under JTC/1 SC25/WG3. The North American Standard for telecommunication grade cabling was TIA/EIA 568. The first edition of that standard included the 150 ohm cabling of the IBM Cabling System and a more rigorous standard for data grade telephone wire called Category 3 cabling. Later editions of that standard specified a Category 5 (and later Category 5e) telephone wiring, which was crucial for supporting token ring’s next generation of hardware operating at 16 Mbps and later at 100 Mbps. The international standard, IS11801, specified the 150 ohm IBM data grade cable as well as both Category 3 and Category 5 twistedpair cabling. Category 5 (and later Category 5e) cabling was more advanced, with better transmission characteristics and became the de facto twisted-pair data cable. As token ring increased its operating speed, first from 4 to 16 Mbps and then to 100 Mbps, the requirement to operate over Category 5 cabling at distances of up to 100 m had to be addressed. The solution at 16 Mbps was to require the concentrators to be active devices, repeating the signal between each pair of wiring concentrators but not between the nodes themselves. The solution at 100 Mbps was to use a more efficient line coding and to operate in dedicated token ring (DTR) mode where each signal received by the concentrator from any of its attachment ports was regenerated and retransmitted to the next lobe. DTR design required one active device in the concentrator for each end station that attached to it. These topology and signal reclocking changes were required to overcome the signal attenuation that occurs over the copper media as the signal clock rate increases, thus decreasing the maximum distance before the signal must be reclocked. Although two principle choices existed for cabling from the wiring closet to the active devices in the offices and within the wiring closet, the requirement to be able to transmit over multi-100 m lengths between wiring closets and between buildings was addressed with optical fiber transmission media. Although optical fiber media could have been used to attach devices located in offices, it never gained significant market share for that application because of attachment and media cost. Advances in signal processing technology have allowed transmission rates to increase from 100 Mbps to over 1000 Mbps while maintaining a clock rate that allows the 100 m length to be maintained. Subsequent advances in the copper cable design and digital encoding schemes since the 1990s have resulted in transmission capacities of up to 10 Gbps over short distances (up to 15 m), with the promise of longer distances in the future. Much of this increase in transmission capacity over copper media can be attributed to the advancements in Application Specific Integrated Circuit (ASIC) technology and digital encoding schemes. Line Coding The data generated by a node must be encoded for transmission over a ring network. The IEEE 802.5 standard

specifies differential Manchester encoding for both the 4 and 16 Mb/s token ring transmission rates (Fig. 3a). Differential Manchester encoding is characterized by the transmission of two line signal elements per bit, which results in a link clock rate that is double the bit transmission rate. In the case of a binary one or zero, a signal element of one polarity is transmitted for one half-bit time, followed by the transmission of a signal element of the opposite polarity. This line coding has two advantages: 1) The resulting signal has no dc component and can readily be inductively coupled, and 2) the mid-bit transition conveys inherent timing information. The ones are different from the zeros at the leading bit boundary; a value of one has no signal transition at the bit boundary, whereas a value of zero does. In decoding the signal, only the presence or absence of the signal transition and not the actual polarity is detected; thus, interchanging the two wires of the twisted pair introduces no data errors. A code violation results if no signal transition occurs at the half-bit position. Code violations can be intentionally created to form a unique non-data signal pattern that can be distinguished from normal data (Fig. 3b). These so-called J/K signals can be inserted to mark the start or end of a valid data frame. The J/K code violations are used in pairs to maintain the dc balance of the Manchester signaling. Manchester coding is a very simple and robust technology but, at the same time, is also very bandwidth-inefficient and therefore not suitable for transmission rates significantly higher than 16 Mbps. For example, in FDDI, information on the medium is transmitted in a 4-out-of-5 group code (4B/5B) with each 5-bit code group, called a symbol, used to represent 4 bits of data. The symbols are transmitted in a non-return to zero inverted (NRZI) line transmission format (11). NRZI is distinguished from Manchester in that 1. 2. 3. 4.

There are no transitions at the half-bit boundary. Transitions occur at the beginning of a binary ‘1’. No transition occurs at the beginning of a binary ‘0’ The 4B/5B encoding scheme has excess code groups that can be used as non-data symbols to distinguish the start and end of a valid data frame.

Synchronization Synchronization of the link clocking among stations is a key technical problem in the design of any ring system. Rings built according to the IEEE 802.5 standard employ a centralized clocking technique. In normal operation, one node on the ring is automatically designated as the active monitor during ring initialization. The monitor plays a crucial role in the supervision of the ring, as will be described below. In addition, it provides the ring master clock. All other nodes on the ring are frequency and phase-locked to the monitor. They extract timing from the received data by means of a phase-locked loop while redriving the digital signal to reach the next node. Each port of an active concentrator can also reclock and redrive the signal as well. Although the mean transmission rate on the ring is controlled by the active monitor node, segments of the ring can, instantaneously, operate at rates slightly different from the


5

Figure 2. Star-wiring topology with dual-ring example.

Figure 3. Differential Manchester encoding.

frequency of the master oscillator. The cumulative effect of such rate variations is sufficient to cause variations of a few bits in the latency of a ring. Unless the latency of the ring remains constant, bits would have to be either dropped or added. To maintain a constant ring latency, an elastic

buffer is provided in the monitor. If the received signal at the active monitor node is slightly faster than the master oscillator, the buffer will fill up to avoid dropping bits. If the received signal is slow, the buffer will be emptied to avoid adding bits to the repeated bit stream. Detailed dis-

6


cussion and analysis of this clocking scheme are given in References 16 and 17. A major advantage of the centralized clocking approach is that it minimizes the total latency of the ring and thus allows use of the IEEE 802.5 protocol, which for optimum performance, requires the ring latency to be small. An alternative synchronization technique that introduces greater latency but is easier to implement at high transmission rates is employed in FDDI, where information is transmitted between nodes asynchronously; that is, each node uses its own autonomous clock source to transmit or repeat information on the ring (11). This type of operation requires the use of an elastic buffer at every node. Information is clocked into the buffer at the clock rate recovered from the incoming bit stream, but it is clocked out of the buffer at the autonomous fixed clock rate of the node. A preamble that precedes each frame enables the buffer to be reset to its midpoint before frame reception. The reset operation increases, or decreases, the length of the preamble. For the IEEE 802.5 100 Mbps token ring operation, the issue of ring latency was sidestepped by defining only a switched token ring operation with one active repeater node in the wiring closet for each attached station. With this configuration, the attached station and its active repeater form a two-station ring. The active repeater acts like a bridge sending information onto this two station ring and broadcasting information received from the attached station to the other direct-attached stations. TOKEN RING ACCESS PROTOCOL, MONITORING, AND RECOVERY

may optionally include frame status information within the control indicator subfield. Finally, the header includes the source address (SA) of the node that originated the information and the destination address (DA) of the node (or nodes) destined to receive the information. Both address fields contain six bytes, with the first two bits of the DA designating that the address is intended for multiple destination nodes (Group bit) or that the address has been assigned by the user (Local Administered Address bit). Use of the routing information field (RIF) will be described in the section on multiring networks. The information field is followed by a trailer that is composed of three subfields. The first portion of the trailer contains a four-byte frame check sequence (FCS) that is calculated by the source node and used by downstream nodes for detecting bit errors that may occur within the frame during transmission bounded by the FC subfield and the last bit of the information field. Next, an ending delimiter (ED) is provided to identify the end of the frame. This delimiter also contains a unique, although slightly different, bit combination along with pairs of code violations as were found in the starting delimiter. The last bit of the ending delimiter is designated as the error-detected indicator (EDI). This indicator will always be zero during error-free ring operation. The ending delimiter is followed by a one-byte frame status (FS) field. The FS field contains bits that can be modified while the frame is traversing the ring by nodes that match the destination address and/or copy the frame. The FS field is therefore not included in the calculation of the FCS character. For this reason, these bits are defined as pairs to avoid erroneous conditions caused by single-bit errors on the wire.

Data Frame Format The token format and the general format for transmitting information on the ring, called a frame, are shown in Fig. 4. A token contains only the access control (AC) subfield and the starting and ending delimiters (Fig. 4a). The onebyte AC field includes a one-bit token (T) indicator that indicates whether this is a token (0) or a frame (1). A token priority mode that uses the priority reservation indicators provides different priority levels of access to the ring (see below). The monitor (M) bit is used in connection with the token monitor function to maintain the validity of the token. The “data” portion of the frame is variable in length and contains the information that the sender is transferring to the receiver. The information (INFO) field is preceded by a header, which contains several subfields (Fig. 4b). The first is a starting delimiter (SD) that identifies the start of the frame. The starting delimiter is a unique signal pattern that includes pairs of code violations of the differential Manchester encoding scheme as described earlier (Fig. 3b). Next, the AC subfield, with the token bit (T) set to 1, is defined for controlling transmit access to the shared media as described above. The frame control (FC) subfield contains a two-bit frame format (FF) and a three-bit frame priority subfield. The frame format enables the receiving node to determine whether the information within the data field of the frame contains media access control (MAC) information (FF=00) or user data (FF=01). MAC frames

Priority Protocol In some applications, it may be necessary for selected nodes to gain priority access to the ring. A priority scheme was designed specifically for the token ring protocol that was initially one of the distinguishing features versus other access control schemes. The priority (PPP) and reservation (RRR) indicators in the AC field are used to facilitate this access scheme (Fig. 4b). Various nodes may be assigned priority levels for gaining access to the ring, with the lowest priority being ‘000’ and the highest being ‘111’. This assignment allows up to eight protocol levels to be defined. A selected node can seize any token that has a priority setting (Bits 0–3) equal to or lower than its assigned priority. The requesting node can set its priority request in the AC reservation field (Bits 5–7) of a frame as it is being repeated if that node’s priority is higher than any current reservation request. The current transmitting node must examine the reservation request in the returning frame and release the next token with the new priority indication (Bits 0–3) but retain the previous priority level within its MAC state information for later release. A requesting node uses the priority token and releases the new token at the same priority so that any other nodes assigned that priority can also have an opportunity to transmit. When the node that originally released the priority token recognizes a token at that priority, it then releases a new token at the level that was interrupted by the original request. Thus,


7

Figure 4. Token and frame formats.

the lower priority token resumes circulation at the point of interruption. In 1995, the IEEE 802.5 standard committee published a set of guidelines for the use of previously reserved priority levels 5 and 6. Priority 5 is recommended for delaysensitive, high-bandwidth data streams such as video applications. Priority 6 is recommended for delay-sensitive, low-bandwidth data streams, such as interactive voice communication. Priority 7 is reserved for ring management and error recovery frames. Priority 4 is generally recognized for bridge access. Token Protocol Performance Issues From a performance point of view, token rings have two distinct advantages over other access protocols: 1. From the cyclic operation (also sometimes referred to as “round-robin”) enforced by the rotating token, all users of the ring are serviced in a perfectly fair fashion within a given priority class. The priority mechanism, however, may be used to give a subset of the user’s preferential service as described above. 2. From its deterministic behavior, the token protocol scales better with respect to network latency than random access protocols such as CSMA/CD. However, the original IEEE 802.5 protocol is not totally insensitive to ring latency because it requires that idle characters be inserted by a transmitting node until it has recognized its source address in the header of the returning frame. This requirement leads to improved error robustness of the operation and is necessary for a higher priority token to be released, but it results in decreased efficiency as the physical ring length, number of active nodes, and/or the ring speed increase.

Ring protocol efficiency can be maximized at very high speeds and/or long ring lengths if nodes release the token immediately after finishing frame transmission. In 1987, the IEEE 802.5 standard was enhanced by an early token release (ETR) option that allows a transmitting node to release a new token as soon as it has completed frame transmission, whether or not the frame header has returned to that node. This enhancement was necessary to allow the support of a 16 Mbps operation over large campus networks. One impact of ETR is that priority reservations applied to “short” frames are lost until the circulation of a subsequent frame that exceeds the ring length. This trade-off was shown to be acceptable for typical campuswide 16 Mbps rings. A detailed discussion of token ring performance issues can be found in Reference 18. Ring Monitor Function In token ring networks, error detection and recovery mechanisms are provided to restore network operation in the event that transmission errors or medium transients, for example, those resulting from node insertion or removal, cause the ring to deviate from normal operation. The IEEE 802.5 token rings use a network monitoring function that is performed in a specific token monitor node with backup capability in all other nodes attached to the ring. The monitoring function is based on the scheme developed by the IBM research team in Zurich (14). Through an arbitration process, the nodes on the ring select one node to be the active monitor. As described, this node also provides the master clock for the ring. The remaining nodes function as standby monitors. The active monitor keeps watch over the health of ring and token, activating recovery procedures when necessary. Ring errors can be quickly isolated to a specific ring segment if an accurate map of the ring stations is main-

8


tained. Periodically, the active monitor will issue a broadcast frame called an active monitor present (AMP) frame. The first active node downstream from the monitor node will set the address recognized indicator (ARI) bits in the FS subfield and save the source address. Other nodes on the ring will ignore this particular broadcast frame when the ARI bits are set. The node that received the AMP frame will then issue a standby monitor present (SMP) frame containing its own source address whenever a token is observed. This frame is recognized and copied by the next downstream active node. This process continues around the ring until the active monitor receives the SMP frame without the address-recognized flag set. At that time, each node will have the specific address of the adjacent node immediately upstream, which is known as the nearest active upstream neighbor (NAUN). The NAUN information is transmitted with all beacon frames and soft error report frames, thereby allowing a network management node to log the logical location of the fault. The AMP and SMP frames are transmitted at the highest ring priority to ensure that the process completes in the least amount of time, even during periods of peak ring utilization.

the reporting node. Once the location of a fault (hard or soft) has been determined, several options are available for eliminating the faulty segment(s) from the ring so that normal operation can resume. The wiring concentrators provide concentration points for bypassing such faults, as was discussed earlier with lobe bypass. Also, alternative backup links are normally available between the wiring concentrators in parallel with the principal links. If a fault occurs in the ring segment between two wiring concentrators or if a concentrator failure occurs, wrapping of the principal ring to the alternative ring within the two wiring concentrators on either side of the fault will restore the physical path of the ring (Fig. 5). This wrapping function, like the lobe bypass function, is automatic in active concentrators. Figure 5 shows four wiring concentrators as they would be configured with both a principal and an alternative ring. The signals on the alternative ring are propagated in the direction opposite to those on the principal ring, thus maintaining the logical order of the nodes on the ring.

MULTI-RING NETWORKS Ring Fault Detection and Isolation The topological structure of a star-ring configuration, in conjunction with the token-access control protocol, permitted the development of additional protocols for rapid detection and isolation of network faults (15, 16). The unidirectional propagation of information (electrical signals and data frames) from node to node provides a basis for detecting certain types of network faults. Network faults can be categorized into two types: hard faults and soft faults. A hard fault occurs when there is a complete break in the ring wiring between two adjacent nodes or wiring concentrators or a failure in the transmitter or receiver elements of a node. A node that detects loss of signal at its receiver will begin transmitting a unique series of contiguous MAC frames. Such a transmit state is called “beaconing.” A hard fault may initially cause more than one node to enter the beacon state, but eventually all nodes but the one immediately adjacent to and downstream of the fault will exit the beacon state as they begin receiving beacon frames from their upstream neighbors. Thus, the location of the fault will be isolated to the particular ring segment and the last known NAUN that is immediately upstream from the node that is transmitting beacon type frames. A soft fault is characterized by a high frame error rate, usually caused by a degradation in the electrical signal or environmental electromagnetic interference. The frame check sequence (FCS) of all frames is calculated and verified by all intermediate nodes as the frames are repeated. The first node on the ring that detects an FCS error sets the error detected indicator (EDI) in the ending delimiter field as an indication to all other nodes that the error has been logged. If a predetermined threshold of FCS errors is reached over a given time interval, an indication of the condition can be reported to a network management application. The location of the soft fault can be readily determined from the information in the error report message and isolated to the ring segment immediately upstream of

Multiple rings are required in a campus or building LAN when the aggregate data transfer requirements or total number of stations exceed the capacity of a single ring or when a large number of attached nodes is spread over a broad area (15). Two rings can be linked together by a highspeed interconnect mechanism known as a bridge (Fig. 6). A bridge is capable of providing a logical forwarding of frames between the rings based on the SA, DA, and/or RIF inserted by the source node. An additional capability of the bridge is to perform transmission speed changes from one ring to another. Each ring retains its individual identity and token mechanism and could therefore stand alone in the event the bridge or another ring was to be disrupted. The bridge interface to a ring is the same as any other node, except that it must recognize and copy frames with a destination address or RIF subfield for one of the other rings within the network. Also, several frames may be temporarily buffered in the bridge while awaiting transfer to the next ring. The local network can be further expanded to meet larger data capacity requirements by interconnecting multiple bridges, which results in a hierarchical network in which multiple rings are interconnected via bridges and multiple bridges are interconnected via a separate highspeed link known as a backbone (Fig. 6). The backbone may be a high-speed ring, such as FDDI, or it may be another network type, such as an asynchronous transfer mode (ATM) network. Most token ring and FDDI network devices support a MAC-level bridging scheme known as source route bridging (SRB). With this scheme, intermediate bridges, switches, or routers and the associated ring segments that form the path between a source and a destination node are uniquely and explicitly identified within the RIF within the frame header (Fig. 4b). The RIF is created via a discovery protocol at the beginning of a session that allows the source node to designate a unique path to the destination


9

Figure 5. Fault detection and isolation with wiring concentrators.

Figure 6. Multi-ring network topology.

node, enumerated as a sequence of bridge and ring segment IDs (16). This scheme simplifies the bridge processing that is required at each intermediate device, while also providing a mechanism that allows multiple active data paths between two points of the network. The IEEE 802.1 committee developed standards for LAN bridging. This committee developed an alternative scheme, known as transparent bridging (TB), that required the bridge devices to create and maintain bridge tables to determine which frames to forward (or drop). This scheme was more applicable to the existing IEEE 802.3 Ethernet

standard-based products, without requiring changes to the existing adapter hardware and could also be applied to token rings as well. The IEEE 802.1 bridge standard incorporated the criteria for a combination TB/SRB bridge a few years later. Traditional bridged networks were gradually replaced by switched networks beginning in 1995 (19). Fewer nodes per ring segment allow individual nodes access to more bandwidth. Dedicated switched links allow one node to use all available bandwidth without contention with other nodes.

10


DEDICATED (“SWITCHED”) TOKEN RINGS As long as a token ring is operated as a shared medium, the total transmission capacity available to all users can obviously not exceed the ring’s transmission rate. FDDI extended token ring speed to 100 Mbps by using a different token-based media access protocol and different frame format in a separate standard effort that was completed in the late 1980s (10–12). The IEEE 802.5 protocol could have been extended to 100 Mbps, but this was not considered to be a commercially viable option with the completion of the FDDI standard. Overcoming the limitations of a shared-media protocol required the introduction of a high-speed switching function that became technically and economically possible in the early 1990s with the advancement of ASIC technology. In 1993, the IEEE 802.5 standard committee began looking at options to extend the token ring standard to meet the demand for additional bandwidth. As a result, the DTR standard was completed in 1997. DTR increased the number of ring segments with the introduction of the DTR Concentrator by allowing a ring segment to contain one or more stations supported by one active node in the wiring concentrator, and it introduced the concept of fullduplex operation for directly attached stations (15). One catalyst for the DTR effort was to leverage the beacon transmit mode that was already present in the hardware design of millions of token ring adapters. In this mode, the node adapter is simultaneously transmitting the beacon frame and receiving frames on the inbound side in order to determine whether it is the station nearest to the fault. The existing token ring adapter firmware was modified to create the new full-duplex mode defined by the DTR standard, thus allowing existing adapters to migrate to the new mode with a firmware update combined with the introduction of a multi-port token ring packet switch to replace the classic wiring concentrator. DTR also allowed a node to be the only station on a ring shared with the DTR port as the other station. In this case, no change to the station interface is required. The DTR standard and technology was expanded in 1998 to enable 100 Mbps token ring to use Category 5 data cabling, which had become standard in many commercial businesses. With DTR, a token ring node is allocated the full bandwidth via a dedicated segment between the node and the DTR concentrator (Fig. 7) (16). A new mode, full-duplex operation, is also supported. With the full-duplex mode, the token is no longer required. Instead, two dedicated parallel paths are established between the two nodes. For 100 Mbps token ring, up to 200 Mbps of data transfer (100 transmit and 100 receive) can be achieved per link. For 16 Mbps operation, up to 32 Mbps of data transfer (16 transmit, 16 receive) can be achieved per link. Data frames are forwarded among the dedicated segments by a high-speed data transfer unit within a DTR (Fig. 7) or a packet switch (19). For commercial applications, the token ring switch uses the existing RIF within the frame header to accelerate the packet forwarding. The effective aggregate bandwidth of the DTR system is determined by the switch capacity rather than by the clock speed of the shared media, thus providing much greater bandwidth than the shared-media

configuration. Devices attached to a dedicated link have access to the full-duplex bandwidth, thereby providing significantly more application growth potential than with a shared-media access control scheme. FDDI TOKEN PROTOCOL A discussion of token ring would not be complete without a more detailed discussion of the FDDI token protocol, which is significantly different from the IEEE 802.5 operation in several fundamental areas (20). Unlike the IEEE 802.5 standard, FDDI defines two classes of data traffic: synchronous and asynchronous (11). The synchronous class is applicable to traffic that requires regular intervals between consecutive frame transmissions, such as real-time voice or video, for example. Synchronous traffic is given the highest priority, and the protocol is designed to guarantee frames of this class a transmit opportunity on each revolution of the token within predetermined bounds on the transmit intervals. Stations requiring synchronous access are assigned reserved bandwidth in advance via a distributed control scheme. Asynchronous frames have up to eight priority levels or thresholds but with no guarantee of access. The protocol allocates asynchronous bandwidth based on the priority after synchronous demand has been satisfied. FDDI Token Timers and Operation Unlike the IEEE 802.5 protocol, control of the FDDI ring under normal operation is decentralized; i.e., there is no master station. The algorithm that each station executes allocates use of the ring based on a fixed value that is the same for all MAC entities on the ring and on the contents of two timers present in every MAC (11). The fixed value is the target token rotation time (TTRT), and the timers are the token rotation timer (TRT) and the token hold timer (THT). As the network load increases, the TTRT defines the average time for the token to complete one rotation around the ring, which in turn determines the response time that the network’s users need for their synchronous traffic. The stations determine the value for TTRT during ring initialization. The FDDI protocol guarantees that the maximum token wait time for any station on the ring will never exceed two times the TTRT value. One of the timers present in every station’s MAC entity is the TRT. In conjunction with a counter called Late Ct (the “late counter”), it indicates the amount of time that has elapsed since the station last received a token. By examining its TRT and Late Ct, a station knows whether the token is taking more or less than the TTRT to complete a rotation. Stations can transmit asynchronous traffic only if the token is received when the Late Ct is zero. The Late Ct is set to zero by the MAC each time the token is received, and the TRT is initialized to the TTRT value each time the token is received early. TRT is a decrementing counter that measures the time required for the token to circulate around the ring. If the TRT expires before the token returns, the Late Ct is incremented and TRT is reset to TTRT and continues to decrement. A late to-


11

Figure 7. Dedicated token ring example.

ken, or one that arrives when Late Ct = 1, does not reset the TRT but allows it to continue decrementing, thus carrying forward the lateness of the current token rotation into the next token rotation time. This process may restrict a station’s ability to transmit asynchronous data frames for multiple successive token rotations. If the TRT expires again while Late Ct = 1, an error condition recovery procedure is initiated by that station. The second timer used in bandwidth allocation is the THT. Each THT indicates the amount of time that the MAC may use for asynchronous frame transmission. The value of THT for each station will vary from one token revolution to the next, depending on the network load. A station may use a “late” token only for synchronous transmission, because the token has taken more than TTRT to complete a rotation. However, if Late Ct equals zero when the station receives the token, the station may transmit asynchronous frames as well. In this case, the THT will determine the amount of time the station may transmit asynchronous frames. A station transmits all pending synchronous frames first. The time required for synchronous transmission has already been factored into the TTRT value and is thus not subject to THT limits. THT is initialized as the residual value of the TRT (e.g., as the difference between TTRT and the amount of time that the early token took to rotate). THT is decremented by the MAC only during the transmission of asynchronous frames. A station may transmit multiple asynchronous frames until the THT expires. The FDDI priority scheme is based on an array of priority threshold values called T Pri. These values indicate the length of time that the station may transmit frames at a given priority. A station can only begin transmitting frames at a given priority if the remaining THT is greater than the threshold value for that priority. Thus, under elevated ring

loads, it is possible that a given station will be allowed to transmit synchronous frames and only a few of the higher priority asynchronous frames, but it will then need to wait additional token rotations to transmit the lower priority asynchronous frames. Dykeman and Bux (21) provide an in-depth analysis of the FDDI token-access and priority schemes. Additional details regarding FDDI operation can be found in referenced FDDI standard documentation (10– 12). LAN EVOLUTION The full-duplex, star configuration continued to evolve from the early 1990s, but the basic principals remained the same. The current generation of LAN switches provides in excess of 100 Gbps internal switch capacity, with the Ethernet packet format being the most widely deployed. These advances are enabled by high-speed ASIC switch technology that allows high-density 1 Gbps and 10 Gbps ports on a single chip. Fiber media are also more pervasive now than in the past, and the fiber connector technology has improved significantly. As pointed out in the Transmission Media section, dedicated link speeds of 1 and 10 Gbps are possible today on high-quality copper media as well. CONCLUSION This artile provides both a historical perspective and an indepth technical review of the IEEE and ANSI token ring LAN protocols that emerged in the 1980s. Interested readers are encouraged to refer to Reference 16 as the most

12


comprehensive source on the IEEE 802.5 token ring that is still available today.

20. Ross, F. E. FDDI–A Tutorial. IEEE Commun. Mag. 1986, 24, pp 10–17. 21. Dykeman, D.; Bux, W. Analysis and Tuning of the FDDI Media Access Control Protocol. IEEE J. Selected Areas Commun. 1998, 6, pp 997–1010.

BIBLIOGRAPHY 1. Farmer, W. D.; Newhall, E. E. An Experimental Distributed Switching System to Handle Bursty Computer Traffic; Proc. ACM Symposium on Problems in the Optimization of Data Communications; Pine Mountain, GA, 1963, pp 31–34. 2. Farber, D. J., et al. The Distributed Computing System; Proc. of COMPCON ’73; 1973, pp 31–34. 3. Pierce, J. R. Network for Block Switching of Data. Bell Syst. Tech. J. 1972, 51, pp 1133–1145. 4. Saltzer, J. H.; Pogran, K. T. A Star-Shaped Ring Network with High Maintainability; Proc. of the Local Area Communications Network Symposium; Boston, MA, 1979, pp 179–189. 5. Hopper, A. Data Ring at Computer Laboratory, University of Cambridge. In Computer Science and Technology: Local Area Networking. Nat. Bur. Stand., NBS Special Pub.: Washington, D. C., 500-31, 1977;pp 11–16. 6. Bux, W.; Closs, F.; Janson, P. A.; Kummerle, ¨ K.; Muller, ¨ H. R.; Rothauser, E. H. A Reliable Token Ring System for Local-Area Communication; Conf. Rec. NTC ’81; Piscataway, NJ, 1981, pp A.2.2.1–A.2.2.6. 7. Hafner, E. R.; Nenadal, Z.; Tschanz, M. A Digital Loop Communications System. IEEE Trans. Commun. 1974, COM-22, pp 877–881. 8. Reames, C. C.; Liu, M. T. A Loop Network for Simultaneous Transmission of Variable Length Messages; Proc. 2nd Annual Symposium Comput. Architecture; Houston, TX, 1975, pp 7–12. 9. ISO/IEC 8802-5:1998. Part 5. Token Ring Access Method and Physical Layer Specifications. www.iso.org. 10. ISO 9314-2:1989. Information Processing Systems–Fibre Distributed Data Interface (FDDI) – Part 2: Token Ring Media Access Control (MAC). www.iso.org. 11. ISO 9314-1:1989. Information Processing Systems–Fibre Distributed Data Interface (FDDI) – Part 1: Token Ring Physical Layer Protocol (PHY). www.iso.org. 12. ISO/IEC 9314-3:1990. Information Processing Systems–Fibre Distributed Data Interface (FDDI) – Part 3: Physical Layer Medium Dependent (PMD). www.iso.org. 13. Bux, W. Local-Area Subnetworks: A Performance Comparison. IEEE Trans. Commun. 1981, 29, pp 1465–1473. 14. Bux, W.; Closs, F.; Kummerle, ¨ K.; Keller, H.; Muller ¨ H. R. A Reliable Token Ring for Local Communications. IEEE J. Select. Areas Commun. 1983, SAC-1, pp 756–765. 15. Strole, N. C. A Local Communications Network Based on Interconnected Token-Access Rings: A Tutorial. IBM J. Res. Develop. 1983, 27, pp 481–496. 16. Carlo, J. T.; Love, R. D.; Siegel, M. S.; Wilson, K. T. Understanding Token Ring Protocols and Standards. Artech House: Norwood, MA, 1998. 17. Muller ¨ H. R.; Keller, H.; Meyr, H. Transmission Design Criteria for a Synchronous Token Ring. IEEE J. Select. Areas Commun. 1983, SAC-1, pp 721–733. 18. Bux, W. Token-Ring Local-Area Networks and Their Performance. Proc. IEEE 1989, 77, pp 238–256. 19. Christensen, K. J.; Haas, L. C.; Noel, F. E.; Strole, N. C. Local Area Networks—Evolving from Shared to Switched Access. IBM Syst. J. 1995, 34, pp 347–374.

NORMAN C. STROLE, Ph.D WERNER BUX, Ph.D ROBERT D. LOVE IBM Corporation, Research Triangle Park, NC IBM Corporation, Zurich, Switzerland LAN Connect Consultants, Raleigh, NC


●

HOME ●

ABOUT US //

●

CONTACT US ●

HELP

Wiley Encyclopedia of Electrical and Electronics Engineering Wireless Networks Standard Article Ali Zahedi1, Kaveh Pahlavan1, Prashanth Krishnamurthy1 1Worcester Polytechnic Institute Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. : 10.1002/047134608X.W5334 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (616K)




❍ ❍

Acronym Finder



Abstract The sections in this article are Wireless Access Effect of Distance General Overview and Classification of Wireless Networks Wireless Local Area Networks Wireless Local Loop Technology Broadband Wireless Access Satellite Networks | | | Copyright © 1999-2008 All Rights Reserved.



WIRELESS NETWORKS This article provides an overview of wireless information networks. Wireless networking is the enabling communications technology of the 1990s and beyond. The field of wireless communications is experiencing unprecedented market growth, as evidenced by the rapid increase in the cellular and cordless telephone, paging, mobile data, and wireless local area network (LAN) industries. In the next section we introduce different medium access techniques in wireless networks and discuss channel characteristics, followed by a general classification of wireless networks. Cellular, cordless, and personal communication services and standards are then described. Another section describes the mobile data services and military networks. Then we consider wireless local area networks: the IEEE 802.11 standard in the United States, and HIPERLAN and BRAN standardization activities in Europe. The next section discusses wireless local loop technology, which is gaining importance especially in developing countries. Broadband wireless access research and standardization are then considered, and finally satellite networks.

Wireless Access Medium Access in Wireless Networks. The spectrum is a valuable resource and is not unlimited. In the United States the Federal Communication Committee (FCC) is in charge of controlling the frequency and bandwidth allocation. Thus, it is imperative that the spectrum and, often, the backbone equipment be shared among users. In short, users in a given area must contend for a limited number of channels. There are different approaches to dividing up the spectrum and providing users access to it in an organized way. The simplest and most straightforward method is known as frequency division multiple access (FDMA). In FDMA, the available spectrum is divided into nonoverlapping slots in the frequency dimension (domain). These frequency slots or channels are then put into a pool and assigned to users on either a manual or an automated basis for the duration of their particular call. For example, a 150 kHz block of spectrum could be divided into five channels, or frequency slots, each 30 kHz wide. Such an arrangement would allow six simultaneous conversations to take place, each with its own carrier within its own frequency slot. In the example, this would mean that each user would be continuously accessing one-fifth of the available spectrum during the duration of the conversation. FDMA is perhaps the most familiar way of dividing spectrum, and it has been traditionally associated with analog systems. In time division multiplex access (TDMA), by using nonoverlapping time slots in the time domain, the available spectrum is shared. These time slots or channels are then put into a pool and assigned to users for the duration of their particular call. To continue the example given above, in a TDMA system the 150 kHz of spectrum would be divided into recurring groups (frames) of five time slots, and each time slot would carry a sequence of bits representing a portion of one of five simultaneous conversations. The five conversations each take turns using the available capacity. In other words, each user would be accessing all of the available spectrum but only for one-fifth of the available time. Rather than each signal having a particular frequency slot 1

2

WIRELESS NETWORKS

as in FDMA, in TDMA each conversation occupies a particular time slot in a sequential fashion. The frames are repeated fast enough that there is no interruption or delay in the conversation as experienced by the user. As seen by user, there is no difference in capacity between FDMA and TDMA. Namely, you get access to one-fifth of the capacity all of the time or all of the capacity one-fifth of the time, to continue the example. Note further that, in the practical world, digital systems are typically a combination of FDMA and TDMA. In other words, the systems are designed so that the capacity is divided into both the frequency and time dimensions, so that a user contends for a particular channel and then a time slot within that channel. A third access method is code division multiple access (CDMA). CDMA is both a modulation and an access technique that is based upon the spread spectrum concept. A spread spectrum system is one in which the bandwidth occupied by the signal is much wider than the bandwidth of the information signal being transmitted. As an example, a voice conversation with a bandwidth of 3 kHz would be spread over 1 MHz or more of spectrum. In spread spectrum systems, multiple conversations simultaneously share the available spectrum in both the time and frequency dimensions. Hence, in a CDMA system, the available spectrum is not channelized in frequency or time as in FDMA and TDMA systems, respectively. Instead, the individual conversations are distinguished through coding; that is, at the transmitter, each conversation channel is processed with a unique spreading code that is used to distribute the signal over the available bandwidth. The receiver uses that code to accept the energy associated with the corresponding signal. The other signals present are each identified by a different code and simply produce background noise. In this method, many conversations can be carried simultaneously within the same block of spectrum, but the accumulation of background noise from multiple users will limit the number of users. Another issue that should be addressed is duplexing. In many communication systems, it is desirable to be able to communicate in both directions at the same time. This system characteristic, which is known as full-duplex operation, is desirable because it lets one party in a voice conversation interrupt the other with a question or one device to immediately request a retransmission of a block of information received in error during a data communications session. There are two basic ways of providing for full-duplex operation in a radio system. The most common is to assign two different frequency slots per conversation—one for transmitting and one for receiving. By separating the slots sufficiently in frequency, filters can be used to prevent the transmitted information from interfering with the simultaneously received information. Thus, in many land mobile radio bands, a channel actually consists of two frequency slots—one for each direction of transmission in a full-duplex conversation. This arrangement is called frequency division duplexing (FDD). Another, much less common means of achieving full-duplex operation in the digital world is through what is called time division duplexing (TDD). In TDD, a single channel is used, with each end taking turns transmitting. Each end sends a burst of information and then receives a burst from the other end. As in the case of the TDMA access technique, this process is repeated rapidly enough that the user does not perceive any gaps or delays in what is heard. To the user it appears as a true full-duplex connection. Characteristics of Wireless Radio Frequency Channels. There are several causes of signal corruption in an indoor wireless channel. The primary causes are signal attenuation due to distance, penetration losses through walls and floors, shadowing, and multipath propagation.

Effect of Distance Signal attenuation over distance is observed when the mean received signal power is attenuated with increasing distance from the transmitter. The most common form of this, often called free space loss, is due to the signal power being spread out over the surface of an expanding sphere as the receiver moves farther from the transmitter. In addition to free space loss effects, the signal experiences decay due to ground wave loss, although this typically comes into play only for very large distances (on the order of kilometers). For indoor

WIRELESS NETWORKS

3

propagation this mechanism is less relevant, but effects of wave guidance through corridors can occur. Also, penetration losses through walls and floors contribute to the path loss. Shadowing. Wireless networks use base stations and antennas to provide radio coverage to an area surrounding each station. The degree of coverage depends on the height of the antenna and the presence of obstacles such as mountains and buildings for outdoor applications and walls and floors for indoor applications. The variation of received signal level due to these obstacles is known as shadowing. Since these signal variations follow a lognormal distribution, the term lognormal fading is also used to describe this phenomenon. A lognormal distribution implies that received signals specified in decibels will display a normal distribution. Multipath Reception. The mobile or indoor radio channel is affected by multipath reception: The signal to the receiver contains not only a direct line-of-sight radio wave, but also a large number of reflected waves. Multipath reception results from the fact that the propagation channel consists of several obstacles and reflectors. Thus, the received signal arrives as an unpredictable set of reflections and/or direct waves, each with its own degree of attenuation and delay. The delay spread is a parameter commonly used to quantify multipath effects. The maximum delay time spread is the total time interval during which reflections with significant energy arrive. Multipath propagation leads to variations in the received signal strength over frequency and antenna location. The distribution of signal level is Rayleigh when there is no one dominant signal component, or Rician when there is a dominant (typically line-of-sight) signal component. These reflected waves interfere with the direct wave, causing significant degradation of the performance of the receiver. A wireless receiver has to be designed in such a way that the adverse effect of these reflections is minimized. Although channel fading is experienced as an unpredictable, stochastic phenomenon, powerful models have been developed that can accurately predict system performance. Some examples for the effect of multipath reception are: • • • •

For a fast-moving user: rapid fluctuations of the signal amplitude and phase For an analog television signal: ghost images (shifted slightly to the right) For a stationary user of a narrowband system: good reception at some locations and frequencies; and poor reception at others. For a satellite positioning system: strong delayed reflections, possibly leading to severe miscalculation of the distance between user and satellite.

Rate of fading. Time variation of the channel occurs if the communicating device (antenna) or components of its environment are in motion. Time variation in conjunction with multipath transmission leads to variation of the instantaneous received signal strength about the mean power level as the receiver moves over distances less than a single carrier wavelength. Time variation of the channel becomes uncorrelated every half carrier wavelength over distance. Fortunately, the degree of time variation in an indoor system is much less than that in an outdoor mobile system. One manifestation of time variation is spreading in the frequency domain. Given the conditions of typical indoor wireless systems, frequency spreading should be virtually nonexistent. Doppler spreads of 0.1 Hz to 6.1 Hz (with rms 0.3 Hz) have been reported. Some researchers have considered the effects of moving people. In particular it was found by Ganesh and Pahlavan (1) that a line-of-sight delay spread of 40 ns can have a standard deviation of 9.2 ns to 12.8 ns. Likewise, an obstructed delay spread can have a standard deviation of 3.7 ns to 5.7 ns. For wireless LANs this can mean that an antenna placed at a local multipath null remains in fade for a very long time. Measures such as diversity and adaptive equalization are needed to guarantee reliable communication irrespective of the position of the antenna.

4

WIRELESS NETWORKS

Fig. 1. Categories of wireless information networks.

General Overview and Classification of Wireless Networks We can divide the evolving wireless information network (WIN) systems into terrestrial services, satellite services, and military services. The existing terrestrial wireless services can be logically divided in two classes: voice-oriented or isochronous networks, and data-oriented or asynchronous networks. Voice-oriented networks use a connectionbased backbone infrastructure, such as the public switch telephone network (PSTN) or integrated service digital network (ISDN), that is supported by circuit switch technology. The wireless access method adopted in these services is an assigned or connection-based method such as TDMA or CDMA in one or another version. The existing wireless data services use connectionless packet-switched data network (PSDN) infrastructures as the backbone, and the terminal accesses the backbone using connectionless random-access protocols such as ALOHA or carrier sense multiple access (CSMA) in one or another version. Voice-oriented networks are divided into low-power (high-quality) local services such as personal communication services (PCS), wireless private branch exchange (PBX), and telepoint on one side and on the other side high-power, wide-area, lower-quality cellular services. The data-oriented networks are divided into high-speed wireless local area networks (LANs) and low-speed wide-area mobile data services. Figure 1 illustrates these four categories of existing services. Wireless access has been defined as “end-user radio connection(s) to core networks,” where core networks include, for example, the PSTN, ISDN, public land mobile network (PLMN), PSDN, Internet, wide area network (WAN), LAN, and cable television (CATV). Technologies in use today for implementing wireless access include cellular systems, cordless phone and cordless telecommunication systems, satellite systems, and specialized point-to-point and point-to-multipoint radio systems. New technologies and systems such as IMT2000/FPLMTS, wireless broadband ISDN, and wireless asynchronous transfer mode (ATM) also form part of wireless access if their application satisfies the basic criteria of end-user radio connection(s) to core networks. Wireless access may be considered from many perspectives, for example: • • • • • •

Mobility capabilities of the terminal: fixed, nomadic, mobile, restricted mobility (within a single cell) Service support capabilities: narrowband, broadband, multimedia Type of telecommunication service: conversational, distribution, information retrieval. Connectivity (which would depend on the switched network that the terminal accesses, such as Internet, PSTN, etc.) Radio transmission technology: access technique (TDMA, CDMA, etc.), modulation technique (analog, digital, etc.), duplex technique (FDD, TDD, etc.), etc. Transport mechanism: terrestrial, satellite, etc.

WIRELESS NETWORKS

5

There are different kinds of wireless access systems, among which wireless local loop, Global System for Mobile Communications (GSM), and IMT-2000/FLPMTS are most recognized by consumers and the industry. We describe them briefly later. Cellular, Personal Communication, and Cordless Services. Today it is customary among the cellular telephone manufacturers and service providers to divide the existing wireless communication systems into three generations. The first generation comprises the analog cellular and the cordless telephone. The second generation comprises digital cellular services and personal communication services (PCS), which also support limited data services. The third generation is expected to combine the cellular and personal voice services with a variety of packet-switched data services. To further generalize these classifications we may extend the first generation to cover data services such as wide-area paging and low-speed local data services such as those using walkie-talkie bands for low-speed local wireless communications. The second generation is likewise extended to include wireless LANs and mobile data services. The third generation will remain the same in its purpose of providing wireless access to support all services, but the infrastructure is expected to be an ATM-oriented network that is expected to support a reliable communication for both connectionless and connection-based services. Table 1 shows the first generation analog cellular systems. The technology for the analog cellular was developed in AT&T Bell Laboratories in the early 1970s. The first deployment of these systems took place in the Nordic countries under Nordic Mobile Phone (NMT), followed by deployment of the Advanced Mobile Phone Services (AMPS) in 1983. In the United States, since it is a larger country and other commercial and military radio applications are more popular there, frequency administration issues were more complicated and the process took more time. Analog cordless telephone services appeared in the US market in the early 1980s.

6

WIRELESS NETWORKS

Paging services were deployed around the same period. In the early 1980s a group of computer hackers in Vancouver, British Columbia connected a number of personal computers using voice-band modem chip sets and commercially available walkie-talkies. These products were operating at the speed of voice-band modems (

38.Networking

Recommend Documents