Principles of Computer System Design An Introduction Part II Chapters 7–11 Jerome H. Saltzer M. Frans Kaashoek Massachusetts Institute of Technology...
Principles of Data Conversion System Design Behzad Razavi ... 2.3.3 Comparison of MOS and Diode Switches 23 2.3.4 Improvements in MOS Switch Performance 24 ... 8.1.2 Output Offset Storage 201 8.1.3 Multistage Offset Storage 202 8.1.4 Comparators Usin
Principles of Compiler Design - The Brainf*ck Compiler - Clifford Wolf - www.clifford.at ... u Basic block analysis u Backpatching u Dynamic programming u Optimizations ... Tools Complex Code Generators The BF Compiler Stack Machines The SPL Project
Software packages make the development of rather complicated computer models using prede ned building blocks possible. This implies that the range of phenomenas that are ... In papers D and E a modeling framework for analyzing simulation models with
Principles of Computer System Design An Introduction
Part II Chapters 7–11
Jerome H. Saltzer M. Frans Kaashoek Massachusetts Institute of Technology
This textbook, Principles of Computer System Design: An Introduction, is an introduction
to the principles and abstractions used in the design of computer systems. It is an out
growth of notes written by the authors for the M.I.T. Electrical Engineering and
Computer Science course 6.033, Computer System Engineering, over a period of 40
plus years.
The book is published in two parts:
• Part I, containing chapters 1-6 and supporting materials for those chapters, is a traditional printed textbook published by Morgan Kaufman, an imprint of Elsevier. (ISBN: 978–012374957–4) • Part II, consisting of Chapters 7–11 and supporting materials for those chapters, is made available on-line by M.I.T. OpenCourseWare and the authors as an open educational resource. Availability of the two parts and various supporting materials is described in the section with that title below. Part II of the textbook continues a main theme of Part I—enforcing modularity—by introducing still stronger forms of modularity. Part I introduces methods that help pre vent accidental errors in one module from propagating to another. Part II introduces stronger forms of modularity that can help protect against component and system fail ures and against malicious attacks. Part II explores communication networks, constructing reliable systems from unreliable components, creating all-or-nothing and before-or-after transactions, and implementing security. In doing so, Part II also contin ues a second main theme of Part I by introducing several additional design principles related to stronger forms of modularity. A detailed description of the contents of the chapters of Part II can be found in Part I, in the section “About Part II” on page 369. Part II also includes a table of contents for both Parts I and II, copies of the Suggested Additional Readings and Glossary, Problem Sets for both Parts I and II, and a comprehensive Index of Concepts with page numbers for both Parts I and II in a single alphabetic list.
xxiii
Saltzer & Kaashoek Ch. 0, p. xxiii
June 24, 2009 12:14 am
xxiv
Preface to Part II
Availability The authors and MIT OpenCourseWare provide, free of charge, on-line versions of Chapters 7 through 11, the problem sets, the glossary, and a comprehensive index. Those materials can be found at http://ocw.mit.edu/Saltzer-Kaashoek
in the form of a series of PDF files (requires Adobe Reader), one per chapter or major supporting section, as well as a single PDF file containing the entire set. The publisher of the printed book also maintains a set of on-line resources at www.ElsevierDirect.com/9780123749574
Click on the link “Companion Materials”, where you will find Part II of the book as well as other resources, including figures from the text in several formats. Additional materials for instructors (registration required) can be found by clicking the “Manual” link. There are two additional sources of supporting material related to the teaching of course 6.033 Computer Systems Engineering, at M.I.T. The first source is an OpenCourseWare site containing materials from the teaching of the class in 2005: a class description; lecture, reading, and assignment schedule; board layouts; and many lecture videos. These materials are at http://ocw.mit.edu/6-033
The second source is a Web site for the current 6.033 class. This site contains the cur rernt lecture schedule which includes assignments, lecturer notes, and slides. There is also a thirteen-year archive of class assignments, design projects, and quizzes. These materials are all at http://mit.edu/6.033
(Some copyrighted or privacy-sensitive materials on that Web site are restricted to cur rent MIT students.)
Saltzer & Kaashoek Ch. 0, p. xxiv
June 24, 2009 12:14 am
Acknowledgments
CHAPTER
This textbook began as a set of notes for the advanced undergraduate course Engineering of Computer Systems (6.033, originally 6.233), offered by the Department of Electrical Engineering and Computer Science of the Massachusetts Institute of Technology start ing in 1968. The text has benefited from some four decades of comments and suggestions by many faculty members, visitors, recitation instructors, teaching assistants, and students. Over 5,000 students have used (and suffered through) draft versions, and observations of their learning experiences (as well as frequent confusion caused by the text) have informed the writing. We are grateful for those many contributions. In addi tion, certain aspects deserve specific acknowledgment.
1. Naming (Section 2.2 and Chapter 3) The concept and organization of the materials on naming grew out of extensive discus sions with Michael D. Schroeder. The naming model (and part of our development) follows closely the one developed by D. Austin Henderson in his Ph.D. thesis. Stephen A. Ward suggested some useful generalizations of the naming model, and Roger Needham suggested several concepts in response to an earlier version of this material. That earlier version, including in-depth examples of the naming model applied to addressing architectures and file systems, and an historical bibliography, was published as Chapter 3 in Rudolf Bayer et al., editors, Operating Systems: An Advanced Course, Lec ture Notes in Computer Science 60, pages 99–208. Springer-Verlag, 1978, reprinted 1984. Additional ideas have been contributed by many others, including Ion Stoica, Karen Sol lins, Daniel Jackson, Butler Lampson, David Karger, and Hari Balakrishnan.
2. Enforced Modularity and Virtualization (Chapters 4 and 5) Chapter 4 was heavily influenced by lectures on the same topic by David L. Tennen house. Both chapters have been improved by substantial feedback from Hari Balakrishnan, Russ Cox, Michael Ernst, Eddie Kohler, Chris Laas, Barbara H. Liskov, Nancy Lynch, Samuel Madden, Robert T. Morris, Max Poletto, Martin Rinard, Susan Ruff, Gerald Jay Sussman, Julie Sussman, and Michael Walfish.
3. Networks (Chapter 7[on-line]) Conversations with David D. Clark and David L. Tennenhouse were instrumental in laying out the organization of this chapter, and lectures by Clark were the basis for part of the presentation. Robert H. Halstead Jr. wrote an early draft set of notes about net working, and some of his ideas have also been borrowed. Hari Balakrishnan provided many suggestions and corrections and helped sort out muddled explanations, and Julie Sussman and Susan Ruff pointed out many opportunities to improve the presentation. The material on congestion control was developed with the help of extensive discussions
Saltzer & Kaashoek Ch. 0, p. xxv
xxv
June 24, 2009 12:14 am
xxvi
Acknowledgments
with Hari Balakrishnan and Robert T. Morris, and is based in part on ideas from Raj Jain.
4. Fault Tolerance (Chapter 8[on-line]) Most of the concepts and examples in this chapter were originally articulated by Claude Shannon, Edward F. Moore, David Huffman, Edward J. McCluskey, Butler W. Lampson, Daniel P. Siewiorek, and Jim N. Gray.
5. Transactions and Consistency (Chapters 9[on-line] and 10[on-line]) The material of the transactions and consistency chapters has been developed over the course of four decades with aid and ideas from many sources. The concept of version his tories is due to Jack Dennis, and the particular form of all-or-nothing and before-or-after atomicity with version histories developed here is due to David P. Reed. Jim N. Gray not only came up with many of the ideas described in these two chapters, he also provided extensive comments. (That doesn’t imply endorsement—he disagreed strongly about the importance of some of the ideas!) Other helpful comments and suggestions were made by Hari Balakrishnan, Andrew Herbert, Butler W. Lampson, Barbara H. Liskov, Samuel R. Madden, Larry Rudolph, Gerald Jay Sussman, and Julie Sussman.
6. Computer Security (Chapter 11[on-line]) Sections 11.1 and 11.6 draw heavily from the paper “The Protection of Information in Computer Systems” by Jerome H. Saltzer and Michael D. Schroeder, Proceedings of the IEEE 63, 9 (September, 1975), pages 1278–1308. Ronald Rivest, David Mazières, and Robert T. Morris made significant contributions to material presented throughout the chapter. Brad Chen, Michael Ernst, Kevin Fu, Charles Leiserson, Susan Ruff, and Seth Teller made numerous suggestions for improving the text.
7. Suggested Outside Readings Ideas for suggested readings have come from many sources. Particular thanks must go to Michael D. Schroeder, who uncovered several of the classic systems papers in places out side computer science where nobody else would have thought to look, Edward D. Lazowska, who provided an extensive reading list used at the University of Washington, and Butler W. Lampson, who provided a thoughtful review of the list.
8. The Exercises and Problem Sets The exercises at the end of each chapter and the problem sets at the end of the book have been collected, suggested, tried, debugged, and revised by many different faculty mem bers, instructors, teaching assistants, and undergraduate students over a period of 40 years in the process of constructing quizzes and examinations while teaching the material of the text.
Saltzer & Kaashoek Ch. 0, p. xxvi
June 24, 2009 12:14 am
Acknowledgments
xxvi
Certain of the longer exercises and most of the problem sets, which are based on leadin stories and include several related questions, represent a substantial effort by a single individual. For those problem sets not developed by one of the authors, a credit line appears in a footnote on the first page of the problem set. Following each problem or problem set is an identifier of the form “1978–3–14”. This identifier reports the year, examination number, and problem number of the examina tion in which some version of that problem first appeared. Jerome H. Saltzer M. Frans Kaashoek 2009
Saltzer & Kaashoek Ch. 0, p. xxvii
June 24, 2009 12:14 am
xxvi
Acknowledgments
Saltzer & Kaashoek Ch. 0, p. xxviii
June 24, 2009 12:14 am
Computer System Design Principles
CHAPTER
Throughout the text, the description of a design principle presents its name in a bold faced display, and each place that the principle is used highlights it in underlined italics.
Design principles applicable to many areas of computer systems • Adopt sweeping simplifications So you can see what you are doing.
• Avoid excessive generality If it is good for everything, it is good for nothing.
• Avoid rarely used components Deterioration and corruption accumulate unnoticed—until the next use.
• Be explicit Get all of the assumptions out on the table.
• Decouple modules with indirection Indirection supports replaceability.
• Design for iteration You won't get it right the first time, so make it easy to change.
• End-to-end argument The application knows best.
• Escalating complexity principle Adding a feature increases complexity out of proportion.
• Incommensurate scaling rule Changing a parameter by a factor of ten requires a new design.
• Keep digging principle Complex systems fail for complex reasons.
• Law of diminishing returns The more one improves some measure of goodness, the more effort the next improvement will require.
• Open design principle Let anyone comment on the design; you need all the help you can get.
• Principle of least astonishment People are part of the system. Choose interfaces that match the user’s experience,
Saltzer & Kaashoek Ch. 0, p. xxix
xxix
June 24, 2009 12:21 am
xxx
Computer System Design Principles
expectations, and mental models.
• Robustness principle Be tolerant of inputs, strict on outputs.
• Safety margin principle Keep track of the distance to the edge of the cliff or you may fall over the edge.
• Unyielding foundations rule It is easier to change a module than to change the modularity.
Design principles applicable to specific areas of computer systems • Atomicity: Golden rule of atomicity Never modify the only copy!
• Coordination: One-writer principle If each variable has only one writer, coordination is simpler.
• Durability: The durability mantra Multiple copies, widely separated and independently administered.
• Security: Minimize secrets Because they probably won’t remain secret for long.
• Security: Complete mediation Check every operation for authenticity, integrity, and authorization.
• Security: Fail-safe defaults Most users won’t change them, so set defaults to do something safe.
• Security: Least privilege principle Don’t store lunch in the safe with the jewels.
• Security: Economy of mechanism The less there is, the more likely you will get it right.
• Security: Minimize common mechanism Shared mechanisms provide unwanted communication paths.
Design Hints (useful but not as compelling as design principles) • Exploit brute force • Instead of reducing latency, hide it • Optimize for the common case • Separate mechanism from policy
7.9.3 Emergent Phase Synchronization of Periodic Protocols ............7–108
7.9.4 Wisconsin Time Server Meltdown ........................................7–109
Exercises......................................................................................7–111 Glossary for Chapter 7 .................................................................7–125 Index of Chapter 7 .......................................................................7–135 Last chapter page 7–139
Overview Almost every computer system includes one or more communication links, and these communication links are usually organized to form a network, which can be loosely defined as a communication system that interconnects several entities. The basic abstrac tion remains SEND (message). and RECEIVE (message), so we can view a network as an elaboration of a communication link. Networks have several interesting properties— interface style, interface timing, latency, failure modes, and parameter ranges—that require careful design attention. Although many of these properties appear in latent form
Saltzer & Kaashoek Ch. 7, p. 2
June 25, 2009 8:22 am
7.1 Interesting Properties of Networks
7–3
in other system components, they become important or even dominate when the design includes communication. Our study of networks begins, in Section 7.1, by identifying and investigating the interesting properties just mentioned, as well as methods of coping with those properties. Section 7.2 describes a three-layer reference model for a data communication network that is based on a best-effort contract, and Sections 7.3, 7.4, and 7.5 then explore more carefully a number of implementation issues and techniques for each of the three layers. Finally, Section 7.6 examines the problem of controlling network congestion. A data communication network is an interesting example of a system itself. Most net work designs make extensive use of layering as a modularization technique. Networks also provide in-depth examples of the issues involved in naming objects, in achieving fault tolerance, and in protecting information. (This chapter mentions fault tolerance and protection only in passing. Later chapters will return to these topics in proper depth.) In addition to layering, this chapter identifies several techniques that have wide appli cability both within computer networks and elsewhere in networked computer systems—framing, multiplexing, exponential backoff, best-effort contracts, latency masking, error control, and the end-to-end argument. A glance at the glossary will show that the chapter defines a large number of concepts. A particular network design is not likely to require them all, and in some contexts some of the ideas would be overkill. The engineer ing of a network as a system component requires trade-offs and careful judgement. It is easy to be diverted into an in-depth study of networks because they are a fasci nating topic in their own right. However, we will limit our exploration to their uses as system components and as a case study of system issues. If this treatment sparks a deeper interest in the topic, the Suggestions for Further Reading at the end of this book include several good books and papers that provide wide-ranging treatments of all aspects of networks.
7.1 Interesting Properties of Networks The design of communication networks is dominated by three intertwined consider ations: (1) a trio of fundamental physical properties, (2) the mechanics of sharing, and (3) a remarkably wide range of parameter values. The first dominating consideration is the trio of fundamental physical properties: 1. The speed of light is finite. Using the most direct route, and accounting for the velocity of propagation in real-world communication media, it takes about 20 milliseconds to transmit a signal across the 2,600 miles from Boston to Los Angeles. This time is known as the propagation delay, and there is no way to avoid it without moving the two cities closer together. If the signal travels via a geostationary satellite perched 22,400 miles above the equator and at a longitude halfway between those two cities, the propagation delay jumps to 244 milliseconds, a latency large enough that a human, not just a computer, will notice.
Saltzer & Kaashoek Ch. 7, p. 3
June 25, 2009 8:22 am
7–4
CHAPTER 7 The Network as a System and as a System Component
But communication between two computers in the same room may have a propagation delay of only 10 nanoseconds. That shorter latency makes some things easier to do, but the important implication is that network systems may have to accommodate a range of delay that spans seven orders of magnitude. 2. Communication environments are hostile. Computers are usually constructed of incredibly reliable components, and they are usually operated in relatively benign environments. But communication is carried out using wires, glass fibers, or radio signals that must traverse far more hostile environments ranging from under the floor to deep in the ocean. These environments endanger communication. Threats range from a burst of noise that wipes out individual bits to careless backhoe operators who sever cables that can require days to repair. 3. Communication media have limited bandwidth. Every transmission medium has a maximum rate at which one can transmit distinct signals. This maximum rate is determined by its physical properties, such as the distance between transmitter and receiver and the attenuation characteristics of the medium. Signals can be multilevel, not just binary, so the data rate can be greater than the signaling rate. However, noise limits the ability of a receiver to distinguish one signal level from another. The combination of limited signaling rate, finite signal power, and the existence of noise limits the rate at which data can be sent over a communication link.* Different network links may thus have radically different data rates, ranging from a few kilobits per second over a long-distance telephone line to several tens of gigabits per second over an optical fiber. Available data rate thus represents a second network parameter that may range over seven orders of magnitude. The second dominating consideration of communications networks is that they are nearly always shared. Sharing arises for two distinct reasons. 1. Any-to-any connection. Any communication system that connects more than two things intrinsically involves an element of sharing. If you have three computers, you usually discover quickly that there are times when you want to communicate between any pair. You can start by building a separate communication path between each pair, but this approach runs out of steam quickly because the number of paths required grows with the square of the number of communicating entities. Even in a small network, a shared communication system is usually much more practical—it is more economical and it is easier to manage. When the number of entities that need to communicate begins to grow, as suggested in Figure 7.1, there is little choice. A closely related observation is that networks may connect three entities or 300 million entities. The number of connected entities is
* The formula that relates signaling rate, signal power, noise level, and maximum data rate, known as Shannon’s capacity theorem, appears on page 7–37.
Saltzer & Kaashoek Ch. 7, p. 4
June 25, 2009 8:22 am
7.1 Interesting Properties of Networks
7–5
thus a third network parameter with a wide range, in this case covering eight orders of magnitude. 2. Sharing of communication costs. Some parts of a communication system follow the same technological trends as do processors, memory, and disk: things made of silicon chips seem to fall in price every year. Other parts, such as digging up streets to lay wire or fiber, launching a satellite, or bidding to displace an existing radiobased service, are not getting any cheaper. Worse, when communication links leave a building, they require right-of-way, which usually subjects them to some form of regulation. Regulation operates on a majestic time scale, with procedures that involve courts and attorneys, legislative action, long-term policies, political pressures, and expediency. These procedures can eventually produce useful results, but on time scales measured in decades, whereas technological change makes new things feasible every year. This incommensurate rate of change means that communication costs rarely fall as fast as technology would permit, so sharing of those costs between otherwise independent users persists even in situations where the technology might allow them to avoid it. The third dominating consideration of network design is the wide range of parameter values. We have already seen that propagation times, data rates, and the number of com municating computers can each vary by seven or more orders of magnitude. There is a fourth such wide-ranging parameter: a single computer may at different times present a network with widely differing loads, ranging from transmitting a file at 30 megabytes per second to interactive typing at a rate of one byte per second. These three considerations, unyielding physical limits, sharing of facilities, and exist ence of four different parameters that can each range over seven or more orders of magnitude, intrude on every level of network design, and even carefully thought-out modularity cannot completely mask them. As a result, systems that use networks as a component must take them into account.
7.1.1 Isochronous and Asynchronous Multiplexing Sharing has significant consequences. Consider the simplified (and gradually becoming obsolescent) telephone network of Figure 7.1, which allows telephones in Boston to talk with telephones in Los Angeles: There are three shared components in this picture: a switch in Boston, a switch in Los Angeles, and an electrical circuit acting as a communi cation link between the two switches. The communication link is multiplexed, which means simply that it is used for several different communications at the same time. Let’s focus on the multiplexed link. Suppose that there is an earthquake in Los Angeles, and many people in Boston simultaneously try to call their relatives in Los Angeles to find out what happened. The multiplexed link has a limited capacity, and at some point the next caller will be told the “network is busy.” (In the U.S. telephone network this event is usually signaled with “fast busy,” a series of beeps repeated at twice the speed of a usual busy signal.)
Saltzer & Kaashoek Ch. 7, p. 5
June 25, 2009 8:22 am
7–6
CHAPTER 7 The Network as a System and as a System Component
multiplexed link
L1
B1 Los Angeles Switch
Boston Switch
B2
B3
L2
L3
shared switches
L4
FIGURE 7.1 A simple telephone network.
This “network busy” phenomenon strikes rather abruptly because the telephone sys tem traditionally uses a line multiplexing technique known as isochronous (from Greek roots meaning “equally timed”) communication. Suppose that the telephones are all dig ital, operating at 64 kilobits per second, and the multiplexed link runs at 45 megabits per second. If we look for the bits that represent the conversation between B2 and L3, we will find them on the wire as shown in Figure 7.2: At regular intervals we will find 8-bit blocks (called frames) carrying data from B2 to L3. To maintain the required data rate of 64 kilobits per second, another B2-to-L3 frame comes by every 5,624 bit times or 125 microseconds, producing a rate of 8,000 frames per second. In between each pair of B2 to-L3 frames there is room for 702 other frames, which may be carrying bits belonging to other telephone conversations. A 45 megabits/second link can thus carry up to 703 simultaneous conversations, but if a 704th person tries to initiate a call, that person will receive the “network busy” signal. Such a capacity-limiting scheme is sometimes called hard-edged, meaning in this case that it offers no resistance to the first 703 calls, but it absolutely refuses to accept the 704th one. This scheme of dividing up the data into equal-size frames and transmitting the frames at equal intervals—known in communications literature as time-division multi plexing (TDM)—is especially suited to telephony because, from the point of view of any one telephone conversation, it provides a constant rate of data flow and the delay from one end to the other is the same for every frame. Time
5,624 bit times
8-bit frame
8-bit frame
8-bit frame
FIGURE 7.2 Data flow on an isochronous multiplexed link.
Saltzer & Kaashoek Ch. 7, p. 6
June 25, 2009 8:22 am
7.1 Interesting Properties of Networks
7–7
One prerequisite to using isochronous communication is that there must be some prior arrangement between the sending switch and the receiving switch: an agreement that this periodic series of frames should be sent along to L3. This agreement is an exam ple of a connection and it requires some previous communication between the two switches to set up the connection, storage for remembered state at both ends of the link, and some method to discard (tear down) that remembered state when the conversation between B2 and L3 is complete. Data communication networks usually use a strategy different from telephony for multiplexing shared links. The starting point for this different strategy is to examine the data rate and latency requirements when one computer sends data to another. Usually, computer-related activities send data on an irregular basis—in bursts called messages—as compared with the continuous stream of bits that flows out of a simple digital telephone. Bursty traffic is particularly ill-suited to fixed size and spacing of isochronous frames. During those times when B2 has nothing to send to L3 the frames allocated to that con nection go unused. Yet when B2 does have something to send it may be larger than one frame in size, in which case the message may take a long time to send because of the rig idly fixed spacing between frames. Even if intervening frames belonging to other connections are unfilled, they can’t be used by the connection from B2 to L3. When communicating data between two computers, a system designer is usually willing to forgo the guarantee of uniform data rate and uniform latency if in return an entire mes sage can get through more quickly. Data communication networks achieve this trade-off by using what is called asynchronous (from Greek roots meaning “untimed”) multiplex ing. For example, in Figure 7.3, a network connects several personal computers and a service. In the middle of the network is a 45 megabits/second multiplexed link, shared by many network users. But, unlike the telephone example, this link is multiplexed asynchronously.
D service
Personal Computer A B multiplexed link data crosses this link in bursts and can tolerate variable delay C FIGURE 7.3 A simple data communication network.
Saltzer & Kaashoek Ch. 7, p. 7
June 25, 2009 8:22 am
7–8
CHAPTER 7 The Network as a System and as a System Component
frame Time D
B
Guidance information
4000 bits
750 bits
FIGURE 7.4 Data flow on an asynchronous multiplexed link.
On an asynchronous link, a frame can be of any convenient length, and can be carried at any time that the link is not being used for another frame. Thus in the time sequence shown in Figure 7.4 we see two frames, the first going to B and the second going to D. Since the receiver can no longer figure out where the message in the frame is destined by simply counting bits, each frame must include a few extra bits that provide guidance about where to deliver it. A variable-length frame together with its guidance information is called a packet. The guidance information can take any of several forms. A common form is to provide the destination address of the message: the name of the place to which the message should be delivered. In addition to delivery guidance information, asynchro nous data transmission requires some way of figuring out where each frame starts and ends, a process known as framing. In contrast, both addressing and framing with isoch ronous communication are done implicitly, by watching the clock. Since a packet carries its own destination guidance, there is no need for any prior agreement between the ends of the multiplexed link. Asynchronous communication thus offers the possibility of connectionless transmission, in which the switches do not need to maintain state about particular end-user communications.* An additional complication arises because most links place a limit on the maximum size of a frame. When a message is larger than this maximum size, it is necessary for the sender to break it up into segments, each of which the network carries in a separate packet, and include enough information with each segment to allow the original message to be reassembled at the other end. Asynchronous transmission can also be used for continuous streams of data such as from a digital telephone, by breaking the stream up into segments. Doing so does create a problem that the segments may not arrive at the other end at a uniform rate or with a uniform delay. On the other hand, if the variations in rate and delay are small enough, * Network experts make a subtle distinction among different kinds of packets by using the word datagram to describe a packet that carries all of the state information (for example, its destination address) needed to guide the packet through a network of packet forwarders that do not themselves maintain any state about particular end-to-end connections.
Saltzer & Kaashoek Ch. 7, p. 8
June 25, 2009 8:22 am
7.1 Interesting Properties of Networks
7–9
packet
A
Packet Switch
Workstation at network attachment point A
B
1 Packet Switch 2 3 Service at network attachment point B
Packet Switch
Packet Switch
B
FIGURE 7.5 A packet forwarding network.
or the application can tolerate occasional missing segments of data, the method is still effective. In the case of telephony, the technique is called “packet voice” and it is gradu ally replacing many parts of the traditional isochronous voice network.
7.1.2 Packet Forwarding; Delay Asynchronous communication links are usually organized in a communication structure known as a packet forwarding network. In this organization, a number of slightly special ized computers known as packet switches (in contrast with the circuit switches of Figure 7.1) are placed at convenient locations and interconnected with asynchronous links. Asynchronous links may also connect customers of the network to network attachment points, as in Figure 7.5. This figure shows two attachment points, named A and B, and it is evident that a packet going from A to B may follow any of several different paths, called routes, through the network. Choosing a particular path for a packet is known as routing. The upper right packet switch has three numbered links connecting it to three other packet switches. The packet coming in on its link #1, which originated at the work station at attachment point A and is destined for the service at attachment point B, contains the address of its destination. By studying this address, the packet switch will be able to figure out that it should send the packet on its way via its link #3. Choosing an outgoing link is known as forwarding, and is usually done by table lookup. The construc tion of the forwarding tables is one of several methods of routing, so packet switches are also called forwarders or routers. The resulting organization resembles that of the postal service. A forwarding network imposes a delay (known as its transit time) in sending some thing from A to B. There are four contributions to transit time, several of which may be different from one packet to the next.
Saltzer & Kaashoek Ch. 7, p. 9
June 25, 2009 8:22 am
7–10
CHAPTER 7 The Network as a System and as a System Component
1. Propagation delay. The time required for the signal to travel across a link is determined by the speed of light in the transmission medium connecting the packet switches and the physical distance the signals travel. Although it does vary slightly with temperature, from the point of view of a network designer propagation delay for any given link can be considered constant. (Propagation delay also applies to the isochronous network.) 2. Transmission delay. Since the frame that carries the packet may be long or short, the time required to send the frame at one switch—and receive it at the next switch—depends on the data rate of the link and the length of the frame. This time is known as transmission delay. Although some packet switches are clever enough to begin sending a packet out before completely receiving it (a trick known as cutthrough), error recovery is simpler if the switch does not forward a packet until the entire packet is present and has passed some validity checks. Each time the packet is transmitted over another link, there is another transmission delay. A packet going from A to B via the dark links in Figure 7.5 will thus be subject to four transmission delays, one when A sends it to the first packet switch, one at each forwarding step, and finally one to transmit it to B. 3. Processing delay. Each packet switch will have to examine the guidance information in the packet to decide to which outgoing link to send it. The time required to figure this out, together with any other work performed on the packet, such as calculating a checksum (see Sidebar 7.1) to allow error detection or copying it to an output buffer that is somewhere else in memory, is known as processing delay.
Sidebar 7.1: Error detection, checksums, and witnesses A checksum on a block of data is a stylized kind of error-detection code in which redundant error-detecting information, rather than being encoded into the data itself (as Chapter 8[on-line] will explain), is placed in a separate field. A typical simple checksum algorithm breaks the data block up into k-bit chunks and performs an exclusive OR on the chunks to produce a k-bit result. (When k = 1, this procedure is called a parity check.) That simple k-bit checksum would catch any one-bit error, but it would miss some two-bit errors, and it would not detect that two chunks of the block have been interchanged. Much more sophisticated checksum algorithms have been devised that can detect multiple-bit errors or that are good at detecting particular kinds of expected errors. As will be seen in Chapter 11[on-line], by using cryptographic techniques it is possible to construct a high-quality checksum with the property that it can detect all changes—even changes that have been intentionally introduced by a malefactor—with near certainty. Such a checksum is called a witness, or fingerprint and is useful for ensuring long-term integrity of stored data.The trade-off is that more elaborate checksums usually require more time to calculate and thus add to processing delay. For that reason, communication systems typically use the simplest checksum algorithm that has a reasonable chance of detecting the expected errors.
Saltzer & Kaashoek Ch. 7, p. 10
June 25, 2009 8:22 am
7.1 Interesting Properties of Networks
7–11
This delay typically has one part that is relatively constant from one packet to the next and a second part that is proportional to the length of the packet. 4. Queuing delay. When the packet from A to B arrives at the upper right packet switch, link #3 may already be transmitting another packet, perhaps one that arrived from link #2, and there may also be other packets queued up waiting to use link #3. If so, the packet switch will hold the arriving packet in a queue in memory until it has finished transmitting the earlier packets. The duration of this delay depends on the amount of other traffic passing through that packet switch, so it can be quite variable. Queuing delay can sometimes be estimated with queuing theory, using the queuing theory formula in Section 6.1.6. If packets arrive according to a random, memoryless process and have randomly distributed service times (technically, a Poisson distribution in which for this case the service time is the transmission delay of the outgoing link), the average queuing delay, measured in units of the packet service time and including the service time of this packet, will be 1 ⁄ ( 1 – ρ) . Here ρ is the utilization of the outgoing line, which can range from 0 to 1. When we plot this result in Figure 7.6 we notice a typical system phenomenon: delay rises rapidly as the line utilization approaches 100%. This plot tells us that the asynchronous system has introduced a trade-off: if we wish to limit the average queuing delay, for example to the amount labeled in the figure “maxi mum tolerable delay,” it will be necessary to leave unused, on average, some of the capacity of each link; in the example this maximum utilization is labeled ρmax. Alterna tively, if we allow the utilization to approach 100%, delays will grow without bound. The asynchronous system seems to have replaced the abrupt appearance of the busy sig nal of the isochronous network with a gradual trade-off: as the system becomes busier, the delays increase. However, as we will see in Section 7.1.3, below, the replacement is actually more subtle than that.
average queuing delay
maximum tolerable delay
1 -----------1–ρ 1 0
Utilization, r
100% rmax
FIGURE 7.6 Queuing delay as a function of utilization.
Saltzer & Kaashoek Ch. 7, p. 11
June 25, 2009 8:22 am
7–12
CHAPTER 7 The Network as a System and as a System Component
The formula and accompanying graph tell us only the average delay. If we try to load up a link so that its utilization is ρmax, the actual delay will exceed our tolerance threshold about as often as it is below that threshold. If we are serious about keeping the maximum delay almost always below a given value, we must prepare for occasional worse peaks by holding utilization below the level of ρmax suggested by the figure. If packets do not obey memoryless arrival statistics (for example, they arrive in long convoys, and all are the same, maximum size), the model no longer applies, and we need a better understanding of the arrival process before we can say anything about delays. This same utilization ver sus delay trade-off also applies to non-network components of a computer system that have queues, for example scheduling the processor or reading and writing a magnetic disk. We have talked about queuing theory as if it might be useful in predicting the behav ior of a network. It is not. In practice, network systems put a bound on link queuing delays by limiting the size of queues and by exerting control on arrivals. These mecha nisms allow individual links to achieve high utilization levels, while shifting delays to other places in the network. The next section explains how, and it also explains just what happened to the isochronous network’s hard-edged busy signal. Later, in Section 7.6 of this chapter we will see how the delays can be shifted all the way back to the entry point of the network.
7.1.3 Buffer Overflow and Discarded Packets Continuing for a moment to apply queuing theory, queuing has an implication: buffer space is needed to hold the queue of packets waiting for transmission. How large a buffer should the designer allocate? Under the memoryless arrival interval assumption, the aver age number of packets awaiting transmission (including the one currently being transmitted) is 1 ⁄ ( 1 – ρ) . As with queuing delay, that number is only the average— queuing theory tells us that the variance of the queue length is also 1 ⁄ ( 1 – ρ) . For a ρ of 0.8 the average queue length and the variance are both 5, so if one wishes to allow enough buffers to handle peaks that are, say, three standard deviations above the average, one must be prepared to buffer not only the 5 packets predicted as the average but also (3 × 5 ≅ 7) more, a total of 12 packets. Worse, in many real networks packets don’t actu ally arrive independently at random; they come in buffer-bursting batches. At this point, we can imagine three quite different strategies for choosing a buffer size: 1. Plan for the worst case. Examine the network traffic carefully, figure out what the worst-case traffic situation will be, and allocate enough buffers to handle it. 2. Plan for the usual case and fight back. Based on a calculation such as the one above, choose a buffer size that will work most of the time, and if the buffers fill up send messages back through the network asking someone to stop sending. 3. Plan for the usual case and discard overflow. Again, choose a buffer size that will work most of the time, and ruthlessly discard packets when the buffers are full.
Saltzer & Kaashoek Ch. 7, p. 12
June 25, 2009 8:22 am
7.1 Interesting Properties of Networks
7–13
Let’s explore these three possibilities in turn. Buffer memory is usually low in cost, so planning for the worst case seems like an attractive idea, but it is actually much harder than it sounds. For one thing, in a large network, it may be impossible to figure out what the worst case is—there just isn’t enough information available about what can happen. Even if one can estimate the worst case, the estimate may not be useful. Consider, for example, the Hypothetical Bank of Canada, which has 21,000 tellers scattered across the country. The branch at Moose Jaw, Saskatchewan, has one teller and usually is the target of only three transactions a day. Although it has never happened, and almost certainly never will, the worst case is that every one of the 20,999 other tellers simultaneously posts a withdrawal against a Moose Jaw account. Thus a worst-case design would require that there be enough buffers in the packet switch leading to Moose Jaw to handle 20,999 simultaneous messages. The prob lem with worst-case analysis is that the worst case can be many orders of magnitude larger than the average case, as well as extremely unlikely. Moreover, even if one decided to buy that large a buffer, the resulting queue to process all the transactions would be so long that many of the other tellers would give up in disgust and abort their transactions, so the large buffer wouldn’t really help. This observation makes it sound attractive to choose a buffer size based on typical, rather than worst-case, loads. But then there is always going to be a chance that traffic will exceed the average for long enough to run out of buffer space. This situation is called congestion. What to do then? One idea is to push back. If buffer space begins to run low, send a message back along an incoming link saying “please don’t send any more until you hear from me”. This mes sage (called a quench request) may go to the packet switch at the other end of that link, or it may go all the way back to the original source that introduced the data into the net work. Either way, pushing back is also harder than it sounds. If a packet switch is experiencing congestion, there is a good chance that the adjacent switch is also congested (if it is not already congested, it soon will be if it is told to stop sending data over the link to this switch), and sending an extra message is adding to the congestion. Worse, a set of packet switches configured in a cycle like that of Figure 7.5 can easily end up in a form of deadlock (called gridlock when it happens to automobile traffic), with all buffers filled and each switch waiting for the next switch to say that it is OK to start sending again. One way to avoid deadlock among the packet switches is to send the quench request all the way back to the source. This method is hard too, for at least three reasons. First, it may not be clear to which source to send the quench. In our Moose Jaw example, there are 21,000 different sources, no one of which is, by itself, the cause of (nor capable of doing much about) the problem. Second, such a request may not have any effect because the source you choose to quench is no longer sending anyway. Again in our example, by the time the packet switch on the way to Moose Jaw detects the overload, all of the 21,000 tellers may have already sent their transaction requests, so asking them not to send anything else would accomplish nothing. Third, assuming that the quench message is itself forwarded back through the packet-switched network, it may run into congestion and be subject to queuing delays. The busier the network, the longer it will take to exert
Saltzer & Kaashoek Ch. 7, p. 13
June 25, 2009 8:22 am
7–14
CHAPTER 7 The Network as a System and as a System Component
control. We are proposing to create a feedback system with delay and should expect to see oscillations. Even if all the data is coming from one source, by the time the quench gets back and the source acts on it, the packets already in the pipeline may exceed the buffer capacity. Controlling congestion by quenching either the adjacent switch or the source is used in various special situations, but as a general technique it is currently an unsolved problem. The remaining possibility is what most packet networks actually do in the face of con gestion: when the buffers fill up, they start throwing packets away. This seems like a somewhat startling thing for a communication system to do because it will disrupt the communication, and eventually each discarded packet will have to be sent again, so the effort to send the packet this far will have been wasted. Nevertheless, this is an action that every packet switching network that is not configured for the worst case must be pre pared to take. Overflowing buffers and discarded packets lead to two remarkable consequences. First, the sender of a packet can interpret the lack of its acknowledgment as a sign that the network is congested, and can in turn reduce the rate at which it introduces new packets into the network. This idea, called automatic rate adaptation, is explored in depth in Section 7.6 of this chapter. The combination of discarded packets and automatic rate adaptation in turn produce the second consequence: simple theoretical models of net work behavior based on standard queuing theory do not apply when a service may serve some requests and may discard others. Modeling of networks that have rate adaptation requires a much deeper understanding of the specific algorithms used not just by the net work but also by network applications. In the final analysis, the asynchronous network replaces the hard-edged blocking of the isochronous network with a variable transmission rate that depends on the instanta neous network load. Which scheme (asynchronous or isochronous) for dealing with overload is preferable depends on the application. For some applications it may be better to be told at the outset of a communications attempt to come back later, rather than to be allowed to start work only to encounter such variations in available capacity that it is hard to do anything useful. In other applications it may be more helpful to have some work done, slowly or at variable rates, rather than none at all. The possibility that a network may actually discard packets to cope with congestion leads to a useful distinction between two kinds of forwarding networks. So far, we have been discussing what is usually described as a best-effort network, which, if it cannot dis patch a packet soon after receipt, may discard it. The alternative design is the guaranteeddelivery network (sometimes called a store-and-forward network, although that term is often applied to all forwarding networks), which takes heroic measures to avoid ever dis carding payload data. Guaranteed delivery networks usually are designed to work with complete messages rather than packets. Typically, a guaranteed delivery network uses non-volatile storage such as a magnetic disk for buffering, so that it can handle large peaks of message load and can be confident that messages will not be lost even if there is a power failure or the forwarding computer crashes. Also, a guaranteed delivery network usually, when faced with the prospect of being completely unable to deliver a message
Saltzer & Kaashoek Ch. 7, p. 14
June 25, 2009 8:22 am
7.1 Interesting Properties of Networks
7–15
(perhaps because the intended recipient has vanished), explicitly returns the message to its originator along with an explanation of why delivery failed. Finally, in keeping with the spirit of not losing a message, a guaranteed delivery switch usually tracks individual messages carefully to make sure that none are lost or damaged during transmission, for example by a burst of noise. A switch of a best-effort network can be quite a bit simpler than a switch of a guaranteed-delivery network. Since the best-effort network may casu ally discard packets anyway, it does not need to make any special provisions for retransmitting damaged packets, for preserving packets in transit when the switch crashes and restarts, or for worrying about the case when the link to a destination node suddenly stops accepting data. The best-effort network is said to provide a best-effort contract to its customers (this contract is defined more carefully in Section 7.1.7, below), rather than a guarantee of delivery. Of course, in the real world there are no absolute guarantees—the real distinc tion between the two designs is that there is intended to be a significant difference in the probability of undetected loss. When we examine network layering in Section 7.2 of this chapter, it will become apparent that these differences can be characterized another way: guaranteed-delivery networks are usually implemented in a higher network layer, besteffort networks in a lower network layer. In these terms, the U.S. Postal Service operates a guaranteed delivery system for firstclass mail, but a best-effort system for third-class (junk) mail, because postal regulations allow it to discard third-class mail that is misaddressed or when congestion gets out of hand. The Internet is organized as a best-effort system, but the Internet mechanisms for handling e-mail are designed as a guaranteed delivery system. The Western Union com pany has always prided itself on operating a true guaranteed-delivery system, to the extent that when it decommissions an office it normally disassembles the site completely in a search for misplaced telegrams. There is a (possibly apocryphal) tale that such a dis assembly once discovered a 75-year-old telegram that had fallen behind a water pipe. The company promptly delivered it to the astonished heirs of the original addressee.
7.1.4 Duplicate Packets and Duplicate Suppression As it turns out, discarded packets are not as much of a problem to the higher-level appli cation as one might expect because when a client sends a request to a service, it is always possible that the service is not available, or the service crashed just after receiving the request. So unanswered requests are actually a routine occurrence, and many network protocols include some kind of timer expiration and resend mechanism to recover from such failures. The timing diagram of Figure 7.7* illustrates the situation, showing a first packet carrying a request, followed by a packet going the other way carrying the response to the first request. A has set a timer, indicated by a vertical line, but the arrival of response 1 before the expiration of the timer causes A to switch off the timer, indicated by the small X. The packet carrying the second request is lost in transit (as indicated by * The conventions for representation of timing diagrams were described in Sidebar 4.2.
Saltzer & Kaashoek Ch. 7, p. 15
June 25, 2009 8:22 am
7–16
CHAPTER 7 The Network as a System and as a System Component
A send request, set timer
B
time
reques
t1
1
response receive response,
reset timer
X
send request,
set timer
reques
X
t2
timer expires,
resend request,
set new timer
overloaded forwarder discards request packet.
reques
t 2’
’
nse 2
respo receive response,
reset timer
X
FIGURE 7.7 Lost packet recovery.
the large X), perhaps having been damaged or discarded by an overloaded forwarder, the timer expires, and A resends request 2 in the packet labeled request 2’. When a congested forwarder discards a packet, there are two important conse quences. First, the client doesn’t receive a response as quickly as originally hoped because a timer expiration period has been added to the overall response time. This extra delay can have a significant impact on performance. Second, users of the network must be pre pared for duplicate requests and responses. The reason lies in the recovery mechanism just described. Suppose a network packet switch gets overloaded and must discard a response packet, as in Figure 7.8. Client A can’t tell the difference between this case and the case of Figure 7.7, so it resends its request. The service sees this resent request as a duplicate. Suppose B does not realize this is a duplicate, does what is requested, and sends back a response. Client A receives the response and assumes that everything is OK. That may be a correct assumption, or it may not, depending on whether or not the first arrival of request 3 changed B’s state. If B is a spelling checker, it will probably give the same response to both copies of the request. But if B is a bank and the request is to transfer funds, doing the request twice would be a mistake. So detecting duplicates may or may not be important, depending on the particular application. For another example, if for some reason the network delays pile up and exceed the resend timer expiration period, the client may resend a request even though the original
Saltzer & Kaashoek Ch. 7, p. 16
June 25, 2009 8:22 am
7.1 Interesting Properties of Networks
B
A send request, set timer
7–17
reque
st 3
X
timer expires, resend request, set new timer
overloaded forwarder discards response 3
request 3
’
duplicate arrives at B B sends response 3’
’ nse 3
respo
receive response, reset timer
X
FIGURE 7.8 ost packet recovery leading to duplicate request.
response is still in transit. Since B can’t tell any difference between this case and the pre vious one, it responds in the same way, by doing what is requested. But now A receives a duplicate response, as in Figure 7.9. Again, this duplicate may or may not matter to A, but at minimum A must take steps not to be confused by the arrival of a duplicate response. What if the arrival of a request from A causes B to change state, as in the bank transfer example? If so, it is usually important to detect and suppress duplicates generated by the lost packet recovery mechanism. The general procedure to suppress duplicates has two components. The first component is hinted at by the request and response numbers used in the illustrations: each request includes a nonce, which is a unique identifier that will B
FIGURE 7.9 Network delay combined with recovery leading to duplicate response.
Saltzer & Kaashoek Ch. 7, p. 17
June 25, 2009 8:22 am
7–18
CHAPTER 7 The Network as a System and as a System Component
never be reused by A when sending requests to B. The illustration uses monotonically increasing serial numbers as nonces, but any unique identifier will do. The second dupli cate suppression component is that B must maintain a list of nonces on which it has taken action or is still working, and whenever a request arrives B should look through this list to see whether or not this apparently new request is actually a duplicate of one previously received. If it is a duplicate B must not perform the action requested. On the other hand, B should not simply ignore the request, either, because the reason for the duplicate may be that A never received B’s response. So B needs some way of reconstruct ing and resending that previous response. The simplest way of doing this is usually for B to add to its list of previously handled nonces a copy of the corresponding responses so that it can easily resend them. Thus in Figure 7.9, the last action of B should be replaced with “B resends response 4”. In some network designs, A may even receive duplicate responses to a single, unre peated request. The reason is that a forwarding link deep inside the network may be using a timer expiration and resend protocol similar to the one above. For this reason, most protocols that are concerned about duplicate suppression include a copy of the nonce in the response, and the originator, A, maintains a list of nonces used in its out standing requests. When a response comes back, A can check for the nonce in the list and delete that list entry or, if there is no list entry, assume it is a duplicate of a previously received response and ignore it. The procedure we have just described allows A to keep its list of nonces short, but B might have to maintain an ever-growing list of nonces and responses to be certain that it never accidentally processes a request twice. A related problem concerns what happens if either participant crashes and restarts, losing its volatile memory, which is probably where it is keeping its list of nonces. Refinements to cope with these problems will be explored in detail when we revisit the topic of duplicate suppression on page 7–71 of this chapter. Ensuring suppression of duplicates is a significant complication so, if possible, it is wise to design the service and its protocol in such a way that suppression is not required. Recall that the reason that duplicate suppression became important was that a request changed the state of the service. It is often possible to design a service interface so that it is idempotent, which for a network request means that repeating the same request or sequence of requests several times has the same effect as doing it just once. This design approach is explored in depth in the discussion of atomicity and error recovery in Chap ter 9[on-line].
7.1.5 Damaged Packets and Broken Links At the beginning of the chapter we noted that noise is one of the fundamental consider ations that dominates the design of data communication. Data can be damaged during transmission, during transit through a switch, or in the memory of a forwarding node. Noise, transmission errors, and techniques for detecting and correcting errors are fasci nating topics in their own right, explored in some depth in Chapter 8[on-line]. As a
Saltzer & Kaashoek Ch. 7, p. 18
June 25, 2009 8:22 am
7.1 Interesting Properties of Networks
7–19
general rule it is possible to sub-contract this area to a specialist in the theory of error detection and correction, with one requirement in the contract: when we receive data, we want to know whether or not it is correct. That is, we require that a reliable error detection mechanism be part of any underlying data transmission system. Section 7.3.3 of this chapter expands a bit on this error detection requirement. Once we have contracted for data transmission with an error detection mechanism in which we have confidence, intermediate packet switches can then handle noise-damaged packets by simply discarding them. This approach changes the noise problem into one for which there is already a recovery procedure. Put another way, this approach trans forms data loss into performance degradation. Finally, because transmission links traverse hostile environments and must be consid ered fragile, a packet network usually has multiple interconnection paths, as in Figure 7.5. Links can go down while transmitting a frame; they may stay down briefly, e.g. because of a power interruption, or for long periods of time while waiting for someone to dig up a street or launch a replacement satellite. Flexibility in routing is an important property of a network of any size. We will return to the implications of broken links in the discussion of the network layer, in Section 7.4 of this chapter.
7.1.6 Reordered Delivery When a packet-forwarding network has an interconnection topology like that of Figure 7.5, in which there is more than one path that a packet can follow from A to B, there is a possibility that a series of packets departing from A in sequential order may arrive at B in a different order. Some networks take special precautions to avoid this possibility by forcing all packets between the same two points to take the same path or by delaying delivery at the destination until all earlier packets have arrived. Both of these techniques introduce additional delay, and there are applications for which reducing delay is more important than receiving the segments of a message in the order in which they were transmitted. Recalling that a message may have been divided into segments, the possibility of reor dered delivery means that reassembly of the original message requires close attention. We have here a model of communication much like when a friend is touring on holiday by car, stopping each night in a different motel, and sending a motel postcard with an account of the day’s adventures. Whenever a day’s story doesn’t fit on one card, your friend uses two or three postcards, as necessary. The Post Office may deliver these cards to you in almost any order, and something on the postcard—probably the date—will be needed to enable you to read them in the proper order. Even when two cards are mailed at the same time from the same motel (as indicated by the motel photograph on the front) the Post Office may deliver them to you on different days, so there must be further information on the postcard to allow you to realize that sender broke the original mes sage into segments and you may need to wait for the next delivery before starting to read.
Saltzer & Kaashoek Ch. 7, p. 19
June 25, 2009 8:22 am
7–20
CHAPTER 7 The Network as a System and as a System Component
7.1.7 Summary of Interesting Properties and the Best-Effort Contract Most of the ideas introduced in this section can be captured in just two illustrations. Fig ure 7.10 summarizes the differences in application characteristics and in response to overload between isochronous and asynchronous multiplexing. Similarly, Figure 7.11 briefly summarizes the interesting (the term “challenging” may also come to mind) properties of computer networks that we have encountered. The “best-effort contract” of the caption means that when a network accepts a segment, it offers the expectation that it will usually deliver the segment to its destination, but it does not guarantee success, and the client of the network is expected to be sophisticated enough to take in stride the possibility that segments may be lost, duplicated, variably delayed, or delivered out of order.
7.2 Getting Organized: Layers To deal with the interesting properties of networks that we identified in Section 7.1, it is necessary to get organized. The primary organizing tool for networks is an example of the design principle adopt sweeping simplifications. All networks use the divide-and-con quer technique known as layering of protocols. But before we come to layers, we must establish what a protocol is.
Application characteristics
isochronous (e.g., telephone network)
Continuous
stream
(e.g., interactive
voice)
Bursts of data (most computer-to computer data)
Response to load variations
good match
wastes capacity
(hard-edged) either accepts or blocks call
good match
(gradual) 1 variable delay 2 discards data 3 rate adaptation
Network Type asynchronous (e.g., Internet)
variable latency upsets application
FIGURE 7.10 Isochronous versus asynchronous multiplexing.
Saltzer & Kaashoek Ch. 7, p. 20
June 25, 2009 8:22 am
7.2 Getting Organized: Layers
7–21
Suppose we are examining the set of programs used by a defense contractor who is retooling for a new business, video games. In the main program we find the procedure call FIRE
(#_of_missiles, target, action_if_defended)
and elsewhere we find the corresponding procedure, which begins procedure FIRE (nmissiles, where, reaction)
These constructs are interpreted at two levels. First, the system matches the name FIRE in the main program with the program that exports a procedure of the same name, and it arranges to transfer control from the main program to that procedure. The procedure, in turn, matches the arguments of the calling program, position by position, with its own parameters. Thus, in this example, the second argument, target, of the calling program is matched with the second parameter, where, of the called procedure. Beyond this mechanical matching, there is an implicit agreement between the programmer of the main program and the programmer of the procedure that this second argument is to be interpreted as the location that the missiles are intended to hit. This set of agreements on how to interpret both the order and the meaning of the arguments stands as a kind of contract between the two programs. In programming lan guages, such contracts are called “specifications”; in networks, such contracts are called protocols. More generally, a protocol goes beyond just the interpretation of the argu ments; it encompasses everything that either of the two parties can depend on about how
1. Networks encounter a vast range of • • • •
Data rates Propagation, transmission, queuing, and processing delays. Loads Numbers of users
the other will act or react. For example, in a client/service system, a request/response pro tocol might specify that the service send an immediate acknowledgment when it gets a request, so that the client knows that the service is there, and send the eventual response as a third message. An example of a protocol that we have already seen is that of the Net work File System shown in Figure 4.10. Let us suppose that our defense contractor wishes to further convert the software from a single-user game to a multiuser game, using a client/service organization. The main program will run as a client and the FIRE program will now run in a multiclient, gamecoordinating service. To simplify the conversion, the contractor has chosen to use the remote procedure call (RPC) protocol illustrated in Figure 7.12. As described in Chapter 4, a stub procedure that runs in the client machine exports the name FIRE so that when the main program calls FIRE, control actually passes to the stub with that name. The stub collects the arguments, marshals them into a request message, and sends them over the network to the game-coordinating service. At the service, a corresponding stub waits for such a request to arrive, unmarshals the arguments in the request message, and uses them to perform a call to the real FIRE procedure. When FIRE completes its operation and returns, the service stub marshals any output value into a response message and sends it to the client. The client stub waits for this response message, and when it arrives, it unmarshals the return value in the response message and returns it as its own value to the main program. The procedure call protocol has been honored and the main program continues as if the procedure named FIRE had executed locally. Figure 7.12 also illustrates a second, somewhat different, protocol between the client stub and the service stub, as compared with the protocol between the main program and the procedure it calls. Between the two stubs the request message spells out the name of the procedure to be called, the number of arguments, and the types of each argument.
Saltzer & Kaashoek Ch. 7, p. 22
June 25, 2009 8:22 am
7.2 Getting Organized: Layers
Main program
application protocol
called procedure
RPC client stub
presentation protocol
RPC service stub
7–23
FIGURE 7.13 Two protocol layers
The details of the protocol between the RPC stubs need have little in common with the corresponding details of the protocol between the original main program and the proce dure it calls.
7.2.1 Layers In that example, the independence of the MAIN-to-FIRE procedure call protocol from the RPC stub-to-stub protocol is characteristic of a layered design. We can make those layers explicit by redrawing our picture as in Figure 7.13. The contract between the main pro gram and the procedure it calls is called the application protocol. The contract between the client-side and service-side RPC stubs protocol is known as a presentation protocol because it translates data formats and semantics to and from locally preferred forms. The request message must get from the client RPC stub to the service RPC stub. To communicate, the client stub calls some network procedure, using an elaboration of the SEND abstraction: SEND_MESSAGE
(request_message, service_name)
specifying in a second argument the identity of the service that should receive this request message. The service stub invokes a similar procedure that provides the RECEIVE abstrac tion to pick up the message. These two procedures represent a third layer, which provides a transport protocol, and we can extend our layered protocol picture as in Figure 7.14. This figure makes apparent an important property of layering as used in network designs: every module has not two, but three interfaces. In the usual layered organization, a module has just two interfaces, an interface to the layer above, which hides a second interface to the layer below. But as used in a network, layering involves a third interface. Consider, for example, the RPC client stub in the figure. As expected, it provides an interface that the main program can use, and it uses an interface of the client network package below. But the whole point of the RPC client stub is to construct a request mes sage that convinces its correspondent stub at the service to do something. The presentation protocol thus represents a third interface of the presentation layer module. The presentation module thus hides both the lower layer interface and the presentation protocol from the layer above. This observation is a general one—each layer in a network
Saltzer & Kaashoek Ch. 7, p. 23
June 25, 2009 8:22 am
7–24
CHAPTER 7 The Network as a System and as a System Component
Main program fire
(return)
(return)
fire
RPC client stub send_ message
called procedure
application protocol
presentation protocol
RPC service stub
receive_ message
send_ message
receive_ message
Client network package
Service network package
transport protocol
FIGURE 7.14 Three protocol layers
implementation provides an interface to the layer above, and it hides the interface to the layer below as well as the protocol interface to the correspondent with which it communicates. Layered design has proven to be especially effective, and it is used in some form in virtually every network implementation. The primary idea of layers is that each layer hides the operation of the layer below from the layer above, and instead provides its own interpretation of all the important features of the lower layer. Every module is assigned to some layer, and interconnections are restricted to go between modules in adjacent lay ers. Thus in the three-layer system of Figure 7.15, module A may call any of the modules J, K, or L, but A doesn’t even know of the existence of X, Y, and Z. The figure shows A using module K. Module K, in turn, may call any of X, Y,, or Z. Different network designs, of course, will have different layering strategies. The par ticular layers we have discussed are only an illustration—as we investigate the design of the transport protocol of Figure 7.14 in more detail, we will find it useful to impose fur-
Layer One
A
Layer Two
Layer Three
B
J
X
C
K
D
L
Y
Z
FIGURE 7.15 A layered system.
Saltzer & Kaashoek Ch. 7, p. 24
June 25, 2009 8:22 am
7.2 Getting Organized: Layers
7–25
ther layers, using a three-layer reference model that provides quite a bit of insight into how networks are organized. Our choice strongly resembles the layering that is used in the design of the Internet. The three layers we choose divide the problem of implement ing a network as follows (from the bottom up): • The link layer: moving data directly from one point to another. • The network layer: forwarding data through intermediate points to move it to the place it is wanted. • The end-to-end layer: everything else required to provide a comfortable application interface. The application itself can be thought of as a fourth, highest layer, not part of the net work. On the other hand, some applications intertwine themselves so thoroughly with the end-to-end layer that it is hard to make a distinction. The terms frame, packet, segment, message, and stream that were introduced in Section 7.1 can now be identified with these layers. Each is the unit of transmission of one of the protocol layers. Working from the top down, an application starts by asking the end-to end layer to transmit a message or a stream of data to a correspondent. The end-to-end layer splits long messages and streams into segments, it copes with lost or duplicated seg ments, it places arriving segments in proper order, it enforces specific communication semantics, it performs presentation transformations, and it calls on the network layer to transmit each segment. The network layer accepts segments from the end-to-end layer, constructs packets, and transmits those packets across the network, choosing which links to follow to move a given packet from its origin to its destination. The link layer accepts packets from the network layer, and constructs and transmits frames across a single link between two forwarders or between a forwarder and a customer of the network. Some network designs attempt to impose a strict layering among various parts of what we call the end-to-end layer, but it is often such a hodgepodge of function that no single layering can describe it in a useful way. On the other hand, the network and link layers are encountered frequently enough in data communication networks that one can almost consider them universal. With this high-level model in mind, we next sketch the basic contracts for each of the three layers and show how they relate to one another. Later, we examine in much more depth how each of the three layers is actually implemented.
7.2.2 The Link Layer At the bottom of a packet-switched network there must be some underlying communi cation mechanism that connects one packet switch with another or a packet switch to a customer of the network. The link layer is responsible for managing this low-level com munication. The goal of the link layer is to move the bits of the packet across one (usually, but not necessarily, physical) link, hiding the particular mechanics of data trans mission that are involved.
Saltzer & Kaashoek Ch. 7, p. 25
June 25, 2009 8:22 am
7–26
CHAPTER 7 The Network as a System and as a System Component
DATA LINK_SEND
(pkt, link2)
NETWORK_HANDLE
B
A link 1
Link Layer
link protocol
C Link Layer
link 2 LT DATA LH link protocol
Link Layer
FIGURE 7.16 A link layer in a packet switch that has two physical links
A typical, somewhat simplified, interface to the link layer looks something like this: LINK_SEND
(data_buffer, link_identifier)
where data_buffer names a place in memory that contains a packet of information ready to be transmitted, and link_identifier names, in a local address space, one of possibly sev eral links to use. Figure 7.16 illustrates the link layer in packet switch B, which has links to two other packet switches, A and C. The call to the link layer identifies a packet buffer named pkt and specifies that the link layer should place the packet in a frame suitable for transmission over link2, the link to packet switch C. Switches B and C both have imple mentations of the link layer, a program that knows the particular protocol used to send and receive frames on this link. The link layer may use a different protocol when sending a frame to switch A using link number 1. Nevertheless, the link layer typically presents a uniform interface (LINK_SEND) to higher layers. Packet switch B and packet switch C may use different labels for the link that connects them. If packet switch C has four links, the frame may arrive on what C considers to be its link number 3. The link identifier is thus a name whose scope is limited to one packet switch. The data that actually appears on the physical wire is usually somewhat different from the data that appeared in the packet buffer at the interface to the link layer. The link layer is responsible for taking into account any special properties of the underlying physical channel, so it may, for example, encode the data in a way that is less fragile in the local noise environment, it may fragment the data because the link protocol requires shorter frames, and it may repeatedly resend the data until the other end of the link acknowl edges that it has received it. These channel-specific measures generally require that the link layer add information to the data provided by the network layer. In a layered communication system, the data passed from an upper layer to a lower layer for transmission is known as the payload. When a lower layer adds to the front of the payload some data intended only for the use of the corresponding lower layer at the other end, the addition is called a header, and when the lower layer adds something to the end, the addition is called a trailer. In Figure
Saltzer & Kaashoek Ch. 7, p. 26
June 25, 2009 8:22 am
7.2 Getting Organized: Layers
7–27
7.16, the link layer has added a link layer header LH (perhaps indicating which network layer program to deliver the packet to) and a link layer trailer LT (perhaps containing a checksum for error detection). The combination of the header, payload, and trailer becomes the link-layer frame. The receiving link layer module will, after establishing that the frame has been correctly received, remove the link layer header and trailer before passing the payload to the network layer. The particular method of waiting for a frame, packet, or message to arrive and trans ferring payload data and control from a lower layer to an upper layer depends on the available thread coordination procedures. Throughout this chapter, rather than having an upper layer call down to a lower-layer procedure named RECEIVE (as Section 2.1.3 sug gested), we use upcalls, which means that when data arrives, the lower layer makes a procedure call up to an entry point in the higher layer. Thus in Figure 7.16 the link layer calls a procedure named NETWORK_HANDLE in the layer above.
7.2.3 The Network Layer A segment enters a forwarding network at one of its network attachment points (the source), accompanied by instructions to deliver it to another network attachment point (the destination). To reach the destination it will probably have to traverse several links. Providing a systematic naming scheme for network attachment points, determining which links to traverse, creating a packet that contains the segment, and forwarding the packet along the intended path are the jobs of the network layer. The interface to the network layer, again somewhat simplified, resembles that of the link layer: NETWORK_SEND
(segment_buffer, network_identifier, destination)
The NETWORK_SEND procedure transmits the segment found in segment_buffer (the pay load, from the point of view of the network layer), using the network named in network_identifier (a single computer may participate in more than one network), to des tination (the address within that network that names the network attachment point to which the segment should be delivered). The network layer, upon receiving this call, creates a network-layer header, labeled NH in Figure 7.17, and/or trailer, labeled NT, to accompany the segment as it traverses the network named “IP”, and it assembles these components into a packet. The key item of information in the network-layer header is the address of the destination, for use by the next packet switch in the forwarding chain. Next, the network layer consults its tables to choose the most appropriate link over which to send this packet with the goal of getting it closer to its destination. Finally, the network layer calls the link layer asking it to send the packet over the chosen link. When the frame containing the packet arrives at the other end of the link, the receiving link layer strips off the link layer header and trailer (LH and LT in the figure) and hands the packet to its network layer by an upcall to NETWORK_HANDLE. This network layer module examines the network layer header and trailer to determine the intended destination of the packet. It consults its own tables to decide on which outgoing link to forward the
Saltzer & Kaashoek Ch. 7, p. 27
June 25, 2009 8:22 am
7–28
CHAPTER 7 The Network as a System and as a System Component
DATA NETWORK_SEND
(segment, “IP”, nap_1197) network
Network Layer
Network Layer
protocol NT DATA NH
lINK_SEND (packet, link2)
Link Layer
LT NT DATA NH LH
link 2
link protocol
LINK_SEND
(packet, link5)
NETWORK_HANDLE
Link Layer
Link Layer
link 5
FIGURE 7.17 Relation between the network layer and the link layer.
packet, and it calls the link layer to send the packet on its way. The network layer of each packet switch along the way repeats this procedure, until the packet traverses the link to its destination. The network layer at the end of that link recognizes that the packet is now at its destination, it extracts the data segment from the packet, and passes that segment to the end-to-end layer, with another upcall.
7.2.4 The End-to-End Layer We can now put the whole picture together. The network and link layers together pro vide a best-effort network, which has the “interesting” properties that were listed in Figure 7.11 on page 7–21. These properties may be problematic to an application, and the function of the end-to-end layer is to create a less “interesting” and thus easier to use interface for the application. For example, Figure 7.18 shows the remote procedure call of Figure 7.12 from a different perspective. Here the RPC protocol is viewed as an endto-end layer of a complete network implementation. As with the lower layers, the endto-end layer has added a header and a trailer to the data that the application gave it, and inspecting the bits on the wire we now see three distinct headers and trailers, correspond ing to the three layers of the network implementation. The RPC implementation in the end-to-end layer provides several distinct end-to end services, each intended to hide some aspect of the underlying network from its application:
Saltzer & Kaashoek Ch. 7, p. 28
June 25, 2009 8:22 am
7.2 Getting Organized: Layers
7–29
• Presentation services. Translating data formats and emulating the semantics of a procedure call. For this purpose the end-to-end header might contain, for example, a count of the number of arguments in the procedure call. • Transport services. Dividing streams and messages into segments and dealing with lost, duplicated, and out-of-order segments. For this purpose, the end-to-end header might contain serial numbers of the segments. • Session services. Negotiating a search, handshake, and binding sequence to locate and prepare to use a service that knows how to perform the requested procedure. For this purpose, the end-to-end header might contain a unique identifier that tells the service which client application is making this call. Depending on the requirements of the application, different end-to-end layer implemen tations may provide all, some, or none of these services, and the end-to-end header and trailer may contain various different bits of information. There is one other important property of this layering that becomes evident in exam ining Figure 7.18. Each layer considers the payload transmitted by the layer above to be information that it is not expected, or even permitted, to interpret. Thus the end-to-end layer constructs a segment with an end-to-end header and trailer that it hands to the net work layer, with the expectation that the network layer will not look inside or perform any actions that require interpretation of the segment. The network layer, in turn, adds a network-layer header and trailer and hands the resulting packet to the link layer, again FIRE
(7, “Lucifer”, evade)
FIRE
DATA
End-to-End Layer (RPC)
end-to-end
End-to-End Layer (RPC)
(7, “Lucifer”, evade)
protocol ET DATA EH
Network Layer
Network Layer
Network Layer
NT ET DATA EH NH
Link Layer
Link Layer
Link Layer
Link Layer
LT NT ET DATA EH NH LH
FIGURE 7.18 Three network layers in action. The arguments of the procedure call become the payload of the end-to-end segment. The network layer forwards the packet across two links on the way from the client to the service. The frame on the wire contains the headers and trailers of three layers.
Saltzer & Kaashoek Ch. 7, p. 29
June 25, 2009 8:22 am
7–30
CHAPTER 7 The Network as a System and as a System Component
with the expectation that the link layer will consider this packet to be an opaque string of bits, a payload to be carried in a link-layer frame. Violation of this rule would lead to interdependence across layers and consequent loss of modularity of the system.
7.2.5 Additional Layers and the End-to-End Argument To this point, we have suggested that a three-layer reference model is both necessary and sufficient to provide insight into how networks operate. Standard textbooks on network design and implementation mention a reference model from the International Organi zation for Standardization, known as “Open Systems Interconnect”, or OSI. The OSI reference model has not three, but seven layers. What is the difference? There are several differences. Some are trivial; for example, the OSI reference model divides the link layer into a strategy layer (known as the “data link layer”) and a physical layer, recognizing that many different kinds of physical links can be managed with a small number of management strategies. There is a much more significant difference between our reference model and the OSI reference model in the upper layers. The OSI reference model systematically divides our end-to-end layer into four distinct layers. Three of these layers directly correspond, in the RPC example, to the layers of Figure 7.14: an application layer, a presentation layer, and a transport layer. In addition just above the transport layer the ISO model inserts a layer that provides the session services mentioned just above. We have avoided this approach for the simple reason that different applications have radically different requirements for transport, session, and presentation services—even to the extent that the order in which they should be applied may be different. This situation makes it difficult to propose any single layering, since a layering implies an ordering. For example, an application that consists of sending a file to a printer would find most useful a transport service that guarantees to deliver to the printer a stream of bytes in the same order in which they were sent, with none missing and none duplicated. But a file transfer application might not care in what order different blocks of the file are delivered, so long as they all eventually arrive at the destination. A digital telephone application would like to see a stream of bits representing successive samples of the sound waveform delivered in proper order, but here and there a few samples can be missing without inter fering with the intelligibility of the conversation. This rather wide range of application requirements suggests that any implementation decisions that a lower layer makes (for example, to wait for out-of-order segments to arrive so that data can be delivered in the correct order to the next higher layer) may be counterproductive for at least some appli cations. Instead, it is likely to be more effective to provide a library of service modules that can be selected and organized by the programmer of a specific application. Thus, our end-to-end layer is an unstructured library of service modules, of which the RPC protocol is an example.
Saltzer & Kaashoek Ch. 7, p. 30
June 25, 2009 8:22 am
7.2 Getting Organized: Layers
7–31
This argument against additional layers is an example of a design principle known as The end-to-end argument The application knows best.
In this case, the basic thrust of the end-to-end argument is that the application knows best what its real communication requirements are, and for a lower network layer to try to implement any feature other than transporting the data risks implementing something that isn’t quite what the application needed. Moreover, if it isn’t exactly what is needed, the application will probably have to reimplement that function on its own. The end-to end argument can thus be paraphrased as: don’t bury it in a lower layer, let the end points deal with it because they know best what they need. A simple example of this phenomenon is file transfer. To transfer a file carefully, the appropriate method is to calculate a checksum from the contents of the file as it is stored in the file system of the originating site. Then, after the file has been transferred and writ ten to the new file system, the receiving site should read the file back out of its file system, recalculate the checksum anew, and compare it with the original checksum. If the two checksums are the same, the file transfer application has quite a bit of confidence that the new site has a correct copy; if they are different, something went wrong and recovery is needed. Given this end-to-end approach to checking the accuracy of the file transfer, one can question whether or not there is any value in, for example, having the link layer protocol add a frame checksum to the link layer trailer. This link layer checksum takes time to calculate, it adds to the data to be sent, and it verifies the correctness of the data only while it is being transmitted across that link. Despite this protection, the data may still be damaged while it is being passed through the network layer, or while it is buffered by the receiving part of the file transfer application, or while it is being written to the disk. Because of those threats, the careful file transfer application cannot avoid calculating its end-to-end checksum, despite the protection provided by the link layer checksum. This is not to say that the link layer checksum is worthless. If the link layer provides a checksum, that layer will discover data transmission errors at a time when they can be easily corrected by resending just one frame. Absent this link-layer checksum, a transmis sion error will not be discovered until the end-to-end layer verifies its checksum, by which point it may be necessary to redo the entire file transfer. So there may be a signif icant performance gain in having this feature in a lower-level layer. The interesting observation is that a lower-layer checksum does not eliminate the need for the application layer to implement the function, and it is thus not required for application correctness. It is just a performance enhancement. The end-to-end argument can be applied to a variety of system design issues in addi tion to network design. It does not provide an absolute decision technique, but rather a useful argument that should be weighed against other arguments in deciding where to place function.
Saltzer & Kaashoek Ch. 7, p. 31
June 25, 2009 8:22 am
7–32
CHAPTER 7 The Network as a System and as a System Component
7.2.6 Mapped and Recursive Applications of the Layered Model When one begins decomposing a particular existing network into link, network, and end-to-end layers, it sometimes becomes apparent that some of the layers of the network are themselves composed of what are obviously link, network, or end-to-end layers. These compositions come in two forms: mapped and recursive. Mapped composition occurs when a network layer is built directly on another network layer by mapping higher-layer network addresses to lower-layer network addresses. A typical application for mapping arises when a better or more popular network technology comes along, yet it is desirable to keep running applications that are designed for the old network. For example, Apple designed a network called Appletalk that was used for many years, and then later mapped the Appletalk network layer to the Ethernet, which, as described in Section 7.8, has a network and link layer of its own but uses a somewhat different scheme for its network layer addresses. Another application for mapped composition is to interconnect several indepen dently designed network layers, a scheme called internetworking. Probably the best example of internetworking is the Internet itself (described in Sidebar 7.2), which links together many different network layers by mapping them all to a universal network layer that uses a protocol known as Internet protocol (IP). Section 7.8 explains how the network Sidebar 7.2: The Internet The Internet provides examples of nearly every concept in this chapter. Much of the Internet is a network layer that is mapped onto some other network layer such as a satellite network, a wireless network, or an Ethernet. Internet protocol (IP) is the primary network layer protocol, but it is not the only network layer protocol used in the Internet. There is a network layer protocol for managing the Internet, known as ICMP. There are also several different network layer routing protocols, some providing routing within small parts of the Internet, others providing routing between major regions. But every point that can be reached via the Internet implements IP. The link layer of the Internet includes all of the link layers of the networks that the Internet maps onto and it also includes many separate, specialized links: a wire, a dial-up telephone line, a dedicated line provided by the telephone company, a microwave link, a digital subscriber line (DSL), a free-space optical link, etc. Almost anything that carries bits has been used somewhere as a link in the Internet. The end-to-end protocols used on the Internet are many and varied. The primary transport protocols are TCP, UDP, and RTP, described briefly on page 7–65. Built on these transport protocols are hundreds of application protocols. A short list of some of the most widely used application protocols would include file transfer (FTP), the World Wide Web (HTTP), mail dispatch and pickup (SMTP and POP), text messaging (IRC), telephone (VoIP), and file exchange (Gnutella, bittorrent, etc.). The current chapter presents a general model of networks, rather than a description of the Internet. To learn more about the Internet, see the books and papers listed in Section 7 of the Suggestions for Further Reading.
Saltzer & Kaashoek Ch. 7, p. 32
June 25, 2009 8:22 am
7.2 Getting Organized: Layers
7–33
layer addresses of the Ethernet are mapped to and from the IP addresses of the Internet using what is known as an Address Resolution Protocol. The Internet also maps the internal network addresses of many other networks—wireless networks, satellite net works, cable TV networks, etc.—into IP addresses. Recursive composition occurs when a network layer rests on a link layer that itself is a complete three-layer network. Recursive composition is not a general property of layers, but rather it is a specific property of layered communication systems: The send/receive semantics of an end-to-end connection through a network can be designed to be have the same semantics as a single link, so such an end-to-end connection can be used as a link in a higher-level network. That property facilitates recursive composition, as well as the implementation of various interesting and useful network structures. Here are some examples of recursive composition: • A dial-up telephone line is often used as a link to an attachment point of the Internet. This dial-up line goes through a telephone network that has its own link, network, and end-to-end layers. • An overlay network is a network layer structure that uses as links the end-to-end layer of an existing network. Gnutella (see problem set 20) is an example of an overlay network that uses the end-to-end layer of the Internet for its links. • With the advance of “voice over IP” (VoIP), the traditional voice telephone network is gradually converting to become an overlay on the Internet. • A tunnel is a structure that uses the end-to-end layer of an existing network as a link between a local network-layer attachment point and a distant one to make it appear that the attachment is at the distant point. Tunnels, combined with the encryption techniques described in Chapter 11, are used to implement what is commonly called a “virtual private network” (VPN). Recursive composition need not be limited to two levels. Figure 7.19 illustrates the case of Gnutella overlaying the Internet, with a dial-up telephone connection being used as the Internet link layer. The primary concern when one is dealing with a link layer that is actually an end-to end connection through another network is that discussion can become confusing unless one is careful to identify which level of decomposition is under discussion. Fortunately our terminology helps keep track of the distinctions among the various layers of a net work, so it is worth briefly reviewing that terminology. At the interface between the application and the end-to-end layer, data is identified as a stream or message. The endto-end layer divides the stream or message up into a series of segments and hands them to the network layer for delivery. The network layer encapsulates each segment in a packet which it forwards through the network with the help of the link layer. The link layer transmits the packet in a frame. If the link layer is itself a network, then this frame is a message as viewed by the underlying network. This discussion of layered network organization has been both general and abstract. In the next three sections we investigate in more depth the usual functions and some typ-
Saltzer & Kaashoek Ch. 7, p. 33
June 25, 2009 8:22 am
7–34
CHAPTER 7 The Network as a System and as a System Component
File Transfer Program (end-to-end layer) File transfer system Gnutella (network layer) Transport Protocol (end-to-end layer) Internet Protocol (network layer)
FIGURE 7.19 A typical recursive network composition. The overlay network Gnutella uses for its link layer an end-to-end transport protocol of the Internet. The Internet uses for one of its links an end-to end transport protocol of the dial-up telephone system.
ical implementation techniques of each of the three layers of our reference model. However, as the introduction pointed out, what follows is not a comprehensive treat ment of networking. Instead it identifies many of the major issues and for each issue exhibits one or two examples of how that issue is typically handled in a real network design. For readers who have a goal of becoming network engineers, and who therefore would like to learn the whole remarkable range of implementation strategies that have been used in networks, the Suggestions for Further Reading list several comprehensive books on the subject.
7.3 The Link Layer The link layer is the bottom-most of the three layers of our reference model. The link layer is responsible for moving data directly from one physical location to another. It thus gets involved in several distinct issues: physical transmission, framing bits and bit sequences, detecting transmission errors, multiplexing the link, and providing a useful interface to the network layer above.
7.3.1 Transmitting Digital Data in an Analog World The purpose of the link layer is to move bits from one place to another. If we are talking about moving a bit from one register to another on the same chip, the mechanism is fairly simple: run a wire that connects the output of the first register to the input of the next. Wait until the first register’s output has settled and the signal has propagated to the input of the second; the next clock tick reads the data into the second register. If all of the volt-
Saltzer & Kaashoek Ch. 7, p. 34
June 25, 2009 8:22 am
7.3 The Link Layer
A
data
ready
acknowledge
7–35
FIGURE 7.20 B
A simple protocol for data communication.
ages are within their specified tolerances, the clock ticks are separated enough in time to allow for the propagation, and there is no electrical interference, then that is all there is to it. Maintaining those three assumptions is relatively easy within a single chip, and even between chips on the same printed circuit board. However, as we begin to consider send ing bits between boards, across the room, or across the country, these assumptions become less and less plausible, and they must be replaced with explicit measures to ensure that data is transmitted accurately. In particular, when the sender and receiver are in sep arate systems, providing a correctly timed clock signal becomes a challenge. A simple method for getting data from one module to another module that does not share the same clock is with a three-wire (plus common ground) ready/acknowledge pro tocol, as shown in figure 7.20. Module A, when it has a bit ready to send, places the bit on the data line, and then changes the steady-state value on the ready line. When B sees the ready line change, it acquires the value of the bit on the data line, and then changes the acknowledge line to tell A that the bit has been safely received. The reason that the ready and acknowledge lines are needed is that, in the absence of any other synchronizing scheme, B needs to know when it is appropriate to look at the data line, and A needs to know when it is safe to stop holding the bit value on the data line. The signals on the ready and acknowledge lines frame the bit. If the propagation time from A to B is Δt, then this protocol would allow A to send one bit to B every 2Δt plus the time required for A to set up its output and for B to acquire its input, so the maximum data rate would be a little less than 1/(2Δt). Over short distances, one can replace the single data line with N parallel data lines, all of which are framed by the same pair of ready/acknowledge lines, and thereby increase the data rate to N/(2Δt). Many backplane bus designs as well as peripheral attachment systems such as SCSI and personal computer printer interfaces use this technique, known as parallel transmission, along with some variant of a ready/acknowledge protocol, to achieve a higher data rate. However, as the distance between A and B grows, Δt also grows, and the maximum data rate declines in proportion, so the ready/acknowledge technique rapidly breaks down. The usual requirement is to send data at higher rates over longer distances with fewer wires, and this requirement leads to employment of a different system called serial transmission. The idea is to send a stream of bits down a single transmission line, without waiting for any response from the receiver and with the expectation that the receiver will somehow recover those bits at the other end with no additional signaling. Thus the out put at the transmitting end of the link looks as in Figure 7.21. Unfortunately, because the underlying transmission line is analog, the farther these bits travel down the line, the
Saltzer & Kaashoek Ch. 7, p. 35
June 25, 2009 8:22 am
7–36
CHAPTER 7 The Network as a System and as a System Component
more attenuation, noise, and line-charging effects they suffer. By the time they arrive at the receiver they will be little more than pulses with exponential leading and trailing edges, as suggested by Figure 7.22. The receiving module, B, now has a significant prob lem in understanding this transmission: Because it does not have a copy of the clock that A used to create the bits, it does not know exactly when to sample the incoming line. A typical solution involves having the two ends agree on an approximate data rate, so that the receiver can run a voltage-controlled oscillator (VCO) at about that same data rate. The output of the VCO is multiplied by the voltage of the incoming signal and the product suitably filtered and sent back to adjust the VCO. If this circuit is designed cor rectly, it will lock the VCO to both the frequency and phase of the arriving signal. (This device is commonly known as a phase-locked loop.) The VCO, once locked, then becomes a clock source that a receiver can use to sample the incoming data. One complication is that with certain patterns of data (for example, a long string of zeros) there may be no transitions in the data stream, in which case the phase-locked loop will not be able to synchronize. To deal with this problem, the transmitter usually encodes the data in a way that ensures that no matter what pattern of bits is sent, there will be some transitions on the transmission line. A frequently used method is called phase encoding, in which there is at least one level transition associated with every data bit. A common phase encoding is the Manchester code, in which the transmitter encodes each bit as two bits: a zero is encoded as a zero followed by a one, while a one is encoded as a one followed by a zero. This encoding guarantees that there is a level transition in the center of every transmitted bit, thus supplying the receiver with plenty of clocking information. It has the disadvantage that the maximum data rate of the communication channel is effectively cut in half, but the resulting simplicity of both the transmitter and the receiver is often worth this price. Other, more elaborate, encoding schemes can ensure that there is at least one transition for every few data bits. These schemes don’t reduce the maximum data rate as much, but they complicate encoding, decoding, and synchronization. The usual goal for the design space of a physical communication link is to achieve the highest possible data rate for the encoding method being used. That highest possible data
V
FIGURE 7.21 Serial transmission.
1
0
1
0
1
0
1
0
1
time
FIGURE 7.22 Bit shape deteri oration with distance.
Saltzer & Kaashoek Ch. 7, p. 36
A
B
June 25, 2009 8:22 am
7.3 The Link Layer
7–37
rate will occur exactly at the point where the arriving data signal is just on the ragged edge of being correctly decodable, and any noise on the line will show up in the form of clock jitter or signals that just miss expected thresholds, either of which will lead to decoding errors. The data rate of a digital link is conven tionally measured in bits per second. Since Sidebar 7.4: Shannon’s capacity theorem digital data is ultimately carried using an S C ≤ W ⋅ log 2⎝⎛ 1 + ---------⎞ analog channel, the question arises of what NW⎠ might be the maximum digital carrying
capacity of a specified analog channel. A where:
perfect analog channel would have an infi
nite capacity for digital data because one C = channel capacity, in bits per
second could both set and measure a transmitted W = channel bandwidth, in hertz signal level with infinite precision, and then change that setting infinitely often. In S = maximum allowable signal power, as seen by the receiver the real world, noise limits the precision with which a receiver can measure the sig- N = noise power per unit of bandwidth nal value, and physical limitations of the analog channel such as chromatic dispersion (in an optical fiber), charging capacitance (in a copper wire), or spectrum availability (in a wireless signal) put a ceiling on the rate at which a receiver can detect a change in value of a signal. These physical limitations are summed up in a single measure known as the bandwidth of the analog channel. To be more precise, the number of different signal values that a receiver can distinguish is pro portional to the logarithm of the ratio of the signal power to the noise power, and the maximum rate at which a receiver can distinguish changes in the signal value is propor tional to the analog bandwidth.xx These two parameters (signal-to-noise ratio and analog bandwidth) allow one to cal culate a theoretical maximum possible channel capacity (that is, data transmission rate) using Shannon’s capacity theorem (see Sidebar 7.4).* Although this formula adopts a par ticular definition of bandwidth, assumes a particular randomness for the noise, and says nothing about the delay that might be encountered if one tries to operate near the chan-
Sidebar 7.3: Framing phase-encoded bits The astute reader may have spotted a puzzling gap in the brief description of the Manchester code: while it is intended as a way of framing bits as they appear on the transmission line, it is also necessary to frame the data bits themselves, in order to know whether a data bit is encoded as bits (n, n + 1) or bits (n + 1, n + 2). A typical approach is to combine code bit framing with data bit framing (and even provide some help in higher-level framing) by specifying that every transmission must begin with a standard pattern, such as some minimum number of coded one-bits followed by a coded zero. The series of consecutive ones gives the Phase-Locked Loop something to synchronize on, and at the same time provides examples of the positions of known data bits. The zero frames the end of the framing sequence.
Saltzer & Kaashoek Ch. 7, p. 37
June 25, 2009 8:22 am
7–38
CHAPTER 7 The Network as a System and as a System Component
nel capacity, it turns out to be surprisingly useful for estimating capacities in the real world. Since some methods of digital transmission come much closer to Shannon’s theoret ical capacity than others, it is customary to use as a measure of goodness of a digital transmission system the number of bits per second that the system can transmit per hertz of bandwidth. Setting W = 1, the capacity theorem says that the maximum bits per sec ond per hertz is log2(1 + S/N). An elementary signalling system in a low-noise environment can easily achieve 1 bit per second per hertz. On the other hand, for a 28 kilobits per second modem to operate over the 2.4 kilohertz telephone network, it must transmit about 12 bits per second per hertz. The capacity theorem says that the logarithm must be at least 12, so the signal-to-noise ratio must be at least 212, or using a more tra ditional analog measure, 36 decibels, which is just about typical for the signal-to-noise ratio of a properly working telephone connection. The copper-pair link between a tele phone handset and the telephone office does not go through any switching equipment, so it actually has a bandwidth closer to 100 kilohertz and a much better signal-to-noise ratio than the telephone system as a whole; these combine to make possible “digital sub scriber line” (DSL) modems that operate at 1.5 megabits/second—and even up to 50 megabits/second over short distances—using a physical link that was originally designed to carry just voice. One other parameter is often mentioned in characterizing a digital transmission sys tem: the bit error rate, abbreviated BER and measured as a ratio to the transmission rate. For a transmission system to be useful, the bit error rate must be quite low; it is typically reported with numbers such as one error in 106, 107, or 108 transmitted bits. Even the best of those rates is not good enough for digital systems; higher levels of the system must be prepared to detect and compensate for errors.
7.3.2 Framing Frames The previous section explained how to obtain a stream of neatly framed bits, but because the job of the link layer is to deliver frames across the link, it must also be able to figure out where in this stream of bits each frame begins and ends. Framing frames is a distinct, and quite independent, requirement from framing bits, and it is one of the reasons that some network models divide the link layer into two layers, a lower layer that manages physical aspects of sending and receiving individual bits and an upper layer that imple ments the strategy of transporting entire frames. There are many ways to frame frames. One simple method is to choose some pattern of bits, for example, seven one-bits in a row, as a frame-separator mark. The sender sim ply inserts this mark into the bit stream at the end of each frame. Whenever this pattern * The derivation of this theorem is beyond the scope of this textbook. The capacity theorem was originally proposed by Claude E. Shannon in the paper “A mathematical theory of communica tion,” Bell System Technical Journal 27 (1948), pages 379–423 and 623–656. Most modern texts on information theory explore it in depth.
Saltzer & Kaashoek Ch. 7, p. 38
June 25, 2009 8:22 am
7.3 The Link Layer
7–39
appears in the received data, the receiver takes it to mark the end of the previous frame, and assumes that any bits that follow belong to the next frame. This scheme works nicely, as long as the payload data stream never contains the chosen pattern of bits. Rather than explaining to the higher layers of the network that they cannot transmit certain bit patterns, the link layer implements a technique known as bit stuffing. The transmitting end of the link layer, in addition to inserting the frame-separator mark between frames, examines the data stream itself, and if it discovers six ones in a row it stuffs an extra bit into the stream, a zero. The receiver, in turn, watches the incoming bit stream for long strings of ones. When it sees six one-bits in a row it examines the next bit to decide what to do. If the seventh bit is a zero, the receiver discards the zero bit, thus reversing the stuffing done by the sender. If the seventh bit is a one, the receiver takes the seven ones as the frame separator. Figure shows a simple pseudocode implementation of the procedure to send a frame with bit stuffing, and Figure 7.24 shows the corresponding procedure on the receiving side of the link. (For simplicity, the illustrated receive proce dure ignores two important considerations. First, the receiver uses only one frame buffer. A better implementation would have multiple buffers to allow it to receive the next frame while processing the current one. Second, the same thread that acquires a bit also runs the network level protocol by calling LINK_RECEIVE. A better implementation would prob ably NOTIFY a separate thread that would then call the higher-level protocol, and this thread could continue processing more incoming bits.) Bit stuffing is one of many ways to frame frames. There is little need to explore all the possible alternatives because frame framing is easily specified and subcontracted to the implementer of the link layer—the entire link layer, along with bit framing, is often done in the hardware—so we now move on to other issues.
procedure FRAME_TO_BIT (frame_data, length) ones_in_a_row = 0 for i from 1 to length do // First send frame contents SEND_BIT (frame_data[i]); if frame_data[i] = 1 then ones_in_a_row ← ones_in_a_row + 1; if ones_in_a_row = 6 then SEND_BIT (0); // Stuff a zero so that data doesn’t ones_in_a_row ← 0; // look like a framing marker else ones_in_a_row ← 0;
for i from 1 to 7 do // Now send framing marker.
SEND_BIT (1)
FIGURE 7.23 Sending a frame with bit stuffing.
Saltzer & Kaashoek Ch. 7, p. 39
June 25, 2009 8:22 am
7–40
CHAPTER 7 The Network as a System and as a System Component
7.3.3 Error Handling An important issue is what the receiving side of the link layer should do about bits that arrive with doubtful values. Since the usual design pushes the data rate of a transmission link up until the receiver can barely tell the ones from the zeros, even a small amount of extra noise can cause errors in the received bit stream. The first and perhaps most important line of defense in dealing with transmission errors is to require that the design of the link be good at detecting such errors when they occur. The usual method is to encode the data with an error detection code, which entails adding a small amount of redundancy. A simple form of such a code is to have the trans mitter calculate a checksum and place the checksum at the end of each frame. As soon as the receiver has acquired a complete frame, it recalculates the checksum and compares its result with the copy that came with the frame. By carefully designing the checksum algorithm and making the number of bits in the checksum large enough, one can make the probability of not detecting an error as low as desired. The more interesting issue is what to do when an error is detected. There are three alternatives: 1. Have the sender encode the transmission using an error correction code, which is a code that has enough redundancy to allow the receiver to identify the particular bits that have errors and correct them. This technique is widely used in situations where the noise behavior of the transmission channel is well understood and the redundancy can be targeted to correct the most likely errors. For example, compact disks are recorded with a burst error-correction code designed to cope particularly well with dust and scratches. Error correction is one of the topics of Chapter 8[on line].
procedure BIT_TO_FRAME (rcvd_bit)
ones_in_a_row integer initially 0
if ones_in_a_row < 6 then
bits_in_frame ← bits_in_frame + 1 frame_data[bits_in_frame] ← rcvd_bit if rcvd_bit = 1 then ones_in_a_row ← ones_in_a_row + 1 else ones_in_a_row ← 0 else // This may be a seventh one-bit in a row, check it out. if rcvd_bit = 0 then ones_in_a_row ← 0 // Stuffed bit, don't use it. else // This is the end-of-frame marker LINK_RECEIVE (frame_data, (bits_in_frame - 6), link_id) bits_in_frame ← 0 ones_in_a_row ← 0
FIGURE 7.24 Receiving a frame with bit stuffing.
Saltzer & Kaashoek Ch. 7, p. 40
June 25, 2009 8:22 am
7.3 The Link Layer
7–41
2. Ask the sender to retransmit the frame that contained an error. This alternative requires that the sender hold the frame in a buffer until the receiver has had a chance to recalculate and compare its checksum. The sender needs to know when it is safe to reuse this buffer for another frame. In most such designs the receiver explicitly acknowledges the correct (or incorrect) receipt of every frame. If the propagation time from sender to receiver is long compared with the time required to send a single frame, there may be several frames in flight, and acknowledgments (especially the ones that ask for retransmission) are disruptive. On a highperformance link an explicit acknowledgment system can be surprisingly complex. 3. Let the receiver discard the frame. This alternative is a reasonable choice in light of our previous observation (see page 7–12) that congestion in higher network levels must be handled by discarding packets anyway. Whatever higher-level protocol is used to deal with those discarded packets will also take care of any frames that are discarded because they contained errors. Real-world designs often involve blending these techniques, for example by having the sender apply a simple error-correction code that catches and repairs the most com mon errors and that reliably detects and reports any more complex irreparable errors, and then by having the receiver discard the frames that the error-correction code could not repair.
7.3.4 The Link Layer Interface: Link Protocols and Multiplexing The link layer, in addition to sending bits and frames at one end and receiving them at the other end, also has interfaces to the network layer above, as illustrated in Figure 7.16 on page 7–26. As described so far, the interface consists of an ordinary procedure call (to LINK_SEND) that the network layer uses to tell the link layer to send a packet, and an upcall (to NETWORK_HANDLE) from the link layer to the network layer at the other end to alert the network layer that a packet arrived. To be practical, this interface between the network layer and the link layer needs to be expanded slightly to incorporate two additional features not previously mentioned: multiple lower-layer protocols, and higher-layer protocol multiplexing. To support these two functions we add two arguments to LINK_SEND, named link_protocol and network_protocol: LINK_SEND
Over any given link, it is sometimes appropriate to use different protocols at different times. For example, a wireless link may occasionally encounter a high noise level and need to switch from the usual link protocol to a “robustness” link protocol that employs a more expensive form of error detection with repeated retry, but runs more slowly. At other times it may want to try out a new, experimental link protocol. The third argument to LINK_SEND, link_protocol tells LINK_SEND which link protocol to use for this_data, and its addition leads to the protocol layering illustrated in Figure 7.25.
Saltzer & Kaashoek Ch. 7, p. 41
June 25, 2009 8:22 am
7–42
CHAPTER 7 The Network as a System and as a System Component
Network Layer
Network protocol
Standard protocol
High robustness protocol
Experimental protocol
Link Layer
FIGURE 7.25 Layer composition with multiple link protocols.
Internet Protocol
Standard protocol
Address Resolution Protocol
Appletalk Protocol
High robustness protocol
Path Vector Exchange Protocol
Experimental protocol
Network Layer
Link Layer
FIGURE 7.26 Layer composition with multiple link protocols and link layer multiplexing to support multiple network layer protocols.
The second feature of the interface to the link layer is more involved: the interface should support protocol multiplexing. Multiplexing allows several different network layer protocols to use the same link. For example, Internet Protocol, Appletalk Protocol, and Address Resolution Protocol (we will talk about some of these protocols later in this chapter) might all be using the same link. Several steps are required. First, the network layer protocol on the sending side needs to specify which protocol handler should be invoked on the receiving side, so one more argument, network_protocol, is needed in the interface to LINK_SEND. Second, the value of network_protocol needs to be transmitted to the receiving side, for example by adding it to the link-level packet header. Finally, the link layer on the receiving side needs to examine this new header field to decide to which of the various network layer implementations it should deliver the packet. Our protocol layering orga nization is now as illustrated in Figure 7.26. This figure demonstrates the real power of the layered organization: any of the four network layer protocols in the figure may use any of the three link layer protocols.
Saltzer & Kaashoek Ch. 7, p. 42
June 25, 2009 8:22 am
7.3 The Link Layer
7–43
With the addition of multiple link protocols and link multiplexing, we can summa rize the discussion of the link layer in the form of pseudocode for the procedures LINK_SEND and LINK_RECEIVE, together with a structure describing the frame that passes between them, as in Figure 7.27. In procedure LINK_SEND, the procedure variable send proc is selected from an array of link layer protocols; the value found in that array might be, for example, a version of the procedure PACKET_TO_BIT of Figure 7.24 that has been extended with a third argument that identifies which link to use. The procedures CHECK SUM and LENGTH are programs we assume are found in the library. Procedure LINK_RECEIVE might be called, for example, by procedure BIT_TO_FRAME of Figure 7.24. The procedure structure frame structure checked_contents bit_string net_protocol bit_string payload bit_string checksum
if CHECKSUM (received_frame.checked_contents, length) =
received_frame.checksum then // Pass good packets up to next layer. good_frame_count ← good_frame_count + 1; GIVE_TO_NETWORK_HANDLER (received_frame.checked_contents.payload, received_frame.checked_contents.net_protocol); else bad_frame_count ← bad_frame_count + 1 // Just count damaged frame. // Each network layer protocol handler must call SET_HANDLER before the first packet // for that protocol arrives… procedure SET_HANDLER (handler_procedure, handler_protocol)
FIGURE 7.27 The LINK_SEND and LINK_RECEIVE procedures, together with the structure of the frame transmit ted over the link and a dispatching procedure for the network layer.
Saltzer & Kaashoek Ch. 7, p. 43
June 25, 2009 8:22 am
7–44
CHAPTER 7 The Network as a System and as a System Component
verifies the checksum, and then extracts net_data and net_protocol from the frame and passes them to the procedure that calls the network handler together with the identifier of the link over which the packet arrived. These procedures also illustrate an important property of layering that was discussed on page 7–29. The link layer handles its argument data_buffer as an unstructured string of bits. When we examine the network layer in the next section of the chapter, we will see that data_buffer contains a network-layer packet, which has its own internal struc ture. The point is that as we pass from an upper layer to a lower layer, the content and structure of the payload data is not supposed to be any concern of the lower layer. As an aside, the division we have chosen for our sample implementation of a link layer, with one program doing framing and another program verifying checksums, cor responds to the OSI reference model division of the link layer into physical and strategy layers, as was mentioned in Section 7.2.5. Since the link is now multiplexed among several network-layer protocols, when a frame arrives, the link layer must dispatch the packet contained in that frame to the proper network layer protocol handler. Figure 7.27 shows a handler dispatcher named GIVE_TO_NETWORK_HANDLER. Each of several different network-layer protocol-implement ing programs specifies the protocol it knows how to handle, through arguments in a call to SET_HANDLER. Control then passes to a particular network-layer handler only on arrival of a frame containing a packet of the protocol it specified. With some additional effort (not illustrated—the reader can explore this idea as an exercise), one could also make this dispatcher multithreaded, so that as it passes a packet up to the network layer a new thread takes over and the link layer thread returns to work on the next arriving frame. With or without threads, the network_protocol field of a frame indicates to whom in the network layer the packet contained in the frame should be delivered. From a more general point of view, we are multiplexing the lower-layer protocol among several higherlayer protocols. This notion of multiplexing, together with an identification field to sup port it, generally appears in every protocol layer, and in every layer-to-layer interface, of a network architecture. An interesting challenge is that the multiplexing field of a layer names the protocols of the next higher layer, so some method is needed to assign those names. Since higherlayer protocols are likely to be defined and implemented by different organizations, the usual solution is to hand the name conflict avoidance problem to some national or inter national standard-setting body. For example, the names of the protocols of the Internet are assigned by an outfit called ICANN, which stands for the Internet Corporation for Assigned Names and Numbers. LINK_RECEIVE
7.3.5 Link Properties Some final details complete our tour of the link layer. First, links come in several flavors, for which there is some standard terminology: A point-to-point link directly connects exactly two communicating entities. A simplex link has a transmitter at one end and a receiver at the other; two-way communication
Saltzer & Kaashoek Ch. 7, p. 44
June 25, 2009 8:22 am
7.3 The Link Layer
7–45
requires installing two such links, one going in each direction. A duplex link has both a transmitter and a receiver at each end, allowing the same link to be used in both direc tions. A half-duplex link is a duplex link in which transmission can take place in only one direction at a time, whereas a full-duplex link allows transmission in both directions at the same time over the same physical medium. A broadcast link is a shared transmission medium in which there can be several trans mitters and several receivers. Anything sent by any transmitter can be received by many—perhaps all—receivers. Depending on the physical design details, a broadcast link may limit use to one transmitter at a time, or it may allow several distinct transmis sions to be in progress at the same time over the same physical medium. This design choice is analogous to the distinction between half duplex and full duplex but there is no standard terminology for it. The link layers of the standard Ethernet and the popular wireless system known as Wi-Fi are one-transmitter-at-a-time broadcast links. The link layer of a CDMA Personal Communication System (such as ANSI–J–STD–008, which is used by cellular providers Verizon and Sprint PCS) is a broadcast link that permits many transmitters to operate simultaneously. Finally, most link layers impose a maximum frame size, known as the maximum transmission unit (MTU). The reasons for limiting the size of a frame are several: 1. The MTU puts an upper bound on link commitment time, which is the length of time that a link will be tied up once it begins to transmit the frame. This consideration is more important for slow links than for fast ones. 2. For a given bit error rate, the longer a frame the greater the chance of an uncorrectable error in that frame. Since the frame is usually also the unit of error control, an uncorrectable error generally means loss of the entire frame, so as the frame length increases not only does the probability of loss increase, but the cost of the loss increases because the entire frame will probably have to be retransmitted. The MTU puts a ceiling on both of these costs. 3. If congestion leads a forwarder to discard a packet, the MTU limits the amount of transmission capacity required to retransmit the packet. 4. There may be mechanical limits on the maximum length of a frame. A hardware interface may have a small buffer or a short counter register tracking the number of bits in the frame. Similar limits sometimes are imposed by software that was originally designed for another application or to comply with some interoperability standard. Whatever the reason for the MTU, when an application needs to send a message that does not fit in a maximum-sized frame, it becomes the job of some end-to-end protocol to divide the message into segments for transmission and to reassemble the segments into the complete message at the other end. The way in which the end-to-end protocol dis covers the value of the MTU is complicated—it needs to know not just the MTU of the link it is about to use, but the smallest MTU that the segment will encounter on the path
Saltzer & Kaashoek Ch. 7, p. 45
June 25, 2009 8:22 am
7–46
CHAPTER 7 The Network as a System and as a System Component
through the network to its destination. For this purpose, it needs some help from the net work layer, which is our next topic.
7.4 The Network Layer The network layer is the middle layer of our three-layer reference model. The network layer moves a packet across a series of links. While conceptually quite simple, the chal lenges in implementation of this layer are probably the most difficult in network design because there is usually a requirement that a single design span a wide range of perfor mance, traffic load, and number of attachment points. In this section we develop a simple model of the network layer and explore some of the challenges.
7.4.1 Addressing Interface network The conceptual model of a network is a attachment cloud bristling with network attachment point 35 points identified by numbers known as net 01 work addresses, as in Figure 7.28 at the left. 07 A segment enters the network at one 24 attachment point, known as the source. 33 The network layer wraps the segment in a Network 11 packet and carries the packet across the 40 network to another attachment point, known as the destination, where it unwraps 41 16 the original segment and delivers it. 39 The model in the figure is misleading network 42 address in one important way: it suggests that delivery of a segment is accomplished by sending it over one final, physical link. A FIGURE 7.28 network attachment point is actually a vir- The network layer. tual concept rather than a physical concept. Every network participant, whether a packet forwarder or a client computer system, contains an implementation of the network layer, and when a packet finally reaches the network layer of its destination, rather than forwarding it further, the network layer unwraps the segment contained in the packet and passes that segment to the end-to-end layer inside the system that con tains the network attachment point. In addition, a single system may have several network attachment points, each with its own address, all of which result in delivery to the same end-to-end layer; such a system is said to be multihomed. Even packet forward ers need network attachment points with their own addresses, so that a network manager can send them instructions about their configuration and maintenance.
Saltzer & Kaashoek Ch. 7, p. 46
June 25, 2009 8:22 am
7.4 The Network Layer
7–47
Since a network has many attachment points, the the end-to-end layer must specify to the network layer not only a data segment to transmit but also its intended destina tion. Further, there may be several available networks and protocols, and several end-to end protocol handlers, so the interface from the end-to-end layer to the network layer is parallel to the one between the network layer and the link layer: NETWORK_SEND
The argument network_protocol allows the end-to-end layer to select a network and pro tocol with which to send the current segment, and the argument end_layer_protocol allows for multiplexing, this time of the network layer by the end-to-end layer. The value of end_layer_protocol tells the network layer at the destination to which end-to-end pro tocol handler the segment should be delivered. The network layer also has a link-layer interface, across which it receives packets. Fol lowing the upcall style of the link layer of Section 7.3, this interface would be NETWORK_HANDLE
(packet, network_protocol)
and this procedure would be the handler_procedure argument of a call to SET_HANDLER in Figure 7.27. Thus whenever the link layer has a packet to deliver to the network layer, it does so by calling NETWORK_HANDLE. The pseudocode of Figure 7.29 describes a model network layer in detail, starting with the structure of a packet, and followed by implementations of the procedures NETWORK_HANDLE and NETWORK_SEND. NETWORK_SEND creates a packet, starting with the seg ment provided by the end-to-end layer and adding a network-layer header, which here comprises three fields: source, destination, and end_layer_protocol. It fills in the destina tion and end_layer_protocol fields from the corresponding arguments, and it fills in the source field with the address of its own network attachment point. Figure 7.30 shows this latest addition to the overhead of a packet. Procedure NETWORK_HANDLE may do one of two rather different things with a packet, distinguished by the test on line 11. If the packet is not at its destination, NETWORK_HANDLE looks up the packet’s destination in forwarding_table to determine the best link on which to forward it, and then it calls the link layer to send the packet on its way. On the other hand, if the received packet is at its destination, the network layer passes its payload up to the end-to-end layer rather than sending the packet out over another link. As in the case of the interface between the link layer and the network layer, the interface to the end-to-end layer is another upcall that is intended to go through a handler dispatcher similar to that of the link layer dispatcher of Figure 7.27. Because in a network, any net work attachment point can send a packet to any other, the last argument of GIVE_TO_END_LAYER, the source of the packet, is a piece of information that the end-layer recipient generally finds useful in deciding how to handle the packet. One might wonder what led to naming the procedure NETWORK_HANDLE rather than NETWORK_RECEIVE. The insight in choosing that name is that forwarding a packet is always done in exactly the same way, whether the packet comes from the layer above or from the layer below. Thus, when we consider the steps to be taken by NETWORK_SEND, the straightforward implementation is simply to place the data in a packet, add a network
Saltzer & Kaashoek Ch. 7, p. 47
June 25, 2009 8:22 am
7–48
CHAPTER 7 The Network as a System and as a System Component
layer header, and hand the packet to NETWORK_HANDLE. As an extra feature, this architec ture allows a source to send a packet to itself without creating a special case. Just as the link layer used the net_protocol field to decide which of several possible network handlers to give the packet to, NETWORK_SEND can use the net_protocol argument for the same purpose. That is, rather than calling NETWORK_HANDLE directly, it could call the procedure GIVE_TO_NETWORK_HANDLER of Figure 7.27.
7.4.2 Managing the Forwarding Table: Routing The primary challenge in a packet forwarding network is to set up and manage the for warding tables, which generally must be different for each network-layer participant. Constructing these tables requires first figuring out appropriate paths (sometimes called routes) to follow from each source to each destination, so the exercise is variously known as path-finding or routing. In a small network, one might set these tables up by hand. As the scale of a network grows, this approach becomes impractical, for several reasons:
FIGURE 7.29 Model implementation of a network layer. The procedure NETWORK_SEND originates packets, while NETWORK_HANDLE receives packets and either forwards them or passes them to the local end-to-end layer.
Saltzer & Kaashoek Ch. 7, p. 48
June 25, 2009 8:22 am
7.4 The Network Layer
7–49
1. The amount of calculation required to determine the best paths grows combinatorially with the number of nodes in the network. 2. Whenever a link is added or removed, the forwarding tables must be recalculated. As a network grows in size, the frequency of links being added and removed will probably grow in proportion, so the combinatorially growing routing calculation will have to be performed more and more frequently. 3. Whenever a link fails or is repaired, the forwarding tables must be recalculated. For a given link failure rate, the number of such failures will be proportional to the number of links, so for a second reason the combinatorially growing routing calculation will have to be performed an increasing number of times. 4. There are usually several possible paths available, and if traffic suddenly causes the originally planned path to become congested, it would be nice if the forwarding tables could automatically adapt to the new situation. All four of these reasons encourage the development of automatic routing algorithms. If reasons 1 and 2 are the only concerns, one can leave the resulting forwarding tables in place for an indefinite period, a technique known as static routing. The on-the-fly recal culation called for by reasons 3 and 4 is known as adaptive routing, and because this feature is vitally important in many networks, routing algorithms that allow for easy update when things change are almost always used. A packet forwarder that also partic-
Segment presented to the network layer
DATA
Packet presented to the link layer
source & destination
end protocol
DATA
Frame appearing on the link
frame network source & mark protocol destination
end protocol
DATA
check frame sum mark
Example
1111111
RPC
“Fire”
97142 1111111 55316
IP
41 —> 24
FIGURE 7.30 A typical accumulation of network layer and link layer headers and trailers. The additional infor mation added at each layer can come from control information passed from the higher layer as arguments (for example, the end protocol type and the destination are arguments in the call to the network layer). In other cases they are added by the lower layer (for example, the link layer adds the frame marks and checksum).
Saltzer & Kaashoek Ch. 7, p. 49
June 25, 2009 8:22 am
7–50
CHAPTER 7 The Network as a System and as a System Component
B C
1 A
1
1
source
1
H
3
4
1
5
2
G
2 3 destination
1 F
1
5
K
4
2 3
4 3
4 2
E
1
5 J
1
D 1
2
FIGURE 7.31 Routing example.
ipates in a routing algorithm is usually called a router. An adaptive routing algorithm requires exchange of current reachability information. Typically, the routers exchange this information using a network-layer routing protocol transmitted over the network itself. To see how adaptive routing algorithms might work, consider the modest-sized net work of Figure 7.31. To minimize confusion in interpreting this figure, each network address is lettered, rather than numbered, while each link is assigned two one-digit link identifiers, one from the point of view of each of the stations it connects. In this figure, routers are rectangular while workstations and services are round, but all have network addresses and all have network layer implementations. Suppose now that the source A sends a packet addressed to destination D. Since A has only one outbound link, its forwarding table is short and simple:
Saltzer & Kaashoek Ch. 7, p. 50
destination
link
A all other
end-layer 1
June 25, 2009 8:22 am
7.4 The Network Layer
7–51
so the packet departs from A by way of link 1, going to router G for its next stop. However, the forwarding table at G must be considerably more complicated. It might contain, for example, the following values: destination
link
A B C D E F G H J K
1 2 2 3 4 4 end-layer 2 3 4
This is not the only possible forwarding table for G. Since there are several possible paths to most destinations, there are several possible values for some of the table entries. In addition, it is essential that the forwarding tables in the other routers be coordinated with this forwarding table. If they are not, when router G sends a packet destined for E to router K, router K might send it back to G, and the packet could loop forever. The interesting question is how to construct a consistent, efficient set of forwarding tables. Many algorithms that sound promising have been proposed and tried; few work well. One that works moderately well for small networks is known as path vector exchange. Each participant maintains, in addition to its forwarding table, a path vector, each element of which is a complete path to some destination. Initially, the only path it knows about is the zero-length path to itself, but as the algorithm proceeds it gradually learns about other paths. Eventually its path vector accumulates paths to every point in the network. After each step of the algorithm it can construct a new forwarding table from its new path vector, so the forwarding table gradually becomes more and more complete. The algorithm involves two steps that every participant repeats over and over, path advertising and path selection. To illustrate the algorithm, suppose par to path ticipant G starts with a path vector that contains just one item, an entry for itself, as G <> in Figure 7.32. In the advertising step, each participant sends its own network address and a copy of its path vector down every FIGURE 7.32 attached link to its immediate neighbors, Initial state of path vector for G. < > is an specifying the network-layer protocol empty path. PATH_EXCHANGE. The routing algorithm of G
would thus receive from its four neighbors
the four path vectors of Figure 7.33. This advertisement allows G to discover the names,
which are in this case network addresses, of each of its neighbors.
Saltzer & Kaashoek Ch. 7, p. 51
June 25, 2009 8:22 am
7–52
CHAPTER 7 The Network as a System and as a System Component
From A, via link 1 to path
From H, via link 2: to path
From J, via link 3: to path
From K, via link 4: to path
A
H
J
K
<>
<>
<>
<>
FIGURE 7.33 Path vectors received by G in the first round.