Principles of Computer System Design

Principles of Computer System Design An Introduction Part II Chapters 7–11 Jerome H. Saltzer M. Frans Kaashoek Massachusetts Institute of Technology...

0 downloads 85 Views 4MB Size
Principles of Computer System Design An Introduction

Part II Chapters 7–11

Jerome H. Saltzer M. Frans Kaashoek Massachusetts Institute of Technology

Version 5.0

Saltzer & Kaashoek Ch. Part II, p. i

June 24, 2009 12:14 am

Copyright © 2009 by Jerome H. Saltzer and M. Frans Kaashoek. Some Rights Reserved. This work is licensed under a Creative Commons Attribution-Non­ commercial-Share Alike 3.0 United States License. For more information on what this license means, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/ Designations used by companies to distinguish their products are often claimed as trade­ marks or registered trademarks. In all instances in which the authors are aware of a claim, the product names appear in initial capital or all capital letters. All trademarks that appear or are otherwise referred to in this work belong to their respective owners. Suggestions, Comments, Corrections, and Requests to waive license restrictions: Please send correspondence by electronic mail to: [email protected] and [email protected]

Saltzer & Kaashoek Ch. Part II, p. ii

June 24, 2009 12:14 am

Contents

CHAPTER

PART I [In Printed Textbook] List of Sidebars. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xix

Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvii

Where to Find Part II and other On-line Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxvii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxxix

Computer System Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xliii

CHAPTER 1 Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1. Systems and Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Common Problems of Systems in Many Fields . . . . . . . . . . . . . . . . . . 3

1.1.2 Systems, Components, Interfaces and Environments . . . . . . . . . . . . . 8

1.1.3 Complexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2. Sources of Complexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.2.1 Cascading and Interacting Requirements . . . . . . . . . . . . . . . . . . . . . 13

1.2.2 Maintaining High Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3. Coping with Complexity I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.3.1 Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.3.2 Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.3.3 Layering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.3.4 Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.3.5 Putting it Back Together: Names Make Connections . . . . . . . . . . . . 26

1.4. Computer Systems are the Same but Different . . . . . . . . . . . . . . . . . . . . 27

1.4.1 Computer Systems Have no Nearby Bounds on Composition . . . . . 28

1.4.2 d(technology)/dt is Unprecedented. . . . . . . . . . . . . . . . . . . . . . . . . . 31

1.5. Coping with Complexity II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

1.5.1 Why Modularity, Abstraction, Layering, and Hierarchy aren’t

Enough . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

1.5.2 Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

1.5.3 Keep it Simple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

What the Rest of this Book is about . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 iii

Saltzer & Kaashoek Ch. 0, p. iii

June 24, 2009 12:21 am

iv

Contents

CHAPTER 2 Elements of Computer System Organization . . . . . . . . . . . . . . 43

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44

2.1. The Three Fundamental Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.1.1 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45

2.1.2 Interpreters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53

2.1.3 Communication Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59

2.2. Naming in Computer Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2.2.1 The Naming Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61

2.2.2 Default and Explicit Context References . . . . . . . . . . . . . . . . . . . . . .66

2.2.3 Path Names, Naming Networks, and Recursive Name Resolution . . .71

2.2.4 Multiple Lookup: Searching through Layered Contexts . . . . . . . . . . .73

2.2.5 Comparing Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75

2.2.6 Name Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76

2.3. Organizing Computer Systems with Names and Layers . . . . . . . . . . . . . . 78

2.3.1 A Hardware Layer: The Bus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80

2.3.2 A Software Layer: The File Abstraction . . . . . . . . . . . . . . . . . . . . . . .87

2.4. Looking Back and Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

2.5. Case Study: UNIX® File System Layering and Naming . . . . . . . . . . . . . . 91

2.5.1 Application Programming Interface for the UNIX File System . . . . . .91

2.5.2 The Block Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93

2.5.3 The File Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .95

2.5.4 The Inode Number Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .96

2.5.5 The File Name Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .96

2.5.6 The Path Name Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98

2.5.7 Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99

2.5.8 Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .101

2.5.9 The Absolute Path Name Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . .102

2.5.10 The Symbolic Link Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .104

2.5.11 Implementing the File System API . . . . . . . . . . . . . . . . . . . . . . . .106

2.5.12 The Shell, Implied Contexts, Search Paths, and Name Discovery .110

2.5.13 Suggestions for Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . .112

Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112

CHAPTER 3 The Design of Naming Schemes . . . . . . . . . . . . . . . . . . . . . . . 115

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .115

3.1. Considerations in the Design of Naming Schemes . . . . . . . . . . . . . . . . 116

3.1.1 Modular Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .116

Saltzer & Kaashoek Ch. 0, p. iv

June 24, 2009 12:21 am

Contents

v

3.1.2 Metadata and Name Overloading . . . . . . . . . . . . . . . . . . . . . . . . . . 120

3.1.3 Addresses: Names that Locate Objects . . . . . . . . . . . . . . . . . . . . . . 122

3.1.4 Generating Unique Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

3.1.5 Intended Audience and User-Friendly Names. . . . . . . . . . . . . . . . . 127

3.1.6 Relative Lifetimes of Names, Values, and Bindings . . . . . . . . . . . . . 129

3.1.7 Looking Back and Ahead: Names are a Basic System Component . 131

3.2. Case Study: The Uniform Resource Locator (URL) . . . . . . . . . . . . . . . 132

3.2.1 Surfing as a Referential Experience; Name Discovery . . . . . . . . . . . 132

3.2.2 Interpretation of the URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

3.2.3 URL Case Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

3.2.4 Wrong Context References for a Partial URL . . . . . . . . . . . . . . . . . 135

3.2.5 Overloading of Names in URLs . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

3.3. War Stories: Pathologies in the Use of Names. . . . . . . . . . . . . . . . . . . . 138

3.3.1 A Name Collision Eliminates Smiling Faces . . . . . . . . . . . . . . . . . . 139

3.3.2 Fragile Names from Overloading, and a Market Solution . . . . . . . . 139

3.3.3 More Fragile Names from Overloading, with Market Disruption . . 140

3.3.4 Case-Sensitivity in User-Friendly Names . . . . . . . . . . . . . . . . . . . . 141

3.3.5 Running Out of Telephone Numbers . . . . . . . . . . . . . . . . . . . . . . . 142

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

CHAPTER 4 Enforcing Modularity with Clients and Services . . . . . . . . . . .147

Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

4.1. Client/service organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

4.1.1 From soft modularity to enforced modularity . . . . . . . . . . . . . . . . . 149

4.1.2 Client/service organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

4.1.3 Multiple clients and services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

4.1.4 Trusted intermediaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

4.1.5 A simple example service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

4.2. Communication between client and service . . . . . . . . . . . . . . . . . . . . . 167

4.2.1 Remote procedure call (RPC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

4.2.2 RPCs are not identical to procedure calls . . . . . . . . . . . . . . . . . . . . 169

4.2.3 Communicating through an intermediary . . . . . . . . . . . . . . . . . . . 172

4.3. Summary and the road ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

4.4. Case study: The Internet Domain Name System (DNS) . . . . . . . . . . . 175

4.4.1 Name resolution in DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

4.4.2 Hierarchical name management . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

4.4.3 Other features of DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

Saltzer & Kaashoek Ch. 0, p. v

June 24, 2009 12:21 am

vi

Contents

4.4.4 Name discovery in DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .183

4.4.5 Trustworthiness of DNS responses . . . . . . . . . . . . . . . . . . . . . . . . .184

4.5. Case study: The Network File System (NFS). . . . . . . . . . . . . . . . . . . . . 184

4.5.1 Naming remote files and directories. . . . . . . . . . . . . . . . . . . . . . . . .185

4.5.2 The NFS remote procedure calls . . . . . . . . . . . . . . . . . . . . . . . . . . .187

4.5.3 Extending the UNIX file system to support NFS. . . . . . . . . . . . . . . .190

4.5.4 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .192

4.5.5 NFS version 3 and beyond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .194

Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .195

CHAPTER 5 Enforcing Modularity with Virtualization . . . . . . . . . . . . . . . . . 199

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .200

5.1. Client/Service Organization within a Computer using Virtualization 201

5.1.1 Abstractions for Virtualizing Computers . . . . . . . . . . . . . . . . . . . . .203

5.1.1.1 Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .204

5.1.1.2 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .206

5.1.1.3 Bounded Buffer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .206

5.1.1.4 Operating System Interface . . . . . . . . . . . . . . . . . . . . . . . . . . .207

5.1.2 Emulation and Virtual Machines. . . . . . . . . . . . . . . . . . . . . . . . . . .208

5.1.3 Roadmap: Step-by-Step Virtualization. . . . . . . . . . . . . . . . . . . . . . .208

5.2. Virtual Links using SEND, RECEIVE, and a Bounded Buffer . . . . . . . . . . . . 210

5.2.1 An Interface for SEND and RECEIVE with Bounded Buffers. . . . . . . . . .210

5.2.2 Sequence Coordination with a Bounded Buffer . . . . . . . . . . . . . . . .211

5.2.3 Race Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .214

5.2.4 Locks and Before-or-After Actions. . . . . . . . . . . . . . . . . . . . . . . . . .218

5.2.5 Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .221

5.2.6 Implementing ACQUIRE and RELEASE . . . . . . . . . . . . . . . . . . . . . . . . . .222

5.2.7 Implementing a Before-or-After Action Using the One-Writer

Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .225

5.2.8 Coordination between Synchronous Islands with Asynchronous

Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .228

5.3. Enforcing Modularity in Memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

5.3.1 Enforcing Modularity with Domains. . . . . . . . . . . . . . . . . . . . . . . .230

5.3.2 Controlled Sharing using Several Domains . . . . . . . . . . . . . . . . . . .231

5.3.3 More Enforced Modularity with Kernel and User Mode . . . . . . . . .234

5.3.4 Gates and Changing Modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .235

5.3.5 Enforcing Modularity for Bounded Buffers . . . . . . . . . . . . . . . . . . .237

Saltzer & Kaashoek Ch. 0, p. vi

June 24, 2009 12:21 am

Contents

vii

5.3.6 The Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

5.4. Virtualizing Memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

5.4.1 Virtualizing Addresses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

5.4.2 Translating Addresses using a Page Map . . . . . . . . . . . . . . . . . . . . . 245

5.4.3 Virtual Address Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

5.4.3.1 Primitives for Virtual Address Spaces . . . . . . . . . . . . . . . . . . . 248

5.4.3.2 The Kernel and Address Spaces . . . . . . . . . . . . . . . . . . . . . . . 250

5.4.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

5.4.4 Hardware versus Software and the Translation Look-Aside Buffer. . 252

5.4.5 Segments (Advanced Topic) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

5.5. Virtualizing Processors using Threads . . . . . . . . . . . . . . . . . . . . . . . . . 255

5.5.1 Sharing a processor among multiple threads . . . . . . . . . . . . . . . . . . 255

5.5.2 Implementing YIELD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

5.5.3 Creating and Terminating Threads . . . . . . . . . . . . . . . . . . . . . . . . . 264

5.5.4 Enforcing Modularity with Threads: Preemptive Scheduling . . . . . 269

5.5.5 Enforcing Modularity with Threads and Address Spaces . . . . . . . . . 271

5.5.6 Layering Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

5.6. Thread Primitives for Sequence Coordination . . . . . . . . . . . . . . . . . . . 273

5.6.1 The Lost Notification Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

5.6.2 Avoiding the Lost Notification Problem with Eventcounts and

Sequencers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

5.6.3 Implementing AWAIT, ADVANCE, TICKET, and READ (Advanced

Topic). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

5.6.4 Polling, Interrupts, and Sequence coordination. . . . . . . . . . . . . . . . 282

5.7. Case study: Evolution of Enforced Modularity in the Intel x86 . . . . . . 284

5.7.1 The early designs: no support for enforced modularity . . . . . . . . . . 285

5.7.2 Enforcing Modularity using Segmentation . . . . . . . . . . . . . . . . . . . 286

5.7.3 Page-Based Virtual Address Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 287

5.7.4 Summary: more evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

5.8. Application: Enforcing Modularity using Virtual Machines . . . . . . . . . 290

5.8.1 Virtual Machine Uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

5.8.2 Implementing Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

5.8.3 Virtualizing Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294

CHAPTER 6 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .299

Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

Saltzer & Kaashoek Ch. 0, p. vii

June 24, 2009 12:21 am

viii

Contents

6.1. Designing for Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

6.1.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .302

6.1.1.1 Capacity, Utilization, Overhead, and Useful Work . . . . . . . . .302

6.1.1.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .302

6.1.1.3 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .303

6.1.2 A Systems Approach to Designing for Performance . . . . . . . . . . . . .304

6.1.3 Reducing latency by exploiting workload properties . . . . . . . . . . . .306

6.1.4 Reducing Latency Using Concurrency. . . . . . . . . . . . . . . . . . . . . . .307

6.1.5 Improving Throughput: Concurrency . . . . . . . . . . . . . . . . . . . . . . .309

6.1.6 Queuing and Overload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .311

6.1.7 Fighting Bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .313

6.1.7.1 Batching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .314

6.1.7.2 Dallying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .314

6.1.7.3 Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .314

6.1.7.4 Challenges with Batching, Dallying, and Speculation . . . . . . .315

6.1.8 An Example: the I/O bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . .316

6.2. Multilevel Memories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

6.2.1 Memory Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .322

6.2.2 Multilevel Memory Management using Virtual Memory. . . . . . . . .323

6.2.3 Adding multilevel memory management to a virtual memory . . . . .327

6.2.4 Analyzing Multilevel Memory Systems . . . . . . . . . . . . . . . . . . . . . .331

6.2.5 Locality of reference and working sets . . . . . . . . . . . . . . . . . . . . . . .333

6.2.6 Multilevel Memory Management Policies . . . . . . . . . . . . . . . . . . . .335

6.2.7 Comparative analysis of different policies . . . . . . . . . . . . . . . . . . . .340

6.2.8 Other Page-Removal Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . .344

6.2.9 Other aspects of multilevel memory management . . . . . . . . . . . . . .346

6.3. Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347

6.3.1 Scheduling Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .348

6.3.2 Scheduling metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .349

6.3.3 Scheduling Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .352

6.3.3.1 First-Come, First-Served . . . . . . . . . . . . . . . . . . . . . . . . . . . . .353

6.3.3.2 Shortest-job-first . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .354

6.3.3.3 Round-Robin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .355

6.3.3.4 Priority Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .357

6.3.3.5 Real-time Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .359

Saltzer & Kaashoek Ch. 0, p. viii

June 24, 2009 12:21 am

Contents

ix

6.3.4 Case study: Scheduling the Disk Arm. . . . . . . . . . . . . . . . . . . . . . . 360

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 About Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 Appendix A: The Binary Classification Trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 Suggestions for Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 Problem Sets for Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 Index of Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513

Part II [On-Line] CHAPTER 7 The Network as a System and as a System Component . . . .7–1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–2 7.1. Interesting Properties of Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–3

7.1.1 Isochronous and Asynchronous Multiplexing . . . . . . . . . . . . . . . . . 7–5

7.1.2 Packet Forwarding; Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–9

7.1.3 Buffer Overflow and Discarded Packets . . . . . . . . . . . . . . . . . . . . 7–12

7.1.4 Duplicate Packets and Duplicate Suppression . . . . . . . . . . . . . . . . 7–15

7.1.5 Damaged Packets and Broken Links . . . . . . . . . . . . . . . . . . . . . . . 7–18

7.1.6 Reordered Delivery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–19

7.1.7 Summary of Interesting Properties and the Best-Effort Contract . 7–20

7.2. Getting Organized: Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–20

7.2.1 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–23

7.2.2 The Link Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–25

7.2.3 The Network Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–27

7.2.4 The End-to-End Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–28

7.2.5 Additional Layers and the End-to-End Argument. . . . . . . . . . . . . 7–30

7.2.6 Mapped and Recursive Applications of the Layered Model . . . . . . 7–32

7.3. The Link Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–34

7.3.1 Transmitting Digital Data in an Analog World . . . . . . . . . . . . . . . 7–34

7.3.2 Framing Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–38

7.3.3 Error Handling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–40

7.3.4 The Link Layer Interface: Link Protocols and Multiplexing . . . . . 7–41

7.3.5 Link Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–44

Saltzer & Kaashoek Ch. 0, p. ix

June 24, 2009 12:21 am

x

Contents

7.4. The Network Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–46

7.4.1 Addressing Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7–46

7.4.2 Managing the Forwarding Table: Routing . . . . . . . . . . . . . . . . . . .7–48

7.4.3 Hierarchical Address Assignment and Hierarchical Routing. . . . . .7–56

7.4.4 Reporting Network Layer Errors . . . . . . . . . . . . . . . . . . . . . . . . . .7–59

7.4.5 Network Address Translation (An Idea That Almost Works) . . . . .7–61

7.5. The End-to-End Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–62

7.5.1 Transport Protocols and Protocol Multiplexing . . . . . . . . . . . . . . .7–63

7.5.2 Assurance of At-Least-Once Delivery; the Role of Timers . . . . . . .7–67

7.5.3 Assurance of At-Most-Once Delivery: Duplicate Suppression . . . .7–71

7.5.4 Division into Segments and Reassembly of Long Messages . . . . . .7–73

7.5.5 Assurance of Data Integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7–73

7.5.6 End-to-End Performance: Overlapping and Flow Control. . . . . . .7–75

7.5.6.1 Overlapping Transmissions . . . . . . . . . . . . . . . . . . . . . . . . . .7–75

7.5.6.2 Bottlenecks, Flow Control, and Fixed Windows . . . . . . . . . .7–77

7.5.6.3 Sliding Windows and Self-Pacing . . . . . . . . . . . . . . . . . . . . .7–79

7.5.6.4 Recovery of Lost Data Segments with Windows . . . . . . . . . .7–81

7.5.7 Assurance of Stream Order, and Closing of Connections . . . . . . . .7–82

7.5.8 Assurance of Jitter Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7–84

7.5.9 Assurance of Authenticity and Privacy . . . . . . . . . . . . . . . . . . . . . .7–85

7.6. A Network System Design Issue: Congestion Control . . . . . . . . . . . . 7–86

7.6.1 Managing Shared Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7–86

7.6.2 Resource Management in Networks . . . . . . . . . . . . . . . . . . . . . . .7–89

7.6.3 Cross-layer Cooperation: Feedback . . . . . . . . . . . . . . . . . . . . . . . .7–91

7.6.4 Cross-layer Cooperation: Control . . . . . . . . . . . . . . . . . . . . . . . . .7–93

7.6.5 Other Ways of Controlling Congestion in Networks . . . . . . . . . . .7–94

7.6.6 Delay Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7–98

7.7. Wrapping up Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–99

7.8. Case Study: Mapping the Internet to the Ethernet. . . . . . . . . . . . . . 7–100

7.8.1 A Brief Overview of Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . . .7–100

7.8.2 Broadcast Aspects of Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . . .7–101

7.8.3 Layer Mapping: Attaching Ethernet to a Forwarding Network . .7–103

7.8.4 The Address Resolution Protocol. . . . . . . . . . . . . . . . . . . . . . . . .7–105

7.9. War Stories: Surprises in Protocol Design . . . . . . . . . . . . . . . . . . . . 7–107

7.9.1 Fixed Timers Lead to Congestion Collapse in NFS . . . . . . . . . . .7–107

7.9.2 Autonet Broadcast Storms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7–108

7.9.3 Emergent Phase Synchronization of Periodic Protocols . . . . . . . .7–108

Saltzer & Kaashoek Ch. 0, p. x

June 24, 2009 12:21 am

Contents

xi

7.9.4 Wisconsin Time Server Meltdown . . . . . . . . . . . . . . . . . . . . . . . 7–109

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–111

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable Components 8–1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–2 8.1. Faults, Failures, and Fault Tolerant Design. . . . . . . . . . . . . . . . . . . . . 8–3

8.1.1 Faults, Failures, and Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–3

8.1.2 The Fault-Tolerance Design Process . . . . . . . . . . . . . . . . . . . . . . . . 8–6

8.2. Measures of Reliability and Failure Tolerance. . . . . . . . . . . . . . . . . . . 8–8

8.2.1 Availability and Mean Time to Failure . . . . . . . . . . . . . . . . . . . . . . 8–8

8.2.2 Reliability Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–13

8.2.3 Measuring Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–16

8.3. Tolerating Active Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–16

8.3.1 Responding to Active Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–16

8.3.2 Fault Tolerance Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–18

8.4. Systematically Applying Redundancy . . . . . . . . . . . . . . . . . . . . . . . . 8–20

8.4.1 Coding: Incremental Redundancy . . . . . . . . . . . . . . . . . . . . . . . . 8–21

8.4.2 Replication: Massive Redundancy. . . . . . . . . . . . . . . . . . . . . . . . . 8–25

8.4.3 Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–26

8.4.4 Repair. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–31

8.5. Applying Redundancy to Software and Data . . . . . . . . . . . . . . . . . . 8–36

8.5.1 Tolerating Software Faults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–36

8.5.2 Tolerating Software (and other) Faults by Separating State . . . . . . 8–37

8.5.3 Durability and Durable Storage . . . . . . . . . . . . . . . . . . . . . . . . . . 8–39

8.5.4 Magnetic Disk Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–40

8.5.4.1 Magnetic Disk Fault Modes . . . . . . . . . . . . . . . . . . . . . . . . . 8–41

8.5.4.2 System Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–42

8.5.4.3 Raw Disk Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–43

8.5.4.4 Fail-Fast Disk Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–43

8.5.4.5 Careful Disk Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–45

8.5.4.6 Durable Storage: RAID 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–46

8.5.4.7 Improving on RAID 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–47

8.5.4.8 Detecting Errors Caused by System Crashes . . . . . . . . . . . . . 8–49

8.5.4.9 Still More Threats to Durability . . . . . . . . . . . . . . . . . . . . . . 8–49

Saltzer & Kaashoek Ch. 0, p. xi

June 24, 2009 12:21 am

xii

Contents

8.6. Wrapping up Reliability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–51

8.6.1 Design Strategies and Design Principles. . . . . . . . . . . . . . . . . . . . .8–51

8.6.2 How about the End-to-End Argument?. . . . . . . . . . . . . . . . . . . . .8–52

8.6.3 A Caution on the Use of Reliability Calculations. . . . . . . . . . . . . .8–53

8.6.4 Where to Learn More about Reliable Systems . . . . . . . . . . . . . . . .8–53

8.7. Application: A Fault Tolerance Model for CMOS RAM . . . . . . . . . . 8–55

8.8. War Stories: Fault Tolerant Systems that Failed . . . . . . . . . . . . . . . . . 8–57

8.8.1 Adventures with Error Correction . . . . . . . . . . . . . . . . . . . . . . . . .8–57

8.8.2 Risks of Rarely-Used Procedures: The National Archives . . . . . . . .8–59

8.8.3 Non-independent Replicas and Backhoe Fade . . . . . . . . . . . . . . . .8–60

8.8.4 Human Error May Be the Biggest Risk . . . . . . . . . . . . . . . . . . . . .8–61

8.8.5 Introducing a Single Point of Failure . . . . . . . . . . . . . . . . . . . . . . .8–63

8.8.6 Multiple Failures: The SOHO Mission Interruption . . . . . . . . . . .8–63

Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8–64

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After . . . . . . . . . . . . 9–1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9–2 9.1. Atomicity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–4

9.1.1 All-or-Nothing Atomicity in a Database . . . . . . . . . . . . . . . . . . . . .9–5

9.1.2 All-or-Nothing Atomicity in the Interrupt Interface . . . . . . . . . . . .9–6

9.1.3 All-or-Nothing Atomicity in a Layered Application . . . . . . . . . . . . .9–8

9.1.4 Some Actions With and Without the All-or-Nothing Property . . .9–10

9.1.5 Before-or-After Atomicity: Coordinating Concurrent Threads. . . .9–13

9.1.6 Correctness and Serialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9–16

9.1.7 All-or-Nothing and Before-or-After Atomicity. . . . . . . . . . . . . . . .9–19

9.2. All-or-Nothing Atomicity I: Concepts . . . . . . . . . . . . . . . . . . . . . . . . 9–21

9.2.1 Achieving All-or-Nothing Atomicity: ALL_OR_NOTHING_PUT . . .9–21

9.2.2 Systematic Atomicity: Commit and the Golden Rule . . . . . . . . . .9–27

9.2.3 Systematic All-or-Nothing Atomicity: Version Histories . . . . . . . .9–30

9.2.4 How Version Histories are Used . . . . . . . . . . . . . . . . . . . . . . . . . .9–37

9.3. All-or-Nothing Atomicity II: Pragmatics . . . . . . . . . . . . . . . . . . . . . . 9–38

9.3.1 Atomicity Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9–39

9.3.2 Logging Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9–42

9.3.3 Recovery Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9–45

9.3.4 Other Logging Configurations: Non-Volatile Cell Storage. . . . . . .9–47

9.3.5 Checkpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9–51

9.3.6 What if the Cache is not Write-Through? (Advanced Topic) . . . . .9–53

Saltzer & Kaashoek Ch. 0, p. xii

June 24, 2009 12:21 am

Contents

xiii

9.4. Before-or-After Atomicity I: Concepts . . . . . . . . . . . . . . . . . . . . . . . 9–54

9.4.1 Achieving Before-or-After Atomicity: Simple Serialization . . . . . . 9–54

9.4.2 The Mark-Point Discipline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–58

9.4.3 Optimistic Atomicity: Read-Capture (Advanced Topic) . . . . . . . . 9–63

9.4.4 Does Anyone Actually Use Version Histories for Before-or-After

Atomicity? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–67

9.5. Before-or-After Atomicity II: Pragmatics . . . . . . . . . . . . . . . . . . . . . 9–69

9.5.1 Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–70

9.5.2 Simple Locking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–72

9.5.3 Two-Phase Locking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–73

9.5.4 Performance Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–75

9.5.5 Deadlock; Making Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–76

9.6. Atomicity across Layers and Multiple Sites . . . . . . . . . . . . . . . . . . . . 9–79

9.6.1 Hierarchical Composition of Transactions . . . . . . . . . . . . . . . . . . 9–80

9.6.2 Two-Phase Commit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–84

9.6.3 Multiple-Site Atomicity: Distributed Two-Phase Commit . . . . . . 9–85

9.6.4 The Dilemma of the Two Generals . . . . . . . . . . . . . . . . . . . . . . . . 9–90

9.7. A More Complete Model of Disk Failure (Advanced Topic) . . . . . . . 9–92

9.7.1 Storage that is Both All-or-Nothing and Durable . . . . . . . . . . . . . 9–92

9.8. Case Studies: Machine Language Atomicity . . . . . . . . . . . . . . . . . . . 9–95

9.8.1 Complex Instruction Sets: The General Electric 600 Line. . . . . . . 9–95

9.8.2 More Elaborate Instruction Sets: The IBM System/370 . . . . . . . . 9–96

9.8.3 The Apollo Desktop Computer and the Motorola M68000

Microprocessor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–97

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–98

CHAPTER 10 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10–1 Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10–2 10.1. Constraints and Interface Consistency . . . . . . . . . . . . . . . . . . . . . . 10–2

10.2. Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10–4

10.2.1 Coherence, Replication, and Consistency in a Cache . . . . . . . . . 10–4

10.2.2 Eventual Consistency with Timer Expiration . . . . . . . . . . . . . . . 10–5

10.2.3 Obtaining Strict Consistency with a Fluorescent Marking Pen . . 10–7

10.2.4 Obtaining Strict Consistency with the Snoopy Cache. . . . . . . . . 10–7

10.3. Durable Storage Revisited: Widely Separated Replicas . . . . . . . . . . 10–9

10.3.1 Durable Storage and the Durability Mantra . . . . . . . . . . . . . . . . 10–9

10.3.2 Replicated State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10–11

Saltzer & Kaashoek Ch. 0, p. xiii

June 24, 2009 12:21 am

xiv

Contents

10.3.3 Shortcuts to Meet more Modest Requirements . . . . . . . . . . . . .10–13

10.3.4 Maintaining Data Integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . .10–15

10.3.5 Replica Reading and Majorities . . . . . . . . . . . . . . . . . . . . . . . . .10–16

10.3.6 Backup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10–17

10.3.7 Partitioning Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10–18

10.4. Reconciliation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10–19

10.4.1 Occasionally Connected Operation . . . . . . . . . . . . . . . . . . . . . .10–20

10.4.2 A Reconciliation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . .10–22

10.4.3 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10–25

10.4.4 Clock Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10–26

10.5. Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10–26

10.5.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10–27

10.5.2 Trade-Offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10–28

10.5.3 Directions for Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . .10–31

Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10–32

CHAPTER 11 Information Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–4 11.1. Introduction to Secure Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–5

11.1.1 Threat Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–7

11.1.2 Security is a Negative Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–9

11.1.3 The Safety Net Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–10

11.1.4 Design Principles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–13

11.1.5 A High d(technology)/dt Poses Challenges For Security . . . . . .11–17

11.1.6 Security Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–18

11.1.7 Trusted Computing Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–26

11.1.8 The Road Map for this Chapter . . . . . . . . . . . . . . . . . . . . . . . .11–28

11.2. Authenticating Principals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–28

11.2.1 Separating Trust from Authenticating Principals . . . . . . . . . . . .11–29

11.2.2 Authenticating Principals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–30

11.2.3 Cryptographic Hash Functions, Computationally Secure, Window of

Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–32

11.2.4 Using Cryptographic Hash Functions to Protect Passwords . . . .11–34

11.3. Authenticating Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–36

11.3.1 Message Authentication is Different from Confidentiality . . . . .11–37

11.3.2 Closed versus Open Designs and Cryptography . . . . . . . . . . . .11–38

11.3.3 Key-Based Authentication Model . . . . . . . . . . . . . . . . . . . . . . .11–41

Saltzer & Kaashoek Ch. 0, p. xiv

June 24, 2009 12:21 am

Contents

xv

11.3.4 Properties of SIGN and VERIFY . . . . . . . . . . . . . . . . . . . . . . . . . 11–41

11.3.5 Public-key versus Shared-Secret Authentication . . . . . . . . . . . . 11–44

11.3.6 Key Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–45

11.3.7 Long-Term Data Integrity with Witnesses . . . . . . . . . . . . . . . . 11–48

11.4. Message Confidentiality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–49

11.4.1 Message Confidentiality Using Encryption . . . . . . . . . . . . . . . . 11–49

11.4.2 Properties of ENCRYPT and DECRYPT . . . . . . . . . . . . . . . . . . . . 11–50

11.4.3 Achieving both Confidentiality and Authentication . . . . . . . . . 11–52

11.4.4 Can Encryption be Used for Authentication? . . . . . . . . . . . . . . 11–53

11.5. Security Protocols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–54

11.5.1 Example: Key Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–54

11.5.2 Designing Security Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . 11–60

11.5.3 Authentication Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–63

11.5.4 An Incorrect Key Exchange Protocol . . . . . . . . . . . . . . . . . . . . 11–66

11.5.5 Diffie-Hellman Key Exchange Protocol . . . . . . . . . . . . . . . . . . 11–68

11.5.6 A Key Exchange Protocol Using a Public-Key System . . . . . . . . 11–69

11.5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–71

11.6. Authorization: Controlled Sharing . . . . . . . . . . . . . . . . . . . . . . . . 11–72

11.6.1 Authorization Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–73

11.6.2 The Simple Guard Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–73

11.6.2.1 The Ticket System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–74

11.6.2.2 The List System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–74

11.6.2.3 Tickets Versus Lists, and Agencies . . . . . . . . . . . . . . . . . . 11–75

11.6.2.4 Protection Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–76

11.6.3 Example: Access Control in UNIX . . . . . . . . . . . . . . . . . . . . . . . 11–76

11.6.3.1 Principals in UNIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–76

11.6.3.2 ACLs in UNIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–77

11.6.3.3 The Default Principal and Permissions of a Process . . . . . 11–78

11.6.3.4 Authenticating Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–79

11.6.3.5 Access Control Check. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–79

11.6.3.6 Running Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–80

11.6.3.7 Summary of UNIX Access Control . . . . . . . . . . . . . . . . . . 11–80

11.6.4 The Caretaker Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–80

11.6.5 Non-Discretionary Access and Information Flow Control . . . . 11–81

11.6.5.1 Information Flow Control Example . . . . . . . . . . . . . . . . . 11–83

11.6.5.2 Covert Channels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–84

Saltzer & Kaashoek Ch. 0, p. xv

June 24, 2009 12:21 am

xvi

Contents

11.7. Advanced Topic: Reasoning about Authentication. . . . . . . . . . . . . 11–85

11.7.1 Authentication Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–86

11.7.1.1 Hard-wired Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–88

11.7.1.2 Internet Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–88

11.7.2 Authentication in Distributed Systems . . . . . . . . . . . . . . . . . . .11–89

11.7.3 Authentication across Administrative Realms. . . . . . . . . . . . . . .11–90

11.7.4 Authenticating Public Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–92

11.7.5 Authenticating Certificates . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–94

11.7.6 Certificate Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–97

11.7.6.1 Hierarchy of Central Certificate Authorities . . . . . . . . . . .11–97

11.7.6.2 Web of Trust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–98

11.8. Cryptography as a Building Block (Advanced Topic). . . . . . . . . . . 11–99

11.8.1 Unbreakable Cipher for Confidentiality (One-Time Pad) . . . . . .11–99

11.8.2 Pseudorandom Number Generators. . . . . . . . . . . . . . . . . . . . .11–101

11.8.2.1 Rc4: A Pseudorandom Generator and its Use . . . . . . . . .11–101

11.8.2.2 Confidentiality using RC4. . . . . . . . . . . . . . . . . . . . . . . .11–102

11.8.3 Block Ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–103

11.8.3.1 Advanced Encryption Standard (AES). . . . . . . . . . . . . . .11–103

11.8.3.2 Cipher-Block Chaining . . . . . . . . . . . . . . . . . . . . . . . . . .11–105

11.8.4 Computing a Message Authentication Code . . . . . . . . . . . . . .11–106

11.8.4.1 MACs Using Block Cipher or Stream Cipher . . . . . . . . .11–107

11.8.4.2 MACs Using a Cryptographic Hash Function. . . . . . . . .11–107

11.8.5 A Public-Key Cipher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–109

11.8.5.1 Rivest-Shamir-Adleman (RSA) Cipher . . . . . . . . . . . . . .11–109

11.8.5.2 Computing a Digital Signature . . . . . . . . . . . . . . . . . . . .11–111

11.8.5.3 A Public-Key Encrypting System. . . . . . . . . . . . . . . . . . .11–112

11.9. .Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–112

11.10. Case Study: Transport Layer Security (TLS) for the Web. . . . . . 11–116

11.10.1 The TLS Handshake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–117

11.10.2 Evolution of TLS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–120

11.10.3 Authenticating Services with TLS . . . . . . . . . . . . . . . . . . . . .11–121

11.10.4 User Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–123

11.11. War Stories: Security System Breaches . . . . . . . . . . . . . . . . . . . . 11–125

11.11.1 Residues: Profitable Garbage . . . . . . . . . . . . . . . . . . . . . . . . .11–126

11.11.1.1 1963: Residues in CTSS . . . . . . . . . . . . . . . . . . . . . . . .11–126

Saltzer & Kaashoek Ch. 0, p. xvi

June 24, 2009 12:21 am

Contents

xvii

11.11.1.2 1997: Residues in Network Packets . . . . . . . . . . . . . . . 11–127

11.11.1.3 2000: Residues in HTTP . . . . . . . . . . . . . . . . . . . . . . . 11–127

11.11.1.4 Residues on Removed Disks . . . . . . . . . . . . . . . . . . . . . 11–128

11.11.1.5 Residues in Backup Copies. . . . . . . . . . . . . . . . . . . . . . 11–128

11.11.1.6 Magnetic Residues: High-Tech Garbage Analysis . . . . . 11–129

11.11.1.7 2001 and 2002: More Low-tech Garbage Analysis . . . . 11–129

11.11.2 Plaintext Passwords Lead to Two Breaches . . . . . . . . . . . . . . 11–130

11.11.3 The Multiply Buggy Password Transformation . . . . . . . . . . . 11–131

11.11.4 Controlling the Configuration . . . . . . . . . . . . . . . . . . . . . . . 11–131

11.11.4.1 Authorized People Sometimes do Unauthorized Things 11–132

11.11.4.2 The System Release Trick . . . . . . . . . . . . . . . . . . . . . . . 11–132

11.11.4.3 The Slammer Worm. . . . . . . . . . . . . . . . . . . . . . . . . . . 11–132

11.11.5 The Kernel Trusts the User . . . . . . . . . . . . . . . . . . . . . . . . . . 11–135

11.11.5.1 Obvious Trust. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–135

11.11.5.2 Nonobvious Trust (Tocttou) . . . . . . . . . . . . . . . . . . . . . 11–136

11.11.5.3 Tocttou 2: Virtualizing the DMA Channel. . . . . . . . . . 11–136

11.11.6 Technology Defeats Economic Barriers. . . . . . . . . . . . . . . . . 11–137

11.11.6.1 An Attack on Our System Would be Too Expensive . . . 11–137

11.11.6.2 Well, it Used to be Too Expensive . . . . . . . . . . . . . . . . 11–137

11.11.7 Mere Mortals Must be Able to Figure Out How to Use it . . . 11–138

11.11.8 The Web can be a Dangerous Place . . . . . . . . . . . . . . . . . . . 11–139

11.11.9 The Reused Password . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–140

11.11.10 Signaling with Clandestine Channels . . . . . . . . . . . . . . . . . 11–141

11.11.10.1 Intentionally I: Banging on the Walls . . . . . . . . . . . . . 11–141

11.11.10.2 Intentionally II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–141

11.11.10.3 Unintentionally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–142

11.11.11 It Seems to be Working Just Fine . . . . . . . . . . . . . . . . . . . . 11–142

11.11.11.1 I Thought it was Secure . . . . . . . . . . . . . . . . . . . . . . . 11–143

11.11.11.2 How Large is the Key Space…Really?. . . . . . . . . . . . . 11–144

11.11.11.3 How Long are the Keys? . . . . . . . . . . . . . . . . . . . . . . . 11–145

11.11.12 Injection For Fun and Profit . . . . . . . . . . . . . . . . . . . . . . . . 11–145

11.11.12.1 Injecting a Bogus Alert Message to the Operator . . . . 11–146

11.11.12.2 CardSystems Exposes 40,000,000 Credit Card Records to SQL

Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11–146

11.11.13 Hazards of Rarely-Used Components . . . . . . . . . . . . . . . . . 11–148

Saltzer & Kaashoek Ch. 0, p. xvii

June 24, 2009 12:21 am

xviii

Contents

11.11.14 A Thorough System Penetration Job . . . . . . . . . . . . . . . . . .11–148 11.11.15 Framing Enigma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–149 Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–151 Suggestions for Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SR–1 Problem Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PS–1 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GL–1 Complete Index of Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . INDEX–1

Saltzer & Kaashoek Ch. 0, p. xviii

June 24, 2009 12:21 am

List of Sidebars

CHAPTER

PART I [In Printed Textbook] CHAPTER 1 Systems

Sidebar 1.1: Sidebar 1.2: Sidebar 1.3: Sidebar 1.4: Sidebar 1.5: Sidebar 1.6:

Stopping a Supertanker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

Why Airplanes can’t Fly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7

Terminology: Words used to Describe System Composition . . . . . . . . . .9

The Cast of Characters and Organizations . . . . . . . . . . . . . . . . . . . . . .14

How Modularity Reshaped the Computer Industry. . . . . . . . . . . . . . . .21

Why Computer Technology has Improved Exponentially with Time. . .32

CHAPTER 2 Elements of Computer System Organization

Sidebar 2.1: Sidebar 2.2: Sidebar 2.3: Sidebar 2.4: Sidebar 2.5:

Terminology: durability, stability, and persistence . . . . . . . . . . . . . . . . .46

How magnetic disks work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49

Representation: pseudocode and messages . . . . . . . . . . . . . . . . . . . . . . .54

What is an operating system?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79

Human engineering and the principle of least astonishment . . . . . . . . .85

CHAPTER 3 The Design of Naming Schemes

Sidebar 3.1: Sidebar 3.2:

Generating a unique name from a timestamp . . . . . . . . . . . . . . . . . . .125

Hypertext links in the Shakespeare Electronic Archive. . . . . . . . . . . . .129

CHAPTER 4 Enforcing Modularity with Clients and Services

Sidebar 4.1: Sidebar 4.2: Sidebar 4.3: Sidebar 4.4: Sidebar 4.5:

Enforcing modularity with a high-level languages . . . . . . . . . . . . . . . .154

Representation: Timing diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . .156

Representation: Big-Endian or Little-Endian? . . . . . . . . . . . . . . . . . . .158

The X Window System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .162

Peer-to-peer: computing without trusted intermediaries . . . . . . . . . . .164

CHAPTER 5 Enforcing Modularity with Virtualization

Sidebar 5.1: RSM, Sidebar 5.2: Sidebar 5.3: Sidebar 5.4: Sidebar 5.5: Sidebar 5.6: Sidebar 5.7:

test-and-set and avoiding locks . . . . . . . . . . . . . . . . . . . . . . . . . .224

Constructing a before-or-after action without special instructions . . . .226

Bootstrapping an operating system . . . . . . . . . . . . . . . . . . . . . . . . . . .239

Process, thread, and address space . . . . . . . . . . . . . . . . . . . . . . . . . . . .249

Position-independent programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .251

Interrupts, exceptions, faults, traps, and signals. . . . . . . . . . . . . . . . . .259

Avoiding the lost notification problem with semaphores . . . . . . . . . . .277

CHAPTER 6 Performance Sidebar 6.1:

Saltzer & Kaashoek Ch. 0, p. xix

Design hint: When in doubt use brute force . . . . . . . . . . . . . . . . . . . .301

xix

June 24, 2009 12:21 am

xx

List of Sidebars

Sidebar 6.2: Sidebar 6.3: Sidebar 6.4: Sidebar 6.5: Sidebar 6.6: Sidebar 6.7: Sidebar 6.8:

Design hint: Optimiz for the common case . . . . . . . . . . . . . . . . . . . . 307 Design hint: Instead of reducing latency, hide it . . . . . . . . . . . . . . . . . 310 RAM latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 Design hint: Separate mechanism from policy. . . . . . . . . . . . . . . . . . . 330 OPT is a stack algorithm and optimal. . . . . . . . . . . . . . . . . . . . . . . . . 343 Receive livelock. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 Priority inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358

Part II [On-Line] CHAPTER 7 The Network as a System and as a System Component Error detection, checksums, and witnesses . . . . . . . . . . . . . . . . . . . . 7–10 Sidebar 7.2: The Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–32 Sidebar 7.3: Framing phase-encoded bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–37 Sidebar 7.4: Shannon’s capacity theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–37 Sidebar 7.5: Other end-to-end transport protocol interfaces . . . . . . . . . . . . . . . . . 7–66 Sidebar 7.6: Exponentially weighted moving averages. . . . . . . . . . . . . . . . . . . . . . 7–70 Sidebar 7.7: What does an acknowledgment really mean?. . . . . . . . . . . . . . . . . . . 7–77 Sidebar 7.8: The tragedy of the commons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–93 Sidebar 7.9: Retrofitting TCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–95 Sidebar 7.10: The invisible hand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7–98 Sidebar 7.1:

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable Components Sidebar 8.1: Sidebar 8.2: Sidebar 8.3: Sidebar 8.4:

Reliability functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–14 Risks of manipulating MTTFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8–30 Are disk system checksums a wasted effort? . . . . . . . . . . . . . . . . . . . . 8–49 Detecting failures with heartbeats. . . . . . . . . . . . . . . . . . . . . . . . . . . 8–54

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After Sidebar 9.1: Sidebar 9.2: Sidebar 9.3: Sidebar 9.4:

Saltzer & Kaashoek Ch. 0, p. xx

Actions and transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–4 Events that might lead to invoking an exception handler . . . . . . . . . . 9–7 Cascaded aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–29 The many uses of logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9–40

June 24, 2009 12:21 am

List of Sidebars

xxi

CHAPTER 10 Consistency CHAPTER 11 Information Security Sidebar 11.1: Sidebar 11.2: Sidebar 11.3: Sidebar 11.4: Sidebar 11.5: Sidebar 11.6: Sidebar 11.7: Sidebar 11.8:

Saltzer & Kaashoek Ch. 0, p. xxi

Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11–7 Should designs and vulnerabilities be public? . . . . . . . . . . . . . . . .11–14 Malware: viruses, worms, trojan horses, logic bombs, bots, etc. . . .11–19 Why are buffer overrun bugs so common? . . . . . . . . . . . . . . . . . .11–23 Authenticating personal devices: the resurrecting duckling policy .11–47 The Kerberos authentication system . . . . . . . . . . . . . . . . . . . . . . .11–58 Secure Hash Algorithm (SHA) . . . . . . . . . . . . . . . . . . . . . . . . . .11–108 Economics of computer security . . . . . . . . . . . . . . . . . . . . . . . . .11–115

June 24, 2009 12:21 am

xxii

List of Sidebars

Saltzer & Kaashoek Ch. 0, p. xxii

June 24, 2009 12:21 am

Preface to Part II

CHAPTER

This textbook, Principles of Computer System Design: An Introduction, is an introduction

to the principles and abstractions used in the design of computer systems. It is an out­

growth of notes written by the authors for the M.I.T. Electrical Engineering and

Computer Science course 6.033, Computer System Engineering, over a period of 40­

plus years.

The book is published in two parts:

• Part I, containing chapters 1-6 and supporting materials for those chapters, is a traditional printed textbook published by Morgan Kaufman, an imprint of Elsevier. (ISBN: 978–012374957–4) • Part II, consisting of Chapters 7–11 and supporting materials for those chapters, is made available on-line by M.I.T. OpenCourseWare and the authors as an open educational resource. Availability of the two parts and various supporting materials is described in the section with that title below. Part II of the textbook continues a main theme of Part I—enforcing modularity—by introducing still stronger forms of modularity. Part I introduces methods that help pre­ vent accidental errors in one module from propagating to another. Part II introduces stronger forms of modularity that can help protect against component and system fail­ ures and against malicious attacks. Part II explores communication networks, constructing reliable systems from unreliable components, creating all-or-nothing and before-or-after transactions, and implementing security. In doing so, Part II also contin­ ues a second main theme of Part I by introducing several additional design principles related to stronger forms of modularity. A detailed description of the contents of the chapters of Part II can be found in Part I, in the section “About Part II” on page 369. Part II also includes a table of contents for both Parts I and II, copies of the Suggested Additional Readings and Glossary, Problem Sets for both Parts I and II, and a comprehensive Index of Concepts with page numbers for both Parts I and II in a single alphabetic list.

xxiii

Saltzer & Kaashoek Ch. 0, p. xxiii

June 24, 2009 12:14 am

xxiv

Preface to Part II

Availability The authors and MIT OpenCourseWare provide, free of charge, on-line versions of Chapters 7 through 11, the problem sets, the glossary, and a comprehensive index. Those materials can be found at http://ocw.mit.edu/Saltzer-Kaashoek

in the form of a series of PDF files (requires Adobe Reader), one per chapter or major supporting section, as well as a single PDF file containing the entire set. The publisher of the printed book also maintains a set of on-line resources at www.ElsevierDirect.com/9780123749574

Click on the link “Companion Materials”, where you will find Part II of the book as well as other resources, including figures from the text in several formats. Additional materials for instructors (registration required) can be found by clicking the “Manual” link. There are two additional sources of supporting material related to the teaching of course 6.033 Computer Systems Engineering, at M.I.T. The first source is an OpenCourseWare site containing materials from the teaching of the class in 2005: a class description; lecture, reading, and assignment schedule; board layouts; and many lecture videos. These materials are at http://ocw.mit.edu/6-033

The second source is a Web site for the current 6.033 class. This site contains the cur­ rernt lecture schedule which includes assignments, lecturer notes, and slides. There is also a thirteen-year archive of class assignments, design projects, and quizzes. These materials are all at http://mit.edu/6.033

(Some copyrighted or privacy-sensitive materials on that Web site are restricted to cur­ rent MIT students.)

Saltzer & Kaashoek Ch. 0, p. xxiv

June 24, 2009 12:14 am

Acknowledgments

CHAPTER

This textbook began as a set of notes for the advanced undergraduate course Engineering of Computer Systems (6.033, originally 6.233), offered by the Department of Electrical Engineering and Computer Science of the Massachusetts Institute of Technology start­ ing in 1968. The text has benefited from some four decades of comments and suggestions by many faculty members, visitors, recitation instructors, teaching assistants, and students. Over 5,000 students have used (and suffered through) draft versions, and observations of their learning experiences (as well as frequent confusion caused by the text) have informed the writing. We are grateful for those many contributions. In addi­ tion, certain aspects deserve specific acknowledgment.

1. Naming (Section 2.2 and Chapter 3) The concept and organization of the materials on naming grew out of extensive discus­ sions with Michael D. Schroeder. The naming model (and part of our development) follows closely the one developed by D. Austin Henderson in his Ph.D. thesis. Stephen A. Ward suggested some useful generalizations of the naming model, and Roger Needham suggested several concepts in response to an earlier version of this material. That earlier version, including in-depth examples of the naming model applied to addressing architectures and file systems, and an historical bibliography, was published as Chapter 3 in Rudolf Bayer et al., editors, Operating Systems: An Advanced Course, Lec­ ture Notes in Computer Science 60, pages 99–208. Springer-Verlag, 1978, reprinted 1984. Additional ideas have been contributed by many others, including Ion Stoica, Karen Sol­ lins, Daniel Jackson, Butler Lampson, David Karger, and Hari Balakrishnan.

2. Enforced Modularity and Virtualization (Chapters 4 and 5) Chapter 4 was heavily influenced by lectures on the same topic by David L. Tennen­ house. Both chapters have been improved by substantial feedback from Hari Balakrishnan, Russ Cox, Michael Ernst, Eddie Kohler, Chris Laas, Barbara H. Liskov, Nancy Lynch, Samuel Madden, Robert T. Morris, Max Poletto, Martin Rinard, Susan Ruff, Gerald Jay Sussman, Julie Sussman, and Michael Walfish.

3. Networks (Chapter 7[on-line]) Conversations with David D. Clark and David L. Tennenhouse were instrumental in laying out the organization of this chapter, and lectures by Clark were the basis for part of the presentation. Robert H. Halstead Jr. wrote an early draft set of notes about net­ working, and some of his ideas have also been borrowed. Hari Balakrishnan provided many suggestions and corrections and helped sort out muddled explanations, and Julie Sussman and Susan Ruff pointed out many opportunities to improve the presentation. The material on congestion control was developed with the help of extensive discussions

Saltzer & Kaashoek Ch. 0, p. xxv

xxv

June 24, 2009 12:14 am

xxvi

Acknowledgments

with Hari Balakrishnan and Robert T. Morris, and is based in part on ideas from Raj Jain.

4. Fault Tolerance (Chapter 8[on-line]) Most of the concepts and examples in this chapter were originally articulated by Claude Shannon, Edward F. Moore, David Huffman, Edward J. McCluskey, Butler W. Lampson, Daniel P. Siewiorek, and Jim N. Gray.

5. Transactions and Consistency (Chapters 9[on-line] and 10[on-line]) The material of the transactions and consistency chapters has been developed over the course of four decades with aid and ideas from many sources. The concept of version his­ tories is due to Jack Dennis, and the particular form of all-or-nothing and before-or-after atomicity with version histories developed here is due to David P. Reed. Jim N. Gray not only came up with many of the ideas described in these two chapters, he also provided extensive comments. (That doesn’t imply endorsement—he disagreed strongly about the importance of some of the ideas!) Other helpful comments and suggestions were made by Hari Balakrishnan, Andrew Herbert, Butler W. Lampson, Barbara H. Liskov, Samuel R. Madden, Larry Rudolph, Gerald Jay Sussman, and Julie Sussman.

6. Computer Security (Chapter 11[on-line]) Sections 11.1 and 11.6 draw heavily from the paper “The Protection of Information in Computer Systems” by Jerome H. Saltzer and Michael D. Schroeder, Proceedings of the IEEE 63, 9 (September, 1975), pages 1278–1308. Ronald Rivest, David Mazières, and Robert T. Morris made significant contributions to material presented throughout the chapter. Brad Chen, Michael Ernst, Kevin Fu, Charles Leiserson, Susan Ruff, and Seth Teller made numerous suggestions for improving the text.

7. Suggested Outside Readings Ideas for suggested readings have come from many sources. Particular thanks must go to Michael D. Schroeder, who uncovered several of the classic systems papers in places out­ side computer science where nobody else would have thought to look, Edward D. Lazowska, who provided an extensive reading list used at the University of Washington, and Butler W. Lampson, who provided a thoughtful review of the list.

8. The Exercises and Problem Sets The exercises at the end of each chapter and the problem sets at the end of the book have been collected, suggested, tried, debugged, and revised by many different faculty mem­ bers, instructors, teaching assistants, and undergraduate students over a period of 40 years in the process of constructing quizzes and examinations while teaching the material of the text.

Saltzer & Kaashoek Ch. 0, p. xxvi

June 24, 2009 12:14 am

Acknowledgments

xxvi

Certain of the longer exercises and most of the problem sets, which are based on leadin stories and include several related questions, represent a substantial effort by a single individual. For those problem sets not developed by one of the authors, a credit line appears in a footnote on the first page of the problem set. Following each problem or problem set is an identifier of the form “1978–3–14”. This identifier reports the year, examination number, and problem number of the examina­ tion in which some version of that problem first appeared. Jerome H. Saltzer M. Frans Kaashoek 2009

Saltzer & Kaashoek Ch. 0, p. xxvii

June 24, 2009 12:14 am

xxvi

Acknowledgments

Saltzer & Kaashoek Ch. 0, p. xxviii

June 24, 2009 12:14 am

Computer System Design Principles

CHAPTER

Throughout the text, the description of a design principle presents its name in a bold­ faced display, and each place that the principle is used highlights it in underlined italics.

Design principles applicable to many areas of computer systems • Adopt sweeping simplifications So you can see what you are doing.

• Avoid excessive generality If it is good for everything, it is good for nothing.

• Avoid rarely used components Deterioration and corruption accumulate unnoticed—until the next use.

• Be explicit Get all of the assumptions out on the table.

• Decouple modules with indirection Indirection supports replaceability.

• Design for iteration You won't get it right the first time, so make it easy to change.

• End-to-end argument The application knows best.

• Escalating complexity principle Adding a feature increases complexity out of proportion.

• Incommensurate scaling rule Changing a parameter by a factor of ten requires a new design.

• Keep digging principle Complex systems fail for complex reasons.

• Law of diminishing returns The more one improves some measure of goodness, the more effort the next improvement will require.

• Open design principle Let anyone comment on the design; you need all the help you can get.

• Principle of least astonishment People are part of the system. Choose interfaces that match the user’s experience,

Saltzer & Kaashoek Ch. 0, p. xxix

xxix

June 24, 2009 12:21 am

xxx

Computer System Design Principles

expectations, and mental models.

• Robustness principle Be tolerant of inputs, strict on outputs.

• Safety margin principle Keep track of the distance to the edge of the cliff or you may fall over the edge.

• Unyielding foundations rule It is easier to change a module than to change the modularity.

Design principles applicable to specific areas of computer systems • Atomicity: Golden rule of atomicity Never modify the only copy!

• Coordination: One-writer principle If each variable has only one writer, coordination is simpler.

• Durability: The durability mantra Multiple copies, widely separated and independently administered.

• Security: Minimize secrets Because they probably won’t remain secret for long.

• Security: Complete mediation Check every operation for authenticity, integrity, and authorization.

• Security: Fail-safe defaults Most users won’t change them, so set defaults to do something safe.

• Security: Least privilege principle Don’t store lunch in the safe with the jewels.

• Security: Economy of mechanism The less there is, the more likely you will get it right.

• Security: Minimize common mechanism Shared mechanisms provide unwanted communication paths.

Design Hints (useful but not as compelling as design principles) • Exploit brute force • Instead of reducing latency, hide it • Optimize for the common case • Separate mechanism from policy

Saltzer & Kaashoek Ch. 0, p. xxx

June 24, 2009 12:21 am

CHAPTER

The Network as a System and as a System Component

7

CHAPTER CONTENTS Overview..........................................................................................7–2

7.1 Interesting Properties of Networks ...........................................7–3

7.1.1 Isochronous and Asynchronous Multiplexing ............................ 7–5

7.1.2 Packet Forwarding; Delay ...................................................... 7–9

7.1.3 Buffer Overflow and Discarded Packets ................................. 7–12

7.1.4 Duplicate Packets and Duplicate Suppression ......................... 7–15

7.1.5 Damaged Packets and Broken Links ...................................... 7–18

7.1.6 Reordered Delivery ............................................................. 7–19

7.1.7 Summary of Interesting Properties and the Best-Effort Contract 7–20

7.2 Getting Organized: Layers .......................................................7–20

7.2.1 Layers .............................................................................. 7–23

7.2.2 The Link Layer ................................................................... 7–25

7.2.3 The Network Layer ............................................................. 7–27

7.2.4 The End-to-End Layer ......................................................... 7–28

7.2.5 Additional Layers and the End-to-End Argument ..................... 7–30

7.2.6 Mapped and Recursive Applications of the Layered Model ........ 7–32

7.3 The Link Layer.........................................................................7–34

7.3.1 Transmitting Digital Data in an Analog World ......................... 7–34

7.3.2 Framing Frames ................................................................. 7–38

7.3.3 Error Handling ................................................................... 7–40

7.3.4 The Link Layer Interface: Link Protocols and Multiplexing ........ 7–41

7.3.5 Link Properties .................................................................. 7–44

7.4 The Network Layer ..................................................................7–46

7.4.1 Addressing Interface .......................................................... 7–46

7.4.2 Managing the Forwarding Table: Routing ............................... 7–48

7.4.3 Hierarchical Address Assignment and Hierarchical Routing ....... 7–56

7.4.4 Reporting Network Layer Errors ........................................... 7–59

7.4.5 Network Address Translation (An Idea That Almost Works) ...... 7–61

7.5 The End-to-End Layer ..............................................................7–62

7.5.1 Transport Protocols and Protocol Multiplexing ......................... 7–63

7.5.2 Assurance of At-Least-Once Delivery; the Role of Timers ......... 7–67

7.5.3 Assurance of At-Most-Once Delivery: Duplicate Suppression .... 7–71

Saltzer & Kaashoek Ch. 7, p. 1

7–1

June 25, 2009 8:22 am

7–2

CHAPTER 7 The Network as a System and as a System Component

7.5.4 Division into Segments and Reassembly of Long Messages ...... 7–73

7.5.5 Assurance of Data Integrity ................................................. 7–73

7.5.6 End-to-End Performance: Overlapping and Flow Control .......... 7–75

7.5.6.1 Overlapping Transmissions ............................................ 7–75

7.5.6.2 Bottlenecks, Flow Control, and Fixed Windows ................. 7–77

7.5.6.3 Sliding Windows and Self-Pacing .................................... 7–79

7.5.6.4 Recovery of Lost Data Segments with Windows................ 7–81

7.5.7 Assurance of Stream Order, and Closing of Connections .......... 7–82

7.5.8 Assurance of Jitter Control .................................................. 7–84

7.5.9 Assurance of Authenticity and Privacy ................................... 7–85

7.6 A Network System Design Issue: Congestion Control..............7–86

7.6.1 Managing Shared Resources ................................................ 7–86

7.6.2 Resource Management in Networks ...................................... 7–89

7.6.3 Cross-layer Cooperation: Feedback ....................................... 7–91

7.6.4 Cross-layer Cooperation: Control ......................................... 7–93

7.6.5 Other Ways of Controlling Congestion in Networks .................. 7–94

7.6.6 Delay Revisited .................................................................. 7–98

7.7 Wrapping up Networks............................................................7–99

7.8 Case Study: Mapping the Internet to the Ethernet ................7–100

7.8.1 A Brief Overview of Ethernet ...............................................7–100

7.8.2 Broadcast Aspects of Ethernet ............................................7–101

7.8.3 Layer Mapping: Attaching Ethernet to a Forwarding Network ...7–103

7.8.4 The Address Resolution Protocol ..........................................7–105

7.9 War Stories: Surprises in Protocol Design .............................7–107

7.9.1 Fixed Timers Lead to Congestion Collapse in NFS ..................7–107

7.9.2 Autonet Broadcast Storms ..................................................7–108

7.9.3 Emergent Phase Synchronization of Periodic Protocols ............7–108

7.9.4 Wisconsin Time Server Meltdown ........................................7–109

Exercises......................................................................................7–111 Glossary for Chapter 7 .................................................................7–125 Index of Chapter 7 .......................................................................7–135 Last chapter page 7–139

Overview Almost every computer system includes one or more communication links, and these communication links are usually organized to form a network, which can be loosely defined as a communication system that interconnects several entities. The basic abstrac­ tion remains SEND (message). and RECEIVE (message), so we can view a network as an elaboration of a communication link. Networks have several interesting properties— interface style, interface timing, latency, failure modes, and parameter ranges—that require careful design attention. Although many of these properties appear in latent form

Saltzer & Kaashoek Ch. 7, p. 2

June 25, 2009 8:22 am

7.1 Interesting Properties of Networks

7–3

in other system components, they become important or even dominate when the design includes communication. Our study of networks begins, in Section 7.1, by identifying and investigating the interesting properties just mentioned, as well as methods of coping with those properties. Section 7.2 describes a three-layer reference model for a data communication network that is based on a best-effort contract, and Sections 7.3, 7.4, and 7.5 then explore more carefully a number of implementation issues and techniques for each of the three layers. Finally, Section 7.6 examines the problem of controlling network congestion. A data communication network is an interesting example of a system itself. Most net­ work designs make extensive use of layering as a modularization technique. Networks also provide in-depth examples of the issues involved in naming objects, in achieving fault tolerance, and in protecting information. (This chapter mentions fault tolerance and protection only in passing. Later chapters will return to these topics in proper depth.) In addition to layering, this chapter identifies several techniques that have wide appli­ cability both within computer networks and elsewhere in networked computer systems—framing, multiplexing, exponential backoff, best-effort contracts, latency masking, error control, and the end-to-end argument. A glance at the glossary will show that the chapter defines a large number of concepts. A particular network design is not likely to require them all, and in some contexts some of the ideas would be overkill. The engineer­ ing of a network as a system component requires trade-offs and careful judgement. It is easy to be diverted into an in-depth study of networks because they are a fasci­ nating topic in their own right. However, we will limit our exploration to their uses as system components and as a case study of system issues. If this treatment sparks a deeper interest in the topic, the Suggestions for Further Reading at the end of this book include several good books and papers that provide wide-ranging treatments of all aspects of networks.

7.1 Interesting Properties of Networks The design of communication networks is dominated by three intertwined consider­ ations: (1) a trio of fundamental physical properties, (2) the mechanics of sharing, and (3) a remarkably wide range of parameter values. The first dominating consideration is the trio of fundamental physical properties: 1. The speed of light is finite. Using the most direct route, and accounting for the velocity of propagation in real-world communication media, it takes about 20 milliseconds to transmit a signal across the 2,600 miles from Boston to Los Angeles. This time is known as the propagation delay, and there is no way to avoid it without moving the two cities closer together. If the signal travels via a geostationary satellite perched 22,400 miles above the equator and at a longitude halfway between those two cities, the propagation delay jumps to 244 milliseconds, a latency large enough that a human, not just a computer, will notice.

Saltzer & Kaashoek Ch. 7, p. 3

June 25, 2009 8:22 am

7–4

CHAPTER 7 The Network as a System and as a System Component

But communication between two computers in the same room may have a propagation delay of only 10 nanoseconds. That shorter latency makes some things easier to do, but the important implication is that network systems may have to accommodate a range of delay that spans seven orders of magnitude. 2. Communication environments are hostile. Computers are usually constructed of incredibly reliable components, and they are usually operated in relatively benign environments. But communication is carried out using wires, glass fibers, or radio signals that must traverse far more hostile environments ranging from under the floor to deep in the ocean. These environments endanger communication. Threats range from a burst of noise that wipes out individual bits to careless backhoe operators who sever cables that can require days to repair. 3. Communication media have limited bandwidth. Every transmission medium has a maximum rate at which one can transmit distinct signals. This maximum rate is determined by its physical properties, such as the distance between transmitter and receiver and the attenuation characteristics of the medium. Signals can be multilevel, not just binary, so the data rate can be greater than the signaling rate. However, noise limits the ability of a receiver to distinguish one signal level from another. The combination of limited signaling rate, finite signal power, and the existence of noise limits the rate at which data can be sent over a communication link.* Different network links may thus have radically different data rates, ranging from a few kilobits per second over a long-distance telephone line to several tens of gigabits per second over an optical fiber. Available data rate thus represents a second network parameter that may range over seven orders of magnitude. The second dominating consideration of communications networks is that they are nearly always shared. Sharing arises for two distinct reasons. 1. Any-to-any connection. Any communication system that connects more than two things intrinsically involves an element of sharing. If you have three computers, you usually discover quickly that there are times when you want to communicate between any pair. You can start by building a separate communication path between each pair, but this approach runs out of steam quickly because the number of paths required grows with the square of the number of communicating entities. Even in a small network, a shared communication system is usually much more practical—it is more economical and it is easier to manage. When the number of entities that need to communicate begins to grow, as suggested in Figure 7.1, there is little choice. A closely related observation is that networks may connect three entities or 300 million entities. The number of connected entities is

* The formula that relates signaling rate, signal power, noise level, and maximum data rate, known as Shannon’s capacity theorem, appears on page 7–37.

Saltzer & Kaashoek Ch. 7, p. 4

June 25, 2009 8:22 am

7.1 Interesting Properties of Networks

7–5

thus a third network parameter with a wide range, in this case covering eight orders of magnitude. 2. Sharing of communication costs. Some parts of a communication system follow the same technological trends as do processors, memory, and disk: things made of silicon chips seem to fall in price every year. Other parts, such as digging up streets to lay wire or fiber, launching a satellite, or bidding to displace an existing radiobased service, are not getting any cheaper. Worse, when communication links leave a building, they require right-of-way, which usually subjects them to some form of regulation. Regulation operates on a majestic time scale, with procedures that involve courts and attorneys, legislative action, long-term policies, political pressures, and expediency. These procedures can eventually produce useful results, but on time scales measured in decades, whereas technological change makes new things feasible every year. This incommensurate rate of change means that communication costs rarely fall as fast as technology would permit, so sharing of those costs between otherwise independent users persists even in situations where the technology might allow them to avoid it. The third dominating consideration of network design is the wide range of parameter values. We have already seen that propagation times, data rates, and the number of com­ municating computers can each vary by seven or more orders of magnitude. There is a fourth such wide-ranging parameter: a single computer may at different times present a network with widely differing loads, ranging from transmitting a file at 30 megabytes per second to interactive typing at a rate of one byte per second. These three considerations, unyielding physical limits, sharing of facilities, and exist­ ence of four different parameters that can each range over seven or more orders of magnitude, intrude on every level of network design, and even carefully thought-out modularity cannot completely mask them. As a result, systems that use networks as a component must take them into account.

7.1.1 Isochronous and Asynchronous Multiplexing Sharing has significant consequences. Consider the simplified (and gradually becoming obsolescent) telephone network of Figure 7.1, which allows telephones in Boston to talk with telephones in Los Angeles: There are three shared components in this picture: a switch in Boston, a switch in Los Angeles, and an electrical circuit acting as a communi­ cation link between the two switches. The communication link is multiplexed, which means simply that it is used for several different communications at the same time. Let’s focus on the multiplexed link. Suppose that there is an earthquake in Los Angeles, and many people in Boston simultaneously try to call their relatives in Los Angeles to find out what happened. The multiplexed link has a limited capacity, and at some point the next caller will be told the “network is busy.” (In the U.S. telephone network this event is usually signaled with “fast busy,” a series of beeps repeated at twice the speed of a usual busy signal.)

Saltzer & Kaashoek Ch. 7, p. 5

June 25, 2009 8:22 am

7–6

CHAPTER 7 The Network as a System and as a System Component

multiplexed link

L1

B1 Los Angeles Switch

Boston Switch

B2

B3

L2

L3

shared switches

L4

FIGURE 7.1 A simple telephone network.

This “network busy” phenomenon strikes rather abruptly because the telephone sys­ tem traditionally uses a line multiplexing technique known as isochronous (from Greek roots meaning “equally timed”) communication. Suppose that the telephones are all dig­ ital, operating at 64 kilobits per second, and the multiplexed link runs at 45 megabits per second. If we look for the bits that represent the conversation between B2 and L3, we will find them on the wire as shown in Figure 7.2: At regular intervals we will find 8-bit blocks (called frames) carrying data from B2 to L3. To maintain the required data rate of 64 kilobits per second, another B2-to-L3 frame comes by every 5,624 bit times or 125 microseconds, producing a rate of 8,000 frames per second. In between each pair of B2­ to-L3 frames there is room for 702 other frames, which may be carrying bits belonging to other telephone conversations. A 45 megabits/second link can thus carry up to 703 simultaneous conversations, but if a 704th person tries to initiate a call, that person will receive the “network busy” signal. Such a capacity-limiting scheme is sometimes called hard-edged, meaning in this case that it offers no resistance to the first 703 calls, but it absolutely refuses to accept the 704th one. This scheme of dividing up the data into equal-size frames and transmitting the frames at equal intervals—known in communications literature as time-division multi­ plexing (TDM)—is especially suited to telephony because, from the point of view of any one telephone conversation, it provides a constant rate of data flow and the delay from one end to the other is the same for every frame. Time

5,624 bit times

8-bit frame

8-bit frame

8-bit frame

FIGURE 7.2 Data flow on an isochronous multiplexed link.

Saltzer & Kaashoek Ch. 7, p. 6

June 25, 2009 8:22 am

7.1 Interesting Properties of Networks

7–7

One prerequisite to using isochronous communication is that there must be some prior arrangement between the sending switch and the receiving switch: an agreement that this periodic series of frames should be sent along to L3. This agreement is an exam­ ple of a connection and it requires some previous communication between the two switches to set up the connection, storage for remembered state at both ends of the link, and some method to discard (tear down) that remembered state when the conversation between B2 and L3 is complete. Data communication networks usually use a strategy different from telephony for multiplexing shared links. The starting point for this different strategy is to examine the data rate and latency requirements when one computer sends data to another. Usually, computer-related activities send data on an irregular basis—in bursts called messages—as compared with the continuous stream of bits that flows out of a simple digital telephone. Bursty traffic is particularly ill-suited to fixed size and spacing of isochronous frames. During those times when B2 has nothing to send to L3 the frames allocated to that con­ nection go unused. Yet when B2 does have something to send it may be larger than one frame in size, in which case the message may take a long time to send because of the rig­ idly fixed spacing between frames. Even if intervening frames belonging to other connections are unfilled, they can’t be used by the connection from B2 to L3. When communicating data between two computers, a system designer is usually willing to forgo the guarantee of uniform data rate and uniform latency if in return an entire mes­ sage can get through more quickly. Data communication networks achieve this trade-off by using what is called asynchronous (from Greek roots meaning “untimed”) multiplex­ ing. For example, in Figure 7.3, a network connects several personal computers and a service. In the middle of the network is a 45 megabits/second multiplexed link, shared by many network users. But, unlike the telephone example, this link is multiplexed asynchronously.

D service

Personal Computer A B multiplexed link data crosses this link in bursts and can tolerate variable delay C FIGURE 7.3 A simple data communication network.

Saltzer & Kaashoek Ch. 7, p. 7

June 25, 2009 8:22 am

7–8

CHAPTER 7 The Network as a System and as a System Component

frame Time D

B

Guidance information

4000 bits

750 bits

FIGURE 7.4 Data flow on an asynchronous multiplexed link.

On an asynchronous link, a frame can be of any convenient length, and can be carried at any time that the link is not being used for another frame. Thus in the time sequence shown in Figure 7.4 we see two frames, the first going to B and the second going to D. Since the receiver can no longer figure out where the message in the frame is destined by simply counting bits, each frame must include a few extra bits that provide guidance about where to deliver it. A variable-length frame together with its guidance information is called a packet. The guidance information can take any of several forms. A common form is to provide the destination address of the message: the name of the place to which the message should be delivered. In addition to delivery guidance information, asynchro­ nous data transmission requires some way of figuring out where each frame starts and ends, a process known as framing. In contrast, both addressing and framing with isoch­ ronous communication are done implicitly, by watching the clock. Since a packet carries its own destination guidance, there is no need for any prior agreement between the ends of the multiplexed link. Asynchronous communication thus offers the possibility of connectionless transmission, in which the switches do not need to maintain state about particular end-user communications.* An additional complication arises because most links place a limit on the maximum size of a frame. When a message is larger than this maximum size, it is necessary for the sender to break it up into segments, each of which the network carries in a separate packet, and include enough information with each segment to allow the original message to be reassembled at the other end. Asynchronous transmission can also be used for continuous streams of data such as from a digital telephone, by breaking the stream up into segments. Doing so does create a problem that the segments may not arrive at the other end at a uniform rate or with a uniform delay. On the other hand, if the variations in rate and delay are small enough, * Network experts make a subtle distinction among different kinds of packets by using the word datagram to describe a packet that carries all of the state information (for example, its destination address) needed to guide the packet through a network of packet forwarders that do not themselves maintain any state about particular end-to-end connections.

Saltzer & Kaashoek Ch. 7, p. 8

June 25, 2009 8:22 am

7.1 Interesting Properties of Networks

7–9

packet

A

Packet Switch

Workstation at network attachment point A

B

1 Packet Switch 2 3 Service at network attachment point B

Packet Switch

Packet Switch

B

FIGURE 7.5 A packet forwarding network.

or the application can tolerate occasional missing segments of data, the method is still effective. In the case of telephony, the technique is called “packet voice” and it is gradu­ ally replacing many parts of the traditional isochronous voice network.

7.1.2 Packet Forwarding; Delay Asynchronous communication links are usually organized in a communication structure known as a packet forwarding network. In this organization, a number of slightly special­ ized computers known as packet switches (in contrast with the circuit switches of Figure 7.1) are placed at convenient locations and interconnected with asynchronous links. Asynchronous links may also connect customers of the network to network attachment points, as in Figure 7.5. This figure shows two attachment points, named A and B, and it is evident that a packet going from A to B may follow any of several different paths, called routes, through the network. Choosing a particular path for a packet is known as routing. The upper right packet switch has three numbered links connecting it to three other packet switches. The packet coming in on its link #1, which originated at the work­ station at attachment point A and is destined for the service at attachment point B, contains the address of its destination. By studying this address, the packet switch will be able to figure out that it should send the packet on its way via its link #3. Choosing an outgoing link is known as forwarding, and is usually done by table lookup. The construc­ tion of the forwarding tables is one of several methods of routing, so packet switches are also called forwarders or routers. The resulting organization resembles that of the postal service. A forwarding network imposes a delay (known as its transit time) in sending some­ thing from A to B. There are four contributions to transit time, several of which may be different from one packet to the next.

Saltzer & Kaashoek Ch. 7, p. 9

June 25, 2009 8:22 am

7–10

CHAPTER 7 The Network as a System and as a System Component

1. Propagation delay. The time required for the signal to travel across a link is determined by the speed of light in the transmission medium connecting the packet switches and the physical distance the signals travel. Although it does vary slightly with temperature, from the point of view of a network designer propagation delay for any given link can be considered constant. (Propagation delay also applies to the isochronous network.) 2. Transmission delay. Since the frame that carries the packet may be long or short, the time required to send the frame at one switch—and receive it at the next switch—depends on the data rate of the link and the length of the frame. This time is known as transmission delay. Although some packet switches are clever enough to begin sending a packet out before completely receiving it (a trick known as cutthrough), error recovery is simpler if the switch does not forward a packet until the entire packet is present and has passed some validity checks. Each time the packet is transmitted over another link, there is another transmission delay. A packet going from A to B via the dark links in Figure 7.5 will thus be subject to four transmission delays, one when A sends it to the first packet switch, one at each forwarding step, and finally one to transmit it to B. 3. Processing delay. Each packet switch will have to examine the guidance information in the packet to decide to which outgoing link to send it. The time required to figure this out, together with any other work performed on the packet, such as calculating a checksum (see Sidebar 7.1) to allow error detection or copying it to an output buffer that is somewhere else in memory, is known as processing delay.

Sidebar 7.1: Error detection, checksums, and witnesses A checksum on a block of data is a stylized kind of error-detection code in which redundant error-detecting information, rather than being encoded into the data itself (as Chapter 8[on-line] will explain), is placed in a separate field. A typical simple checksum algorithm breaks the data block up into k-bit chunks and performs an exclusive OR on the chunks to produce a k-bit result. (When k = 1, this procedure is called a parity check.) That simple k-bit checksum would catch any one-bit error, but it would miss some two-bit errors, and it would not detect that two chunks of the block have been interchanged. Much more sophisticated checksum algorithms have been devised that can detect multiple-bit errors or that are good at detecting particular kinds of expected errors. As will be seen in Chapter 11[on-line], by using cryptographic techniques it is possible to construct a high-quality checksum with the property that it can detect all changes—even changes that have been intentionally introduced by a malefactor—with near certainty. Such a checksum is called a witness, or fingerprint and is useful for ensuring long-term integrity of stored data.The trade-off is that more elaborate checksums usually require more time to calculate and thus add to processing delay. For that reason, communication systems typically use the simplest checksum algorithm that has a reasonable chance of detecting the expected errors.

Saltzer & Kaashoek Ch. 7, p. 10

June 25, 2009 8:22 am

7.1 Interesting Properties of Networks

7–11

This delay typically has one part that is relatively constant from one packet to the next and a second part that is proportional to the length of the packet. 4. Queuing delay. When the packet from A to B arrives at the upper right packet switch, link #3 may already be transmitting another packet, perhaps one that arrived from link #2, and there may also be other packets queued up waiting to use link #3. If so, the packet switch will hold the arriving packet in a queue in memory until it has finished transmitting the earlier packets. The duration of this delay depends on the amount of other traffic passing through that packet switch, so it can be quite variable. Queuing delay can sometimes be estimated with queuing theory, using the queuing theory formula in Section 6.1.6. If packets arrive according to a random, memoryless process and have randomly distributed service times (technically, a Poisson distribution in which for this case the service time is the transmission delay of the outgoing link), the average queuing delay, measured in units of the packet service time and including the service time of this packet, will be 1 ⁄ ( 1 – ρ) . Here ρ is the utilization of the outgoing line, which can range from 0 to 1. When we plot this result in Figure 7.6 we notice a typical system phenomenon: delay rises rapidly as the line utilization approaches 100%. This plot tells us that the asynchronous system has introduced a trade-off: if we wish to limit the average queuing delay, for example to the amount labeled in the figure “maxi­ mum tolerable delay,” it will be necessary to leave unused, on average, some of the capacity of each link; in the example this maximum utilization is labeled ρmax. Alterna­ tively, if we allow the utilization to approach 100%, delays will grow without bound. The asynchronous system seems to have replaced the abrupt appearance of the busy sig­ nal of the isochronous network with a gradual trade-off: as the system becomes busier, the delays increase. However, as we will see in Section 7.1.3, below, the replacement is actually more subtle than that.

average queuing delay

maximum tolerable delay

1 -----------1–ρ 1 0

Utilization, r

100% rmax

FIGURE 7.6 Queuing delay as a function of utilization.

Saltzer & Kaashoek Ch. 7, p. 11

June 25, 2009 8:22 am

7–12

CHAPTER 7 The Network as a System and as a System Component

The formula and accompanying graph tell us only the average delay. If we try to load up a link so that its utilization is ρmax, the actual delay will exceed our tolerance threshold about as often as it is below that threshold. If we are serious about keeping the maximum delay almost always below a given value, we must prepare for occasional worse peaks by holding utilization below the level of ρmax suggested by the figure. If packets do not obey memoryless arrival statistics (for example, they arrive in long convoys, and all are the same, maximum size), the model no longer applies, and we need a better understanding of the arrival process before we can say anything about delays. This same utilization ver­ sus delay trade-off also applies to non-network components of a computer system that have queues, for example scheduling the processor or reading and writing a magnetic disk. We have talked about queuing theory as if it might be useful in predicting the behav­ ior of a network. It is not. In practice, network systems put a bound on link queuing delays by limiting the size of queues and by exerting control on arrivals. These mecha­ nisms allow individual links to achieve high utilization levels, while shifting delays to other places in the network. The next section explains how, and it also explains just what happened to the isochronous network’s hard-edged busy signal. Later, in Section 7.6 of this chapter we will see how the delays can be shifted all the way back to the entry point of the network.

7.1.3 Buffer Overflow and Discarded Packets Continuing for a moment to apply queuing theory, queuing has an implication: buffer space is needed to hold the queue of packets waiting for transmission. How large a buffer should the designer allocate? Under the memoryless arrival interval assumption, the aver­ age number of packets awaiting transmission (including the one currently being transmitted) is 1 ⁄ ( 1 – ρ) . As with queuing delay, that number is only the average— queuing theory tells us that the variance of the queue length is also 1 ⁄ ( 1 – ρ) . For a ρ of 0.8 the average queue length and the variance are both 5, so if one wishes to allow enough buffers to handle peaks that are, say, three standard deviations above the average, one must be prepared to buffer not only the 5 packets predicted as the average but also (3 × 5 ≅ 7) more, a total of 12 packets. Worse, in many real networks packets don’t actu­ ally arrive independently at random; they come in buffer-bursting batches. At this point, we can imagine three quite different strategies for choosing a buffer size: 1. Plan for the worst case. Examine the network traffic carefully, figure out what the worst-case traffic situation will be, and allocate enough buffers to handle it. 2. Plan for the usual case and fight back. Based on a calculation such as the one above, choose a buffer size that will work most of the time, and if the buffers fill up send messages back through the network asking someone to stop sending. 3. Plan for the usual case and discard overflow. Again, choose a buffer size that will work most of the time, and ruthlessly discard packets when the buffers are full.

Saltzer & Kaashoek Ch. 7, p. 12

June 25, 2009 8:22 am

7.1 Interesting Properties of Networks

7–13

Let’s explore these three possibilities in turn. Buffer memory is usually low in cost, so planning for the worst case seems like an attractive idea, but it is actually much harder than it sounds. For one thing, in a large network, it may be impossible to figure out what the worst case is—there just isn’t enough information available about what can happen. Even if one can estimate the worst case, the estimate may not be useful. Consider, for example, the Hypothetical Bank of Canada, which has 21,000 tellers scattered across the country. The branch at Moose Jaw, Saskatchewan, has one teller and usually is the target of only three transactions a day. Although it has never happened, and almost certainly never will, the worst case is that every one of the 20,999 other tellers simultaneously posts a withdrawal against a Moose Jaw account. Thus a worst-case design would require that there be enough buffers in the packet switch leading to Moose Jaw to handle 20,999 simultaneous messages. The prob­ lem with worst-case analysis is that the worst case can be many orders of magnitude larger than the average case, as well as extremely unlikely. Moreover, even if one decided to buy that large a buffer, the resulting queue to process all the transactions would be so long that many of the other tellers would give up in disgust and abort their transactions, so the large buffer wouldn’t really help. This observation makes it sound attractive to choose a buffer size based on typical, rather than worst-case, loads. But then there is always going to be a chance that traffic will exceed the average for long enough to run out of buffer space. This situation is called congestion. What to do then? One idea is to push back. If buffer space begins to run low, send a message back along an incoming link saying “please don’t send any more until you hear from me”. This mes­ sage (called a quench request) may go to the packet switch at the other end of that link, or it may go all the way back to the original source that introduced the data into the net­ work. Either way, pushing back is also harder than it sounds. If a packet switch is experiencing congestion, there is a good chance that the adjacent switch is also congested (if it is not already congested, it soon will be if it is told to stop sending data over the link to this switch), and sending an extra message is adding to the congestion. Worse, a set of packet switches configured in a cycle like that of Figure 7.5 can easily end up in a form of deadlock (called gridlock when it happens to automobile traffic), with all buffers filled and each switch waiting for the next switch to say that it is OK to start sending again. One way to avoid deadlock among the packet switches is to send the quench request all the way back to the source. This method is hard too, for at least three reasons. First, it may not be clear to which source to send the quench. In our Moose Jaw example, there are 21,000 different sources, no one of which is, by itself, the cause of (nor capable of doing much about) the problem. Second, such a request may not have any effect because the source you choose to quench is no longer sending anyway. Again in our example, by the time the packet switch on the way to Moose Jaw detects the overload, all of the 21,000 tellers may have already sent their transaction requests, so asking them not to send anything else would accomplish nothing. Third, assuming that the quench message is itself forwarded back through the packet-switched network, it may run into congestion and be subject to queuing delays. The busier the network, the longer it will take to exert

Saltzer & Kaashoek Ch. 7, p. 13

June 25, 2009 8:22 am

7–14

CHAPTER 7 The Network as a System and as a System Component

control. We are proposing to create a feedback system with delay and should expect to see oscillations. Even if all the data is coming from one source, by the time the quench gets back and the source acts on it, the packets already in the pipeline may exceed the buffer capacity. Controlling congestion by quenching either the adjacent switch or the source is used in various special situations, but as a general technique it is currently an unsolved problem. The remaining possibility is what most packet networks actually do in the face of con­ gestion: when the buffers fill up, they start throwing packets away. This seems like a somewhat startling thing for a communication system to do because it will disrupt the communication, and eventually each discarded packet will have to be sent again, so the effort to send the packet this far will have been wasted. Nevertheless, this is an action that every packet switching network that is not configured for the worst case must be pre­ pared to take. Overflowing buffers and discarded packets lead to two remarkable consequences. First, the sender of a packet can interpret the lack of its acknowledgment as a sign that the network is congested, and can in turn reduce the rate at which it introduces new packets into the network. This idea, called automatic rate adaptation, is explored in depth in Section 7.6 of this chapter. The combination of discarded packets and automatic rate adaptation in turn produce the second consequence: simple theoretical models of net­ work behavior based on standard queuing theory do not apply when a service may serve some requests and may discard others. Modeling of networks that have rate adaptation requires a much deeper understanding of the specific algorithms used not just by the net­ work but also by network applications. In the final analysis, the asynchronous network replaces the hard-edged blocking of the isochronous network with a variable transmission rate that depends on the instanta­ neous network load. Which scheme (asynchronous or isochronous) for dealing with overload is preferable depends on the application. For some applications it may be better to be told at the outset of a communications attempt to come back later, rather than to be allowed to start work only to encounter such variations in available capacity that it is hard to do anything useful. In other applications it may be more helpful to have some work done, slowly or at variable rates, rather than none at all. The possibility that a network may actually discard packets to cope with congestion leads to a useful distinction between two kinds of forwarding networks. So far, we have been discussing what is usually described as a best-effort network, which, if it cannot dis­ patch a packet soon after receipt, may discard it. The alternative design is the guaranteeddelivery network (sometimes called a store-and-forward network, although that term is often applied to all forwarding networks), which takes heroic measures to avoid ever dis­ carding payload data. Guaranteed delivery networks usually are designed to work with complete messages rather than packets. Typically, a guaranteed delivery network uses non-volatile storage such as a magnetic disk for buffering, so that it can handle large peaks of message load and can be confident that messages will not be lost even if there is a power failure or the forwarding computer crashes. Also, a guaranteed delivery network usually, when faced with the prospect of being completely unable to deliver a message

Saltzer & Kaashoek Ch. 7, p. 14

June 25, 2009 8:22 am

7.1 Interesting Properties of Networks

7–15

(perhaps because the intended recipient has vanished), explicitly returns the message to its originator along with an explanation of why delivery failed. Finally, in keeping with the spirit of not losing a message, a guaranteed delivery switch usually tracks individual messages carefully to make sure that none are lost or damaged during transmission, for example by a burst of noise. A switch of a best-effort network can be quite a bit simpler than a switch of a guaranteed-delivery network. Since the best-effort network may casu­ ally discard packets anyway, it does not need to make any special provisions for retransmitting damaged packets, for preserving packets in transit when the switch crashes and restarts, or for worrying about the case when the link to a destination node suddenly stops accepting data. The best-effort network is said to provide a best-effort contract to its customers (this contract is defined more carefully in Section 7.1.7, below), rather than a guarantee of delivery. Of course, in the real world there are no absolute guarantees—the real distinc­ tion between the two designs is that there is intended to be a significant difference in the probability of undetected loss. When we examine network layering in Section 7.2 of this chapter, it will become apparent that these differences can be characterized another way: guaranteed-delivery networks are usually implemented in a higher network layer, besteffort networks in a lower network layer. In these terms, the U.S. Postal Service operates a guaranteed delivery system for firstclass mail, but a best-effort system for third-class (junk) mail, because postal regulations allow it to discard third-class mail that is misaddressed or when congestion gets out of hand. The Internet is organized as a best-effort system, but the Internet mechanisms for handling e-mail are designed as a guaranteed delivery system. The Western Union com­ pany has always prided itself on operating a true guaranteed-delivery system, to the extent that when it decommissions an office it normally disassembles the site completely in a search for misplaced telegrams. There is a (possibly apocryphal) tale that such a dis­ assembly once discovered a 75-year-old telegram that had fallen behind a water pipe. The company promptly delivered it to the astonished heirs of the original addressee.

7.1.4 Duplicate Packets and Duplicate Suppression As it turns out, discarded packets are not as much of a problem to the higher-level appli­ cation as one might expect because when a client sends a request to a service, it is always possible that the service is not available, or the service crashed just after receiving the request. So unanswered requests are actually a routine occurrence, and many network protocols include some kind of timer expiration and resend mechanism to recover from such failures. The timing diagram of Figure 7.7* illustrates the situation, showing a first packet carrying a request, followed by a packet going the other way carrying the response to the first request. A has set a timer, indicated by a vertical line, but the arrival of response 1 before the expiration of the timer causes A to switch off the timer, indicated by the small X. The packet carrying the second request is lost in transit (as indicated by * The conventions for representation of timing diagrams were described in Sidebar 4.2.

Saltzer & Kaashoek Ch. 7, p. 15

June 25, 2009 8:22 am

7–16

CHAPTER 7 The Network as a System and as a System Component

A send request, set timer

B

time

reques

t1

1

response receive response,

reset timer

X

send request,

set timer

reques

X

t2

timer expires,

resend request,

set new timer

overloaded forwarder discards request packet.

reques

t 2’



nse 2

respo receive response,

reset timer

X

FIGURE 7.7 Lost packet recovery.

the large X), perhaps having been damaged or discarded by an overloaded forwarder, the timer expires, and A resends request 2 in the packet labeled request 2’. When a congested forwarder discards a packet, there are two important conse­ quences. First, the client doesn’t receive a response as quickly as originally hoped because a timer expiration period has been added to the overall response time. This extra delay can have a significant impact on performance. Second, users of the network must be pre­ pared for duplicate requests and responses. The reason lies in the recovery mechanism just described. Suppose a network packet switch gets overloaded and must discard a response packet, as in Figure 7.8. Client A can’t tell the difference between this case and the case of Figure 7.7, so it resends its request. The service sees this resent request as a duplicate. Suppose B does not realize this is a duplicate, does what is requested, and sends back a response. Client A receives the response and assumes that everything is OK. That may be a correct assumption, or it may not, depending on whether or not the first arrival of request 3 changed B’s state. If B is a spelling checker, it will probably give the same response to both copies of the request. But if B is a bank and the request is to transfer funds, doing the request twice would be a mistake. So detecting duplicates may or may not be important, depending on the particular application. For another example, if for some reason the network delays pile up and exceed the resend timer expiration period, the client may resend a request even though the original

Saltzer & Kaashoek Ch. 7, p. 16

June 25, 2009 8:22 am

7.1 Interesting Properties of Networks

B

A send request, set timer

7–17

reque

st 3

X

timer expires, resend request, set new timer

overloaded forwarder discards response 3

request 3



duplicate arrives at B B sends response 3’

’ nse 3

respo

receive response, reset timer

X

FIGURE 7.8 ost packet recovery leading to duplicate request.

response is still in transit. Since B can’t tell any difference between this case and the pre­ vious one, it responds in the same way, by doing what is requested. But now A receives a duplicate response, as in Figure 7.9. Again, this duplicate may or may not matter to A, but at minimum A must take steps not to be confused by the arrival of a duplicate response. What if the arrival of a request from A causes B to change state, as in the bank transfer example? If so, it is usually important to detect and suppress duplicates generated by the lost packet recovery mechanism. The general procedure to suppress duplicates has two components. The first component is hinted at by the request and response numbers used in the illustrations: each request includes a nonce, which is a unique identifier that will B

A send request, set timer

reques

t4

se 4

respon timer expires, resend receive response, reset timer

packet containing response gets delayed X

reques

t 4’ ’ 4 nse

duplicate arrives at B B sends response 4’

respo

receive duplicate response

FIGURE 7.9 Network delay combined with recovery leading to duplicate response.

Saltzer & Kaashoek Ch. 7, p. 17

June 25, 2009 8:22 am

7–18

CHAPTER 7 The Network as a System and as a System Component

never be reused by A when sending requests to B. The illustration uses monotonically increasing serial numbers as nonces, but any unique identifier will do. The second dupli­ cate suppression component is that B must maintain a list of nonces on which it has taken action or is still working, and whenever a request arrives B should look through this list to see whether or not this apparently new request is actually a duplicate of one previously received. If it is a duplicate B must not perform the action requested. On the other hand, B should not simply ignore the request, either, because the reason for the duplicate may be that A never received B’s response. So B needs some way of reconstruct­ ing and resending that previous response. The simplest way of doing this is usually for B to add to its list of previously handled nonces a copy of the corresponding responses so that it can easily resend them. Thus in Figure 7.9, the last action of B should be replaced with “B resends response 4”. In some network designs, A may even receive duplicate responses to a single, unre­ peated request. The reason is that a forwarding link deep inside the network may be using a timer expiration and resend protocol similar to the one above. For this reason, most protocols that are concerned about duplicate suppression include a copy of the nonce in the response, and the originator, A, maintains a list of nonces used in its out­ standing requests. When a response comes back, A can check for the nonce in the list and delete that list entry or, if there is no list entry, assume it is a duplicate of a previously received response and ignore it. The procedure we have just described allows A to keep its list of nonces short, but B might have to maintain an ever-growing list of nonces and responses to be certain that it never accidentally processes a request twice. A related problem concerns what happens if either participant crashes and restarts, losing its volatile memory, which is probably where it is keeping its list of nonces. Refinements to cope with these problems will be explored in detail when we revisit the topic of duplicate suppression on page 7–71 of this chapter. Ensuring suppression of duplicates is a significant complication so, if possible, it is wise to design the service and its protocol in such a way that suppression is not required. Recall that the reason that duplicate suppression became important was that a request changed the state of the service. It is often possible to design a service interface so that it is idempotent, which for a network request means that repeating the same request or sequence of requests several times has the same effect as doing it just once. This design approach is explored in depth in the discussion of atomicity and error recovery in Chap­ ter 9[on-line].

7.1.5 Damaged Packets and Broken Links At the beginning of the chapter we noted that noise is one of the fundamental consider­ ations that dominates the design of data communication. Data can be damaged during transmission, during transit through a switch, or in the memory of a forwarding node. Noise, transmission errors, and techniques for detecting and correcting errors are fasci­ nating topics in their own right, explored in some depth in Chapter 8[on-line]. As a

Saltzer & Kaashoek Ch. 7, p. 18

June 25, 2009 8:22 am

7.1 Interesting Properties of Networks

7–19

general rule it is possible to sub-contract this area to a specialist in the theory of error detection and correction, with one requirement in the contract: when we receive data, we want to know whether or not it is correct. That is, we require that a reliable error detection mechanism be part of any underlying data transmission system. Section 7.3.3 of this chapter expands a bit on this error detection requirement. Once we have contracted for data transmission with an error detection mechanism in which we have confidence, intermediate packet switches can then handle noise-damaged packets by simply discarding them. This approach changes the noise problem into one for which there is already a recovery procedure. Put another way, this approach trans­ forms data loss into performance degradation. Finally, because transmission links traverse hostile environments and must be consid­ ered fragile, a packet network usually has multiple interconnection paths, as in Figure 7.5. Links can go down while transmitting a frame; they may stay down briefly, e.g. because of a power interruption, or for long periods of time while waiting for someone to dig up a street or launch a replacement satellite. Flexibility in routing is an important property of a network of any size. We will return to the implications of broken links in the discussion of the network layer, in Section 7.4 of this chapter.

7.1.6 Reordered Delivery When a packet-forwarding network has an interconnection topology like that of Figure 7.5, in which there is more than one path that a packet can follow from A to B, there is a possibility that a series of packets departing from A in sequential order may arrive at B in a different order. Some networks take special precautions to avoid this possibility by forcing all packets between the same two points to take the same path or by delaying delivery at the destination until all earlier packets have arrived. Both of these techniques introduce additional delay, and there are applications for which reducing delay is more important than receiving the segments of a message in the order in which they were transmitted. Recalling that a message may have been divided into segments, the possibility of reor­ dered delivery means that reassembly of the original message requires close attention. We have here a model of communication much like when a friend is touring on holiday by car, stopping each night in a different motel, and sending a motel postcard with an account of the day’s adventures. Whenever a day’s story doesn’t fit on one card, your friend uses two or three postcards, as necessary. The Post Office may deliver these cards to you in almost any order, and something on the postcard—probably the date—will be needed to enable you to read them in the proper order. Even when two cards are mailed at the same time from the same motel (as indicated by the motel photograph on the front) the Post Office may deliver them to you on different days, so there must be further information on the postcard to allow you to realize that sender broke the original mes­ sage into segments and you may need to wait for the next delivery before starting to read.

Saltzer & Kaashoek Ch. 7, p. 19

June 25, 2009 8:22 am

7–20

CHAPTER 7 The Network as a System and as a System Component

7.1.7 Summary of Interesting Properties and the Best-Effort Contract Most of the ideas introduced in this section can be captured in just two illustrations. Fig­ ure 7.10 summarizes the differences in application characteristics and in response to overload between isochronous and asynchronous multiplexing. Similarly, Figure 7.11 briefly summarizes the interesting (the term “challenging” may also come to mind) properties of computer networks that we have encountered. The “best-effort contract” of the caption means that when a network accepts a segment, it offers the expectation that it will usually deliver the segment to its destination, but it does not guarantee success, and the client of the network is expected to be sophisticated enough to take in stride the possibility that segments may be lost, duplicated, variably delayed, or delivered out of order.

7.2 Getting Organized: Layers To deal with the interesting properties of networks that we identified in Section 7.1, it is necessary to get organized. The primary organizing tool for networks is an example of the design principle adopt sweeping simplifications. All networks use the divide-and-con­ quer technique known as layering of protocols. But before we come to layers, we must establish what a protocol is.

Application characteristics

isochronous (e.g., telephone network)

Continuous

stream

(e.g., interactive

voice)

Bursts of data (most computer-to­ computer data)

Response to load variations

good match

wastes capacity

(hard-edged) either accepts or blocks call

good match

(gradual) 1 variable delay 2 discards data 3 rate adaptation

Network Type asynchronous (e.g., Internet)

variable latency upsets application

FIGURE 7.10 Isochronous versus asynchronous multiplexing.

Saltzer & Kaashoek Ch. 7, p. 20

June 25, 2009 8:22 am

7.2 Getting Organized: Layers

7–21

Suppose we are examining the set of programs used by a defense contractor who is retooling for a new business, video games. In the main program we find the procedure call FIRE

(#_of_missiles, target, action_if_defended)

and elsewhere we find the corresponding procedure, which begins procedure FIRE (nmissiles, where, reaction)

These constructs are interpreted at two levels. First, the system matches the name FIRE in the main program with the program that exports a procedure of the same name, and it arranges to transfer control from the main program to that procedure. The procedure, in turn, matches the arguments of the calling program, position by position, with its own parameters. Thus, in this example, the second argument, target, of the calling program is matched with the second parameter, where, of the called procedure. Beyond this mechanical matching, there is an implicit agreement between the programmer of the main program and the programmer of the procedure that this second argument is to be interpreted as the location that the missiles are intended to hit. This set of agreements on how to interpret both the order and the meaning of the arguments stands as a kind of contract between the two programs. In programming lan­ guages, such contracts are called “specifications”; in networks, such contracts are called protocols. More generally, a protocol goes beyond just the interpretation of the argu­ ments; it encompasses everything that either of the two parties can depend on about how

1. Networks encounter a vast range of • • • •

Data rates Propagation, transmission, queuing, and processing delays. Loads Numbers of users

2. Networks traverse hostile environments • •

Noise damages data Links stop working

3. Best-effort networks have • • • • • •

Variable delays Variable transmission rates Discarded packets Duplicate packets Maximum packet length Reordered delivery

FIGURE 7.11 A summary of the “interesting” properties of computer networks. The last group of bullets defines what is called the best-effort contract.

Saltzer & Kaashoek Ch. 7, p. 21

June 25, 2009 8:22 am

7–22

CHAPTER 7 The Network as a System and as a System Component

result ← FIRE (#, target, action) Client stub

Service stub request:

Prepare request message. Send to service Wait for response.

procedure FIRE (nmiss, where, react) ... return result

proc: FIRE args: 3 type: integer value: 2 type: string value: “Lucifer” type: procedure value: EVADE

Receive request message. Call requested procedure. Prepare response message. Send to client.

response:

acknowledgment type: string value: “destroyed”

FIGURE 7.12 A remote procedure call.

the other will act or react. For example, in a client/service system, a request/response pro­ tocol might specify that the service send an immediate acknowledgment when it gets a request, so that the client knows that the service is there, and send the eventual response as a third message. An example of a protocol that we have already seen is that of the Net­ work File System shown in Figure 4.10. Let us suppose that our defense contractor wishes to further convert the software from a single-user game to a multiuser game, using a client/service organization. The main program will run as a client and the FIRE program will now run in a multiclient, gamecoordinating service. To simplify the conversion, the contractor has chosen to use the remote procedure call (RPC) protocol illustrated in Figure 7.12. As described in Chapter 4, a stub procedure that runs in the client machine exports the name FIRE so that when the main program calls FIRE, control actually passes to the stub with that name. The stub collects the arguments, marshals them into a request message, and sends them over the network to the game-coordinating service. At the service, a corresponding stub waits for such a request to arrive, unmarshals the arguments in the request message, and uses them to perform a call to the real FIRE procedure. When FIRE completes its operation and returns, the service stub marshals any output value into a response message and sends it to the client. The client stub waits for this response message, and when it arrives, it unmarshals the return value in the response message and returns it as its own value to the main program. The procedure call protocol has been honored and the main program continues as if the procedure named FIRE had executed locally. Figure 7.12 also illustrates a second, somewhat different, protocol between the client stub and the service stub, as compared with the protocol between the main program and the procedure it calls. Between the two stubs the request message spells out the name of the procedure to be called, the number of arguments, and the types of each argument.

Saltzer & Kaashoek Ch. 7, p. 22

June 25, 2009 8:22 am

7.2 Getting Organized: Layers

Main program

application protocol

called procedure

RPC client stub

presentation protocol

RPC service stub

7–23

FIGURE 7.13 Two protocol layers

The details of the protocol between the RPC stubs need have little in common with the corresponding details of the protocol between the original main program and the proce­ dure it calls.

7.2.1 Layers In that example, the independence of the MAIN-to-FIRE procedure call protocol from the RPC stub-to-stub protocol is characteristic of a layered design. We can make those layers explicit by redrawing our picture as in Figure 7.13. The contract between the main pro­ gram and the procedure it calls is called the application protocol. The contract between the client-side and service-side RPC stubs protocol is known as a presentation protocol because it translates data formats and semantics to and from locally preferred forms. The request message must get from the client RPC stub to the service RPC stub. To communicate, the client stub calls some network procedure, using an elaboration of the SEND abstraction: SEND_MESSAGE

(request_message, service_name)

specifying in a second argument the identity of the service that should receive this request message. The service stub invokes a similar procedure that provides the RECEIVE abstrac­ tion to pick up the message. These two procedures represent a third layer, which provides a transport protocol, and we can extend our layered protocol picture as in Figure 7.14. This figure makes apparent an important property of layering as used in network designs: every module has not two, but three interfaces. In the usual layered organization, a module has just two interfaces, an interface to the layer above, which hides a second interface to the layer below. But as used in a network, layering involves a third interface. Consider, for example, the RPC client stub in the figure. As expected, it provides an interface that the main program can use, and it uses an interface of the client network package below. But the whole point of the RPC client stub is to construct a request mes­ sage that convinces its correspondent stub at the service to do something. The presentation protocol thus represents a third interface of the presentation layer module. The presentation module thus hides both the lower layer interface and the presentation protocol from the layer above. This observation is a general one—each layer in a network

Saltzer & Kaashoek Ch. 7, p. 23

June 25, 2009 8:22 am

7–24

CHAPTER 7 The Network as a System and as a System Component

Main program fire

(return)

(return)

fire

RPC client stub send_ message

called procedure

application protocol

presentation protocol

RPC service stub

receive_ message

send_ message

receive_ message

Client network package

Service network package

transport protocol

FIGURE 7.14 Three protocol layers

implementation provides an interface to the layer above, and it hides the interface to the layer below as well as the protocol interface to the correspondent with which it communicates. Layered design has proven to be especially effective, and it is used in some form in virtually every network implementation. The primary idea of layers is that each layer hides the operation of the layer below from the layer above, and instead provides its own interpretation of all the important features of the lower layer. Every module is assigned to some layer, and interconnections are restricted to go between modules in adjacent lay­ ers. Thus in the three-layer system of Figure 7.15, module A may call any of the modules J, K, or L, but A doesn’t even know of the existence of X, Y, and Z. The figure shows A using module K. Module K, in turn, may call any of X, Y,, or Z. Different network designs, of course, will have different layering strategies. The par­ ticular layers we have discussed are only an illustration—as we investigate the design of the transport protocol of Figure 7.14 in more detail, we will find it useful to impose fur-

Layer One

A

Layer Two

Layer Three

B

J

X

C

K

D

L

Y

Z

FIGURE 7.15 A layered system.

Saltzer & Kaashoek Ch. 7, p. 24

June 25, 2009 8:22 am

7.2 Getting Organized: Layers

7–25

ther layers, using a three-layer reference model that provides quite a bit of insight into how networks are organized. Our choice strongly resembles the layering that is used in the design of the Internet. The three layers we choose divide the problem of implement­ ing a network as follows (from the bottom up): • The link layer: moving data directly from one point to another. • The network layer: forwarding data through intermediate points to move it to the place it is wanted. • The end-to-end layer: everything else required to provide a comfortable application interface. The application itself can be thought of as a fourth, highest layer, not part of the net­ work. On the other hand, some applications intertwine themselves so thoroughly with the end-to-end layer that it is hard to make a distinction. The terms frame, packet, segment, message, and stream that were introduced in Section 7.1 can now be identified with these layers. Each is the unit of transmission of one of the protocol layers. Working from the top down, an application starts by asking the end-to­ end layer to transmit a message or a stream of data to a correspondent. The end-to-end layer splits long messages and streams into segments, it copes with lost or duplicated seg­ ments, it places arriving segments in proper order, it enforces specific communication semantics, it performs presentation transformations, and it calls on the network layer to transmit each segment. The network layer accepts segments from the end-to-end layer, constructs packets, and transmits those packets across the network, choosing which links to follow to move a given packet from its origin to its destination. The link layer accepts packets from the network layer, and constructs and transmits frames across a single link between two forwarders or between a forwarder and a customer of the network. Some network designs attempt to impose a strict layering among various parts of what we call the end-to-end layer, but it is often such a hodgepodge of function that no single layering can describe it in a useful way. On the other hand, the network and link layers are encountered frequently enough in data communication networks that one can almost consider them universal. With this high-level model in mind, we next sketch the basic contracts for each of the three layers and show how they relate to one another. Later, we examine in much more depth how each of the three layers is actually implemented.

7.2.2 The Link Layer At the bottom of a packet-switched network there must be some underlying communi­ cation mechanism that connects one packet switch with another or a packet switch to a customer of the network. The link layer is responsible for managing this low-level com­ munication. The goal of the link layer is to move the bits of the packet across one (usually, but not necessarily, physical) link, hiding the particular mechanics of data trans­ mission that are involved.

Saltzer & Kaashoek Ch. 7, p. 25

June 25, 2009 8:22 am

7–26

CHAPTER 7 The Network as a System and as a System Component

DATA LINK_SEND

(pkt, link2)

NETWORK_HANDLE

B

A link 1

Link Layer

link protocol

C Link Layer

link 2 LT DATA LH link protocol

Link Layer

FIGURE 7.16 A link layer in a packet switch that has two physical links

A typical, somewhat simplified, interface to the link layer looks something like this: LINK_SEND

(data_buffer, link_identifier)

where data_buffer names a place in memory that contains a packet of information ready to be transmitted, and link_identifier names, in a local address space, one of possibly sev­ eral links to use. Figure 7.16 illustrates the link layer in packet switch B, which has links to two other packet switches, A and C. The call to the link layer identifies a packet buffer named pkt and specifies that the link layer should place the packet in a frame suitable for transmission over link2, the link to packet switch C. Switches B and C both have imple­ mentations of the link layer, a program that knows the particular protocol used to send and receive frames on this link. The link layer may use a different protocol when sending a frame to switch A using link number 1. Nevertheless, the link layer typically presents a uniform interface (LINK_SEND) to higher layers. Packet switch B and packet switch C may use different labels for the link that connects them. If packet switch C has four links, the frame may arrive on what C considers to be its link number 3. The link identifier is thus a name whose scope is limited to one packet switch. The data that actually appears on the physical wire is usually somewhat different from the data that appeared in the packet buffer at the interface to the link layer. The link layer is responsible for taking into account any special properties of the underlying physical channel, so it may, for example, encode the data in a way that is less fragile in the local noise environment, it may fragment the data because the link protocol requires shorter frames, and it may repeatedly resend the data until the other end of the link acknowl­ edges that it has received it. These channel-specific measures generally require that the link layer add information to the data provided by the network layer. In a layered communication system, the data passed from an upper layer to a lower layer for transmission is known as the payload. When a lower layer adds to the front of the payload some data intended only for the use of the corresponding lower layer at the other end, the addition is called a header, and when the lower layer adds something to the end, the addition is called a trailer. In Figure

Saltzer & Kaashoek Ch. 7, p. 26

June 25, 2009 8:22 am

7.2 Getting Organized: Layers

7–27

7.16, the link layer has added a link layer header LH (perhaps indicating which network layer program to deliver the packet to) and a link layer trailer LT (perhaps containing a checksum for error detection). The combination of the header, payload, and trailer becomes the link-layer frame. The receiving link layer module will, after establishing that the frame has been correctly received, remove the link layer header and trailer before passing the payload to the network layer. The particular method of waiting for a frame, packet, or message to arrive and trans­ ferring payload data and control from a lower layer to an upper layer depends on the available thread coordination procedures. Throughout this chapter, rather than having an upper layer call down to a lower-layer procedure named RECEIVE (as Section 2.1.3 sug­ gested), we use upcalls, which means that when data arrives, the lower layer makes a procedure call up to an entry point in the higher layer. Thus in Figure 7.16 the link layer calls a procedure named NETWORK_HANDLE in the layer above.

7.2.3 The Network Layer A segment enters a forwarding network at one of its network attachment points (the source), accompanied by instructions to deliver it to another network attachment point (the destination). To reach the destination it will probably have to traverse several links. Providing a systematic naming scheme for network attachment points, determining which links to traverse, creating a packet that contains the segment, and forwarding the packet along the intended path are the jobs of the network layer. The interface to the network layer, again somewhat simplified, resembles that of the link layer: NETWORK_SEND

(segment_buffer, network_identifier, destination)

The NETWORK_SEND procedure transmits the segment found in segment_buffer (the pay­ load, from the point of view of the network layer), using the network named in network_identifier (a single computer may participate in more than one network), to des­ tination (the address within that network that names the network attachment point to which the segment should be delivered). The network layer, upon receiving this call, creates a network-layer header, labeled NH in Figure 7.17, and/or trailer, labeled NT, to accompany the segment as it traverses the network named “IP”, and it assembles these components into a packet. The key item of information in the network-layer header is the address of the destination, for use by the next packet switch in the forwarding chain. Next, the network layer consults its tables to choose the most appropriate link over which to send this packet with the goal of getting it closer to its destination. Finally, the network layer calls the link layer asking it to send the packet over the chosen link. When the frame containing the packet arrives at the other end of the link, the receiving link layer strips off the link layer header and trailer (LH and LT in the figure) and hands the packet to its network layer by an upcall to NETWORK_HANDLE. This network layer module examines the network layer header and trailer to determine the intended destination of the packet. It consults its own tables to decide on which outgoing link to forward the

Saltzer & Kaashoek Ch. 7, p. 27

June 25, 2009 8:22 am

7–28

CHAPTER 7 The Network as a System and as a System Component

DATA NETWORK_SEND

(segment, “IP”, nap_1197) network

Network Layer

Network Layer

protocol NT DATA NH

lINK_SEND (packet, link2)

Link Layer

LT NT DATA NH LH

link 2

link protocol

LINK_SEND

(packet, link5)

NETWORK_HANDLE

Link Layer

Link Layer

link 5

FIGURE 7.17 Relation between the network layer and the link layer.

packet, and it calls the link layer to send the packet on its way. The network layer of each packet switch along the way repeats this procedure, until the packet traverses the link to its destination. The network layer at the end of that link recognizes that the packet is now at its destination, it extracts the data segment from the packet, and passes that segment to the end-to-end layer, with another upcall.

7.2.4 The End-to-End Layer We can now put the whole picture together. The network and link layers together pro­ vide a best-effort network, which has the “interesting” properties that were listed in Figure 7.11 on page 7–21. These properties may be problematic to an application, and the function of the end-to-end layer is to create a less “interesting” and thus easier to use interface for the application. For example, Figure 7.18 shows the remote procedure call of Figure 7.12 from a different perspective. Here the RPC protocol is viewed as an endto-end layer of a complete network implementation. As with the lower layers, the endto-end layer has added a header and a trailer to the data that the application gave it, and inspecting the bits on the wire we now see three distinct headers and trailers, correspond­ ing to the three layers of the network implementation. The RPC implementation in the end-to-end layer provides several distinct end-to­ end services, each intended to hide some aspect of the underlying network from its application:

Saltzer & Kaashoek Ch. 7, p. 28

June 25, 2009 8:22 am

7.2 Getting Organized: Layers

7–29

• Presentation services. Translating data formats and emulating the semantics of a procedure call. For this purpose the end-to-end header might contain, for example, a count of the number of arguments in the procedure call. • Transport services. Dividing streams and messages into segments and dealing with lost, duplicated, and out-of-order segments. For this purpose, the end-to-end header might contain serial numbers of the segments. • Session services. Negotiating a search, handshake, and binding sequence to locate and prepare to use a service that knows how to perform the requested procedure. For this purpose, the end-to-end header might contain a unique identifier that tells the service which client application is making this call. Depending on the requirements of the application, different end-to-end layer implemen­ tations may provide all, some, or none of these services, and the end-to-end header and trailer may contain various different bits of information. There is one other important property of this layering that becomes evident in exam­ ining Figure 7.18. Each layer considers the payload transmitted by the layer above to be information that it is not expected, or even permitted, to interpret. Thus the end-to-end layer constructs a segment with an end-to-end header and trailer that it hands to the net­ work layer, with the expectation that the network layer will not look inside or perform any actions that require interpretation of the segment. The network layer, in turn, adds a network-layer header and trailer and hands the resulting packet to the link layer, again FIRE

(7, “Lucifer”, evade)

FIRE

DATA

End-to-End Layer (RPC)

end-to-end

End-to-End Layer (RPC)

(7, “Lucifer”, evade)

protocol ET DATA EH

Network Layer

Network Layer

Network Layer

NT ET DATA EH NH

Link Layer

Link Layer

Link Layer

Link Layer

LT NT ET DATA EH NH LH

FIGURE 7.18 Three network layers in action. The arguments of the procedure call become the payload of the end-to-end segment. The network layer forwards the packet across two links on the way from the client to the service. The frame on the wire contains the headers and trailers of three layers.

Saltzer & Kaashoek Ch. 7, p. 29

June 25, 2009 8:22 am

7–30

CHAPTER 7 The Network as a System and as a System Component

with the expectation that the link layer will consider this packet to be an opaque string of bits, a payload to be carried in a link-layer frame. Violation of this rule would lead to interdependence across layers and consequent loss of modularity of the system.

7.2.5 Additional Layers and the End-to-End Argument To this point, we have suggested that a three-layer reference model is both necessary and sufficient to provide insight into how networks operate. Standard textbooks on network design and implementation mention a reference model from the International Organi­ zation for Standardization, known as “Open Systems Interconnect”, or OSI. The OSI reference model has not three, but seven layers. What is the difference? There are several differences. Some are trivial; for example, the OSI reference model divides the link layer into a strategy layer (known as the “data link layer”) and a physical layer, recognizing that many different kinds of physical links can be managed with a small number of management strategies. There is a much more significant difference between our reference model and the OSI reference model in the upper layers. The OSI reference model systematically divides our end-to-end layer into four distinct layers. Three of these layers directly correspond, in the RPC example, to the layers of Figure 7.14: an application layer, a presentation layer, and a transport layer. In addition just above the transport layer the ISO model inserts a layer that provides the session services mentioned just above. We have avoided this approach for the simple reason that different applications have radically different requirements for transport, session, and presentation services—even to the extent that the order in which they should be applied may be different. This situation makes it difficult to propose any single layering, since a layering implies an ordering. For example, an application that consists of sending a file to a printer would find most useful a transport service that guarantees to deliver to the printer a stream of bytes in the same order in which they were sent, with none missing and none duplicated. But a file transfer application might not care in what order different blocks of the file are delivered, so long as they all eventually arrive at the destination. A digital telephone application would like to see a stream of bits representing successive samples of the sound waveform delivered in proper order, but here and there a few samples can be missing without inter­ fering with the intelligibility of the conversation. This rather wide range of application requirements suggests that any implementation decisions that a lower layer makes (for example, to wait for out-of-order segments to arrive so that data can be delivered in the correct order to the next higher layer) may be counterproductive for at least some appli­ cations. Instead, it is likely to be more effective to provide a library of service modules that can be selected and organized by the programmer of a specific application. Thus, our end-to-end layer is an unstructured library of service modules, of which the RPC protocol is an example.

Saltzer & Kaashoek Ch. 7, p. 30

June 25, 2009 8:22 am

7.2 Getting Organized: Layers

7–31

This argument against additional layers is an example of a design principle known as The end-to-end argument The application knows best.

In this case, the basic thrust of the end-to-end argument is that the application knows best what its real communication requirements are, and for a lower network layer to try to implement any feature other than transporting the data risks implementing something that isn’t quite what the application needed. Moreover, if it isn’t exactly what is needed, the application will probably have to reimplement that function on its own. The end-to­ end argument can thus be paraphrased as: don’t bury it in a lower layer, let the end points deal with it because they know best what they need. A simple example of this phenomenon is file transfer. To transfer a file carefully, the appropriate method is to calculate a checksum from the contents of the file as it is stored in the file system of the originating site. Then, after the file has been transferred and writ­ ten to the new file system, the receiving site should read the file back out of its file system, recalculate the checksum anew, and compare it with the original checksum. If the two checksums are the same, the file transfer application has quite a bit of confidence that the new site has a correct copy; if they are different, something went wrong and recovery is needed. Given this end-to-end approach to checking the accuracy of the file transfer, one can question whether or not there is any value in, for example, having the link layer protocol add a frame checksum to the link layer trailer. This link layer checksum takes time to calculate, it adds to the data to be sent, and it verifies the correctness of the data only while it is being transmitted across that link. Despite this protection, the data may still be damaged while it is being passed through the network layer, or while it is buffered by the receiving part of the file transfer application, or while it is being written to the disk. Because of those threats, the careful file transfer application cannot avoid calculating its end-to-end checksum, despite the protection provided by the link layer checksum. This is not to say that the link layer checksum is worthless. If the link layer provides a checksum, that layer will discover data transmission errors at a time when they can be easily corrected by resending just one frame. Absent this link-layer checksum, a transmis­ sion error will not be discovered until the end-to-end layer verifies its checksum, by which point it may be necessary to redo the entire file transfer. So there may be a signif­ icant performance gain in having this feature in a lower-level layer. The interesting observation is that a lower-layer checksum does not eliminate the need for the application layer to implement the function, and it is thus not required for application correctness. It is just a performance enhancement. The end-to-end argument can be applied to a variety of system design issues in addi­ tion to network design. It does not provide an absolute decision technique, but rather a useful argument that should be weighed against other arguments in deciding where to place function.

Saltzer & Kaashoek Ch. 7, p. 31

June 25, 2009 8:22 am

7–32

CHAPTER 7 The Network as a System and as a System Component

7.2.6 Mapped and Recursive Applications of the Layered Model When one begins decomposing a particular existing network into link, network, and end-to-end layers, it sometimes becomes apparent that some of the layers of the network are themselves composed of what are obviously link, network, or end-to-end layers. These compositions come in two forms: mapped and recursive. Mapped composition occurs when a network layer is built directly on another network layer by mapping higher-layer network addresses to lower-layer network addresses. A typical application for mapping arises when a better or more popular network technology comes along, yet it is desirable to keep running applications that are designed for the old network. For example, Apple designed a network called Appletalk that was used for many years, and then later mapped the Appletalk network layer to the Ethernet, which, as described in Section 7.8, has a network and link layer of its own but uses a somewhat different scheme for its network layer addresses. Another application for mapped composition is to interconnect several indepen­ dently designed network layers, a scheme called internetworking. Probably the best example of internetworking is the Internet itself (described in Sidebar 7.2), which links together many different network layers by mapping them all to a universal network layer that uses a protocol known as Internet protocol (IP). Section 7.8 explains how the network Sidebar 7.2: The Internet The Internet provides examples of nearly every concept in this chapter. Much of the Internet is a network layer that is mapped onto some other network layer such as a satellite network, a wireless network, or an Ethernet. Internet protocol (IP) is the primary network layer protocol, but it is not the only network layer protocol used in the Internet. There is a network layer protocol for managing the Internet, known as ICMP. There are also several different network layer routing protocols, some providing routing within small parts of the Internet, others providing routing between major regions. But every point that can be reached via the Internet implements IP. The link layer of the Internet includes all of the link layers of the networks that the Internet maps onto and it also includes many separate, specialized links: a wire, a dial-up telephone line, a dedicated line provided by the telephone company, a microwave link, a digital subscriber line (DSL), a free-space optical link, etc. Almost anything that carries bits has been used somewhere as a link in the Internet. The end-to-end protocols used on the Internet are many and varied. The primary transport protocols are TCP, UDP, and RTP, described briefly on page 7–65. Built on these transport protocols are hundreds of application protocols. A short list of some of the most widely used application protocols would include file transfer (FTP), the World Wide Web (HTTP), mail dispatch and pickup (SMTP and POP), text messaging (IRC), telephone (VoIP), and file exchange (Gnutella, bittorrent, etc.). The current chapter presents a general model of networks, rather than a description of the Internet. To learn more about the Internet, see the books and papers listed in Section 7 of the Suggestions for Further Reading.

Saltzer & Kaashoek Ch. 7, p. 32

June 25, 2009 8:22 am

7.2 Getting Organized: Layers

7–33

layer addresses of the Ethernet are mapped to and from the IP addresses of the Internet using what is known as an Address Resolution Protocol. The Internet also maps the internal network addresses of many other networks—wireless networks, satellite net­ works, cable TV networks, etc.—into IP addresses. Recursive composition occurs when a network layer rests on a link layer that itself is a complete three-layer network. Recursive composition is not a general property of layers, but rather it is a specific property of layered communication systems: The send/receive semantics of an end-to-end connection through a network can be designed to be have the same semantics as a single link, so such an end-to-end connection can be used as a link in a higher-level network. That property facilitates recursive composition, as well as the implementation of various interesting and useful network structures. Here are some examples of recursive composition: • A dial-up telephone line is often used as a link to an attachment point of the Internet. This dial-up line goes through a telephone network that has its own link, network, and end-to-end layers. • An overlay network is a network layer structure that uses as links the end-to-end layer of an existing network. Gnutella (see problem set 20) is an example of an overlay network that uses the end-to-end layer of the Internet for its links. • With the advance of “voice over IP” (VoIP), the traditional voice telephone network is gradually converting to become an overlay on the Internet. • A tunnel is a structure that uses the end-to-end layer of an existing network as a link between a local network-layer attachment point and a distant one to make it appear that the attachment is at the distant point. Tunnels, combined with the encryption techniques described in Chapter 11, are used to implement what is commonly called a “virtual private network” (VPN). Recursive composition need not be limited to two levels. Figure 7.19 illustrates the case of Gnutella overlaying the Internet, with a dial-up telephone connection being used as the Internet link layer. The primary concern when one is dealing with a link layer that is actually an end-to­ end connection through another network is that discussion can become confusing unless one is careful to identify which level of decomposition is under discussion. Fortunately our terminology helps keep track of the distinctions among the various layers of a net­ work, so it is worth briefly reviewing that terminology. At the interface between the application and the end-to-end layer, data is identified as a stream or message. The endto-end layer divides the stream or message up into a series of segments and hands them to the network layer for delivery. The network layer encapsulates each segment in a packet which it forwards through the network with the help of the link layer. The link layer transmits the packet in a frame. If the link layer is itself a network, then this frame is a message as viewed by the underlying network. This discussion of layered network organization has been both general and abstract. In the next three sections we investigate in more depth the usual functions and some typ-

Saltzer & Kaashoek Ch. 7, p. 33

June 25, 2009 8:22 am

7–34

CHAPTER 7 The Network as a System and as a System Component

File Transfer Program (end-to-end layer) File transfer system Gnutella (network layer) Transport Protocol (end-to-end layer) Internet Protocol (network layer)

Internet (link layer)

dialed connection (end-to-end layer) telephone switch (network layer) (link layer) physical wire (link layer)

dial-up telephone network

FIGURE 7.19 A typical recursive network composition. The overlay network Gnutella uses for its link layer an end-to-end transport protocol of the Internet. The Internet uses for one of its links an end-to­ end transport protocol of the dial-up telephone system.

ical implementation techniques of each of the three layers of our reference model. However, as the introduction pointed out, what follows is not a comprehensive treat­ ment of networking. Instead it identifies many of the major issues and for each issue exhibits one or two examples of how that issue is typically handled in a real network design. For readers who have a goal of becoming network engineers, and who therefore would like to learn the whole remarkable range of implementation strategies that have been used in networks, the Suggestions for Further Reading list several comprehensive books on the subject.

7.3 The Link Layer The link layer is the bottom-most of the three layers of our reference model. The link layer is responsible for moving data directly from one physical location to another. It thus gets involved in several distinct issues: physical transmission, framing bits and bit sequences, detecting transmission errors, multiplexing the link, and providing a useful interface to the network layer above.

7.3.1 Transmitting Digital Data in an Analog World The purpose of the link layer is to move bits from one place to another. If we are talking about moving a bit from one register to another on the same chip, the mechanism is fairly simple: run a wire that connects the output of the first register to the input of the next. Wait until the first register’s output has settled and the signal has propagated to the input of the second; the next clock tick reads the data into the second register. If all of the volt-

Saltzer & Kaashoek Ch. 7, p. 34

June 25, 2009 8:22 am

7.3 The Link Layer

A

data

ready

acknowledge

7–35

FIGURE 7.20 B

A simple protocol for data communication.

ages are within their specified tolerances, the clock ticks are separated enough in time to allow for the propagation, and there is no electrical interference, then that is all there is to it. Maintaining those three assumptions is relatively easy within a single chip, and even between chips on the same printed circuit board. However, as we begin to consider send­ ing bits between boards, across the room, or across the country, these assumptions become less and less plausible, and they must be replaced with explicit measures to ensure that data is transmitted accurately. In particular, when the sender and receiver are in sep­ arate systems, providing a correctly timed clock signal becomes a challenge. A simple method for getting data from one module to another module that does not share the same clock is with a three-wire (plus common ground) ready/acknowledge pro­ tocol, as shown in figure 7.20. Module A, when it has a bit ready to send, places the bit on the data line, and then changes the steady-state value on the ready line. When B sees the ready line change, it acquires the value of the bit on the data line, and then changes the acknowledge line to tell A that the bit has been safely received. The reason that the ready and acknowledge lines are needed is that, in the absence of any other synchronizing scheme, B needs to know when it is appropriate to look at the data line, and A needs to know when it is safe to stop holding the bit value on the data line. The signals on the ready and acknowledge lines frame the bit. If the propagation time from A to B is Δt, then this protocol would allow A to send one bit to B every 2Δt plus the time required for A to set up its output and for B to acquire its input, so the maximum data rate would be a little less than 1/(2Δt). Over short distances, one can replace the single data line with N parallel data lines, all of which are framed by the same pair of ready/acknowledge lines, and thereby increase the data rate to N/(2Δt). Many backplane bus designs as well as peripheral attachment systems such as SCSI and personal computer printer interfaces use this technique, known as parallel transmission, along with some variant of a ready/acknowledge protocol, to achieve a higher data rate. However, as the distance between A and B grows, Δt also grows, and the maximum data rate declines in proportion, so the ready/acknowledge technique rapidly breaks down. The usual requirement is to send data at higher rates over longer distances with fewer wires, and this requirement leads to employment of a different system called serial transmission. The idea is to send a stream of bits down a single transmission line, without waiting for any response from the receiver and with the expectation that the receiver will somehow recover those bits at the other end with no additional signaling. Thus the out­ put at the transmitting end of the link looks as in Figure 7.21. Unfortunately, because the underlying transmission line is analog, the farther these bits travel down the line, the

Saltzer & Kaashoek Ch. 7, p. 35

June 25, 2009 8:22 am

7–36

CHAPTER 7 The Network as a System and as a System Component

more attenuation, noise, and line-charging effects they suffer. By the time they arrive at the receiver they will be little more than pulses with exponential leading and trailing edges, as suggested by Figure 7.22. The receiving module, B, now has a significant prob­ lem in understanding this transmission: Because it does not have a copy of the clock that A used to create the bits, it does not know exactly when to sample the incoming line. A typical solution involves having the two ends agree on an approximate data rate, so that the receiver can run a voltage-controlled oscillator (VCO) at about that same data rate. The output of the VCO is multiplied by the voltage of the incoming signal and the product suitably filtered and sent back to adjust the VCO. If this circuit is designed cor­ rectly, it will lock the VCO to both the frequency and phase of the arriving signal. (This device is commonly known as a phase-locked loop.) The VCO, once locked, then becomes a clock source that a receiver can use to sample the incoming data. One complication is that with certain patterns of data (for example, a long string of zeros) there may be no transitions in the data stream, in which case the phase-locked loop will not be able to synchronize. To deal with this problem, the transmitter usually encodes the data in a way that ensures that no matter what pattern of bits is sent, there will be some transitions on the transmission line. A frequently used method is called phase encoding, in which there is at least one level transition associated with every data bit. A common phase encoding is the Manchester code, in which the transmitter encodes each bit as two bits: a zero is encoded as a zero followed by a one, while a one is encoded as a one followed by a zero. This encoding guarantees that there is a level transition in the center of every transmitted bit, thus supplying the receiver with plenty of clocking information. It has the disadvantage that the maximum data rate of the communication channel is effectively cut in half, but the resulting simplicity of both the transmitter and the receiver is often worth this price. Other, more elaborate, encoding schemes can ensure that there is at least one transition for every few data bits. These schemes don’t reduce the maximum data rate as much, but they complicate encoding, decoding, and synchronization. The usual goal for the design space of a physical communication link is to achieve the highest possible data rate for the encoding method being used. That highest possible data

V

FIGURE 7.21 Serial transmission.

1

0

1

0

1

0

1

0

1

time

FIGURE 7.22 Bit shape deteri­ oration with distance.

Saltzer & Kaashoek Ch. 7, p. 36

A

B

June 25, 2009 8:22 am

7.3 The Link Layer

7–37

rate will occur exactly at the point where the arriving data signal is just on the ragged edge of being correctly decodable, and any noise on the line will show up in the form of clock jitter or signals that just miss expected thresholds, either of which will lead to decoding errors. The data rate of a digital link is conven­ tionally measured in bits per second. Since Sidebar 7.4: Shannon’s capacity theorem digital data is ultimately carried using an S C ≤ W ⋅ log 2⎝⎛ 1 + ---------­⎞ analog channel, the question arises of what NW⎠ might be the maximum digital carrying

capacity of a specified analog channel. A where:

perfect analog channel would have an infi­

nite capacity for digital data because one C = channel capacity, in bits per

second could both set and measure a transmitted W = channel bandwidth, in hertz signal level with infinite precision, and then change that setting infinitely often. In S = maximum allowable signal power, as seen by the receiver the real world, noise limits the precision with which a receiver can measure the sig- N = noise power per unit of bandwidth nal value, and physical limitations of the analog channel such as chromatic dispersion (in an optical fiber), charging capacitance (in a copper wire), or spectrum availability (in a wireless signal) put a ceiling on the rate at which a receiver can detect a change in value of a signal. These physical limitations are summed up in a single measure known as the bandwidth of the analog channel. To be more precise, the number of different signal values that a receiver can distinguish is pro­ portional to the logarithm of the ratio of the signal power to the noise power, and the maximum rate at which a receiver can distinguish changes in the signal value is propor­ tional to the analog bandwidth.xx These two parameters (signal-to-noise ratio and analog bandwidth) allow one to cal­ culate a theoretical maximum possible channel capacity (that is, data transmission rate) using Shannon’s capacity theorem (see Sidebar 7.4).* Although this formula adopts a par­ ticular definition of bandwidth, assumes a particular randomness for the noise, and says nothing about the delay that might be encountered if one tries to operate near the chan-

Sidebar 7.3: Framing phase-encoded bits The astute reader may have spotted a puzzling gap in the brief description of the Manchester code: while it is intended as a way of framing bits as they appear on the transmission line, it is also necessary to frame the data bits themselves, in order to know whether a data bit is encoded as bits (n, n + 1) or bits (n + 1, n + 2). A typical approach is to combine code bit framing with data bit framing (and even provide some help in higher-level framing) by specifying that every transmission must begin with a standard pattern, such as some minimum number of coded one-bits followed by a coded zero. The series of consecutive ones gives the Phase-Locked Loop something to synchronize on, and at the same time provides examples of the positions of known data bits. The zero frames the end of the framing sequence.

Saltzer & Kaashoek Ch. 7, p. 37

June 25, 2009 8:22 am

7–38

CHAPTER 7 The Network as a System and as a System Component

nel capacity, it turns out to be surprisingly useful for estimating capacities in the real world. Since some methods of digital transmission come much closer to Shannon’s theoret­ ical capacity than others, it is customary to use as a measure of goodness of a digital transmission system the number of bits per second that the system can transmit per hertz of bandwidth. Setting W = 1, the capacity theorem says that the maximum bits per sec­ ond per hertz is log2(1 + S/N). An elementary signalling system in a low-noise environment can easily achieve 1 bit per second per hertz. On the other hand, for a 28 kilobits per second modem to operate over the 2.4 kilohertz telephone network, it must transmit about 12 bits per second per hertz. The capacity theorem says that the logarithm must be at least 12, so the signal-to-noise ratio must be at least 212, or using a more tra­ ditional analog measure, 36 decibels, which is just about typical for the signal-to-noise ratio of a properly working telephone connection. The copper-pair link between a tele­ phone handset and the telephone office does not go through any switching equipment, so it actually has a bandwidth closer to 100 kilohertz and a much better signal-to-noise ratio than the telephone system as a whole; these combine to make possible “digital sub­ scriber line” (DSL) modems that operate at 1.5 megabits/second—and even up to 50 megabits/second over short distances—using a physical link that was originally designed to carry just voice. One other parameter is often mentioned in characterizing a digital transmission sys­ tem: the bit error rate, abbreviated BER and measured as a ratio to the transmission rate. For a transmission system to be useful, the bit error rate must be quite low; it is typically reported with numbers such as one error in 106, 107, or 108 transmitted bits. Even the best of those rates is not good enough for digital systems; higher levels of the system must be prepared to detect and compensate for errors.

7.3.2 Framing Frames The previous section explained how to obtain a stream of neatly framed bits, but because the job of the link layer is to deliver frames across the link, it must also be able to figure out where in this stream of bits each frame begins and ends. Framing frames is a distinct, and quite independent, requirement from framing bits, and it is one of the reasons that some network models divide the link layer into two layers, a lower layer that manages physical aspects of sending and receiving individual bits and an upper layer that imple­ ments the strategy of transporting entire frames. There are many ways to frame frames. One simple method is to choose some pattern of bits, for example, seven one-bits in a row, as a frame-separator mark. The sender sim­ ply inserts this mark into the bit stream at the end of each frame. Whenever this pattern * The derivation of this theorem is beyond the scope of this textbook. The capacity theorem was originally proposed by Claude E. Shannon in the paper “A mathematical theory of communica­ tion,” Bell System Technical Journal 27 (1948), pages 379–423 and 623–656. Most modern texts on information theory explore it in depth.

Saltzer & Kaashoek Ch. 7, p. 38

June 25, 2009 8:22 am

7.3 The Link Layer

7–39

appears in the received data, the receiver takes it to mark the end of the previous frame, and assumes that any bits that follow belong to the next frame. This scheme works nicely, as long as the payload data stream never contains the chosen pattern of bits. Rather than explaining to the higher layers of the network that they cannot transmit certain bit patterns, the link layer implements a technique known as bit stuffing. The transmitting end of the link layer, in addition to inserting the frame-separator mark between frames, examines the data stream itself, and if it discovers six ones in a row it stuffs an extra bit into the stream, a zero. The receiver, in turn, watches the incoming bit stream for long strings of ones. When it sees six one-bits in a row it examines the next bit to decide what to do. If the seventh bit is a zero, the receiver discards the zero bit, thus reversing the stuffing done by the sender. If the seventh bit is a one, the receiver takes the seven ones as the frame separator. Figure shows a simple pseudocode implementation of the procedure to send a frame with bit stuffing, and Figure 7.24 shows the corresponding procedure on the receiving side of the link. (For simplicity, the illustrated receive proce­ dure ignores two important considerations. First, the receiver uses only one frame buffer. A better implementation would have multiple buffers to allow it to receive the next frame while processing the current one. Second, the same thread that acquires a bit also runs the network level protocol by calling LINK_RECEIVE. A better implementation would prob­ ably NOTIFY a separate thread that would then call the higher-level protocol, and this thread could continue processing more incoming bits.) Bit stuffing is one of many ways to frame frames. There is little need to explore all the possible alternatives because frame framing is easily specified and subcontracted to the implementer of the link layer—the entire link layer, along with bit framing, is often done in the hardware—so we now move on to other issues.

procedure FRAME_TO_BIT (frame_data, length) ones_in_a_row = 0 for i from 1 to length do // First send frame contents SEND_BIT (frame_data[i]); if frame_data[i] = 1 then ones_in_a_row ← ones_in_a_row + 1; if ones_in_a_row = 6 then SEND_BIT (0); // Stuff a zero so that data doesn’t ones_in_a_row ← 0; // look like a framing marker else ones_in_a_row ← 0;

for i from 1 to 7 do // Now send framing marker.

SEND_BIT (1)

FIGURE 7.23 Sending a frame with bit stuffing.

Saltzer & Kaashoek Ch. 7, p. 39

June 25, 2009 8:22 am

7–40

CHAPTER 7 The Network as a System and as a System Component

7.3.3 Error Handling An important issue is what the receiving side of the link layer should do about bits that arrive with doubtful values. Since the usual design pushes the data rate of a transmission link up until the receiver can barely tell the ones from the zeros, even a small amount of extra noise can cause errors in the received bit stream. The first and perhaps most important line of defense in dealing with transmission errors is to require that the design of the link be good at detecting such errors when they occur. The usual method is to encode the data with an error detection code, which entails adding a small amount of redundancy. A simple form of such a code is to have the trans­ mitter calculate a checksum and place the checksum at the end of each frame. As soon as the receiver has acquired a complete frame, it recalculates the checksum and compares its result with the copy that came with the frame. By carefully designing the checksum algorithm and making the number of bits in the checksum large enough, one can make the probability of not detecting an error as low as desired. The more interesting issue is what to do when an error is detected. There are three alternatives: 1. Have the sender encode the transmission using an error correction code, which is a code that has enough redundancy to allow the receiver to identify the particular bits that have errors and correct them. This technique is widely used in situations where the noise behavior of the transmission channel is well understood and the redundancy can be targeted to correct the most likely errors. For example, compact disks are recorded with a burst error-correction code designed to cope particularly well with dust and scratches. Error correction is one of the topics of Chapter 8[on­ line].

procedure BIT_TO_FRAME (rcvd_bit)

ones_in_a_row integer initially 0

if ones_in_a_row < 6 then

bits_in_frame ← bits_in_frame + 1 frame_data[bits_in_frame] ← rcvd_bit if rcvd_bit = 1 then ones_in_a_row ← ones_in_a_row + 1 else ones_in_a_row ← 0 else // This may be a seventh one-bit in a row, check it out. if rcvd_bit = 0 then ones_in_a_row ← 0 // Stuffed bit, don't use it. else // This is the end-of-frame marker LINK_RECEIVE (frame_data, (bits_in_frame - 6), link_id) bits_in_frame ← 0 ones_in_a_row ← 0

FIGURE 7.24 Receiving a frame with bit stuffing.

Saltzer & Kaashoek Ch. 7, p. 40

June 25, 2009 8:22 am

7.3 The Link Layer

7–41

2. Ask the sender to retransmit the frame that contained an error. This alternative requires that the sender hold the frame in a buffer until the receiver has had a chance to recalculate and compare its checksum. The sender needs to know when it is safe to reuse this buffer for another frame. In most such designs the receiver explicitly acknowledges the correct (or incorrect) receipt of every frame. If the propagation time from sender to receiver is long compared with the time required to send a single frame, there may be several frames in flight, and acknowledgments (especially the ones that ask for retransmission) are disruptive. On a highperformance link an explicit acknowledgment system can be surprisingly complex. 3. Let the receiver discard the frame. This alternative is a reasonable choice in light of our previous observation (see page 7–12) that congestion in higher network levels must be handled by discarding packets anyway. Whatever higher-level protocol is used to deal with those discarded packets will also take care of any frames that are discarded because they contained errors. Real-world designs often involve blending these techniques, for example by having the sender apply a simple error-correction code that catches and repairs the most com­ mon errors and that reliably detects and reports any more complex irreparable errors, and then by having the receiver discard the frames that the error-correction code could not repair.

7.3.4 The Link Layer Interface: Link Protocols and Multiplexing The link layer, in addition to sending bits and frames at one end and receiving them at the other end, also has interfaces to the network layer above, as illustrated in Figure 7.16 on page 7–26. As described so far, the interface consists of an ordinary procedure call (to LINK_SEND) that the network layer uses to tell the link layer to send a packet, and an upcall (to NETWORK_HANDLE) from the link layer to the network layer at the other end to alert the network layer that a packet arrived. To be practical, this interface between the network layer and the link layer needs to be expanded slightly to incorporate two additional features not previously mentioned: multiple lower-layer protocols, and higher-layer protocol multiplexing. To support these two functions we add two arguments to LINK_SEND, named link_protocol and network_protocol: LINK_SEND

(data_buffer, link_identifier, link_protocol, network_protocol)

Over any given link, it is sometimes appropriate to use different protocols at different times. For example, a wireless link may occasionally encounter a high noise level and need to switch from the usual link protocol to a “robustness” link protocol that employs a more expensive form of error detection with repeated retry, but runs more slowly. At other times it may want to try out a new, experimental link protocol. The third argument to LINK_SEND, link_protocol tells LINK_SEND which link protocol to use for this_data, and its addition leads to the protocol layering illustrated in Figure 7.25.

Saltzer & Kaashoek Ch. 7, p. 41

June 25, 2009 8:22 am

7–42

CHAPTER 7 The Network as a System and as a System Component

Network Layer

Network protocol

Standard protocol

High robustness protocol

Experimental protocol

Link Layer

FIGURE 7.25 Layer composition with multiple link protocols.

Internet Protocol

Standard protocol

Address Resolution Protocol

Appletalk Protocol

High robustness protocol

Path Vector Exchange Protocol

Experimental protocol

Network Layer

Link Layer

FIGURE 7.26 Layer composition with multiple link protocols and link layer multiplexing to support multiple network layer protocols.

The second feature of the interface to the link layer is more involved: the interface should support protocol multiplexing. Multiplexing allows several different network layer protocols to use the same link. For example, Internet Protocol, Appletalk Protocol, and Address Resolution Protocol (we will talk about some of these protocols later in this chapter) might all be using the same link. Several steps are required. First, the network layer protocol on the sending side needs to specify which protocol handler should be invoked on the receiving side, so one more argument, network_protocol, is needed in the interface to LINK_SEND. Second, the value of network_protocol needs to be transmitted to the receiving side, for example by adding it to the link-level packet header. Finally, the link layer on the receiving side needs to examine this new header field to decide to which of the various network layer implementations it should deliver the packet. Our protocol layering orga­ nization is now as illustrated in Figure 7.26. This figure demonstrates the real power of the layered organization: any of the four network layer protocols in the figure may use any of the three link layer protocols.

Saltzer & Kaashoek Ch. 7, p. 42

June 25, 2009 8:22 am

7.3 The Link Layer

7–43

With the addition of multiple link protocols and link multiplexing, we can summa­ rize the discussion of the link layer in the form of pseudocode for the procedures LINK_SEND and LINK_RECEIVE, together with a structure describing the frame that passes between them, as in Figure 7.27. In procedure LINK_SEND, the procedure variable send­ proc is selected from an array of link layer protocols; the value found in that array might be, for example, a version of the procedure PACKET_TO_BIT of Figure 7.24 that has been extended with a third argument that identifies which link to use. The procedures CHECK­ SUM and LENGTH are programs we assume are found in the library. Procedure LINK_RECEIVE might be called, for example, by procedure BIT_TO_FRAME of Figure 7.24. The procedure structure frame structure checked_contents bit_string net_protocol bit_string payload bit_string checksum

// multiplexing parameter // payload data

procedure LINK_SEND (data_buffer, link_identifier, link_protocol, network_protocol) frame instance outgoing_frame outgoing_frame.checked_contents.payload ← data_buffer outgoing_frame.checked_contents.net_protocol ← data_buffer.network_protocol frame_length ← LENGTH (data_buffer) + header_length outgoing_frame.checksum ← CHECKSUM (frame.checked_contents, frame_length) sendproc ← link_protocol[that_link.protocol] // Select link protocol. sendproc (outgoing_frame, frame_length, link_identifier) // Send frame. procedure LINK_RECEIVE (received_frame, length, link_id)

frame instance received_frame

if CHECKSUM (received_frame.checked_contents, length) =

received_frame.checksum then // Pass good packets up to next layer. good_frame_count ← good_frame_count + 1; GIVE_TO_NETWORK_HANDLER (received_frame.checked_contents.payload, received_frame.checked_contents.net_protocol); else bad_frame_count ← bad_frame_count + 1 // Just count damaged frame. // Each network layer protocol handler must call SET_HANDLER before the first packet // for that protocol arrives… procedure SET_HANDLER (handler_procedure, handler_protocol)

net_handler[handler_protocol] ← handler_procedure

procedure GIVE_TO_NETWORK_HANDLER (received_packet, network_protocol)

handler ← net_handler[network_protocol]

if (handler ≠ NULL) call handler(received_packet, network_protocol)

else unexpected_protocol_count ← unexpected_protocol_count + 1

FIGURE 7.27 The LINK_SEND and LINK_RECEIVE procedures, together with the structure of the frame transmit­ ted over the link and a dispatching procedure for the network layer.

Saltzer & Kaashoek Ch. 7, p. 43

June 25, 2009 8:22 am

7–44

CHAPTER 7 The Network as a System and as a System Component

verifies the checksum, and then extracts net_data and net_protocol from the frame and passes them to the procedure that calls the network handler together with the identifier of the link over which the packet arrived. These procedures also illustrate an important property of layering that was discussed on page 7–29. The link layer handles its argument data_buffer as an unstructured string of bits. When we examine the network layer in the next section of the chapter, we will see that data_buffer contains a network-layer packet, which has its own internal struc­ ture. The point is that as we pass from an upper layer to a lower layer, the content and structure of the payload data is not supposed to be any concern of the lower layer. As an aside, the division we have chosen for our sample implementation of a link layer, with one program doing framing and another program verifying checksums, cor­ responds to the OSI reference model division of the link layer into physical and strategy layers, as was mentioned in Section 7.2.5. Since the link is now multiplexed among several network-layer protocols, when a frame arrives, the link layer must dispatch the packet contained in that frame to the proper network layer protocol handler. Figure 7.27 shows a handler dispatcher named GIVE_TO_NETWORK_HANDLER. Each of several different network-layer protocol-implement­ ing programs specifies the protocol it knows how to handle, through arguments in a call to SET_HANDLER. Control then passes to a particular network-layer handler only on arrival of a frame containing a packet of the protocol it specified. With some additional effort (not illustrated—the reader can explore this idea as an exercise), one could also make this dispatcher multithreaded, so that as it passes a packet up to the network layer a new thread takes over and the link layer thread returns to work on the next arriving frame. With or without threads, the network_protocol field of a frame indicates to whom in the network layer the packet contained in the frame should be delivered. From a more general point of view, we are multiplexing the lower-layer protocol among several higherlayer protocols. This notion of multiplexing, together with an identification field to sup­ port it, generally appears in every protocol layer, and in every layer-to-layer interface, of a network architecture. An interesting challenge is that the multiplexing field of a layer names the protocols of the next higher layer, so some method is needed to assign those names. Since higherlayer protocols are likely to be defined and implemented by different organizations, the usual solution is to hand the name conflict avoidance problem to some national or inter­ national standard-setting body. For example, the names of the protocols of the Internet are assigned by an outfit called ICANN, which stands for the Internet Corporation for Assigned Names and Numbers. LINK_RECEIVE

7.3.5 Link Properties Some final details complete our tour of the link layer. First, links come in several flavors, for which there is some standard terminology: A point-to-point link directly connects exactly two communicating entities. A simplex link has a transmitter at one end and a receiver at the other; two-way communication

Saltzer & Kaashoek Ch. 7, p. 44

June 25, 2009 8:22 am

7.3 The Link Layer

7–45

requires installing two such links, one going in each direction. A duplex link has both a transmitter and a receiver at each end, allowing the same link to be used in both direc­ tions. A half-duplex link is a duplex link in which transmission can take place in only one direction at a time, whereas a full-duplex link allows transmission in both directions at the same time over the same physical medium. A broadcast link is a shared transmission medium in which there can be several trans­ mitters and several receivers. Anything sent by any transmitter can be received by many—perhaps all—receivers. Depending on the physical design details, a broadcast link may limit use to one transmitter at a time, or it may allow several distinct transmis­ sions to be in progress at the same time over the same physical medium. This design choice is analogous to the distinction between half duplex and full duplex but there is no standard terminology for it. The link layers of the standard Ethernet and the popular wireless system known as Wi-Fi are one-transmitter-at-a-time broadcast links. The link layer of a CDMA Personal Communication System (such as ANSI–J–STD–008, which is used by cellular providers Verizon and Sprint PCS) is a broadcast link that permits many transmitters to operate simultaneously. Finally, most link layers impose a maximum frame size, known as the maximum transmission unit (MTU). The reasons for limiting the size of a frame are several: 1. The MTU puts an upper bound on link commitment time, which is the length of time that a link will be tied up once it begins to transmit the frame. This consideration is more important for slow links than for fast ones. 2. For a given bit error rate, the longer a frame the greater the chance of an uncorrectable error in that frame. Since the frame is usually also the unit of error control, an uncorrectable error generally means loss of the entire frame, so as the frame length increases not only does the probability of loss increase, but the cost of the loss increases because the entire frame will probably have to be retransmitted. The MTU puts a ceiling on both of these costs. 3. If congestion leads a forwarder to discard a packet, the MTU limits the amount of transmission capacity required to retransmit the packet. 4. There may be mechanical limits on the maximum length of a frame. A hardware interface may have a small buffer or a short counter register tracking the number of bits in the frame. Similar limits sometimes are imposed by software that was originally designed for another application or to comply with some interoperability standard. Whatever the reason for the MTU, when an application needs to send a message that does not fit in a maximum-sized frame, it becomes the job of some end-to-end protocol to divide the message into segments for transmission and to reassemble the segments into the complete message at the other end. The way in which the end-to-end protocol dis­ covers the value of the MTU is complicated—it needs to know not just the MTU of the link it is about to use, but the smallest MTU that the segment will encounter on the path

Saltzer & Kaashoek Ch. 7, p. 45

June 25, 2009 8:22 am

7–46

CHAPTER 7 The Network as a System and as a System Component

through the network to its destination. For this purpose, it needs some help from the net­ work layer, which is our next topic.

7.4 The Network Layer The network layer is the middle layer of our three-layer reference model. The network layer moves a packet across a series of links. While conceptually quite simple, the chal­ lenges in implementation of this layer are probably the most difficult in network design because there is usually a requirement that a single design span a wide range of perfor­ mance, traffic load, and number of attachment points. In this section we develop a simple model of the network layer and explore some of the challenges.

7.4.1 Addressing Interface network The conceptual model of a network is a attachment cloud bristling with network attachment point 35 points identified by numbers known as net­ 01 work addresses, as in Figure 7.28 at the left. 07 A segment enters the network at one 24 attachment point, known as the source. 33 The network layer wraps the segment in a Network 11 packet and carries the packet across the 40 network to another attachment point, known as the destination, where it unwraps 41 16 the original segment and delivers it. 39 The model in the figure is misleading network 42 address in one important way: it suggests that delivery of a segment is accomplished by sending it over one final, physical link. A FIGURE 7.28 network attachment point is actually a vir- The network layer. tual concept rather than a physical concept. Every network participant, whether a packet forwarder or a client computer system, contains an implementation of the network layer, and when a packet finally reaches the network layer of its destination, rather than forwarding it further, the network layer unwraps the segment contained in the packet and passes that segment to the end-to-end layer inside the system that con­ tains the network attachment point. In addition, a single system may have several network attachment points, each with its own address, all of which result in delivery to the same end-to-end layer; such a system is said to be multihomed. Even packet forward­ ers need network attachment points with their own addresses, so that a network manager can send them instructions about their configuration and maintenance.

Saltzer & Kaashoek Ch. 7, p. 46

June 25, 2009 8:22 am

7.4 The Network Layer

7–47

Since a network has many attachment points, the the end-to-end layer must specify to the network layer not only a data segment to transmit but also its intended destina­ tion. Further, there may be several available networks and protocols, and several end-to­ end protocol handlers, so the interface from the end-to-end layer to the network layer is parallel to the one between the network layer and the link layer: NETWORK_SEND

(segment_buffer, destination, network_protocol, end_layer_protocol)

The argument network_protocol allows the end-to-end layer to select a network and pro­ tocol with which to send the current segment, and the argument end_layer_protocol allows for multiplexing, this time of the network layer by the end-to-end layer. The value of end_layer_protocol tells the network layer at the destination to which end-to-end pro­ tocol handler the segment should be delivered. The network layer also has a link-layer interface, across which it receives packets. Fol­ lowing the upcall style of the link layer of Section 7.3, this interface would be NETWORK_HANDLE

(packet, network_protocol)

and this procedure would be the handler_procedure argument of a call to SET_HANDLER in Figure 7.27. Thus whenever the link layer has a packet to deliver to the network layer, it does so by calling NETWORK_HANDLE. The pseudocode of Figure 7.29 describes a model network layer in detail, starting with the structure of a packet, and followed by implementations of the procedures NETWORK_HANDLE and NETWORK_SEND. NETWORK_SEND creates a packet, starting with the seg­ ment provided by the end-to-end layer and adding a network-layer header, which here comprises three fields: source, destination, and end_layer_protocol. It fills in the destina­ tion and end_layer_protocol fields from the corresponding arguments, and it fills in the source field with the address of its own network attachment point. Figure 7.30 shows this latest addition to the overhead of a packet. Procedure NETWORK_HANDLE may do one of two rather different things with a packet, distinguished by the test on line 11. If the packet is not at its destination, NETWORK_HANDLE looks up the packet’s destination in forwarding_table to determine the best link on which to forward it, and then it calls the link layer to send the packet on its way. On the other hand, if the received packet is at its destination, the network layer passes its payload up to the end-to-end layer rather than sending the packet out over another link. As in the case of the interface between the link layer and the network layer, the interface to the end-to-end layer is another upcall that is intended to go through a handler dispatcher similar to that of the link layer dispatcher of Figure 7.27. Because in a network, any net­ work attachment point can send a packet to any other, the last argument of GIVE_TO_END_LAYER, the source of the packet, is a piece of information that the end-layer recipient generally finds useful in deciding how to handle the packet. One might wonder what led to naming the procedure NETWORK_HANDLE rather than NETWORK_RECEIVE. The insight in choosing that name is that forwarding a packet is always done in exactly the same way, whether the packet comes from the layer above or from the layer below. Thus, when we consider the steps to be taken by NETWORK_SEND, the straightforward implementation is simply to place the data in a packet, add a network

Saltzer & Kaashoek Ch. 7, p. 47

June 25, 2009 8:22 am

7–48

CHAPTER 7 The Network as a System and as a System Component

layer header, and hand the packet to NETWORK_HANDLE. As an extra feature, this architec­ ture allows a source to send a packet to itself without creating a special case. Just as the link layer used the net_protocol field to decide which of several possible network handlers to give the packet to, NETWORK_SEND can use the net_protocol argument for the same purpose. That is, rather than calling NETWORK_HANDLE directly, it could call the procedure GIVE_TO_NETWORK_HANDLER of Figure 7.27.

7.4.2 Managing the Forwarding Table: Routing The primary challenge in a packet forwarding network is to set up and manage the for­ warding tables, which generally must be different for each network-layer participant. Constructing these tables requires first figuring out appropriate paths (sometimes called routes) to follow from each source to each destination, so the exercise is variously known as path-finding or routing. In a small network, one might set these tables up by hand. As the scale of a network grows, this approach becomes impractical, for several reasons:

structure packet bit_string source bit_string destination bit_string end_protocol bit_string payload 1 2 3 4 5 6 7 8

procedure NETWORK_SEND (segment_buffer, destination, network_protocol, end_protocol) packet instance outgoing_packet outgoing_packet.payload ← segment_buffer outgoing_packet.end_protocol ← end_protocol outgoing_packet.source ← MY_NETWORK_ADDRESS outgoing_packet.destination ← destination NETWORK_HANDLE (outgoing_packet, net_protocol)

9 procedure NETWORK_HANDLE (net_packet, net_protocol) 10 packet instance net_packet 11 if net_packet.destination ≠ MY_NETWORK_ADDRESS then 12 next_hop ← LOOKUP (net_packet.destination, forwarding_table) 13 LINK_SEND (net_packet, next_hop, link_protocol, net_protocol) 14 else 15 GIVE_TO_END_LAYER (net_packet.payload, 16 net_packet.end_protocol, net_packet.source)

FIGURE 7.29 Model implementation of a network layer. The procedure NETWORK_SEND originates packets, while NETWORK_HANDLE receives packets and either forwards them or passes them to the local end-to-end layer.

Saltzer & Kaashoek Ch. 7, p. 48

June 25, 2009 8:22 am

7.4 The Network Layer

7–49

1. The amount of calculation required to determine the best paths grows combinatorially with the number of nodes in the network. 2. Whenever a link is added or removed, the forwarding tables must be recalculated. As a network grows in size, the frequency of links being added and removed will probably grow in proportion, so the combinatorially growing routing calculation will have to be performed more and more frequently. 3. Whenever a link fails or is repaired, the forwarding tables must be recalculated. For a given link failure rate, the number of such failures will be proportional to the number of links, so for a second reason the combinatorially growing routing calculation will have to be performed an increasing number of times. 4. There are usually several possible paths available, and if traffic suddenly causes the originally planned path to become congested, it would be nice if the forwarding tables could automatically adapt to the new situation. All four of these reasons encourage the development of automatic routing algorithms. If reasons 1 and 2 are the only concerns, one can leave the resulting forwarding tables in place for an indefinite period, a technique known as static routing. The on-the-fly recal­ culation called for by reasons 3 and 4 is known as adaptive routing, and because this feature is vitally important in many networks, routing algorithms that allow for easy update when things change are almost always used. A packet forwarder that also partic-

Segment presented to the network layer

DATA

Packet presented to the link layer

source & destination

end protocol

DATA

Frame appearing on the link

frame network source & mark protocol destination

end protocol

DATA

check frame sum mark

Example

1111111

RPC

“Fire”

97142 1111111 55316

IP

41 —> 24

FIGURE 7.30 A typical accumulation of network layer and link layer headers and trailers. The additional infor­ mation added at each layer can come from control information passed from the higher layer as arguments (for example, the end protocol type and the destination are arguments in the call to the network layer). In other cases they are added by the lower layer (for example, the link layer adds the frame marks and checksum).

Saltzer & Kaashoek Ch. 7, p. 49

June 25, 2009 8:22 am

7–50

CHAPTER 7 The Network as a System and as a System Component

B C

1 A

1

1

source

1

H

3

4

1

5

2

G

2 3 destination

1 F

1

5

K

4

2 3

4 3

4 2

E

1

5 J

1

D 1

2

FIGURE 7.31 Routing example.

ipates in a routing algorithm is usually called a router. An adaptive routing algorithm requires exchange of current reachability information. Typically, the routers exchange this information using a network-layer routing protocol transmitted over the network itself. To see how adaptive routing algorithms might work, consider the modest-sized net­ work of Figure 7.31. To minimize confusion in interpreting this figure, each network address is lettered, rather than numbered, while each link is assigned two one-digit link identifiers, one from the point of view of each of the stations it connects. In this figure, routers are rectangular while workstations and services are round, but all have network addresses and all have network layer implementations. Suppose now that the source A sends a packet addressed to destination D. Since A has only one outbound link, its forwarding table is short and simple:

Saltzer & Kaashoek Ch. 7, p. 50

destination

link

A all other

end-layer 1

June 25, 2009 8:22 am

7.4 The Network Layer

7–51

so the packet departs from A by way of link 1, going to router G for its next stop. However, the forwarding table at G must be considerably more complicated. It might contain, for example, the following values: destination

link

A B C D E F G H J K

1 2 2 3 4 4 end-layer 2 3 4

This is not the only possible forwarding table for G. Since there are several possible paths to most destinations, there are several possible values for some of the table entries. In addition, it is essential that the forwarding tables in the other routers be coordinated with this forwarding table. If they are not, when router G sends a packet destined for E to router K, router K might send it back to G, and the packet could loop forever. The interesting question is how to construct a consistent, efficient set of forwarding tables. Many algorithms that sound promising have been proposed and tried; few work well. One that works moderately well for small networks is known as path vector exchange. Each participant maintains, in addition to its forwarding table, a path vector, each element of which is a complete path to some destination. Initially, the only path it knows about is the zero-length path to itself, but as the algorithm proceeds it gradually learns about other paths. Eventually its path vector accumulates paths to every point in the network. After each step of the algorithm it can construct a new forwarding table from its new path vector, so the forwarding table gradually becomes more and more complete. The algorithm involves two steps that every participant repeats over and over, path advertising and path selection. To illustrate the algorithm, suppose par­ to path ticipant G starts with a path vector that contains just one item, an entry for itself, as G <> in Figure 7.32. In the advertising step, each participant sends its own network address and a copy of its path vector down every FIGURE 7.32 attached link to its immediate neighbors, Initial state of path vector for G. < > is an specifying the network-layer protocol empty path. PATH_EXCHANGE. The routing algorithm of G

would thus receive from its four neighbors

the four path vectors of Figure 7.33. This advertisement allows G to discover the names,

which are in this case network addresses, of each of its neighbors.

Saltzer & Kaashoek Ch. 7, p. 51

June 25, 2009 8:22 am

7–52

CHAPTER 7 The Network as a System and as a System Component

From A, via link 1 to path

From H, via link 2: to path

From J, via link 3: to path

From K, via link 4: to path

A

H

J

K

<>

<>

<>

<>

FIGURE 7.33 Path vectors received by G in the first round.

path vector

forwarding table

to

path

to

link

A G H J K

<>

A G H J K

1 end-layer 2 3 4

FIGURE 7.34 First-round path vector and forwarding table for G.

From A, via link 1

From H, via link 2:

From J, via link 3:

From K, via link 4:

to

path

to

path

to

path

to

path

A G

<>

B C G H J K

<>

D E G H J K

<>

E F G H J K

<>

FIGURE 7.35 Path vectors received by G in the second round.

path vector

forwarding table

to

path

to

link

A B C D E F G H J K

<>

A B C D E F G H J K

1 2 2 3 3 4 end-layer 2 3 4

FIGURE 7.36 Second-round path vector and forwarding table for G.

Saltzer & Kaashoek Ch. 7, p. 52

June 25, 2009 8:22 am

7.4 The Network Layer

7–53

G now performs the path selection step by merging the information received from its neighbors with that already in its own previous path vector. To do this merge, G takes each received path, prepends the network address of the neighbor that supplied it, and then decides whether or not to use this path in its own path vector. Since on the first round in our example all of the information from neighbors gives paths to previously unknown destinations, G adds all of them to its path vector, as in Figure 7.34. G can also now construct a forwarding table for use by NET_HANDLE that allows NET_HANDLE to forward packets to destinations A, H, J, and K as well as to the end-to-end layer of G itself. In a similar way, each of the other participants has also constructed a better path vector and forwarding table. Now, each participant advertises its new path vector. This time, G receives the four path vectors of Figure 7.35, which contain information about several participants of which G was previously unaware. Following the same procedure again, G prepends to each element of each received path vector the identity of the router that provided it, and then considers whether or not to use this path in its own path vector. For previously unknown destinations, the answer is yes. For previously known destinations, G com­ pares the paths that its neighbors have provided with the path it already had in its table to see if the neighbor has a better path. This comparison raises the question of what metric to use for “better”. One simple answer is to count the number of hops. More elaborate schemes might evaluate the data rate of each link along the way or even try to keep track of the load on each link of the path by measuring and reporting queue lengths. Assuming G is simply counting hops, G looks at the path that A has offered to reach G, namely to G:

and notices that G’s own path vector already contains a zero-length path to G, so it ignores A’s offering. A second reason to ignore this offering is that its own name, G, is in the path, which means that this path would involve a loop. To ensure loop-free for­ warding, the algorithm always ignores any offered path that includes this router’s own name. When it is finished with the second round of path selection, G will have constructed the second-round path vector and forwarding table of Figure 7.36. On the next round G will begin receiving longer paths. For example it will learn that H offers the path to D:

Since this path is longer than the one that G already has in its own path vector for D, G will ignore the offer. If the participants continue to alternate advertising and path selec­ tion steps, this algorithm ensures that eventually every participant will have in its own path vector the best (in this case, shortest) path to every other participant and there will be no loops. If static routing would suffice, the path vector construction procedure described above could stop once everyone’s tables had stabilized. But a nice feature of this algo­ rithm is that it is easily extended to provide adaptive routing. One method of extension would be, on learning of a change in topology, to redo the entire procedure, starting

Saltzer & Kaashoek Ch. 7, p. 53

June 25, 2009 8:22 am

7–54

CHAPTER 7 The Network as a System and as a System Component

again with path vectors containing just the path to the local end layer. A more efficient approach is to use the existing path vectors as a first approximation. The one or two par­ ticipants who, for example, discover that a link is no longer working simply adjust their own path vectors to stop using that link and then advertise their new path vectors to the neighbors they can still reach. Once we realize that readvertising is a way to adjust to topology change, it is apparent that the straightforward way to achieve adaptive routing is simply to have every router occasionally repeat the path vector exchange algorithm. If someone adds a new link to the network, on the next iteration of the exchange algo­ rithm, the routers at each end of the new link will discover it and propagate the discovery throughout the network. On the other hand, if a link goes down, an additional step is needed to ensure that paths that traversed that link are discarded: each router discards any paths that a neighbor stops advertising. When a link goes down, the routers on each end of that link stop receiving advertisements; as soon as they notice this lack they dis­ card all paths that went through that link. Those paths will be missing from their own next advertisements, which will cause any neighbors using those paths to discard them in turn; in this way the fact of a down link retraces each path that contains the link, thereby propagating through the network to every router that had a path that traversed the link. A model implementation of all of the parts of this path vector algorithm appears in Figure 7.37. When designing a routing algorithm, there are a number of questions that one should ask. Does the algorithm converge? (Because it selects the shortest path this algorithm will converge, assuming that the topology remains constant.) How rapidly does it converge? (If the shortest path from a router to some participant is N steps, then this algorithm will insert that shortest path in that router’s table after N advertising/path-selection exchanges.) Does it respond equally well to link deletions? (No, it can take longer to con­ vince all participants of deletions. On the other hand, there are other algorithms—such as distance vector, which passes around just the lengths of paths rather than the paths themselves—that are much worse.) Is it safe to send traffic before the algorithm con­ verges? (If a link has gone down, some packets may loop for a while until everyone agrees on the new forwarding tables. This problem is serious, but in the next paragraph we will see how to fix it by discarding packets that have been forwarded too many times.) How many destinations can it reasonably handle? (The Border Gateway Protocol, which uses a path vector algorithm similar to the one described above, has been used in the Internet to exchange information concerning 100,000 or so routes.) The possibility of temporary loops in the forwarding tables or more general routing table inconsistencies, buggy routing algorithms, or misconfigurations can be dealt with by a network layer mechanism known as the hop limit. The idea is to add a field to the network-layer header containing a hop limit counter. The originator of the packet ini­ tializes the hop limit. Each router that handles the packet decrements the hop limit by one as the packet goes by. If a router finds that the resulting value is zero, it discards the packet. The hop limit is thus a safety net that ensures that no packet continues bouncing around the network forever.

Saltzer & Kaashoek Ch. 7, p. 54

June 25, 2009 8:22 am

7.4 The Network Layer

7–55

// Maintain routing and forwarding tables. vector associative array // vector[d_addr] contains path to destination d_addr neighbor_vector instance of vector // A path vector received from some neighbor my_vector instance of vector // My current path vector. addr associative array // addr[j] is the address of the network attachment // point at the other end of link j. // my_addr is address of my network attachment point. // A path is a parsable list of addresses, e.g. {a,b,c,d} procedure main() // Initialize, then start advertising. SET_TYPE_HANDLER (HANDLE_ADVERTISEMENT, exchange_protocol) clear my_vector; // Listen for advertisements do occasionally // and advertise my paths for each j in link_ids do // to all of my neighbors. status ← SEND_PATH_VECTOR (j, my_addr, my_vector, exch_protocol) if status ≠ 0 then // If the link was down, clear new_vector // forget about any paths FLUSH_AND_REBUILD (j) // that start with that link. // Called when an advt arrives. procedure HANDLE_ADVERTISEMENT (advt, link_id) // Extract neighbor’s address addr[link_id] ← GET_SOURCE (advt) // and path vector. neighbor_vector ← GET_PATH_VECTOR (advt) for each neighbor_vector.d_addr do // Look for better paths. new_path ←{addr[link_id], neighbor_vector[d_addr]} // Build potential path. if my_addr is not in new_path then // Skip it if I’m in it. // Is it a new destination? if my_vector[d_addr] = NULL) then my_vector[d_addr] ← new_path // Yes, add this one. else // Not new; if better, use it. my_vector[d_addr] ← SELECT_PATH (new_path, my_vector[d_addr]) FLUSH_AND_REBUILD (link_id) // Decide if new path is better than old one. procedure SELECT_PATH (new, old) if first_hop(new) = first_hop(old) then return new // Update any path we were // already using. else if length(new) ≥ length(old) then return old // We know a shorter path, keep else return new // OK, the new one looks better. // Flush out stale paths from this neighbor. procedure FLUSH_AND_REBUILD (link_id) for each d_addr in my_vector if first_hop(my_vector[d_addr]) = addr[link_id] and new_vector[d_addr] = NULL then delete my_vector[d_addr] // Delete paths that are no longer advertised. REBUILD_FORWARDING_TABLE (my_vector, addr) // Pass info to forwarder.

FIGURE 7.37 Model implementation of a path vector exchange routing algorithm. These procedures run in every participating router. They assume that the link layer discards damaged packets. If an advertisement is lost, it is of little consequence because the next advertisement will replace it The procedure REBUILD_FORWARDING_TABLE is not shown; it simply constructs a new forwarding table for use by this router, using the latest path vector information.

Saltzer & Kaashoek Ch. 7, p. 55

June 25, 2009 8:22 am

7–56

CHAPTER 7 The Network as a System and as a System Component

There are some obvious refinements that can be made to the path vector algorithm. For example, since nodes such as A, B, C, D, and F are connected by only one link to the rest of the network, they can skip the path selection step and just assume that all des­ tinations are reachable via their one link—but when they first join the network they must do an advertising step, to ensure that the rest of the network knows how to reach them (and it would be wise to occasionally repeat the advertising step, to make sure that link failures and router restarts don’t cause them to be forgotten). A service node such as E, which has two links to the network but is not intended to be used for transit traffic, may decide never to advertise anything more than the path to itself. Because each participant can independently decide which paths it advertises, path vector exchange is sometimes used to implement restrictive routing policies. For example, a country might decide that packets that both originate and terminate domestically should not be allowed to transit another country, even if that country advertises a shorter path. The exchange of data among routers is just another example of a network layer pro­ tocol. Since the link layer already provides network layer protocol multiplexing, no extra effort is needed to add a routing protocol to the layered system. Further, there is nothing preventing different groups of routers from choosing to use different routing protocols among themselves. In the Internet, there are many different routing protocols simulta­ neously in use, and it is common for a single router to use different routing protocols over different links.

7.4.3 Hierarchical Address Assignment and Hierarchical Routing The system for identifying attachment points of a network as described so far is work­ able, but does not scale up well to large numbers of attachment points. There are two immediate problems: 1. Every attachment point must have a unique address. If there are just ten attachment points, all located in the same room, coming up with a unique identifier for an eleventh is not difficult. But if there are several hundred million attachment points in locations around the world, as in the Internet, it is hard to maintain a complete and accurate list of addresses already assigned. 2. The path vector grows in size with the number of attachment points. Again, for routers to exchange a path vector with ten entries is not a problem; a path vector with 100 million entries could be a hassle. The usual way to tackle these two problems is to introduce hierarchy: invent some scheme by which network addresses have a hierarchical structure that we can take advan­ tage of, both for decentralizing address assignments and for reducing the size of forwarding tables and path vectors. For example, consider again the abstract network of Figure 7.28, in which we arbi­ trarily assigned two-digit numbers as network addresses. Suppose we instead adopt a more structured network address consisting, say, of two parts, which we might call

Saltzer & Kaashoek Ch. 7, p. 56

June 25, 2009 8:22 am

7.4 The Network Layer

7–57

“region” and “station”. Thus in Figure 7.31 we might assign to A the network address “11,75” where 11 is a region identifier and 75 is a station identifier. By itself, this change merely complicates things. However, if we also adopt a policy that regions must correspond to the set of network attachment points served by an iden­ tifiable group of closely-connected routers, we have a lever that we can use to reduce the size of forwarding tables and path vectors. Whenever a router for region 11 gets ready to advertise its path vector to a router that serves region 12, it can condense all of the paths for the region 11 network destinations it knows about into a single path, and simply advertise that it knows how to forward things to any region 11 network destination. The routers that serve region 11 must, of course, still maintain complete path vectors for every region 11 station, and exchange those vectors among themselves, but these vectors are now proportional in size to the number of attachment points in region 11, rather than to the number of attachment points in the whole network. When a network uses hierarchical addresses, the operation of forwarding involves the same steps as before, but the table lookup process is slightly more complicated: The for­ warder must first extract the region component of the destination address and look that up in its forwarding table. This lookup has two possible outcomes: either the forwarding table contains an entry showing a link over which to send the packet to that region, or the forwarding table contains an entry saying that this forwarder is already in the desti­ nation region, in which case it is necessary to extract the station identifier from the destination address and look that up in a distinct part of the forwarding table. In most implementations, the structure of the forwarding table reflects the hierarchical structure of network addresses. Figure 7.38 illustrates the use of a forwarding table for hierarchical addresses that is constructed of two sections. Hierarchical addresses also offer an opportunity to grapple with the problem of assigning unique addresses in a large network because the station part of a network address needs to be unique only within its region. A central authority can assign region identifiers, while different local authorities can assign the station identifiers within each region, without consulting other regional authorities. For this decentralization to work, the boundaries of each local administrative authority must coincide with the boundaries of the regions served by the packet forwarders. While this seems like a simple thing to arrange, it can actually be problematic. One easy way to define regions of closely con­ nected packet forwarders is to do it geographically. However, administrative authority is often not organized on a strictly geographic basis. So there may be a significant tension between the needs of address assignment and the needs of packet forwarding. Hierarchical network addresses are not a panacea—in addition to complexity, they introduce at least two new problems. With the non-hierarchical scheme, the geographi­ cal location of a network attachment point did not matter, so a portable computer could, for example, connect to the network in either Boston or San Francisco, announce its net­ work address, and after the routers have exchanged path vectors a few times, expect to communicate with its peers. But with hierarchical routing, this feature stops working. When a portable computer attaches to the network in a different region, it cannot simply advertise the same network address that it had in its old region. It will instead have to

Saltzer & Kaashoek Ch. 7, p. 57

June 25, 2009 8:22 am

7–58

CHAPTER 7 The Network as a System and as a System Component

first acquire a network address within the region to which it is attaching. In addition, unless some provision has been made at the old address for forwarding, other stations in the network that remember the old network address will find that they receive no-answer responses when they try to contact this station, even though it is again attached to the network. The second complication is that paths may no longer be the shortest possible because the path vector algorithm is working with less detailed information. If there are two dif­ ferent routers in region 5 that have paths leading to region 7, the algorithm will choose the path to the nearest of those two routers, even though the other router may be much closer to the actual destination inside region 7. We have used in this example a network address with two hierarchical levels, but the same principle can be extended to as many levels as are needed to manage the network. In fact, any region can do hierarchical addressing within just the part of the address space that it controls, so the number of hierarchical levels can be different in different places. The public Internet uses just two hierarchical addressing levels, but some large subnet­ works of the Internet implement the second level internally as a two-level hierarchy. Similarly, North American telephone providers have created a four-level hierarchy for telephone numbers: country code, area code, exchange, and line number, for exactly the same reasons: to reduce the size of the tables used in routing calls, and to allow local administration of line numbers. Other countries agree on the country codes but inter­ nally may have a different number of hierarchical levels. region R1 R1.B 1 R1.A 2 3

region R2 forwarding table in R1.B region forwarding section

R1.C R1.D

R3.C

to

link

R1 R2 R3 R4

local 1 1 3

local forwarding section to

link

R1.A 1 R1.B end-layer R1.C 2 R1.D 3

region R3 region R4 FIGURE 7.38 Example of a forwarding table with regional addressing in network node R1.B. The forwarder first looks up the region identifier in the region forwarding section of the table. If the target address is R3.C, the region identifier is R3, so the table tells it that it should forward the packet on link 1. If the target address is R1.C, which is in its own region R1, the region forwarding table tells it that R1 is the local region, so it then looks up R1.C in the local forwarding section of the table. There may be hundreds of network attachment points in region R3, but just one entry is needed in the forwarding table at node R1.B.

Saltzer & Kaashoek Ch. 7, p. 58

June 25, 2009 8:22 am

7.4 The Network Layer

7–59

7.4.4 Reporting Network Layer Errors The network layer can encounter trouble when trying to forward a packet, so it needs a way of reporting that trouble. The network layer is in a uniquely awkward position when this happens because the usual reporting method (return a status value to the higher-layer program that asked for this operation) may not be available. An intermediate router receives a packet from a link layer below, and it is expected to forward that packet via another link layer. Even if there is a higher layer in the router, that layer probably has no interest in this packet. Instead, the entity that needs to hear about the problem is more likely to be the upper layer program that originated the packet, and that program may be located several hops away in another computer. Even the network layer at the desti­ nation address may need to report something to the original sender such as the lack of an upper-layer handler for the end-to-end type that the sender specified. The obvious thing to do is send a message to the entity that needs to know about the problem. The usual method is that the network layer of the router creates a new packet on the spot and sends it back to the source address shown in the problem packet. The message in this new packet reports details of the problem using some standard error reporting protocol. With this design, the original higher-layer sender of a packet is expected to listen not only for replies but also for messages of the error reporting proto­ col. Here are some typical error reports: • • • • • • •

The buffers of the router were full, so the packet had to be discarded. The buffers of the router are getting full—please stop sending so many packets. The region identifier part of the target address does not exist. The station identifier part of the target address does not exist. The end type identifier was not recognized. The packet is larger than the maximum transmission unit of the next link. The packet hop limit has been exceeded.

In addition, a copy of the header of the doomed packet goes into a data field of the error message, so that the recipient can match it with an outstanding SEND request. One might suggest that a router send an error report when discarding a packet that is received with a wrong checksum. This idea is not as good as it sounds because a damaged packet may have garbled header information, in which case the error message might be sent to a wrong—or even nonexistent—place. Once a packet has been identified as con­ taining unknown damage, it is not a good idea to take any action that depends on its contents. A network-layer error reporting protocol is a bit unusual. An error message originates in the network layer, but is delivered to the end-to-end layer. Since it crosses layers, it can be seen as violating (in a minor way) the usual separation of layers: we have a network layer program preparing an end-to-end header and inserting end-to-end data; a strict layer doctrine would insist that the network layer not touch anything but network layer headers.

Saltzer & Kaashoek Ch. 7, p. 59

June 25, 2009 8:22 am

7–60

CHAPTER 7 The Network as a System and as a System Component

An error reporting protocol is usually specified to be a best-effort protocol, rather than one that takes heroic efforts to get the message through. There are two reasons why this design decision makes sense. First, as will be seen in Section 7.5 of this chapter, implementing a more reliable protocol adds a fair amount of machinery: timers, keeping copies of messages in case they need to be retransmitted, and watching for receipt acknowledgments. The network layer is not usually equipped to do any of these func­ tions, and not implementing them minimizes the violation of layer separation. Second, error messages can be thought of as hints that allow the originator of a packet to more quickly discover a problem. If an error message gets lost, the originator should, one way or another, eventually discover the problem in some other way, perhaps after timing out, resending the original packet, and getting an error message on the retry. A good example of the best-effort nature of an error reporting protocol is that it is common to not send an error message about every discarded packet; if congestion is caus­ ing the discard rate to climb, that is exactly the wrong time to increase the network load by sending many “I discarded your packet” notices. But sending a few such notices can help alert sources who are flooding the network that they need to back off—this topic is explored in more depth in Section 7.6. The basic idea of an error reporting protocol can be used for other communications to and from the network layer of any participant in the network. For example, the Inter­ net has a protocol named internet control message protocol (ICMP) that includes an echo request message (also known as a “ping,” from an analogy with submarine active sonar systems). If an end node sends an echo request to any network participant, whether a packet forwarder or another end node, the network layer in that participant is expected to respond by immediately sending the data of the message back to the sender in an echo reply message. Echo request/reply messages are widely used to determine whether or not a participant is actually up and running. They are also sometimes used to assess network congestion by measuring the time until the reply comes back. Another useful network error report is “hop limit exceeded”. Recall from page 7–54 that to provide a safety net against the possibility of forwarding loops, a packet may con­ tain a hop limit field, which a router decrements in each packet that it forwards. If a router finds that the hop limit field contains zero, it discards the packet and it also sends back a message containing the error report. The “hop limit exceeded” error message pro­ vides feedback to the originator, for example it may have chosen a hop limit that is too small for the network configuration. The “hop limit exceeded” error message can also be used in an interesting way to help locate network problems: send a test message (usually called a probe) to some distant destination address, but with the hop limit set to 1. This probe will cause the first router that sees it to send back a “hop limit exceeded” message whose source address identifies that first router. Repeat the experiment, sending probes with hop limits set to 2, 3,…, etc. Each response will reveal the network address of the next router along the current path between the source and the destination. In addition, the time required for the response to return gives a rough indication of the network load between the source and that router. In this way one can trace the current path through the network to the destination address, and identify points of congestion.

Saltzer & Kaashoek Ch. 7, p. 60

June 25, 2009 8:22 am

7.4 The Network Layer

7–61

Another way to use an error reporting protocol is for the end-to-end layer to send a series of probes to learn the smallest maximum transmission unit (MTU) that lies on the current path between it and another network attachment point. It first sends a packet of the largest size the application has in mind. If this probe results in an “MTU exceeded” error response, it halves the packet size and tries again. A continued binary search will quickly home in on the smallest MTU along the path. This procedure is known as MTU discovery.

7.4.5 Network Address Translation (An Idea That Almost Works) From a naming point of view, the Internet provides a layered naming environment with two contexts for its network attachment points, known as “Internet addresses”. An Inter­ net address has two components, a network number and a host number. Most network numbers are global names, but a few, such as network 10, are designated for use in pri­ vate networks. These network numbers can be used either completely privately, or in conjunction with the public Internet. Completely private use involves setting up an inde­ pendent private network, and assigning host addresses using the network number 10. Routers within this network advertise and forward just as in the public Internet. Routers on the public Internet follow the convention that they do not accept routes to network 10, so if this private network is also directly attached to the public Internet, there is no confusion. Assuming that the private network accepts routes to globally named net­ works, a host inside the private network could send a message to a host on the public Internet, but a host on the public Internet cannot send a response back because of the routing convention. Thus any number of private networks can each independently assign numbers using network number 10—but hosts on different private networks can­ not talk to one another and hosts on the public Internet cannot talk to them. Network Address Translation (NAT) is a scheme to bridge this gap. The idea is that a specialized translating router (known informally as a “NAT box”) stands at the border between a private network and the public Internet. When a host inside the private net­ work wishes to communicate with a service on the public Internet, it first makes a request to the translating router. The translator sets up a binding between that host’s private address and a temporarily assigned public address, which the translator advertises to the public Internet. The private host then launches a packet that has a destination address in the public Internet, and its own private network source address. As this packet passes through the translating router, the translator modifies the source address by replacing it with the temporarily assigned public address. It then sends the packet on its way into the public Internet. When a response from the service on the public Internet comes back to the translating router, the translator extracts the destination address from the response, looks it up in its table of temporarily assigned public addresses, finds the internal address to which it corresponds, modifies the destination address in the packet, and sends the packet on its way on the internal network, where it finds its way to the private host that initiated the communication.

Saltzer & Kaashoek Ch. 7, p. 61

June 25, 2009 8:22 am

7–62

CHAPTER 7 The Network as a System and as a System Component

The scheme works, after a fashion, but it has a number of limitations. The most severe limitation is that some end-to-end network protocols place Internet addresses in fields buried in their payloads; there is nothing restricting Internet addresses to packet source and destination fields of the network layer header. For example, some protocols between two parties start by mentioning the Internet address of a third party, such as a bank, that must also participate in the protocol. If the Internet address of the third party is on the public Internet, there may be no problem, but if it is an address on the private network, the translator needs to translate it as it goes by. The trouble is that translation requires that the translator peer into the payload data of the packet and understand the format of the higher-layer protocol. The result is that NAT works only for those protocols that the translator is programmed to understand. Some protocols may present great difficulties. For example, if a secure protocol uses key-driven cryptographic transformations for either privacy or authentication, the NAT gateway would need to have a copy of the keys, but giving it the keys may defeat the purpose of the secure protocol. (This concern will become clearer after reading Chapter 11[on-line].) A second problem is that all of the packets between the public Internet and the private network must pass through the translating router, since it is the only place that knows how to do the address translation. The translator thus introduces both a potential bot­ tleneck and a potential single point of failure, and NAT becomes a constraint on routing policy. A third problem arises if two such organizations later merge. Each organization will have assigned addresses in network 10, but since their assignments were not coordinated, some addresses will probably have been assigned in both organizations, and all of the col­ liding addresses must be discovered and changed. Although originally devised as a scheme to interconnect private networks to the pub­ lic Internet, NAT has become popular as a technique to beef up security of computer systems that have insecure operating system or network implementations. In this appli­ cation, the NAT translator inspects every packet coming from the public Internet and refuses to pass along any whose origin seems suspicious or that try to invoke services that are not intended for public use. The scheme does not in itself provide much security, but in conjunction with other security mechanisms described in Chapter 11[on-line], it can help create what that chapter describes as “defense in depth”.

7.5 The End-to-End Layer The network layer provides a useful but not completely dependable best-effort commu­ nication environment that will deliver data segments to any destination, but with no guarantees about delay, order of arrival, certainty of arrival, accuracy of content, or even of delivery to the right place. This environment is too hostile for most applications, and the job of the end-to-end layer is to create a more comfortable communication environ­ ment that has the features of performance, reliability, and certainty that an application needs. The complication is that different applications can have quite different commu-

Saltzer & Kaashoek Ch. 7, p. 62

June 25, 2009 8:22 am

7.5 The End-to-End Layer

7–63

nication needs, so no single end-to-end design is likely to suffice. At the same time, applications tend to fall in classes all of whose members have somewhat similar require­ ments. For each such class it is usually possible to design a broadly useful protocol, known as a transport protocol, for use by all the members of the class.

7.5.1 Transport Protocols and Protocol Multiplexing A transport protocol operates between two attachment points of a network, with the goal of moving either messages or a stream of data between those points while providing a particular set of specified assurances. As was explained in Chapter 4, it is convenient to distinguish the two attachment points by referring to the application program that ini­ tiates action as the client and the application program that responds as the service. At the same time, data may flow either from client to service, from service to client, or both, so we will need to refer to the sending and receiving sides for each message or stream. Trans­ port protocols almost always include multiplexing, to tell the receiving side to which application it should deliver the message or direct the stream. Because the mechanics of application multiplexing can be more intricate than in lower layers, we first describe a transport protocol interface that omits multiplexing, and then add multiplexing to the interface. In contrast with the network layer, where an important feature is a uniform applica­ tion programming interface, the interface to an end-to-end transport protocol varies with the particular end-to-end semantics that the protocol provides. Thus a simple mes­ sage-sending protocol that is intended to be used by only one application might have a first-version interface such as: v.1

SEND_MESSAGE

(destination, message)

in which, in addition to supplying the content of the message, the sender specifies in des­ tination the network attachment point to which the message should be delivered. The sender of a message needs to know both the message format that the recipient expects and the destination address. Chapter 3 described several methods of discovering destina­ tion addresses, any of which might be used. The prospective receiver must provide an interface by which the transport protocol delivers the message to the application. Just as in the link and network layers, receiving a message can’t happen until the message arrives, so receiving involves waiting and the corresponding receive-side interface depends on the system mechanisms that are avail­ able for waiting and for thread or event coordination. For illustration, we again use an upcall: when a message arrives, the message transport protocol delivers it by calling an application-provided procedure entry point: V.1

DELIVER_MESSAGE

(message)

This first version of an upcall interface omits not only multiplexing but another impor­ tant requirement: When sending a message, the sender usually expects a reply. While a programmer may be able to ask someone down the hall the appropriate destination address to use for some service, it is usually the case that a service has many clients. Thus

Saltzer & Kaashoek Ch. 7, p. 63

June 25, 2009 8:22 am

7–64

CHAPTER 7 The Network as a System and as a System Component

the service needs to know where each message came from so that it can send a reply. A message transport protocol usually provides this information, for example by including a second argument in the upcall interface: V.2

DELIVER_MESSAGE

(source, message)

In this second (but not quite final) version of the upcall, the transport protocol sets the value of source to the address from which this message originated. The transport proto­ col obtains the value of source as an argument of an upcall from the network layer. Since the reason for designing a message transport protocol is that it is expected to be useful to several applications, the interface needs additional information to allow the pro­ tocol to know which messages belong to which application. End-to-end layer multiplexing is generally a bit more complicated than that of lower layers because not only can there be multiple applications, there can be multiple instances of the same appli­ cation using the same transport protocol. Rather than assigning a single multiplexing identifier to an application, each instance of an application receives a distinct multiplex­ ing identifier, usually known as a port. In a client/service situation, most application services advertise one of these identifiers, called that application’s well-known port. Thus the second (and again not final) version of the send interface is v.2 SEND_MESSAGE (destination, service_port, message)

where service_port identifies the well-known port of the application service to which the sender wants to have the message delivered. At the receiving side each application that expects to receive messages needs to tell the message transport protocol what port it expects clients to use, and it must also tell the protocol what program to call to deliver messages. The application can provide both pieces of information invoking the transport protocol procedure LISTEN_FOR_MESSAGES

(service_port, message_handler)

which alerts the transport protocol implementation that whenever a message arrives at this destination carrying the port identifier service_port, the protocol should deliver it by calling the procedure named in the second argument (that is, the procedure message_handler). LISTEN_FOR_MESSAGES enters its two arguments in a transport layer table for future reference. Later, when the transport protocol receives a message and is ready to deliver it, it invokes a dispatcher similar to that of Figure 7.27, on page 7–43. The dispatcher looks in the table for the service port that came with the message, iden­ tifies the associated message_handler procedure, and calls it, giving as arguments the source and the message. One might expect that the service might send replies back to the client using the same application port number, but since one service might have several clients at the same net­ work attachment point, each client instance will typically choose a distinct port number for its own replies, and the service needs to know to which port to send the reply. So the

Saltzer & Kaashoek Ch. 7, p. 64

June 25, 2009 8:22 am

7.5 The End-to-End Layer

7–65

SEND interface must be extended one final time to allow the sender to specify a port num­ ber to use for reply:

v.3

SEND_MESSAGE

(destination, service_port, reply_port, message)

where reply_port is the identifier that the service can use to send a message back to this particular client. When the service does send its reply message, it may similarly specify a reply_port that is different from its well-known port if it expects that same client to send further, related messages. The reply_port arguments in the two directions thus allow a series of messages between a client and a service to be associated with one another. Having added the port number to SEND_MESSAGE, we must communicate that port number to the recipient by adding an argument to the upcall by the message transport protocol when it delivers a message to the recipient: v.3

DELIVER_MESSAGE

(source, reply_port, message)

This third and final version of DELIVER_MESSAGE is the handler that the application desig­ nated when it called LISTEN_FOR_MESSAGES. The three arguments tell the handler (1) who sent the message (source), (2) the port on which that sender said it will listen for a pos­ sible reply (reply_port) and (3) the content of the message itself (message). The interface set {LISTEN_FOR_MESSAGE, SEND_MESSAGE, DELIVER_MESSAGE} is specialized to end-to-end transport of discrete messages. Sidebar 7.5 illustrates two other, somewhat different, end-to-end transport protocol interfaces, one for a request/response protocol and the second for streams. Each different transport protocol can be thought of as a pre­ packaged set of improvements on the best-effort contract of the network layer. Here are three examples of transport protocols used widely in the Internet, and the assurances they provide: 1. User datagram protocol (UDP). This protocol adds ports for multiple applications and a checksum for data integrity to the network-layer packet. Although UDP is used directly for some simple request/reply applications such as asking for the time of day or looking up the network address of a service, its primary use is as a component of other message transport protocols, to provide end-to-end multiplexing and data integrity. [For details, see Internet standard STD0006 or Internet request for comments RFC–768.] 2. Transmission control protocol (TCP). Provides a stream of bytes with the assurances that data is delivered in the order it was originally sent, nothing is missing, nothing is duplicated, and the data has a modest (but not terribly high) probability of integrity. There is also provision for flow control, which means that the sender takes care not to overrun the ability of the receiver to accept data, and TCP cooperates with the network layer to avoid congestion. This protocol is used for applications such as interactive typing that require a telephone-like connection in which the order of delivery of data is important. (It is also used in many bulk transfer applications that do not require delivery order, but that do want to take advantage of its data integrity, flow control, and congestion avoidance assurances.)

Saltzer & Kaashoek Ch. 7, p. 65

June 25, 2009 8:22 am

7–66

CHAPTER 7 The Network as a System and as a System Component

Sidebar 7.5: Other end-to-end transport protocol interfaces Since there are many different combinations of services that an end-to-end transport protocol might provide, there are equally many transport protocol interfaces. Here are two more examples: 1. A request/response protocol sends a request message and waits for a response to that message before returning to the application. Since an interface that waits for a response ensures that there can be only one such call per thread outstanding, neither an explicit multiplexing parameter nor an upcall are necessary. A typical client interface to a request/response transport protocol is response ← SEND_REQUEST (service_identifier, request)

where service_identifier is a name used by the transport protocol to locate the service destination and service port. It then sends a message, waits for a matching response, and delivers the result. The corresponding application programming interface at the service side of a request/response protocol may be equally simple or it can be quite complex, depending on the performance requirements. 2. A reliable message stream protocol sends several messages to the same destination with the intent that they be delivered reliably and in the order in which they were sent. There are many ways of defining a stream protocol interface. In the following example, an application client begins by creating a stream: client_stream_id ← OPEN_STREAM (destination, service_port, reply_port)

followed by several invocations of: WRITE_STREAM

(client_stream_id, message)

and finally ends with: CLOSE_STREAM

(client_stream_id)

The service-side programming interface allows for several streams to be coming in to an application at the same time. The application starts by calling a LISTEN_FOR_STREAMS procedure to post a listener on the service port, just as with the message interface. When a client opens a new stream, the service’s network layer, upon receiving the open request, upcalls to the stream listener that the application posted: OPEN_STREAM_REQUEST

(source, reply_port)

and upon receiving such an upcall OPEN_STREAM_REQUEST assigns a stream identifier for use within the service and invokes a transport layer dispatcher with ACCEPT_STREAM

(service_stream_id, next_message_handler)

The arrival of each message on the stream then leads the dispatcher to perform an upcall to the program identified in the variable next_message_handler: HANDLE_NEXT_MESSAGE

(stream_id, message);

With this design, a message value of NULL might signal that the client has closed the stream.

Saltzer & Kaashoek Ch. 7, p. 66

June 25, 2009 8:22 am

7.5 The End-to-End Layer

7–67

[For details, see Internet standard STD0007 or Internet request for comments RFC–793.] 3. Real-time transport protocol (RTP). Built on UDP (but with checksums switched off ), RTP provides a stream of time-stamped packets with no other integrity guarantee. This kind of protocol is useful for applications such as streaming video or voice, where order and stream timing are important, but an occasional lost packet is not a catastrophe, so out-of-order packets can be discarded, and packets with bits in error may still contain useful data. [For details, see Internet request for comments RFC–1889.] There have, over the years, been several other transport protocols designed for use with the Internet, but they have not found enough application to be widely imple­ mented. There are also several end-to-end protocols that provide services in addition to message transport, such as file transfer, file access, remote procedure call, and remote sys­ tem management, and that are built using UDP or TCP as their underlying transport mechanism. These protocols are usually classified as presentation protocols because the pri­ mary additional service they provide is translating data formats between different computer platforms. This collection of protocols illustrates that the end-to-end layer is itself sometimes layered and sometimes not, depending on the requirements of the application. Finally, end-to-end protocols can be multipoint, which means they involve more than two players. For example, to complete a purchase transaction, there may be a buyer, a seller, and one or more banks, each of which needs various end-to-end assurances about agreement, order of delivery, and data integrity. In the next several sections, we explore techniques for providing various kinds of endto-end assurances. Any of these techniques may be applied in the design of a message transport protocol, a presentation protocol, or by the application itself.

7.5.2 Assurance of At-Least-Once Delivery; the Role of Timers A property of a best-effort network is that it may lose packets, so a goal of many end-to­ end transport protocols is to eliminate the resulting uncertainty about delivery. A persis­ tent sender is a protocol participant that tries to ensure that at least one copy of each data segment is delivered, by sending it repeatedly until it receives an acknowledgment. The usual implementation of a persistent sender is to add to the application data a header containing a nonce and to set a timer that the designer estimates will expire in a little more than one network round-trip time, which is the sum of the network transit time for the outbound segment, the time the receiver spends absorbing the segment and prepar­ ing an acknowledgment, and the network transit time for the acknowledgment. Having set the timer, the sender passes the segment to the network layer for delivery, taking care to keep a copy. The receiving side of the protocol strips off the end-to-end header, passes the application data along to the application, and in addition sends back an acknowledg­ ment that contains the nonce. When the acknowledgment gets back to the sender, the

Saltzer & Kaashoek Ch. 7, p. 67

June 25, 2009 8:22 am

7–68

CHAPTER 7 The Network as a System and as a System Component

sender uses the nonce to identify which previously-sent segment is being acknowledged. It then turns off the corresponding timer and discards its copy of that segment. If the timer expires before the acknowledgment returns, the sender restarts the timer and resends the segment, repeating this sequence indefinitely, until it receives an acknowl­ edgment. For its part, the receiver sends back an acknowledgment every time it receives a segment, thereby extending the persistence in the reverse direction, thus covering the possibility that the best-effort network has lost one or more acknowledgments. A protocol that includes a persistent sender does its best to provide an assurance of at-least-once delivery, which has semantics similar to the at-least-once RPC introducd in Section 4.2.2. The nonce, timer, retry, and acknowledgment together act to ensure that the data segment will eventually get through. As long as there is a non-zero probability of a message getting through, this protocol will eventually succeed. On the other hand, the probability may actually be zero, either for an indefinite time—perhaps the network is partitioned or the destination is not currently listening, or permanently—perhaps the destination is on a ship that has sunk. Because of the possibility that there will not be an acknowledgment forthcoming soon, or perhaps ever, a practical sender is not infinitely persistent. The sender limits the number of retries, and if the number exceeds the limit, the sender returns error status to the application that asked to send the message. The application must interpret this error status with some understanding of network com­ munications. The lack of an acknowledgment means that one of two—significantly different—events has occurred: 1. The data segment was not delivered. 2. The data segment was delivered, but the acknowledgment never returned. The good news is that the application is now aware that there is a problem. The bad news is that there is no way to determine which of the two problems occurred. This dilemma is intrinsic to communication systems, and the appropriate response depends on the par­ ticular application. Some applications will respond to this dilemma by making a note to later ask the other side whether or not it got the message; other applications may just ignore the problem. Chapter 10[on-line] investigates this issue further. In summary, just as with at-least-once RPC, the at-least-once delivery protocol does not provide the absolute assurance that its name implies; it instead provides the assurance that if it is possible to get through, the message will get through, and if it is not possible to confirm delivery, the application will know about it. The at-least-once delivery protocol provides no assurance about duplicates—it actu­ ally tends to generate duplicates. Furthermore, the assurance of delivery is weaker than appears on the surface: the data may have been corrupted along the way, or it may have been delivered to the wrong destination—and acknowledged—by mistake. Assurances on any of those points require additional techniques. Finally, the at-least-once delivery protocol ensures only that the message was delivered, not that the application actually acted on it—the receiving system may have been so overloaded that it ignored the mes­ sage or it may have crashed an instant after acknowledging the message. When examining end-to-end assurances, it is important to identify the end points. In this case,

Saltzer & Kaashoek Ch. 7, p. 68

June 25, 2009 8:22 am

7.5 The End-to-End Layer

7–69

the receiving end point is the place in the protocol code that sends the acknowledgment of message receipt. This protocol requires the sender to choose a value for the retry timer at the time it sends a packet. One possibility would be to choose in advance a timer value to be used for every packet—a fixed timer. But using a timer value fixed in advance is problematic because there is no good way to make that choice. To detect a lost packet by noticing that no acknowledgment has returned, the appropriate timer interval would be the expected network round-trip time plus some allowance for unusual queuing delays. But even the expected round-trip time between two given points can vary by quite a bit when routes change. In fact, one can argue that since the path to be followed and the amount of queuing to be tolerated is up to the network layer, and the individual transit times of links are properties of the link layer, for the end-to-end layer to choose a fixed value for the timer interval would violate the layering abstraction—it would require that the endto-end layer know something about the internal implementation of the link and network layers. Even if we are willing to ignore the abstraction concern, the end-to-end transport protocol designer has a dilemma in choosing a fixed timer interval. If the designer chooses too short an interval, there is a risk that the protocol will resend packets unnec­ essarily, which wastes network capacity as well as resources at both the sending and receiving ends. But if the designer sets the timer too long, then genuinely lost packets will take a long time to discover, so recovery will be delayed and overall performance will decline. Worse, setting a fixed value for a timer will not only force the designer to choose between these two evils, it will also embed in the system a lurking surprise that may emerge long in the future when someone else changes the system, for example to use a faster network connection. Going over old code to understand the rationale for setting the timers and choosing new values for them is a dismal activity that one would prefer to avoid by better design. There are two common ways to minimize the use of fixed timers, both of which are applicable only when a transport protocol sends a stream of data segments to the same destination: adaptive timers and negative acknowledgments. An adaptive timer is one whose setting dynamically adjusts to currently observed con­ ditions. A common implementation scheme is to observe the round-trip times for each data segment and its corresponding response and calculate an exponentially weighted moving average of those measurements (Sidebar 7.6 explains the method). The protocol then sets its timers to, say, 150% of that estimate, with the intent that minor variations in queuing delay should rarely cause the timer to expire. Keeping an estimate of the round-trip time turns out to be useful for other purposes, too. An example appears in the discussion of flow control in Section 7.5.6, below. A refinement for an adaptive timer is to assume that duplicate acknowledgments mean that the timer setting is too small, and immediately increase it. (Since a too-small timer setting would expire before the first acknowledgment returns, causing the sender to resend the original data segment, which would trigger the duplicate acknowledg­ ment.) It is usually a good idea to make any increase a big one, for example by doubling

Saltzer & Kaashoek Ch. 7, p. 69

June 25, 2009 8:22 am

7–70 CHAPTER 7 The Network as a System and as a System Component

Sidebar 7.6: Exponentially weighted moving averages One way of keeping a running average, A , of a series of measurements, M i , is to calculate an exponentially weighted moving average, defined as 2 3 A = ⎛ M 0 + M × α + M 2 × α + M 3 × α + …⎞ × ( 1 – α ) ⎝ ⎠ 1

where α < 1 and the subscript indicates the age of the measurement; the most recent being M 0 . The multiplier ( 1 – α ) at the end normalizes the result. This scheme has two advantages over a simple average. First, it gives more weight to recent measurements. The multiplier, α , is known as the decay factor. A smaller value for the decay factor means that older measurements lose weight more rapidly as succeeding measurements are added into the average. The second advantage is that it can be easily calculated as new measurements become available using the recurrence relation: A new ← ( α × A old + ( 1 – α ) × M new )

where M new is the latest measurement. In a high-performance environment where measurements arrive frequently and calculation time must be minimized, one can instead calculate A new A old ⎞ ----------------- ← ⎛ α × ----------------- + M new⎠ ⎝ (1 – α) (1 – α)

which requires only one multiplication and one addition. Furthermore, if ( 1 – α ) is chosen to be a fractional power of two (e.g., 1/8) the multiplication can be done with one register shift and one addition. Calculated this way, the result is too large by the constant factor 1 ⁄ ( 1 – α ) , but it may be possible to take a constant factor into account at the time the average is used. In both computer systems and networks there are many situations in which it is useful to know the average value of an endless series of observations. Exponentially weighted moving averages are probably the most frequently used method.

the value previously used to set the timer. Repeatedly increasing a timer setting by mul­ tiplying its previous value by a constant on each retry (thus succeeding timer values might be, say, 1, 2, 4, 8, 16, … seconds) is known as exponential backoff, a technique that we will see again in other, quite different system applications. Doubling the value, rather than multiplying by, say, ten, is a good choice because it gets within a factor of two of the “right” value quickly without overshooting too much. Adaptive techniques are not a panacea: the protocol must still select a timer value for the first data segment, and it can be a challenge to choose a value for the decay factor (in the sidebar, the constant α ) that both keeps the estimate stable and also quickly responds to changes in network conditions. The advantage of an adaptive timer comes from being

Saltzer & Kaashoek Ch. 7, p. 70

June 25, 2009 8:22 am

7.5 The End-to-End Layer

7–71

able to amortize the cost of an uninformed choice on that first data segment over the ensuing several segments. A different method for minimizing use of fixed timers is for the receiving side of a stream of data segments to infer from the arrival of later data segments the loss of earlier ones and request their retransmission by sending a negative acknowledgment, or NAK. A NAK is simply a message that lists missing items. Since data segments may be delivered out of order, the recipient needs some way of knowing which segment is missing. For example, the sender might assign sequential numbers as nonces, so arrival of segments #13 and #14 without having previously received segment #12 might cause the recipient to send a NAK requesting retransmission of segment #12. To distinguish transmission delays from lost segments, the recipient must decide how long to wait before sending a NAK, but that decision can be made by counting later-arriving segments rather than by measuring a time interval. Since the recipient reports lost packets, the sender does not need to be persistent, so it does not need to use a timer at all—that is, until it sends the last segment of a stream. Because the recipient can’t depend on later segment arrivals to discover that the last seg­ ment has been lost, that discovery still requires the help of a timer. With NAKs, the persistent-sender strategy with a timer is needed only once per stream, so the penalty for choosing a timer setting that is too long (or too short) is just one excessive delay (or one risk of an unnecessary duplicate transmission) on the last segment of the stream. Com­ pared with using an adaptive timer on every segment of the stream, this is probably an improvement. The appropriate conclusion about timers is that fixed timers are a terrible mechanism to include in an end-to-end protocol (or indeed anywhere—this conclusion applies to many applications of timers in systems). Adaptive timers work better, but add complex­ ity and require careful thought to make them stable. Avoidance and minimization of timers are the better strategies, but it is usually impossible to completely eliminate them. Where timers must be used they should be designed with care and the designer should clearly document them as potential trouble spots.

7.5.3 Assurance of At-Most-Once Delivery: Duplicate Suppression At-least-once delivery assurance was accomplished by remembering state at the sending side of the transport protocol: a copy of the data segment, its nonce, and a flag indicating that an acknowledgment is still needed. But a side effect of at-least-once delivery is that it tends to generate duplicates. To ensure at-most-once delivery, it is necessary to suppress these duplicates, as well as any other duplicates created elsewhere within the network, perhaps by a persistent sender in some link-layer protocol. The mechanism of suppressing duplicates is a mirror image of the mechanism of atleast-once delivery: add state at the receiving side. We saw a preview of this mechanism in Section 7.1 of this chapter—the receiving side maintains a table of previously-seen nonces. Whenever a data segment arrives, the transport layer implementation checks the nonce of the incoming segment against the list of previously-seen nonces. If this nonce

Saltzer & Kaashoek Ch. 7, p. 71

June 25, 2009 8:22 am

7–72

CHAPTER 7 The Network as a System and as a System Component

is new, it adds the nonce to the list, delivers the data segment to the application, and sends an acknowledgment back to the sender. If the nonce is already in its list, it discards the data segment, but it resends the acknowledgment, in case the sender did not receive the previous one. If, in addition, the application has already sent a response to the orig­ inal request, the transport protocol also resends that response. The main problem with this technique is that the list of nonces maintained at the receiving side of the transport protocol may grow indefinitely, taking up space and, whenever a data segment arrives, taking time to search. Because they may have to be kept indefinitely, these nonces are described colorfully as tombstones. A challenge in designing a duplicate-suppression technique is to avoid accumulating an unlimited number of tombstones. One possibility is for the sending side to use monotonically increasing sequence num­ bers for nonces, and include as an additional field in the end-to-end header of every data segment the highest sequence number for which it has received an acknowledgment. The receiving side can then discard that nonce and any others from that sender that are smaller, but it must continue to hold a nonce for the most recently-received data seg­ ment. This technique reduces the magnitude of the problem, but it leaves a dawning realization that it may never be possible to discard the last nonce, which threatens to become a genuine tombstone, one per sender. Two pragmatic responses to the tomb­ stone problem are: 1. Move the problem somewhere else. For example, change the port number on which the protocol accepts new requests. The protocol should never reuse the old port number (the old port number becomes the tombstone), but if the port number space is large then it doesn’t matter. 2. Accept the possibility of making a mistake, but make its probability vanishingly small. If the sending side of the transport protocol always gives up and stops resending requests after, say, five retries, then the receiving side can safely discard nonces that are older than five network round-trip times plus some allowance for unusually large delays. This approach requires keeping track of the age of each nonce in the table, and it has some chance of failing if a packet that the network delayed a long time finally shows up. A simple defense against this form of failure is to wait a long time before discarding a tombstone. Another form of the same problem concerns what to do when the computer at the receiving side crashes and restarts, losing its volatile memory. If the receiving side stores the list of previously handled nonces in volatile memory, following a crash it will not be able to recognize duplicates of packets that it handled before the crash. But if it stores that list in a non-volatile storage device such as a hard disk, it will have to do one write to that storage device for every message received. Writes to non-volatile media tend to be slow, so this approach may introduce a significant performance loss. To solve the prob­ lem without giving up performance, techniques parallel to the last two above are typically employed. For example, one can use a new port number each time the system restarts.

Saltzer & Kaashoek Ch. 7, p. 72

June 25, 2009 8:22 am

7.5 The End-to-End Layer

7–73

This technique requires remembering which port number was last used, but that number can be stored on a disk without hurting performance because it changes only once per restart. Or, if we know that the sending side of the transport protocol always gives up after some number of retries, whenever the receiving side restarts, it can simply ignore all packets until that number of round-trip times has passed since restarting. Either proce­ dure may force the sending side to report delivery failure to its application, but that may be better than taking the risk of accepting duplicate data. When techniques for at-least-once delivery (the persistent sender) and at-most-once delivery (duplicate detection) are combined, they produce an assurance that is called exactly-once delivery. This assurance is the one that would probably be wanted in an implementation of the Remote Procedure Call protocol of Chapter 4. Despite its name, and even if the sender is prepared to be infinitely persistent, exactly-once delivery is not a guarantee that the message will eventually be delivered. Instead, it ensures that if the message is delivered, it will be delivered only once, and if delivery fails, the sender will learn, by lack of acknowledgment despite repeated requests, that delivery probably failed. However, even if no acknowledgment returns, there is a still a possibility that the message was delivered. Section 9.6.2[on-line] introduces a protocol known as two-phase commit that can reduce the uncertainty by adding a persistent sender of the acknowledgement. Unfortunately, there is no way to completely eliminate the uncertainty.

7.5.4 Division into Segments and Reassembly of Long Messages Recall that the requirements of the application determine the length of a message, but the network sets a maximum transmission unit, arising from limits on the length of a frame at the link layer. One of the jobs of the end-to-end transport protocol is to bridge this difference. Division of messages that are too long to fit in a single packet is relatively straightforward. Each resulting data segment must contain, in its end-to-end header, an identifier to show to which message this segment belongs and a segment number indi­ cating where in the message the segment fits (e.g., “message 914, segment 3 of 7”). The message identifier and segment number together can also serve as the nonce used to ensure at-least-once and at-most-once delivery. Reassembly is slightly more complicated because segments of the same message may arrive at the receiving side in any order, and may be mingled with segments from other messages. The reassembly process typically consists of allocating a buffer large enough to hold the entire message, placing the segments in the proper position within that buffer as they arrive, and keeping a checklist of which segments have not yet arrived. Once the message has been completely reassembled, the receiving side of the transport protocol can deliver the message to the application and discard the checklist. Message division and reassembly is a special case of stream division and reassembly, the topic of Section 7.5.7, below.

7.5.5 Assurance of Data Integrity Data integrity is the assurance that when a message is delivered, its contents are the same as when they left the sender. Adding data integrity to a protocol with a persistent sender

Saltzer & Kaashoek Ch. 7, p. 73

June 25, 2009 8:22 am

7–74

CHAPTER 7 The Network as a System and as a System Component

creates a reliable delivery protocol. Two additions are required, one at the sending side and one at the receiving side. The sending side of the protocol adds a field to the end-to­ end header or trailer containing a checksum of the contents of the application message. The receiving side recalculates the checksum from the received version of the reassem­ bled message and compares it with the checksum that came with the message. Only if the two checksums match does the transport protocol deliver the reassembled message to the application and send an acknowledgment. If the checksums do not match the receiver discards the message and waits for the sending side to resend it. (One might sug­ gest immediately sending a NAK, to alert the sending side to resend the data identified with that nonce, rather than waiting for timers to expire. This idea has the hazard that the source address that accompanies the data may have been corrupted along with the data. For this reason, sending a NAK on a checksum error isn’t usually done in end-to­ end protocols. However, as was described in Section 7.3.3, requesting retransmission as soon as an error is detected is useful at the link layer, where the other end of a point-to­ point link is the only possible source.) It might seem redundant for the transport protocol to provide a checksum, given that link layer protocols often also provide checksums. The reason the transport protocol might do so is an end-to-end argument: the link layer checksums protect the data only while it is in transit on the link. During the time the data is in the memory of a forward­ ing node, while being divided into multiple segments, being reassembled at the receiving end, or while being copied to the destination application buffer, it is still vulnerable to undetected accidents. An end-to-end transport checksum can help defend against those threats. On the other hand, reapplying the end-to-end argument suggests that an even better place for this checksum would be in the application program. But in the real world, many applications assume that a transport-protocol checksum covers enough of the threats to integrity that they don’t bother to apply their own checksum. Transport protocol checksums cater to this assumption. As with all checksums, the assurance is not absolute. Its quality depends on the num­ ber of bits in the checksum, the structure of the checksum algorithm, and the nature of the likely errors. In addition, there remains a threat that someone has maliciously mod­ ified both the data and its checksum to match while enroute; this threat is explored briefly in Section 7.5.9, below, and in more depth in Chapter 11[on-line]. A related integrity concern is that a packet might be misdelivered, perhaps because its address field has been corrupted. Worse, the unintended recipient may even acknowl­ edge receipt of the segment in the packet, leading the sender to believe that it was correctly delivered. The transport protocol can guard against this possibility by, on the sending side, including a copy of the destination address in the end-to-end segment header, and, on the receiving side, verifying that the address is the recipient’s own before delivering the packet to the application and sending an acknowledgment back.

Saltzer & Kaashoek Ch. 7, p. 74

June 25, 2009 8:22 am

7.5 The End-to-End Layer

7–75

7.5.6 End-to-End Performance: Overlapping and Flow Control End-to-end transport of a multisegment message raises some questions of strategy for the transport protocol, including an interesting trade-off between complexity and perfor­ mance. The simplest method of sending a multisegment message is to send one segment, wait for the receiving side to acknowledge that segment, then send the second segment, and so on. This protocol, known as lock-step, is illustrated in Figure 7.39. An important virtue of the lock-step protocol is that it is easy to see how to apply each of the previous end-to-end assurance techniques to one segment at a time. The downside is that trans­ mitting a message that occupies N segments will take N network round-trip times. If the network transit time is large, both ends may spend most of their time waiting.

7.5.6.1 Overlapping Transmissions To avoid the wait times, we can employ a pipelining technique related to the pipelining described in Section 6.1.5: As soon as the first segment has been sent, immediately send the second one, then the third one, and so on, without waiting for acknowledgments. This technique allows both close spacing of transmissions and overlapping of transmis­ sions with their corresponding acknowledgments. If nothing goes wrong, the technique leads to a timing diagram such as that of Figure 7.40. When the pipeline is completely filled, there may be several segments “in the net” traveling in both directions down trans­ mission lines or sitting in the buffers of intermediate packet forwarders.

receiver

sender send first segment

time

segme

nt 1

receive ACK, send second segment

receive ACK, send third segment

dgment 1

accept segment 1

Acknowle

segme

nt 2 gment 2 d le Acknow 3

• • •

(repeat N times)

accept segment 2

N

accept segment N

ent ledgm

N

w

Done.

Ackno

FIGURE 7.39 Lock-step transmission of multiple segments.

Saltzer & Kaashoek Ch. 7, p. 75

June 25, 2009 8:22 am

7–76

CHAPTER 7 The Network as a System and as a System Component

This diagram shows a small time interval between the sending of segment 1 and the sending of segment 2. This interval accounts for the time to generate and transmit the next segment. It also shows a small time interval at the receiving side that accounts for the time required for the recipient to accept the segment and prepare the acknowledg­ ment. Depending on the details of the protocol, it may also include the time the receiver spends acting on the segment (see Sidebar 7.7). With this approach, the total time to send N segments has dropped to N packet transmission times plus one round-trip time for the last segment and its acknowledgment—if nothing goes wrong. Unfortunately, several things can go wrong, and taking care of them can add quite a bit of complexity to the picture. First, one or more packets or acknowledgments may be lost along the way. The first step in coping with this problem is for the sender to maintain a list of segments sent. As each acknowledgment comes back, the sender checks that segment off its list. Then, after sending the last segment, the sender sets a timer to expire a little more than one network round-trip time in the future. If, upon receiving an acknowledgment, the list of missing acknowledgments becomes empty, the sender can turn off the timer, confident that the entire message has been delivered. If, on the other hand, the timer expires and there is still a list of unacknowledged segments, the sender resends each one in the list, starts another timer, and continues checking off acknowledgments. The sender repeats this sequence until either every segment is acknowledged or the sender exceeds its retry limit, in which case it reports a failure to the application that initiated this message. Each timer expiration at the sending side adds one more round-trip time of delay in completing the transmission, but if packets get through at all, the process should eventually converge.

sender send segment 1 send segment 2 send segment 3 receive ACK 1 receive ACK 2

receiver segme nt 1 2 3

ack 1 ack 2

• • •

(repeat N times) receive ACK N, done.

time acknowledge segment 1 acknowledge segment 2 N

acknowledge segment N

ack N

FIGURE 7.40 Overlapped transmission of multiple segments.

Saltzer & Kaashoek Ch. 7, p. 76

June 25, 2009 8:22 am

7.5 The End-to-End Layer

7–77

Sidebar 7.7: What does an acknowledgment really mean? An end-to-end acknowledgment is a widely used technique for the receiving side to tell the sending side something of importance, but since there are usually several different things going on in the end-to-end layer, there can also be several different purposes for acknowledgments. Some possibilities include • • • • •

it is OK to stop the timer associated with the acknowledged data segment it is OK to release the buffer holding a copy of the acknowledged segment it is OK to send another segment the acknowledged segment has been accepted for consideration the work requested in the acknowledged segment has been completed.

In some protocols, a single acknowledgment serves several of those purposes, while in other protocols a different form of acknowledgment may be used for each one; there are endless combinations. As a result, whenever the word acknowledgment is used in the discussion of a protocol, it is a good idea to establish exactly what the acknowledgment really means. This understanding is especially important if one is trying to estimate round-trip times by measuring the time for an acknowledgment to return; in some protocols such a measurement would include time spent doing processing in the receiving application, while in other cases it would not. If there really are five different kinds of acknowledgments, there is a concern that for every outgoing packet there might be five different packets returning with acknowledgments. In practice this is rarely the case because acknowledgments can be implemented as data items in the end-to-end header of any packet that happens to be going in the reverse direction. A single packet may thus carry any number of different kinds of acknowledgments and acknowledgments for a range of received packets, in addition to application data that may be flowing in the reverse direction. The technique of placing one or more acknowledgments in the header of the next packet that happens to be going in the reverse direction is known as piggybacking.

7.5.6.2 Bottlenecks, Flow Control, and Fixed Windows A second set of issues has to do with the relative speeds of the sender in generating seg­ ments, the entry point to the network in accepting them, any bottleneck inside the network in transmitting them, and the receiver in consuming them. The timing diagram and analysis above assumed that the bottleneck was at the sending side, either in the rate at which the sender generates segments or the rate that at which the first network link can transmit them. A more interesting case is when the sender generates data, and the network transmits it, faster than the receiver can accept it, perhaps because the receiver has a slow processor and eventually runs out of buffer space to hold not-yet-processed data. When this is a possibility, the transport protocol needs to include some method of controlling the rate at which the sender generates data. This mechanism is called flow control. The basic con-

Saltzer & Kaashoek Ch. 7, p. 77

June 25, 2009 8:22 am

7–78

CHAPTER 7 The Network as a System and as a System Component

cept involved is that the sender starts by asking the receiver how much data the receiver can handle. The response from the receiver, which may be measured in bits, bytes, or segments, is known as a window. The sender asks permission to send, and the receiver responds by quoting a window size, as illustrated in Figure 7.41. The sender then sends that much data and waits until it receives permission to send more. Any intermediate acknowledgments from the receiver allow the sender to stop the associated timer and release the send buffer, but they cannot be used as permission to send more data because the receiver is only acknowledging data arrival, not data consumption. Once the receiver has actually consumed the data in its buffers, it sends permission for another window’s worth of data. One complication is that the implementation must guard against both missing permission messages that could leave the sender with a zero-sized window and also duplicated permission messages that could increase the window size more than the receiver intends: messages carrying window-granting permission require exactly-once delivery. The window provided by the scheme of Figure 7.41 is called a fixed window. The lock-step protocol described earlier is a flow control scheme with a window that is one data segment in size. With any window scheme, one network round-trip time elapses between the receiver’s sending of a window-opening message and the arrival of the first data that takes advantage of the new window. Unless we are careful, this time will be pure delay experienced by both parties. A clever receiver could anticipate this delay, and send

sender

receive permission, send segment 1 send segment 2 send segment 3 send segment 4 receive ACK 1 receive ACK 2 receive ACK 3 receive ACK 4, wait … receive permission, send segment 5 send segment 6

may I

send?

nts , 4 segme

yes segme 1 ack # 2 # ack 3 ack # 4 ack #

nt #1 #2 #3 #4

ore

4m send

receiver time receive request, open a 4-segment window buffer segment 1 buffer segment 2 buffer segment 3 buffer segment 4 finished processing segments 1–4, reopen the window

segme • • •

nt #5 #6

buffer segment 5 buffer segment 6

FIGURE 7.41 Flow control with a fixed window.

Saltzer & Kaashoek Ch. 7, p. 78

June 25, 2009 8:22 am

7.5 The End-to-End Layer

7–79

the window-opening message one round-trip time before it expects to be ready for more data. This form of prediction is still using a fixed window, but it keeps data flowing more smoothly. Unfortunately, it requires knowing the network round-trip time which, as the discussion of timers explained, is a hard thing to estimate. Exercises 7.13, on page 7–114, and 7.16, on page 7–115, explore the bang-bang protocol and pacing, two more variants on the fixed window idea.

7.5.6.3 Sliding Windows and Self-Pacing An even more clever scheme is the following: as soon as it has freed up a segment buffer, the receiver could immediately send permission for a window that is one segment larger (either by sending a separate message or, if there happens to be an ACK ready to go, piggy-backing on that ACK). The sender keeps track of how much window space is left, and increases that number whenever additional permission arrives. When a window can have space added to it on the fly it is called a sliding window. The advantage of a sliding window is that it can automatically keep the pipeline filled, without need to guess when it is safe to send permission-granting messages. The sliding window appears to eliminate the need to know the network round-trip time, but this appearance is an illusion. The real challenge in flow control design is to develop a single flow control algorithm that works well under all conditions, whether the bottleneck is the sender’s rate of generating data, the network transmission capacity, or the rate at which the receiver can accept data. When the receiver is the bottleneck, the goal is to ensure that the receiver never waits. Similarly, when the sender is the bottle­ neck, the goal is to ensure that the sender never waits. When the network is the bottleneck, the goal is to keep the network moving data at its maximum rate. The ques­ tion is what window size will achieve these goals. The answer, no matter where the bottleneck is located, is determined by the bottle­ neck data rate and the round-trip time of the network. If we multiply these two quantities, the product tells us the amount of buffering, and thus the minimum window size, needed to ensure a continuous flow of data. That is, window size ≥ round-trip time × bottleneck data rate

To see why, imagine for a moment that we are operating with a sliding window one seg­ ment in size. As we saw before, this window size creates a lock-step protocol with one segment delivered each round-trip time, so the realized data rate will be the window size divided by the round-trip time. Now imagine operating with a window of two segments. The network will then deliver two segments each round-trip time. The realized data rate is still the window size divided by the round-trip time, but the window size is twice as large. Now, continue to try larger window sizes until the realized data rate just equals the bottleneck data rate. At that point the window size divided by the round-trip time still tells us the realized data rate, so we have equality in the formula above. Any window size less than this will produce a realized data rate less than the bottleneck. The window size can be larger than this minimum, but since the realized data rate cannot exceed the bot-

Saltzer & Kaashoek Ch. 7, p. 79

June 25, 2009 8:22 am

7–80

CHAPTER 7 The Network as a System and as a System Component

tleneck, there is no advantage. There is actually a disadvantage to a larger window size: if something goes wrong that requires draining the pipeline, it will take longer to do so. Further, a larger window puts a larger load on the network, and thereby contributes to congestion and discarded packets in the network routers. The most interesting feature of a sliding window whose size satisfies the inequality is that, although the sender does not know the bottleneck data rate, it is sending at exactly that rate. Once the sender fills a sliding window, it cannot send the next data element until the acknowledgment of the oldest data element in the window returns. At the same time, the receiver cannot generate acknowledgments any faster than the network can deliver data elements. Because of these two considerations, the rate at which the window slides adjusts itself automatically to be equal to the bottleneck data rate, a property known as self-pacing. Self-pacing provides the needed mechanism to adjust the sender’s data rate to exactly equal the data rate that the connection can sustain. Let us consider what the window-size formula means in practice. Suppose a client computer in Boston that can absorb data at 500 kilobytes per second wants to download a file from a service in San Francisco that can send at a rate of 1 megabyte per second, and the network is not a bottleneck. The round-trip time for the Internet over this dis­ tance is about 70 milliseconds,* so the minimum window size would be 70 milliseconds × 500 kilobytes/second = 35 kilobytes

and if each segment carries 512 bytes, there could be as many as 70 such segments enroute at once. If, instead, the two computers were in the same building, with a 1 mil­ lisecond round-trip time separating them, the minimum window size would be 500 bytes. Over this short distance a lock-step protocol would work equally well. So, despite the effort to choose the appropriate window size, we still need an estimate of the round-trip time of the network, with all the hazards of making an accurate esti­ mate. The protocol may be able to use the same round-trip time estimate that it used in setting its timers, but there is a catch. To keep from unnecessarily retransmitting packets that are just delayed in transit, an estimate that is used in timer setting should err by being too large. But if a too-large round-trip time estimate is used in window setting, the resulting excessive window size will simply increase the length of packet forwarding queues within the network; those longer queues will increase the transit time, in turn leading the sender to think it needs a still larger window. To avoid this positive feedback, a round-trip time estimator that is to be used for window size adjustment needs to err on the side of being too small, and be designed not to react too quickly to an apparent

* Measurements of round-trip time from Boston to San Francisco over the Internet in 2005 typi­ cally show a minimum of about 70 milliseconds. A typical route might take a packet via New York, Cleveland, Indianapolis, Kansas City, Denver, and Sacramento, a distance of 11,400 kilometers, and through 15 packet forwarders in each direction. The propagation delay over that distance, assuming a velocity of propagation in optical fiber of 66% of the speed of light, would be about 57 millisec­ onds. Thus the 30 packet forwarders apparently introduce about another 13 milliseconds of process­ ing and transmission delay, roughly 430 microseconds per forwarder.

Saltzer & Kaashoek Ch. 7, p. 80

June 25, 2009 8:22 am

7.5 The End-to-End Layer

7–81

increase in round-trip time—exactly the opposite of the desiderata for an estimate used for setting timers. Once the window size has been established, there is still a question of how big to make the buffer at the receiving side of the transport protocol. The simplest way to ensure that there is always space available for arriving data is to allocate a buffer that is at least as large as the window size.

7.5.6.4 Recovery of Lost Data Segments with Windows While the sliding window may have addressed the performance problem, it has compli­ cated the problem of recovering lost data segments. The sender can still maintain a checklist of expected acknowledgments, but the question is when to take action on this list. One strategy is to associate with each data segment in the list a timestamp indicating when that segment was sent. When the clock indicates that more than one round-trip time has passed, it is time for a resend. Or, assuming that the sender is numbering the segments for reassembly, the receiver might send a NAK when it notices that several seg­ ments with higher numbers have arrived. Either approach raises a question of how resent segments should count against the available window. There are two cases: either the orig­ inal segment never made it to the receiver, or the receiver got it but the acknowledgment was lost. In the first case, the sender has already counted the lost segment, so there is no reason to count its replacement again. In the second case, presumably the receiver will immediately discard the duplicate segment. Since it will not occupy the recipient’s atten­ tion or buffers for long, there is no need to include it in the window accounting. So in both cases the answer is the same: do not count a resent segment against the available window. (This conclusion is fortunate because the sender can’t tell the difference between the two cases.) We should also consider what might go wrong if a window-increase permission mes­ sage is lost. The receiver will eventually notice that no data is forthcoming, and may suspect the loss. But simply resending permission to send more data carries the risk that the original permission message has simply been delayed and may still be delivered, in which case the sender may conclude that it can send twice as much data as the receiver intended. For this reason, sending a window-increasing message as an incremental value is fragile. Even resending the current permitted window size can lead to confusion if win­ dow-opening messages happen to be delivered out of order. A more robust approach is for the receiver to always send the cumulative total of all permissions granted since trans­ mission of this message or stream began. (A cumulative total may grow large, but a field size of 64 bits can handle window sizes of 1030 transmission units, which probably is suf­ ficient for most applications.) This approach makes it easy to discover and ignore an outof-order total because a cumulative total should never decrease. Sending a cumulative total also simplifies the sender’s algorithm, which now merely maintains the cumulative total of all permissions it has used since the transmission began. The difference between the total used so far and the largest received total of permissions granted is a self-correct­ ing, robust measure of the current window size. This model is familiar. A sliding window

Saltzer & Kaashoek Ch. 7, p. 81

June 25, 2009 8:22 am

7–82

CHAPTER 7 The Network as a System and as a System Component

is an example of the producer–consumer problem described in Chapter 5, and the cumu­ lative total window sizes granted and used are examples of eventcounts. Sending of a message that contains the cumulative permission count can be repeated any number of times without affecting the correctness of the result. Thus a persistent sender (in this case the receiver of the data is the persistent sender of the permission mes­ sage) is sufficient to ensure exactly-once delivery of a permission increase. With this design, the sender’s permission receiver is an example of an idempotent service interface, as suggested in the last paragraph of Section 7.1.4. There is yet one more rate-matching problem: the blizzard of packets arising from a newly-opened flow control window may encounter or even aggravate congestion some­ where within the network, resulting in packets being dropped. Avoiding this situation requires some cooperation between the end-to-end protocol and the network forwarders, so we defer its discussion to Section 7.6 of this chapter.

7.5.7 Assurance of Stream Order, and Closing of Connections A stream transport protocol transports a related series of elements, which may be bits, bytes, segments, or messages, from one point to another with the assurance that they will be delivered to the recipient in the order in which the sender dispatched them. A stream protocol usually—but not always—provides additional assurances, such as no missing elements, no duplicate elements, and data integrity. Because a telephone circuit has some of these same properties, a stream protocol is sometimes said to create a virtual circuit. The simple-minded way to deliver things in order is to use the lock-step transmission protocol described in Section 7.5.3, in which the sending side does not send the next ele­ ment until the receiving side acknowledges that the previous one has arrived safely. But applications often choose stream protocols to send large quantities of data, and the round-trip delays associated with a lock-step transmission protocol are enough of a prob­ lem that stream protocols nearly always employ some form of overlapped transmission. When overlapped transmission is added, the several elements that are simultaneously enroute can arrive at the receiving side out of order. Two quite different events can lead to elements arriving out of order: different packets may follow different paths that have different transit times, or a packet may be discarded if it traverses a congested part of the network or is damaged by noise. A discarded packet will have to be retransmitted, so its replacement will almost certainly arrive much later than its adjacent companions. The transport protocol can ensure that the data elements are delivered in the proper order by adding to the transport-layer header a serial number that indicates the position in the stream where the element or elements in the current data segment belong. At the receiving side, the protocol delivers elements to the application and sends acknowledg­ ments back to the sender as long as they arrive in order. When elements arrive out of order, the protocol can follow one of two strategies: 1. Acknowledge only when the element that arrives is the next element expected or a duplicate of a previously received element. Discard any others. This strategy is

Saltzer & Kaashoek Ch. 7, p. 82

June 25, 2009 8:22 am

7.5 The End-to-End Layer

7–83

simple, but it forces a capacity-wasting retransmission of elements that arrive before their predecessors. 2. Acknowledge every element as it arrives, and hold in buffers any elements that arrive before their predecessors. When the predecessors finally arrive, the protocol can then deliver the elements to the application in order and release the buffers. This technique is more efficient in its use of network resources, but it requires some care to avoid using up a large number of buffers while waiting for an earlier element that was in a packet that was discarded or damaged. The two strategies can be combined by acknowledging an early-arriving element only if there is a buffer available to hold it, and discarding any others. This approach raises the question of how much buffer space to allocate. One simple answer is to provide at least enough buffer space to hold all of the elements that would be expected to arrive during the time it takes to sort out an out-of-order condition. This question is closely related to the one explored earlier of how many buffers to provide to go with a given size of sliding window. A requirement of delivery in order is one of the reasons why it is useful to make a clear distinction between acknowledging receipt of data and opening a window that allows the sending of more data. It may be possible to speed up the resending of lost packets by taking advantage of the additional information implied by arrival of numbered stream elements. If stream elements have been arriving quite regularly, but one element of the stream is missing, rather than waiting for the sender to time out and resend, the receiver can send an explicit negative acknowledgment (NAK) for the missing element. If the usual reason for an ele­ ment to appear to be missing is that it has been lost, sending NAKs can produce a useful performance enhancement. On the other hand, if the usual reason is that the missing ele­ ment has merely suffered a bit of extra delay along the way, then sending NAKs may lead to unnecessary retransmissions, which waste network capacity and can degrade perfor­ mance. The decision whether or not to use this technique depends on the specific current conditions of the network. One might try to devise an algorithm that figures out what is going on (e.g., if NAKs are causing duplicates, stop sending NAKs) but it may not be worth the added complexity. As the interface described in Section 7.5.1 above suggests, using a stream transport protocol involves a call to open the stream, a series of calls to write to or read from the stream, and a call to close the stream. Opening a stream involves creating a record at each end of the connection. This record keeps track of which elements have been sent, which have been received, and which have been acknowledged. Closing a stream involves two additional considerations. First and simplest, after the receiving side of the transport pro­ tocol delivers the last element of the stream to the receiving application, it then needs to report an end-of-stream indication to that application. Second, both ends of the connec­ tion need to agree that the network has delivered the last element and the stream should be closed. This agreement requires some care to reach. A simple protocol that ensures agreement is the following: Suppose that Alice has opened a stream to Bob, and has now decided that the stream is no longer needed. She

Saltzer & Kaashoek Ch. 7, p. 83

June 25, 2009 8:22 am

7–84

CHAPTER 7 The Network as a System and as a System Component

begins persistently sending a close request to Bob, specifying the stream identifier. Bob, upon receiving a close request, checks to see if he agrees that the stream is no longer needed. If he does agree, he begins persistently sending a close acknowledgment, again specifying the stream identifier. Alice, upon receiving the close acknowledgment, can turn off her persistent sender and discard her record of the stream, confident that Bob has received all elements of the stream and will not be making any requests for retrans­ missions. In addition, she sends Bob a single “all done” message, containing the stream identifier. If she receives a duplicate of the close acknowledgment, her record of the stream will already be discarded, but it doesn’t matter; she can assume that this is a dupli­ cate close acknowledgment from some previously closed stream and, from the information in the close acknowledgment, she can fabricate an “all done” message and send it to Bob. When Bob receives the “all done” message he can turn off his persistent sender and, confident that Alice agrees that there is no further use for the stream, discard his copy of the record of the stream. Alice and Bob can in the future safely discard any late duplicates that mention a stream for which they have no record. (The tombstone problem still exists for the stream itself. It would be a good idea for Bob to delay deletion of his record until there is no chance that a long-delayed duplicate of Alice’s original request to open the stream will arrive.)

7.5.8 Assurance of Jitter Control Some applications, such as delivering sound or video to a person listening or watching on the spot, are known as real-time. For real-time applications, reliability, in the sense of never delivering an incorrect bit of data, is often less important than timely delivery. High reliability can actually be counter-productive if the transport protocol achieves it by requesting retransmission of a damaged data element, and then holds up delivery of the remainder of the stream until the corrected data arrives. What the application wants is continuous delivery of data, even if the data is not completely perfect. For example, if a few bits are wrong in one frame of a movie (note that this video use of the term “frame” has a meaning similar but not identical to the “frame” used in data communications), it probably won’t be noticed. In fact, if one video frame is completely lost in transit, the application program can probably get away with repeating the previous video frame while waiting for the following one to be delivered. The most important assurance that an end-to-end stream protocol can provide to a real-time application is that delivery of successive data elements be on a regular schedule. For example, a standard North Amer­ ican television set consumes one video frame every 33.37 milliseconds and the next video frame must be presented on that schedule. Transmission across a forwarding network can produce varying transit times from one data segment to the next. In real-time applications, this variability in delivery time is known as jitter, and the requirement is to control the amount of jitter. The basic strat­ egy is for the receiving side of the transport protocol to delay all arriving segments to make it look as though they had encountered the worst allowable amount of delay. One can in principle estimate an appropriate amount of extra buffering for the delayed seg-

Saltzer & Kaashoek Ch. 7, p. 84

June 25, 2009 8:22 am

7.5 The End-to-End Layer

7–85

ments as follows (assume for the television example that there is one video frame in each segment): 1. Measure the distribution of segment delivery delays between sending and receiving points and plot that distribution in a chart showing delay time versus frequency of that delay. 2. Choose an acceptable frequency of delivery failure. For a television application one might decide that 1 out of 100 video frames won’t be missed. 3. From the distribution, determine a delay time large enough to ensure that 99 out of 100 segments will be delivered in less than that delay time. Call this delay Dlong. 4. From the distribution determine the shortest delay time that is observed in practice. Call this value Dshort. 5. Now, provide enough buffering to delay every arriving segment so that it appears to have arrived with delay Dlong. The largest number of segments that would need to be buffered is D long – D short Number of segment buffers = -------------------------------------D headway

where Dheadway is the average time between arriving segments. With this much buffer­ ing, we would expect that about one out of every 100 segments will arrive too late; when that occurs, the transport protocol simply reports “missing data” to the application and discards that segment if it finally does arrive. In practice, there is no easy way to measure one-way segment delivery delay, so a com­ mon strategy is simply to set the buffer size by trial and error. Although the goal of this technique is to keep the rate of missing video frames below the level of human perceptibility, you can sometimes see the technique fail when watch­ ing a television program that has been transmitted by satellite or via the Internet. Occasionally there may be a freeze-frame that persists long enough that you can see it, but that doesn’t seem to be one that the director intended. This event probably indicates that the transmission path was disrupted for a longer time than the available buffers were prepared to handle.

7.5.9 Assurance of Authenticity and Privacy Most of the assurance-providing techniques described above are intended to operate in a benign environment, in which the designer assumes that errors can occur but that the errors are not maliciously constructed to frustrate the intended assurances. In many realworld environments, the situation is worse than that: one must defend against the threat that someone hostile intercepts and maliciously modifies packets, or that some end-to­ end layer participants violate a protocol with malicious intent. To counter these threats, the end-to-end layer can apply two kinds of key-based mathematical transformations to the data:

Saltzer & Kaashoek Ch. 7, p. 85

June 25, 2009 8:22 am

7–86

CHAPTER 7 The Network as a System and as a System Component

1. sign and verify, to establish the authenticity of the source and the integrity of the contents of a message, and 2. encrypt and decrypt, to maintain the privacy of the contents of a message. These two techniques can, if applied properly, be effective, but they require great care in design and implementation. Without such care, they may not work, but because they were applied the user may believe that they do, and thus have a false sense of security. A false assurance can be worse than no assurance at all. The issues involved in providing security assurances are a whole subject in themselves, and they apply to many system components in addition to networks, so we defer them to Chapter 11[on-line], which provides an in-depth discussion of protecting information in computer systems. With this examination of end-to-end topics, we have worked our way through the highest layer that we identify as part of the network. The next section of this chapter, on congestion control, is a step sideways, to explore a topic that requires cooperation of more than one layer.

7.6 A Network System Design Issue: Congestion Control 7.6.1 Managing Shared Resources Chapters 5 and 6 discussed shared resources and their management: a thread manager creates many virtual processors from a few real, shared processors that must then be scheduled, and a multilevel memory manager creates the illusion of large, fast virtual memories for several clients by combining a small and fast shared memory with large and slow storage devices. In both cases we looked at relatively simple management mecha­ nisms because more complex mechanisms aren’t usually needed. In the network context, the resource that is shared is a set of communication links and the supporting packet for­ warders. The geographically and administratively distributed nature of those components and their users adds delay and complication to resource management, so we need to revisit the topic. In Section 7.1.2 of this chapter we saw how queues manage the problem that packets may arrive at a packet switch at a time when the outgoing link is already busy transmit­ ting another packet, and Figure 7.6 showed the way that queues grow with increased utilization of the link. This same phenomenon applies to processor scheduling and supermarket checkout lines: any time there is a shared resource, and the demand for that resource comes from several statistically independent sources, there will be fluctuations in the arrival of load, and thus in the length of the queue and the time spent waiting for service. Whenever the offered load (in the case of a packet switch, that is the rate at which packets arrive and need to be forwarded) is greater than the capacity (the rate at which the switch can forward packets) of a resource for some duration, the resource is over­ loaded for that time period.

Saltzer & Kaashoek Ch. 7, p. 86

June 25, 2009 8:22 am

7.6 A Network System Design Issue: Congestion Control

7–87

When sources are statistically independent of one another, occasional overload is inevitable but its significance depends critically on how long it lasts. If the duration is comparable to the service time, which is the typical time for the resource to handle one customer (in a supermarket), one thread (in a processor manager), or one packet (in a packet forwarder), then a queue is simply an orderly way to delay some requests for ser­ vice until a later time when the offered load drops below the capacity of the resource. Put another way, a queue handles short bursts of too much demand by time-averaging with adjacent periods when there is excess capacity. If, on the other hand, overload persists for a time significantly longer than the service time, there begins to develop a risk that the system will fail to meet some specification such as maximum delay or acceptable response time. When this occurs, the resource is said to be congested. Congestion is not a precisely defined concept. The duration of over­ load that is required to classify a resource as congested is a matter of judgement, and different systems (and observers) will use different thresholds. Congestion may be temporary, in which case clever resource management schemes may be able to rescue the situation, or it may be chronic, meaning that the demand for service continually exceeds the capacity of the resource. If the congestion is chronic, the length of the queue will grow without bound until something breaks: the space allocated for the queue may be exceeded, the system may fail completely, or customers may go else­ where in disgust. The stability of the offered load is another factor in the frequency and duration of congestion. When the load on a resource is aggregated from a large number of statisti­ cally independent small sources, averaging can reduce the frequency and duration of load peaks. On the other hand, if the load comes from a small number of large sources, even if the sources are independent, the probability that they all demand service at about the same time can be high enough that congestion can be frequent or long-lasting. A counter-intuitive concern of shared resource management is that competition for a resource sometimes leads to wasting of that resource. For example, in a grocery store, cus­ tomers who are tired of waiting in the checkout line may just walk out of the store, leaving filled shopping carts behind. Someone has to put the goods from the abandoned carts back on the shelves. Suppose that one or two of the checkout clerks leave their reg­ isters to take care of the accumulating abandoned carts. The rate of sales being rung up drops while they are away from their registers, so the queues at the remaining registers grow longer, causing more people to abandon their carts, and more clerks will have to turn their attention to restocking. Eventually, the clerks will be doing nothing but restocking and the number of sales rung up will drop to zero. This regenerative overload phenomenon is called congestion collapse. Figure 7.42 plots the useful work getting done as the offered load increases, for three different cases of resource limitation and waste, including one that illustrates collapse. Congestion collapse is dangerous because it can be self-sustaining. Once temporary congestion induces a collapse, even if the offered load drops back to a level that the resource could handle, the already-induced waste rate can continue to exceed the capacity of the resource, causing it to continue to waste the resource and thus remain congested indefinitely.

Saltzer & Kaashoek Ch. 7, p. 87

June 25, 2009 8:22 am

7–88

CHAPTER 7 The Network as a System and as a System Component

When developing or evaluating a resource management scheme, it is important to keep in mind that you can’t squeeze blood out of a turnip: if a resource is congested, either temporarily or chronically, delays in receiving service are inevitable. The best a management scheme can do is redistribute the total amount of delay among waiting cus­ tomers. The primary goal of resource management is usually quite simple: to avoid congestion collapse. Occasionally other goals, such as enforcing a policy about who gets delayed, are suggested, but these goals are often hard to define and harder to achieve. (Doling out delays is a tricky business; overall satisfaction may be higher if a resource serves a few customers well and completely discourages the remainder, rather than leav­ ing all equally disappointed.) Chapter 6 suggested two general approaches to managing congestion. Either: • increase the capacity of the resource, or • reduce the offered load. In both cases the goal is to move quickly to a state in which the load is less than the capac­ ity of the resource. When measures are taken to reduce offered load, it is useful to separately identify the intended load, which would have been offered in the absence of

unlimited resource capacity of a limited resource

useful work done

limited resource with no waste congestion collapse

offered load FIGURE 7.42 Offered load versus useful work done. The more work offered to an ideal unlimited resource, the more work gets done, as indicated by the 45-degree unlimited resource line. Real resources are limited, but in the case with no waste, useful work asymptotically approaches the capacity of the resource. On the other hand, if overloading the resource also wastes it, use­ ful work can decline when offered load increases, as shown by the congestion collapse line.

Saltzer & Kaashoek Ch. 7, p. 88

June 25, 2009 8:22 am

7.6 A Network System Design Issue: Congestion Control

7–89

control. Of course, in reducing offered load, the amount by which it is reduced doesn’t really go away, it is just deferred to a later time. Reducing offered load acts by averaging periods of overload with periods of excess capacity, just like queuing, but with involve­ ment of the source of the load, and typically over a longer period of time. To increase capacity or to reduce offered load it is necessary to provide feedback to one or more control points. A control point is an entity that determines, in the first case, the amount of resource that is available and, in the second, the load being offered. A con­ gestion control system is thus a feedback system, and delay in the feedback path can lead to oscillations in load and in useful work done. For example, in a supermarket, a common strategy is for the store manager to watch the queues at the checkout lines; whenever there are more than two or three customers in any line the manager calls for staff elsewhere in the store to drop what they are doing and temporarily take stations as checkout clerks, thereby increasing capacity. In contrast, when you call a customer support telephone line you may hear an automatic response message that says something such as, “Your call is important to us. It will be approxi­ mately 21 minutes till we are able to answer it.” That message will probably lead some callers to hang up and try again at a different time, thereby decreasing (actually deferring) the offered load. In both the supermarket and the telephone customer service system, it is easy to create oscillations. By the time the fourth supermarket clerk stops stacking dog biscuits and gets to the front of the store, the lines may have vanished, and if too many callers decide to hang up, the customer service representatives may find there is no one left to talk to. In the commercial world, the choice between these strategies is a complex trade-off involving economics, physical limitations, reputation, and customer satisfaction. The same thing is true inside a computer system or network.

7.6.2 Resource Management in Networks In a computer network, the shared resources are the communication links and the pro­ cessing and buffering capacity of the packet forwarders. There are several things that make this resource management problem more difficult than, say, scheduling a processor among competing threads. 1. There is more than one resource. Even a small number of resources can be used up in an alarmingly large number of different ways, and the mechanisms needed to keep track of the situation can rapidly escalate in complexity. In addition, there can be dynamic interactions among different resources—as one nears capacity it may push back on another, which may push back on yet another, which may push back on the first one. These interactions can create either deadlock or livelock, depending on the details. 2. It is easy to induce congestion collapse. The usually beneficial independence of the layers of a packet forwarding network contributes to the ease of inducing congestion collapse. As queues for a particular communication link grow, delays

Saltzer & Kaashoek Ch. 7, p. 89

June 25, 2009 8:22 am

7–90

CHAPTER 7 The Network as a System and as a System Component

grow. When queuing delays become too long, the timers of higher layer protocols begin to expire and trigger retransmissions of the delayed packets. The retransmitted packets join the long queues but, since they are duplicates that will eventually be discarded, they just waste capacity of the link. Designers sometimes suggest that an answer to congestion is to buy more or bigger buffers. As memory gets cheaper, this idea is tempting, but it doesn’t work. To see why, suppose memory is so cheap that a packet forwarder can be equipped with an infinite number of packet buffers. That many buffers can absorb an unlimited amount of overload, but as more buffers are used, the queuing delay grows. At some point the queuing delay exceeds the time-outs of the end-to-end protocols and the end-to-end protocols begin retransmitting packets. The offered load is now larger, perhaps twice as large as it would have been in the absence of conges­ tion, so the queues grow even longer. After a while the retransmissions cause the queues to become long enough that end-to-end protocols retransmit yet again, and packets begin to appear in the queue three times, and then four times, etc. Once this phenomenon begins, it is self-sustaining until the real traffic drops to less than half (or 1/3 or 1/4, depending on how bad things got) of the capacity of the resource. The conclusion is that the infinite buffers did not solve the problem, they made it worse. Instead, it may be better to discard old packets than to let them use up scarce transmission capacity. 3. There are limited options to expand capacity. In a network there may not be many options to raise capacity to deal with temporary overload. Capacity is generally determined by physical facilities: optical fibers, coaxial cables, wireless spectrum availability, and transceiver technology. Each of these things can be augmented, but not quickly enough to deal with temporary congestion. If the network is meshconnected, one might consider sending some of the queued packets via an alternate path. That can be a good response, but doing it on a fast enough timescale to overcome temporary congestion requires knowing the instantaneous state of queues throughout the network. Strategies to do that have been tried; they are complex and haven’t worked well. It is usually the case that the only realistic strategy is to reduce demand. 4. The options to reduce load are awkward. The alternative to increasing capacity is to reduce the offered load. Unfortunately, the control point for the offered load is distant and probably administered independently of the congested packet forwarder. As a result, there are at least three problems: • The feedback path to a distant control point may be long. By the time the feedback signal gets there the sender may have stopped sending (but all the previously sent packets are still on their way to join the queue) or the congestion may have disappeared and the sender no longer needs to hold back. Worse, if we use the network to send the signal, the delay will be variable, and any congestion on the

Saltzer & Kaashoek Ch. 7, p. 90

June 25, 2009 8:22 am

7.6 A Network System Design Issue: Congestion Control

7–91

path back may mean that the signal gets lost. The feedback system must be robust to deal with all these eventualities. • The control point (in this case, an end-to-end protocol or application) must be capable of reducing its offered load. Some end-to-end protocols can do this quite easily, but others may not be able to. For example, a stream protocol that is being used to send files can probably reduce its average data rate on short notice. On the other hand, a real-time video transmission protocol may have a commitment to deliver a certain number of bits every second. A single-packet request/response protocol will have no control at all over the way it loads the network; control must be exerted by the application, which means there must be some way of asking the application to cooperate—if it can. • The control point must be willing to cooperate. If the congestion is discovered by the network layer of a packet forwarder, but the control point is in the end-to-end layer of a leaf node, there is a good chance these two entities are under the responsibility of different administrations. In that case, obtaining cooperation can be problematic; the administration of the control point may be more interested in keeping its offered load equal to its intended load in the hope of capturing more of the capacity in the face of competition. These problems make it hard to see how to apply a central planning approach such as the one that worked in the grocery store. Decentralized schemes seem more promising. Many mechanisms have been devised to try to manage network congestion. Sections 7.6.3 and 7.6.4 describe the design considerations surrounding one set of decentralized mechanisms, similar to the ones that are currently used in the public Internet. These mechanisms are not especially well understood, but they not only seem to work, they have allowed the Internet to operate over an astonishing range of capacity. In fact, the Internet is probably the best existing counterexample of the incommensurate scaling rule. Recall that the rule suggests that a system needs to be redesigned whenever any important parameter changes by a factor of ten. The Internet has increased in scale from a few hun­ dred attachment points to a few hundred million attachment points with only modest adjustments to its underlying design.

7.6.3 Cross-layer Cooperation: Feedback If the designer can arrange for cross-layer cooperation, then one way to attack congestion would be for the packet forwarder that notices congestion to provide feedback to one or more end-to-end layer sources, and for the end-to-end source to respond by reducing its offered load. Several mechanisms have been suggested for providing feedback. One of the first ideas that was tried is for the congested packet forwarder to send a control message, called a source quench, to one or more of the source addresses that seems to be filling the queue. Unfortunately, preparing a control message distracts the packet forwarder at a time when

Saltzer & Kaashoek Ch. 7, p. 91

June 25, 2009 8:22 am

7–92

CHAPTER 7 The Network as a System and as a System Component

it least needs extra distractions. Moreover, transmitting the control packet adds load to an already-overloaded network. Since the control protocol is best-effort the chance that the control message will itself be discarded increases as the network load increases, so when the network most needs congestion control the control messages are most likely to be lost. A second feedback idea is for a packet forwarder that is experiencing congestion to set a flag on each forwarded packet. When the packet arrives at its destination, the end-to­ end transport protocol is expected to notice the congestion flag and in the next packet that it sends back it should include a “slow down!” request to alert the other end about the congestion. This technique has the advantage that no extra packets are needed. Instead, all communication is piggybacked on packets that were going to be sent anyway. But the feedback path is even more hazardous than with a source quench—not only does the signal have to first reach the destination, the next response packet of the end-to-end protocol may not go out immediately. Both of these feedback ideas would require that the feedback originate at the packet forwarding layer of the network. But it is also possible for congestion to be discovered in the link layer, especially when a link is, recursively, another network. For these reasons, Internet designers converged on a third method of communicating feedback about con­ gestion: a congested packet forwarder just discards a packet. This method does not require interpretation of packet contents and can be implemented simply in any compo­ nent in any layer that notices congestion. The hope is that the source of that packet will eventually notice a lack of response (or perhaps receive a NAK). This scheme is not a pan­ acea because the end-to-end layer has to assume that every packet loss is caused by congestion, and the speed with which the end-to-end layer responds depends on its timer settings. But it is simple and reliable. This scheme leaves a question about which packet to discard. The choice is not obvi­ ous; one might prefer to identify the sources that are contributing most to the congestion and signal them, but a congested packet forwarder has better things to do than extensive analysis of its queues. The simplest method, known as tail drop, is to limit the size of the queue; any packet that arrives when the queue is full gets discarded. A better technique (random drop) may be to choose a victim from the queue at random. This approach has the virtue that the sources that are contributing most to the congestion are the most likely to be receive the feedback. One can even make a plausible argument to discard the packet at the front of the queue, on the basis that of all the packets in the queue, the one at the front has been in the network the longest, and thus is the one whose associated timer is most likely to have already expired. Another refinement (early drop) is to begin dropping packets before the queue is com­ pletely full, in the hope of alerting the source sooner. The goal of early drop is to start reducing the offered load as soon as the possibility of congestion is detected, rather than waiting until congestion is confirmed, so it can be viewed as a strategy of avoidance rather than of recovery. Random drop and early drop are combined in a scheme known as RED, for random early detection.

Saltzer & Kaashoek Ch. 7, p. 92

June 25, 2009 8:22 am

7.6 A Network System Design Issue: Congestion Control

7–93

7.6.4 Cross-layer Cooperation: Control Suppose that the end-to-end protocol implementation learns of a lost packet. Sidebar 7.8: The tragedy of the commons What then? One possibility is that it just “Picture a pasture open to all…As a rational drives forward, retransmitting the lost being, each herdsman seeks to maximize his packet and continuing to send more data gain…he asks, ‘What is the utility to me of as rapidly as its application supplies it. The adding one more animal to my herd?’ This utility has one negative and one positive end-to-end protocol implementation is in component…Since the herdsman receives all control, and there is nothing compelling it the proceeds from the sale of the additional to cooperate. Indeed, it may discover that animal, the positive utility is nearly +1. by sending packets at the greatest rate it Since, however, the effects of overgrazing are can sustain, it will push more data through shared by all the herdsmen, the negative the congested packet forwarder than it utility for any particular decision-making would otherwise. The problem, of course, herdsman is only a fraction of –1.

is that if this is the standard mode of oper­

ation of every client, congestion will set in “Adding together the component partial

and all clients of the network will suffer, as utilities, the rational herdsman concludes

predicted by the tragedy of the commons that the only sensible course for him to pursue is to add another animal to his herd. (see Sidebar 7.8). There are at least two things that the And another…. But this is the conclusion end-to-end protocol can do to cooperate. reached by each and every rational herdsman The first is to be careful about its use of sharing a commons. Therein is the tragedy. timers, and the second is to pace the rate at Each man is locked into a system that compels him to increase his herd without which it sends data, a technique known as limit—in a world that is limited…Freedom automatic rate adaptation. Both these in a commons brings ruin to all.” things require having an estimate of the — Garrett Hardin, Science 162, 3859 round-trip time between the two ends of [Suggestions for Further Reading 1.4.5] the protocol. The usual way of detecting a lost packet in a best-effort network is to set a timer to expire after a little more than one round-trip time, and assume that if an acknowledgment has not been received by then the packet is lost. In Section 7.5 of this chapter we introduced timers as a way of ensuring at-least­ once delivery via a best-effort network, expecting that lost packets had encountered mis­ haps such as misrouting, damage in transmission, or an overflowing packet buffer. With congestion management in operation, the dominant reason for timer expiration is prob­ ably that either a queue in the network has grown too long or a packet forwarder has intentionally discarded the packet. The designer needs to take this additional consider­ ation into account when choosing a value for a retransmit timer. As described in Section 7.5.6, a protocol can develop an estimate of the round trip time by directly measuring it for the first packet exchange and then continuing to update that estimate as additional packets flow back and forth. Then, if congestion develops, queuing delays will increase the observed round-trip times for individual packets, and

Saltzer & Kaashoek Ch. 7, p. 93

June 25, 2009 8:22 am

7–94

CHAPTER 7 The Network as a System and as a System Component

those observations will increase the round-trip estimate used for setting future retransmit timers. In addition, when a timer does expire, the algorithm for timer setting should use exponential backoff for successive retransmissions of the same packet (exponential backoff was described in Section 7.5.2). It does not matter whether the reason for expiration is that the packet was delayed in a growing queue or it was discarded as part of congestion control. Either way, exponential backoff immediately reduces the retransmission rate, which helps ease the congestion problem. Exponential backoff has been demonstrated to be quite effective as a way to avoid contributing to congestion collapse. Once acknowl­ edgments begin to confirm that packets are actually getting through, the sender can again allow timer settings to be controlled by the round-trip time estimate. The second cooperation strategy involves managing the flow control window. Recall from the discussion of flow control in Section 7.5.6 that to keep the flow of data moving as rapidly as possible without overrunning the receiving application, the flow control window and the receiver’s buffer should both be at least as large as the bottleneck data rate multiplied by the round trip time. Anything larger than that will work equally well for end-to-end flow control. Unfortunately, when the bottleneck is a congested link inside the network, a larger than necessary window will simply result in more packets pil­ ing up in the queue for that link. The additional cooperation strategy, then, is to ensure that the flow control window is no larger than necessary. Even if the receiver has buffers large enough to justify a larger flow control window, the sender should restrain itself and set the flow control window to the smallest size that keeps the connection running at the data rate that the bottleneck permits. In other words, the sender should force equality in the expression on page 7–79. Relatively early in the history of the Internet, it was realized (and verified in the field) that congestion collapse was not only a possibility, but that some of the original Internet protocols had unexpectedly strong congestion-inducing properties. Since then, almost all implementations of TCP, the most widely used end-to-end Internet transport protocol, have been significantly modified to reduce the risk, as described in Sidebar 7.9. While having a widely-deployed, cooperative strategy for controlling congestion reduces both congestion and the chance of congestion collapse, there is one unfortunate consequence: Since every client that cooperates may be offering a load that is less than its intended load, there is no longer any way to estimate the size of that intended load. Inter­ mediate packet forwarders know that if they are regularly discarding some packets, they need more capacity, but they have no clue how much more capacity they really need.

7.6.5 Other Ways of Controlling Congestion in Networks Overprovisioning: Configure each link of the network to have 125% (or 150% or 200%) as much capacity as the offered load at the busiest minute (or five minutes or hour) of the day. This technique works best on interior links of a large network, where no indi­ vidual client represents more than a tiny fraction of the load. When that is the case, the average load offered by the large number of statistically independent sources is relatively

Saltzer & Kaashoek Ch. 7, p. 94

June 25, 2009 8:22 am

7.6 A Network System Design Issue: Congestion Control

7–95

Sidebar 7.9: Retrofitting TCP The Transmission Control Protocol (TCP), probably the most widely used end-to-end transport protocol of the Internet, was designed in 1974, At that time, previous experience was limited to lock-step protocols. on networks with no more than a few hundred nodes. As a result, avoiding congestion collapse was not in its list of requirements. About a decade later, when the Internet first began to expand rapidly, this omission was noticed, and a particular collapse-inducing feature of its design drew attention. The only form of acknowledgment in the original TCP was “I have received all the bytes up to X”. There was no way for a receiver to say, for example, “I am missing bytes Y through Z”. In consequence when a timer expired because some packet or its acknowledgment was lost, as soon as the sender retransmitted that packet the timer of the next packet expired, causing its retransmission. This process would repeat until the next acknowledgment finally returned, a full round trip (and full flow control window) later. On long-haul routes, where flow control windows might be fairly large, if an overloaded packet forwarder responded to congestion by discarding a few packets (each perhaps from a different TCP connection), each discarded packet would trigger retransmission of a window full of packets, and the ensuing blizzard of retransmitted packets could immediately induce congestion collapse. In addition, an insufficiently adaptive time-out scheme ensured that the problem would occur frequently. By the time this effect was recognized, TCP was widely deployed, so changes to the protocol were severely constrained. The designers found a way to change the implementation without changing the data formats. The goal was to allow new and old implementations to interoperate, so new implementations could gradually replace the old. The new implementation works by having the sender tinker with the size of the flow control window (Warning: this explanation is somewhat oversimplified!): 1. Slow start. When starting a new connection, send just one packet, and wait for its acknowledgment. Then, for each acknowledged packet, add one to the window size and send two packets. The result is that in each round trip time, the number of packets that the sender dispatches doubles. This doubling procedure continues until one of three things happens: (1) the sender reaches the window size suggested by the receiver, in which case the network is not the bottleneck, and the sender maintains the window at that size; (2) all the available data has been dispatched; or (3) the sender detects that a packet it sent has been discarded, as described in step 2. 2. Duplicate acknowledgment: The receiving TCP implementation is modified very slightly: whenever it receives an out-of-order packet, it sends back a duplicate of its latest acknowledgment. The idea is that a duplicate acknowledgment can be interpreted by the sender as a negative acknowledgment for the next unacknowledged packet. 3. Equilibrium: Upon duplicate acknowledgment, the sender retransmits just the first unacknowledged packet and also drops its window size to some fixed fraction (for example, 1/2) of its previous size. From then on it operates in an equilibrium mode in which it continues to watch for duplicate acknowledgments but it also probes gently to see if more capacity might be available. The equilibrium mode has two components: (Sidebar continues)

Saltzer & Kaashoek Ch. 7, p. 95

June 25, 2009 8:22 am

7–96

CHAPTER 7 The Network as a System and as a System Component

• Additive increase: Whenever all of the packets in a round trip time are successfully acknowledged, the sender increases the size of the window by one. • Multiplicative decrease: Whenever a duplicate acknowledgment arrives, the sender decreases the size of the window by the fixed fraction. 4. Restart: If the sender’s retransmission timer expires, self-pacing based on ACKs has been disrupted, perhaps because something in the network has radically changed. So the sender waits a short time to allow things to settle down, and then goes back to slow start, to allow assessment of the new condition of the network. By interpreting a duplicate acknowledgment as a negative acknowledgment for a single packet, TCP eliminates the massive retransmission blizzard, and by reinitiating slow start on each timer expiration, it avoids contributing to congestion collapse. The figure below illustrates the evolution of the TCP window size with time in the case where the bottleneck is inside the network. TCP begins with one packet and slow start, until it detects the first packet loss. The sender immediately reduces the window size by half and then begins gradually increasing it by one for each round trip time until detecting another lost packet. This sawtooth behavior may continue indefinitely, unless the retransmission timer expires. The sender pauses and then enters another slow start phase, this time switching to additive increase as soon as it reaches the window size it would have used previously, which is half the window size that was in effect before it encountered the latest round of congestion. This cooperative scheme has not been systematically analyzed, but it seems to work in practice, even though not all of the traffic on the Internet uses TCP as its end-to-end transport protocol. The long and variable feedback delays that inevitably accompany lost packet detection by the use of duplicate acknowledgments induce oscillations (as evidenced by the sawteeth) but the additive increase—multiplicative decrease algorithms strongly damp those oscillations. Exercise 7.12 compares slow start with “fast start”, another scheme for establishing an initial estimate of the window size. There have been dozens (perhaps hundreds) of other proposals for fixing both real and imaginary, problems in TCP. The interested reader should consult Section 7.4 in the Suggestions for Further Reading. duplicate acknowledgment received multiplicative decrease additive increase delay Window size slow start, again timer expires, stop sending slow start Time

Saltzer & Kaashoek Ch. 7, p. 96

June 25, 2009 8:22 am

7.6 A Network System Design Issue: Congestion Control

7–97

stable and predictable. Internet backbone providers generally use overprovisioning to avoid congestion. The problems with this technique are: • Odd events can disrupt statistical independence. An earthquake in California or a hurricane in Florida typically clogs up all the telephone trunks leading to and from the affected state, even if the trunks themselves haven’t been damaged. Everyone tries to place a call at once. • Overprovisioning on one link typically just moves the congestion to a different link. So every link in a network must be overprovisioned, and the amount of overprovisioning has to be greater on links that are shared by fewer customers because statistical averaging is not as effective in limiting the duration of load peaks. • At the edge of the network, statistical averaging across customers stops working completely. The link to an individual customer may become congested if the customer’s Web service is featured in Newsweek—a phenomenon known as a “flash crowd”. Permanently increasing the capacity of that link to handle what is probably a temporary but large overload may not make economic sense. • Adaptive behavior of users can interfere with the plan. In Los Angeles, the opening of a new freeway initially provides additional traffic capacity, but new traffic soon appears and absorbs the new capacity, as people realize that they can conveniently live in places that are farther from where they work. Because of this effect, it does not appear to be physically possible to use overprovisioning as a strategy in the freeway system—the load always increases to match (or exceed) the capacity. Anecdotally, similar effects seem to occur in the Internet, although they have not yet been documented. Over the life of the Internet there have been major changes in both telecommunica­ tions regulation and fiber optic technology that between them have transformed the Internet’s central core from capacity-scarce to capacity-rich. As a result, the locations at which congestion occurs have moved as rapidly as techniques to deal with it have been invented. But so far congestion hasn’t gone away. Pricing: Another approach to congestion control is to rearrange the rules so that the interest of an individual client coincides with the interest of the network community and let the invisible hand take over, as explained in Sidebar 7.10. Since network resources are just another commodity, it should be possible to use pricing as a congestion control mechanism. The idea is that, if demand for a resource temporarily exceeds its capacity, clients will bid up the price. The increased price will cause some clients to defer their use of the resource until a time when it is cheaper, thereby reducing offered load; it will also induce additional suppliers to provide more capacity. There is a challenge in trying to make pricing mechanisms work on the short timescales associated with network congestion; in addition there is a countervailing need for predictability of costs in the short term that may make the idea unworkable. However,

Saltzer & Kaashoek Ch. 7, p. 97

June 25, 2009 8:22 am

7–98

CHAPTER 7 The Network as a System and as a System Component

Sidebar 7.10: The invisible hand Economics 101: In a free market, buyers have the option of buying a good or walking away, and sellers similarly have the option of offering a good or leaving the market. The higher the price, the more sellers will be attracted to the profit opportunity, and they will collectively thus make additional quantities of the good available. At the same time, the higher the price, the more buyers will balk, and collectively they will reduce their demand for the good. These two effects act to create an equilibrium in which the supply of the good exactly matches the demand for the good. Every buyer is satisfied with the price paid and every seller with the price received. When the market is allowed to set the price, surpluses and shortages are systematically driven out by this equilibrium-seeking mechanism. “Every individual necessarily labors to render the annual revenue of the society as great as he can. He generally indeed neither intends to promote the public interest, nor knows how much he is promoting it. He intends only his own gain, and he is in this, as in many other cases, led by an invisible hand to promote an end which was no part of his intention. By pursuing his own interest he frequently promotes that of the society more effectually than when he really intends to promote it.”* * Adam Smith (1723–1790). The Wealth of Nations 4, Chapter 2. (1776)

as a long-term strategy, pricing can be quite an effective mechanism to match the supply of network resources with demand. Even in the long term, the invisible hand generally requires that there be minimal barriers to entry by alternate suppliers; this is a hard con­ dition to maintain when installing new communication links involves digging up streets, erecting microwave towers or launching satellites. Congestion control in networks is by no means a solved problem—it is an active research area. This discussion has just touched the highlights, and there are many more design considerations and ideas that must be assimilated before one can claim to under­ stand this topic.

7.6.6 Delay Revisited Section 7.1.2 of this chapter identified four sources of delay in networks: propagation delay, processing delay, transmission delay, and queuing delay. Congestion control and flow control both might seem to add a fifth source of delay, in which the sender waits for permission from the receiver to launch a message into the network. In fact this delay is not of a new kind, it is actually an example of a transmission delay arising in a different protocol layer. At the time when we identified the four kinds of delay, we had not yet discussed protocol layers, so this subtlety did not appear. Each protocol layer of a network can impose any or all of the four kinds of delay. For example, what Section 7.1.2 identified as processing delay is actually composed of pro­ cessing delay in the link layer (e.g., time spent bit-stuffing and calculating checksums),

Saltzer & Kaashoek Ch. 7, p. 98

June 25, 2009 8:22 am

7.7 Wrapping up Networks

7–99

processing delay in the network layer (e.g., time spent looking up addresses in forwarding tables), and processing delay in the end-to-end layer (e.g., time spent compressing data, dividing a long message into segments and later reassembling it, and encrypting or decrypting message contents). Similarly, transmission delay can also arise in each layer. At the link layer, transmis­ sion delay is measured from when the first bit of a frame enters a link until the last bit of that same frame enters the link. The length of the frame and the data rate of the link together determine its magnitude. The network layer does not usually impose any addi­ tional transmission delays of its own, but in choosing a route (and thus the number of hops) it helps determine the number of link-layer transmission delays. The end-to-end layer imposes an additional transmission delay whenever the pacing effect of either con­ gestion control or flow control causes it to wait for permission to send. The data rate of the bottleneck in the end-to-end path, the round-trip time, and the size of the flow-con­ trol window together determine the magnitude of the end-to-end transmission delay. The end-to-end layer may also delay delivering a message to its client when waiting for an out-of-order segment of that message to arrive, and it may delay delivery in order to reduce jitter. These delivery delays are another component of end-to-end transmission delay. Any layer that imposes either processing or transmission delays can also cause queuing delays for subsequent packets. The transmission delays of the link layer can thus create queues, where packets wait for the link to become available. The network layer can impose queuing delays if several packets arrive at a router during the time it spends fig­ uring out how to forward a packet. Finally, the end-to-end layer can also queue up packets waiting for flow control or congestion control permission to enter the network. Propagation delay might seem to be unique to the link layer, but a careful accounting will reveal small propagation delays contributed by the network and end-to-end layers as messages are moved around inside a router or end-node computer. Because the distances involved in a network link are usually several orders of magnitude larger than those inside a computer, the propagation delays of the network and end-to-end layers can usually be ignored.

7.7 Wrapping up Networks This chapter has introduced a lot of concepts and techniques for designing and dealing with data communication networks. A natural question arises: “Is all of this stuff really needed?” The answer, of course, is “It depends.” It obviously depends on the application, which may not require all of the features that the various network layers provide. It also depends on several lower-layer aspects. For example, if at the link layer the entire network consists of just a single point-to­ point link, there is no need for a network layer at all. There may still be a requirement to multiplex the link, but multiplexing does not require any of the routing function of a

Saltzer & Kaashoek Ch. 7, p. 99

June 25, 2009 8:22 am

7–100

CHAPTER 7 The Network as a System and as a System Component

network layer because everything that goes in one end of the link is destined for whatever is attached at the other end. In addition, there is probably no need for some of the trans­ port services of the end-to-end layer because frames, segments, streams, or messages come out of the link in the same order they went in. A short link is sometimes quite reli­ able, in which case the end-to-end layer may not need to provide a duplicate-generating resend mechanism and in turn can omit duplicate suppression. What remains in the endto-end function is session services (such as authenticating the identity of the user and encrypting the communication for privacy) and presentation services (marshaling appli­ cation data into a form that can be transmitted as a message or a stream.) Similarly, if at the link layer the entire network consists of just a single broadcast link, a network layer is needed, but it is vestigial: it consists of just enough intelligence at each receiver to discard packets addressed to different targets. For example, the backplane bus described in Chapter 3 is a reliable broadcast network with an end-to-end layer that pro­ vides only presentation services. For another example, an Ethernet, which is less reliable, needs a healthier set of end-to-end services because it exhibits greater variations in delay. On the other hand, packet loss is still rare enough that it may be possible to ignore it, and reordered packet delivery is not a problem. As with all aspects of computer system design, good judgement and careful consider­ ation of trade-offs are required for a design that works well and also is economical. This summary completes our conceptual material about networks. In the remaining sections of this chapter are a case study of a popular network design, the Ethernet, and a collection of network-related war stories.

7.8 Case Study: Mapping the Internet to the Ethernet This case study begins with a brief description of Ethernet using the terminology and network model of this chapter. It then explores the issues involved in routing that are raised when one maps a packet-forwarding network such as the Internet to an Ethernet.

7.8.1 A Brief Overview of Ethernet Ethernet is the generic name for a family of local area networks based on broadcast over a shared wire or fiber link on which all participants can hear one another’s transmissions. Ethernet uses a listen-before-sending rule (known as “carrier sense”) to control access and it uses a listen-while-sending rule to minimize wasted transmission time if two stations happen to start transmitting at the same time, an error known as a collision. This protocol is named Carrier Sense Multiple Access with Collision Detection, and abbreviated CSMA/CD. Ethernet was demonstrated in 1974 and documented in a 1976 paper by Metcalfe and Boggs [see Suggestions for Further Reading 7.1.2]. Since that time several successively higher-speed versions have evolved. Originally designed as a half duplex sys­ tem, a full duplex, point-to-point specification that relaxes length restrictions was a later

Saltzer & Kaashoek Ch. 7, p. 100

June 25, 2009 8:22 am

7.8 Case Study: Mapping the Internet to the Ethernet

7–101

development. The primary forms of Ethernet that one encounters either in the literature or in the field are the following: • Experimental Ethernet, a long obsolete 3 megabit per second network that was

used only in laboratory settings. The 1976 paper describes this version.

• Standard Ethernet, a 10 megabit per second version. • Fast Ethernet, a 100 megabit per second version. • Gigabit Ethernet, which operates at the eponymous speed. Standard, fast, and gigabit Ethernet all share the same basic protocol design and for­ mat. The format of an Ethernet frame (with some subfield details omitted) is: leader 64 bits

destination 48 bits

source 48 bits

type 16 bits

data 368 to 12,000 bits

checksum 32 bits

The leader field contains a standard bit pattern that frames the payload and also provides an opportunity for the receiver’s phase-locked loop to synchronize. The destination and source fields identify specific stations on the Ethernet. The type field is used for protocol multiplexing in some applications and to contain the length of the data field in others. (The format diagram does not show that each frame is followed by 96 bit times of silence, which allows finding the end of the frame when the length field is absent.) The maximum extent of a half duplex Ethernet is determined by its propagation time; the controlling requirement is that the maximum two-way propagation time between the two most distant stations on the network be less than the 576 bit times required to trans­ mit the shortest allowable packet. This restriction guarantees that if a collision occurs, both colliding parties are certain to detect it. When a sending station does detect a colli­ sion, it waits a random time before trying again; when there are repeated collisions it uses exponential backoff to increase the interval from which it randomly chooses the time to wait. In a full duplex, point-to-point Ethernet there are no collisions, and the maximum length of the link is determined by the physical medium. There are many fascinating aspects of Ethernet design and implementation ranging from debates about its probabilistic character to issues of electrical grounding; we omit all of them here. For more information, a good place to start is with the paper by Met­ calfe and Boggs. The Ethernet is completely specified in a series of IEEE standards numbered 802.3, and it is described in great detail in most books devoted to networking.

7.8.2 Broadcast Aspects of Ethernet Section 7.3.5 of this chapter mentioned Ethernet as an example of a network that uses a broadcast link. As illustrated in Figure 7.43, the Ethernet link layer is quite simple: every frame is delivered to every station. At its network layer, each Ethernet station has a 48­ bit address, which to avoid confusion with other addresses we will call a station identifier. (To help reduce ambiguity in the examples that follow, station identifiers will be the only two-digit numbers.)

Saltzer & Kaashoek Ch. 7, p. 101

June 25, 2009 8:22 am

7–102

CHAPTER 7 The Network as a System and as a System Component

17

24

12

05

19

Station Identifier (Ethernet Address)

FIGURE 7.43 An Ethernet.

The network layer of Ethernet is quite simple. On the sending side, ETHERNET_SEND does nothing but pass the call along to the link layer. On the receiving side, the network handler procedure of the Ethernet network layer is straightforward: procedure ETHERNET_HANDLE (net_packet, length)

destination ← net_packet.target_id

if destination = my_station_id then

GIVE_TO_END_LAYER (net_packet.data,

net_packet.end_protocol,

net_packet.source_id)

else ignore packet

There are two differences between this network layer handler and the network layer han­ dler of a packet-forwarding network: • Because the underlying physical link is a broadcast link, it is up to the network layer of the station to figure out that it should ignore packets not addressed specifically to it. • Because every packet is delivered to every Ethernet station, there is no need to do any forwarding. Most Ethernet implementations actually place ETHERNET_HANDLE completely in hardware. One consequence is that the hardware of each station must know its own station identi­ fier, so it can ignore packets addressed to other stations. This identifier is wired in at manufacturing time, but most implementations also provide a programmable identifier register that overrides the wired-in identifier. Since the link layer of Ethernet is a broadcast link, it offers a convenient additional opportunity for the network layer to create a broadcast network. For this purpose, Ether­ net reserves one station identifier as a broadcast address, and the network handler procedure acquires one additional test: procedure ETHERNET_HANDLE (net_packet, length)

destination ← net_packet.target_id

if destination = my_station_id or destination = BROADCAST_ID then

GIVE_TO_END_LAYER (net_packet.data,

net_packet.end_protocol,

net_packet.source_id)

else ignore packet

Saltzer & Kaashoek Ch. 7, p. 102

June 25, 2009 8:22 am

7.8 Case Study: Mapping the Internet to the Ethernet

7–103

The Ethernet broadcast feature is seductive. It has led people to propose also adding broadcast features to packet-forwarding networks. It is possible to develop broadcast algorithms for a forwarding network, but it is a much trickier business. Even in Ethernet broadcast must be used judiciously. Reliable transport protocols that require that every receiving station send back an acknowledgment lead to a problematic flood of acknowl­ edgment packets. In addition, broadcast mechanisms are too easily triggered by mistake. For example, if a request is accidentally sent with its source address set to the broadcast address, the response will be broadcast to all network attachment points. The worst case is a broadcast sent from the broadcast address, which can lead to a flood of broadcasts. Such mechanisms make a good target for malicious attack on a network, so it is usually thought to be preferable not to implement them at all.

7.8.3 Layer Mapping: Attaching Ethernet to a Forwarding Network Suppose we have several workstations and perhaps a few servers in one building, all con­ nected using an Ethernet, and we would like to attach this Ethernet to the packetforwarding network illustrated in Figure 7.31 on page 7–50, by making the Ethernet a sixth link on router K in that figure. This connection produces the configuration of Fig­ ure 7.44. There are three kinds of network-related labels in the figure. First, each link is num­ bered with a local single-digit link identifier (in italics), as viewed from within the station that attaches that link. Second, as in Figure 7.43, each Ethernet attachment point has a two-digit Ethernet station identifier. Finally, each station has a one-letter name, just as in the packet-forwarding network in the figure on page 7–50. With this configuration, workstation L sends a remote procedure call to server N by sending one or more packets to station 18 of the Ethernet attached to it as link number 1.

upper-layer network address link identifier L work station

M work station

1

1

17

N server 1

15

18

P work station

Q work station

1

1 14

G K router … 6

22

1 2 3 4 5

H J E

19 F

Ethernet Ethernet station identifier FIGURE 7.44 Connecting an Ethernet to a packet forwarding network.

Saltzer & Kaashoek Ch. 7, p. 103

June 25, 2009 8:22 am

7–104

CHAPTER 7 The Network as a System and as a System Component

Workstation L might also want to send a request to the computer connected to the destination E, which requires that L actually send the request packet to router K at Ether­ net station 19 for forwarding to destination E. The complication is that E may be at address 15 of the packet-forwarding network, while workstation M is at station 15 of the Ethernet. Since Ethernet station identifiers may be wired into the hardware interface, we probably can’t set them to suit our needs, and it might be a major hassle to go around changing addresses on the original packet-forwarding network. The bottom line here is that we can’t simply use Ethernet station identifiers as the network addresses in our packet-forwarding network. But this conclusion seems to leave station L with no way of expressing the idea that it wants to send a packet to address E. We were able to express this idea in words because in the two figures we assigned a unique letter identifier to every station. What our design needs is a more universal con­ cept of network—a cloud that encompasses every station in both the Ethernet and the packet-forwarding network and assigns each station a unique network address. Recall that the letter identifiers originally stood for addresses in the packet-forwarding network; they may even be hierarchical identifiers. We can simply extend that concept and assign identifiers from that same numbering plan to each Ethernet station, in addition to the wired-in Ethernet station identifiers. What we are doing here is mapping the letter identifiers of the packet-forwarding net­ work to the station identifiers of the Ethernet. Since the Ethernet is itself decomposable into a network layer and a link layer, we can describe this situation, as was suggested on page 7–34, as a mapping composition—an upper-level network layer is being mapped to lower-level network layer. The upper network layer is a simplified version of the Inter­ net, so we will label it with the name “internet,” using a lower case initial letter as a reminder that it is simplified. Our internet provides us with a language in which work­ station L can express the idea that it wants to send an RPC request to server E, which is located somewhere beyond the router: NETWORK_SEND

(data, length, RPC, INTERNET, E)

where E is the internet address of the server, and the fourth argument selects our internet forwarding protocol from among the various available network protocols. With this scheme, station A also uses the same network address E to send a request to that server. In other words, this internet provides a universal name space. Our new, expanded, internet network layer must now map its addresses into the Ethernet station identifiers required by the Ethernet network layer. For example, when workstation L sends a remote procedure call to server N by NETWORK_SEND

(data, length, RPC, INTERNET, N)

the internet network layer must turn this into the Ethernet network-layer call NETWORK_SEND

(data, length, RPC, ENET, 18)

in which we have named the Ethernet network-layer protocol ENET.

Saltzer & Kaashoek Ch. 7, p. 104

June 25, 2009 8:22 am

7.8 Case Study: Mapping the Internet to the Ethernet

For this purpose, L must maintain a table such as that of Figure 7.45, in which each internet address maps to an Ethernet station identifier. This table maps, for example, address N to ENET, station 18, as required for the NETWORK_SEND call above. Since our internet is a forwarding network, our table also indicates that for address E the thing to do is send the packet on ENET to station 19, in the hope that it (a router in our diagram) will be well enough con­ nected to pass the packet along to its destination. This table is just another example of a forwarding table like the ones in Section 7.4 of this chapter.

internet address

Ethernet/ station

M N P Q K E

enet/15 enet/18 enet/14 enet/22 enet/19 enet/19

7–105

FIGURE 7.45 Forwarding table to connect upper and lowe layer addresses

7.8.4 The Address Resolution Protocol The forwarding table could simply be filled in by hand, by a network administrator who, every time a new station is added to an Ethernet, visits every station already on that Ethernet and adds an entry to its forwarding table. But the charm of manual network management quickly wears thin as the network grows in number of stations, and a more automatic procedure is usually implemented. An elegant scheme, known as the address resolution protocol (ARP), takes advantage of the broadcast feature of Ethernet to dynamically fill in the forwarding table as it is needed. Suppose we start with an empty forwarding table and that an application calls the internet NETWORK_SEND interface in L, asking that a packet be sent to internet address M. The internet network layer in L looks in its local forwarding table, and finding nothing there that helps, it asks the Ethernet network layer to send a query such as the following: NETWORK_SEND

(“where is M?”, 11, ARP, ENET, BROADCAST)

where 10 is the number of bytes in the query, ARP is the network-layer protocol we are using, rather than INTERNET, and BROADCAST is the station identifier that is reserved for broadcast on this Ethernet. Since this query uses the broadcast address, it will be received by the Ethernet net­ work layer of every station on the attached Ethernet. Each station notices the ARP protocol type and passes it to its ARP handler in the upper network layer. Each ARP handler checks the query, and if it discovers its own internet address in the inquiry, sends a response: NETWORK_SEND

Saltzer & Kaashoek Ch. 7, p. 105

(“M is at station 15”, 18, ARP, ENET, BROADCAST)

June 25, 2009 8:22 am

7–106

CHAPTER 7 The Network as a System and as a System Component

At most, one station—the one whose internet address is Ethernet/ internet named by the ARP request—will respond. All the others will station address ignore the ARP request. When the ARP response arrives at staM enet/15 tion 17, that station’s Ethernet network layer will pass it up to the ARP handler in its upper network layer, which will immediately add an entry relating address M to station 15 to its forwarding table, shown at the right. The internet network handler of station 17 can now proceed with its originally requested send operation. Suppose now that station L tries to send a packet to server E, which is on the internet but not directly attached to the internet Ethernet/ Ethernet. In that case, server E does not hear the Ethernet address station broadcast, but the router at station 19 does, and it sends a enet/15 M suitable ARP response instead. The forwarding table then E enet/19 has a second entry as shown at the left. Station L can now send the packet to the router, which presumably knows how to forward the packet to its intended destination. One more step is required—the server at E will not be able to reply to station L unless L is in its own forwarding table. This step is easy to arrange: whenever router K hears, via ARP, of the existence of a station on its attached Ethernet, it simply adds that internet address to the list of addresses that it advertises, and whatever routing protocol it is using will propagate that information throughout the internet. If hierarchical addresses are in use, the region designer might assign a region number to be used exclusively for all the stations on one Ethernet, to simplify routing. Mappings from Ethernet station identifiers to the addresses of the higher network level are thus dynamically built up, and eventually station L will have the full table shown in Figure 7.45. Typical systems deployed in the field have developed and refined this basic set of dynamic mapping ideas in many directions: The forwarding table is usually managed as a cache, with entries that time out or can be explicitly updated, to allow sta­ tions to change their station identifiers; the ARP response may also be noted by stations that didn’t send the original ARP request for their own future reference; a newly-attached station may, without being asked, broadcast what appears to be an ARP response simply to make itself known to existing stations (advertising); and there is even a reverse version of the ARP protocol that can be used by a station to ask if anyone knows its own higherlevel network address, or to ask that a higher-level address be assigned to it. These refine­ ments are not important to our case study, but many of them are essential to smooth network management.

Saltzer & Kaashoek Ch. 7, p. 106

June 25, 2009 8:22 am

7.9 War Stories: Surprises in Protocol Design

7–107

7.9 War Stories: Surprises in Protocol Design 7.9.1 Fixed Timers Lead to Congestion Collapse in NFS A classic example of congestion collapse appeared in early releases of the Sun Network File System (NFS) described in the case study in Section 4.5. The NFS server imple­ mented at-least-once semantics with an idempotent stateless interface. The NFS client was programmed to be persistent. If it did not receive a response after some fixed number of seconds, it would resend its request, repeating forever, if necessary. The server simply ran a first-in, first-out queue, so if several NFS clients happened to make requests of the server at about the same time, the server would handle the requests one at a time in the order that they arrived. These apparently plausible arrangements on the parts of the cli­ ent and the server, respectively, set the stage for the problem. As the number of clients increased, the length of the queue increased accordingly. With enough clients, the queue would grow long enough that some requests would time out before the server got to them. Those clients, upon timing out, would repeat their requests. In due course, the server would handle the original request of a client that had timed out, send a response, and that client would go away happy. But that client’s dupli­ cate request was still in the server’s queue. The stateless NFS server had no way to tell that it had already handled the duplicate request, so when it got to the duplicate it would go ahead and handle it again, taking the same time as before, and sending an unneeded response. The client ignored this response, but the time spent by the server handling the duplicate request was wasted, and the waste occurred at a time when the server could least afford it—it was already so heavily loaded that at least one client had timed out. Once the server began wasting time handling duplicate requests, the queue grew still longer, causing more clients to time out, leading to more duplicate requests. The observed effect was that a steady increase of load would result in a steady increase of sat­ isfied requests, up to the point that the server was near full capacity. If the load ever exceeded the capacity, even for a short time, every request from then on would time out, and be duplicated, resulting in a doubling of the load on the server. That wasn’t the end—with a doubled load, clients would begin to time out a second time, send their requests yet again, thus tripling the load. From there, things would continue to deterio­ rate, with no way to recover. From the NFS server’s point of view, it was just doing what its clients were asking, but from the point of view of the clients the useful throughput had dropped to zero. The solution to this problem was for the clients to switch to an exponential backoff algorithm in their choice of timer setting: each time a client timed out it would double the size of its timer setting for the next repetition of the request. Lesson: Fixed timers are always a source of trouble, sometimes catastrophic trouble.

Saltzer & Kaashoek Ch. 7, p. 107

June 25, 2009 8:22 am

7–108

CHAPTER 7 The Network as a System and as a System Component

7.9.2 Autonet Broadcast Storms Autonet, an experimental local area network designed at the Digital Equipment Corpo­ ration Systems Research Center, handled broadcast in an elegant way. The network was organized as a tree. When a node sent a packet to the broadcast address, the network first routed the packet up to the root of the tree. The root turned the packet around and sent it down every path of the tree. Nodes accepted only packets going downward, so this pro­ cedure ensured that a broadcast packet would reach every connected node, but no more than once. But every once in a while, the network collapsed with a storm of repeated broadcast packets. Analysis of the software revealed no possible source of the problem. It took a hardware expert to figure it out. The physical layer of the Autonet consisted of point-to-point coaxial cables. An inter­ esting property of an unterminated coaxial cable is that it will almost perfectly reflect any signal sent down the cable. The reflection is known as an “echo”. Echos are one of the causes of ghosts in analog cable television systems. In the case of the Autonet, the network card in each node properly terminated the cable, eliminating echos. But if someone disconnected a computer from the network, and left the cable dangling, that cable would echo everything back to its source. Suppose someone disconnects a cable, and someone else in the network sends a packet to the broadcast address. The network routes the packet up to the root of the tree, the root turns the packet around and sends it down the tree. When the packet hits the end of the unterminated cable, it reflects and returns to the other end of the cable looking like a new upward bound packet with the broadcast address. The node at that end duti­ fully forwards the packet toward the root node, which, upon receipt turns it around and sends it again. And again, and again, as fast as the network can carry the packet. Lesson: Emergent properties often arise from the interaction of apparently unrelated system features operating at different system layers, in this case, link-layer reflections and networklayer broadcasts.

7.9.3 Emergent Phase Synchronization of Periodic Protocols Some network protocols involve periodic polling. Examples include picking up mail, checking for chat buddies, and sending “are-you-there?” inquiries for reassurance that a co-worker hasn’t crashed. For a specific example, a workstation might send a broadcast packet every five minutes to announce that it is still available for conversations. If there are dozens of such workstations on the same local area network, the designer would pre­ fer that they not all broadcast simultaneously. One might assume that, even if they all broadcast with the same period, if they start at random their broadcasts would be out of phase and it would take a special effort to synchronize their phases and keep them that way. Unfortunately, it is common to discover that they have somehow synchronized themselves and are all trying to broadcast at the same time. How can this be? Suppose, for example, that each one of a group of workstations sends a broadcast and then sets a timer for a fixed interval. When the timer expires, it

Saltzer & Kaashoek Ch. 7, p. 108

June 25, 2009 8:22 am

7.9 War Stories: Surprises in Protocol Design

7–109

sends another broadcast and, after sending, it again sets the timer. During the time that it is sending the broadcast message, the timer is not running. If a second workstation happens to send a broadcast during that time, both workstations take a network inter­ rupt, each accepts the other station’s broadcast, and makes a note of it, as might be expected. But the time required to handle the incoming broadcast interrupts slightly delays the start of the next timing cycle for both of the workstations, whereas broadcasts that arrive while a workstation’s timer is running don’t affect the timer. Although the delay is small, it does shift the timing of these workstation’s broadcasts relative to all of the other workstations. The next time this workstation’s timer expires, it will again be interrupted by the other workstation, since they are both using the same timer value, and both of their timing cycles will again be slightly lengthened. The two workstations have formed a phase-locked group, and will remain that way indefinitely. More important, the two workstations that were accidentally synchronized are now polling with a period that is slightly larger than all the other workstations. As a result, their broadcasts now precess relative to the others, and eventually will overlap the time of broadcast of a third workstation. That workstation will then join the phase-locked group, increasing the rate of precession, and things continue from there. The problem is that the system design unintentionally includes an emergent phase-locked loop, similar to the one described on page 7–36. The generic mechanism is that the supposed “fixed” interval does not count the run­ ning time of the periodic program, and that for some reason that running time is different when two or more participants happen to run concurrently. In a network, it is quite common to find that unsynchronized activities with identical timing periods become synchronized. Lesson: Fixed timers have many evils. Don’t assume that unsynchronized periodic activi­ ties will stay that way.

7.9.4 Wisconsin Time Server Meltdown NE TGEAR®, a manufacturer of Ethernet and wireless equipment, added a feature to four of its low-cost wireless routers intended for home use: a log of packets that traverse the router. To be useful in debugging, the designers realized that the log needed to times­ tamp each log entry, but adding timestamps required that the router know the current date and time. Since the router would be attached to the Internet, the designers added a few lines of code that invoked a simple network time service protocol known as SNTP. Since SNTP requires that the client invoke a specific time service, there remained a name discovery problem. They solved it by configuring the firmware code with the Internet address of a network time service. Specifically, they inserted the address 128.105.39.11, the network address of one of the time servers operated by the University of Wisconsin. The designers surrounded this code with a persistent sender that would retry the protocol once per second until it received a response. Upon receiving a response, it refreshed the clock with another invocation of SNTP, using the same persistent sender, on a schedule ranging from once per minute to once per day, depending on the firmware version.

Saltzer & Kaashoek Ch. 7, p. 109

June 25, 2009 8:22 am

7–110

CHAPTER 7 The Network as a System and as a System Component

On May 14, 2003, at about 8:00 a.m. local time, the network staff at the University of Wisconsin noticed an abrupt increase in the rate of inbound Internet traffic at their connection to the Internet—the rate jumped from 20,000 packets per second to 60,000 packets per second. All of the extra traffic seemed to be SNTP packets targeting one of their time servers, and specifying the same UDP response port, port 23457. To prevent disruption to university network access, the staff installed a temporary filter at their bor­ der routers that discarded all incoming SNTP request packets that specified a response port of 23457. They also tried invoking an SNTP protocol access control feature in which the service can send a response saying, in effect, “go away”, but it had no effect on the incoming packet flood. Over the course of the next few weeks, SNTP packets continued to arrive at an increasing rate, soon reaching around 270,000 packets per second, and consuming about 150 megabits per second of Internet connection capacity. Analysis of the traffic showed that the source addresses seemed to be legitimate and that any single source was sending a packet about once per second. A modest amount of sleuthing identified the NET­ GEAR routers as the source of the packets and the firmware as containing the target address and response port numbers. Deeper analysis established that the immediate dif­ ficulty was congestion collapse. NETGEAR had sold over 700,000 routers containing this code world-wide. As the number in operation increased, the load on the Wisconsin time service grew gradually until one day the response latency of the server exceeded one second. At that point, the NETGEAR router that made that request timed out and retried, thereby increasing its load on the time service, which increased the time service response latency for future requesters. After a few such events, essentially all of the NET­ GEAR routers would start to time out, thereby multiplying the load they presented by a factor of 60 or more, which ensured that the server latency would continue to exceed their one second timer. How Wisconsin and NETGEAR solved this problem, and at whose expense, is a whole separate tale.* Lesson(s): There are several. (1) Fixed timers were once again found at the scene of an acci­ dent. (2) Configuring a fixed Internet address, which is overloaded with routing information, is a bad idea. In this case, the wired-in address made it difficult to repair the problem by rerouting requests to a different time service, such as one provided by NETGEAR. The address should have been a variable, preferably one that could be hidden with indirection (decouple modules with indirection). (3) There is a reason for features such as the “go away” response in SNTP; it is risky for a client to implement only part of a protocol.

* For that story, see . This inci­ dent is also described in David Mills, Judah Levine, Richard Schmidt and David Plonka. “Coping with overload on the Network Time Protocol public servers.” Proceedings of the Precision Time and Time Interval (PTTI) Applications and Planning Meeting (Washington DC, December 2004), pages 5-16.

Saltzer & Kaashoek Ch. 7, p. 110

June 25, 2009 8:22 am

Exercises

7–111

Exercises 7.1 Chapter 1 discussed four general methods for coping with complexity: modularity, abstraction, hierarchy, and layering. Which of those four methods does a protocol stack use as its primary organizing scheme? 1996–1–1e

7.2 The end-to-end argument A. is a guideline for placing functions in a computer system; B. is a rule for placing functions in a computer system; C. is a debate about where to place functions in a computer system; D. is a debate about anonymity in computer networks. 1999–2–03

7.3 Of the following, the best example of an end-to-end argument is: A. If you laid all the Web hackers in the world end to end, they would reach from Cambridge to CERN. B. Every byte going into the write end of a UNIX pipe eventually emerges from the pipe’s read end. C. Even if a chain manufacturer tests each link before assembly, he’d better test the completed chain. D. Per-packet checksums must be augmented by a parity bit for each byte. E. All important network communication functions should be moved to the application layer. 1998–2–01

7.4 Give two scenarios in the form of timing diagrams showing how a duplicate request might end up at a service. 1995-1-5a

7.5 After sending a frame, a certain piece of network software waits one second for an acknowledgment before retransmitting the frame. After each retransmission, it cuts delay in half, so after the first retransmission the wait is 1/2 second, after the second retransmission the wait is 1/4 second, etc. If it has reduced the delay to 1/1024

Saltzer & Kaashoek Ch. 7, p. 111

June 25, 2009 8:22 am

7–112

CHAPTER 7 The Network as a System and as a System Component

second without receiving an acknowledgment, the software gives up and reports to its caller that it was not able to deliver the frame. 7.5a. Is this a good way to manage retransmission delays for Ethernet? Why or why not? 1987–1–2a 7.5b. Is this a good way to manage retransmission delays for a receive-and-forward network? Why or why not? 1987–1–2b

7.6 Variable delay is an intrinsic problem of isochronous networks. True or False? 1995–1–1f

7.7 Host A is sending frames to host B over a noisy communication link. The median transit time over the communication link is 100 milliseconds. The probability of a frame being damaged en route in either direction across the communication link is α, and B can reliably detect the damage. When B gets a damaged frame it simply discards it. To ensure that frames arrive safely, B sends an acknowledgment back to A for every frame received intact. 7.7a. How long should A wait for a frame to be acknowledged before retransmitting it? 1987–1–3a 7.7b. What is the average number of times that A will have to send each frame? 1987–1–3b

7.8 Consider the protocol reference model of this chapter with the link, network, and end-to-end layers. Which of the following is a behavior of the reference model? A. An end-to-end layer at an end host tells its network layer which network layer protocol to use to reach a destination. B. The network layer at a router maintains a separate queue of packets for each end-to­ end protocol. C. The network layer at an end host looks at the end-to-end type field in the network header to decide which end-to-end layer protocol handler to invoke. D. The link layer retransmits packets based on the end-to-end type of the packets: if the end-to-end protocol is reliable, then a link-layer retransmission occurs when a loss is detected at the link layer, otherwise not. 2000–2–02

Saltzer & Kaashoek Ch. 7, p. 112

June 25, 2009 8:22 am

Exercises

7–113

7.9 Congestion is said to occur in a receive-and-forward network when A. Communication stalls because of cycles in the flow-control dependencies. B. The throughput demanded of a network link exceeds its capacity. C. The volume of e-mail received by each user exceeds the rate at which users can read e-mail. D. The load presented to a network link persistently exceeds its capacity. E. The amount of space required to store routing tables at each node becomes burdensome. 1997–1–1e

7.10 Alice has arranged to send a stream of data to Bob using the following protocol: • Each message segment has a block number attached to it; block numbers are consecutive starting with 1. • Whenever Bob receives a segment of data with the number N he sends back an acknowledgment saying “OK to send block N + 1”. • Whenever Alice receives an “OK to send block K” she sends block K. Alice initiates the protocol by sending a block numbered 1, she terminates the protocol by ignoring any “OK to send block K” for which K is larger than the number on the last block she wants to send. The network has been observed to never lose message segments, so Bob and Alice have made no provision for timer expirations and retries. They have also made no provision for deduplication. Unfortunately, the network systematically delivers every segment twice. Alice starts the protocol, planning to send a three-block stream. How many “OK to send block 4” responses does she ignore at the end? 1994–2–6

7.11 A and B agree to use a simple window protocol for flow control for data going from A to B: When the connection is first established, B tells A how many message segments B can accept, and as B consumes the segments it occasionally sends a message to A saying “you can send M more”. In operation, B notices that occasionally A sends more segments than it was supposed to. Explain. 1980–3–3

7.12 Assume a client and a service are directly connected by a private, 800,000 bytes per second link. Also assume that the client and the service produce and consume

Saltzer & Kaashoek Ch. 7, p. 113

June 25, 2009 8:22 am

7–114

CHAPTER 7 The Network as a System and as a System Component

message segments at the same rate. Using acknowledgments, the client measures the round-trip between itself and the service to be 10 milliseconds. 7.12a. If the client is sending message segments that require 1000-byte frames, what is the smallest window size that allows the client to achieve 800,000 bytes per second throughput? 1995–2–2a 7.12b. One scheme for establishing the window size is similar to the slow start congestion control mechanism. The idea is that the client starts with a window size of one. For every segment received, the service responds with an acknowledgment telling the client to double the window size. The client does so until it realizes that there is no point in increasing it further. For the same parameters as in part 7.12a, how long would it take for the client to realize it has reached the maximum throughput? 1995–2–2b 7.12c. Another scheme for establishing the window size is called fast start. In (an oversimplified version of ) fast start, the client simply starts sending segments as fast as it can, and watches to see when the first acknowledgment returns. At that point, it counts the number of outstanding segments in the pipeline, and sets the window size to that number. Again using the same parameters as in part 7.12a, how long will it take for the client to know it has achieved the maximum throughput? 1995–2–2c

7.13 A satellite in stationary orbit has a two-way data channel that can send frames containing up to 1000 data bytes in a millisecond. Frames are received without error after 249 milliseconds of propagation delay. A transmitter T frequently has a data file that takes 1000 of these maximal-length frames to send to a receiver R. T and R start using lock-step flow control. R allocates a buffer which can hold one message segment. As soon as the buffered segment is used and the buffer is available to hold new data, R sends an acknowledgment of the same length. T sends the next segment as soon as it sees the acknowledgment for the last one. 7.13a. What is the minimum time required to send the file? 1988–2–2a 7.13b. T and R decide that lock-step is too slow, so they change to a bang-bang protocol. A bang-bang protocol means that R sends explicit messages to T saying “go ahead” or “pause”. The idea is that R will allocate a receive buffer of some size B, send a goahead message when it is ready to receive data. T then sends data segments as fast as the channel can absorb them. R sends a pause message at just the right time so that its buffer will not overflow even if R stops consuming message segments.

Saltzer & Kaashoek Ch. 7, p. 114

June 25, 2009 8:22 am

Exercises

7–115

Suppose that R sends a go-ahead, and as soon as it sees the first data arrive it sends a pause. What is the minimum buffer size Bmin that it needs?) 1988–2–2b] 7.13c. What now is the minimum time required to send the file? 1988–2–2c

7.14 Some end-to-end protocols include a destination field in the end-to-end header. Why? A. So the protocol can check that the network layer routed the packet containing the message segment correctly. B. Because an end-to-end argument tells us that routing should be performed at the endto-end layer. C. Because the network layer uses the end-to-end header to route the packet. D. Because the end-to-end layer at the sender needs it to decide which network protocol to use. 2000–2–09

7.15 One value of hierarchical naming of network attachment points is that it allows a reduction in the size of routing tables used by packet forwarders. Do the packet forwarders themselves have to be organized hierarchically to take advantage of this space reduction? 1994–2–5

7.16 The System Network Architecture (SNA) protocol family developed by IBM uses a flow control mechanism called pacing. With pacing, a sender may transmit a fixed number of message segments, and then must pause. When the receiver has accepted all of these segments, it can return a pacing response to the sender, which can then send another burst of message segments. Suppose that this scheme is being used over a satellite link, with a delay from earth station to earth station of 250 milliseconds. The frame size on the link is 1000 bits, four segments are sent before pausing for a pacing response, and the satellite channel has a data rate of one megabit per second. 7.16a. The timing diagram below illustrates the frame carrying the first segment. Fill in the diagram to show the next six frames exchanged in the pacing system. Assume no frames are lost, delays are uniform, and sender and receiver have no internal

Saltzer & Kaashoek Ch. 7, p. 115

June 25, 2009 8:22 am

7–116

CHAPTER 7 The Network as a System and as a System Component

delays (for example, the first bit of the second frame may immediately follow the last bit of the first). sender time, in ms

0 1

250 251

receiver first bit of first frame leaves sender last bit of first frame leaves sender

first bit of first frame arrives at receiver last bit of first frame arrives at receiver

7.16b. What is the maximum fraction of the available satellite capacity that can be used by this pacing scheme? 7.16c. We would like to increase the utilization of the channel to 50% but we can't increase the frame size. How many message segments would have to be sent between pacing responses to achieve this capacity? 1982–3–4

7.17 Which are true statements about network address translators as described in Section 7.4.5? A. NATs break the universal addressing scheme of the Internet. B. NATs break the layering abstraction of the network model of Chapter 7. C. NATs increase the consumption of Internet addresses. D. NATs address the problem that the Internet has a shortage of Internet addresses. E. NATs constrain the design of new end-to-end protocols. F. When a NAT translates the Internet address of a packet, it must also modify the Ethernet checksum, to ensure that the packet is not discarded by the next router that handles it. The client application might be sending its Internet address in the TCP payload to the server. G. When a packet from the public Internet arrives at a NAT box for delivery to a host behind the NAT, the NAT must examine the payload and translate any Internet addresses found therein. H. Clients behind a NAT cannot communicate with servers that are behind the same NAT because the NAT does not know how to forward those packets. 2001–2–01, 2002–2–02, and 2004–2–2

Saltzer & Kaashoek Ch. 7, p. 116

June 25, 2009 8:22 am

Exercises

7–117

7.18 Some network protocols deal with both big-endian and little-endian clients by providing two different network ports. Big-endian clients send requests and data to one port, while little-endian clients send requests and data to the other. The service may, of course, be implemented on either a big-endian or a little-endian machine. This approach is unusual—most Internet protocols call for just one network port, and require that all data be presented at that port in “network standard form”, which is little- endian. Explain the advantage of the two port structure as compared with the usual structure. 1994–1–2

7.19 Ethernet cannot scale to large sizes because a centralized mechanism is used to control network contention. True or False? 1994–1–3b

7.20 Ethernet A. uses luminiferous ether to carry packets. B. uses Manchester encoding to frame bits. C. uses exponential back-off to resolve repeated conflicts between multiple senders. D. uses retransmissions to avoid congestion. E. delegates arbitration of conflicting transmissions to each station. F. always guarantees the delivery of packets. G. can support an unbounded number of computers. H. has limited physical range. 1999–2–01, 2000–1–04

7.21 Ethernet cards have unique addresses built into them. What role do these unique addresses play in the Internet? A. None. They are there for Macintosh compatibility only. B. A portion of the Ethernet address is used as the Internet address of the computer using the card. C. They provide routing information for packets destined to non-local subnets. D. They are used as private keys in the Security Layer of the ISO protocol. E. They provide addressing within each subnet for an Internet address resolution protocol. F. They provide secure identification for warranty service. 1998-2-02

7.22 If eight stations on an Ethernet all want to transmit one packet, which of the following statements is true?

Saltzer & Kaashoek Ch. 7, p. 117

June 25, 2009 8:22 am

7–118

CHAPTER 7 The Network as a System and as a System Component

A. It is guaranteed that all transmissions will succeed. B. With high probability all stations will eventually end up being able to transmit their data successfully. C. Some of the transmissions may eventually succeed, but it is likely some may not. D. It is likely that none of the transmissions will eventually succeed. 2004–1–3

7.23 Ben Bitdiddle has been thinking about remote procedure call. He remembers that one of the problems with RPC is the difficulty of passing pointers: since pointers are really just addresses, if the service dereferences a client pointer, it’ll get some value from its address space, rather than the intended value in the client’s address space. Ben decides to redesign his RPC system to always pass, in the place of a bare pointer, a structure consisting of the original pointer plus a context reference. Louis Reasoner, excited by Ben’s insight, decides to change all end-to-end protocols along the same lines. Argue for or against Louis’s decision. 1996–2–1a

7.24 Alyssa’s mobiles:* Alyssa P. Protocol-Hacker is designing an end-to-end protocol for locating mobile hosts. A mobile host is a computer that plugs into the network at different places at different times, and get assigned a new network address at each place. The system she starts with assigns each host a home location, which can be found simply by looking the user up in a name service. Her end-to-end protocol will use a network that can reorder packets, but doesn’t ever lose or duplicate them. Her first protocol is simple: every time a user moves, store a forwarding pointer at the previous location, pointing to the new location. This creates a chain of forwarding pointers with the permanent home location at the beginning and the mobile host at the end. Packets meant for the mobile host are sent to the home location, which forwards them along the chain until they reach the mobile host itself. (The chain is truncated when a mobile host returns to a previously visited location.) Alyssa notices that because of the long chains of forwarding pointers, performance generally gets worse each time she moves her mobile host. Alyssa’s first try at fixing the problem works like this: Each time a mobile host moves, it sends a message to its home location indicating its new location. The home location maintains a pointer to the new location. With this protocol, there are no chains at all. Places other than the home location do not maintain forwarding information. 7.24a. When this protocol is implemented, Alyssa notices that packets regularly get lost when she moves from one location to another. Explain why or give an example. \

Alyssa is disappointed with her first attempt, and decides to start over. In her new scheme, no forwarding pointers are maintained anywhere, not even at the home

* Credit for developing exercise 7.24 goes to Anant Agarwal.

Saltzer & Kaashoek Ch. 7, p. 118

June 25, 2009 8:22 am

Exercises

7–119

node. Say a packet destined for a mobile host A arrives at a node N. If N can directly communicate with A, then N sends the packet to A, and we’re done. Otherwise, N broadcasts a search request for A to all the other fixed nodes in the network. If A is near a different fixed node N', then N' responds to the search request. On receiving this response, N forwards the packet for A to N'. 7.24b. Will packets get lost with this protocol, even if A moves before the packet gets to N'? Explain.

Unfortunately the network doesn’t support broadcast efficiently, so Alyssa goes back to the keyboard and tries again. Her third protocol works like this. Each time a mobile host moves, say from N to N', a forwarding pointer is stored at N pointing to N'. Every so often, the mobile host sends a message to its permanent home node with its current location. Then, the home node propagates a message down the forwarding chain, asking the intermediate nodes to delete their forwarding state. 7.24c. Can Alyssa ever lose packets with this protocol? Explain. (Hint: think about the properties of the underlying network.) 7.24d. What additional steps can the home node take to ensure that the scheme in question 7.24c never loses packets? 1996–2–2

7.25 ByteStream Inc. sells three data-transfer products: Send-and-wait, Blast, and Flow-control. Mike R. Kernel is deciding which product to use. The protocols work as follows: • Send-and-wait sends one segment of a message and then waits for an acknowledgment before sending the next segment. • Flow-control uses a sliding window of 8 segments. The sender sends until the window closes (i.e., until there are 8 unacknowledged segments). The receiver sends an acknowledgment as soon as it receives a segment. Each acknowledgment opens the sender’s window with one segment. • Blast uses only one acknowledgment. The sender blasts all the segments of a message to the receiver as fast as the network layer can accept them. The last segment of the blast contains a bit indicating that it is the last segment of the message. After sending all segments in a single blast, the sender waits for one acknowledgment from the receiver. The receiver sends an acknowledgment as soon as it receives the last segment. Mike asks you to help him compute for each protocol its maximum throughput. He is planning to use a 1,000,000 bytes per second network that has a packet size of 1,000 bytes. The propagation time from the sender to the receiver is 500 microseconds. To simplify the calculation, Mike suggests making the following approximations: (1) there is no processing time at the sender and the receiver; (2) the time to send an acknowledgment is just the propagation time (number of data

Saltzer & Kaashoek Ch. 7, p. 119

June 25, 2009 8:22 am

7–120

CHAPTER 7 The Network as a System and as a System Component

bytes in an ACK is zero); (3) the data segments are always 1,000 bytes; and (4) all headers are zero-length. He also assumes that the underlying communication medium is perfect (frames are not lost, frames are not duplicated, etc.) and that the receiver has unlimited buffering. 7.25a. What is the maximum throughput for the Send-and-wait? 7.25b. What is the maximum throughput for Flow-control? 7.25c. What is the maximum throughput for Blast?

Mike needs to choose one of the three protocols for an application which periodically sends arbitrary-sized messages. He has a reliable network, but his application involves unpredictable computation times at both the sender and the receiver. And this time the receiver has a 20,000-byte receive buffer. 7.25d. Which product should he choose for maximum reliable operation? A. Send-and-wait, the others might hang. B. Blast, which outperforms the others. C. Flow-control, since Blast will be unreliable and Send-and-wait is slower. D. There is no way to tell from the information given. 1997–2–2

7.26 Suppose the longest packet you can transmit across the Internet can contain 480 bytes of useful data, you are using a lock-step end-to-end protocol, and you are sending data from Boston to California. You have measured the round-trip time and found that it is about 100 milliseconds. 7.26a. If there are no lost packets, estimate the maximum data rate you can achieve. 7.26b. Unfortunately, 1% of the packets are getting lost. So you install a resend timer, set to 1000 milliseconds. Estimate the data rate you now expect to achieve. 7.26c. On Tuesdays the phone company routes some westward-bound packets via satellite link, and we notice that 50% of the round trips now take exactly 100 extra milliseconds. What effect does this delay have on the overall data rate when the resend timer is not in use. (Assume the network does not lose any packets.) 7.26d. Ben turns on the resend timer, but since he hadn’t heard about the satellite delays he sets it to 150 milliseconds. What now is the data rate on Tuesdays? (Again, assume the network does not lose any packets.) 7.26e. Usually, when discussing end-to-end data rate across a network, the first parameter one hears is the data rate of the slowest link in the network. Why wasn't that parameter needed to answer any of the previous parts of this question? 1994–1–5

Saltzer & Kaashoek Ch. 7, p. 120

June 25, 2009 8:22 am

Exercises

7–121

7.27 Ben Bitdiddle is called in to consult for Microhard. Bill Doors, the CEO, has set up an application to control the Justice department in Washington, D.C. The client running on the TNT operating system makes RPC calls from Seattle to the server running in Washington, D.C. The server also runs on TNT (surprise!). Each RPC call instructs the Justice department on how to behave; the response acknowledges the request but contains no data (the Justice department always complies with requests from Microhard). Bill Doors, however, is unhappy with the number of requests that he can send to the Justice department. He therefore wants to improve TNT’s communication facilities. Ben observes that the Microhard application runs in a single thread and uses RPC. He also notices that the link between Seattle and Washington, D.C. is reliable. He then proposes that Microhard enhance TNT with a new communication primitive, pipe calls. Like RPCs, pipe calls initiate remote computation on the server. Unlike RPCs, however, pipe calls return immediately to the caller and execute asynchronously on the server. TNT packs multiple pipe calls into request messages that are 1000 bytes long. TNT sends the request message to the server as soon as one of the following two conditions becomes true: 1) the message is full, or 2) the message contains at least 1 pipe call and it has been 1 second since the client last performed a pipe call. Pipe calls have no acknowledgments. Pipe calls are not synchronized with respect to RPC calls. Ben quickly settles down to work and measures the network traffic between Seattle and Washington. Here is what he observes: Seattle to D.C. transit time: D.C to Seattle transit time: Channel bandwidth in each direction: RPC or Pipe data per call: Network overhead per message: Size of RPC request message (per call) Size of pipe request message: Size of RPC reply message (no data): Client computation time per request: Server computation time per request:

Saltzer & Kaashoek Ch. 7, p. 121

12.5 x 10-3 seconds 12.5 x 10-3 seconds

1.5 x 106 bits per second

10 bytes

40 bytes

50 bytes

= 10 bytes data + 40 bytes overhead 1000 bytes (96 pipe calls per message) 50 bytes 100 x 10-6 seconds 50 x 10-6 seconds

June 25, 2009 8:22 am

7–122

CHAPTER 7 The Network as a System and as a System Component

The Microhard application is the only one sending messages on the link. 7.27a. What is the transmission delay the client thread observes in sending an RPC request message)? 7.27b. Assuming that only RPCs are used for remote requests, what is the maximum number of RPCs per second that will be executed by this application? 7.27c. Assuming that all RPC calls are changed to pipe calls, what is the maximum number of pipe calls per second that will be executed by this application? 7.27d. Assuming that every pipe call includes a serial number argument, and serial numbers increase by one with every pipe call, how could you know the last pipe call was executed? A. Ensure that serial numbers are synchronized to the time of day clock, and wait at the client until the time of the last serial number. B. Call an RPC both before and after the pipe call, and wait for both calls to return. C. Call an RPC passing as an argument the serial number that was sent on the last pipe call, and design the remote procedure called to not return until a pipe call with a given serial number had been processed. D. Stop making pipe calls for twice the maximum network delay, and reset the serial number counter to zero. 1998–1–2a…d

7.28 Alyssa P. Hacker is implementing a client/service spell checker in which a network will stand between the client and the service. The client scans an ASCII file, sending each word to the service in a separate message. The service checks each word against its database of correctly spelled words and returns a one-bit answer. The client displays the list of incorrectly spelled words. 7.28a. The client’s cost for preparing a message to be sent is 1 millisecond, regardless of length. The network transit time is 10 milliseconds, and network data rate is infinite. The service can look up a word and determine whether or not it is misspelled in 100 microseconds. Since the service runs on a supercomputer, its cost for preparing a message to be sent is zero milliseconds. Both the client and service can receive messages with no overhead. How long will Alyssa’s design take to spell check a 1,000 word file if she uses RPC for communication (ignore acknowledgments to requests and replies, and assume that messages are not lost or reordered)? 7.28b. Alyssa does the same computations that you did and decides that the design is too slow. She decides to group several words into each request. If she packs 10 words in each request, how long will it take to spell check the same file? 7.28c. Alyssa decides that grouping words still isn’t fast enough, so she wants to know how long it would take if she used an asynchronous message protocol (with

Saltzer & Kaashoek Ch. 7, p. 122

June 25, 2009 8:22 am

Exercises

7–123

grouping words) instead of RPC. How long will it take to spell check the same file? (For this calculation, assume that messages are not lost or reordered.) 7.28d. Alyssa is so pleased with the performance of this last design that she decides to use it (without grouping) for a banking system. The service maintains a set of accounts and processes requests to debit and credit accounts (i.e., modify account balances). One day Alyssa deposits $10,000 and transfers it to Ben’s account immediately afterwards. The transfer fails with a reply saying she is overdrawn. But when she checks her balance afterwards, the $10,000 is there! Draw a time diagram explaining these events. 1996–1–4a…d

Additional exercises relating to Chapter 7 can be found in problem sets 17 through 25.

Saltzer & Kaashoek Ch. 7, p. 123

June 25, 2009 8:22 am

7–124

CHAPTER 7 The Network as a System and as a System Component

Saltzer & Kaashoek Ch. 7, p. 124

June 25, 2009 8:22 am

CHAPTER

Fault Tolerance: Reliable

Systems from Unreliable

Components

8

CHAPTER CONTENTS Overview..........................................................................................8–2

8.1 Faults, Failures, and Fault Tolerant Design................................8–3

8.1.1 Faults, Failures, and Modules ................................................. 8–3

8.1.2 The Fault-Tolerance Design Process ........................................ 8–6

8.2 Measures of Reliability and Failure Tolerance............................8–8

8.2.1 Availability and Mean Time to Failure ...................................... 8–8

8.2.2 Reliability Functions ............................................................ 8–13

8.2.3 Measuring Fault Tolerance ................................................... 8–16

8.3 Tolerating Active Faults...........................................................8–16

8.3.1 Responding to Active Faults ................................................. 8–16

8.3.2 Fault Tolerance Models ........................................................ 8–18

8.4 Systematically Applying Redundancy ......................................8–20

8.4.1 Coding: Incremental Redundancy ......................................... 8–21

8.4.2 Replication: Massive Redundancy ......................................... 8–25

8.4.3 Voting .............................................................................. 8–26

8.4.4 Repair .............................................................................. 8–31

8.5 Applying Redundancy to Software and Data ............................8–36

8.5.1 Tolerating Software Faults ................................................... 8–36

8.5.2 Tolerating Software (and other) Faults by Separating State ...... 8–37

8.5.3 Durability and Durable Storage ............................................ 8–39

8.5.4 Magnetic Disk Fault Tolerance .............................................. 8–40

8.5.4.1 Magnetic Disk Fault Modes ............................................ 8–41

8.5.4.2 System Faults ............................................................. 8–42

8.5.4.3 Raw Disk Storage ........................................................ 8–43

8.5.4.4 Fail-Fast Disk Storage................................................... 8–43

8.5.4.5 Careful Disk Storage .................................................... 8–45

8.5.4.6 Durable Storage: RAID 1 .............................................. 8–46

8.5.4.7 Improving on RAID 1 ................................................... 8–47

8.5.4.8 Detecting Errors Caused by System Crashes.................... 8–49

8.5.4.9 Still More Threats to Durability ...................................... 8–49

8.6 Wrapping up Reliability ...........................................................8–51

Saltzer & Kaashoek Ch. 8, p. 1

8–1

June 24, 2009 12:24 am

8–2

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

8.6.1 Design Strategies and Design Principles ................................ 8–51

8.6.2 How about the End-to-End Argument? .................................. 8–52

8.6.3 A Caution on the Use of Reliability Calculations ...................... 8–53

8.6.4 Where to Learn More about Reliable Systems ......................... 8–53

8.7 Application: A Fault Tolerance Model for CMOS RAM ...............8–55

8.8 War Stories: Fault Tolerant Systems that Failed......................8–57

8.8.1 Adventures with Error Correction ......................................... 8–57

8.8.2 Risks of Rarely-Used Procedures: The National Archives .......... 8–59

8.8.3 Non-independent Replicas and Backhoe Fade ......................... 8–60

8.8.4 Human Error May Be the Biggest Risk ................................... 8–61

8.8.5 Introducing a Single Point of Failure ..................................... 8–63

8.8.6 Multiple Failures: The SOHO Mission Interruption ................... 8–63

Exercises........................................................................................8–64 Glossary for Chapter 8 ...................................................................8–69 Index of Chapter 8 .........................................................................8–75 Last chapter page 8–77

Overview Construction of reliable systems from unreliable components is one of the most impor­ tant applications of modularity. There are, in principle, three basic steps to building reliable systems: 1. Error detection: discovering that there is an error in a data value or control signal. Error detection is accomplished with the help of redundancy, extra information that can verify correctness. 2. Error containment: limiting how far the effects of an error propagate. Error containment comes from careful application of modularity. When discussing reliability, a module is usually taken to be the unit that fails independently of other such units. It is also usually the unit of repair and replacement. 3. Error masking: ensuring correct operation despite the error. Error masking is accomplished by providing enough additional redundancy that it is possible to discover correct, or at least acceptably close, values of the erroneous data or control signal. When masking involves changing incorrect values to correct ones, it is usually called error correction. Since these three steps can overlap in practice, one sometimes finds a single error-han­ dling mechanism that merges two or even all three of the steps. In earlier chapters each of these ideas has already appeared in specialized forms: • A primary purpose of enforced modularity, as provided by client/server architecture, virtual memory, and threads, is error containment.

Saltzer & Kaashoek Ch. 8, p. 2

June 24, 2009 12:24 am

8.1 Faults, Failures, and Fault Tolerant Design

8–3

• Network links typically use error detection to identify and discard damaged frames. • Some end-to-end protocols time out and resend lost data segments, thus masking the loss. • Routing algorithms find their way around links that fail, masking those failures. • Some real-time applications fill in missing data by interpolation or repetition, thus masking loss. and, as we will see in Chapter 11[on-line], secure systems use a technique called defense in depth both to contain and to mask errors in individual protection mechanisms. In this chapter we explore systematic application of these techniques to more general problems, as well as learn about both their power and their limitations.

8.1 Faults, Failures, and Fault Tolerant Design 8.1.1 Faults, Failures, and Modules Before getting into the techniques of constructing reliable systems, let us distinguish between concepts and give them separate labels. In ordinary English discourse, the three words “fault,” “failure,” and “error” are used more or less interchangeably or at least with strongly overlapping meanings. In discussing reliable systems, we assign these terms to distinct formal concepts. The distinction involves modularity. Although common English usage occasionally intrudes, the distinctions are worth maintaining in technical settings. A fault is an underlying defect, imperfection, or flaw that has the potential to cause problems, whether it actually has, has not, or ever will. A weak area in the casing of a tire is an example of a fault. Even though the casing has not actually cracked yet, the fault is lurking. If the casing cracks, the tire blows out, and the car careens off a cliff, the resulting crash is a failure. (That definition of the term “failure” by example is too informal; we will give a more careful definition in a moment.) One fault that underlies the failure is the weak spot in the tire casing. Other faults, such as an inattentive driver and lack of a guard rail, may also contribute to the failure. Experience suggests that faults are commonplace in computer systems. Faults come from many different sources: software, hardware, design, implementation, operations, and the environment of the system. Here are some typical examples: • Software fault: A programming mistake, such as placing a less-than sign where there should be a less-than-or-equal sign. This fault may never have caused any trouble because the combination of events that requires the equality case to be handled correctly has not yet occurred. Or, perhaps it is the reason that the system crashes twice a day. If so, those crashes are failures.

Saltzer & Kaashoek Ch. 8, p. 3

June 24, 2009 12:24 am

8–4

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

• Hardware fault: A gate whose output is stuck at the value ZERO. Until something depends on the gate correctly producing the output value ONE, nothing goes wrong. If you publish a paper with an incorrect sum that was calculated by this gate, a failure has occurred. Furthermore, the paper now contains a fault that may lead some reader to do something that causes a failure elsewhere. • Design fault: A miscalculation that has led to installing too little memory in a telephone switch. It may be months or years until the first time that the presented load is great enough that the switch actually begins failing to accept calls that its specification says it should be able to handle. • Implementation fault: Installing less memory than the design called for. In this case the failure may be identical to the one in the previous example of a design fault, but the fault itself is different. • Operations fault: The operator responsible for running the weekly payroll ran the payroll program twice last Friday. Even though the operator shredded the extra checks, this fault has probably filled the payroll database with errors such as wrong values for year-to-date tax payments. • Environment fault: Lightning strikes a power line, causing a voltage surge. The computer is still running, but a register that was being updated at that instant now has several bits in error. Environment faults come in all sizes, from bacteria contaminating ink-jet printer cartridges to a storm surge washing an entire building out to sea. Some of these examples suggest that a fault may either be latent, meaning that it isn’t affecting anything right now, or active. When a fault is active, wrong results appear in data values or control signals. These wrong results are errors. If one has a formal specifi­ cation for the design of a module, an error would show up as a violation of some assertion or invariant of the specification. The violation means that either the formal specification is wrong (for example, someone didn’t articulate all of the assumptions) or a module that this component depends on did not meet its own specification. Unfortunately, formal specifications are rare in practice, so discovery of errors is more likely to be somewhat ad hoc. If an error is not detected and masked, the module probably does not perform to its specification. Not producing the intended result at an interface is the formal definition of a failure. Thus, the distinction between fault and failure is closely tied to modularity and the building of systems out of well-defined subsystems. In a system built of sub­ systems, the failure of a subsystem is a fault from the point of view of the larger subsystem that contains it. That fault may cause an error that leads to the failure of the larger sub­ system, unless the larger subsystem anticipates the possibility of the first one failing, detects the resulting error, and masks it. Thus, if you notice that you have a flat tire, you have detected an error caused by failure of a subsystem you depend on. If you miss an appointment because of the flat tire, the person you intended to meet notices a failure of

Saltzer & Kaashoek Ch. 8, p. 4

June 24, 2009 12:24 am

8.1 Faults, Failures, and Fault Tolerant Design

8–5

a larger subsystem. If you change to a spare tire in time to get to the appointment, you have masked the error within your subsystem. Fault tolerance thus consists of noticing active faults and component subsystem failures and doing something helpful in response. One such helpful response is error containment, which is another close relative of modularity and the building of systems out of subsystems. When an active fault causes an error in a subsystem, it may be difficult to confine the effects of that error to just a portion of the subsystem. On the other hand, one should expect that, as seen from out­ side that subsystem, the only effects will be at the specified interfaces of the subsystem. In consequence, the boundary adopted for error containment is usually the boundary of the smallest subsystem inside which the error occurred. From the point of view of the next higher-level subsystem, the subsystem with the error may contain the error in one of four ways: 1. Mask the error, so the higher-level subsystem does not realize that anything went wrong. One can think of failure as falling off a cliff and masking as a way of providing some separation from the edge. 2. Detect and report the error at its interface, producing what is called a fail-fast design. Fail-fast subsystems simplify the job of detection and masking for the next higher-level subsystem. If a fail-fast module correctly reports that its output is questionable, it has actually met its specification, so it has not failed. (Fail-fast modules can still fail, for example by not noticing their own errors.) 3. Immediately stop dead, thereby hoping to limit propagation of bad values, a technique known as fail-stop. Fail-stop subsystems require that the higher-level subsystem take some additional measure to discover the failure, for example by setting a timer and responding to its expiration. A problem with fail-stop design is that it can be difficult to distinguish a stopped subsystem from one that is merely running more slowly than expected. This problem is particularly acute in asynchronous systems. 4. Do nothing, simply failing without warning. At the interface, the error may have contaminated any or all output values. (Informally called a “crash” or perhaps “fail­ thud”.) Another useful distinction is that of transient versus persistent faults. A transient fault, also known as a single-event upset, is temporary, triggered by some passing external event such as lightning striking a power line or a cosmic ray passing through a chip. It is usually possible to mask an error caused by a transient fault by trying the operation again. An error that is successfully masked by retry is known as a soft error. A persistent fault contin­ ues to produce errors, no matter how many times one retries, and the corresponding errors are called hard errors. An intermittent fault is a persistent fault that is active only occasionally, for example, when the noise level is higher than usual but still within spec­ ifications. Finally, it is sometimes useful to talk about latency, which in reliability terminology is the time between when a fault causes an error and when the error is

Saltzer & Kaashoek Ch. 8, p. 5

June 24, 2009 12:24 am

8–6

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

detected or causes the module to fail. Latency can be an important parameter because some error-detection and error-masking mechanisms depend on there being at most a small fixed number of errors—often just one—at a time. If the error latency is large, there may be time for a second error to occur before the first one is detected and masked, in which case masking of the first error may not succeed. Also, a large error latency gives time for the error to propagate and may thus complicate containment. Using this terminology, an improperly fabricated stuck-at-ZERO bit in a memory chip is a persistent fault: whenever the bit should contain a ONE the fault is active and the value of the bit is in error; at times when the bit is supposed to contain a ZERO, the fault is latent. If the chip is a component of a fault tolerant memory module, the module design prob­ ably includes an error-correction code that prevents that error from turning into a failure of the module. If a passing cosmic ray flips another bit in the same chip, a transient fault has caused that bit also to be in error, but the same error-correction code may still be able to prevent this error from turning into a module failure. On the other hand, if the errorcorrection code can handle only single-bit errors, the combination of the persistent and the transient fault might lead the module to produce wrong data across its interface, a failure of the module. If someone were then to test the module by storing new data in it and reading it back, the test would probably not reveal a failure because the transient fault does not affect the new data. Because simple input/output testing does not reveal successfully masked errors, a fault tolerant module design should always include some way to report that the module masked an error. If it does not, the user of the module may not realize that persistent errors are accumulating but hidden.

8.1.2 The Fault-Tolerance Design Process One way to design a reliable system would be to build it entirely of components that are individually so reliable that their chance of failure can be neglected. This technique is known as fault avoidance. Unfortunately, it is hard to apply this technique to every com­ ponent of a large system. In addition, the sheer number of components may defeat the strategy. If all N of the components of a system must work, the probability of any one component failing is p, and component failures are independent of one another, then the probability that the system works is (1 – p) N . No matter how small p may be, there is some value of N beyond which this probability becomes too small for the system to be useful. The alternative is to apply various techniques that are known collectively by the name fault tolerance. The remainder of this chapter describes several such techniques that are the elements of an overall design process for building reliable systems from unreliable components. Here is an overview of the fault-tolerance design process: 1. Begin to develop a fault-tolerance model, as described in Section 8.3: • Identify every potential fault. • Estimate the risk of each fault, as described in Section 8.2. • Where the risk is too high, design methods to detect the resulting errors.

Saltzer & Kaashoek Ch. 8, p. 6

June 24, 2009 12:24 am

8.1 Faults, Failures, and Fault Tolerant Design

8–7

2. Apply modularity to contain the damage from the high-risk errors. 3. Design and implement procedures that can mask the detected errors, using the

techniques described in Section 8.4:

• Temporal redundancy. Retry the operation, using the same components. • Spatial redundancy. Have different components do the operation. 4. Update the fault-tolerance model to account for those improvements. 5. Iterate the design and the model until the probability of untolerated faults is low

enough that it is acceptable.

6. Observe the system in the field: • Check logs of how many errors the system is successfully masking. (Always keep

track of the distance to the edge of the cliff.)

• Perform postmortems on failures and identify all of the reasons for each failure. 7. Use the logs of masked faults and the postmortem reports about failures to revise

and improve the fault-tolerance model and reiterate the design.

The fault-tolerance design process includes some subjective steps, for example, decid­ ing that a risk of failure is “unacceptably high” or that the “probability of an untolerated fault is low enough that it is acceptable.” It is at these points that different application requirements can lead to radically different approaches to achieving reliability. A per­ sonal computer may be designed with no redundant components, the computer system for a small business is likely to make periodic backup copies of most of its data and store the backup copies at another site, and some space-flight guidance systems use five com­ pletely redundant computers designed by at least two independent vendors. The decisions required involve trade-offs between the cost of failure and the cost of imple­ menting fault tolerance. These decisions can blend into decisions involving business models and risk management. In some cases it may be appropriate to opt for a nontech­ nical solution, for example, deliberately accepting an increased risk of failure and covering that risk with insurance. The fault-tolerance design process can be described as a safety-net approach to system design. The safety-net approach involves application of some familiar design principles and also some not previously encountered. It starts with a new design principle: Be explicit Get all of the assumptions out on the table.

The primary purpose of creating a fault-tolerance model is to expose and document the assumptions and articulate them explicitly. The designer needs to have these assump­ tions not only for the initial design, but also in order to respond to field reports of

Saltzer & Kaashoek Ch. 8, p. 7

June 24, 2009 12:24 am

8–8

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

unexpected failures. Unexpected failures represent omissions or violations of the assumptions. Assuming that you won’t get it right the first time, the second design principle of the safety-net approach is the familiar design for iteration. It is difficult or impossible to antic­ ipate all of the ways that things can go wrong. Moreover, when working with a fastchanging technology it can be hard to estimate probabilities of failure in components and in their organization, especially when the organization is controlled by software. For these reasons, a fault tolerant design must include feedback about actual error rates, eval­ uation of that feedback, and update of the design as field experience is gained. These two principles interact: to act on the feedback requires having a fault tolerance model that is explicit about reliability assumptions. The third design principle of the safety-net approach is also familiar: the safety margin principle, described near the end of Section 1.3.2. An essential part of a fault tolerant design is to monitor how often errors are masked. When fault tolerant systems fail, it is usually not because they had inadequate fault tolerance, but because the number of fail­ ures grew unnoticed until the fault tolerance of the design was exceeded. The key requirement is that the system log all failures and that someone pay attention to the logs. The biggest difficulty to overcome in applying this principle is that it is hard to motivate people to expend effort checking something that seems to be working. The fourth design principle of the safety-net approach came up in the introduction to the study of systems; it shows up here in the instruction to identify all of the causes of each failure: keep digging. Complex systems fail for complex reasons. When a failure of a system that is supposed to be reliable does occur, always look beyond the first, obvious cause. It is nearly always the case that there are actually several contributing causes and that there was something about the mind set of the designer that allowed each of those causes to creep in to the design. Finally, complexity increases the chances of mistakes, so it is an enemy of reliability. The fifth design principle embodied in the safety-net approach is to adopt sweeping sim­ plifications. This principle does not show up explicitly in the description of the faulttolerance design process, but it will appear several times as we go into more detail. The safety-net approach is applicable not just to fault tolerant design. Chapter 11[on­ line] will show that the safety-net approach is used in an even more rigorous form in designing systems that must protect information from malicious actions.

8.2 Measures of Reliability and Failure Tolerance 8.2.1 Availability and Mean Time to Failure A useful model of a system or a system component, from a reliability point of view, is that it operates correctly for some period of time and then it fails. The time to failure (TTF) is thus a measure of interest, and it is something that we would like to be able to predict. If a higher-level module does not mask the failure and the failure is persistent,

Saltzer & Kaashoek Ch. 8, p. 8

June 24, 2009 12:24 am

8.2 Measures of Reliability and Failure Tolerance

8–9

the system cannot be used until it is repaired, perhaps by replacing the failed component, so we are equally interested in the time to repair (TTR). If we observe a system through N run–fail–repair cycles and observe in each cycle i the values of TTFi and TTRi, we can calculate the fraction of time it operated properly, a useful measure known as availability: time system was running Availability = -------------------------------------------------------------------------------------------­ time system should have been running N

∑ TTFi

i=1 = ---------------------------------------------

Eq. 8–1

N

∑ (TTFi + TTRi )

i=1

By separating the denominator of the availability expression into two sums and dividing each by N (the number of observed failures) we obtain two time averages that are fre­ quently reported as operational statistics: the mean time to failure (MTTF) and the mean time to repair (MTTR): 1 MTTF = ---N

N

∑ TTFi i=1

1 MTTR = ---N

N

∑ TTRi

Eq. 8–2

i=1

The sum of these two statistics is usually called the mean time between failures (MTBF). Thus availability can be variously described as MTTF MTTF MTBF – MTTR Availability = ---------------- = --------------------------------------- = --------------------------------------MTBF MTTF + MTTR MTBF

Eq. 8–3

In some situations, it is more useful to measure the fraction of time that the system is not working, known as its down time: MTTR Down time = (1 – Availability) = ---------------MTBF

Eq. 8–4

One thing that the definition of down time makes clear is that MTTR and MTBF are in some sense equally important. One can reduce down time either by reducing MTTR or by increasing MTBF. Components are often repaired by simply replacing them with new ones. When failed components are discarded rather than fixed and returned to service, it is common to use a slightly different method to measure MTTF. The method is to place a batch of N com­ ponents in service in different systems (or in what is hoped to be an equivalent test environment), run them until they have all failed, and use the set of failure times as the TTFi in equation 8–2. This procedure substitutes an ensemble average for the time aver­ age. We could use this same procedure on components that are not usually discarded when they fail, in the hope of determining their MTTF more quickly, but we might obtain a different value for the MTTF. Some failure processes do have the property that the ensemble average is the same as the time average (processes with this property are

Saltzer & Kaashoek Ch. 8, p. 9

June 24, 2009 12:24 am

8–10

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

called ergodic), but other failure processes do not. For example, the repair itself may cause wear, tear, and disruption to other parts of the system, in which case each successive sys­ tem failure might on average occur sooner than did the previous one. If that is the case, an MTTF calculated from an ensemble-average measurement might be too optimistic. As we have defined them, availability, MTTF, MTTR, and MTBF are backwardlooking measures. They are used for two distinct purposes: (1) for evaluating how the system is doing (compared, for example, with predictions made when the system was designed) and (2) for predicting how the system will behave in the future. The first pur­ pose is concrete and well defined. The second requires that one take on faith that samples from the past provide an adequate predictor of the future, which can be a risky assump­ tion. There are other problems associated with these measures. While MTTR can usually be measured in the field, the more reliable a component or system the longer it takes to evaluate its MTTF, so that measure is often not directly available. Instead, it is common to use and measure proxies to estimate its value. The quality of the resulting estimate of availability then depends on the quality of the proxy. A typical 3.5-inch magnetic disk comes with a reliability specification of 300,000 hours “MTTF”, which is about 34 years. Since the company quoting this number has probably not been in business that long, it is apparent that whatever they are calling “MTTF” is not the same as either the time-average or the ensemble-average MTTF that we just defined. It is actually a quite different statistic, which is why we put quotes around its name. Sometimes this “MTTF” is a theoretical prediction obtained by mod­ eling the ways that the components of the disk might be expected to fail and calculating an expected time to failure. A more likely possibility is that the manufacturer measured this “MTTF” by running an array of disks simultaneously for a much shorter time and counting the number of failures. For example, suppose the manufacturer ran 1,000 disks for 3,000 hours (about four months) each, and during that time 10 of the disks failed. The observed failure rate of this sample is 1 failure for every 300,000 hours of operation. The next step is to invert the failure rate to obtain 300,000 hours of operation per failure and then quote this num­ ber as the “MTTF”. But the relation between this sample observation of failure rate and the real MTTF is problematic. If the failure process were memoryless (meaning that the failure rate is independent of time; Section 8.2.2, below, explores this idea more thor­ oughly), we would have the special case in which the MTTF really is the inverse of the failure rate. A good clue that the disk failure process is not memoryless is that the disk specification may also mention an “expected operational lifetime” of only 5 years. That statistic is probably the real MTTF—though even that may be a prediction based on modeling rather than a measured ensemble average. An appropriate re-interpretation of the 34-year “MTTF” statistic is to invert it and identify the result as a short-term failure rate that applies only within the expected operational lifetime. The paragraph discussing equation 8–9 on page 8–13 describes a fallacy that sometimes leads to miscalculation of statistics such as the MTTF. Magnetic disks, light bulbs, and many other components exhibit a time-varying sta­ tistical failure rate known as a bathtub curve, illustrated in Figure 8.1 and defined more

Saltzer & Kaashoek Ch. 8, p. 10

June 24, 2009 12:24 am

8.2 Measures of Reliability and Failure Tolerance

8–11

carefully in Section 8.2.2, below. When components come off the production line, a cer­ tain fraction fail almost immediately because of gross manufacturing defects. Those components that survive this initial period usually run for a long time with a relatively uniform failure rate. Eventually, accumulated wear and tear cause the failure rate to increase again, often quite rapidly, producing a failure rate plot that resembles the shape of a bathtub. Several other suggestive and colorful terms describe these phenomena. Components that fail early are said to be subject to infant mortality, and those that fail near the end of their expected lifetimes are said to burn out. Manufacturers sometimes burn in such com­ ponents by running them for a while before shipping, with the intent of identifying and discarding the ones that would otherwise fail immediately upon being placed in service. When a vendor quotes an “expected operational lifetime,” it is probably the mean time to failure of those components that survive burn in, while the much larger “MTTF” number is probably the inverse of the observed failure rate at the lowest point of the bath­ tub. (The published numbers also sometimes depend on the outcome of a debate between the legal department and the marketing department, but that gets us into a dif­ ferent topic.) A chip manufacturer describes the fraction of components that survive the burn-in period as the yield of the production line. Component manufacturers usually exhibit a phenomenon known informally as a learning curve, which simply means that the first components coming out of a new production line tend to have more failures than later ones. The reason is that manufacturers design for iteration: upon seeing and analyzing failures in the early production batches, the production line designer figures out how to refine the manufacturing process to reduce the infant mortality rate. One job of the system designer is to exploit the nonuniform failure rates predicted by the bathtub and learning curves. For example, a conservative designer exploits the learn­ ing curve by avoiding the latest generation of hard disks in favor of slightly older designs that have accumulated more field experience. One can usually rely on other designers who may be concerned more about cost or performance than availability to shake out the bugs in the newest generation of disks.

conditional failure rate, h(t)

time, t FIGURE 8.1 A bathtub curve, showing how the conditional failure rate of a component changes with time.

Saltzer & Kaashoek Ch. 8, p. 11

June 24, 2009 12:24 am

8–12

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

The 34-year “MTTF” disk drive specification may seem like public relations puffery in the face of the specification of a 5-year expected operational lifetime, but these two numbers actually are useful as a measure of the nonuniformity of the failure rate. This nonuniformity is also susceptible to exploitation, depending on the operation plan. If the operation plan puts the component in a system such as a satellite, in which it will run until it fails, the designer would base system availability and reliability estimates on the 5-year figure. On the other hand, the designer of a ground-based storage system, mindful that the 5-year operational lifetime identifies the point where the conditional failure rate starts to climb rapidly at the far end of the bathtub curve, might include a plan to replace perfectly good hard disks before burn-out begins to dominate the failure rate—in this case, perhaps every 3 years. Since one can arrange to do scheduled replacement at conve­ nient times, for example, when the system is down for another reason, or perhaps even without bringing the system down, the designer can minimize the effect on system avail­ ability. The manufacturer’s 34-year “MTTF”, which is probably the inverse of the observed failure rate at the lowest point of the bathtub curve, then can be used as an esti­ mate of the expected rate of unplanned replacements, although experience suggests that this specification may be a bit optimistic. Scheduled replacements are an example of pre­ ventive maintenance, which is active intervention intended to increase the mean time to failure of a module or system and thus improve availability. For some components, observed failure rates are so low that MTTF is estimated by accelerated aging. This technique involves making an educated guess about what the dominant underlying cause of failure will be and then amplifying that cause. For exam­ ple, it is conjectured that failures in recordable Compact Disks are heat-related. A typical test scenario is to store batches of recorded CDs at various elevated temperatures for sev­ eral months, periodically bringing them out to test them and count how many have failed. One then plots these failure rates versus temperature and extrapolates to estimate what the failure rate would have been at room temperature. Again making the assump­ tion that the failure process is memoryless, that failure rate is then inverted to produce an MTTF. Published MTTFs of 100 years or more have been obtained this way. If the dominant fault mechanism turns out to be something else (such as bacteria munching on the plastic coating) or if after 50 years the failure process turns out not to be memo­ ryless after all, an estimate from an accelerated aging study may be far wide of the mark. A designer must use such estimates with caution and understanding of the assumptions that went into them. Availability is sometimes discussed by counting the number of nines in the numerical representation of the availability measure. Thus a system that is up and running 99.9% of the time is said to have 3-nines availability. Measuring by nines is often used in mar­ keting because it sounds impressive. A more meaningful number is usually obtained by calculating the corresponding down time. A 3-nines system can be down nearly 1.5 min­ utes per day or 8 hours per year, a 5-nines system 5 minutes per year, and a 7-nines system only 3 seconds per year. Another problem with measuring by nines is that it tells only about availability, without any information about MTTF. One 3-nines system may have a brief failure every day, while a different 3-nines system may have a single eight

Saltzer & Kaashoek Ch. 8, p. 12

June 24, 2009 12:24 am

8.2 Measures of Reliability and Failure Tolerance

8–13

hour outage once a year. Depending on the application, the difference between those two systems could be important. Any single measure should always be suspect. Finally, availability can be a more fine-grained concept. Some systems are designed so that when they fail, some functions (for example, the ability to read data) remain avail­ able, while others (the ability to make changes to the data) are not. Systems that continue to provide partial service in the face of failure are called fail-soft, a concept defined more carefully in Section 8.3.

8.2.2 Reliability Functions The bathtub curve expresses the conditional failure rate h(t) of a module, defined to be the probability that the module fails between time t and time t + dt, given that the com­ ponent is still working at time t. The conditional failure rate is only one of several closely related ways of describing the failure characteristics of a component, module, or system. The reliability, R, of a module is defined to be R ( t ) = Pr ⎛ the module has not yet failed at time t, given that ⎞ ⎝ ⎠ the module was operating at time 0

Eq. 8–5

and the unconditional failure rate f(t) is defined to be f ( t ) = Pr(module fails between t and t + dt)

Eq. 8–6

(The bathtub curve and these two reliability functions are three ways of presenting the same information. If you are rusty on probability, a brief reminder of how they are related appears in Sidebar 8.1.) Once f(t) is at hand, one can directly calculate the MTTF: ∞

MTTF =

∫ t ⋅ f ( t )dt

Eq. 8–7

0

One must keep in mind that this MTTF is predicted from the failure rate function f(t), in contrast to the MTTF of eq. 8–2, which is the result of a field measurement. The two MTTFs will be the same only if the failure model embodied in f(t) is accurate. Some components exhibit relatively uniform failure rates, at least for the lifetime of the system of which they are a part. For these components the conditional failure rate, rather than resembling a bathtub, is a straight horizontal line, and the reliability function becomes a simple declining exponential: t – ⎛ ----------------⎞ ⎝ MTTF⎠

Eq. 8–8 This reliability function is said to be memoryless, which simply means that the conditional failure rate is independent of how long the component has been operating. Memoryless failure processes have the nice property that the conditional failure rate is the inverse of the MTTF. Unfortunately, as we saw in the case of the disks with the 34-year “MTTF”, this prop­ erty is sometimes misappropriated to quote an MTTF for a component whose R(t) = e

Saltzer & Kaashoek Ch. 8, p. 13

June 24, 2009 12:24 am

8–14

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

Sidebar 8.1: Reliability functions The failure rate function, the reliability function, and the bathtub curve (which in probability texts is called the conditional failure rate function, and which in operations research texts is called the hazard function) are actually three mathematically related ways of describing the same information. The failure rate function, f(t) as defined in equation 8–6, is a probability density function, which is everywhere non-negative and whose integral over all time is 1. Integrating the failure rate function from the time the component was created (conventionally taken to be t = 0) to the present time yields t

F(t) =

∫ f(t) dt

0

F(t) is the cumulative probability that the component has failed by time t. The cumulative probability that the component has not failed is the probability that it is still operating at time t given that it was operating at time 0, which is exactly the definition of the reliability function, R(t). That is,

R(t) = 1 – F(t) The bathtub curve of Figure 8.1 reports the conditional probability h(t) that a failure occurs between t and t + dt, given that the component was operating at time t. By the definition of conditional probability, the conditional failure rate function is thus

f(t) h ( t ) = ---------R(t)

conditional failure rate does change with time. This misappropriation starts with a fal­ lacy: an assumption that the MTTF, as defined in eq. 8–7, can be calculated by inverting the measured failure rate. The fallacy arises because in general, E(1 ⁄ t) ≠ 1 ⁄ E ( t )

Eq. 8–9 That is, the expected value of the inverse is not equal to the inverse of the expected value, except in certain special cases. The important special case in which they are equal is the memoryless distribution of eq. 8–8. When a random process is memoryless, calculations and measurements are so much simpler that designers sometimes forget that the same simplicity does not apply everywhere. Just as availability is sometimes expressed in an oversimplified way by counting the number of nines in its numerical representation, reliability in component manufacturing is sometimes expressed in an oversimplified way by counting standard deviations in the observed distribution of some component parameter, such as the maximum propagation time of a gate. The usual symbol for standard deviation is the Greek letter σ (sigma), and a normal distribution has a standard deviation of 1.0, so saying that a component has “4.5 σ reliability” is a shorthand way of saying that the production line controls varia­ tions in that parameter well enough that the specified tolerance is 4.5 standard deviations away from the mean value, as illustrated in Figure 8.2. Suppose, for example, that a pro-

Saltzer & Kaashoek Ch. 8, p. 14

June 24, 2009 12:24 am

8.2 Measures of Reliability and Failure Tolerance

8–15

duction line is manufacturing gates that are specified to have a mean propagation time of 10 nanoseconds and a maximum propagation time of 11.8 nanoseconds with 4.5 σ reliability. The difference between the mean and the maximum, 1.8 nanoseconds, is the tolerance. For that tolerance to be 4.5 σ, σ would have to be no more than 0.4 nanosec­ onds. To meet the specification, the production line designer would measure the actual propagation times of production line samples and, if the observed variance is greater than 0.4 ns, look for ways to reduce the variance to that level. Another way of interpreting “4.5 σ reliability” is to calculate the expected fraction of components that are outside the specified tolerance. That fraction is the integral of one tail of the normal distribution from 4.5 to ∞, which is about 3.4 × 10 –6 , so in our exam­ ple no more than 3.4 out of each million gates manufactured would have delays greater than 11.8 nanoseconds. Unfortunately, this measure describes only the failure rate of the production line, it does not say anything about the failure rate of the component after it is installed in a system. A currently popular quality control method, known as “Six Sigma”, is an application of two of our design principles to the manufacturing process. The idea is to use measure­ ment, feedback, and iteration (design for iteration: “you won’t get it right the first time”) to reduce the variance (the robustness principle: “be strict on outputs”) of production-line manufacturing. The “Six Sigma” label is somewhat misleading because in the application of the method, the number 6 is allocated to deal with two quite different effects. The method sets a target of controlling the production line variance to the level of 4.5 σ, just as in the gate example of Figure 8.2. The remaining 1.5 σ is the amount that the mean output value is allowed to drift away from its original specification over the life of the

4.5 s –1 9.6

10.0

1

2

3

4

10.4

10.8

11.2

11.6

5

6

12.0

12.4

7s 12.8 ns

11.8 ns FIGURE 8.2 The normal probability density function applied to production of gates that are specified to have mean propagation time of 10 nanoseconds and maximum propagation time of 11.8 nanosec­ onds. The upper numbers on the horizontal axis measure the distance from the mean in units of the standard deviation, σ. The lower numbers depict the corresponding propagation times. The integral of the tail from 4.5 σ to ∞ is so small that it is not visible in this figure.

Saltzer & Kaashoek Ch. 8, p. 15

June 24, 2009 12:24 am

8–16

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

production line. So even though the production line may start 6 σ away from the toler­ ance limit, after it has been operating for a while one may find that the failure rate has drifted upward to the same 3.4 in a million calculated for the 4.5 σ case. In manufacturing quality control literature, these applications of the two design prin­ ciples are known as Taguchi methods, after their popularizer, Genichi Taguchi.

8.2.3 Measuring Fault Tolerance It is sometimes useful to have a quantitative measure of the fault tolerance of a system. One common measure, sometimes called the failure tolerance, is the number of failures of its components that a system can tolerate without itself failing. Although this label could be ambiguous, it is usually clear from context that a measure is being discussed. Thus a memory system that includes single-error correction (Section 8.4 describes how error correction works) has a failure tolerance of one bit. When a failure occurs, the remaining failure tolerance of the system goes down. The remaining failure tolerance is an important thing to monitor during operation of the sys­ tem because it shows how close the system as a whole is to failure. One of the most common system design mistakes is to add fault tolerance but not include any monitoring to see how much of the fault tolerance has been used up, thus ignoring the safety margin principle. When systems that are nominally fault tolerant do fail, later analysis invariably discloses that there were several failures that the system successfully masked but that somehow were never reported and thus were never repaired. Eventually, the total num­ ber of failures exceeded the designed failure tolerance of the system. Failure tolerance is actually a single number in only the simplest situations. Some­ times it is better described as a vector, or even as a matrix showing the specific combinations of different kinds of failures that the system is designed to tolerate. For example, an electric power company might say that it can tolerate the failure of up to 15% of its generating capacity, at the same time as the downing of up to two of its main transmission lines.

8.3 Tolerating Active Faults 8.3.1 Responding to Active Faults In dealing with active faults, the designer of a module can provide one of several responses: • Do nothing. The error becomes a failure of the module, and the larger system or subsystem of which it is a component inherits the responsibilities both of discovering and of handling the problem. The designer of the larger subsystem then must choose which of these responses to provide. In a system with several layers of modules, failures may be passed up through more than one layer before

Saltzer & Kaashoek Ch. 8, p. 16

June 24, 2009 12:24 am

8.3 Tolerating Active Faults

8–17

being discovered and handled. As the number of do-nothing layers increases, containment generally becomes more and more difficult. • Be fail-fast. The module reports at its interface that something has gone wrong. This response also turns the problem over to the designer of the next higher-level system, but in a more graceful way. Example: when an Ethernet transceiver detects a collision on a frame it is sending, it stops sending as quickly as possible, broadcasts a brief jamming signal to ensure that all network participants quickly realize that there was a collision, and it reports the collision to the next higher level, usually a hardware module of which the transceiver is a component, so that the higher level can consider resending that frame. • Be fail-safe. The module transforms any value or values that are incorrect to values that are known to be acceptable, even if not right or optimal. An example is a digital traffic light controller that, when it detects a failure in its sequencer, switches to a blinking red light in all directions. Chapter 11[on-line] discusses systems that provide security. In the event of a failure in a secure system, the safest thing to do is usually to block all access. A fail-safe module designed to do that is said to be fail-secure. • Be fail-soft. The system continues to operate correctly with respect to some predictably degraded subset of its specifications, perhaps with some features missing or with lower performance. For example, an airplane with three engines can continue to fly safely, albeit more slowly and with less maneuverability, if one engine fails. A file system that is partitioned into five parts, stored on five different small hard disks, can continue to provide access to 80% of the data when one of the disks fails, in contrast to a file system that employs a single disk five times as large. • Mask the error. Any value or values that are incorrect are made right and the module meets it specification as if the error had not occurred. We will concentrate on masking errors because the techniques used for that purpose can be applied, often in simpler form, to achieving a fail-fast, fail-safe, or fail-soft system. As a general rule, one can design algorithms and procedures to cope only with spe­ cific, anticipated faults. Further, an algorithm or procedure can be expected to cope only with faults that are actually detected. In most cases, the only workable way to detect a fault is by noticing an incorrect value or control signal; that is, by detecting an error. Thus when trying to determine if a system design has adequate fault tolerance, it is help­ ful to classify errors as follows: • A detectable error is one that can be detected reliably. If a detection procedure is in place and the error occurs, the system discovers it with near certainty and it becomes a detected error.

Saltzer & Kaashoek Ch. 8, p. 17

June 24, 2009 12:24 am

8–18

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

• A maskable error is one for which it is possible to devise a procedure to recover correctness. If a masking procedure is in place and the error occurs, is detected, and is masked, the error is said to be tolerated. • Conversely, an untolerated error is one that is undetectable, undetected, unmaskable, or unmasked. An untolerated error usually leads to a

error failure of the system. (“Usually,” because

we could get lucky and still produce a cor-

detectable undetectable rect output, either because the error values error error didn’t actually matter under the current

conditions, or some measure intended to

mask a different error incidentally masks undetected

detected error error this one, too.) This classification of errors is illustrated in Figure 8.3. A subtle consequence of the concept of unmaskable maskable error error a maskable error is that there must be a well-defined boundary around that part of

the system state that might be in error. The

unmasked masked masking procedure must restore all of that error error erroneous state to correctness, using infor­ mation that has not been corrupted by the untolerated tolerated error. The real meaning of detectable, then, error error is that the error is discovered before its consequences have propagated beyond some FIGURE 8.3 specified boundary. The designer usually chooses this boundary to coincide with that Classification of errors. Arrows lead from a of some module and designs that module to category to mutually exclusive subcatego­ ries. For example, unmasked errors include be fail-fast (that is, it detects and reports its both unmaskable errors and maskable errors

own errors). The system of which the mod- that the designer decides not to mask.

ule is a component then becomes

responsible for masking the failure of the module.

8.3.2 Fault Tolerance Models The distinctions among detectable, detected, maskable, and tolerated errors allow us to specify for a system a fault tolerance model, one of the components of the fault tolerance design process described in Section 8.1.2, as follows: 1. Analyze the system and categorize possible error events into those that can be reliably detected and those that cannot. At this stage, detectable or not, all errors are untolerated.

Saltzer & Kaashoek Ch. 8, p. 18

June 24, 2009 12:24 am

8.3 Tolerating Active Faults

8–19

2. For each undetectable error, evaluate the probability of its occurrence. If that probability is not negligible, modify the system design in whatever way necessary to make the error reliably detectable. 3. For each detectable error, implement a detection procedure and reclassify the module in which it is detected as fail-fast. 4. For each detectable error try to devise a way of masking it. If there is a way, reclassify this error as a maskable error. 5. For each maskable error, evaluate its probability of occurrence, the cost of failure, and the cost of the masking method devised in the previous step. If the evaluation indicates it is worthwhile, implement the masking method and reclassify this error as a tolerated error. When finished developing such a model, the designer should have a useful fault tol­ erance specification for the system. Some errors, which have negligible probability of occurrence or for which a masking measure would be too expensive, are identified as untolerated. When those errors occur the system fails, leaving its users to cope with the result. Other errors have specified recovery algorithms, and when those occur the system should continue to run correctly. A review of the system recovery strategy can now focus separately on two distinct questions: • Is the designer’s list of potential error events complete, and is the assessment of the probability of each error realistic? • Is the designer’s set of algorithms, procedures, and implementations that are supposed to detect and mask the anticipated errors complete and correct? These two questions are different. The first is a question of models of the real world. It addresses an issue of experience and judgment about real-world probabilities and whether all real-world modes of failure have been discovered or some have gone unno­ ticed. Two different engineers, with different real-world experiences, may reasonably disagree on such judgments—they may have different models of the real world. The eval­ uation of modes of failure and of probabilities is a point at which a designer may easily go astray because such judgments must be based not on theory but on experience in the field, either personally acquired by the designer or learned from the experience of others. A new technology, or an old technology placed in a new environment, is likely to create surprises. A wrong judgment can lead to wasted effort devising detection and masking algorithms that will rarely be invoked rather than the ones that are really needed. On the other hand, if the needed experience is not available, all is not lost: the iteration part of the design process is explicitly intended to provide that experience. The second question is more abstract and also more absolutely answerable, in that an argument for correctness (unless it is hopelessly complicated) or a counterexample to that argument should be something that everyone can agree on. In system design, it is helpful to follow design procedures that distinctly separate these classes of questions. When someone questions a reliability feature, the designer can first ask, “Are you questioning

Saltzer & Kaashoek Ch. 8, p. 19

June 24, 2009 12:24 am

8–20

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

the correctness of my recovery algorithm or are you questioning my model of what may fail?” and thereby properly focus the discussion or argument. Creating a fault tolerance model also lays the groundwork for the iteration part of the fault tolerance design process. If a system in the field begins to fail more often than expected, or completely unexpected failures occur, analysis of those failures can be com­ pared with the fault tolerance model to discover what has gone wrong. By again asking the two questions marked with bullets above, the model allows the designer to distin­ guish between, on the one hand, failure probability predictions being proven wrong by field experience, and on the other, inadequate or misimplemented masking procedures. With this information the designer can work out appropriate adjustments to the model and the corresponding changes needed for the system. Iteration and review of fault tolerance models is also important to keep them up to date in the light of technology changes. For example, the Network File System described in Section 4.4 was first deployed using a local area network, where packet loss errors are rare and may even be masked by the link layer. When later users deployed it on larger networks, where lost packets are more common, it became necessary to revise its fault tolerance model and add additional error detection in the form of end-to-end checksums. The processor time required to calculate and check those checksums caused some performance loss, which is why its designers did not originally include checksums. But loss of data integrity outweighed loss of performance and the designers reversed the trade-off. To illustrate, an example of a fault tolerance model applied to a popular kind of mem­ ory devices, RAM, appears in Section 8.7. This fault tolerance model employs error detection and masking techniques that are described below in Section 8.4 of this chapter, so the reader may prefer to delay detailed study of that section until completing Section 8.4.

8.4 Systematically Applying Redundancy The designer of an analog system typically masks small errors by specifying design toler­ ances known as margins, which are amounts by which the specification is better than necessary for correct operation under normal conditions. In contrast, the designer of a digital system both detects and masks errors of all kinds by adding redundancy, either in time or in space. When an error is thought to be transient, as when a packet is lost in a data communication network, one method of masking is to resend it, an example of redundancy in time. When an error is likely to be persistent, as in a failure in reading bits from the surface of a disk, the usual method of masking is with spatial redundancy, hav­ ing another component provide another copy of the information or control signal. Redundancy can be applied either in cleverly small quantities or by brute force, and both techniques may be used in different parts of the same system.

Saltzer & Kaashoek Ch. 8, p. 20

June 24, 2009 12:24 am

8.4 Systematically Applying Redundancy

8–21

8.4.1 Coding: Incremental Redundancy The most common form of incremental redundancy, known as forward error correction, consists of clever coding of data values. With data that has not been encoded to tolerate errors, a change in the value of one bit may transform one legitimate data value into another legitimate data value. Encoding for errors involves choosing as the representa­ tion of legitimate data values only some of the total number of possible bit patterns, being careful that the patterns chosen for legitimate data values all have the property that to transform any one of them to any other, more than one bit must change. The smallest number of bits that must change to transform one legitimate pattern into another is known as the Hamming distance between those two patterns. The Hamming distance is named after Richard Hamming, who first investigated this class of codes. Thus the patterns 100101

000111

have a Hamming distance of 2 because the upper pattern can be transformed into the lower pattern by flipping the values of two bits, the first bit and the fifth bit. Data fields that have not been coded for errors might have a Hamming distance as small as 1. Codes that can detect or correct errors have a minimum Hamming distance between any two legitimate data patterns of 2 or more. The Hamming distance of a code is the minimum Hamming distance between any pair of legitimate patterns of the code. One can calcu­ late the Hamming distance between two patterns, A and B, by counting the number of ONEs in A ⊕ B , where ⊕ is the exclusive OR (XOR) operator. Suppose we create an encoding in which the Hamming distance between every pair of legitimate data patterns is 2. Then, if one bit changes accidentally, since no legitimate data item can have that pattern, we can detect that something went wrong, but it is not possible to figure out what the original data pattern was. Thus, if the two patterns above were two members of the code and the first bit of the upper pattern were flipped from ONE to ZERO, there is no way to tell that the result, 000101, is not the result of flipping the fifth bit of the lower pattern. Next, suppose that we instead create an encoding in which the Hamming distance of the code is 3 or more. Here are two patterns from such a code; bits 1, 2, and 5 are different: 100101

010111

Now, a one-bit change will always transform a legitimate data pattern into an incor­ rect data pattern that is still at least 2 bits distant from any other legitimate pattern but only 1 bit distant from the original pattern. A decoder that receives a pattern with a onebit error can inspect the Hamming distances between the received pattern and nearby legitimate patterns and by choosing the nearest legitimate pattern correct the error. If 2 bits change, this error-correction procedure will still identify a corrected data value, but it will choose the wrong one. If we expect 2-bit errors to happen often, we could choose the code patterns so that the Hamming distance is 4, in which case the code can correct

Saltzer & Kaashoek Ch. 8, p. 21

June 24, 2009 12:24 am

8–22

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

1-bit errors and detect 2-bit errors. But a 3-bit error would look just like a 1-bit error in some other code pattern, so it would decode to a wrong value. More generally, if the Hamming distance of a code is d, a little analysis reveals that one can detect d – 1 errors and correct (d – 1) ⁄ 2 errors. The reason that this form of redundancy is named “forward” error correction is that the creator of the data performs the coding before stor­ ing or transmitting it, and anyone can later decode the data without appealing to the creator. (Chapter 7[on-line] described the technique of asking the sender of a lost frame, packet, or message to retransmit it. That technique goes by the name of backward error correction.) The systematic construction of forward error-detection and error-correction codes is a large field of study, which we do not intend to explore. However, two specific examples of commonly encountered codes are worth examining. The first example is a simple parity check on a 2-bit value, in which the parity 110 010 bit is the XOR of the 2 data bits. The coded pattern is 3 bits long, so there are 2 3 = 8 possible patterns for this 3-bit quantity, 000 100 only 4 of which represent legitimate data. 111 011 As illustrated in Figure 8.4, the 4 “correct” patterns have the property that changing any single bit transforms the word into one 101 001 of the 4 illegal patterns. To transform the

coded quantity into another legal pattern,

at least 2 bits must change (in other words, FIGURE 8.4

the Hamming distance of this code is 2). Patterns for a simple parity-check code.

The conclusion is that a simple parity Each line connects patterns that differ in

check can detect any single error, but it only one bit; bold-face patterns are the

doesn’t have enough information to cor- legitimate ones.

rect errors.

The second example, in Figure 8.5, shows a forward error-correction code that can correct 1-bit errors in a 4-bit data value, by encoding the 4 bits into 7-bit words. In this code, bits P7, P6, P5, and P3 carry the data, while bits P4, P2, and P1 are calculated from the data bits. (This out-of-order numbering scheme creates a multidimensional binary coordinate system with a use that will be evident in a moment.) We could analyze this code to determine its Hamming distance, but we can also observe that three extra bits can carry exactly enough information to distinguish 8 cases: no error, an error in bit 1, an error in bit 2, … or an error in bit 7. Thus, it is not surprising that an error-correction code can be created. This code calculates bits P1, P2, and P4 as follows: P1 = P7 ⊕ P5 ⊕ P3

P2 = P7 ⊕ P6 ⊕ P3

P4 = P7 ⊕ P6 ⊕ P5

Saltzer & Kaashoek Ch. 8, p. 22

June 24, 2009 12:24 am

8.4 Systematically Applying Redundancy

8–23

Now, suppose that the array of bits P1 through P7 is sent across a network and noise causes bit P5 to flip. If the recipient recalculates P1, P2, and P4, the recalculated values of P1 and P4 will be different from the received bits P1 and P4. The recipient then writes P4 P2 P1 in order, representing the troubled bits as ONEs and untroubled bits as ZEROs, and notices that their binary value is 101 2 = 5 , the position of the flipped bit. In this code, whenever there is a one-bit error, the troubled parity bits directly identify the bit to cor­ rect. (That was the reason for the out-of-order bit-numbering scheme, which created a 3-dimensional coordinate system for locating an erroneous bit.) The use of 3 check bits for 4 data bits suggests that an error-correction code may not be efficient, but in fact the apparent inefficiency of this example is only because it is so small. Extending the same reasoning, one can, for example, provide single-error correc­ tion for 56 data bits using 7 check bits in a 63-bit code word. In both of these examples of coding, the assumed threat to integrity is that an uni­ dentified bit out of a group may be in error. Forward error correction can also be effective against other threats. A different threat, called erasure, is also common in digital systems. An erasure occurs when the value of a particular, identified bit of a group is unintelligible or perhaps even completely missing. Since we know which bit is in question, the simple parity-check code, in which the parity bit is the XOR of the other bits, becomes a forward error correction code. The unavailable bit can be reconstructed simply by calculating the XOR of the unerased bits. Returning to the example of Figure 8.4, if we find a pattern in which the first and last bits have values 0 and 1 respectively, but the middle bit is illegible, the only possibilities are 001 and 011. Since 001 is not a legitimate code pattern, the original pattern must have been 011. The simple parity check allows correction of only a single erasure. If there is a threat of multiple erasures, a more complex coding scheme is needed. Suppose, for example, we have 4 bits to protect, and they are coded as in Fig­ ure 8.5. In that case, if as many as 3 bits are erased, the remaining 4 bits are sufficient to reconstruct the values of the 3 that are missing. Since erasure, in the form of lost packets, is a threat in a best-effort packet network, this same scheme of forward error correction is applicable. One might, for example, send four numbered, identical-length packets of data followed by a parity packet that contains

bit Choose P1 so XOR of every other bit (P7 ⊕ P5 ⊕ P3 ⊕ P1) is 0 Choose P2 so XOR of every other pair (P7 ⊕ P6 ⊕P3 ⊕ P2) is 0 Choose P4 so XOR of every other four (P7 ⊕ P6 ⊕ P5 ⊕P4) is 0

P7 P6 P5 ⊕ ⊕ ⊕

P4

⊕ ⊕ ⊕



P3 P2 P1 ⊕ ⊕

⊕ ⊕



FIGURE 8.5 A single-error-correction code. In the table, the symbol ⊕ marks the bits that participate in the calculation of one of the redundant bits. The payload bits are P7, P6, P5, and P3, and the redun­ dant bits are P4, P2, and P1. The “every other” notes describe a 3-dimensional coordinate system that can locate an erroneous bit.

Saltzer & Kaashoek Ch. 8, p. 23

June 24, 2009 12:24 am

8–24

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

as its payload the bit-by-bit XOR of the payloads of the previous four. (That is, the first bit of the parity packet is the XOR of the first bit of each of the other four packets; the second bits are treated similarly, etc.) Although the parity packet adds 25% to the network load, as long as any four of the five packets make it through, the receiving side can reconstruct all of the payload data perfectly without having to ask for a retransmission. If the network is so unreliable that more than one packet out of five typically gets lost, then one might send seven packets, of which four contain useful data and the remaining three are calcu­ lated using the formulas of Figure 8.5. (Using the numbering scheme of that figure, the payload of packet 4, for example, would consist of the XOR of the payloads of packets 7, 6, and 5.) Now, if any four of the seven packets make it through, the receiving end can reconstruct the data. Forward error correction is especially useful in broadcast protocols, where the exist­ ence of a large number of recipients, each of which may miss different frames, packets, or stream segments, makes the alternative of backward error correction by requesting retransmission unattractive. Forward error correction is also useful when controlling jit­ ter in stream transmission because it eliminates the round-trip delay that would be required in requesting retransmission of missing stream segments. Finally, forward error correction is usually the only way to control errors when communication is one-way or round-trip delays are so long that requesting retransmission is impractical, for example, when communicating with a deep-space probe. On the other hand, using forward error correction to replace lost packets may have the side effect of interfering with congestion control techniques in which an overloaded packet forwarder tries to signal the sender to slow down by discarding an occasional packet. Another application of forward error correction to counter erasure is in storing data on magnetic disks. The threat in this case is that an entire disk drive may fail, for example because of a disk head crash. Assuming that the failure occurs long after the data was orig­ inally written, this example illustrates one-way communication in which backward error correction (asking the original writer to write the data again) is not usually an option. One response is to use a RAID array (see Section 2.1.1.4) in a configuration known as RAID 4. In this configuration, one might use an array of five disks, with four of the disks containing application data and each sector of the fifth disk containing the bit-by-bit XOR of the corresponding sectors of the first four. If any of the five disks fails, its identity will quickly be discovered because disks are usually designed to be fail-fast and report failures at their interface. After replacing the failed disk, one can restore its contents by reading the other four disks and calculating, sector by sector, the XOR of their data (see exercise 8.9). To maintain this strategy, whenever anyone updates a data sector, the RAID 4 sys­ tem must also update the corresponding sector of the parity disk, as shown in Figure 8.6. That figure makes it apparent that, in RAID 4, forward error correction has an identifi­ able read and write performance cost, in addition to the obvious increase in the amount of disk space used. Since loss of data can be devastating, there is considerable interest in RAID, and much ingenuity has been devoted to devising ways of minimizing the perfor­ mance penalty.

Saltzer & Kaashoek Ch. 8, p. 24

June 24, 2009 12:24 am

8.4 Systematically Applying Redundancy

8–25

Although it is an important and widely used technique, successfully applying incre­ mental redundancy to achieve error detection and correction is harder than one might expect. The first case study of Section 8.8 provides several useful lessons on this point. In addition, there are some situations where incremental redundancy does not seem to be applicable. For example, there have been efforts to devise error-correction codes for numerical values with the property that the coding is preserved when the values are pro­ cessed by an adder or a multiplier. While it is not too hard to invent schemes that allow a limited form of error detection (for example, one can verify that residues are consistent, using analogues of casting out nines, which school children use to check their arith­ metic), these efforts have not yet led to any generally applicable techniques. The only scheme that has been found to systematically protect data during arithmetic processing is massive redundancy, which is our next topic.

8.4.2 Replication: Massive Redundancy In designing a bridge or a skyscraper, a civil engineer masks uncertainties in the strength of materials and other parameters by specifying components that are 5 or 10 times as strong as minimally required. The method is heavy-handed, but simple and effective.

new sector

data 1

data 1 old sector data 2

data 2

data 3

data 3

data 4

data 4

parity

parity ⊕ old ⊕ new

parity

FIGURE 8.6 Update of a sector on disk 2 of a five-disk RAID 4 system. The old parity sector contains parity ← data 1 ⊕ data 2 ⊕ data 3 ⊕ data 4. To construct a new parity sector that includes the new data 2, one could read the corresponding sectors of data 1, data 3, and data 4 and per­ form three more XORs. But a faster way is to read just the old parity sector and the old data 2 sector and compute the new parity sector as new parity ← old parity ⊕ old data 2 ⊕ new data 2

Saltzer & Kaashoek Ch. 8, p. 25

June 24, 2009 12:24 am

8–26

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

The corresponding way of building a reliable system out of unreliable discrete compo­ nents is to acquire multiple copies of each component. Identical multiple copies are called replicas, and the technique is called replication. There is more to it than just making copies: one must also devise a plan to arrange or interconnect the replicas so that a failure in one replica is automatically masked with the help of the ones that don’t fail. For exam­ ple, if one is concerned about the possibility that a diode may fail by either shorting out or creating an open circuit, one can set up a network of four diodes as in Figure 8.7, cre­ ating what we might call a “superdiode”. This interconnection scheme, known as a quad component, was developed by Claude E. Shannon and Edward F. Moore in the 1950s as a way of increasing the reliability of relays in telephone systems. It can also be used with resistors and capacitors in circuits that can tolerate a modest range of component values. This particular superdiode can tolerate a single short circuit and a single open circuit in any two component diodes, and it can also tolerate certain other multiple failures, such as open circuits in both upper diodes plus a short circuit in one of the lower diodes. If the bridging connection of the figure is added, the superdiode can tolerate additional multiple open-circuit failures (such as one upper diode and one lower diode), but it will be less tolerant of certain short-circuit failures (such as one left diode and one right diode). A serious problem with this superdiode is that it masks failures silently. There is no easy way to determine how much failure tolerance remains in the system.

8.4.3 Voting Although there have been attempts to extend quad-component methods to digital logic, the intricacy of the required interconnections grows much too rapidly. Fortunately, there is a systematic alternative that takes advantage of the static discipline and level regenera­ tion that are inherent properties of digital logic. In addition, it has the nice feature that it can be applied at any level of module, from a single gate on up to an entire computer. The technique is to substitute in place of a single module a set of replicas of that same module, all operating in parallel with the same inputs, and compare their outputs with a device known as a voter. This basic strategy is called N-modular redundancy, or NMR. When N has the value 3 the strategy is called triple-modular redundancy, abbreviated TMR. When other values are used for N the strategy is named by replacing the N of NMR with the number, as in 5MR. The combination of N replicas of some module and

FIGURE 8.7 A quad-component superdiode.

The dotted line represents an

optional bridging connection,

which allows the superdiode to

tolerate a different set of failures,

as described in the text.

Saltzer & Kaashoek Ch. 8, p. 26

June 24, 2009 12:24 am

8.4 Systematically Applying Redundancy

8–27

the voting system is sometimes called a supermodule. Several different schemes exist for interconnection and voting, only a few of which we explore here. The simplest scheme, called fail-vote, consists of NMR with a majority voter. One assembles N replicas of the module and a voter that consists of an N-way comparator and some counting logic. If a majority of the replicas agree on the result, the voter accepts that result and passes it along to the next system component. If any replicas disagree with the majority, the voter may in addition raise an alert, calling for repair of the replicas that were in the minority. If there is no majority, the voter signals that the supermodule has failed. In failure-tolerance terms, a triply-redundant fail-vote supermodule can mask the failure of any one replica, and it is fail-fast if any two replicas fail in different ways. If the reliability, as was defined in Section 8.2.2, of a single replica module is R and the underlying fault mechanisms are independent, a TMR fail-vote supermodule will operate correctly if all 3 modules are working (with reliability R 3 ) or if 1 module has failed and the other 2 are working (with reliability R 2 (1 – R) ). Since a single-module failure can happen in 3 different ways, the reliability of the supermodule is the sum, 3

2

2

R supermodule = R + 3R (1 – R) = 3R – 2R

3

Eq. 8–10

but the supermodule is not always fail-fast. If two replicas fail in exactly the same way, the voter will accept the erroneous result and, unfortunately, call for repair of the one correctly operating replica. This outcome is not as unlikely as it sounds because several replicas that went through the same design and production process may have exactly the same set of design or manufacturing faults. This problem can arise despite the indepen­ dence assumption used in calculating the probability of correct operation. That calculation assumes only that the probability that different replicas produce correct answers be independent; it assumes nothing about the probability of producing specific wrong answers. Without more information about the probability of specific errors and their correlations the only thing we can say about the probability that an incorrect result will be accepted by the voter is that it is not more than (1–R supermodule ) = (1 – 3 R 2 + 2R 3 )

These calculations assume that the voter is perfectly reliable. Rather than trying to create perfect voters, the obvious thing to do is replicate them, too. In fact, everything— modules, inputs, outputs, sensors, actuators, etc.—should be replicated, and the final vote should be taken by the client of the system. Thus, three-engine airplanes vote with their propellers: when one engine fails, the two that continue to operate overpower the inoperative one. On the input side, the pilot’s hand presses forward on three separate throttle levers. A fully replicated TMR supermodule is shown in Figure 8.8. With this interconnection arrangement, any measurement or estimate of the reliability, R, of a component module should include the corresponding voter. It is actually customary (and more logical) to consider a voter to be a component of the next module in the chain rather than, as the diagram suggests, the previous module. This fully replicated design is sometimes described as recursive.

Saltzer & Kaashoek Ch. 8, p. 27

June 24, 2009 12:24 am

8–28

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

The numerical effect of fail-vote TMR is impressive. If the reliability of a single mod­ ule at time T is 0.999, equation 8–10 says that the reliability of a fail-vote TMR supermodule at that same time is 0.999997. TMR has reduced the probability of failure from one in a thousand to three in a million. This analysis explains why airplanes intended to fly across the ocean have more than one engine. Suppose that the rate of engine failures is such that a single-engine plane would fail to complete one out of a thou­ sand trans-Atlantic flights. Suppose also that a 3-engine plane can continue flying as long as any 2 engines are operating, but it is too heavy to fly with only 1 engine. In 3 flights out of a thousand, one of the three engines will fail, but if engine failures are indepen­ dent, in 999 out of each thousand first-engine failures, the remaining 2 engines allow the plane to limp home successfully. Although TMR has greatly improved reliability, it has not made a comparable impact on MTTF. In fact, the MTTF of a fail-vote TMR supermodule can be smaller than the MTTF of the original, single-replica module. The exact effect depends on the failure process of the replicas, so for illustration consider a memoryless failure process, not because it is realistic but because it is mathematically tractable. Suppose that airplane engines have an MTTF of 6,000 hours, they fail independently, the mechanism of engine failure is memoryless, and (since this is a fail-vote design) we need at least 2 oper­ ating engines to get home. When flying with three engines, the plane accumulates 6,000 hours of engine running time in only 2,000 hours of flying time, so from the point of view of the airplane as a whole, 2,000 hours is the expected time to the first engine fail­ ure. While flying with the remaining two engines, it will take another 3,000 flying hours to accumulate 6,000 more engine hours. Because the failure process is memoryless we can calculate the MTTF of the 3-engine plane by adding: Mean time to first failure Mean time from first to second failure Total mean time to system failure

2000 hours (three engines) 3000 hours (two engines) 5000 hours

Thus the mean time to system failure is less than the 6,000 hour MTTF of a single engine. What is going on here is that we have actually sacrificed long-term reliability in order to enhance short-term reliability. Figure 8.9 illustrates the reliability of our hypo-

FIGURE 8.8 Triple-modular redundant supermodule, with three inputs, three voters, and three outputs.

Saltzer & Kaashoek Ch. 8, p. 28

M1

V1

M2

V2

M3

V3

June 24, 2009 12:24 am

8.4 Systematically Applying Redundancy

8–29

thetical airplane during its 6 hours of flight, which amounts to only 0.001 of the singleengine MTTF—the mission time is very short compared with the MTTF and the reli­ ability is far higher. Figure 8.10 shows the same curve, but for flight times that are comparable with the MTTF. In this region, if the plane tried to keep flying for 8000 hours (about 1.4 times the single-engine MTTF), a single-engine plane would fail to complete the flight in 3 out of 4 tries, but the 3-engine plane would fail to complete the flight in 5 out of 6 tries. (One should be wary of these calculations because the assump­ tions of independence and memoryless operation may not be met in practice. Sidebar 8.2 elaborates.)

.999997

1.0000

three engines

Reliability

0.9997 0.9993 0.9990

0.999

one engine

0.9987 0.9983 0

.00025 .0005 .00075 .001 Mission time, in units of MTTF

FIGURE 8.9 Reliability with triple modular redundancy, for mission times much less than the MTTF of 6,000 hours. The vertical dotted line represents a six-hour flight.

1.0

Reliability

0.8 0.6 0.4

previous figure 0.25 single engine 0.15 three engines

0.2 0.0

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Mission time, in units of MTTF

FIGURE 8.10 Reliability with triple modular redundancy, for mission times comparable to the MTTF of 6,000 hours. The two vertical dotted lines represent mission times of 6,000 hours (left) and 8,400 hours (right).

Saltzer & Kaashoek Ch. 8, p. 29

June 24, 2009 12:24 am

8–30

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

Sidebar 8.2: Risks of manipulating MTTFs The apparently casual manipulation of MTTFs in Sections 8.4.3 and 8.4.4 is justified by assumptions of independence of failures and memoryless processes. But one can trip up by blindly applying this approach without understanding its limitations. To see how, consider a computer system that has been observed for several years to have a hardware crash an average of every 2 weeks and a software crash an average of every 6 weeks. The operator does not repair the system, but simply restarts it and hopes for the best. The composite MTTF is 1.5 weeks, determined most easily by considering what happens if we run the system for, say, 60 weeks. During that time we expect to see 10 software failures 30 hardware failures ___ 40 system failures in 60 weeks —> 1.5 weeks between failure New hardware is installed, identical to the old except that it never fails. The MTTF should jump to 6 weeks because the only remaining failures are software, right? Perhaps—but only if the software failure process is independent of the hardware failure process. Suppose the software failure occurs because there is a bug (fault) in a clock-updating procedure: The bug always crashes the system exactly 420 hours (2 1/2 weeks) after it is started—if it gets a chance to run that long. The old hardware was causing crashes so often that the software bug only occasionally had a chance to do its thing—only about once every 6 weeks. Most of the time, the recovery from a hardware failure, which requires restarting the system, had the side effect of resetting the process that triggered the software bug. So, when the new hardware is installed, the system has an MTTF of only 2.5 weeks, much less than hoped. MTTF's are useful, but one must be careful to understand what assumptions go into their measurement and use.

If we had assumed that the plane could limp home with just one engine, the MTTF would have increased, rather than decreased, but only modestly. Replication provides a dramatic improvement in reliability for missions of duration short compared with the MTTF, but the MTTF itself changes much less. We can verify this claim with a little more analysis, again assuming memoryless failure processes to make the mathematics tractable. Suppose we have an NMR system with the property that it somehow continues to be useful as long as at least one replica is still working. (This system requires using failfast replicas and a cleverer voter, as described in Section 8.4.4 below.) If a single replica has an MTTFreplica = 1, there are N independent replicas, and the failure process is mem­ oryless, the expected time until the first failure is MTTFreplica/N, the expected time from then until the second failure is MTTFreplica/(N – 1), etc., and the expected time until the system of N replicas fails is the sum of these times, Eq. 8–11 MTTF system = 1 + 1 ⁄ 2 + 1 ⁄ 3 + …(1 ⁄ N)

Saltzer & Kaashoek Ch. 8, p. 30

June 24, 2009 12:24 am

8.4 Systematically Applying Redundancy

8–31

which for large N is approximately ln(N). As we add to the cost by adding more replicas, MTTFsystem grows disappointingly slowly—proportional to the logarithm of the cost. To multiply the MTTFsystem by K, the number of replicas required is e K —the cost grows exponentially. The significant conclusion is that in systems for which the mission time is long compared with MTTFreplica, simple replication escalates the cost while providing little benefit. On the other hand, there is a way of making replication effective for long mis­ sions, too. The method is to enhance replication by adding repair.

8.4.4 Repair Let us return now to a fail-vote TMR supermodule (that is, it requires that at least two replicas be working) in which the voter has just noticed that one of the three replicas is producing results that disagree with the other two. Since the voter is in a position to report which replica has failed, suppose that it passes such a report along to a repair per­ son who immediately examines the failing replica and either fixes or replaces it. For this approach, the mean time to repair (MTTR) measure becomes of interest. The supermodule fails if either the second or third replica fails before the repair to the first one can be completed. Our intuition is that if the MTTR is small compared with the combined MTTF of the other two replicas, the chance that the supermodule fails will be similarly small. The exact effect on chances of supermodule failure depends on the shape of the reli­ ability function of the replicas. In the case where the failure and repair processes are both memoryless, the effect is easy to calculate. Since the rate of failure of 1 replica is 1/MTTF, the rate of failure of 2 replicas is 2/MTTF. If the repair time is short compared with MTTF the probability of a failure of 1 of the 2 remaining replicas while waiting a time T for repair of the one that failed is approximately 2T/MTTF. Since the mean time to repair is MTTR, we have 2 × MTTR Pr(supermodule fails while waiting for repair) = ------------------------MTTF

Eq. 8–12

Continuing our airplane example and temporarily suspending disbelief, suppose that during a long flight we send a mechanic out on the airplane’s wing to replace a failed engine. If the replacement takes 1 hour, the chance that one of the other two engines fails during that hour is approximately 1/3000. Moreover, once the replacement is complete, we expect to fly another 2000 hours until the next engine failure. Assuming further that the mechanic is carrying an unlimited supply of replacement engines, completing a 10,000 hour flight—or even a longer one—becomes plausible. The general formula for the MTTF of a fail-vote TMR supermodule with memoryless failure and repair processes is (this formula comes out of the analysis of continuous-transition birth-and-death Markov processes, an advanced probability technique that is beyond our scope): 2

MTTF supermodule

Saltzer & Kaashoek Ch. 8, p. 31

MTTF replica MTTF replica (MTTF replica ) Eq. 8–13 - × ----------------------------------------= ------------------------------= ----------------------------------------3 2 × MTTR replica 6 × MTTR replica

June 24, 2009 12:24 am

8–32

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

Thus, our 3-engine plane with hypothetical in-flight repair has an MTTF of 6 million hours, an enormous improvement over the 6000 hours of a single-engine plane. This equation can be interpreted as saying that, compared with an unreplicated module, the MTTF has been reduced by the usual factor of 3 because there are 3 replicas, but at the same time the availability of repair has increased the MTTF by a factor equal to the ratio of the MTTF of the remaining 2 engines to the MTTR. Replacing an airplane engine in flight may be a fanciful idea, but replacing a magnetic disk in a computer system on the ground is quite reasonable. Suppose that we store 3 replicas of a set of data on 3 independent hard disks, each of which has an MTTF of 5 years (using as the MTTF the expected operational lifetime, not the “MTTF” derived from the short-term failure rate). Suppose also, that if a disk fails, we can locate, install, and copy the data to a replacement disk in an average of 10 hours. In that case, by eq. 8–13, the MTTF of the data is 2

2 (MTTF replica ) (5 years) ----------------------------------------= --------------------------------------------------------------------------------- = 3650 years 6 × MTTR replica 6 ⋅ (10 hours) ⁄ (8760 hours/year)

Eq. 8–14

In effect, redundancy plus repair has reduced the probability of failure of this supermod­ ule to such a small value that for all practical purposes, failure can be neglected and the supermodule can operate indefinitely. Before running out to start a company that sells superbly reliable disk-storage sys­ tems, it would be wise to review some of the overly optimistic assumptions we made in getting that estimate of the MTTF, most of which are not likely to be true in the real world: • Disks fail independently. A batch of real world disks may all come from the same vendor, where they acquired the same set of design and manufacturing faults. Or, they may all be in the same machine room, where a single earthquake—which probably has an MTTF of less than 3,650 years—may damage all three. • Disk failures are memoryless. Real-world disks follow a bathtub curve. If, when disk #1 fails, disk #2 has already been in service for three years, disk #2 no longer has an expected operational lifetime of 5 years, so the chance of a second failure while waiting for repair is higher than the formula assumes. Furthermore, when disk #1 is replaced, its chances of failing are probably higher than usual for the first few weeks. • Repair is also a memoryless process. In the real world, if we stock enough spares that we run out only once every 10 years and have to wait for a shipment from the factory, but doing a replacement happens to run us out of stock today, we will probably still be out of stock tomorrow and the next day. • Repair is done flawlessly. A repair person may replace the wrong disk, forget to copy the data to the new disk, or install a disk that hasn’t passed burn-in and fails in the first hour.

Saltzer & Kaashoek Ch. 8, p. 32

June 24, 2009 12:24 am

8.4 Systematically Applying Redundancy

8–33

Each of these concerns acts to reduce the reliability below what might be expected from our overly simple analysis. Nevertheless, NMR with repair remains a useful technique, and in Chapter 10[on-line] we will see ways in which it can be applied to disk storage. One of the most powerful applications of NMR is in the masking of transient errors. When a transient error occurs in one replica, the NMR voter immediately masks it. Because the error is transient, the subsequent behavior of the supermodule is as if repair happened by the next operation cycle. The numerical result is little short of extraordi­ nary. For example, consider a processor arithmetic logic unit (ALU) with a 1 gigahertz clock and which is triply replicated with voters checking its output at the end of each clock cycle. In equation 8–13 we have MTTR replica = 1 (in this application, equation 8–13 is only an approximation because the time to repair is a constant rather than the result of a memoryless process), and MTTF supermodule = (MTTF replica ) 2 ⁄ 6 cycles. If MTTFreplica is 1010 cycles (1 error in 10 billion cycles, which at this clock speed means one error every 10 seconds), MTTFsupermodule is 10 20 ⁄ 6 cycles, about 500 years. TMR has taken three ALUs that were for practical use nearly worthless and created a super-ALU that is almost infallible. The reason things seem so good is that we are evaluating the chance that two transient errors occur in the same operation cycle. If transient errors really are independent, that chance is small. This effect is powerful, but the leverage works in both directions, thereby creating a potential hazard: it is especially important to keep track of the rate at which transient errors actually occur. If they are happening, say, 20 times as often as hoped, MTTFsupermodule will be 1/400 of the original prediction—the super-ALU is likely to fail once per year. That may still be acceptable for some applications, but it is a big change. Also, as usual, the assumption of independence is absolutely critical. If all the ALUs came from the same production line, it seems likely that they will have at least some faults in common, in which case the super-ALU may be just as worthless as the individual ALUs. Several variations on the simple fail-vote structure appear in practice: • Purging. In an NMR design with a voter, whenever the voter detects that one replica disagrees with the majority, the voter calls for its repair and in addition marks that replica DOWN and ignores its output until hearing that it has been repaired. This technique doesn’t add anything to a TMR design, but with higher levels of replication, as long as replicas fail one at a time and any two replicas continue to operate correctly, the supermodule works. • Pair-and-compare. Create a fail-fast module by taking two replicas, giving them the same inputs, and connecting a simple comparator to their outputs. As long as the comparator reports that the two replicas of a pair agree, the next stage of the system accepts the output. If the comparator detects a disagreement, it reports that the module has failed. The major attraction of pair-and-compare is that it can be used to create fail-fast modules starting with easily available commercial, off-the-shelf components, rather than commissioning specialized fail-fast versions. Special high-reliability components typically have a cost that is much higher than off-the­ shelf designs, for two reasons. First, since they take more time to design and test,

Saltzer & Kaashoek Ch. 8, p. 33

June 24, 2009 12:24 am

8–34

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

the ones that are available are typically of an older, more expensive technology. Second, they are usually low-volume products that cannot take advantage of economies of large-scale production. These considerations also conspire to produce long delivery cycles, making it harder to keep spares in stock. An important aspect of using standard, high-volume, low-cost components is that one can afford to keep a stock of spares, which in turn means that MTTR can be made small: just replace a failing replica with a spare (the popular term for this approach is pair-and-spare) and do the actual diagnosis and repair at leisure. • NMR with fail-fast replicas. If each of the replicas is itself a fail-fast design (perhaps using pair-and-compare internally), then a voter can restrict its attention to the outputs of only those replicas that claim to be producing good results and ignore those that are reporting that their outputs are questionable. With this organization, a TMR system can continue to operate even if 2 of its 3 replicas have failed, since the 1 remaining replica is presumably checking its own results. An NMR system with repair and constructed of fail-fast replicas is so robust that it is unusual to find examples for which N is greater than 2. Figure 8.11 compares the ability to continue operating until repair arrives of 5MR designs that use fail-vote, purging, and fail-fast replicas. The observant reader will note that this chart can be deemed guilty of a misleading comparison, since it claims that the 5MR system continues working when only one fail-fast replica is still running. But if that fail-fast replica is actually a pair-and-compare module, it might be more accurate to say that there are two still-working replicas at that point. Another technique that takes advantage of repair, can improve availability, and can degrade gracefully (in other words, it can be fail-soft) is called partition. If there is a choice of purchasing a system that has either one fast processor or two slower processors, the two-processor system has the virtue that when one of its processors fails, the system

5MR with fail-vote fails

5 Number of replicas still working correctly

5MR with purging fails

4 3

5MR with fail-fast replicas fails

2 1 0 time

FIGURE 8.11 Failure points of three different 5MR supermodule designs, if repair does not happen in time.

Saltzer & Kaashoek Ch. 8, p. 34

June 24, 2009 12:24 am

8.4 Systematically Applying Redundancy

8–35

can continue to operate with half of its usual capacity until someone can repair the failed processor. An electric power company, rather than installing a single generator of capac­ ity K megawatts, may install N generators of capacity K/N megawatts each. When equivalent modules can easily share a load, partition can extend to what is called N + 1 redundancy. Suppose a system has a load that would require the capacity of N equivalent modules. The designer partitions the load across N + 1 or more modules. Then, if any one of the modules fails, the system can carry on at full capacity until the failed module can be repaired. N + 1 redundancy is most applicable to modules that are completely interchangeable, can be dynamically allocated, and are not used as storage devices. Examples are proces­ sors, dial-up modems, airplanes, and electric generators. Thus, one extra airplane located at a busy hub can mask the failure of any single plane in an airline’s fleet. When modules are not completely equivalent (for example, electric generators come in a range of capac­ ities, but can still be interconnected to share load), the design must ensure that the spare capacity is greater than the capacity of the largest individual module. For devices that provide storage, such as a hard disk, it is also possible to apply partition and N + 1 redun­ dancy with the same goals, but it requires a greater level of organization to preserve the stored contents when a failure occurs, for example by using RAID, as was described in Section 8.4.1, or some more general replica management system such as those discussed in Section 10.3.7. For some applications an occasional interruption of availability is acceptable, while in others every interruption causes a major problem. When repair is part of the fault toler­ ance plan, it is sometimes possible, with extra care and added complexity, to design a system to provide continuous operation. Adding this feature requires that when failures occur, one can quickly identify the failing component, remove it from the system, repair it, and reinstall it (or a replacement part) all without halting operation of the system. The design required for continuous operation of computer hardware involves connecting and disconnecting cables and turning off power to some components but not others, without damaging anything. When hardware is designed to allow connection and disconnection from a system that continues to operate, it is said to allow hot swap. In a computer system, continuous operation also has significant implications for the software. Configuration management software must anticipate hot swap so that it can stop using hardware components that are about to be disconnected, as well as discover newly attached components and put them to work. In addition, maintaining state is a challenge. If there are periodic consistency checks on data, those checks (and repairs to data when the checks reveal inconsistencies) must be designed to work correctly even though the system is in operation and the data is perhaps being read and updated by other users at the same time. Overall, continuous operation is not a feature that should be casually added to a list of system requirements. When someone suggests it, it may be helpful to point out that it is much like trying to keep an airplane flying indefinitely. Many large systems that appear to provide continuous operation are actually designed to stop occasionally for maintenance.

Saltzer & Kaashoek Ch. 8, p. 35

June 24, 2009 12:24 am

8–36

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

8.5 Applying Redundancy to Software and Data The examples of redundancy and replication in the previous sections all involve hard­ ware. A seemingly obvious next step is to apply the same techniques to software and to data. In the case of software the goal is to reduce the impact of programming errors, while in the case of data the goal is to reduce the impact of any kind of hardware, software, or operational error that might affect its integrity. This section begins the exploration of several applicable techniques: N-version programming, valid construction, and building a firewall to separate stored state into two categories: state whose integrity must be pre­ served and state that can casually be abandoned because it is easy to reconstruct.

8.5.1 Tolerating Software Faults Simply running three copies of the same buggy program is likely to produce three iden­ tical incorrect results. NMR requires independence among the replicas, so the designer needs a way of introducing that independence. An example of a way of introducing inde­ pendence is found in the replication strategy for the root name servers of the Internet Domain Name System (DNS, described in Section 4.4). Over the years, slightly differ­ ent implementations of the DNS software have evolved for different operating systems, so the root name server replicas intentionally employ these different implementations to reduce the risk of replicated errors. To try to harness this idea more systematically, one can commission several teams of programmers and ask each team to write a complete version of an application according to a single set of specifications. Then, run these several versions in parallel and compare their outputs. The hope is that the inevitable programming errors in the different ver­ sions will be independent and voting will produce a reliable system. Experiments with this technique, known as N-version programming, suggest that the necessary indepen­ dence is hard to achieve. Different programmers may be trained in similar enough ways that they make the same mistakes. Use of the same implementation language may encourage the same errors. Ambiguities in the specification may be misinterpreted in the same way by more than one team and the specification itself may contain errors. Finally, it is hard to write a specification in enough detail that the outputs of different implemen­ tations can be expected to be bit-for-bit identical. The result is that after much effort, the technique may still mask only a certain class of bugs and leave others unmasked. Never­ theless, there are reports that N-version programming has been used, apparently with success, in at least two safety-critical aerospace systems, the flight control system of the Boeing 777 aircraft (with N = 3) and the on-board control system for the Space Shuttle (with N = 2). Incidentally, the strategy of employing multiple design teams can also be applied to hardware replicas, with a goal of increasing the independence of the replicas by reducing the chance of replicated design errors and systematic manufacturing defects. Much of software engineering is devoted to a different approach: devising specifica­ tion and programming techniques that avoid faults in the first place and test techniques

Saltzer & Kaashoek Ch. 8, p. 36

June 24, 2009 12:24 am

8.5 Applying Redundancy to Software and Data

8–37

that systematically root out faults so that they can be repaired once and for all before deploying the software. This approach, sometimes called valid construction, can dramat­ ically reduce the number of software faults in a delivered system, but because it is difficult both to completely specify and to completely test a system, some faults inevitably remain. Valid construction is based on the observation that software, unlike hardware, is not sub­ ject to wear and tear, so if it is once made correct, it should stay that way. Unfortunately, this observation can turn out to be wishful thinking, first because it is hard to make soft­ ware correct, and second because it is nearly always necessary to make changes after installing a program because the requirements, the environment surrounding the pro­ gram, or both, have changed. There is thus a potential for tension between valid construction and the principle that one should design for iteration. Worse, later maintainers and reworkers often do not have a complete understanding of the ground rules that went into the original design, so their work is likely to introduce new faults for which the original designers did not anticipate providing tests. Even if the original design is completely understood, when a system is modified to add features that were not originally planned, the original ground rules may be subjected to some violence. Software faults more easily creep into areas that lack systematic design.

8.5.2 Tolerating Software (and other) Faults by Separating State Designers of reliable systems usually assume that, despite the best efforts of programmers there will always be a residue of software faults, just as there is also always a residue of hardware, operation, and environment faults. The response is to develop a strategy for tolerating all of them. Software adds the complication that the current state of a running program tends to be widely distributed. Parts of that state may be in non-volatile storage, while other parts are in temporary variables held in volatile memory locations, processor registers, and kernel tables. This wide distribution of state makes containment of errors problematic. As a result, when an error occurs, any strategy that involves stopping some collection of running threads, tinkering to repair the current state (perhaps at the same time replacing a buggy program module), and then resuming the stopped threads is usu­ ally unrealistic. In the face of these observations, a programming discipline has proven to be effective: systematically divide the current state of a running program into two mutually exclusive categories and separate the two categories with a firewall. The two categories are: • State that the system can safely abandon in the event of a failure. • State whose integrity the system should preserve despite failure. Upon detecting a failure, the plan becomes to abandon all state in the first category and instead concentrate just on maintaining the integrity of the data in the second cate­ gory. An important part of the strategy is an important sweeping simplification: classify the state of running threads (that is, the thread table, stacks, and registers) as abandonable. When a failure occurs, the system abandons the thread or threads that were running at the time and instead expects a restart procedure, the system operator, or the individual

Saltzer & Kaashoek Ch. 8, p. 37

June 24, 2009 12:24 am

8–38

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

user to start a new set of threads with a clean slate. The new thread or threads can then, working with only the data found in the second category, verify the integrity of that data and return to normal operation. The primary challenge then becomes to build a firewall that can protect the integrity of the second category of data despite the failure. The designer can base a natural firewall on the common implementations of volatile (e.g., CMOS memory) and non-volatile (e.g., magnetic disk) storage. As it happens, writing to non-volatile storage usually involves mechanical movement such as rotation of a disk platter, so most transfers move large blocks of data to a limited region of addresses, using a GET/PUT interface. On the other hand, volatile storage technologies typ­ ically provide a READ/WRITE interface that allows rapid-fire writes to memory addresses chosen at random, so failures that originate in or propagate to software tend to quickly and untraceably corrupt random-access data. By the time an error is detected the soft­ ware may thus have already damaged a large and unidentifiable part of the data in volatile memory. The GET/PUT interface instead acts as a bottleneck on the rate of spread of data corruption. The goal can be succinctly stated: to detect failures and stop the system before it reaches the next PUT operation, thus making the volatile storage medium the error containment boundary. It is only incidental that volatile storage usually has a READ/WRITE interface, while non-volatile storage usually has a GET/PUT interface, but because that is usually true it becomes a convenient way to implement and describe the firewall. This technique is widely used in systems whose primary purpose is to manage longlived data. In those systems, two aspects are involved: • Prepare for failure by recognizing that all state in volatile memory devices can vanish at any instant, without warning. When it does vanish, automatically launch new threads that start by restoring the data in non-volatile storage to a consistent, easily described state. The techniques to do this restoration are called recovery. Doing recovery systematically involves atomicity, which is explored in Chapter 9[on-line]. • Protect the data in non-volatile storage using replication, thus creating the class of storage known as durable storage. Replicating data can be a straightforward application of redundancy, so we will begin the topic in this chapter. However, there are more effective designs that make use of atomicity and geographical separation of replicas, so we will revisit durability in Chapter 10[on-line]. When the volatile storage medium is CMOS RAM and the non-volatile storage medium is magnetic disk, following this programming discipline is relatively straightfor­ ward because the distinctively different interfaces make it easy to remember where to place data. But when a one-level store is in use, giving the appearance of random access to all storage, or the non-volatile medium is flash memory, which allows fast random access, it may be necessary for the designer to explicitly specify both the firewall mecha­ nism and which data items are to reside on each side of the firewall.

Saltzer & Kaashoek Ch. 8, p. 38

June 24, 2009 12:24 am

8.5 Applying Redundancy to Software and Data

8–39

A good example of the firewall strategy can be found in most implementations of Internet Domain Name System servers. In a typical implementation the server stores the authoritative name records for its domain on magnetic disk, and copies those records into volatile CMOS memory either at system startup or the first time it needs a particular record. If the server fails for any reason, it simply abandons the volatile memory and restarts. In some implementations, the firewall is reinforced by not having any PUT oper­ ations in the running name server. Instead, the service updates the authoritative name records using a separate program that runs when the name server is off-line. In addition to employing independent software implementations and a firewall between categories of data, DNS also protects against environmental faults by employing geographical separation of its replicas, a topic that is explored more deeply in Section 10.3[on-line]. The three techniques taken together make DNS quite fault tolerant.

8.5.3 Durability and Durable Storage For the discipline just described to work, we need to make the result of a PUT operation durable. But first we must understand just what “durable” means. Durability is a speci­ fication of how long the result of an action must be preserved after the action completes. One must be realistic in specifying durability because there is no such thing as perfectly durable storage in which the data will be remembered forever. However, by choosing enough genuinely independent replicas, and with enough care in management, one can meet any reasonable requirement. Durability specifications can be roughly divided into four categories, according to the length of time that the application requires that data survive. Although there are no bright dividing lines, as one moves from one category to the next the techniques used to achieve durability tend to change. • Durability no longer than the lifetime of the thread that created the data. For this case, it is usually adequate to place the data in volatile memory. For example, an action such as moving the gearshift may require changing the oper­ ating parameters of an automobile engine. The result must be reliably remembered, but only until the next shift of gears or the driver switches the engine off. The operations performed by calls to the kernel of an operating system provide another example. The CHDIR procedure of the UNIX kernel (see Table 2.1 in Section 2.5.1) changes the working directory of the currently running process. The kernel state variable that holds the name of the current working directory is a value in volatile RAM that does not need to survive longer than this process. For a third example, the registers and cache of a hardware processor usually provide just the first category of durability. If there is a failure, the plan is to abandon those values along with the contents of volatile memory, so there is no need for a higher level of durability. • Durability for times short compared with the expected operational lifetime of non­ volatile storage media such as magnetic disk or flash memory. A designer typically

Saltzer & Kaashoek Ch. 8, p. 39

June 24, 2009 12:24 am

8–40

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

implements this category of durability by writing one copy of the data in the non­ volatile storage medium. Returning to the automotive example, there may be operating parameters such as engine timing that, once calibrated, should be durable at least until the next tune-up, not just for the life of one engine use session. Data stored in a cache that writes through to a non-volatile medium has about this level of durability. As a third example, a remote pro­ cedure call protocol that identifies duplicate messages by recording nonces might write old nonce values (see Section 7.5.3) to a non-volatile storage medium, knowing that the real goal is not to remember the nonces forever, but rather to make sure that the nonce record outlasts the longest retry timer of any client. Finally, text editors and word-pro­ cessing systems typically write temporary copies on magnetic disk of the material currently being edited so that if there is a system crash or power failure the user does not have to repeat the entire editing session. These temporary copies need to survive only until the end of the current editing session. • Durability for times comparable to the expected operational lifetime of non-volatile storage media. Because actual non-volatile media lifetimes vary quite a bit around the expected lifetime, implementation generally involves placing replicas of the data on independent instances of the non-volatile media. This category of durability is the one that is usually called durable storage and it is the category for which the next section of this chapter develops techniques for implementa­ tion. Users typically expect files stored in their file systems and data managed by a database management system to have this level of durability. Section 10.3[on-line] revis­ its the problem of creating durable storage when replicas are geographically separated. • Durability for many multiples of the expected operational lifetime of non-volatile storage media. This highest level of durability is known as preservation, and is the specialty of archi­ vists. In addition to making replicas and keeping careful records, it involves copying data from one non-volatile medium to another before the first one deteriorates or becomes obsolete. Preservation also involves (sometimes heroic) measures to preserve the ability to correctly interpret idiosyncratic formats created by software that has long since become obsolete. Although important, it is a separate topic, so preservation is not dis­ cussed any further here.

8.5.4 Magnetic Disk Fault Tolerance In principle, durable storage can be constructed starting with almost any storage medium, but it is most straightforward to use non-volatile devices. Magnetic disks (see Sidebar 2.8) are widely used as the basis for durable storage because of their low cost, large capacity and non-volatility—they retain their memory when power is turned off or is accidentally disconnected. Even if power is lost during a write operation, at most a small block of data surrounding the physical location that was being written is lost, and

Saltzer & Kaashoek Ch. 8, p. 40

June 24, 2009 12:24 am

8.5 Applying Redundancy to Software and Data

8–41

disks can be designed with enough internal power storage and data buffering to avoid even that loss. In its raw form, a magnetic disk is remarkably reliable, but it can still fail in various ways and much of the complexity in the design of disk systems consists of masking these failures. Conventionally, magnetic disk systems are designed in three nested layers. The inner­ most layer is the spinning disk itself, which provides what we will call raw storage. The next layer is a combination of hardware and firmware of the disk controller that provides for detecting the failures in the raw storage layer; it creates fail-fast storage. Finally, the hard disk firmware adds a third layer that takes advantage of the detection features of the second layer to create a substantially more reliable storage system, known as careful stor­ age. Most disk systems stop there, but high-availability systems add a fourth layer to create durable storage. This section develops a disk failure model and explores error mask­ ing techniques for all four layers. In early disk designs, the disk controller presented more or less the raw disk interface, and the fail-fast and careful layers were implemented in a software component of the operating system called the disk driver. Over the decades, first the fail-fast layer and more recently part or all of the careful layer of disk storage have migrated into the firmware of the disk controller to create what is known in the trade as a “hard drive”. A hard drive usually includes a RAM buffer to hold a copy of the data going to and from the disk, both to avoid the need to match the data rate to and from the disk head with the data rate to and from the system memory and also to simplify retries when errors occur. RAID systems, which provide a form of durable storage, generally are implemented as an addi­ tional hardware layer that incorporates mass-market hard drives. One reason for this move of error masking from the operating system into the disk controller is that as com­ putational power has gotten cheaper, the incremental cost of a more elaborate firmware design has dropped. A second reason may explain the obvious contrast with the lack of enthusiasm for memory parity checking hardware that is mentioned in Section 8.8.1. A transient memory error is all but indistinguishable from a program error, so the hardware vendor is not likely to be blamed for it. On the other hand, most disk errors have an obvi­ ous source, and hard errors are not transient. Because blame is easy to place, disk vendors have a strong motivation to include error masking in their designs.

8.5.4.1 Magnetic Disk Fault Modes Sidebar 2.8 described the physical design of the magnetic disk, including platters, mag­ netic material, read/write heads, seek arms, tracks, cylinders, and sectors, but it did not make any mention of disk reliability. There are several considerations: • D isks are high precision devices made to close tolerances. Defects in manufacturing a recording surface typically show up in the field as a sector that does not reliably record data. Such defects are a source of hard errors. Deterioration of the surface of a platter with age can cause a previously good sector to fail. Such loss is known as decay and, since any data previously recorded there is lost forever, decay is another example of hard error.

Saltzer & Kaashoek Ch. 8, p. 41

June 24, 2009 12:24 am

8–42

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

• Since a disk is mechanical, it is subject to wear and tear. Although a modern disk is a sealed unit, deterioration of its component materials as they age can create dust. The dust particles can settle on a magnetic surface, where they may interfere either with reading or writing. If interference is detected, then re-reading or re­ writing that area of the surface, perhaps after jiggling the seek arm back and forth, may succeed in getting past the interference, so the fault may be transient. Another source of transient faults is electrical noise spikes. Because disk errors caused by transient faults can be masked by retry, they fall in the category of soft errors. • If a running disk is bumped, the shock may cause a head to hit the surface of a spinning platter, causing what is known as a head crash. A head crash not only may damage the head and destroy the data at the location of impact, it also creates a cloud of dust that interferes with the operation of heads on other platters. A head crash generally results in several sectors decaying simultaneously. A set of sectors that tend to all fail together is known as a decay set. A decay set may be quite large, for example all the sectors on one drive or on one disk platter. • As electronic components in the disk controller age, clock timing and signal detection circuits can go out of tolerance, causing previously good data to become unreadable, or bad data to be written, either intermittently or permanently. In consequence, electronic component tolerance problems can appear either as soft or hard errors. • The mechanical positioning systems that move the seek arm and that keep track of the rotational position of the disk platter can fail in such a way that the heads read or write the wrong track or sector within a track. This kind of fault is known as a seek error.

8.5.4.2 System Faults In addition to failures within the disk subsystem, there are at least two threats to the integrity of the data on a disk that arise from outside the disk subsystem: • If the power fails in the middle of a disk write, the sector being written may end up being only partly updated. After the power is restored and the system restarts, the next reader of that sector may find that the sector begins with the new data, but ends with the previous data. • If the operating system fails during the time that the disk is writing, the data being written could be affected, even if the disk is perfect and the rest of the system is fail-fast. The reason is that all the contents of volatile memory, including the disk buffer, are inside the fail-fast error containment boundary and thus at risk of damage when the system fails. As a result, the disk channel may correctly write on the disk what it reads out of the disk buffer in memory, but the faltering operating system may have accidentally corrupted the contents of that buffer after the

Saltzer & Kaashoek Ch. 8, p. 42

June 24, 2009 12:24 am

8.5 Applying Redundancy to Software and Data

8–43

application called PUT. In such cases, the data that ends up on the disk will be corrupted, but there is no sure way in which the disk subsystem can detect the problem.

8.5.4.3 Raw Disk Storage Our goal is to devise systematic procedures to mask as many of these different faults as possible. We start with a model of disk operation from a programmer’s point of view. The raw disk has, at least conceptually, a relatively simple interface: There is an operation to seek to a (numbered) track, an operation that writes data on the track and an operation that reads data from the track. The failure model is simple: all errors arising from the fail­ ures just described are untolerated. (In the procedure descriptions, arguments are call-by­ reference, and GET operations read from the disk into the argument named data.) The raw disk layer implements these storage access procedures and failure tolerance model: RAW_SEEK

(track) (data) RAW_GET (data) RAW_PUT

// Move read/write head into position.

// Write entire track.

// Read entire track.

• error-free operation: RAW_SEEK moves the seek arm to position track. RAW_GET returns whatever was most recently written by RAW_PUT at position track. • untolerated error: On any given attempt to read from or write to a disk, dust particles on the surface of the disk or a temporarily high noise level may cause data to be read or written incorrectly. (soft error) • untolerated error: A spot on the disk may be defective, so all attempts to write to any track that crosses that spot will be written incorrectly. (hard error) • untolerated error: Information previously written correctly may decay, so RAW_GET returns incorrect data. (hard error) • untolerated error: When asked to read data from or write data to a specified track, a disk may correctly read or write the data, but on the wrong track. (seek error) • untolerated error: The power fails during a RAW_PUT with the result that only the first part of data ends up being written on track. The remainder of track may contain older data. • untolerated error: The operating system crashes during a RAW_PUT and scribbles over the disk buffer in volatile storage, so RAW_PUT writes corrupted data on one track of the disk.

8.5.4.4 Fail-Fast Disk Storage The fail-fast layer is the place where the electronics and microcode of the disk controller divide the raw disk track into sectors. Each sector is relatively small, individually pro­ tected with an error-detection code, and includes in addition to a fixed-sized space for data a sector and track number. The error-detection code enables the disk controller to

Saltzer & Kaashoek Ch. 8, p. 43

June 24, 2009 12:24 am

8–44

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

return a status code on FAIL_FAST_GET that tells whether a sector read correctly or incor­ rectly, and the sector and track numbers enable the disk controller to verify that the seek ended up on the correct track. The FAIL_FAST_PUT procedure not only writes the data, but it verifies that the write was successful by reading the newly written sector on the next rotation and comparing it with the data still in the write buffer. The sector thus becomes the minimum unit of reading and writing, and the disk address becomes the pair {track, sector_number}. For performance enhancement, some systems allow the caller to bypass the verification step of FAIL_FAST_PUT. When the client chooses this bypass, write failures become indistinguishable from decay events. There is always a possibility that the data on a sector is corrupted in such a way that the error-detection code accidentally verifies. For completeness, we will identify that case as an untolerated error, but point out that the error-detection code should be powerful enough that the probability of this outcome is negligible. The fail-fast layer implements these storage access procedures and failure tolerance model: status ← FAIL_FAST_SEEK (track)

status ← FAIL_FAST_PUT (data, sector_number)

status ← FAIL_FAST_GET (data, sector_number)

• error-free operation: FAIL_FAST_SEEK moves the seek arm to track. FAIL_FAST_GET returns whatever was most recently written by FAIL_FAST_PUT at sector_number on track and returns status = OK. • detected error: FAIL_FAST_GET reads the data, checks the error-detection code and finds that it does not verify. The cause may a soft error, a hard error due to decay, or a hard error because there is a bad spot on the disk and the invoker of a previous FAIL_FAST_PUT chose to bypass verification. FAIL_FAST_GET does not attempt to distinguish these cases; it simply reports the error by returning status = BAD. • detected error: FAIL_FAST_PUT writes the data, on the next rotation reads it back, checks the error-detection code, finds that it does not verify, and reports the error by returning status = BAD. • detected error: FAIL_FAST_SEEK moves the seek arm, reads the permanent track number in the first sector that comes by, discovers that it does not match the requested track number (or that the sector checksum does not verify), and reports the error by returning status = BAD. • detected error: The caller of FAIL_FAST_PUT tells it to bypass the verification step, so FAIL_FAST_PUT always reports status = OK even if the sector was not written correctly. But a later caller of FAIL_FAST_GET that requests that sector should detect any such error. • detected error: The power fails during a FAIL_FAST_PUT with the result that only the first part of data ends up being written on sector. The remainder of sector may contain older data. Any later call of FAIL_FAST_GET for that sector should discover that the sector checksum fails to verify and will thus return status = BAD.

Saltzer & Kaashoek Ch. 8, p. 44

June 24, 2009 12:24 am

8.5 Applying Redundancy to Software and Data

8–45

Many (but not all) disks are designed to mask this class of failure by maintaining a reserve of power that is sufficient to complete any current sector write, in which case loss of power would be a tolerated failure. • untolerated error: The operating system crashes during a FAIL_FAST_PUT and scribbles over the disk buffer in volatile storage, so FAIL_FAST_PUT writes corrupted data on one sector of the disk. • untolerated error: The data of some sector decays in a way that is undetectable— the checksum accidentally verifies. (Probability should be negligible.)

8.5.4.5 Careful Disk Storage The fail-fast disk layer detects but does not mask errors. It leaves masking to the careful disk layer, which is also usually implemented in the firmware of the disk controller. The careful layer checks the value of status following each disk SEEK, GET and PUT operation, retrying the operation several times if necessary, a procedure that usually recovers from seek errors and soft errors caused by dust particles or a temporarily elevated noise level. Some disk controllers seek to a different track and back in an effort to dislodge the dust. The careful storage layer implements these storage procedures and failure tolerance model: status ← CAREFUL_SEEK (track)

status ← CAREFUL_PUT (data, sector_number)

status ← CAREFUL_GET (data, sector_number)

• error-free operation: CAREFUL_SEEK moves the seek arm to track. CAREFUL_GET returns whatever was most recently written by CAREFUL_PUT at sector_number on track. All three return status = OK. • tolerated error: Soft read, write, or seek error. CAREFUL_SEEK, CAREFUL_GET and CAREFUL_PUT mask these errors by repeatedly retrying the operation until the failfast layer stops detecting an error, returning with status = OK. The careful storage layer counts the retries, and if the retry count exceeds some limit, it gives up and declares the problem to be a hard error. • detected error: Hard error. The careful storage layer distinguishes hard from soft errors by their persistence through several attempts to read, write, or seek, and reports them to the caller by setting status = BAD. (But also see the note on revectoring below.) • detected error: The power fails during a CAREFUL_PUT with the result that only the first part of data ends up being written on sector. The remainder of sector may contain older data. Any later call of CAREFUL_GET for that sector should discover that the sector checksum fails to verify and will thus return status = BAD. (Assuming that the fail-fast layer does not tolerate power failures.) • untolerated error: Crash corrupts data. The system crashes during CAREFUL_PUT and corrupts the disk buffer in volatile memory, so CAREFUL_PUT correctly writes to the

Saltzer & Kaashoek Ch. 8, p. 45

June 24, 2009 12:24 am

8–46

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

disk sector the corrupted data in that buffer. The sector checksum of the fail-fast layer cannot detect this case. • untolerated error: The data of some sector decays in a way that is undetectable— the checksum accidentally verifies. (Probability should be negligible) Figure 8.12 exhibits algorithms for CAREFUL_GET and CAREFUL_PUT. The procedure , by repeatedly reading any data with status = BAD, masks soft read errors. Similarly, CAREFUL_PUT retries repeatedly if the verification done by FAIL_FAST_PUT fails, thereby masking soft write errors, whatever their source. The careful layer of most disk controller designs includes one more feature: if CAREFUL_PUT detects a hard error while writing a sector, it may instead write the data on a spare sector elsewhere on the same disk and add an entry to an internal disk mapping table so that future GETs and PUTs that specify that sector instead use the spare. This mech­ anism is called revectoring, and most disk designs allocate a batch of spare sectors for this purpose. The spares are not usually counted in the advertised disk capacity, but the man­ ufacturer’s advertising department does not usually ignore the resulting increase in the expected operational lifetime of the disk. For clarity of the discussion we omit that feature. As indicated in the failure tolerance analysis, there are still two modes of failure that remain unmasked: a crash during CAREFUL_PUT may undetectably corrupt one disk sector, and a hard error arising from a bad spot on the disk or a decay event may detectably cor­ rupt any number of disk sectors. CAREFUL_GET

8.5.4.6 Durable Storage: RAID 1 For durability, the additional requirement is to mask decay events, which the careful storage layer only detects. The primary technique is that the PUT procedure should write several replicas of the data, taking care to place the replicas on different physical devices with the hope that the probability of disk decay in one replica is independent of the prob­ 1 2 3 4 5

procedure CAREFUL_GET (data, sector_number) for i from 1 to NTRIES do if FAIL_FAST_GET (data, sector_number) = OK then return OK return BAD

6 procedure CAREFUL_PUT (data, sector_number) 7 for i from 1 to NTRIES do 8 if FAIL_FAST_PUT (data, sector_number) = OK then 9 return OK 10 return BAD

FIGURE 8.12 Procedures that implement careful disk storage.

Saltzer & Kaashoek Ch. 8, p. 46

June 24, 2009 12:24 am

8.5 Applying Redundancy to Software and Data

8–47

ability of disk decay in the next one, and the number of replicas is large enough that when a disk fails there is enough time to replace it before all the other replicas fail. Disk system designers call these replicas mirrors. A carefully designed replica strategy can create stor­ age that guards against premature disk failure and that is durable enough to substantially exceed the expected operational lifetime of any single physical disk. Errors on reading are detected by the fail-fast layer, so it is not usually necessary to read more than one copy unless that copy turns out to be bad. Since disk operations may involve more than one replica, the track and sector numbers are sometimes encoded into a virtual sector number and the durable storage layer automatically performs any needed seeks. The durable storage layer implements these storage access procedures and failure toler­ ance model: status ← DURABLE_PUT (data, virtual_sector_number)

status ← DURABLE_GET (data, virtual_sector_number)

• error-free operation: DURABLE_GET returns whatever was most recently written by DURABLE_PUT at virtual_sector_number with status = OK. • tolerated error: Hard errors reported by the careful storage layer are masked by reading from one of the other replicas. The result is that the operation completes with status = OK. • untolerated error: A decay event occurs on the same sector of all the replicas, and the operation completes with status = BAD. • untolerated error: The operating system crashes during a DURABLE_PUT and scribbles over the disk buffer in volatile storage, so DURABLE_PUT writes corrupted data on all mirror copies of that sector. • untolerated error: The data of some sector decays in a way that is undetectable— the checksum accidentally verifies. (Probability should be negligible) In this accounting there is no mention of soft errors or of positioning errors because they were all masked by a lower layer. One configuration of RAID (see Section 2.1.1.4), known as “RAID 1”, implements exactly this form of durable storage. RAID 1 consists of a tightly-managed array of iden­ tical replica disks in which DURABLE_PUT (data, sector_number) writes data at the same sector_number of each disk and DURABLE_GET reads from whichever replica copy has the smallest expected latency, which includes queuing time, seek time, and rotation time. With RAID, the decay set is usually taken to be an entire hard disk. If one of the disks fails, the next DURABLE_GET that tries to read from that disk will detect the failure, mask it by reading from another replica, and put out a call for repair. Repair consists of first replacing the disk that failed and then copying all of the disk sectors from one of the other replica disks.

8.5.4.7 Improving on RAID 1 Even with RAID 1, an untolerated error can occur if a rarely-used sector decays, and before that decay is noticed all other copies of that same sector also decay. When there is

Saltzer & Kaashoek Ch. 8, p. 47

June 24, 2009 12:24 am

8–48

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

finally a call for that sector, all fail to read and the data is lost. A closely related scenario is that a sector decays and is eventually noticed, but the other copies of that same sector decay before repair of the first one is completed. One way to reduce the chances of these outcomes is to implement a clerk that periodically reads all replicas of every sector, to check for decay. If CAREFUL_GET reports that a replica of a sector is unreadable at one of these periodic checks, the clerk immediately rewrites that replica from a good one. If the rewrite fails, the clerk calls for immediate revectoring of that sector or, if the number of revectorings is rapidly growing, replacement of the decay set to which the sector belongs. The period between these checks should be short enough that the probability that all rep­ licas have decayed since the previous check is negligible. By analyzing the statistics of experience for similar disk systems, the designer chooses such a period, Td. This approach leads to the following failure tolerance model: status ← MORE_DURABLE_PUT (data, virtual_sector_number) status ← MORE_DURABLE_GET (data, virtual_sector_number)

• error-free operation: MORE_DURABLE_GET returns whatever was most recently written by MORE_DURABLE_PUT at virtual_sector_number with status = OK • tolerated error: Hard errors reported by the careful storage layer are masked by reading from one of the other replicas. The result is that the operation completes with status = OK. • tolerated error: data of a single decay set decays, is discovered by the clerk, and is repaired, all within Td seconds of the decay event. • untolerated error: The operating system crashes during a DURABLE_PUT and scribbles over the disk buffer in volatile storage, so DURABLE_PUT writes corrupted data on all mirror copies of that sector. • untolerated error: all decay sets fail within T d seconds. (With a conservative choice of Td, the probability of this event should be negligible.) • untolerated error: The data of some sector decays in a way that is undetectable— the checksum accidentally verifies. (With a good quality checksum, the probability of this event should be negligible.) A somewhat less effective alternative to running a clerk that periodically verifies integ­ rity of the data is to notice that the bathtub curve of Figure 8.1 applies to magnetic disks, and simply adopt a policy of systematically replacing the individual disks of the RAID array well before they reach the point where their conditional failure rate is predicted to start climbing. This alternative is not as effective for two reasons: First, it does not catch and repair random decay events, which instead accumulate. Second, it provides no warn­ ing if the actual operational lifetime is shorter than predicted (for example, if one happens to have acquired a bad batch of disks).

Saltzer & Kaashoek Ch. 8, p. 48

June 24, 2009 12:24 am

8.5 Applying Redundancy to Software and Data

8–49

8.5.4.8 Detecting Errors Caused by System Crashes With the addition of a clerk to watch for decay, there is now just one remaining Sidebar 8.3: Are disk system checksums a the adjacent untolerated error that has a significant wasted effort? From suggests paragraph, an end-to-end argument probability: the hard error created by an that an end-to-end checksum is always operating system crash during CAREFUL_PUT. Since that scenario corrupts the data needed to protect data on its way to and from the disk subsystem, and that the failbefore the disk subsystem sees it, the disk fast checksum performed inside the disk subsystem has no way of either detecting system thus may not be essential. or masking this error. Help is needed from outside the disk subsystem—either the However, the disk system checksum cleanly operating system or the application. The subcontracts one rather specialized job: usual approach is that either the system or, correcting burst errors of the storage even better, the application program, cal- medium. In addition, the disk system culates and includes an end-to-end checksum provides a handle for disk-layer checksum with the data before initiating erasure code implementations such as RAID, the disk write. Any program that later as was described in Section 8.4.1. Thus the reads the data verifies that the stored disk system checksum, though superficially checksum matches the recalculated check- redundant, actually turns out to be quite sum of the data. The end-to-end useful.

checksum thus monitors the integrity of

the data as it passes through the operating system buffers and also while it resides in the

disk subsystem.

The end-to-end checksum allows only detecting this class of error. Masking is another matter—it involves a technique called recovery, which is one of the topics of the next chapter. Table 8.1 summarizes where failure tolerance is implemented in the several disk lay­ ers. The hope is that the remaining untolerated failures are so rare that they can be neglected. If they are not, the number of replicas could be increased until the probability of untolerated failures is negligible. 8.5.4.9 Still More Threats to Durability The various procedures described above create storage that is durable in the face of indi­ vidual disk decay but not in the face of other threats to data integrity. For example, if the power fails in the middle of a MORE_DURABLE_PUT, some replicas may contain old versions of the data, some may contain new versions, and some may contain corrupted data, so it is not at all obvious how MORE_DURABLE_GET should go about meeting its specification. The solution is to make MORE_DURABLE_PUT atomic, which is one of the topics of Chapter 9[on-line]. RAID systems usually specify that a successful return from a PUT confirms that writing of all of the mirror replicas was successful. That specification in turn usually requires that the multiple disks be physically co-located, which in turn creates a threat that a single

Saltzer & Kaashoek Ch. 8, p. 49

June 24, 2009 12:24 am

8–50

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

physical disaster—fire, earthquake, flood, civil disturbance, etc.—might damage or destroy all of the replicas. Since magnetic disks are quite reliable in the short term, a different strategy is to write only one replica at the time that MORE_DURABLE_PUT is invoked and write the remaining replicas at a later time. Assuming there are no inopportune failures in the short run, the results gradually become more durable as more replicas are written. Replica writes that are separated in time are less likely to have replicated failures because they can be sepa­ rated in physical location, use different disk driver software, or be written to completely different media such as magnetic tape. On the other hand, separating replica writes in time increases the risk of inconsistency among the replicas. Implementing storage that has durability that is substantially beyond that of RAID 1 and MORE_DURABLE_PUT/GET generally involves use of geographically separated replicas and systematic mechanisms to keep those replicas coordinated, a challenge that Chapter 10[on-line] discusses in depth. Perhaps the most serious threat to durability is that although different storage systems have employed each of the failure detection and masking techniques discussed in this sec­ tion, it is all too common to discover that a typical off-the-shelf personal computer file

raw layer

fail-fast layer

careful layer

durable layer

more durable layer

soft read, write, or seek error

failure

detected

masked

hard read, write error

failure

detected

detected

masked

power failure interrupts a write

failure

detected

detected

masked

single data decay

failure

detected

detected

masked

multiple data decay spaced in time

failure

detected

detected

detected

masked

multiple data decay within Td

failure

detected

detected

detected

failure*

undetectable decay

failure

failure

failure

failure

failure*

system crash corrupts write buffer

failure

failure

failure

failure

detected

Table 8.1: Summary of disk failure tolerance models. Each entry shows the effect of this error at the interface between the named layer and the next higher layer. With careful design, the probability of the two failures marked with an asterisk should be negligible. Masking of corruption caused by system crashes is discussed in Chapter 9[on-line]

Saltzer & Kaashoek Ch. 8, p. 50

June 24, 2009 12:24 am

8.6 Wrapping up Reliability

8–51

system has been designed using an overly simple disk failure model and thus misses some—or even many—straightforward failure masking opportunities.

8.6 Wrapping up Reliability 8.6.1 Design Strategies and Design Principles Standing back from the maze of detail about redundancy, we can identify and abstract three particularly effective design strategies: • N-modular redundancy is a simple but powerful tool for masking failures and increasing availability, and it can be used at any convenient level of granularity. • Fail-fast modules provide a sweeping simplification of the problem of containing errors. When containment can be described simply, reasoning about fault tolerance becomes easier. • Pair-and-compare allows fail-fast modules to be constructed from commercial, off-the-shelf components. Standing back still further, it is apparent that several general design principles are directly applicable to fault tolerance. In the formulation of the fault-tolerance design pro­ cess in Section 8.1.2, we invoked be explicit, design for iteration. keep digging, and the safety margin principle, and in exploring different fault tolerance techniques we have seen several examples of adopt sweeping simplifications. One additional design principle that applies to fault tolerance (and also, as we will see in Chapter 11[on-line], to security) comes from experience, as documented in the case studies of Section 8.8: Avoid rarely used components Deterioration and corruption accumulate unnoticed—until the next use.

Whereas redundancy can provide masking of errors, redundant components that are used only when failures occur are much more likely to cause trouble than redundant components that are regularly exercised in normal operation. The reason is that failures in regularly exercised components are likely to be immediately noticed and fixed. Fail­ ures in unused components may not be noticed until a failure somewhere else happens. But then there are two failures, which may violate the design assumptions of the masking plan. This observation is especially true for software, where rarely-used recovery proce­ dures often accumulate unnoticed bugs and incompatibilities as other parts of the system evolve. The alternative of periodic testing of rarely-used components to lower their fail­ ure latency is a band-aid that rarely works well. In applying these design principles, it is important to consider the threats, the conse­ quences, the environment, and the application. Some faults are more likely than others,

Saltzer & Kaashoek Ch. 8, p. 51

June 24, 2009 12:24 am

8–52

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

some failures are more disruptive than others, and different techniques may be appropri­ ate in different environments. A computer-controlled radiation therapy machine, a deepspace probe, a telephone switch, and an airline reservation system all need fault tolerance, but in quite different forms. The radiation therapy machine should emphasize fault detection and fail-fast design, to avoid injuring patients. Masking faults may actually be a mistake. It is likely to be safer to stop, find their cause, and fix them before continuing operation. The deep-space probe, once the mission begins, needs to concentrate on fail­ ure masking to ensure mission success. The telephone switch needs many nines of availability because customers expect to always receive a dial tone, but if it occasionally disconnects one ongoing call, that customer will simply redial without thinking much about it. Users of the airline reservation system might tolerate short gaps in availability, but the durability of its storage system is vital. At the other extreme, most people find that a digital watch has an MTTF that is long compared with the time until the watch is misplaced, becomes obsolete, goes out of style, or is discarded. Consequently, no pro­ vision for either error masking or repair is really needed. Some applications have built-in redundancy that a designer can exploit. In a video stream, it is usually possible to mask the loss of a single video frame by just repeating the previous frame.

8.6.2 How about the End-to-End Argument? There is a potential tension between error masking and an end-to-end argument. An endto-end argument suggests that a subsystem need not do anything about errors and should not do anything that might compromise other goals such as low latency, high through­ put, or low cost. The subsystem should instead let the higher layer system of which it is a component take care of the problem because only the higher layer knows whether or not the error matters and what is the best course of action to take. There are two counter arguments to that line of reasoning: • Ignoring an error allows it to propagate, thus contradicting the modularity goal of error containment. This observation points out an important distinction between error detection and error masking. Error detection and containment must be performed where the error happens, so that the error does not propagate wildly. Error masking, in contrast, presents a design choice: masking can be done locally or the error can be handled by reporting it at the interface (that is, by making the module design fail-fast) and allowing the next higher layer to decide what masking action—if any—to take. • The lower layer may know the nature of the error well enough that it can mask it far more efficiently than the upper layer. The specialized burst error correction codes used on DVDs come to mind. They are designed specifically to mask errors caused by scratches and dust particles, rather than random bit-flips. So we have a trade-off between the cost of masking the fault locally and the cost of letting the error propagate and handling it in a higher layer.

Saltzer & Kaashoek Ch. 8, p. 52

June 24, 2009 12:24 am

8.6 Wrapping up Reliability

8–53

These two points interact: When an error propagates it can contaminate otherwise correct data, which can increase the cost of masking and perhaps even render masking impossible. The result is that when the cost is small, error masking is usually done locally. (That is assuming that masking is done at all. Many personal computer designs omit memory error masking. Section 8.8.1 discusses some of the reasons for this design decision.) A closely related observation is that when a lower layer masks a fault it is important that it also report the event to a higher layer, so that the higher layer can keep track of how much masking is going on and thus how much failure tolerance there remains. Reporting to a higher layer is a key aspect of the safety margin principle.

8.6.3 A Caution on the Use of Reliability Calculations Reliability calculations seem to be exceptionally vulnerable to the garbage-in, garbageout syndrome. It is all too common that calculations of mean time to failure are under­ mined because the probabilistic models are not supported by good statistics on the failure rate of the components, by measures of the actual load on the system or its components, or by accurate assessment of independence between components. For computer systems, back-of-the-envelope calculations are often more than suffi­ cient because they are usually at least as accurate as the available input data, which tends to be rendered obsolete by rapid technology change. Numbers predicted by formula can generate a false sense of confidence. This argument is much weaker for technologies that tend to be stable (for example, production lines that manufacture glass bottles). So reli­ ability analysis is not a waste of time, but one must be cautious in applying its methods to computer systems.

8.6.4 Where to Learn More about Reliable Systems Our treatment of fault tolerance has explored only the first layer of fundamental con­ cepts. There is much more to the subject. For example, we have not considered another class of fault that combines the considerations of fault tolerance with those of security: faults caused by inconsistent, perhaps even malevolent, behavior. These faults have the characteristic they generate inconsistent error values, possibly error values that are specif­ ically designed by an attacker to confuse or confound fault tolerance measures. These faults are called Byzantine faults, recalling the reputation of ancient Byzantium for mali­ cious politics. Here is a typical Byzantine fault: suppose that an evil spirit occupies one of the three replicas of a TMR system, waits for one of the other replicas to fail, and then adjusts its own output to be identical to the incorrect output of the failed replica. A voter accepts this incorrect result and the error propagates beyond the intended containment boundary. In another kind of Byzantine fault, a faulty replica in an NMR system sends different result values to each of the voters that are monitoring its output. Malevolence is not required—any fault that is not anticipated by a fault detection mechanism can pro­ duce Byzantine behavior. There has recently been considerable attention to techniques

Saltzer & Kaashoek Ch. 8, p. 53

June 24, 2009 12:24 am

8–54

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

that can tolerate Byzantine faults. Because the tolerance algorithms can be quite com­ plex, we defer the topic to advanced study. We also have not explored the full range of reliability techniques that one might encounter in practice. For an example that has not yet been mentioned, Sidebar 8.4 describes the heartbeat, a popular technique for detecting failures of active processes. This chapter has oversimplified some ideas. For example, the definition of availability proposed in Section 8.2 of this chapter is too simple to adequately characterize many large systems. If a bank has hundreds of automatic teller machines, there will probably always be a few teller machines that are not working at any instant. For this case, an avail­ ability measure based on the percentage of transactions completed within a specified response time would probably be more appropriate. A rapidly moving but in-depth discussion of fault tolerance can be found in Chapter 3 of the book Transaction Processing: Concepts and Techniques, by Jim Gray and Andreas Reuter. A broader treatment, with case studies, can be found in the book Reliable Com­ puter Systems: Design and Evaluation, by Daniel P. Siewiorek and Robert S. Swarz. Byzantine faults are an area of ongoing research and development, and the best source is current professional literature. This chapter has concentrated on general techniques for achieving reliability that are applicable to hardware, software, and complete systems. Looking ahead, Chapters 9[on­ line] and 10[on-line] revisit reliability in the context of specific software techniques that permit reconstruction of stored state following a failure when there are several concur­ rent activities. Chapter 11[on-line], on securing systems against malicious attack, introduces a redundancy scheme known as defense in depth that can help both to contain and to mask errors in the design or implementation of individual security mechanisms.

Sidebar 8.4: Detecting failures with heartbeats. An activity such as a Web server is usually intended to keep running indefinitely. If it fails (perhaps by crashing) its clients may notice that it has stopped responding, but clients are not typically in a position to restart the server. Something more systematic is needed to detect the failure and initiate recovery. One helpful technique is to program the thread that should be performing the activity to send a periodic signal to another thread (or a message to a monitoring service) that says, in effect, “I'm still OK”. The periodic signal is known as a heartbeat and the observing thread or service is known as a watchdog. The watchdog service sets a timer, and on receipt of a heartbeat message it restarts the timer. If the timer ever expires, the watchdog assumes that the monitored service has gotten into trouble and it initiates recovery. One limitation of this technique is that if the monitored service fails in such a way that the only thing it does is send heartbeat signals, the failure will go undetected. As with all fixed timers, choosing a good heartbeat interval is an engineering challenge. Setting the interval too short wastes resources sending and responding to heartbeat signals. Setting the interval too long delays detection of failures. Since detection is a prerequisite to repair, a long heartbeat interval increases MTTR and thus reduces availability.

Saltzer & Kaashoek Ch. 8, p. 54

June 24, 2009 12:24 am

8.7 Application: A Fault Tolerance Model for CMOS RAM

8–55

8.7 Application: A Fault Tolerance Model for CMOS RAM This section develops a fault tolerance model for words of CMOS random access mem­ ory, first without and then with a simple error-correction code, comparing the probability of error in the two cases. CMOS RAM is both low in cost and extraordinarily reliable, so much so that error masking is often not implemented in mass production systems such as television sets and personal computers. But some systems, for example life-support, air traffic control, or banking systems, cannot afford to take unnecessary risks. Such systems usually employ the same low-cost memory technology but add incremental redundancy. A common failure of CMOS RAM is that noise intermittently causes a single bit to read or write incorrectly. If intermittent noise affected only reads, then it might be suf­ ficient to detect the error and retry the read. But the possibility of errors on writes suggests using a forward error-correction code. We start with a fault tolerance model that applies when reading a word from memory without error correction. The model assumes that errors in different bits are independent and it assigns p as the (presumably small) probability that any individual bit is in error. The notation O(pn) means terms involving pn and higher, presumably negligible, pow­ ers. Here are the possibilities and their associated probabilities: Fault tolerance model for raw CMOS random access memory probability error-free case:

all 32 bits are correct

( 1 – p ) 32 = 1 – O ( p )

untolerated:

one bit is in error:

32p ( 1 – p) 31 = O ( p )

untolerated:

two bits are in error:

errors:

untolerated:

( 31 ⋅ 32 ⁄ 2 ) p 2 (1 – p) 30 = O ( p 2 )

three or more bits are in error:

(30 ⋅ 31 ⋅ 32 ⁄ 3 ⋅ 2) p 3 (1 – p) 29 + … + p 32 = O ( p 3 ) The coefficients 32 , (31 ⋅ 32) ⁄ 2 , etc., arise by counting the number of ways that one, two, etc., bits could be in error. Suppose now that the 32-bit block of memory is encoded using a code of Hamming distance 3, as described in Section 8.4.1. Such a code allows any single-bit error to be

Saltzer & Kaashoek Ch. 8, p. 55

June 24, 2009 12:24 am

8–56

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

corrected and any double-bit error to be detected. After applying the decoding algo­ rithm, the fault tolerance model changes to: Fault tolerance model for CMOS memory with error correction probability error-free case:

all 32 bits are correct

( 1 – p) 32 = 1 – O( p )

tolerated:

one bit corrected:

32p(1 – p) 31 = O( p )

detected:

two bits are in error:

errors:

untolerated:

( 31 ⋅ 32 ⁄ 2 ) p 2 (1 – p) 30 = O ( p 2 )

three or more bits are in error:

(30 ⋅ 31 ⋅ 32 ⁄ 3 ⋅ 2) p 3 ( 1 – p) 29 + … + p 32 = O ( p 3 ) The interesting change is in the probability that the decoded value is correct. That prob­ ability is the sum of the probabilities that there were no errors and that there was one, tolerated error: Prob(decoded value is correct) = (1 – p) 32 + 32p(1 – p) 31

= ( 1 – 32p + ( 31 ⋅ 32 ⁄ 2) p 2 + … ) + (32p + 31 ⋅ = ( 1 – O ( p 2 ))

The decoding algorithm has thus eliminated the errors that have probability of order p. It has not eliminated the two-bit errors, which have probability of order p2, but for twobit errors the algorithm is fail-fast, so a higher-level procedure has an opportunity to recover, perhaps by requesting retransmission of the data. The code is not helpful if there are errors in three or more bits, which situation has probability of order p3, but presum­ ably the designer has determined that probabilities of that order are negligible. If they are not, the designer should adopt a more powerful error-correction code. With this model in mind, one can review the two design questions suggested on page 8–19. The first question is whether the estimate of bit error probability is realistic and if it is realistic to suppose that multiple bit errors are statistically independent of one another. (Error independence appeared in the analysis in the claim that the probability of an n-bit error has the order of the nth power of the probability of a one-bit error.) Those questions concern the real world and the accuracy of the designer’s model of it. For example, this failure model doesn’t consider power failures, which might take all the bits out at once, or a driver logic error that might take out all of the even-numbered bits.

Saltzer & Kaashoek Ch. 8, p. 56

June 24, 2009 12:24 am

8.8 War Stories: Fault Tolerant Systems that Failed

8–57

It also ignores the possibility of faults that lead to errors in the logic of the error-correc­ tion circuitry itself. The second question is whether the coding algorithm actually corrects all one-bit errors and detects all two-bit errors. That question is explored by examining the mathe­ matical structure of the error-correction code and is quite independent of anybody’s estimate or measurement of real-world failure types and rates. There are many off-the­ shelf coding algorithms that have been thoroughly analyzed and for which the answer is yes.

8.8 War Stories: Fault Tolerant Systems that Failed 8.8.1 Adventures with Error Correction* The designers of the computer systems at the Xerox Palo Alto Research Center in the early 1970s encountered a series of experiences with error-detecting and error-correcting memory systems. From these experiences follow several lessons, some of which are far from intuitive, and all of which still apply several decades later. MAXC. One of the first projects undertaken in the newly-created Computer Systems Laboratory was to build a time-sharing computer system, named MAXC. A brand new 1024-bit memory chip, the Intel 1103, had just appeared on the market, and it promised to be a compact and economical choice for the main memory of the computer. But since the new chip had unknown reliability characteristics, the MAXC designers implemented the memory system using a few extra bits for each 36-bit word, in the form of a singleerror-correction, double-error-detection code. Experience with the memory in MAXC was favorable. The memory was solidly reli­ able—so solid that no errors in the memory system were ever reported. The Alto. When the time came to design the Alto personal workstation, the same Intel memory chips still appeared to be the preferred component. Because these chips had per­ formed so reliably in MAXC, the designers of the Alto memory decided to relax a little, omitting error correction. But, they were still conservative enough to provide error detec­ tion, in the form of one parity bit for each 16-bit word of memory. This design choice seemed to be a winner because the Alto memory systems also per­ formed flawlessly, at least for the first several months. Then, mysteriously, the operating system began to report frequent memory-parity failures. Some background: the Alto started life with an operating system and applications that used a simple typewriter-style interface. The display was managed with a character-by­ character teletype emulator. But the purpose of the Alto was to experiment with better * These experiences were reported by Butler Lampson, one of the designers of the MAXC computer and the Alto personal workstations at Xerox Palo Alto Research Center.

Saltzer & Kaashoek Ch. 8, p. 57

June 24, 2009 12:24 am

8–58

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

things. One of the first steps in that direction was to implement the first what-you-see­ is-what-you-get editor, named Bravo. Bravo took full advantage of the bit-map display, filling it not only with text, but also with lines, buttons, and icons. About half the mem­ ory system was devoted to display memory. Curiously, the installation of Bravo coincided with the onset of memory parity errors. It turned out that the Intel 1103 chips were pattern-sensitive—certain read/write sequences of particular bit patterns could cause trouble, probably because those pattern sequences created noise levels somewhere on the chip that systematically exceeded some critical threshold. The Bravo editor's display management was the first application that generated enough different patterns to have an appreciable probability of causing a parity error. It did so, frequently. Lesson 8.8.1a: There is no such thing as a small change in a large system. A new piece of soft­ ware can bring down a piece of hardware that is thought to be working perfectly. You are never quite sure just how close to the edge of the cliff you are standing. Lesson 8.8.1b: Experience is a primary source of information about failures. It is nearly impos­ sible, without specific prior experience, to predict what kinds of failures you will encounter in the field. Back to MAXC. This circumstance led to a more careful review of the situation on MAXC. MAXC, being a heavily used server, would be expected to encounter at least some of this pattern sensitivity. It was discovered that although the error-correction cir­ cuits had been designed to report both corrected errors and uncorrectable errors, the software logged only uncorrectable errors and corrected errors were being ignored. When logging of corrected errors was implemented, it turned out that the MAXC's Intel 1103's were actually failing occasionally, and the error-correction circuitry was busily setting things right. Lesson 8.8.1c: Whenever systems implement automatic error masking, it is important to fol­ low the safety margin principle, by tracking how often errors are successfully masked. Without this information, one has no way of knowing whether the system is operating with a large or small safety margin for additional errors. Otherwise, despite the attempt to put some guaran­ teed space between yourself and the edge of the cliff, you may be standing on the edge again. The Alto 2. In 1975, it was time to design a follow-on workstation, the Alto 2. A new generation of memory chips, this time with 4096 bits, was now available. Since it took up much less space and promised to be cheaper, this new chip looked attractive, but again there was no experience with its reliability. The Alto 2 designers, having been made wary by the pattern sensitivity of the previous generation chips, again resorted to a singleerror-correction, double-error-detection code in the memory system. Once again, the memory system performed flawlessly. The cards passed their accep­ tance tests and went into service. In service, not only were no double-bit errors detected, only rarely were single-bit errors being corrected. The initial conclusion was that the chip vendors had worked the bugs out and these chips were really good.

Saltzer & Kaashoek Ch. 8, p. 58

June 24, 2009 12:24 am

8.8 War Stories: Fault Tolerant Systems that Failed

8–59

About two years later, someone discovered an implementation mistake. In one quad­ rant of each memory card, neither error correction nor error detection was actually working. All computations done using memory in the misimplemented quadrant were completely unprotected from memory errors. Lesson 8.8.1d: Never assume that the hardware actually does what it says in the specifications. Lesson 8.8.1e: It is harder than it looks to test the fault tolerance features of a fault tolerant system. One might conclude that the intrinsic memory chip reliability had improved substan­ tially—so much that it was no longer necessary to take heroic measures to achieve system reliability. Certainly the chips were better, but they weren't perfect. The other effect here is that errors often don't lead to failures. In particular, a wrong bit retrieved from mem­ ory does not necessarily lead to an observed failure. In many cases a wrong bit doesn't matter; in other cases it does but no one notices; in still other cases, the failure is blamed on something else. Lesson 8.8.1f: Just because it seems to be working doesn't mean that it actually is. The bottom line. One of the designers of MAXC and the Altos, Butler Lampson, sug­ gests that the possibility that a failure is blamed on something else can be viewed as an opportunity, and it may be one of the reasons that PC manufacturers often do not pro­ vide memory parity checking hardware. First, the chips are good enough that errors are rare. Second, if you provide parity checks, consider who will be blamed when the parity circuits report trouble: the hardware vendor. Omitting the parity checks probably leads to occasional random behavior, but occasional random behavior is indistinguishable from software error and is usually blamed on the software. Lesson 8.8.1g (in Lampson's words): “Beauty is in the eye of the beholder. The various parties involved in the decisions about how much failure detection and recovery to implement do not always have the same interests.”

8.8.2 Risks of Rarely-Used Procedures: The National Archives The National Archives and Record Administration of the United States government has the responsibility, among other things, of advising the rest of the government how to preserve electronic records such as e-mail messages for posterity. Quite separate from that responsibility, the organization also operates an e-mail system at its Washington, D.C. headquarters for a staff of about 125 people and about 10,000 messages a month pass through this system. To ensure that no messages are lost, it arranged with an outside con­ tractor to perform daily incremental backups and to make periodic complete backups of its e-mail files. On the chance that something may go wrong, the system has audit logs that track actions regarding incoming and outgoing mail as well as maintenance on files. Over the weekend of June 18–21, 1999, the e-mail records for the previous four months (an estimated 43,000 messages) disappeared. No one has any idea what went wrong—the files may have been deleted by a disgruntled employee or a runaway house-

Saltzer & Kaashoek Ch. 8, p. 59

June 24, 2009 12:24 am

8–60

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

cleaning program, or the loss may have been caused by a wayward system bug. In any case, on Monday morning when people came to work, they found that the files were missing. On investigation, the system managers reported that the audit logs had been turned off because they were reducing system performance, so there were no clues available to diagnose what went wrong. Moreover, since the contractor’s employees had never gotten around to actually performing the backup part of the contract, there were no backup copies. It had not occurred to the staff of the Archives to verify the existence of the backup copies, much less to test them to see if they could actually be restored. They assumed that since the contract required it, the work was being done. The contractor’s project manager and the employee responsible for making backups were immediately replaced. The Assistant Archivist reports that backup systems have now been beefed up to guard against another mishap, but he added that the safest way to save important messages is to print them out.* Lesson 8.8.2: Avoid rarely used components. Rarely used failure-tolerance mechanisms, such as restoration from backup copies, must be tested periodically. If they are not, there is not much chance that they will work when an emergency arises. Fire drills (in this case performing a restoration of all files from a backup copy) seem disruptive and expensive, but they are not nearly as disruptive and expensive as the discovery, too late, that the backup system isn’t really operating. Even better, design the system so that all the components are exposed to day-to-day use, so that failures can be noticed before they cause real trouble.

8.8.3 Non-independent Replicas and Backhoe Fade In Eagan, Minnesota, Northwest airlines operated a computer system, named WorldFlight, that managed the Northwest flight dispatching database, provided weight-and­ balance calculations for pilots, and managed e-mail communications between the dis­ patch center and all Northwest airplanes. It also provided data to other systems that managed passenger check-in and the airline’s Web site. Since many of these functions involved communications, Northwest contracted with U.S. West, the local telephone company at that time, to provide these communications in the form of fiber-optic links to airports that Northwest serves, to government agencies such as the Weather Bureau and the Federal Aviation Administration, and to the Internet. Because these links were vital, Northwest paid U.S. West extra to provide each primary link with a backup sec­ ondary link. If a primary link to a site failed, the network control computers automatically switched over to the secondary link to that site. At 2:05 p.m. on March 23, 2000, all communications to and from WorldFlight dropped out simultaneously. A contractor who was boring a tunnel (for fiber optic lines for a different telephone company) at the nearby intersection of Lone Oak and Pilot Knob roads accidentally bored through a conduit containing six cables carrying the U.S. * George Lardner Jr. “Archives Loses 43,000 E-Mails; officials can't explain summer erasure; backup system failed.” The Washington Post, Thursday, January 6, 2000, page A17.

Saltzer & Kaashoek Ch. 8, p. 60

June 24, 2009 12:24 am

8.8 War Stories: Fault Tolerant Systems that Failed

8–61

West fiber-optic and copper lines. In a tongue-in-cheek analogy to the fading in and out of long-distance radio signals, this kind of communications disruption is known in the trade as “backhoe fade.” WorldFlight immediately switched from the primary links to the secondary links, only to find that they were not working, either. It seems that the pri­ mary and secondary links were routed through the same conduit, and both were severed. Pilots resorted to manual procedures for calculating weight and balance, and radio links were used by flight dispatchers in place of the electronic message system, but about 125 of Northwest’s 1700 flights had to be cancelled because of the disruption, about the same number that are cancelled when a major snowstorm hits one of Northwest’s hubs. Much of the ensuing media coverage concentrated on whether or not the contractor had followed “dig-safe” procedures that are intended to prevent such mistakes. But a news release from Northwest at 5:15 p.m. blamed the problem entirely on U.S. West. “For such contingencies, U.S. West provides to Northwest a complete redundancy plan. The U.S. West redundancy plan also failed.”* In a similar incident, the ARPAnet, a predecessor to the Internet, had seven separate trunk lines connecting routers in New England to routers elsewhere in the United States. All the trunk lines were purchased from a single long-distance carrier, AT&T. On December 12, 1986, all seven trunk lines went down simultaneously when a contractor accidentally severed a single fiber-optic cable running from White Plains, New York to Newark, New Jersey.† A complication for communications customers who recognize this problem and request information about the physical location of their communication links is that, in the name of security, communications companies sometimes refuse to reveal it. Lesson 8.8.3: The calculation of mean time to failure of a redundant system depends critically on the assumption that failures of the replicas are independent. If they aren’t independent, then the replication may be a waste of effort and money, while producing a false complacency. This incident also illustrates why it can be difficult to test fault tolerance measures properly. What appears to be redundancy at one level of abstraction turns out not to be redundant at a lower level of abstraction.

8.8.4 Human Error May Be the Biggest Risk Telehouse was an East London “telecommunications hotel”, a seven story building hous­ ing communications equipment for about 100 customers, including most British Internet companies, many British and international telephone companies, and dozens of financial institutions. It was designed to be one of the most secure buildings in Europe, safe against “fire, flooding, bombs, and sabotage”. Accordingly, Telehouse had extensive protection against power failure, including two independent connections to the national * Tony Kennedy. “Cut cable causes cancellations, delays for Northwest Airlines.” Minneapolis Star Tribune, March 22, 2000. † Peter G. Neumann. Computer Related Risks (Addison-Wesley, New York, 1995), page 14.

Saltzer & Kaashoek Ch. 8, p. 61

June 24, 2009 12:24 am

8–62

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

electric power grid, a room full of batteries, and two diesel generators, along with systems to detect failures in supply and automatically cut over from one backup system to the next, as needed. On May 8, 1997, all the computer systems went off line for lack of power. According to Robert Bannington, financial director of Telehouse, “It was due to human error.” That is, someone pulled the wrong switch. The automatic power supply cutover proce­ dures did not trigger because they were designed to deploy on failure of the outside power supply, and the sensors correctly observed that the outside power supply was intact.* Lesson 8.8.4a: The first step in designing a fault tolerant system is to identify each potential fault and evaluate the risk that it will happen. People are part of the system, and mistakes made by authorized operators are typically a bigger threat to reliability than trees falling on power lines. Anecdotes concerning failures of backup power supply systems seem to be common. Here is a typical report of an experience in a Newark, New Jersey, hospital operating room that was equipped with three backup generators: “On August 14, 2003, at 4:10pm EST, a widespread power grid failure caused our hospital to suffer a total OR power loss, regaining partial power in 4 hours and total restoration 12 hours later... When the backup generators initially came on-line, all ORs were running as usual. Within 20 min­ utes, one parallel-linked generator caught fire from an oil leak. After being subjected to twice its rated load, the second in-line generator quickly shut down... Hospital engineer­ ing, attempting load-reduction to the single surviving generator, switched many hospital circuit breakers off. Main power was interrupted to the OR.”† Lesson 8.8.4b: A backup generator is another example of a rarely used component that may not have been maintained properly. The last two sentences of that report reemphasize Lesson 8.8.4a. For yet another example, the M.I.T. Information Services and Technology staff posted the following system services notice on April 2, 2004: “We suffered a power fail­ ure in W92 shortly before 11AM this morning. Most services should be restored now, but some are still being recovered. Please check back here for more information as it becomes available.” A later posting reported: “Shortly after 10AM Friday morning the routine test of the W92 backup generator was started. Unknown to us was that the tran­ sition of the computer room load from commercial power to the backup generator resulted in a power surge within the computer room's Uninterruptable [sic] Power Sup­ ply (UPS). This destroyed an internal surge protector, which started to smolder. Shortly before 11AM the smoldering protector triggered the VESDA® smoke sensing system * Robert Uhlig. “Engineer pulls plug on secure bunker.” Electronic Telegraph, (9 May 1997). † Ian E. Kirk, M.D. and Peter L. Fine, M.D. “Operating by Flashlight: Power Failure and Safety Lessons from the August, 2003 Blackout.” Abstracts of the Annual Meeting of the American Society of Anesthesiologists, October 2005.

Saltzer & Kaashoek Ch. 8, p. 62

June 24, 2009 12:24 am

8.8 War Stories: Fault Tolerant Systems that Failed

8–63

within the computer room. This sensor triggered the fire alarm, and as a safety precau­ tion forced an emergency power down of the entire computer room.”* Lesson 8.8.4c: A failure masking system not only can fail, it can cause a bigger failure than the one it is intended to mask.

8.8.5 Introducing a Single Point of Failure “[Rabbi Israel Meir HaCohen Kagan described] a real-life situation in his town of Radin, Poland. He lived at the time when the town first purchased an electrical generator and wired all the houses and courtyards with electric lighting. One evening something broke within the machine, and darkness descended upon all of the houses and streets, and even in the synagogue. “So he pointed out that before they had electricity, every house had a kerosene light— and if in one particular house the kerosene ran out, or the wick burnt away, or the glass broke, that only that one house would be dark. But when everyone is dependent upon one machine, darkness spreads over the entire city if it breaks for any reason.”† Lesson 8.8.5: Centralization may provide economies of scale, but it can also reduce robust­ ness—a single failure can interfere with many unrelated activities. This phenomenon is commonly known as introducing a single point of failure. By carefully adding redundancy to a centralized design one may be able to restore some of the lost robustness but it takes planning and adds to the cost.

8.8.6 Multiple Failures: The SOHO Mission Interruption “Contact with the SOlar Heliospheric Observatory (SOHO) spacecraft was lost in the early morning hours of June 25, 1998, Eastern Daylight Time (EDT), during a planned period of calibrations, maneuvers, and spacecraft reconfigurations. Prior to this the SOHO operations team had concluded two years of extremely successful science operations. “…The Board finds that the loss of the SOHO spacecraft was a direct result of oper­ ational errors, a failure to adequately monitor spacecraft status, and an erroneous decision which disabled part of the on-board autonomous failure detection. Further, fol­ lowing the occurrence of the emergency situation, the Board finds that insufficient time was taken by the operations team to fully assess the spacecraft status prior to initiating recovery operations. The Board discovered that a number of factors contributed to the circumstances that allowed the direct causes to occur.”‡

* Private internal communication. † Chofetz Chaim (the Rabbi Israel Meir HaCohen Kagan of Radin), paraphrased by Rabbi Yaakov Menken, in a discussion of lessons from the Torah in Project Genesis Lifeline. . Suggested by David Karger.

Saltzer & Kaashoek Ch. 8, p. 63

June 24, 2009 12:24 am

8–64

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

In a tour-de-force of the keep digging principle, the report of the investigating board quoted above identified five distinct direct causes of the loss: two software errors, a design feature that unintentionally amplified the effect of one of the software errors, an incor­ rect diagnosis by the ground staff, and a violated design assumption. It then goes on to identify three indirect causes in the spacecraft design process: lack of change control, missing risk analysis for changes, and insufficient communication of changes, and then three indirect causes in operations procedures: failure to follow planned procedures, to evaluate secondary telemetry data, and to question telemetry discrepancies. Lesson 8.8.6: Complex systems fail for complex reasons. In systems engineered for reliability, it usually takes several component failures to cause a system failure. Unfortunately, when some of the components are people, multiple failures are all too common.

Exercises 8.1 Failures are A. Faults that are latent. B. Errors that are contained within a module. C. Errors that propagate out of a module. D. Faults that turn into errors. 1999–3–01

8.2 Ben Bitdiddle has been asked to perform a deterministic computation to calculate the orbit of a near-Earth asteroid for the next 500 years, to find out whether or not the asteroid will hit the Earth. The calculation will take roughly two years to complete, and Ben wants be be sure that the result will be correct. He buys 30 identical computers and runs the same program with the same inputs on all of them. Once each hour the software pauses long enough to write all intermediate results to a hard disk on that computer. When the computers return their results at the end

‡ Massimo Trella and Michael Greenfield. Final Report of the SOHO Mission Interruption Joint NASA/ESA Investigation Board (August 31, 1998). National Aeronautics and Space Administration and European Space Agency.

Saltzer & Kaashoek Ch. 8, p. 64

June 24, 2009 12:24 am

Exercises

8–65

of the two years, a voter selects the majority answer. Which of the following failures can this scheme tolerate, assuming the voter works correctly? A. The software carrying out the deterministic computation has a bug in it, causing the program to compute the wrong answer for certain inputs. B. Over the course of the two years, cosmic rays corrupt data stored in memory at twelve of the computers, causing them to return incorrect results. C. Over the course of the two years, on 24 different days the power fails in the computer room. When the power comes back on, each computer reboots and then continues its computation, starting with the state it finds on its hard disk. 2006–2–3

8.3 Ben Bitdiddle has seven smoke detectors installed in various places in his house. Since the fire department charges $100 for responding to a false alarm, Ben has connected the outputs of the smoke detectors to a simple majority voter, which in turn can activate an automatic dialer that calls the fire department. Ben returns home one day to find his house on fire, and the fire department has not been called. There is smoke at every smoke detector. What did Ben do wrong? A. He should have used fail-fast smoke detectors. B. He should have used a voter that ignores failed inputs from fail-fast sources. C. He should have used a voter that ignores non-active inputs. D. He should have done both A and B. E. He should have done both Aand C. 1997–0–01

8.4 You will be flying home from a job interview in Silicon Valley. Your travel agent gives you the following choice of flights: A. Flight A uses a plane whose mean time to failure (MTTF) is believed to be 6,000 hours. With this plane, the flight is scheduled to take 6 hours. B. Flight B uses a plane whose MTTF is believed to be 5,000 hours. With this plane, the flight takes 5 hours.

The agent assures you that both planes’ failures occur according to memoryless random processes (not a “bathtub” curve). Assuming that model, which flight should you choose to minimize the chance of your plane failing during the flight? 2005–2–5

8.5 (Note: solving this problem is best done with use of probability through the level of Markov chains.) You are designing a computer system to control the power grid for the Northeastern United States. If your system goes down, the lights go out and civil disorder—riots, looting, fires, etc.—will ensue. Thus, you have set a goal of having a system MTTF of at least 100 years (about 106 hours). For hardware you are constrained to use a building block computer that has a MTTF of 1000 hours

Saltzer & Kaashoek Ch. 8, p. 65

June 24, 2009 12:24 am

8–66

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

and a MTTR of 1 hour. Assuming that the building blocks are fail-fast, memoryless, and fail independently of one another, how can you arrange to meet your goal? 1995–3–1a

8.6 The town council wants to implement a municipal network to connect the local area networks in the library, the town hall, and the school. They want to minimize the chance that any building is completely disconnected from the others. They are considering two network topologies:

1. “Daisy Chain”

2. “Fully connected”

Each link in the network has a failure probability of p. 8.6a. What is the probability that the daisy chain network is connecting all the buildings? 8.6b. What is the probability that the fully connected network is connecting all the buildings? 8.6c. The town council has a limited budget, with which it can buy either a daisy chain network with two high reliability links (p = .000001), or a fully connected network with three low-reliability links (p = .0001). Which should they purchase? 1985–0–1

8.7 Figure 8.11 shows the failure points of three different 5MR supermodule designs, if repair does not happen in time. Draw the corresponding figure for the same three different TMR supermodule designs. 2001–3–05

8.8 An astronomer calculating the trajectory of Pluto has a program that requires the execution of 10 13 machine operations. The fastest processor available in the lab runs only 10 9 operations per second and, unfortunately, has a probability of failing on any one operation of 10 –12 . (The failure process is memoryless.) The good news is that the processor is fail-fast, so when a failure occurs it stops dead in its tracks and starts ringing a bell. The bad news is that when it fails, it loses all state, so whatever it was doing is lost, and has to be started over from the beginning. Seeing that in practical terms, the program needs to run for about 3 hours, and the machine has an MTTF of only 1/10 of that time, Louis Reasoner and Ben Bitdiddle have proposed two ways to organize the computation:

Saltzer & Kaashoek Ch. 8, p. 66

June 24, 2009 12:24 am

Exercises

8–67

• Louis says run it from the beginning and hope for the best. If the machine fails, just try again; keep trying till the calculation successfully completes. • Ben suggests dividing the calculation into ten equal-length segments; if the calculation gets to the end of a segment, it writes its state out to the disk. When a failure occurs, restart from the last state saved on the disk. Saving state and restart both take zero time. What is the ratio of the expected time to complete the calculation under the two strategies? Warning: A straightforward solution to this problem involves advanced probability techniques. 1976–0–3

8.9 Draw a figure, similar to that of Figure 8.6, that shows the recovery procedure for one sector of a 5-disk RAID 4 system when disk 2 fails and is replaced. 2005–0–1

8.10 Louis Reasoner has just read an advertisement for a RAID controller that provides a choice of two configurations. According to the advertisement, the first configuration is exactly the RAID 4 system described in Section 8.4.1. The advertisement goes on to say that the configuration called RAID 5 has just one difference: in an N-disk configuration, the parity block, rather than being written on disk N, is written on the disk number (1 + sector_address modulo N). Thus, for example, in a five-disk system, the parity block for sector 18 would be on disk 4 (because 1 + (18 modulo 5) = 4), while the parity block for sector 19 would be on

Saltzer & Kaashoek Ch. 8, p. 67

June 24, 2009 12:24 am

8–68

CHAPTER 8 Fault Tolerance: Reliable Systems from Unreliable

disk 5 (because 1 + (19 modulo 5) = 5). Louis is hoping you can help him understand why this idea might be a good one. 8.10a. RAID 5 has the advantage over RAID 4 that A. It tolerates single-drive failures. B. Read performance in the absence of errors is enhanced. C. Write performance in the absence of errors is enhanced. D. Locating data on the drives is easier. E. Allocating space on the drives is easier. F. It requires less disk space. G. There’s no real advantage, its just another advertising gimmick. 1997–3–01 8.10b. Is there any workload for which RAID 4 has better write performance than RAID 5? 2000–3–01 8.10c. Louis is also wondering about whether he might be better off using a RAID 1 system (see Section 8.5.4.6). How does the number of disks required compare between RAID 1 and RAID 5? 1998–3–01 8.10d. Which of RAID 1 and RAID 5 has better performance for a workload consisting of small reads and small writes? 2000–3–01

8.11 A system administrator notices that a file service disk is failing for two unrelated reasons. Once every 30 days, on average, vibration due to nearby construction breaks the disk’s arm. Once every 60 days, on average, a power surge destroys the disk’s electronics. The system administrator fixes the disk instantly each time it fails. The two failure modes are independent of each other, and independent of the age of the disk. What is the mean time to failure of the disk? 2002–3–01

Additional exercises relating to Chapter 8 can be found in problem sets 26 through 28.

Saltzer & Kaashoek Ch. 8, p. 68

June 24, 2009 12:24 am

CHAPTER

Atomicity: All-or-Nothing and Before-or-After

9

CHAPTER CONTENTS Overview..........................................................................................9–2

9.1 Atomicity...................................................................................9–4

9.1.1 All-or-Nothing Atomicity in a Database .................................... 9–5

9.1.2 All-or-Nothing Atomicity in the Interrupt Interface .................... 9–6

9.1.3 All-or-Nothing Atomicity in a Layered Application ...................... 9–8

9.1.4 Some Actions With and Without the All-or-Nothing Property ..... 9–10

9.1.5 Before-or-After Atomicity: Coordinating Concurrent Threads .... 9–13

9.1.6 Correctness and Serialization ............................................... 9–16

9.1.7 All-or-Nothing and Before-or-After Atomicity .......................... 9–19

9.2 All-or-Nothing Atomicity I: Concepts.......................................9–21

9.2.1 Achieving All-or-Nothing Atomicity: ALL_OR_NOTHING_PUT .......... 9–21

9.2.2 Systematic Atomicity: Commit and the Golden Rule ................ 9–27

9.2.3 Systematic All-or-Nothing Atomicity: Version Histories ............ 9–30

9.2.4 How Version Histories are Used ............................................ 9–37

9.3 All-or-Nothing Atomicity II: Pragmatics ..................................9–38

9.3.1 Atomicity Logs ................................................................... 9–39

9.3.2 Logging Protocols ............................................................... 9–42

9.3.3 Recovery Procedures .......................................................... 9–45

9.3.4 Other Logging Configurations: Non-Volatile Cell Storage .......... 9–47

9.3.5 Checkpoints ...................................................................... 9–51

9.3.6 What if the Cache is not Write-Through? (Advanced Topic) ....... 9–53

9.4 Before-or-After Atomicity I: Concepts .....................................9–54

9.4.1 Achieving Before-or-After Atomicity: Simple Serialization ........ 9–54

9.4.2 The Mark-Point Discipline .................................................... 9–58

9.4.3 Optimistic Atomicity: Read-Capture (Advanced Topic) ............. 9–63

9.4.4 Does Anyone Actually Use Version Histories for Before-or-After

Atomicity? ........................................................................ 9–67

9.5 Before-or-After Atomicity II: Pragmatics ................................9–69

9.5.1 Locks ............................................................................... 9–70

9.5.2 Simple Locking .................................................................. 9–72

9.5.3 Two-Phase Locking ............................................................. 9–73

Saltzer & Kaashoek Ch. 9, p. 1

9–1

June 25, 2009 8:22 am

9–2

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

9.5.4 Performance Optimizations .................................................. 9–75

9.5.5 Deadlock; Making Progress .................................................. 9–76

9.6 Atomicity across Layers and Multiple Sites..............................9–79

9.6.1 Hierarchical Composition of Transactions ............................... 9–80

9.6.2 Two-Phase Commit ............................................................. 9–84

9.6.3 Multiple-Site Atomicity: Distributed Two-Phase Commit ........... 9–85

9.6.4 The Dilemma of the Two Generals ........................................ 9–90

9.7 A More Complete Model of Disk Failure (Advanced Topic) .......9–92

9.7.1 Storage that is Both All-or-Nothing and Durable ..................... 9–92

9.8 Case Studies: Machine Language Atomicity .............................9–95

9.8.1 Complex Instruction Sets: The General Electric 600 Line ......... 9–95

9.8.2 More Elaborate Instruction Sets: The IBM System/370 ............ 9–96

9.8.3 The Apollo Desktop Computer and the Motorola M68000 Microprocessor .................................................................. 9–97 Exercises........................................................................................9–98 Glossary for Chapter 9 .................................................................9–107 Index of Chapter 9 .......................................................................9–113 Last chapter page 9–115

Overview This chapter explores two closely related system engineering design strategies. The first is all-or-nothing atomicity, a design strategy for masking failures that occur while inter­ preting programs. The second is before-or-after atomicity, a design strategy for coordinating concurrent activities. Chapter 8[on-line] introduced failure masking, but did not show how to mask failures of running programs. Chapter 5 introduced coordi­ nation of concurrent activities, and presented solutions to several specific problems, but it did not explain any systematic way to ensure that actions have the before-or-after prop­ erty. This chapter explores ways to systematically synthesize a design that provides both the all-or-nothing property needed for failure masking and the before-or-after property needed for coordination. Many useful applications can benefit from atomicity. For example, suppose that you are trying to buy a toaster from an Internet store. You click on the button that says “pur­ chase”, but before you receive a response the power fails. You would like to have some assurance that, despite the power failure, either the purchase went through properly or that nothing happen at all. You don’t want to find out later that your credit card was charged but the Internet store didn’t receive word that it was supposed to ship the toaster. In other words, you would like to see that the action initiated by the “purchase” button be all-or-nothing despite the possibility of failure. And if the store has only one toaster in stock and two customers both click on the “purchase” button for a toaster at about the same time, one of the customers should receive a confirmation of the purchase, and the other should receive a “sorry, out of stock” notice. It would be problematic if

Saltzer & Kaashoek Ch. 9, p. 2

June 25, 2009 8:22 am

Overview

9–3

both customers received confirmations of purchase. In other words, both customers would like to see that the activity initiated by their own click of the “purchase” button occur either completely before or completely after any other, concurrent click of a “pur­ chase” button. The single conceptual framework of atomicity provides a powerful way of thinking about both all-or-nothing failure masking and before-or-after sequencing of concurrent activities. Atomicity is the performing of a sequence of steps, called actions, so that they appear to be done as a single, indivisible step, known in operating system and architec­ ture literature as an atomic action and in database management literature as a transaction. When a fault causes a failure in the middle of a correctly designed atomic action, it will appear to the invoker of the atomic action that the atomic action either completed suc­ cessfully or did nothing at all—thus an atomic action provides all-or-nothing atomicity. Similarly, when several atomic actions are going on concurrently, each atomic action will appear to take place either completely before or completely after every other atomic action—thus an atomic action provides before-or-after atomicity. Together, all-or-noth­ ing atomicity and before-or-after atomicity provide a particularly strong form of modularity: they hide the fact that the atomic action is actually composed of multiple steps. The result is a sweeping simplification in the description of the possible states of a sys­ tem. This simplification provides the basis for a methodical approach to recovery from failures and coordination of concurrent activities that simplifies design, simplifies under­ standing for later maintainers, and simplifies verification of correctness. These desiderata are particularly important because errors caused by mistakes in coordination usually depend on the relative timing of external events and among different threads. When a timing-dependent error occurs, the difficulty of discovering and diagnosing it can be orders of magnitude greater than that of finding a mistake in a purely sequential activity. The reason is that even a small number of concurrent activities can have a very large number of potential real time sequences. It is usually impossible to determine which of those many potential sequences of steps preceded the error, so it is effectively impossible to reproduce the error under more carefully controlled circumstances. Since debugging this class of error is so hard, techniques that ensure correct coordination a priori are par­ ticularly valuable. The remarkable thing is that the same systematic approach—atomicity—to failure recovery also applies to coordination of concurrent activities. In fact, since one must be able to deal with failures while at the same time coordinating concurrent activities, any attempt to use different strategies for these two problems requires that the strategies be compatible. Being able to use the same strategy for both is another sweeping simplification. Atomic actions are a fundamental building block that is widely applicable in com­ puter system design. Atomic actions are found in database management systems, in register management for pipelined processors, in file systems, in change-control systems used for program development, and in many everyday applications such as word proces­ sors and calendar managers.

Saltzer & Kaashoek Ch. 9, p. 3

June 25, 2009 8:22 am

9–4

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

Sidebar 9.1: Actions and transactions The terminology used by system designers to discuss atomicity can be confusing because the concept was identified and developed independently by database designers and by hardware architects. An action that changes several data values can have any or all of at least four independent properties: it can be all-or-nothing (either all or none of the changes happen), it can be beforeor-after (the changes all happen either before or after every concurrent action), it can be constraint-maintaining (the changes maintain some specified invariant), and it can be durable (the changes last as long as they are needed). Designers of database management systems customarily are concerned only with actions that are both all-or-nothing and before-or-after, and they describe such actions as transactions. In addition, they use the term atomic primarily in reference to all-or-nothing atomicity. On the other hand, hardware processor architects customarily use the term atomic to describe an action that exhibits before-or-after atomicity. This book does not attempt to change these common usages. Instead, it uses the qualified terms “all-or-nothing atomicity” and “before-or-after atomicity.” The unqualified term “atomic” may imply all-or-nothing, or before-or-after, or both, depending on the context. The text uses the term “transaction” to mean an action that is both all-or-nothing and before-or-after. All-or-nothing atomicity and before-or-after atomicity are universally defined properties of actions, while constraints are properties that different applications define in different ways. Durability lies somewhere in between because different applications have different durability requirements. At the same time, implementations of constraints and durability usually have a prerequisite of atomicity. Since the atomicity properties are modularly separable from the other two, this chapter focuses just on atomicity. Chapter 10[on-line] then explores how a designer can use transactions to implement constraints and enhance durability.

The sections of this chapter define atomicity, examine some examples of atomic actions, and explore systematic ways of achieving atomicity: version histories, logging, and locking protocols. Chapter 10[on-line] then explores some applications of atomicity. Case studies at the end of both chapters provide real-world examples of atomicity as a tool for creating useful systems.

9.1 Atomicity Atomicity is a property required in several different areas of computer system design. These areas include managing a database, developing a hardware architecture, specifying the interface to an operating system, and more generally in software engineering. The table below suggests some of the kinds of problems to which atomicity is applicable. In

Saltzer & Kaashoek Ch. 9, p. 4

June 25, 2009 8:22 am

9.1 Atomicity

9–5

this chapter we will encounter examples of both kinds of atomicity in each of these dif­ ferent areas. Area

All-or-nothing atomicity

Before-or-after atomicity

database management

updating more than one record

records shared between threads

hardware architecture

handling interrupts and exceptions

register renaming

operating systems

supervisor call interface

printer queue

software engineering

handling faults in layers

bounded buffer

9.1.1 All-or-Nothing Atomicity in a Database As a first example, consider a database of bank accounts. We define a procedure named TRANSFER that debits one account and credits a second account, both of which are stored on disk, as follows: 1 2 3 4 5 6 7

procedure TRANSFER (debit_account, credit_account, amount) GET (dbdata, debit_account) dbdata ← dbdata - amount PUT (dbdata, debit_account) GET (crdata, credit_account) crdata ← crdata + amount PUT (crdata, credit_account)

where debit_account and credit_account identify the records for the accounts to be deb­ ited and credited, respectively. Suppose that the system crashes while executing the PUT instruction on line 4. Even if we use the MORE_DURABLE_PUT described in Section 8.5.4, a system crash at just the wrong time may cause the data written to the disk to be scrambled, and the value of debit_account lost. We would prefer that either the data be completely written to the disk or nothing be written at all. That is, we want the PUT instruction to have the all-or-noth­ ing atomicity property. Section 9.2.1 will describe a way to do that. There is a further all-or-nothing atomicity requirement in the TRANSFER procedure. Suppose that the PUT on line 4 is successful but that while executing line 5 or line 6 the power fails, stopping the computer in its tracks. When power is restored, the computer restarts, but volatile memory, including the state of the thread that was running the TRANSFER procedure, has been lost. If someone now inquires about the balances in debit_account and in credit_account things will not add up properly because debit_account has a new value but credit_account has an old value. One might suggest postponing the first PUT to be just before the second one, but that just reduces the win­ dow of vulnerability, it does not eliminate it—the power could still fail in between the two PUTs. To eliminate the window, we must somehow arrange that the two PUT instruc­ tions, or perhaps even the entire TRANSFER procedure, be done as an all-or-nothing atomic

Saltzer & Kaashoek Ch. 9, p. 5

June 25, 2009 8:22 am

9–6

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

action. In Section 9.2.3 we will devise a TRANSFER procedure that has the all-or-nothing property, and in Section 9.3 we will see some additional ways of providing the property.

9.1.2 All-or-Nothing Atomicity in the Interrupt Interface A second application for all-or-nothing atomicity is in the processor instruction set inter­ face as seen by a thread. Recall from Chapters 2 and 5 that a thread normally performs actions one after another, as directed by the instructions of the current program, but that certain events may catch the attention of the thread’s interpreter, causing the interpreter, rather than the program, to supply the next instruction. When such an event happens, a different program, running in an interrupt thread, takes control. If the event is a signal arriving from outside the interpreter, the interrupt thread may simply invoke a thread management primitive such as ADVANCE, as described in Section 5.6.4, to alert some other thread about the event. For example, an I/O operation that the other thread was waiting for may now have completed. The interrupt handler then returns control to the interrupted thread. This example requires before-or-after atomicity between the interrupt thread and the interrupted thread. If the interrupted thread was in the midst of a call to the thread manager, the invocation of ADVANCE by the interrupt thread should occur either before or after that call. Another possibility is that the interpreter has detected that something is going wrong in the interrupted thread. In that case, the interrupt event invokes an exception handler, which runs in the environment of the original thread. (Sidebar 9.2 offers some exam­ ples.) The exception handler either adjusts the environment to eliminate some problem (such as a missing page) so that the original thread can continue, or it declares that the original thread has failed and terminates it. In either case, the exception handler will need to examine the state of the action that the original thread was performing at the instant of the interruption—was that action finished, or is it in a partially done state? Ideally, the handler would like to see an all-or-nothing report of the state: either the instruction that caused the exception completed or it didn’t do anything. An all-or-noth­ ing report means that the state of the original thread is described entirely with values belonging to the layer in which the exception handler runs. An example of such a value is the program counter, which identifies the next instruction that the thread is to execute. An in-the-middle report would mean that the state description involves values of a lower layer, probably the operating system or the hardware processor itself. In that case, know­ ing the next instruction is only part of the story; the handler would also need to know which parts of the current instruction were executed and which were not. An example might be an instruction that increments an address register, retrieves the data at that new address, and adds that data value to the value in another register. If retrieving the data causes a missing-page exception, the description of the current state is that the address register has been incremented but the retrieval and addition have not yet been per­ formed. Such an in-the-middle report is problematic because after the handler retrieves the missing page it cannot simply tell the processor to jump to the instruction that failed—that would increment the address register again, which is not what the program-

Saltzer & Kaashoek Ch. 9, p. 6

June 25, 2009 8:22 am

9.1 Atomicity

9–7

Sidebar 9.2: Events that might lead to invoking an exception handler 1. A hardware fault occurs: • The processor detects a memory parity fault. • A sensor reports that the electric power has failed; the energy left in the power supply may be just enough to perform a graceful shutdown. 2. A hardware or software interpreter encounters something in the program that is clearly wrong: • The program tried to divide by zero. • The program supplied a negative argument to a square root function. 3. Continuing requires some resource allocation or deferred initialization: • The running thread encountered a missing-page exception in a virtual memory system. • The running thread encountered an indirection exception, indicating that it encountered an unresolved procedure linkage in the current program. 4. More urgent work needs to take priority, so the user wishes to terminate the thread: • This program is running much longer than expected. • The program is running normally, but the user suddenly realizes that it is time to catch the last train home. 5. The user realizes that something is wrong and decides to terminate the thread: • Calculating e, the program starts to display 3.1415… • The user asked the program to copy the wrong set of files. 6. Deadlock: • Thread A has acquired the scanner, and is waiting for memory to become free; thread B has acquired all available memory, and is waiting for the scanner to be released. Either the system notices that this set of waits cannot be resolved or, more likely, a timer that should never expire eventually expires. The system or the timer signals an exception to one or both of the deadlocked threads.

mer expected. Jumping to the next instruction isn’t right, either, because that would omit the addition step. An all-or-nothing report is preferable because it avoids the need for the handler to peer into the details of the next lower layer. Modern processor design­ ers are generally careful to avoid designing instructions that don’t have the all-or-nothing property. As will be seen shortly, designers of higher-layer interpreters must be similarly careful. Sections 9.1.3 and 9.1.4 explore the case in which the exception terminates the run­ ning thread, thus creating a fault. Section 9.1.5 examines the case in which the interrupted thread continues, oblivious (one hopes) to the interruption.

Saltzer & Kaashoek Ch. 9, p. 7

June 25, 2009 8:22 am

9–8

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

9.1.3 All-or-Nothing Atomicity in a Layered Application A third example of all-or-nothing atomicity lies in the challenge presented by a fault in a running program: at the instant of the fault, the program is typically in the middle of doing something, and it is usually not acceptable to leave things half-done. Our goal is to obtain a more graceful response, and the method will be to require that some sequence of actions behave as an atomic action with the all-or-nothing property. Atomic actions are closely related to the modularity that arises when things are organized in layers. Lay­ ered components have the feature that a higher layer can completely hide the existence of a lower layer. This hiding feature makes layers exceptionally effective at error contain­ ment and for systematically responding to faults. To see why, recall the layered structure of the calendar management program of Chapter 2, reproduced in Figure 9.19.1 (that figure may seem familiar—it is a copy of Figure 2.10). The calendar program implements each request of the user by executing a sequence of Java language statements. Ideally, the user will never notice any evidence of the composite nature of the actions implemented by the calendar manager. Similarly, each statement of the Java language is implemented by several actions at the hardware layer. Again, if the Java interpreter is carefully implemented, the composite nature of the implementation in terms of machine language will be completely hidden from the Java programmer.

Human user generating requests

Interface

Typical instruction across this interface

Calendar manager layer interface

Add new event on February 27

Calendar Program Java language layer interface

nextch = instring[j];

Java Interpreter Machine language layer interface

add R1,R2

hardware FIGURE 9.1 An application system with three layers of interpretation. The user has requested an action that will fail, but the failure will be discovered at the lowest layer. A graceful response involves ato­ micity at each interface.

Saltzer & Kaashoek Ch. 9, p. 8

June 25, 2009 8:22 am

9.1 Atomicity

9–9

Now consider what happens if the hardware processor detects a condition that should be handled as an exception—for example, a register overflow. The machine is in the mid­ dle of interpreting an action at the machine language layer interface—an ADD instruction somewhere in the middle of the Java interpreter program. That ADD instruction is itself in the middle of interpreting an action at the Java language interface—a Java expression to scan an array. That Java expression in turn is in the middle of interpreting an action at the user interface—a request from the user to add a new event to the calendar. The report “Overflow exception caused by the ADD instruction at location 41574” is not intel­ ligible to the user at the user interface; that description is meaningful only at the machine language interface. Unfortunately, the implication of being “in the middle” of higherlayer actions is that the only accurate description of the current state of affairs is in terms of the progress of the machine language program. The actual state of affairs in our example as understood by an all-seeing observer might be the following: the register overflow was caused by adding one to a register that contained a two’s complement negative one at the machine language layer. That machine language add instruction was part of an action to scan an array of characters at the Java layer and a zero means that the scan has reached the end of the array. The array scan was embarked upon by the Java layer in response to the user’s request to add an event on February 31. The highest-level interpretation of the overflow exception is “You tried to add an event on a non-existent date”. We want to make sure that this report goes to the end user, rather than the one about register overflow. In addition, we want to be able to assure the user that this mistake has not caused an empty event to be added some­ where else in the calendar or otherwise led to any other changes to the calendar. Since the system couldn’t do the requested change it should do nothing but report the error. Either a low-level error report or muddled data would reveal to the user that the action was composite. With the insight that in a layered application, we want a fault detected by a lower layer to be contained in a particular way we can now propose a more formal definition of all-or-nothing atomicity: All-or-nothing atomicity

A sequence of steps is an all-or-nothing action if, from the point of view of its invoker, the sequence always either • completes, or • aborts in such a way that it appears that the sequence had never been undertaken in the first place. That is, it backs out.

In a layered application, the idea is to design each of the actions of each layer to be all-or-nothing. That is, whenever an action of a layer is carried out by a sequence of

Saltzer & Kaashoek Ch. 9, p. 9

June 25, 2009 8:22 am

9–10

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

actions of the next lower layer, the action either completes what it was asked to do or else it backs out, acting as though it had not been invoked at all. When control returns to a higher layer after a lower layer detects a fault, the problem of being “in the middle” of an action thus disappears. In our calendar management example, we might expect that the machine language layer would complete the add instruction but signal an overflow exception; the Java interpreter layer would, upon receiving the overflow exception might then decide that its array scan has ended, and return a report of “scan complete, value not found” to the cal­ endar management layer; the calendar manager would take this not-found report as an indication that it should back up, completely undo any tentative changes, and tell the user that the request to add an event on that date could not be accomplished because the date does not exist. Thus some layers run to completion, while others back out and act as though they had never been invoked, but either way the actions are all-or-nothing. In this example, the failure would probably propagate all the way back to the human user to decide what to do next. A different failure (e.g. “there is no room in the calendar for another event”) might be intercepted by some intermediate layer that knows of a way to mask it (e.g., by allocating more storage space). In that case, the all-or-nothing requirement is that the layer that masks the failure find that the layer below has either never started what was to be the current action or else it has completed the current action but has not yet under­ taken the next one. All-or-nothing atomicity is not usually achieved casually, but rather by careful design and specification. Designers often get it wrong. An unintelligible error message is the typical symptom that a designer got it wrong. To gain some insight into what is involved, let us examine some examples.

9.1.4 Some Actions With and Without the All-or-Nothing Property Actions that lack the all-or-nothing property have frequently been discovered upon add­ ing multilevel memory management to a computer architecture, especially to a processor that is highly pipelined. In this case, the interface that needs to be all-or-nothing lies between the processor and the operating system. Unless the original machine architect designed the instruction set with missing-page exceptions in mind, there may be cases in which a missing-page exception can occur “in the middle” of an instruction, after the processor has overwritten some register or after later instructions have entered the pipe­ line. When such a situation arises, the later designer who is trying to add the multilevel memory feature is trapped. The instruction cannot run to the end because one of the operands it needs is not in real memory. While the missing page is being retrieved from secondary storage, the designer would like to allow the operating system to use the pro­ cessor for something else (perhaps even to run the program that fetches the missing page), but reusing the processor requires saving the state of the currently executing pro­ gram, so that it can be restarted later when the missing page is available. The problem is how to save the next-instruction pointer.

Saltzer & Kaashoek Ch. 9, p. 10

June 25, 2009 8:22 am

9.1 Atomicity

9–11

If every instruction is an all-or-nothing action, the operating system can simply save as the value of the next-instruction pointer the address of the instruction that encoun­ tered the missing page. The resulting saved state description shows that the program is between two instructions, one of which has been completely executed, and the next one of which has not yet begun. Later, when the page is available, the operating system can restart the program by reloading all of the registers and setting the program counter to the place indicated by the next-instruction pointer. The processor will continue, starting with the instruction that previously encountered the missing page exception; this time it should succeed. On the other hand, if even one instruction of the instruction set lacks the all-or-nothing property, when an interrupt happens to occur during the execution of that instruction it is not at all obvious how the operating system can save the processor state for a future restart. Designers have come up with several techniques to retrofit the all-or-nothing property at the machine language interface. Section 9.8 describes some examples of machine architectures that had this problem and the techniques that were used to add virtual memory to them. A second example is the supervisor call (SVC). Section 5.3.4 pointed out that the SVC instruction, which changes both the program counter and the processor mode bit (and in systems with virtual memory, other registers such as the page map address regis­ ter), needs to be all-or-nothing, to ensure that all (or none) of the intended registers change. Beyond that, the SVC invokes some complete kernel procedure. The designer would like to arrange that the entire call, (the combination of the SVC instruction and the operation of the kernel procedure itself) be an all-or-thing action. An all-or-nothing design allows the application programmer to view the kernel procedure as if it is an exten­ sion of the hardware. That goal is easier said than done, since the kernel procedure may detect some condition that prevents it from carrying out the intended action. Careful design of the kernel procedure is thus required. Consider an SVC to a kernel READ procedure that delivers the next typed keystroke to the caller. The user may not have typed anything yet when the application program calls READ, so the the designer of READ must arrange to wait for the user to type something. By itself, this situation is not especially problematic, but it becomes more so when there is also a user-provided exception handler. Suppose, for example, a thread timer can expire during the call to READ and the user-provided exception handler is to decide whether or not the thread should continue to run a while longer. The scenario, then, is the user pro­ gram calls READ, it is necessary to wait, and while waiting, the timer expires and control passes to the exception handler. Different systems choose one of three possibilities for the design of the READ procedure, the last one of which is not an all-or-nothing design: 1. An all-or-nothing design that implements the “nothing” option (blocking read): Seeing no available input, the kernel procedure first adjusts return pointers (“push the PC back”) to make it appear that the application program called AWAIT just ahead of its call to the kernel READ procedure and then it transfers control to the kernel AWAIT entry point. When the user finally types something, causing AWAIT to return, the user’s thread re-executes the original kernel call to READ, this time finding the typed

Saltzer & Kaashoek Ch. 9, p. 11

June 25, 2009 8:22 am

9–12

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

input. With this design, if a timer exception occurs while waiting, when the exception handler investigates the current state of the thread it finds the answer “the application program is between instructions; its next instruction is a call to READ.” This description is intelligible to a user-provided exception handler, and it allows that handler several options. One option is to continue the thread, meaning go ahead and execute the call to READ. If there is still no input, READ will again push the PC back and transfer control to AWAIT. Another option is for the handler to save this state description with a plan of restoring a future thread to this state at some later time. 2. An all-or-nothing design that implements the “all” option (non-blocking read): Seeing no available input, the kernel immediately returns to the application program with a zero-length result, expecting that the program will look for and properly handle this case. The program would probably test the length of the result and if zero, call AWAIT itself or it might find something else to do instead. As with the previous design, this design ensures that at all times the user-provided timer exception handler will see a simple description of the current state of the thread—it is between two user program instructions. However, some care is needed to avoid a race between the call to AWAIT and the arrival of the next typed character. 3. A blocking read design that is neither “all” nor “nothing” and therefore not atomic: The kernel READ procedure itself calls AWAIT, blocking the thread until the user types a character. Although this design seems conceptually simple, the description of the state of the thread from the point of view of the timer exception handler is not simple. Rather than “between two user instructions”, it is “waiting for something to happen in the middle of a user call to kernel procedure READ”. The option of saving this state description for future use has been foreclosed. To start another thread with this state description, the exception handler would need to be able to request “start this thread just after the call to AWAIT in the middle of the kernel READ entry.” But allowing that kind of request would compromise the modularity of the user-kernel interface. The user-provided exception handler could equally well make a request to restart the thread anywhere in the kernel, thus bypassing its gates and compromising its security. The first and second designs correspond directly to the two options in the definition of an all-or-nothing action, and indeed some operating systems offer both options. In the first design the kernel program acts in a way that appears that the call had never taken place, while in the second design the kernel program runs to completion every time it is called. Both designs make the kernel procedure an all-or-nothing action, and both lead to a user-intelligible state description—the program is between two of its instructions— if an exception should happen while waiting. One of the appeals of the client/server model introduced in Chapter 4 is that it tends to force the all-or-nothing property out onto the design table. Because servers can fail independently of clients, it is necessary for the client to think through a plan for recovery

Saltzer & Kaashoek Ch. 9, p. 12

June 25, 2009 8:22 am

9.1 Atomicity

9–13

from server failure, and a natural model to use is to make every action offered by a server all-or-nothing.

9.1.5 Before-or-After Atomicity: Coordinating Concurrent Threads In Chapter 5 we learned how to express opportunities for concurrency by creating threads, the goal of concurrency being to improve performance by running several things at the same time. Moreover, Section 9.1.2 above pointed out that interrupts can also cre­ ate concurrency. Concurrent threads do not represent any special problem until their paths cross. The way that paths cross can always be described in terms of shared, writable data: concurrent threads happen to take an interest in the same piece of writable data at about the same time. It is not even necessary that the concurrent threads be running simultaneously; if one is stalled (perhaps because of an interrupt) in the middle of an action, a different, running thread can take an interest in the data that the stalled thread was, and will sometime again be, working with. From the point of view of the programmer of an application, Chapter 5 introduced two quite different kinds of concurrency coordination requirements: sequence coordina­ tion and before-or-after atomicity. Sequence coordination is a constraint of the type “Action W must happen before action X”. For correctness, the first action must complete before the second action begins. For example, reading of typed characters from a key­ board must happen before running the program that presents those characters on a display. As a general rule, when writing a program one can anticipate the sequence coor­ dination constraints, and the programmer knows the identity of the concurrent actions. Sequence coordination thus is usually explicitly programmed, using either special lan­ guage constructs or shared variables such as the eventcounts of Chapter 5. In contrast, before-or-after atomicity is a more general constraint that several actions that concurrently operate on the same data should not interfere with one another. We define before-or-after atomicity as follows: Before-or-after atomicity

Concurrent actions have the before-or-after property if their effect from the point of view of their invokers is the same as if the actions occurred either completely before or completely after one another.

In Chapter 5 we saw how before-or-after actions can be created with explicit locks and a thread manager that implements the procedures ACQUIRE and RELEASE. Chapter 5 showed some examples of before-or-after actions using locks, and emphasized that programming correct before-or-after actions, for example coordinating a bounded buffer with several producers or several consumers, can be a tricky proposition. To be confident of correct­ ness, one needs to establish a compelling argument that every action that touches a shared variable follows the locking protocol.

Saltzer & Kaashoek Ch. 9, p. 13

June 25, 2009 8:22 am

9–14

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

One thing that makes before-or-after atomicity different from sequence coordination is that the programmer of an action that must have the before-or-after property does not necessarily know the identities of all the other actions that might touch the shared vari­ able. This lack of knowledge can make it problematic to coordinate actions by explicit program steps. Instead, what the programmer needs is an automatic, implicit mechanism that ensures proper handling of every shared variable. This chapter will describe several such mechanisms. Put another way, correct coordination requires discipline in the way concurrent threads read and write shared data. Applications for before-or-after atomicity in a computer system abound. In an oper­ ating system, several concurrent threads may decide to use a shared printer at about the same time. It would not be useful for printed lines of different threads to be interleaved in the printed output. Moreover, it doesn’t really matter which thread gets to use the printer first; the primary consideration is that one use of the printer be complete before the next begins, so the requirement is to give each print job the before-or-after atomicity property. For a more detailed example, let us return to the banking application and the TRANSFER procedure. This time the account balances are held in shared memory variables (recall that the declaration keyword reference means that the argument is call-by-reference, so that TRANSFER can change the values of those arguments): procedure TRANSFER (reference debit_account, reference credit_account, amount) debit_account ← debit_account - amount credit_account ← credit_account + amount

Despite their unitary appearance, a program statement such as “X ← X + Y” is actu­ ally composite: it involves reading the values of X and Y, performing an addition, and then writing the result back into X. If a concurrent thread reads and changes the value of X between the read and the write done by this statement, that other thread may be sur­ prised when this statement overwrites its change. Suppose this procedure is applied to accounts A (initially containing $300) and B (ini­ tially containing $100) as in TRANSFER

(A, B, $10)

We expect account A, the debit account, to end up with $290, and account B, the credit account, to end up with $110. Suppose, however, a second, concurrent thread is executing the statement TRANSFER

(B, C, $25)

where account C starts with $175. When both threads complete their transfers, we expect B to end up with $85 and C with $200. Further, this expectation should be fulfilled no matter which of the two transfers happens first. But the variable credit_account in the first thread is bound to the same object (account B) as the variable debit_account in the second thread. The risk to correctness occurs if the two transfers happen at about the same time. To understand this risk, consider Figure 9.2, which illustrates several possible time sequences of the READ and WRITE steps of the two threads with respect to variable B.

Saltzer & Kaashoek Ch. 9, p. 14

June 25, 2009 8:22 am

9.1 Atomicity

9–15

With each time sequence the figure shows the history of values of the cell containing the balance of account B. If both steps 1–1 and 1–2 precede both steps 2–1 and 2–2, (or viceversa) the two transfers will work as anticipated, and B ends up with $85. If, however, step 2–1 occurs after step 1–1, but before step 1–2, a mistake will occur: one of the two transfers will not affect account B, even though it should have. The first two cases illus­ trate histories of shared variable B in which the answers are the correct result; the remaining four cases illustrate four different sequences that lead to two incorrect values for B.

Thread #2 (debit_account is B) 2–1 READ B . . . 2–2 WRITE B

Thread #1 (credit_account is B) 1–1 READ B . . . 1–2 WRITE B

correct result: case 1: Thread #1: READ B Thread #2: Value of B: 100 case 2: Thread #1: Thread #2: READ B Value of B: 100 wrong results: case 3: Thread #1: READ B Thread #2: Value of B: 100 case 4: Thread #1: READ B Thread #2: Value of B: 100

time WRITE

B READ

B

WRITE

B

110

85 READ

WRITE

B

WRITE

B

B

75

85

WRITE READ

B

B

WRITE

B

110

75 WRITE

READ

B

WRITE

B

B

75

case 5: Thread #1: Thread #2: READ B Value of B: 100

READ

case 6: Thread #1: Thread #2: READ B Value of B: 100

READ

B

WRITE

110

B WRITE

B

110 B

75 WRITE

WRITE

B

B

75

110

FIGURE 9.2 Six possible histories of variable B if two threads that share B do not coordinate their concur­ rent activities.

Saltzer & Kaashoek Ch. 9, p. 15

June 25, 2009 8:22 am

9–16

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

Thus our goal is to ensure that one of the first two time sequences actually occurs. One way to achieve this goal is that the two steps 1–1 and 1–2 should be atomic, and the two steps 2–1 and 2–2 should similarly be atomic. In the original program, the steps debit_account ← debit_account - amount

and credit_account ← credit_account + amount

should each be atomic. There should be no possibility that a concurrent thread that intends to change the value of the shared variable debit_account read its value between the READ and WRITE steps of this statement.

9.1.6 Correctness and Serialization The notion that the first two sequences of Figure 9.2 are correct and the other four are wrong is based on our understanding of the banking application. It would be better to have a more general concept of correctness that is independent of the application. Appli­ cation independence is a modularity goal: we want to be able to make an argument for correctness of the mechanism that provides before-or-after atomicity without getting into the question of whether or not the application using the mechanism is correct. There is such a correctness concept: coordination among concurrent actions can be considered to be correct if every result is guaranteed to be one that could have been obtained by some purely serial application of those same actions. The reasoning behind this concept of cor­ rectness involves several steps. Consider Figure 9.3,which shows, abstractly, the effect old system action new system of applying some action, whether atomic or state state not, to a system: the action changes the state of the system. Now, if we are sure that: 1. the old state of the system was correct from the point of view of the application, and

FIGURE 9.3 A single action takes a system from one state to another state.

2. the action, performing all by itself, correctly transforms any correct old state to a correct new state, then we can reason that the new state must also be correct. This line of reasoning holds for any application-dependent definition of “correct” and “correctly transform”, so our reasoning method is independent of those definitions and thus of the application. The corresponding requirement when several actions act concurrently, as in Figure 9.4, is that the resulting new state ought to be one of those that would have resulted from some serialization of the several actions, as in Figure 9.5. This correctness criterion means that concurrent actions are correctly coordinated if their result is guaranteed to be one that would have been obtained by some purely serial application of those same actions.

Saltzer & Kaashoek Ch. 9, p. 16

June 25, 2009 8:22 am

9.1 Atomicity

9–17

FIGURE 9.4 action #1

old system action #2 state #3 action

When several actions act con­ currently, they together produce a new state. If the actions are before-or-after and the old state was correct, the new state will be correct.

new system state

So long as the only coordination requirement is before-or-after atomicity, any serializa­ tion will do. Moreover, we do not even need to insist that the system actually traverse the interme­ diate states along any particular path of Figure 9.5—it may instead follow the dotted trajectory through intermediate states that are not by themselves correct, according to the application’s definition. As long as the intermediate states are not visible above the implementing layer, and the system is guaranteed to end up in one of the acceptable final states, we can declare the coordination to be correct because there exists a trajectory that leads to that state for which a correctness argument could have been applied to every step. Since our definition of before-or-after atomicity is that each before-or-after action act as though it ran either completely before or completely after each other before-or-after action, before-or-after atomicity leads directly to this concept of correctness. Put another way, before-or-after atomicity has the effect of serializing the actions, so it follows that before-or-after atomicity guarantees correctness of coordination. A different way of

AA#3

AA #2 AA #1

final state A

AA#

3

AA #2

old system state AA #2

AA#3

AA#1

final state B final state C

FIGURE 9.5 We insist that the final state be one that could have been reached by some serialization of the atomic actions, but we don't care which serialization. In addition, we do not need to insist that the intermediate states ever actually exist. The actual state trajectory could be that shown by the dotted lines, but only if there is no way of observing the intermediate states from the outside.

Saltzer & Kaashoek Ch. 9, p. 17

June 25, 2009 8:22 am

9–18

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

expressing this idea is to say that when concurrent actions have the before-or-after prop­ erty, they are serializable: there exists some serial order of those concurrent transactions that would, if followed, lead to the same ending state.* Thus in Figure 9.2, the sequences of case 1 and case 2 could result from a serialized order, but the actions of cases 3 through 6 could not. In the example of Figure 9.2, there were only two concurrent actions and each of the concurrent actions had only two steps. As the number of concurrent actions and the number of steps in each action grows there will be a rapidly growing number of possible orders in which the individual steps can occur, but only some of those orders will ensure a correct result. Since the purpose of concurrency is to gain performance, one would like to have a way of choosing from the set of correct orders the one correct order that has the highest performance. As one might guess, making that choice can in general be quite difficult. In Sections 9.4 and 9.5 of this chapter we will encounter several programming disciplines that ensure choice from a subset of the possible orders, all members of which are guaranteed to be correct but, unfortunately, may not include the correct order that has the highest performance. In some applications it is appropriate to use a correctness requirement that is stronger than serializability. For example, the designer of a banking system may want to avoid anachronisms by requiring what might be called external time consistency: if there is any external evidence (such as a printed receipt) that before-or-after action T1 ended before before-or-after action T2 began, the serialization order of T1 and T2 inside the system should be that T1 precedes T2. For another example of a stronger correctness require­ ment, a processor architect may require sequential consistency: when the processor concurrently performs multiple instructions from the same instruction stream, the result should be as if the instructions were executed in the original order specified by the programmer. Returning to our example, a real funds-transfer application typically has several dis­ tinct before-or-after atomicity requirements. Consider the following auditing procedure; its purpose is to verify that the sum of the balances of all accounts is zero (in double-entry bookkeeping, accounts belonging to the bank, such as the amount of cash in the vault, have negative balances): procedure AUDIT()

sum ← 0

for each W ← in bank.accounts

sum ← sum + W.balance

if (sum ≠ 0) call for investigation

Suppose that AUDIT is running in one thread at the same time that another thread is transferring money from account A to account B. If AUDIT examines account A before the transfer and account B after the transfer, it will count the transferred amount twice and * The general question of whether or not a collection of existing transactions is serializable is an advanced topic that is addressed in database management. Problem set 36 explores one method of answering this question.

Saltzer & Kaashoek Ch. 9, p. 18

June 25, 2009 8:22 am

9.1 Atomicity

9–19

thus will compute an incorrect answer. So the entire auditing procedure should occur either before or after any individual transfer: we want it to be a before-or-after action. There is yet another before-or-after atomicity requirement: if AUDIT should run after the statement in TRANSFER debit_account ← debit_account - amount

but before the statement credit_account ← credit_account + amount

it will calculate a sum that does not include amount; we therefore conclude that the two balance updates should occur either completely before or completely after any AUDIT action; put another way, TRANSFER should be a before-or-after action.

9.1.7 All-or-Nothing and Before-or-After Atomicity We now have seen examples of two forms of atomicity: all-or-nothing and before-or­ after. These two forms have a common underlying goal: to hide the internal structure of an action. With that insight, it becomes apparent that atomicity is really a unifying concept: Atomicity An action is atomic if there is no way for a higher layer to discover the internal structure of its implementation. This description is really the fundamental definition of atomicity. From it, one can immediately draw two important consequences, corresponding to all-or-nothing atom­ icity and to before-or-after atomicity: 1. From the point of view of a procedure that invokes an atomic action, the atomic action always appears either to complete as anticipated, or to do nothing. This consequence is the one that makes atomic actions useful in recovering from failures. 2. From the point of view of a concurrent thread, an atomic action acts as though it occurs either completely before or completely after every other concurrent atomic action. This consequence is the one that makes atomic actions useful for coordinating concurrent threads. These two consequences are not fundamentally different. They are simply two per­ spectives, the first from other modules within the thread that invokes the action, the second from other threads. Both points of view follow from the single idea that the inter­ nal structure of the action is not visible outside of the module that implements the action. Such hiding of internal structure is the essence of modularity, but atomicity is an exceptionally strong form of modularity. Atomicity hides not just the details of which

Saltzer & Kaashoek Ch. 9, p. 19

June 25, 2009 8:22 am

9–20

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

steps form the atomic action, but the very fact that it has structure. There is a kinship between atomicity and other system-building techniques such as data abstraction and cli­ ent/server organization. Data abstraction has the goal of hiding the internal structure of data; client/server organization has the goal of hiding the internal structure of major sub­ systems. Similarly, atomicity has the goal of hiding the internal structure of an action. All three are methods of enforcing industrial-strength modularity, and thereby of guar­ anteeing absence of unanticipated interactions among components of a complex system. We have used phrases such as “from the point of view of the invoker” several times, suggesting that there may be another point of view from which internal structure is apparent. That other point of view is seen by the implementer of an atomic action, who is often painfully aware that an action is actually composite, and who must do extra work to hide this reality from the higher layer and from concurrent threads. Thus the inter­ faces between layers are an essential part of the definition of an atomic action, and they provide an opportunity for the implementation of an action to operate in any way that ends up providing atomicity. There is one more aspect of hiding the internal structure of atomic actions: atomic actions can have benevolent side effects. A common example is an audit log, where atomic actions that run into trouble record the nature of the detected failure and the recovery sequence for later analysis. One might think that when a failure leads to backing out, the audit log should be rolled back, too; but rolling it back would defeat its pur­ pose—the whole point of an audit log is to record details about the failure. The important point is that the audit log is normally a private record of the layer that imple­ mented the atomic action; in the normal course of operation it is not visible above that layer, so there is no requirement to roll it back. (A separate atomicity requirement is to ensure that the log entry that describes a failure is complete and not lost in the ensuing recovery.) Another example of a benevolent side effect is performance optimization. For exam­ ple, in a high-performance data management system, when an upper layer atomic action asks the data management system to insert a new record into a file, the data management system may decide as a performance optimization that now is the time to rearrange the file into a better physical order. If the atomic action fails and aborts, it need ensure only that the newly-inserted record be removed; the file does not need to be restored to its older, less efficient, storage arrangement. Similarly, a lower-layer cache that now contains a variable touched by the atomic action does not need to be cleared and a garbage collec­ tion of heap storage does not need to be undone. Such side effects are not a problem, as long as they are hidden from the higher-layer client of the atomic action except perhaps in the speed with which later actions are carried out, or across an interface that is intended to report performance measures or failures.

Saltzer & Kaashoek Ch. 9, p. 20

June 25, 2009 8:22 am

9.2 All-or-Nothing Atomicity I: Concepts

9–21

9.2 All-or-Nothing Atomicity I: Concepts Section 9.1 of this chapter defined the goals of all-or-nothing atomicity and before-or­ after atomicity, and provided a conceptual framework that at least in principle allows a designer to decide whether or not some proposed algorithm correctly coordinates con­ current activities. However, it did not provide any examples of actual implementations of either goal. This section of the chapter, together with the next one, describe some widely applicable techniques of systematically implementing all-or-nothing atomicity. Later sections of the chapter will do the same for before-or-after atomicity. Many of the examples employ the technique introduced in Chapter 5 called boot­ strapping, a method that resembles inductive proof. To review, bootstrapping means to first look for a systematic way to reduce a general problem to some much-narrowed par­ ticular version of that same problem. Then, solve the narrow problem using some specialized method that might work only for that case because it takes advantage of the specific situation. The general solution then consists of two parts: a special-case tech­ nique plus a method that systematically reduces the general problem to the special case. Recall that Chapter 5 tackled the general problem of creating before-or-after actions from arbitrary sequences of code by implementing a procedure named ACQUIRE that itself required before-or-after atomicity of two or three lines of code where it reads and then sets a lock value. It then implemented that before-or-after action with the help of a spe­ cial hardware feature that directly makes a before-or-after action of the read and set sequence, and it also exhibited a software implementation (in Sidebar 5.2) that relies only on the hardware performing ordinary LOADs and STOREs as before-or-after actions. This chapter uses bootstrapping several times. The first example starts with the special case and then introduces a way to reduce the general problem to that special case. The reduc­ tion method, called the version history, is used only occasionally in practice, but once understood it becomes easy to see why the more widely used reduction methods that will be described in Section 9.3 work.

9.2.1 Achieving All-or-Nothing Atomicity: ALL_OR_NOTHING_PUT The first example is of a scheme that does an all-or-nothing update of a single disk sector. The problem to be solved is that if a system crashes in the middle of a disk write (for example, the operating system encounters a bug or the power fails), the sector that was being written at the instant of the failure may contain an unusable muddle of old and new data. The goal is to create an all-or-nothing PUT with the property that when GET later reads the sector, it always returns either the old or the new data, but never a muddled mixture. To make the implementation precise, we develop a disk fault tolerance model that is a slight variation of the one introduced in Chapter 8[on-line], taking as an example application a calendar management program for a personal computer. The user is hoping that, if the system fails while adding a new event to the calendar, when the system later restarts the calendar will be safely intact. Whether or not the new event ended up in the

Saltzer & Kaashoek Ch. 9, p. 21

June 25, 2009 8:22 am

9–22

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

calendar is less important than that the calendar not be damaged by inopportune timing of the system failure. This system comprises a human user, a display, a processor, some volatile memory, a magnetic disk, an operating system, and the calendar manager pro­ gram. We model this system in several parts: Overall system fault tolerance model. • error-free operation: All work goes according to expectations. The user initiates actions such as adding events to the calendar and the system confirms the actions by displaying messages to the user. • tolerated error: The user who has initiated an action notices that the system failed before it confirmed completion of the action and, when the system is operating again, checks to see whether or not it actually performed that action. • untolerated error: The system fails without the user noticing, so the user does not realize that he or she should check or retry an action that the system may not have completed. The tolerated error specification means that, to the extent possible, the entire system is fail-fast: if something goes wrong during an update, the system stops before taking any more requests, and the user realizes that the system has stopped. One would ordinarily design a system such as this one to minimize the chance of the untolerated error, for example by requiring supervision by a human user. The human user then is in a position to realize (perhaps from lack of response) that something has gone wrong. After the sys­ tem restarts, the user knows to inquire whether or not the action completed. This design strategy should be familiar from our study of best effort networks in Chapter 7[on-line]. The lower layer (the computer system) is providing a best effort implementation. A higher layer (the human user) supervises and, when necessary, retries. For example, sup­ pose that the human user adds an appointment to the calendar but just as he or she clicks “save” the system crashes. The user doesn’t know whether or not the addition actually succeeded, so when the system comes up again the first thing to do is open up the calen­ dar to find out what happened. Processor, memory, and operating system fault tolerance model. This part of the model just specifies more precisely the intended fail-fast properties of the hardware and operating system: • error-free operation: The processor, memory, and operating system all follow their specifications. • detected error: Something fails in the hardware or operating system. The system is fail-fast: the hardware or operating system detects the failure and restarts from a clean slate before initiating any further PUTs to the disk. • untolerated error: Something fails in the hardware or operating system. The processor muddles along and PUTs corrupted data to the disk before detecting the failure.

Saltzer & Kaashoek Ch. 9, p. 22

June 25, 2009 8:22 am

9.2 All-or-Nothing Atomicity I: Concepts

9–23

The primary goal of the processor/memory/operating-system part of the model is to detect failures and stop running before any corrupted data is written to the disk storage system. The importance of detecting failure before the next disk write lies in error con­ tainment: if the goal is met, the designer can assume that the only values potentially in error must be in processor registers and volatile memory, and the data on the disk should be safe, with the exception described in Section 8.5.4.2: if there was a PUT to the disk in progress at the time of the crash, the failing system may have corrupted the disk buffer in volatile memory, and consequently corrupted the disk sector that was being written. The recovery procedure can thus depend on the disk storage system to contain only uncorrupted information, or at most one corrupted disk sector. In fact, after restart the disk will contain the only information. “Restarts from a clean slate” means that the sys­ tem discards all state held in volatile memory. This step brings the system to the same state as if a power failure had occurred, so a single recovery procedure will be able to han­ dle both system crashes and power failures. Discarding volatile memory also means that all currently active threads vanish, so everything that was going on comes to an abrupt halt and will have to be restarted. Disk storage system fault tolerance model. Implementing all-or-nothing atomicity involves some steps that resemble the decay masking of MORE_DURABLE_PUT/GET in Chapter 8[on-line]—in particular, the algorithm will write multiple copies of data. To clarify how the all-or-nothing mechanism works, we temporarily back up to CAREFUL_PUT/GET (see Section 8.5.4.5), which masks soft disk errors but not hard disk errors or disk decay. To simplify further, we pretend for the moment that a disk never decays and that it has no hard errors. (Since this perfect-disk assumption is obviously unrealistic, we will reverse it in Section 9.7, which describes an algorithm for all-or-nothing atomicity despite disk decay and hard errors.) With the perfect-disk assumption, only one thing can go wrong: a system crash at just the wrong time. The fault tolerance model for this simplified careful disk system then becomes: • error-free operation: CAREFUL_GET returns the result of the most recent call to CAREFUL_PUT at sector_number on track, with status = OK. • detectable error: The operating system crashes during a CAREFUL_PUT and corrupts the disk buffer in volatile storage, and CAREFUL_PUT writes corrupted data on one sector of the disk. We can classify the error as “detectable” if we assume that the application has included with the data an end-to-end checksum, calculated before calling CAREFUL_PUT and thus before the system crash could have corrupted the data. The change in this revision of the careful storage layer is that when a system crash occurs, one sector on the disk may be corrupted, but the client of the interface is confi­ dent that (1) that sector is the only one that may be corrupted and (2) if it has been corrupted, any later reader of that sector will detect the problem. Between the processor model and the storage system model, all anticipated failures now lead to the same situa-

Saltzer & Kaashoek Ch. 9, p. 23

June 25, 2009 8:22 am

9–24

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

1 2 3 4

procedure ALMOST_ALL_OR_NOTHING_PUT (data, all_or_nothing_sector) CAREFUL_PUT (data, all_or_nothing_sector.S1) CAREFUL_PUT (data, all_or_nothing_sector.S2) // Commit point. CAREFUL_PUT (data, all_or_nothing_sector.S3)

5 procedure ALL_OR_NOTHING_GET (reference data, all_or_nothing_sector) 6 CAREFUL_GET (data1, all_or_nothing_sector.S1) 7 CAREFUL_GET (data2, all_or_nothing_sector.S2) 8 CAREFUL_GET (data3, all_or_nothing_sector.S3) 9 if data1 = data2 then data ← data1 // Return new value. 10 else data ← data3 // Return old value.

FIGURE 9.6 Algorithms for ALMOST_ALL_OR_NOTHING_PUT and ALL_OR_NOTHING_GET.

tion: the system detects the failure, resets all processor registers and volatile memory, forgets all active threads, and restarts. No more than one disk sector is corrupted. Our problem is now reduced to providing the all-or-nothing property: the goal is to create all-or-nothing disk storage, which guarantees either to change the data on a sector completely and correctly or else appear to future readers not to have touched it at all. Here is one simple, but somewhat inefficient, scheme that makes use of virtualization: assign, for each data sector that is to have the all-or-nothing property, three physical disk sectors, identified as S1, S2, and S3. The three physical sectors taken together are a vir­ tual “all-or-nothing sector”. At each place in the system where this disk sector was previously used, replace it with the all-or-nothing sector, identified by the triple {S1, S2, S3}. We start with an almost correct all-or-nothing implementation named ALMOST_ALL_OR_NOTHING_PUT, find a bug in it, and then fix the bug, finally creating a cor­ rect ALL_OR_NOTHING_PUT. When asked to write data, ALMOST_ALL_OR_NOTHING_PUT writes it three times, on S1, S2, and S3, in that order, each time waiting until the previous write finishes, so that if the system crashes only one of the three sectors will be affected. To read data, ALL_OR_NOTHING_GET reads all three sectors and compares their contents. If the contents of S1 and S2 are identical, ALL_OR_NOTHING_GET returns that value as the value of the all-or­ nothing sector. If S1 and S2 differ, ALL_OR_NOTHING_GET returns the contents of S3 as the value of the all-or-nothing sector. Figure 9.6 shows this almost correct pseudocode. Let’s explore how this implementation behaves on a system crash. Suppose that at some previous time a record has been correctly stored in an all-or-nothing sector (in other words, all three copies are identical), and someone now updates it by calling ALL_OR_NOTHING_PUT. The goal is that even if a failure occurs in the middle of the update, a later reader can always be ensured of getting some complete, consistent version of the record by invoking ALL_OR_NOTHING_GET. Suppose that ALMOST_ALL_OR_NOTHING_PUT were interrupted by a system crash some time before it finishes writing sector S2, and thus corrupts either S1 or S2. In that case,

Saltzer & Kaashoek Ch. 9, p. 24

June 25, 2009 8:22 am

9.2 All-or-Nothing Atomicity I: Concepts

1 2 3

9–25

procedure ALL_OR_NOTHING_PUT (data, all_or_nothing_sector) CHECK_AND_REPAIR (all_or_nothing_sector) ALMOST_ALL_OR_NOTHING_PUT (data, all_or_nothing_sector)

4 procedure CHECK_AND_REPAIR (all_or_nothing_sector) // Ensure copies match. 5 CAREFUL_GET (data1, all_or_nothing_sector.S1) 6 CAREFUL_GET (data2, all_or_nothing_sector.S2) 7 CAREFUL_GET (data3, all_or_nothing_sector.S3) 8 if (data1 = data2) and (data2 = data3) return // State 1 or 7, no repair 9 if (data1 = data2) 10 CAREFUL_PUT (data1, all_or_nothing_sector.S3) return // State 5 or 6. 11 if (data2 = data3) 12 CAREFUL_PUT (data2, all_or_nothing_sector.S1) return // State 2 or 3. 13 CAREFUL_PUT (data1, all_or_nothing_sector.S2) // State 4, go to state 5 14 CAREFUL_PUT (data1, all_or_nothing_sector.S3) // State 5, go to state 7

FIGURE 9.7 Algorithms for ALL_OR_NOTHING_PUT and CHECK_AND_REPAIR.

when ALL_OR_NOTHING_GET reads sectors S1 and S2, they will have different values, and it is not clear which one to trust. Because the system is fail-fast, sector S3 would not yet have been touched by ALMOST_ALL_OR_NOTHING_PUT, so it still contains the previous value. Returning the value found in S3 thus has the desired effect of ALMOST_ALL_OR_NOTHING_PUT having done nothing. Now, suppose that ALMOST_ALL_OR_NOTHING_PUT were interrupted by a system crash some time after successfully writing sector S2. In that case, the crash may have corrupted S3, but S1 and S2 both contain the newly updated value. ALL_OR_NOTHING_GET returns the value of S1, thus providing the desired effect of ALMOST_ALL_OR_NOTHING_PUT having com­ pleted its job. So what’s wrong with this design? ALMOST_ALL_OR_NOTHING_PUT assumes that all three copies are identical when it starts. But a previous failure can violate that assumption. Suppose that ALMOST_ALL_OR_NOTHING_PUT is interrupted while writing S3. The next thread to call ALL_OR_NOTHING_GET finds data1 = data2, so it uses data1, as expected. The new thread then calls ALMOST_ALL_OR_NOTHING_PUT, but is interrupted while writing S2. Now, S1 doesn't equal S2, so the next call to ALMOST_ALL_OR_NOTHING_PUT returns the damaged S3. The fix for this bug is for ALL_OR_NOTHING_PUT to guarantee that the three sectors be iden­ tical before updating. It can provide this guarantee by invoking a procedure named CHECK_AND_REPAIR as in Figure 9.7. CHECK_AND_REPAIR simply compares the three copies and, if they are not identical, it forces them to be identical. To see how this works, assume that someone calls ALL_OR_NOTHING_PUT at a time when all three of the copies do contain identical values, which we designate as “old”. Because ALL_OR_NOTHING_PUT writes “new”

Saltzer & Kaashoek Ch. 9, p. 25

June 25, 2009 8:22 am

9–26

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

values into S1, S2, and S3 one at a time and in order, even if there is a crash, at the next call to ALL_OR_NOTHING_PUT there are only seven possible data states for CHECK_AND_REPAIR to consider: data state: sector S1 sector S2 sector S3

1 old old old

2 bad old old

3 new old old

4 new bad old

5 new new old

6 new new bad

7 new new new

The way to read this table is as follows: if all three sectors S1, S2, and S3 contain the “old” value, the data is in state 1. Now, if CHECK_AND_REPAIR discovers that all three copies are identical (line 8 in Figure 9.7), the data is in state 1 or state 7 so CHECK_AND_REPAIR simply returns. Failing that test, if the copies in sectors S1 and S2 are identical (line 9), the data must be in state 5 or state 6, so CHECK_AND_REPAIR forces sector S3 to match and returns (line 10). If the copies in sectors S2 and S3 are identical the data must be in state 2 or state 3 (line 11), so CHECK_AND_REPAIR forces sector S1 to match and returns (line 12). The only remaining possibility is that the data is in state 4, in which case sector S2 is surely bad, but sector S1 contains a new value and sector S3 contains an old one. The choice of which to use is arbitrary; as shown the procedure copies the new value in sector S1 to both sectors S2 and S3. What if a failure occurs while running CHECK_AND_REPAIR? That procedure systemati­ cally drives the state either forward from state 4 toward state 7, or backward from state 3 toward state 1. If CHECK_AND_REPAIR is itself interrupted by another system crash, rerun­ ning it will continue from the point at which the previous attempt left off. We can make several observations about the algorithm implemented by ALL_OR_NOTHING_GET and ALL_OR_NOTHING_PUT: 1. This all-or-nothing atomicity algorithm assumes that only one thread at a time tries to execute either ALL_OR_NOTHING_GET or ALL_OR_NOTHING_PUT. This algorithm implements all-or-nothing atomicity but not before-or-after atomicity. 2. CHECK_AND_REPAIR is idempotent. That means that a thread can start the procedure, execute any number of its steps, be interrupted by a crash, and go back to the beginning again any number of times with the same ultimate result, as far as a later call to ALL_OR_NOTHING_GET is concerned. 3. The completion of the CAREFUL_PUT on line 3 of ALMOST_ALL_OR_NOTHING_PUT, marked “commit point,” exposes the new data to future ALL_OR_NOTHING_GET actions. Until that step begins execution, a call to ALL_OR_NOTHING_GET sees the old data. After line 3 completes, a call to ALL_OR_NOTHING_GET sees the new data. 4. Although the algorithm writes three replicas of the data, the primary reason for the replicas is not to provide durability as described in Section 8.5. Instead, the reason for writing three replicas, one at a time and in a particular order, is to ensure observance at all times and under all failure scenarios of the golden rule of atomicity, which is the subject of the next section.

Saltzer & Kaashoek Ch. 9, p. 26

June 25, 2009 8:22 am

9.2 All-or-Nothing Atomicity I: Concepts

9–27

There are several ways of implementing all-or-nothing disk sectors. Near the end of Chapter 8[on-line] we introduced a fault tolerance model for decay events that did not mask system crashes, and applied the technique known as RAID to mask decay to pro­ duce durable storage. Here we started with a slightly different fault tolerance model that omits decay, and we devised techniques to mask system crashes and produce all-or-noth­ ing storage. What we really should do is start with a fault tolerance model that considers both system crashes and decay, and devise storage that is both all-or-nothing and dura­ ble. Such a model, devised by Xerox Corporation researchers Butler Lampson and Howard Sturgis, is the subject of Section 9.7, together with the more elaborate recovery algorithms it requires. That model has the additional feature that it needs only two phys­ ical sectors for each all-or-nothing sector.

9.2.2 Systematic Atomicity: Commit and the Golden Rule The example of ALL_OR_NOTHING_PUT and ALL_OR_NOTHING_GET demonstrates an interesting special case of all-or-nothing atomicity, but it offers little guidance on how to systemat­ ically create a more general all-or-nothing action. From the example, our calendar program now has a tool that allows writing individual sectors with the all-or-nothing property, but that is not the same as safely adding an event to a calendar, since adding an event probably requires rearranging a data structure, which in turn may involve writ­ ing more than one disk sector. We could do a series of ALL_OR_NOTHING_PUTs to the several sectors, to ensure that each sector is itself written in an all-or-nothing fashion, but a crash that occurs after writing one and before writing the next would leave the overall calendar addition in a partly-done state. To make the entire calendar addition action all-or-noth­ ing we need a generalization. Ideally, one might like to be able to take any arbitrary sequence of instructions in a program, surround that sequence with some sort of begin and end statements as in Fig­ ure 9.8, and expect that the language compilers and operating system will perform some magic that makes the surrounded sequence into an all-or-nothing action. Unfortunately, no one knows how to do that. But we can come close, if the programmer is willing to make a modest concession to the requirements of all-or-nothing atomicity. This conces­ sion is expressed in the form of a discipline on the constituent steps of the all-or-nothing action. The discipline starts by identifying some single step of the sequence as the commit point. The all-or-nothing action is thus divided into two phases, a pre-commit phase and a post-commit phase, as suggested by Figure 9.9. During the pre-commit phase, the disci­ plining rule of design is that no matter what happens, it must be possible to back out of this all-or-nothing action in a way that leaves no trace. During the post-commit phase the disciplining rule of design is that no matter what happens, the action must run to the end successfully. Thus an all-or-nothing action can have only two outcomes. If the allor-nothing action starts and then, without reaching the commit point, backs out, we say that it aborts. If the all-or-nothing action passes the commit point, we say that it commits.

Saltzer & Kaashoek Ch. 9, p. 27

June 25, 2009 8:22 am

9–28

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

___ ___ ___ begin all-or-nothing action ___ ___ ___ arbitrary sequence of ___ lower-layer actions ___ end all-or-nothing action ___ ___ ___

}

FIGURE 9.8 Imaginary semantics for painless programming of all-or-nothing actions.

We can make several observations about the restrictions of the pre-commit phase. The pre-commit phase must identify all the resources needed to complete the all-or­ nothing action, and establish their availability. The names of data should be bound, per­ missions should be checked, the pages to be read or written should be in memory, removable media should be mounted, stack space must be allocated, etc. In other words, all the steps needed to anticipate the severe run-to-the-end-without-faltering require­ ment of the post-commit phase should be completed during the pre-commit phase. In addition, the pre-commit phase must maintain the ability to abort at any instant. Any changes that the pre-commit phase makes to the state of the system must be undoable in case this all-or-nothing action aborts. Usually, this requirement means that shared

___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___

first step of all-or-nothing action

}

Pre-commit discipline: can back out, leaving no trace

}

Post-commit discipline: completion is inevitable

Commit point

last step of all-or-nothing action

FIGURE 9.9 The commit point of an all-or-nothing action.

Saltzer & Kaashoek Ch. 9, p. 28

June 25, 2009 8:22 am

9.2 All-or-Nothing Atomicity I: Concepts

9–29

resources, once reserved, cannot be released until the commit point is passed. The reason is that if an all-or-nothing action releases a shared resource, some other, concurrent thread may capture that resource. If the resource is needed in order to undo some effect of the all-or-nothing action, releasing the resource is tantamount to abandoning the abil­ ity to abort. Finally, the reversibility requirement means that the all-or-nothing action should not do anything externally visible, for example printing a check or firing a missile, prior to the commit point. (It is possible, though more complicated, to be slightly less restrictive. Sidebar 9.3 explores that possibility.) In contrast, the post-commit phase can expose results, it can release reserved resources that are no longer needed, and it can perform externally visible actions such as printing a check, opening a cash drawer, or drilling a hole. But it cannot try to acquire additional resources because an attempt to acquire might fail, and the post-commit phase is not per­ mitted the luxury of failure. The post-commit phase must confine itself to finishing just the activities that were planned during the pre-commit phase. It might appear that if a system fails before the post-commit phase completes, all hope is lost, so the only way to ensure all-or-nothing atomicity is to always make the commit step the last step of the all-or-nothing action. Often, that is the simplest way to ensure all-or-nothing atomicity, but the requirement is not actually that stringent. An impor­ tant feature of the post-commit phase is that it is hidden inside the layer that implements the all-or-nothing action, so a scheme that ensures that the post-commit phase completes after a system failure is acceptable, so long as this delay is hidden from the invoking layer. Some all-or-nothing atomicity schemes thus involve a guarantee that a cleanup proce­ dure will be invoked following every system failure, or as a prelude to the next use of the data, before anyone in a higher layer gets a chance to discover that anything went wrong. This idea should sound familiar: the implementation of ALL_OR_NOTHING_PUT in Figure 9.7 used this approach, by always running the cleanup procedure named CHECK_AND_REPAIR before updating the data. A popular technique for achieving all-or-nothing atomicity is called the shadow copy. It is used by text editors, compilers, calendar management programs, and other programs that modify existing files, to ensure that following a system failure the user does not end up with data that is damaged or that contains only some of the intended changes: • Pre-commit: Create a complete duplicate working copy of the file that is to be modified. Then, make all changes to the working copy.

Sidebar 9.3: Cascaded aborts (Temporary) sweeping simplification. In this initial discussioin of commit points, we are intentionally avoiding a more complex and harder-to-design possibility. Some systems allow other, concurrent activities to see pending results, and they may even allow externally visible actions before commit. Those systems must therefore be prepared to track down and abort those concurrent activities (this tracking down is called cascaded abort) or perform compensating external actions (e.g., send a letter requesting return of the check or apologizing for the missile firing). The discussion of layers and multiple sites in Chapter 10[on­ line] introduces a simple version of cascaded abort.

Saltzer & Kaashoek Ch. 9, p. 29

June 25, 2009 8:22 am

9–30

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

• Commit point: Carefully exchange the working copy with the original. Typically this step is bootstrapped, using a lower-layer RENAME entry point of the file system that provides certain atomic-like guarantees such as the ones described for the UNIX version of RENAME in Section 2.5.8. • Post-commit: Release the space that was occupied by the original. The ALL_OR_NOTHING_PUT algorithm of Figure 9.7 can be seen as a particular example of the shadow copy strategy, which itself is a particular example of the general pre-com­ mit/post-commit discipline. The commit point occurs at the instant when the new value of S2 is successfully written to the disk. During the pre-commit phase, while ALL_OR_NOTHING_PUT is checking over the three sectors and writing the shadow copy S1, a crash will leave no trace of that activity (that is, no trace that can be discovered by a later caller of ALL_OR_NOTHING_GET). The post-commit phase of ALL_OR_NOTHING_PUT consists of writing S3. From these examples we can extract an important design principle: The golden rule of atomicity Never modify the only copy!

In order for a composite action to be all-or-nothing, there must be some way of reversing the effect of each of its pre-commit phase component actions, so that if the action does not commit it is possible to back out. As we continue to explore implementations of allor-nothing atomicity, we will notice that correct implementations always reduce at the end to making a shadow copy. The reason is that structure ensures that the implemen­ tation follows the golden rule.

9.2.3 Systematic All-or-Nothing Atomicity: Version Histories This section develops a scheme to provide all-or-nothing atomicity in the general case of a program that modifies arbitrary data structures. It will be easy to see why the scheme is correct, but the mechanics can interfere with performance. Section 9.3 of this chapter then introduces a variation on the scheme that requires more thought to see why it is cor­ rect, but that allows higher-performance implementations. As before, we concentrate for the moment on all-or-nothing atomicity. While some aspects of before-or-after atomic­ ity will also emerge, we leave a systematic treatment of that topic for discussion in Sections 9.4 and 9.5 of this chapter. Thus the model to keep in mind in this section is that only a single thread is running. If the system crashes, after a restart the original thread is gone—recall from Chapter 8[on-line] the sweeping simplification that threads are included in the volatile state that is lost on a crash and only durable state survives. After the crash, a new, different thread comes along and attempts to look at the data. The goal is that the new thread should always find that the all-or-nothing action that was in progress at the time of the crash either never started or completed successfully.

Saltzer & Kaashoek Ch. 9, p. 30

June 25, 2009 8:22 am

9.2 All-or-Nothing Atomicity I: Concepts

9–31

In looking at the general case, a fundamental difficulty emerges: random-access mem­ ory and disk usually appear to the programmer as a set of named, shared, and rewritable storage cells, called cell storage. Cell storage has semantics that are actually quite hard to make all-or-nothing because the act of storing destroys old data, thus potentially violat­ ing the golden rule of atomicity. If the all-or-nothing action later aborts, the old value is irretrievably gone; at best it can only be reconstructed from information kept elsewhere. In addition, storing data reveals it to the view of later threads, whether or not the all-or­ nothing action that stored the value reached its commit point. If the all-or-nothing action happens to have exactly one output value, then writing that value into cell storage can be the mechanism of committing, and there is no problem. But if the result is sup­ posed to consist of several output values, all of which should be exposed simultaneously, it is harder to see how to construct the all-or-nothing action. Once the first output value is stored, the computation of the remaining outputs has to be successful; there is no going back. If the system fails and we have not been careful, a later thread may see some old and some new values. These limitations of cell storage did not plague the shopkeepers of Padua, who in the 14th century invented double-entry bookkeeping. Their storage medium was leaves of paper in bound books and they made new entries with quill pens. They never erased or even crossed out entries that were in error; when they made a mistake they made another entry that reversed the mistake, thus leaving a complete history of their actions, errors, and corrections in the book. It wasn’t until the 1950’s, when programmers began to automate bookkeeping systems, that the notion of overwriting data emerged. Up until that time, if a bookkeeper collapsed and died while making an entry, it was always pos­ sible for someone else to seamlessly take over the books. This observation about the robustness of paper systems suggests that there is a form of the golden rule of atomicity that might allow one to be systematic: never erase anything. Examining the shadow copy technique used by the text editor provides a second use­ ful idea. The essence of the mechanism that allows a text editor to make several changes to a file, yet not reveal any of the changes until it is ready, is this: the only way another prospective reader of a file can reach it is by name. Until commit time the editor works on a copy of the file that is either not yet named or has a unique name not known outside the thread, so the modified copy is effectively invisible. Renaming the new version is the step that makes the entire set of updates simultaneously visible to later readers. These two observations suggest that all-or-nothing actions would be better served by a model of storage that behaves differently from cell storage: instead of a model in which a store operation overwrites old data, we instead create a new, tentative version of the data, such that the tentative version remains invisible to any reader outside this all-or­ nothing action until the action commits. We can provide such semantics, even though we start with traditional cell memory, by interposing a layer between the cell storage and the program that reads and writes data. This layer implements what is known as journal storage. The basic idea of journal storage is straightforward: we associate with every named variable not a single cell, but a list of cells in non-volatile storage; the values in the list represent the history of the variable. Figure 9.10 illustrates. Whenever any action

Saltzer & Kaashoek Ch. 9, p. 31

June 25, 2009 8:22 am

9–32

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

Variable A:

7

20

5

29

112

14

History of earlier versions

16

Tentative next version Current version

FIGURE 9.10 Version history of a variable in journal storage.

proposes to write a new value into the variable, the journal storage manager appends the prospective new value to the end of the list. Clearly this approach, being history-preserv­ ing, offers some hope of being helpful because if an all-or-nothing action aborts, one can imagine a systematic way to locate and discard all of the new versions it wrote. Moreover, we can tell the journal storage manager to expect to receive tentative values, but to ignore them unless the all-or-nothing action that created them commits. The basic mechanism to accomplish such an expectation is quite simple; the journal storage manager should make a note, next to each new version, of the identity of the all-or-nothing action that created it. Then, at any later time, it can discover the status of the tentative version by inquiring whether or not the all-or-nothing action ever committed. Figure 9.11 illustrates the overall structure of such a journal storage system, imple­ mented as a layer that hides a cell storage system. (To reduce clutter, this journal storage system omits calls to create new and delete old variables.) In this particular model, we assign to the journal storage manager most of the job of providing tools for programming all-or-nothing actions. Thus the implementer of a prospective all-or-nothing action should begin that action by invoking the journal storage manager entry NEW_ACTION, and later complete the action by invoking either COMMIT or ABORT. If, in addition, actions per­ form all reads and writes of data by invoking the journal storage manager’s READ_CURRENT_VALUE and WRITE_NEW_VALUE entries, our hope is that the result will auto­ matically be all-or-nothing with no further concern of the implementer. How could this automatic all-or-nothing atomicity work? The first step is that the journal storage manager, when called at NEW_ACTION, should assign a nonce identifier to the prospective all-or-nothing action, and create, in non-volatile cell storage, a record of this new identifier and the state of the new all-or-nothing action. This record is called an outcome record; it begins its existence in the state PENDING; depending on the outcome it should eventually move to one of the states COMMITTED or ABORTED, as suggested by Figure 9.12. No other state transitions are possible, except to discard the outcome record once

Saltzer & Kaashoek Ch. 9, p. 32

June 25, 2009 8:22 am

9.2 All-or-Nothing Atomicity I: Concepts

9–33

All-or-nothing Journal Storage System

NEW_ACTION READ

Cell Storage System

READ_CURRENT_VALUE WRITE

WRITE_NEW_VALUE

– catalogs

Journal Storage Manager

ALLOCATE

COMMIT DEALLOCATE

– versions – outcome records

ABORT

FIGURE 9.11 Interface to and internal organization of an all-or-nothing storage system based on version his­ tories and journal storage.

all-or-nothing action commits non-existent

committed discarded

pending

new all-or-nothing action is created all-or-nothing action aborts

aborted outcome record state no longer of any interest

FIGURE 9.12 The allowed state transitions of an outcome record.

Saltzer & Kaashoek Ch. 9, p. 33

June 25, 2009 8:22 am

9–34

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

1 2 3 4

procedure NEW_ACTION () id ← NEW_OUTCOME_RECORD () id.outcome_record.state ← PENDING return id

5 6

procedure COMMIT (reference id) id.outcome_record.state ← COMMITTED

7 8

procedure ABORT (reference id) id.outcome_record.state ← ABORTED

FIGURE 9.13 The procedures NEW_ACTION, COMMIT, and ABORT.

there is no further interest in its state. Figure 9.13 illustrates implementations of the three procedures NEW_ACTION, COMMIT, and ABORT. When an all-or-nothing action calls the journal storage manager to write a new ver­ sion of some data object, that action supplies the identifier of the data object, a tentative new value for the new version, and the identifier of the all-or-nothing action. The journal storage manager calls on the lower-level storage management system to allocate in non­ volatile cell storage enough space to contain the new version; it places in the newly allo­ cated cell storage the new data value and the identifier of the all-or-nothing action. Thus the journal storage manager creates a version history as illustrated in Figure 9.14. Now,

Object A

value:

7

all-or-nothing 03 action id:

1101:

committed

24

15

1101

1423

1423:

aborted

75 1794

1794:

pending

outcome records FIGURE 9.14 Portion of a version history, with outcome records. Some thread has recently called WRITE_NEW_VALUE specifying data_id = A, new_value = 75, and client_id = 1794. A caller to READ_CURRENT_VALUE will read the value 24 for A.

Saltzer & Kaashoek Ch. 9, p. 34

June 25, 2009 8:22 am

9.2 All-or-Nothing Atomicity I: Concepts

1 2 3 4 5 6 7 8 9

9–35

procedure READ_CURRENT_VALUE (data_id, caller_id) starting at end of data_id repeat until beginning v ← previous version of data_id // Get next older version a ← v.action_id // Identify the action a that created it s ← a.outcome_record.state // Check action a’s outcome record if s = COMMITTED then return v.value else skip v // Continue backward search signal (“Tried to read an uninitialized variable!”)

10 procedure WRITE_NEW_VALUE (reference data_id, new_value, caller_id) 11 if caller_id.outcome_record.state = PENDING 12 append new version v to data_id 13 v.value ← new_value 14 v.action_id ← caller_id else signal (“Tried to write outside of an all-or-nothing action!”)

FIGURE 9.15 Algorithms followed by READ_CURRENT_VALUE and WRITE_NEW_VALUE. The parameter caller_id is the action identifier returned by NEW_ACTION. In this version, only WRITE_NEW_VALUE uses caller_id. Later, READ_CURRENT_VALUE will also use it.

when someone proposes to read a data value by calling READ_CURRENT_VALUE, the journal storage manager can review the version history, starting with the latest version and return the value in the most recent committed version. By inspecting the outcome records, the journal storage manager can ignore those versions that were written by all-or-nothing actions that aborted or that never committed. The procedures READ_CURRENT_VALUE and WRITE_NEW_VALUE thus follow the algorithms of Figure 9.15. The important property of this pair of algorithms is that if the current all-or-nothing action is somehow derailed before it reaches its call to COMMIT, the new ver­ sion it has created is invisible to invokers of READ_CURRENT_VALUE. (They are also invisible to the all-or-nothing action that wrote them. Since it is sometimes convenient for an allor-nothing action to read something that it has tentatively written, a different procedure, named READ_MY_PENDING_VALUE, identical to READ_CURRENT_VALUE except for a different test on line 6, could do that.) Moreover if, for example, all-or-nothing action 99 crashes while partway through changing the values of nineteen different data objects, all nine­ teen changes would be invisible to later invokers of READ_CURRENT_VALUE. If all-or-nothing action 99 does reach its call to COMMIT, that call commits the entire set of changes simul­ taneously and atomically, at the instant that it changes the outcome record from PENDING to COMMITTED. Pending versions would also be invisible to any concurrent action that reads data with READ_CURRENT_VALUE, a feature that will prove useful when we introduce concurrent threads and discuss before-or-after atomicity, but for the moment our only

Saltzer & Kaashoek Ch. 9, p. 35

June 25, 2009 8:22 am

9–36

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

1 2 3 4 5 6 7 8 9 10 11 12 13

procedure TRANSFER (reference debit_account, reference credit_account, amount) my_id ← NEW_ACTION () xvalue ← READ_CURRENT_VALUE (debit_account, my_id) xvalue ← xvalue - amount WRITE_NEW_VALUE (debit_account, xvalue, my_id) yvalue ← READ_CURRENT_VALUE (credit_account, my_id) yvalue ← yvalue + amount WRITE_NEW_VALUE (credit_account, yvalue, my_id) if xvalue > 0 then COMMIT (my_id) else ABORT (my_id) signal(“Negative transfers are not allowed.”)

FIGURE 9.16 An all-or-nothing TRANSFER procedure, based on journal storage. (This program assumes that it is the only running thread. Making the transfer procedure a before-or-after action because other threads might be updating the same accounts concurrently requires additional mecha­ nism that is discussed later in this chapter.)

concern is that a system crash may prevent the current thread from committing or abort­ ing, and we want to make sure that a later thread doesn’t encounter partial results. As in the case of the calendar manager of Section 9.2.1, we assume that when a crash occurs, any all-or-nothing action that was in progress at the time was being supervised by some outside agent who realizes that a crash has occurred, uses READ_CURRENT_VALUE to find out what happened and if necessary initiates a replacement all-or-nothing action. Figure 9.16 shows the TRANSFER procedure of Section 9.1.5 reprogrammed as an allor-nothing (but not, for the moment, before-or-after) action using the version history mechanism. This implementation of TRANSFER is more elaborate than the earlier one—it tests to see whether or not the account to be debited has enough funds to cover the trans­ fer and if not it aborts the action. The order of steps in the transfer procedure is remarkably unconstrained by any consideration other than calculating the correct answer. The reading of credit_account, for example, could casually be moved to any point between NEW_ACTION and the place where yvalue is recalculated. We conclude that the journal storage system has made the pre-commit discipline much less onerous than we might have expected. There is still one loose end: it is essential that updates to a version history and changes to an outcome record be all-or-nothing. That is, if the system fails while the thread is inside WRITE_NEW_VALUE, adjusting structures to append a new version, or inside COMMIT while updating the outcome record, the cell being written must not be muddled; it must either stay as it was before the crash or change to the intended new value. The solution is to design all modifications to the internal structures of journal storage so that they can

Saltzer & Kaashoek Ch. 9, p. 36

June 25, 2009 8:22 am

9.2 All-or-Nothing Atomicity I: Concepts

9–37

be done by overwriting a single cell. For example, suppose that the name of a variable that has a version history refers to a cell that contains the address of the newest version, and that versions are linked from the newest version backwards, by address references. Adding a version consists of allocating space for a new version, reading the current address of the prior version, writing that address in the backward link field of the new version, and then updating the descriptor with the address of the new version. That last update can be done by overwriting a single cell. Similarly, updating an outcome record to change it from PENDING to COMMITTED can be done by overwriting a single cell. As a first bootstrapping step, we have reduced the general problem of creating all-or­ nothing actions to the specific problem of doing an all-or-nothing overwrite of one cell. As the remaining bootstrapping step, recall that we already know two ways to do a singlecell all-or-nothing overwrite: apply the ALL_OR_NOTHING_PUT procedure of Figure 9.7. (If there is concurrency, updates to the internal structures of the version history also need before-or-after atomicity. Section 9.4 will explore methods of providing it.)

9.2.4 How Version Histories are Used The careful reader will note two possibly puzzling things about the version history scheme just described. Both will become less puzzling when we discuss concurrency and before-or-after atomicity in Section 9.4 of this chapter: 1. Because READ_CURRENT_VALUE skips over any version belonging to another all-or­ nothing action whose OUTCOME record is not COMMITTED, it isn’t really necessary to change the OUTCOME record when an all-or-nothing action aborts; the record could just remain in the PENDING state indefinitely. However, when we introduce concurrency, we will find that a pending action may prevent other threads from reading variables for which the pending action created a new version, so it will become important to distinguish aborted actions from those that really are still pending. 2. As we have defined READ_CURRENT_VALUE, versions older than the most recent committed version are inaccessible and they might just as well be discarded. Discarding could be accomplished either as an additional step in the journal storage manager, or as part of a separate garbage collection activity. Alternatively, those older versions may be useful as an historical record, known as an archive, with the addition of timestamps on commit records and procedures that can locate and return old values created at specified times in the past. For this reason, a version history system is sometimes called a temporal database or is said to provide time domain addressing. The banking industry abounds in requirements that make use of history information, such as reporting a consistent sum of balances in all bank accounts, paying interest on the fifteenth on balances as of the first of the month, or calculating the average balance last month. Another reason for not discarding old versions immediately will emerge when we discuss concurrency and

Saltzer & Kaashoek Ch. 9, p. 37

June 25, 2009 8:22 am

9–38

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

before-or-after atomicity: concurrent threads may, for correctness, need to read old versions even after new versions have been created and committed. Direct implementation of a version history raises concerns about performance: rather than simply reading a named storage cell, one must instead make at least one indirect reference through a descriptor that locates the storage cell containing the current version. If the cell storage device is on a magnetic disk, this extra reference is a potential bottle­ neck, though it can be alleviated with a cache. A bottleneck that is harder to alleviate occurs on updates. Whenever an application writes a new value, the journal storage layer must allocate space in unused cell storage, write the new version, and update the version history descriptor so that future readers can find the new version. Several disk writes are likely to be required. These extra disk writes may be hidden inside the journal storage layer and with added cleverness may be delayed until commit and batched, but they still have a cost. When storage access delays are the performance bottleneck, extra accesses slow things down. In consequence, version histories are used primarily in low-performance applications. One common example is found in revision management systems used to coordinate teams doing program development. A programmer “checks out” a group of files, makes changes, and then “checks in” the result. The check-out and check-in operations are allor-nothing and check-in makes each changed file the latest version in a complete history of that file, in case a problem is discovered later. (The check-in operation also verifies that no one else changed the files while they were checked out, which catches some, but not all, coordination errors.) A second example is that some interactive applications such as word processors or image editing systems provide a “deep undo” feature, which allows a user who decides that his or her recent editing is misguided to step backwards to reach an earlier, satisfactory state. A third example appears in file systems that automatically create a new version every time any application opens an existing file for writing; when the application closes the file, the file system tags a number suffix to the name of the pre­ vious version of the file and moves the original name to the new version. These interfaces employ version histories because users find them easy to understand and they provide allor-nothing atomicity in the face of both system failures and user mistakes. Most such applications also provide an archive that is useful for reference and that allows going back to a known good version. Applications requiring high performance are a different story. They, too, require allor-nothing atomicity, but they usually achieve it by applying a specialized technique called a log. Logs are our next topic.

9.3 All-or-Nothing Atomicity II: Pragmatics Database management applications such as airline reservation systems or banking sys­ tems usually require high performance as well as all-or-nothing atomicity, so their designers use streamlined atomicity techniques. The foremost of these techniques sharply separates the reading and writing of data from the failure recovery mechanism.

Saltzer & Kaashoek Ch. 9, p. 38

June 25, 2009 8:22 am

9.3 All-or-Nothing Atomicity II: Pragmatics

9–39

The idea is to minimize the number of storage accesses required for the most common activities (application reads and updates). The trade-off is that the number of storage accesses for rarely-performed activities (failure recovery, which one hopes is actually exer­ cised only occasionally, if at all) may not be minimal. The technique is called logging. Logging is also used for purposes other than atomicity, several of which Sidebar 9.4 describes.

9.3.1 Atomicity Logs The basic idea behind atomicity logging is to combine the all-or-nothing atomicity of journal storage with the speed of cell storage, by having the application twice record every change to data. The application first logs the change in journal storage, and then it installs the change in cell storage*. One might think that writing data twice must be more expen­ sive than writing it just once into a version history, but the separation permits specialized optimizations that can make the overall system faster. The first recording, to journal storage, is optimized for fast writing by creating a sin­ gle, interleaved version history of all variables, known as a log. The information describing each data update forms a record that the application appends to the end of the log. Since there is only one log, a single pointer to the end of the log is all that is needed to find the place to append the record of a change of any variable in the system. If the log medium is magnetic disk, and the disk is used only for logging, and the disk storage management system allocates sectors contiguously, the disk seek arm will need to move only when a disk cylinder is full, thus eliminating most seek delays. As we will see, recov­ ery does involve scanning the log, which is expensive, but recovery should be a rare event. Using a log is thus an example of following the hint to optimize for the common case. The second recording, to cell storage, is optimized to make reading fast: the applica­ tion installs by simply overwriting the previous cell storage record of that variable. The record kept in cell storage can be thought of as a cache that, for reading, bypasses the effort that would be otherwise be required to locate the latest version in the log. In addi­ tion, by not reading from the log the logging disk’s seek arm can remain in position, ready for the next update. The two steps, LOG and INSTALL, become a different implemen­ tation of the WRITE_NEW_VALUE interface of Figure 9.11. Figure 9.17 illustrates this twostep implementation. The underlying idea is that the log is the authoritative record of the outcome of the action. Cell storage is merely a reference copy; if it is lost, it can be reconstructed from the log. The purpose of installing a copy in cell storage is to make both logging and read­ ing faster. By recording data twice, we obtain high performance in writing, high performance in reading, and all-or-nothing atomicity, all at the same time. There are three common logging configurations, shown in Figure 9.18. In each of these three configurations, the log resides in non-volatile storage. For the in-memory * A hardware architect would say “…it graduates the change to cell storage”. This text, somewhat arbitrarily, chooses to use the database management term “install” .

Saltzer & Kaashoek Ch. 9, p. 39

June 25, 2009 8:22 am

9–40

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

Sidebar 9.4: The many uses of logs A log is an object whose primary usage method is to append a new record. Log implementations normally provide procedures to read entries from oldest to newest or in reverse order, but there is usually not any procedure for modifying previous entries. Logs are used for several quite distinct purposes, and this range of purposes sometimes gets confused in real-world designs and implementations. Here are some of the most common uses for logs: 1. Atomicity log. If one logs the component actions of an all-or-nothing action, together with sufficient before and after information, then a crash recovery procedure can undo (and thus roll back the effects of ) all-or-nothing actions that didn’t get a chance to complete, or finish all-or-nothing actions that committed but that didn’t get a chance to record all of their effects. 2. Archive log. If the log is kept indefinitely, it becomes a place where old values of data and the sequence of actions taken by the system or its applications can be kept for review. There are many uses for archive information: watching for failure patterns, reviewing the actions of the system preceding and during a security breach, recovery from application-layer mistakes (e.g., a clerk incorrectly deleted an account), historical study, fraud control, and compliance with record-keeping requirements. 3. Performance log. Most mechanical storage media have much higher performance for sequential access than for random access. Since logs are written sequentially, they are ideally suited to such storage media. It is possible to take advantage of this match to the physical properties of the media by structuring data to be written in the form of a log. When combined with a cache that eliminates most disk reads, a performance log can provide a significant speed-up. As will be seen in the accompanying text, an atomicity log is usually also a performance log. 4. Durability log. If the log is stored on a non-volatile medium—say magnetic tape—that fails in ways and at times that are independent from the failures of the cell storage medium— which might be magnetic disk—then the copies of data in the log are replicas that can be used as backup in case of damage to the copies of the data in cell storage. This kind of log helps implement durable storage. Any log that uses a non-volatile medium, whether intended for atomicity, archiving or performance, typically also helps support durability. It is essential to have these various purposes—all-or-nothing atomicity, archive, performance, and durable storage—distinct in one’s mind when examining or designing a log implementation because they lead to different priorities among design trade-offs. When archive is the goal, low cost of the storage medium is usually more important than quick access because archive logs are large but, in practice, infrequently read. When durable storage is the goal, it may be important to use storage media with different physical properties, so that failure modes will be as independent as possible. When all-or-nothing atomicity or performance is the purpose, minimizing mechanical movement of the storage device becomes a high priority. Because of the competing objectives of different kinds of logs, as a general rule, it is usually a wise move to implement separate, dedicated logs for different functions.

Saltzer & Kaashoek Ch. 9, p. 40

June 25, 2009 8:22 am

9.3 All-or-Nothing Atomicity II: Pragmatics

9–41

Journal Storage Log log

WRITE_NEW_VALUE

current end of log

install

Cell Storage

READ_CURRENT_VALUE

FIGURE 9.17 Logging for all-or-nothing atomicity. The application performs WRITE_NEW_VALUE by first appending a record of the new value to the log in journal storage, and then installing the new value in cell storage by overwriting. The application performs READ_CURRENT_VALUE by reading just from cell storage.

Volatile storage

Non-volatile storage log

In-memory database:

Application program

Ordinary database:

Application program

log cell storage

High-performance database:

Application program

log cell storage

cell storage

cache

FIGURE 9.18 Three common logging configurations. Arrows show data flow as the application reads, logs, and installs data.

Saltzer & Kaashoek Ch. 9, p. 41

June 25, 2009 8:22 am

9–42

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

database, cell storage resides entirely in some volatile storage medium. In the second common configuration, cell storage resides in non-volatile storage along with the log. Finally, high-performance database management systems usually blend the two preced­ ing configurations by implementing a cache for cell storage in a volatile medium, and a potentially independent multilevel memory management algorithm moves data between the cache and non-volatile cell storage. Recording everything twice adds one significant complication to all-or-nothing ato­ micity because the system can crash between the time a change is logged and the time it is installed. To maintain all-or-nothing atomicity, logging systems follow a protocol that has two fundamental requirements. The first requirement is a constraint on the order of logging and installing. The second requirement is to run an explicit recovery procedure after every crash. (We saw a preview of the strategy of using a recovery procedure in Fig­ ure 9.7, which used a recovery procedure named CHECK_AND_REPAIR.)

9.3.2 Logging Protocols There are several kinds of atomicity logs that vary in the order in which things are done and in the details of information logged. However, all of them involve the ordering con­ straint implied by the numbering of the arrows in Figure 9.17. The constraint is a version of the golden rule of atomicity (never modify the only copy), known as the write-ahead­ log (WAL) protocol: Write-ahead-log protocol Log the update before installing it.

The reason is that logging appends but installing overwrites. If an application violates this protocol by installing an update before logging it and then for some reason must abort, or the system crashes, there is no systematic way to discover the installed update and, if necessary, reverse it. The write-ahead-log protocol ensures that if a crash occurs, a recovery procedure can, by consulting the log, systematically find all completed and intended changes to cell storage and either restore those records to old values or set them to new values, as appropriate to the circumstance. The basic element of an atomicity log is the log record. Before an action that is to be all-or-nothing installs a data value, it appends to the end of the log a new record of type CHANGE containing, in the general case, three pieces of information (we will later see spe­ cial cases that allow omitting item 2 or item 3): 1. The identity of the all-or-nothing action that is performing the update. 2. A component action that, if performed, installs the intended value in cell storage. This component action is a kind of an insurance policy in case the system crashes. If the all-or-nothing action commits, but then the system crashes before the action has a chance to perform the install, the recovery procedure can perform the install

Saltzer & Kaashoek Ch. 9, p. 42

June 25, 2009 8:22 am

9.3 All-or-Nothing Atomicity II: Pragmatics

9–43

on behalf of the action. Some systems call this component action the do action, others the redo action. For mnemonic compatibility with item 3, this text calls it the redo action. 3. A second component action that, if performed, reverses the effect on cell storage of the planned install. This component action is known as the undo action because if, after doing the install, the all-or-nothing action aborts or the system crashes, it may be necessary for the recovery procedure to reverse the effect of (undo) the install. An application appends a log record by invoking the lower-layer procedure LOG, which itself must be atomic. The LOG procedure is another example of bootstrapping: Starting with, for example, the ALL_OR_NOTHING_PUT described earlier in this chapter, a log designer creates a generic LOG procedure, and using the LOG procedure an application programmer then can implement all-or-nothing atomicity for any properly designed composite action. As we saw in Figure 9.17, LOG and INSTALL are the logging implementation of the WRITE_NEW_VALUE part of the interface of Figure 9.11, and READ_CURRENT_VALUE is simply a READ from cell storage. We also need a logging implementation of the remaining parts of the Figure 9.11 interface. The way to implement NEW_ACTION is to log a BEGIN record that contains just the new all-or-nothing action’s identity. As the all-or-nothing action pro­ ceeds through its pre-commit phase, it logs CHANGE records. To implement COMMIT or ABORT, the all-or-nothing action logs an OUTCOME record that becomes the authoritative indication of the outcome of the all-or-nothing action. The instant that the all-or-noth­ ing action logs the OUTCOME record is its commit point. As an example, Figure 9.19 shows our by now familiar TRANSFER action implemented with logging. Because the log is the authoritative record of the action, the all-or-nothing action can perform installs to cell storage at any convenient time that is consistent with the writeahead-log protocol, either before or after logging the OUTCOME record. The final step of an action is to log an END record, again containing just the action’s identity, to show that the action has completed all of its installs. (Logging all four kinds of activity—BEGIN, CHANGE, OUTCOME, and END—is more general than sometimes necessary. As we will see, some log­ ging systems can combine, e.g., OUTCOME and END, or BEGIN with the first CHANGE.) Figure 9.20 shows examples of three log records, two typical CHANGE records of an all-or-nothing TRANSFER action, interleaved with the OUTCOME record of some other, perhaps completely unrelated, all-or-nothing action. One consequence of installing results in cell storage is that for an all-or-nothing action to abort it may have to do some clean-up work. Moreover, if the system involun­ tarily terminates a thread that is in the middle of an all-or-nothing action (because, for example, the thread has gotten into a deadlock or an endless loop) some entity other than the hapless thread must clean things up. If this clean-up step were omitted, the all-or­ nothing action could remain pending indefinitely. The system cannot simply ignore indefinitely pending actions because all-or-nothing actions initiated by other threads are likely to want to use the data that the terminated action changed. (This is actually a

Saltzer & Kaashoek Ch. 9, p. 43

June 25, 2009 8:22 am

9–44

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

1 procedure TRANSFER (debit_account, credit_account, amount) 2 my_id ← LOG (BEGIN_TRANSACTION) 3 dbvalue.old ← GET (debit_account) 4 dbvalue.new ← dbvalue.old - amount 5 crvalue.old ← GET (credit_account, my_id) 6 crvalue.new ← crvalue.old + amount 7 LOG (CHANGE, my_id, 8 “PUT (debit_account, dbvalue.new)”, //redo action 9 “PUT (debit_account, dbvalue.old)” ) //undo action 10 LOG ( CHANGE, my_id, 11 “PUT (credit_account, crvalue.new)” //redo action 12 “PUT (credit_account, crvalue.old)”) //undo action 13 PUT (debit_account, dbvalue.new) // install 14 PUT (credit_account, crvalue.new) // install 15 if dbvalue.new > 0 then 16 LOG ( OUTCOME, COMMIT, my_id) 17 else 18 LOG (OUTCOME, ABORT, my_id) 19 signal(“Action not allowed. Would make debit account negative.”) 20 LOG (END_TRANSACTION, my_id)

FIGURE 9.19 An all-or-nothing TRANSFER procedure, implemented with logging.

before-or-after atomicity concern, one of the places where all-or-nothing atomicity and before-or-after atomicity intersect.) If the action being aborted did any installs, those installs are still in cell storage, so simply appending to the log an OUTCOME record saying that the action aborted is not enough to make it appear to later observers that the all-or-nothing action did nothing. The solution to this problem is to execute a generic ABORT procedure. The ABORT proce­ type: CHANGE action_id: 9979



redo_action: PUT(debit_account, $90) undo_action: PUT(debit_account, $120)

older log records

type: OUTCOME action_id: 9974 status: COMMITTED

type:

CHANGE

action_id: 9979 redo_action: PUT(credit_account, $40) undo_action: PUT(credit_account, $10)

newer log records

FIGURE 9.20 An example of a section of an atomicity log, showing two CHANGE records for a TRANSFER action that has action_id 9979 and the OUTCOME record of a different all-or-nothing action.

Saltzer & Kaashoek Ch. 9, p. 44

June 25, 2009 8:22 am

9.3 All-or-Nothing Atomicity II: Pragmatics

9–45

dure restores to their old values all cell storage variables that the all-or-nothing action installed. The ABORT procedure simply scans the log backwards looking for log entries cre­ ated by this all-or-nothing action; for each CHANGE record it finds, it performs the logged undo_action, thus restoring the old values in cell storage. The backward search terminates when the ABORT procedure finds that all-or-nothing action’s BEGIN record. Figure 9.21 illustrates. The extra work required to undo cell storage installs when an all-or-nothing action aborts is another example of optimizing for the common case: one expects that most all-or­ nothing actions will commit, and that aborted actions should be relatively rare. The extra effort of an occasional roll back of cell storage values will (one hopes) be more than repaid by the more frequent gains in performance on updates, reads, and commits.

9.3.3 Recovery Procedures The write-ahead log protocol is the first of the two required protocol elements of a log­ ging system. The second required protocol element is that, following every system crash, the system must run a recovery procedure before it allows ordinary applications to use the data. The details of the recovery procedure depend on the particular configuration of the journal and cell storage with respect to volatile and non-volatile memory. Consider first recovery for the in-memory database of Figure 9.18. Since a system crash may corrupt anything that is in volatile memory, including both the state of cell storage and the state of any currently running threads, restarting a crashed system usually begins by resetting all volatile memory. The effect of this reset is to abandon both the cell

1 procedure ABORT (action_id)

2 starting at end of log repeat until beginning

3 log_record ← previous record of log

4 if log_record.id = action_id then

5 if (log_record.type = OUTCOME)

6 then signal (“Can’t abort an already completed action.”)

7 if (log_record.type = CHANGE)

8 then perform undo_action of log_record

9 if (log_record.type = BEGIN)

10 then break repeat

11 LOG (action_id, OUTCOME, ABORTED) // Block future undos.

12 LOG (action_id, END)

FIGURE 9.21 Generic ABORT procedure for a logging system. The argument action_id identifies the action to be aborted. An atomic action calls this procedure if it decides to abort. In addition, the operating system may call this procedure if it decides to terminate the action, for example to break a deadlock or because the action is running too long. The LOG procedure must itself be atomic.

Saltzer & Kaashoek Ch. 9, p. 45

June 25, 2009 8:22 am

9–46

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

1 2 3 4 5 6 7 8 9 10 11 12

procedure RECOVER () // Recovery procedure for a volatile, in-memory database. winners ← NULL starting at end of log repeat until beginning log_record ← previous record of log if (log_record.type = OUTCOME) then winners ← winners + log_record // Set addition. starting at beginning of log repeat until end log_record ← next record of log if (log_record.type= CHANGE) and (outcome_record ← find (log_record.action_id) in winners) and (outcome_record.status = COMMITTED) then perform log_record.redo_action

FIGURE 9.22 An idempotent redo-only recovery procedure for an in-memory database. Because RECOVER writes only to volatile storage, if a crash occurs while it is running it is safe to run it again.

storage version of the database and any all-or-nothing actions that were in progress at the time of the crash. On the other hand, the log, since it resides on non-volatile journal stor­ age, is unaffected by the crash and should still be intact. The simplest recovery procedure performs two passes through the log. On the first pass, it scans the log backward from the last record, so the first evidence it will encounter of each all-or-nothing action is the last record that the all-or-nothing action logged. A backward log scan is sometimes called a LIFO (for last-in, first-out) log review. As the recovery procedure scans backward, it collects in a set the identity and completion status of every all-or-nothing action that logged an OUTCOME record before the crash. These actions, whether committed or aborted, are known as winners. When the backward scan is complete the set of winners is also complete, and the recovery procedure begins a forward scan of the log. The reason the forward scan is needed is that restarting after the crash completely reset the cell storage. During the for­ ward scan the recovery procedure performs, in the order found in the log, all of the REDO actions of every winner whose OUTCOME record says that it COMMITTED. Those REDOs reinstall all committed values in cell storage, so at the end of this scan, the recovery procedure has restored cell storage to a desirable state. This state is as if every all-or-nothing action that committed before the crash had run to completion, while every all-or-nothing action that aborted or that was still pending at crash time had never existed. The database sys­ tem can now open for regular business. Figure 9.22 illustrates. This recovery procedure emphasizes the point that a log can be viewed as an author­ itative version of the entire database, sufficient to completely reconstruct the reference copy in cell storage. There exist cases for which this recovery procedure may be overkill, when the dura­ bility requirement of the data is minimal. For example, the all-or-nothing action may

Saltzer & Kaashoek Ch. 9, p. 46

June 25, 2009 8:22 am

9.3 All-or-Nothing Atomicity II: Pragmatics

9–47

have been to make a group of changes to soft state in volatile storage. If the soft state is completely lost in a crash, there would be no need to redo installs because the definition of soft state is that the application is prepared to construct new soft state following a crash. Put another way, given the options of “all” or “nothing,” when the data is all soft state “nothing” is always an appropriate outcome after a crash. A critical design property of the recovery procedure is that, if there should be another system crash during recovery, it must still be possible to recover. Moreover, it must be possible for any number of crash-restart cycles to occur without compromising the cor­ rectness of the ultimate result. The method is to design the recovery procedure to be idempotent. That is, design it so that if it is interrupted and restarted from the beginning it will produce exactly the same result as if it had run to completion to begin with. With the in-memory database configuration, this goal is an easy one: just make sure that the recovery procedure modifies only volatile storage. Then, if a crash occurs during recov­ ery, the loss of volatile storage automatically restores the state of the system to the way it was when the recovery started, and it is safe to run it again from the beginning. If the recovery procedure ever finishes, the state of the cell storage copy of the database will be correct, no matter how many interruptions and restarts intervened. The ABORT procedure similarly needs to be idempotent because if an all-or-nothing action decides to abort and, while running ABORT, some timer expires, the system may decide to terminate and call ABORT for that same all-or-nothing action. The version of abort in Figure 9.21 will satisfy this requirement if the individual undo actions are them­ selves idempotent.

9.3.4 Other Logging Configurations: Non-Volatile Cell Storage Placing cell storage in volatile memory is a sweeping simplification that works well for small and medium-sized databases, but some databases are too large for that to be prac­ tical, so the designer finds it necessary to place cell storage on some cheaper, non-volatile storage medium such as magnetic disk, as in the second configuration of Figure 9.18. But with a non-volatile storage medium, installs survive system crashes, so the simple recov­ ery procedure used with the in-memory database would have two shortcomings: 1. If, at the time of the crash, there were some pending all-or-nothing actions that had installed changes, those changes will survive the system crash. The recovery procedure must reverse the effects of those changes, just as if those actions had aborted. 2. That recovery procedure reinstalls the entire database, even though in this case much of it is probably intact in non-volatile storage. If the database is large enough that it requires non-volatile storage to contain it, the cost of unnecessarily reinstalling it in its entirety at every recovery is likely to be unacceptable. In addition, reads and writes to non-volatile cell storage are likely to be slow, so it is nearly always the case that the designer installs a cache in volatile memory, along with a

Saltzer & Kaashoek Ch. 9, p. 47

June 25, 2009 8:22 am

9–48

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

multilevel memory manager, thus moving to the third configuration of Figure 9.18. But that addition introduces yet another shortcoming: 3. In a multilevel memory system, the order in which data is written from volatile levels to non-volatile levels is generally under control of a multilevel memory manager, which may, for example, be running a least-recently-used algorithm. As a result, at the instant of the crash some things that were thought to have been installed may not yet have migrated to the non-volatile memory. To postpone consideration of this shortcoming, let us for the moment assume that the multilevel memory manager implements a write-through cache. (Section 9.3.6, below, will return to the case where the cache is not write-through.) With a writethrough cache, we can be certain that everything that the application program has installed has been written to non-volatile storage. This assumption temporarily drops the third shortcoming out of our list of concerns and the situation is the same as if we were using the “Ordinary Database” configuration of Figure 9.18 with no cache. But we still have to do something about the first two shortcomings, and we also must make sure that the modified recovery procedure is still idempotent. To address the first shortcoming, that the database may contain installs from actions that should be undone, we need to modify the recovery procedure of Figure 9.22. As the recovery procedure performs its initial backward scan, rather than looking for winners, it instead collects in a set the identity of those all-or-nothing actions that were still in progress at the time of the crash. The actions in this set are known as losers, and they can include both actions that committed and actions that did not. Losers are easy to identify because the first log record that contains their identity that is encountered in a backward scan will be something other than an END record. To identify the losers, the pseudocode keeps track of which actions logged an END record in an auxiliary list named completeds. When RECOVER comes across a log record belong to an action that is not in completed, it adds that action to the set named losers. In addition, as it scans backwards, whenever the recovery procedure encounters a CHANGE record belonging to a loser, it performs the UNDO action listed in the record. In the course of the LIFO log review, all of the installs per­ formed by losers will thus be rolled back and the state of the cell storage will be as if the all-or-nothing actions of losers had never started. Next, RECOVER performs the forward log scan of the log, performing the redo actions of the all-or-nothing actions that committed, as shown in Figure 9.23. Finally, the recovery procedure logs an END record for every allor-nothing action in the list of losers. This END record transforms the loser into a com­ pleted action, thus ensuring that future recoveries will ignore it and not perform its undos again. For future recoveries to ignore aborted losers is not just a performance enhancement, it is essential, to avoid incorrectly undoing updates to those same variables made by future all-or-nothing actions. As before, the recovery procedure must be idempotent, so that if a crash occurs during recovery the system can just run the recovery procedure again. In addition to the tech­ nique used earlier of placing the temporary variables of the recovery procedure in volatile storage, each individual undo action must also be idempotent. For this reason, both redo

Saltzer & Kaashoek Ch. 9, p. 48

June 25, 2009 8:22 am

9.3 All-or-Nothing Atomicity II: Pragmatics

9–49

1 procedure RECOVER ()// Recovery procedure for non-volatile cell memory 2 completeds ← NULL 3 losers ← NULL 4 starting at end of log repeat until beginning 5 log_record ← previous record of log 6 if (log_record.type = END) 7 then completeds ← completeds + log_record // Set addition. 8 if (log_record.action_id is not in completeds) then 9 losers ← losers + log_record // Add if not already in set. 10 if (log_record.type = CHANGE) then 11 perform log_record.undo_action 12 13 14 15 16

starting at beginning of log repeat until end log_record ← next record of log if (log_record.type = CHANGE) and (log_record.action_id.status = COMMITTED) then perform log_record.redo_action

17 18

for each log_record in losers do log (log_record.action_id, END)

// Show action completed.

FIGURE 9.23 An idempotent undo/redo recovery procedure for a system that performs installs to non-volatile cell memory. In this recovery procedure, losers are all-or-nothing actions that were in progress at the time of the crash.

and undo actions are usually expressed as blind writes. A blind write is a simple overwrit­ ing of a data value without reference to its previous value. Because a blind write is inherently idempotent, no matter how many times one repeats it, the result is always the same. Thus, if a crash occurs part way through the logging of END records of losers, imme­ diately rerunning the recovery procedure will still leave the database correct. Any losers that now have END records will be treated as completed on the rerun, but that is OK because the previous attempt of the recovery procedure has already undone their installs. As for the second shortcoming, that the recovery procedure unnecessarily redoes every install, even installs not belong to losers, we can significantly simplify (and speed up) recovery by analyzing why we have to redo any installs at all. The reason is that, although the WAL protocol requires logging of changes to occur before install, there is no necessary ordering between commit and install. Until a committed action logs its END record, there is no assurance that any particular install of that action has actually hap­ pened yet. On the other hand, any committed action that has logged an END record has completed its installs. The conclusion is that the recovery procedure does not need to

Saltzer & Kaashoek Ch. 9, p. 49

June 25, 2009 8:22 am

9–50

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

1 procedure RECOVER () // Recovery procedure for rollback recovery. 2 completeds ← NULL 3 losers ← NULL 4 starting at end of log repeat until beginning // Perform undo scan. 5 log_record ← previous record of log 6 if (log_record.type = OUTCOME) 7 then completeds ← completeds + log_record // Set addition. 8 if (log_record.action_id is not in completeds) then 9 losers ← losers + log_record // New loser. 10 if (log_record.type = CHANGE) then 11 perform log_record.undo_action 12 13

for each log_record in losers do log (log_record.action_id, OUTCOME, ABORT)

// Block future undos.

FIGURE 9.24 An idempotent undo-only recovery procedure for rollback logging.

redo installs for any committed action that has logged its END record. A useful exercise is to modify the procedure of Figure 9.23 to take advantage of that observation. It would be even better if the recovery procedure never had to redo any installs. We can arrange for that by placing another requirement on the application: it must perform all of its installs before it logs its OUTCOME record. That requirement, together with the write-through cache, ensures that the installs of every completed all-or-nothing action are safely in non-volatile cell storage and there is thus never a need to perform any redo actions. (It also means that there is no need to log an END record.) The result is that the recovery procedure needs only to undo the installs of losers, and it can skip the entire forward scan, leading to the simpler recovery procedure of Figure 9.24. This scheme, because it requires only undos, is sometimes called undo logging or rollback recovery. A property of rollback recovery is that for completed actions, cell storage is just as author­ itative as the log. As a result, one can garbage collect the log, discarding the log records of completed actions. The now much smaller log may then be able to fit in a faster stor­ age medium for which the durability requirement is only that it outlast pending actions. There is an alternative, symmetric constraint used by some logging systems. Rather than requiring that all installs be done before logging the OUTCOME record, one can instead require that all installs be done after recording the OUTCOME record. With this constraint, the set of CHANGE records in the log that belong to that all-or-nothing action become a description of its intentions. If there is a crash before logging an OUTCOME record, we know that no installs have happened, so the recovery never needs to perform any undos. On the other hand, it may have to perform installs for all-or-nothing actions that committed. This scheme is called redo logging or roll-forward recovery. Furthermore, because we are uncertain about which installs actually have taken place, the recovery procedure must

Saltzer & Kaashoek Ch. 9, p. 50

June 25, 2009 8:22 am

9.3 All-or-Nothing Atomicity II: Pragmatics

9–51

perform all logged installs for all-or-nothing actions that did not log an END record. Any all-or-nothing action that logged an END record must have completed all of its installs, so there is no need for the recovery procedure to perform them. The recovery procedure thus reduces to doing installs just for all-or-nothing actions that were interrupted between the logging of their OUTCOME and END records. Recovery with redo logging can thus be quite swift, though it does require both a backward and forward scan of the entire log. We can summarize the procedures for atomicity logging as follows: • Log to journal storage before installing in cell storage (WAL protocol) • If all-or-nothing actions perform a ll installs to non-volatile storage before logging their OUTCOME record, then recovery needs only to undo the installs of incomplete uncommitted actions. (rollback/undo recovery) • If all-or-nothing actions perform no installs to non-volatile storage before logging their OUTCOME record, then recovery needs only to redo the installs of incomplete committed actions. (roll-forward/redo recovery) • If all-or-nothing actions are not disciplined about when they do installs to non­ volatile storage, then recovery needs to both redo the installs of incomplete committed actions and undo the installs of incomplete uncommitted ones. In addition to reading and updating memory, an all-or-nothing action may also need to send messages, for example, to report its success to the outside world. The action of sending a message is just like any other component action of the all-or-nothing action. To provide all-or-nothing atomicity, message sending can be handled in a way analogous to memory update. That is, log a CHANGE record with a redo action that sends the message. If a crash occurs after the all-or-nothing action commits, the recovery procedure will per­ form this redo action along with other redo actions that perform installs. In principle, one could also log an undo_action that sends a compensating message (“Please ignore my previous communication!”). However, an all-or-nothing action will usually be careful not to actually send any messages until after the action commits, so roll-forward recovery applies. For this reason, a designer would not normally specify an undo action for a mes­ sage or for any other action that has outside-world visibility such as printing a receipt, opening a cash drawer, drilling a hole, or firing a missile. Incidentally, although much of the professional literature about database atomicity and recovery uses the terms “winner” and “loser” to describe the recovery procedure, dif­ ferent recovery systems use subtly different definitions for the two sets, depending on the exact logging scheme, so it is a good idea to review those definitions carefully.

9.3.5 Checkpoints Constraining the order of installs to be all before or all after the logging of the OUTCOME record is not the only thing we could do to speed up recovery. Another technique that can shorten the log scan is to occasionally write some additional information, known as a checkpoint, to non-volatile storage. Although the principle is always the same, the exact

Saltzer & Kaashoek Ch. 9, p. 51

June 25, 2009 8:22 am

9–52

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

information that is placed in a checkpoint varies from one system to another. A check­ point can include information written either to cell storage or to the log (where it is known as a checkpoint record) or both. Suppose, for example, that the logging system maintains in volatile memory a list of identifiers of all-or-nothing actions that have started but have not yet recorded an END record, together with their pending/committed/aborted status, keeping it up to date by observing logging calls. The logging system then occasionally logs this list as a CHECKPOINT record. When a crash occurs sometime later, the recovery procedure begins a LIFO log scan as usual, collecting the sets of completed actions and losers. When it comes to a CHECKPOINT record it can immediately fill out the set of losers by adding those all-or-noth­ ing actions that were listed in the checkpoint that did not later log an END record. This list may include some all-or-nothing actions listed in the CHECKPOINT record as COMMITTED, but that did not log an END record by the time of the crash. Their installs still need to be performed, so they need to be added to the set of losers. The LIFO scan continues, but only until it has found the BEGIN record of every loser. With the addition of CHECKPOINT records, the recovery procedure becomes more com­ plex, but is potentially shorter in time and effort: 1. Do a LIFO scan of the log back to the last CHECKPOINT record, collecting identifiers of losers and undoing all actions they logged. 2. Complete the list of losers from information in the checkpoint. 3. Continue the LIFO scan, undoing the actions of losers, until every BEGIN record belonging to every loser has been found. 4. Perform a forward scan from that point to the end of the log, performing any committed actions belonging to all-or-nothing actions in the list of losers that logged an OUTCOME record with status COMMITTED. In systems in which long-running all-or-nothing actions are uncommon, step 3 will typ­ ically be quite brief or even empty, greatly shortening recovery. A good exercise is to modify the recovery program of Figure 9.23 to accommodate checkpoints. Checkpoints are also used with in-memory databases, to provide durability without the need to reprocess the entire log after every system crash. A useful checkpoint proce­ dure for an in-memory database is to make a snapshot of the complete database, writing it to one of two alternating (for all-or-nothing atomicity) dedicated non-volatile storage regions, and then logging a CHECKPOINT record that contains the address of the latest snap­ shot. Recovery then involves scanning the log back to the most recent CHECKPOINT record, collecting a list of committed all-or-nothing actions, restoring the snapshot described there, and then performing redo actions of those committed actions from the CHECKPOINT record to the end of the log. The main challenge in this scenario is dealing with update activity that is concurrent with the writing of the snapshot. That challenge can be met either by preventing all updates for the duration of the snapshot or by applying more complex before-or-after atomicity techniques such as those described in later sections of this chapter.

Saltzer & Kaashoek Ch. 9, p. 52

June 25, 2009 8:22 am

9.3 All-or-Nothing Atomicity II: Pragmatics

9–53

9.3.6 What if the Cache is not Write-Through? (Advanced Topic) Between the log and the write-through cache, the logging configurations just described require, for every data update, two synchronous writes to non-volatile storage, with attendant delays waiting for the writes to complete. Since the original reason for intro­ ducing a log was to increase performance, these two synchronous write delays usually become the system performance bottleneck. Designers who are interested in maximizing performance would prefer to use a cache that is not write-through, so that writes can be deferred until a convenient time when they can be done in batches. Unfortunately, the application then loses control of the order in which things are actually written to non­ volatile storage. Loss of control of order has a significant impact on our all-or-nothing atomicity algorithms, since they require, for correctness, constraints on the order of writes and certainty about which writes have been done. The first concern is for the log itself because the write-ahead log protocol requires that appending a CHANGE record to the log precede the corresponding install in cell storage. One simple way to enforce the WAL protocol is to make just log writes write-through, but allow cell storage writes to occur whenever the cache manager finds it convenient. However, this relaxation means that if the system crashes there is no assurance that any particular install has actually migrated to non-volatile storage. The recovery procedure, assuming the worst, cannot take advantage of checkpoints and must again perform installs starting from the beginning of the log. To avoid that possibility, the usual design response is to flush the cache as part of logging each checkpoint record. Unfortunately, flushing the cache and logging the checkpoint must be done as a before-or-after action to avoid getting tangled with concurrent updates, which creates another design chal­ lenge. This challenge is surmountable, but the complexity is increasing. Some systems pursue performance even farther. A popular technique is to write the log to a volatile buffer, and force that entire buffer to non-volatile storage only when an all-or-nothing action commits. This strategy allows batching several CHANGE records with the next OUTCOME record in a single synchronous write. Although this step would appear to violate the write-ahead log protocol, that protocol can be restored by making the cache used for cell storage a bit more elaborate; its management algorithm must avoid writing back any install for which the corresponding log record is still in the volatile buffer. The trick is to number each log record in sequence, and tag each record in the cell storage cache with the sequence number of its log record. Whenever the system forces the log, it tells the cache manager the sequence number of the last log record that it wrote, and the cache manager is careful never to write back any cache record that is tagged with a higher log sequence number. We have in this section seen some good examples of the law of diminishing returns at work: schemes that improve performance sometimes require significantly increased com­ plexity. Before undertaking any such scheme, it is essential to evaluate carefully how much extra performance one stands to gain.

Saltzer & Kaashoek Ch. 9, p. 53

June 25, 2009 8:22 am

9–54

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

9.4 Before-or-After Atomicity I: Concepts The mechanisms developed in the previous sections of this chapter provide atomicity in the face of failure, so that other atomic actions that take place after the failure and sub­ sequent recovery find that an interrupted atomic action apparently either executed all of its steps or none of them. This and the next section investigate how to also provide ato­ micity of concurrent actions, known as before-or-after atomicity. In this development we will provide both all-or-nothing atomicity and before-or-after atomicity, so we will now be able to call the resulting atomic actions transactions. Concurrency atomicity requires additional mechanism because when an atomic action installs data in cell storage, that data is immediately visible to all concurrent actions. Even though the version history mechanism can hide pending changes from concurrent atomic actions, they can read other variables that the first atomic action plans to change. Thus, the composite nature of a multiple-step atomic action may still be dis­ covered by a concurrent atomic action that happens to look at the value of a variable in the midst of execution of the first atomic action. Thus, making a composite action atomic with respect to concurrent threads—that is, making it a before-or-after action— requires further effort. Recall that Section 9.1.5 defined the operation of concurrent actions to be correct if every result is guaranteed to be one that could have been obtained by some purely serial appli­ cation of those same actions. So we are looking for techniques that guarantee to produce the same result as if concurrent actions had been applied serially, yet maximize the per­ formance that can be achieved by allowing concurrency. In this Section 9.4 we explore three successively better before-or-after atomicity schemes, where “better” means that the scheme allows more concurrency. To illustrate the concepts we return to version histories, which allow a straightforward and compel­ ling correctness argument for each scheme. Because version histories are rarely used in practice, in the following Section 9.5 we examine a somewhat different approach, locks, which are widely used because they can provide higher performance, but for which cor­ rectness arguments are more difficult.

9.4.1 Achieving Before-or-After Atomicity: Simple Serialization A version history assigns a unique identifier to each atomic action so that it can link ten­ tative versions of variables to the action’s outcome record. Suppose that we require that the unique identifiers be consecutive integers, which we interpret as serial numbers, and we modify the procedure BEGIN_TRANSACTION by adding enforcement of the following sim­ ple serialization rule: each newly created transaction n must, before reading or writing any data, wait until the preceding transaction n – 1 has either committed or aborted. (To ensure that there is always a transaction n – 1, assume that the system was initialized by creating a transaction number zero with an OUTCOME record in the committed state.) Fig­ ure 9.25 shows this version of BEGIN_TRANSACTION. The scheme forces all transactions to execute in the serial order that threads happen to invoke BEGIN_TRANSACTION. Since that

Saltzer & Kaashoek Ch. 9, p. 54

June 25, 2009 8:22 am

9.4 Before-or-After Atomicity I: Concepts

1 2 3 4 5

9–55

procedure BEGIN_TRANSACTION () id ← NEW_OUTCOME_RECORD (PENDING) // Create, initialize, assign id. previous_id ← id – 1 wait until previous_id.outcome_record.state ≠ PENDING return id

FIGURE 9.25

with the simple serialization discipline to achieve before-or-after atomicity. In order that there be an id – 1 for every value of id, startup of the system must include creating a dummy transaction with id = 0 and id.outcome_record.state set to COMMITTED. Pseudocode for the procedure NEW_OUTCOME_RECORD appears in Figure 9.30. BEGIN_TRANSACTION

order is a possible serial order of the various transactions, by definition simple serializa­ tion will produce transactions that are serialized and thus are correct before-or-after actions. Simple serialization trivially provides before-or-after atomicity, and the transac­ tion is still all-or-nothing, so the transaction is now atomic both in the case of failure and in the presence of concurrency. Simple serialization provides before-or-after atomicity by being too conservative: it prevents all concurrency among transactions, even if they would not interfere with one another. Nevertheless, this approach actually has some practical value—in some applica­ tions it may be just the right thing to do, on the basis of simplicity. Concurrent threads can do much of their work in parallel because simple serialization comes into play only during those times that threads are executing transactions, which they generally would be only at the moments they are working with shared variables. If such moments are infrequent or if the actions that need before-or-after atomicity all modify the same small set of shared variables, simple serialization is likely to be just about as effective as any other scheme. In addition, by looking carefully at why it works, we can discover less con­ servative approaches that allow more concurrency, yet still have compelling arguments that they preserve correctness. Put another way, the remainder of study of before-or-after atomicity techniques is fundamentally nothing but invention and analysis of increasingly effective—and increasingly complex—performance improvement measures. The version history provides a useful representation for this analysis. Figure 9.26 illustrates in a single figure the version histories of a banking system consisting of four accounts named A, B, C, and D, during the execution of six transactions, with serial num­ bers 1 through 6. The first transaction initializes all the objects to contain the value 0 and the following transactions transfer various amounts back and forth between pairs of accounts. This figure provides a straightforward interpretation of why simple serialization works correctly. Consider transaction 3, which must read and write objects B and C in order to transfer funds from one to the other. The way for transaction 3 to produce results as if it ran after transaction 2 is for all of 3’s input objects to have values that include all the effects of transaction 2—if transaction 2 commits, then any objects it

Saltzer & Kaashoek Ch. 9, p. 55

June 25, 2009 8:22 am

9–56

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

value of object at end of transaction

Object 1

2

A

0

+10

B

0

-10

C

0

D

0

outcome record state

3

4

5

+12

6 0

-6

-12

-4

+2

-2

-2

Committed Committed Committed Aborted Committed Pending

transaction

1: 2: 3: 4: 5: 6:

initialize all accounts to 0 transfer 10 from B to A transfer 4 from C to B transfer 2 from D to A (aborts) transfer 6 from B to C transfer 10 from A to B

FIGURE 9.26 Version history of a banking system.

changed and that 3 uses should have new values; if transaction 2 aborts, then any objects it tentatively changed and 3 uses should contain the values that they had when transac­ tion 2 started. Since in this example transaction 3 reads B and transaction 2 creates a new version of B, it is clear that for transaction 3 to produce a correct result it must wait until transaction 2 either commits or aborts. Simple serialization requires that wait, and thus ensures correctness. Figure 9.26 also provides some clues about how to increase concurrency. Looking at transaction 4 (the example shows that transaction 4 will ultimately abort for some reason, but suppose we are just starting transaction 4 and don’t know that yet), it is apparent that simple serialization is too strict. Transaction 4 reads values only from A and D, yet trans­ action 3 has no interest in either object. Thus the values of A and D will be the same whether or not transaction 3 commits, and a discipline that forces 4 to wait for 3’s com­ pletion delays 4 unnecessarily. On the other hand, transaction 4 does use an object that transaction 2 modifies, so transaction 4 must wait for transaction 2 to complete. Of course, simple serialization guarantees that, since transaction 4 can’t begin till transaction 3 completes and transaction 3 couldn’t have started until transaction 2 completed.

Saltzer & Kaashoek Ch. 9, p. 56

June 25, 2009 8:22 am

9.4 Before-or-After Atomicity I: Concepts

Object

Value of object at end of transaction 3 4 5

1

2

A

0

+10

+10

+12

B

0

-10

-6

C

0

0

D

0

0

OUTCOME

record state

6

7

+12

0

0

-6

-12

-2

-2

-4

-4

+2

+2

+2

0

-2

-2

-2

-2

Committed Committed Committed Aborted Committed

Pending

9–57

Pending

Unchanged value

Changed value FIGURE 9.27 System state history with unchanged values shown.

These observations suggest that there may be other, more relaxed, disciplines that can still guarantee correct results. They also suggest that any such discipline will probably involve detailed examination of exactly which objects each transaction reads and writes. Figure 9.26 represents the state history of the entire system in serialization order, but the slightly different representation of Figure 9.27 makes that state history more explicit. In Figure 9.27 it appears that each transaction has perversely created a new version of every object, with unchanged values in dotted boxes for those objects it did not actually change. This representation emphasizes that the vertical slot for, say, transaction 3 is in effect a reservation in the state history for every object in the system; transaction 3 has an opportunity to propose a new value for any object, if it so wishes. The reason that the system state history is helpful to the discussion is that as long as we eventually end up with a state history that has the values in the boxes as shown, the actual order in real time in which individual object values are placed in those boxes is unimportant. For example, in Figure 9.27, transaction 3 could create its new version of object C before transaction 2 creates its new version of B. We don’t care when things hap­ pen, as long as the result is to fill in the history with the same set of values that would result from strictly following this serial ordering. Making the actual time sequence unim-

Saltzer & Kaashoek Ch. 9, p. 57

June 25, 2009 8:22 am

9–58

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

portant is exactly our goal, since that allows us to put concurrent threads to work on the various transactions. There are, of course, constraints on time ordering, but they become evident by examining the state history. Figure 9.27 allows us to see just what time constraints must be observed in order for the system state history to record this particular sequence of transactions. In order for a transaction to generate results appropriate for its position in the sequence, it should use as its input values the latest versions of all of its inputs. If Figure 9.27 were available, transaction 4 could scan back along the histories of its inputs A and D, to the most recent solid boxes (the ones created by transactions 2 and 1, respectively) and correctly conclude that if transactions 2 and 1 have committed then transaction 4 can proceed—even if transaction 3 hasn’t gotten around to filling in values for B and C and hasn’t decided whether or not it should commit. This observation suggests that any transaction has enough information to ensure before-or-after atomicity with respect to other transactions if it can discover the dottedversus-solid status of those version history boxes to its left. The observation also leads to a specific before-or-after atomicity discipline that will ensure correctness. We call this discipline mark-point.

9.4.2 The Mark-Point Discipline Concurrent threads that invoke READ_CURRENT_VALUE as implemented in Figure 9.15 can not see a pending version of any variable. That observation is useful in designing a before-or-after atomicity discipline because it allows a transaction to reveal all of its results at once simply by changing the value of its OUTCOME record to COMMITTED. But in addition to that we need a way for later transactions that need to read a pending version to wait for it to become committed. The way to do that is to modify READ_CURRENT_VALUE to wait for, rather than skip over, pending versions created by transactions that are earlier in the sequential ordering (that is, they have a smaller caller_id), as implemented in lines 4–9 of Figure 9.28. Because, with concurrency, a transaction later in the ordering may create a new version of the same variable before this transaction reads it, READ_CURRENT_VALUE still skips over any versions created by transactions that have a larger caller_id. Also, as before, it may be convenient to have a READ_MY_VALUE procedure (not shown) that returns pending values previously written by the running transaction. Adding the ability to wait for pending versions in READ_CURRENT_VALUE is the first step; to ensure correct before-or-after atomicity we also need to arrange that all variables that a transaction needs as inputs, but that earlier, not-yet-committed transactions plan to modify, have pending versions. To do that we call on the application programmer (for example, the programmer of the TRANSFER transaction) do a bit of extra work: each trans­ action should create new, pending versions of every variable it intends to modify, and announce when it is finished doing so. Creating a pending version has the effect of mark­ ing those variables that are not ready for reading by later transactions, so we will call the point at which a transaction has created them all the mark point of the transaction. The

Saltzer & Kaashoek Ch. 9, p. 58

June 25, 2009 8:22 am

9.4 Before-or-After Atomicity I: Concepts

9–59

transaction announces that it has passed its mark point by calling a procedure named MARK_POINT_ANNOUNCE, which simply sets a flag in the outcome record for that transaction. The mark-point discipline then is that no transaction can begin reading its inputs until the preceding transaction has reached its mark point or is no longer pending. This discipline requires that each transaction identify which data it will update. If the trans­ action has to modify some data objects before it can discover the identity of others that require update, it could either delay setting its mark point until it does know all of the objects it will write (which would, of course, also delay all succeeding transactions) or use the more complex discipline described in the next section. For example, in Figure 9.27, the boxes under newly arrived transaction 7 are all dot­ ted; transaction 7 should begin by marking the ones that it plans to make solid. For convenience in marking, we split the WRITE_NEW_VALUE procedure of Figure 9.15 into two parts, named NEW_VERSION and WRITE_VALUE, as in Figure 9.29. Marking then consists sim­ ply of a series of calls to NEW_VERSION. When finished marking, the transaction calls MARK_POINT_ANNOUNCE. It may then go about its business, reading and writing values as appropriate to its purpose. Finally, we enforce the mark point discipline by putting a test and, depending on its outcome, a wait in BEGIN_TRANSACTION, as in Figure 9.30, so that no transaction may begin execution until the preceding transaction either reports that it has reached its mark point or is no longer PENDING. Figure 9.30 also illustrates an implementation of MARK_POINT_ANNOUNCE. No changes are needed in procedures ABORT and COMMIT as shown in Figure 9.13, so they are not repeated here. Because no transaction can start until the previous transaction reaches its mark point, all transactions earlier in the serial ordering must also have passed their mark points, so every transaction earlier in the serial ordering has already created all of the versions that it ever will. Since READ_CURRENT_VALUE now waits for earlier, pending values to become

1 procedure READ_CURRENT_VALUE (data_id, this_transaction_id) 2 starting at end of data_id repeat until beginning 3 v ← previous version of data_id 4 last_modifier ← v.action_id 5 if last_modifier ≥ this_transaction_id then skip v // Keep searching 6 wait until (last_modifier.outcome_record.state ≠ PENDING) 7 if (last_modifier.outcome_record.state = COMMITTED) 8 then return v.state 9 else skip v // Resume search 10 signal (“Tried to read an uninitialized variable”) FIGURE 9.28 READ_CURRENT_VALUE for the mark-point discipline. This form of the procedure skips all versions created by transactions later than the calling transaction, and it waits for a pending version cre­ ated by an earlier transaction until that earlier transaction commits or aborts.

Saltzer & Kaashoek Ch. 9, p. 59

June 25, 2009 8:22 am

9–60

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

1 2 3 4 5 6

procedure NEW_VERSION (reference data_id, this_transaction_id) if this_transaction_id.outcome_record.mark_state = MARKED then signal (“Tried to create new version after announcing mark point!”) append new version v to data_id v.value ← NULL v.action_id ← transaction_id

7 procedure WRITE_VALUE (reference data_id, new_value, this_transaction_id) 8 starting at end of data_id repeat until beginning 9 v ← previous version of data_id 10 if v.action_id = this_transaction_id 11 v.value ← new_value 12 return 13 signal (“Tried to write without creating new version!”))

FIGURE 9.29 Mark-point discipline versions of NEW_VERSION and WRITE_VALUE.

1 2 3 4 5 6

procedure BEGIN_TRANSACTION () id ← NEW_OUTCOME_RECORD (PENDING) previous_id ← id - 1 wait until (previous_id.outcome_record.mark_state = MARKED) or (previous_id.outcome_record.state ≠ PENDING) return id

7 procedure NEW_OUTCOME_RECORD (starting_state) 8 ACQUIRE (outcome_record_lock) // Make this a before-or-after action. 9 id ← TICKET (outcome_record_sequencer) 10 allocate id.outcome_record 11 id.outcome_record.state ← starting_state 12 id.outcome_record.mark_state ← NULL 13 RELEASE (outcome_record_lock) 14 return id 15 procedure MARK_POINT_ANNOUNCE (reference this_transaction_id) 16 this_transaction_id.outcome_record.mark_state ← MARKED

FIGURE 9.30 The procedures BEGIN_TRANSACTION, NEW_OUTCOME_RECORD, and MARK_POINT_ANNOUNCE for the mark-point discipline. BEGIN_TRANSACTION presumes that there is always a preceding transac­ tion. so the system should be initialized by calling NEW_OUTCOME_RECORD to create an empty initial transaction in the starting_state COMMITTED and immediately calling MARK_POINT_ANNOUNCE for the empty transaction.

Saltzer & Kaashoek Ch. 9, p. 60

June 25, 2009 8:22 am

9.4 Before-or-After Atomicity I: Concepts

9–61

committed or aborted, it will always return to its client a value that represents the final outcome of all preceding transactions. All input values to a transaction thus contain the committed result of all transactions that appear earlier in the serial ordering, just as if it had followed the simple serialization discipline. The result is thus guaranteed to be exactly the same as one produced by a serial ordering, no matter in what real time order the various transactions actually write data values into their version slots. The particular serial ordering that results from this discipline is, as in the case of the simple serialization discipline, the ordering in which the transactions were assigned serial numbers by NEW_OUTCOME_RECORD. There is one potential interaction between all-or-nothing atomicity and before-or­ after atomicity. If pending versions survive system crashes, at restart the system must track down all PENDING transaction records and mark them ABORTED to ensure that future invokers of READ_CURRENT_VALUE do not wait for the completion of transactions that have forever disappeared. The mark-point discipline provides before-or-after atomicity by bootstrapping from a more primitive before-or-after atomicity mechanism. As usual in bootstrapping, the idea is to reduce some general problem—here, that problem is to provide before-or-after atomicitiy for arbitrary application programs—to a special case that is amenable to a spe­ cial-case solution—here, the special case is construction and initialization of a new outcome record. The procedure NEW_OUTCOME_RECORD in Figure 9.30 must itself be a before-or-after action because it may be invoked concurrently by several different threads and it must be careful to give out different serial numbers to each of them. It must also create completely initialized outcome records, with value and mark_state set to PENDING and NULL, respectively, because a concurrent thread may immediately need to look at one of those fields. To achieve before-or-after atomicity, NEW_OUTCOME_RECORD bootstraps from the TICKET procedure of Section 5.6.3 to obtain the next sequential serial number, and it uses ACQUIRE and RELEASE to make its initialization steps a before-or-after action. Those procedures in turn bootstrap from still lower-level before-or-after atomicity mech­ anisms, so we have three layers of bootstrapping. We can now reprogram the funds TRANSFER procedure of Figure 9.15 to be atomic under both failure and concurrent activity, as in Figure 9.31. The major change from the earlier version is addition of lines 4 through 6, in which TRANSFER calls NEW_VERSION to mark the two variables that it intends to modify and then calls MARK_POINT_ANNOUNCE. The interesting observation about this program is that most of the work of making actions before-or-after is actually carried out in the called procedures. The only effort or thought required of the application programmer is to identify and mark, by creating new ver­ sions, the variables that the transaction will modify. The delays (which under the simple serialization discipline would all be concentrated in BEGIN_TRANSACTION) are distributed under the mark-point discipline. Some delays may still occur in BEGIN_TRANSACTION, waiting for the preceding transaction to reach its mark point. But if marking is done before any other calculations, transactions are likely to reach their mark points promptly, and thus this delay should be not as great as waiting for them to commit or abort. Delays can also occur at any invocation of

Saltzer & Kaashoek Ch. 9, p. 61

June 25, 2009 8:22 am

9–62

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

1 procedure TRANSFER (reference debit_account, reference credit_account, 2 amount) 3 my_id ← BEGIN_TRANSACTION () 4 NEW_VERSION (debit_account, my_id) 5 NEW_VERSION (credit_account, my_id) 6 MARK_POINT_ANNOUNCE (my_id); 7 xvalue ← READ_CURRENT_VALUE (debit_account, my_id) 8 xvalue ← xvalue - amount 9 WRITE_VALUE (debit_account, xvalue, my_id) 10 yvalue ← READ_CURRENT_VALUE (credit_account, my_id) 11 yvalue ← yvalue + amount 12 WRITE_VALUE (credit_account, yvalue, my_id) 13 if xvalue > 0 then 14 COMMIT (my_id) 15 else 16 ABORT (my_id) 17 signal(“Negative transfers are not allowed.”)

FIGURE 9.31 An implementation of the funds transfer procedure that uses the mark point discipline to ensure that it is atomic both with respect to failure and with respect to concurrent activity.

, but only if there is really something that the transaction must wait for, such as committing a pending version of a necessary input variable. Thus the overall delay for any given transaction should never be more than that imposed by the simple serialization discipline, and one might anticipate that it will often be less. A useful property of the mark-point discipline is that it never creates deadlocks. Whenever a wait occurs it is a wait for some transaction earlier in the serialization. That transaction may in turn be waiting for a still earlier transaction, but since no one ever waits for a transaction later in the ordering, progress is guaranteed. The reason is that at all times there must be some earliest pending transaction. The ordering property guar­ antees that this earliest pending transaction will encounter no waits for other transactions to complete, so it, at least, can make progress. When it completes, some other transaction in the ordering becomes earliest, and it now can make progress. Eventually, by this argu­ ment, every transaction will be able to make progress. This kind of reasoning about progress is a helpful element of a before-or-after atomicity discipline. In Section 9.5 of this chapter we will encounter before-or-after atomicity disciplines that are correct in the sense that they guarantee the same result as a serial ordering, but they do not guarantee progress. Such disciplines require additional mechanisms to ensure that threads do not end up deadlocked, waiting for one another forever. Two other minor points are worth noting. First, if transactions wait to announce their mark point until they are ready to commit or abort, the mark-point discipline reduces to the simple serialization discipline. That observation confirms that one disci-

READ_CURRENT_VALUE

Saltzer & Kaashoek Ch. 9, p. 62

June 25, 2009 8:22 am

9.4 Before-or-After Atomicity I: Concepts

9–63

pline is a relaxed version of the other. Second, there are at least two opportunities in the mark-point discipline to discover and report protocol errors to clients. A transaction should never call NEW_VERSION after announcing its mark point. Similarly, WRITE_VALUE can report an error if the client tries to write a value for which a new version was never created. Both of these error-reporting opportunities are implemented in the pseudocode of Figure 9.29.

9.4.3 Optimistic Atomicity: Read-Capture (Advanced Topic) Both the simple serialization and mark-point disciplines are concurrency control meth­ ods that may be described as pessimistic. That means that they presume that interference between concurrent transactions is likely and they actively prevent any possibility of interference by imposing waits at any point where interference might occur. In doing so, they also may prevent some concurrency that would have been harmless to correctness. An alternative scheme, called optimistic concurrency control, is to presume that interfer­ ence between concurrent transactions is unlikely, and allow them to proceed without waiting. Then, watch for actual interference, and if it happens take some recovery action, for example aborting an interfering transaction and makaing it restart. (There is a popu­ lar tongue-in-cheek characterization of the difference: pessimistic = “ask first”, optimistic = “apologize later”.) The goal of optimistic concurrency control is to increase concur­ rency in situations where actual interference is rare. The system state history of Figure 9.27 suggests an opportunity to be optimistic. We could allow transactions to write values into the system state history in any order and at any time, but with the risk that some attempts to write may be met with the response “Sorry, that write would interfere with another transaction. You must abort, abandon this serialization position in the system state history, obtain a later serialization, and rerun your transaction from the beginning.” A specific example of this approach is the read-capture discipline. Under the read-cap­ ture discipline, there is an option, but not a requirement, of advance marking. Eliminating the requirement of advance marking has the advantage that a transaction does not need to predict the identity of every object it will update—it can discover the identity of those objects as it works. Instead of advance marking, whenever a transaction calls READ_CURRENT_VALUE, that procedure makes a mark at this thread’s position in the version history of the object it read. This mark tells potential version-inserters earlier in the serial ordering but arriving later in real time that they are no longer allowed to insert—they must abort and try again, using a later serial position in the version history. Had the prospective version inserter gotten there sooner, before the reader had left its mark, the new version would have been acceptable, and the reader would have instead waited for the version inserter to commit, and taken that new value instead of the earlier one. Read-capture gives the reader the power of extending validity of a version through intervening transactions, up to the reader’s own serialization position. This view of the situation is illustrated in Figure 9.32, which has the same version history as did Figure 9.27.

Saltzer & Kaashoek Ch. 9, p. 63

June 25, 2009 8:22 am

9–64

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

Value of object at end of transaction 1 A

2

0

3

4 +12

+10 HWM=2

B

0

C

HWM=6

-10 HWM=2

-6 HWM=5

HWM=3

0

-4 HWM=5

HWM=3

D

6

5

0

-12

HWM=6

7

0

HWM=7

-2

+2

-2 HWM=7

HWM=4

Committed Committed

Committed

+2

Aborted Committed

-4

Pending Pending

Outcome state record HWM=

Conflict: Must abort!

High-water mark Conflict

Changed value FIGURE 9.32 Version history with high-water marks and the read-capture discipline. First, transaction 6, which is running concurrently with transaction 4, reads variable A, thus extending the highwater mark of A to 6. Then, transaction 4 (which intends to transfer 2 from D to A) encounters a conflict when it tries to create a new version of A and discovers that the high-water mark of A has already been set by transaction 6, so 4 aborts and returns as transaction 7. Transaction 7 retries transaction 4, extending the high-water marks of A and D to 7.

The key property of read-capture is illustrated by an example in Figure 9.32. Trans­ action 4 was late in creating a new version of object A; by the time it tried to do the insertion, transaction 6 had already read the old value (+10) and thereby extended the validity of that old value to the beginning of transaction 6. Therefore, transaction 4 had to be aborted; it has been reincarnated to try again as transaction 7. In its new position as transaction 7, its first act is to read object D, extending the validity of its most recent committed value (zero) to the beginning of transaction 7. When it tries to read object A, it discovers that the most recent version is still uncommitted, so it must wait for transac­ tion 6 to either commit or abort. Note that if transaction 6 should now decide to create a new version of object C, it can do so without any problem, but if it should try to create a new version of object D, it would run into a conflict with the old, now extended version of D, and it would have to abort.

Saltzer & Kaashoek Ch. 9, p. 64

June 25, 2009 8:22 am

9.4 Before-or-After Atomicity I: Concepts

9–65

1 procedure READ_CURRENT_VALUE (reference data_id, value, caller_id) 2 starting at end of data_id repeat until beginning 3 v ← previous version of data_id 4 if v.action_id ≥ caller_id then skip v 5 examine v.action_id.outcome_record 6 if PENDING then 7 WAIT for v.action_id to COMMIT or ABORT 8 if COMMITTED then 9 v.high_water_mark ← max(v.high_water_mark, caller_id) 10 return v.value 11 else skip v // Continue backward search 12 signal (“Tried to read an uninitialized variable!”) 13 procedure NEW_VERSION (reference data_id, caller_id) 14 if (caller_id < data_id.high_water_mark) // Conflict with later reader. 15 or (caller_id < (LATEST_VERSION[data_id].action_id)) // Blind write conflict. 16 then ABORT this transaction and terminate this thread 17 add new version v at end of data_id 18 v.value ← 0 19 v.action_id ← caller_id 20 procedure WRITE_VALUE (reference data_id, new_value, caller_id) 21 locate version v of data_id.history such that v.action_id = caller_id 22 (if not found, signal (“Tried to write without creating new version!”)) 23 v.value ← new_value

FIGURE 9.33 Read-capture forms of READ_CURRENT_VALUE,

NEW_VERSION,

and

WRITE_VALUE.

Read-capture is relatively easy to implement in a version history system. We start, as shown in Figure 9.33, by adding a new step (at line 9) to READ_CURRENT_VALUE. This new step records with each data object a high-water mark—the serial number of the highestnumbered transaction that has ever read a value from this object’s version history. The high-water mark serves as a warning to other transactions that have earlier serial numbers but are late in creating new versions. The warning is that someone later in the serial ordering has already read a version of this object from earlier in the ordering, so it is too late to create a new version now. We guarantee that the warning is heeded by adding a step to NEW_VERSION (at line 14), which checks the high-water mark for the object to be written, to see if any transaction with a higher serial number has already read the current version of the object. If not, we can create a new version without concern. But if the transaction serial number in the high-water mark is greater than this transaction’s own serial number, this transaction must abort, obtain a new, higher serial number, and start over again.

Saltzer & Kaashoek Ch. 9, p. 65

June 25, 2009 8:22 am

9–66

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

We have removed all constraints on the real-time sequence of the constituent steps of the concurrent transaction, so there is a possibility that a high-numbered transaction will create a new version of some object, and then later a low-numbered transaction will try to create a new version of the same object. Since our NEW_VERSION procedure simply tacks new versions on the end of the object history, we could end up with a history in the wrong order. The simplest way to avoid that mistake is to put an additional test in NEW_VERSION (at line 15), to ensure that every new version has a client serial number that is larger than the serial number of the next previous version. If not, NEW_VERSION aborts the transaction, just as if a read-capture conflict had occurred. (This test aborts only those transactions that perform conflicting blind writes, which are uncommon. If either of the conflicting transactions reads the value before writing it, the setting and testing of high_water_mark will catch and prevent the conflict.) The first question one must raise about this kind of algorithm is whether or not it actually works: is the result always the same as some serial ordering of the concurrent transactions? Because the read-capture discipline permits greater concurrency than does mark-point, the correctness argument is a bit more involved. The induction part of the argument goes as follows: 1. The WAIT for PENDING values in READ_CURRENT_VALUE ensures that if any pending transaction k < n has modified any value that is later read by transaction n, transaction n will wait for transaction k to commit or abort. 2. The setting of the high-water mark when transaction n calls READ_CURRENT_VALUE, together with the test of the high-water mark in NEW_VERSION ensures that if any transaction j < n tries to modify any value after transaction n has read that value, transaction j will abort and not modify that value. 3. Therefore, every value that READ_CURRENT_VALUE returns to transaction n will include the final effect of all preceding transactions 1...n – 1. 4. Therefore, every transaction n will act as if it serially follows transaction n – 1. Optimistic coordination disciplines such as read-capture have the possibly surprising effect that something done by a transaction later in the serial ordering can cause a trans­ action earlier in the ordering to abort. This effect is the price of optimism; to be a good candidate for an optimistic discipline, an application probably should not have a lot of data interference. A subtlety of read-capture is that it is necessary to implement bootstrapping beforeor-after atomicity in the procedure NEW_VERSION, by adding a lock and calls to ACQUIRE and RELEASE because NEW_VERSION can now be called by two concurrent threads that happen to add new versions to the same variable at about the same time. In addition, NEW_VERSION must be careful to keep versions of the same variable in transaction order, so that the backward search performed by READ_CURRENT_VALUE works correctly. There is one final detail, an interaction with all-or-nothing recovery. High water marks should be stored in volatile memory, so that following a crash (which has the effect

Saltzer & Kaashoek Ch. 9, p. 66

June 25, 2009 8:22 am

9.4 Before-or-After Atomicity I: Concepts

9–67

of aborting all pending transactions) the high water marks automatically disappear and thus don’t cause unnecessary aborts.

9.4.4 Does Anyone Actually Use Version Histories for Before-or-After Atomicity? The answer is yes, but the most common use is in an application not likely to be encoun­ tered by a software specialist. Legacy processor architectures typically provide a limited number of registers (the “architectural registers”) in which the programmer can hold temporary results, but modern large scale integration technology allows space on a phys­ ical chip for many more physical registers than the architecture calls for. More registers generally allow better performance, especially in multiple-issue processor designs, which execute several sequential instructions concurrently whenever possible. To allow use of the many physical registers, a register mapping scheme known as register renaming imple­ ments a version history for the architectural registers. This version history allows instructions that would interfere with each other only because of a shortage of registers to execute concurrently. For example, Intel Pentium processors, which are based on the x86 instruction set architecture described in Section 5.7, have only eight architectural registers. The Pen­ tium 4 has 128 physical registers, and a register renaming scheme based on a circular reorder buffer. A reorder buffer resembles a direct hardware implementation of the pro­ cedures NEW_VERSION and WRITE_VALUE of Figure 9.29. As each instruction issues (which corresponds to BEGIN_TRANSACTION), it is assigned the next sequential slot in the reorder buffer. The slot is a map that maintains a correspondence between two numbers: the number of the architectural register that the programmer specified to hold the output value of the instruction, and the number of one of the 128 physical registers, the one that will actually hold that output value. Since machine instructions have just one output value, assigning a slot in the reorder buffer implements in a single step the effect of both NEW_OUTCOME_RECORD and NEW_VERSION. Similarly, when the instruction commits, it places its output in that physical register, thereby implementing WRITE_VALUE and COMMIT as a single step. Figure 9.34 illustrates register renaming with a reorder buffer. In the program sequence of that example, instruction n uses architectural register five to hold an output value that instruction n + 1 will use as an input. Instruction n + 2 loads architectural reg­ ister five from memory. Register renaming allows there to be two (or more) versions of register five simultaneously, one version (in physical register 42) containing a value for use by instructions n and n + 1 and the second version (in physical register 29) to be used by instruction n + 2. The performance benefit is that instruction n + 2 (and any later instructions that write into architectural register 5) can proceed concurrently with instructions n and n + 1. An instruction following instruction n + 2 that requires the new value in architectural register five as an input uses a hardware implementation of READ_CURRENT_VALUE to locate the most recent preceding mapping of architectural register five in the reorder buffer. In this case that most recent mapping is to physical register 29.

Saltzer & Kaashoek Ch. 9, p. 67

June 25, 2009 8:22 am

9–68

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

The later instruction then stalls, waiting for instruction n + 2 to write a value into phys­ ical register 29. Later instructions that reuse architectural register five for some purpose that does not require that version can proceed concurrently. Although register renaming is conceptually straightforward, the mechanisms that pre­ vent interference when there are dependencies between instructions tend to be more intricate than either of the mark-point or read-capture disciplines, so this description has been oversimplified. For more detail, the reader should consult a textbook on processor architecture, for example Computer Architecture, a Quantitative Approach, by Hennessy and Patterson [Suggestions for Further Reading 1.1.1]. The Oracle database management system offers several before-or-after atomicity methods, one of which it calls “serializable”, though the label may be a bit misleading. This method uses a before-or-after atomicity scheme that the database literature calls snapshot isolation. The idea is that when a transaction begins the system conceptually takes a snapshot of every committed value and the transaction reads all of its inputs from that snapshot. If two concurrent transactions (which might start with the same snapshot) modify the same variable, the first one to commit wins; the system aborts the other one with a “serialization error”. This scheme effectively creates a limited variant of a version

instruction

architectural physical register register

n

R5

42

n+1

R4

61

n+2

R5

29

three entries in the reorder buffer

0

127 physical register file with 128 registers

FIGURE 9.34 Example showing how a reorder buffer maps architectural register numbers to physical register numbers. The program sequence corresponding to the three entries is: n R5 ← R4 × R2 n + 1 R4 ← R5 + R1 n + 2 R5 ← READ (117492)

// Write a result in register five. // Use result in register five. // Write content of a memory cell in register five.

Instructions n and n + 2 both write into register R5, so R5 has two versions, with mappings to physical registers 42 and 29, respectively. Instruction n + 2 can thus execute concurrently with instructions n and n + 1.

Saltzer & Kaashoek Ch. 9, p. 68

June 25, 2009 8:22 am

9.5 Before-or-After Atomicity II: Pragmatics

9–69

history that, in certain situations, does not always ensure that concurrent transactions are correctly coordinated. Another specialized variant implementation of version histories, known as transac­ tional memory, is a discipline for creating atomic actions from arbitrary instruction sequences that make multiple references to primary memory. Transactional memory was first suggested in 1993 and with widespread availability of multicore processors, has become the subject of quite a bit of recent research interest because it allows the applica­ tion programmer to use concurrent threads without having to deal with locks. The discipline is to mark the beginning of an instruction sequence that is to be atomic with a “begin transaction” instruction, direct all ensuing STORE instructions to a hidden copy of the data that concurrent threads cannot read, and at end of the sequence check to see that nothing read or written during the sequence was modified by some other transaction that committed first. If the check finds no such earlier modifications, the system com­ mits the transaction by exposing the hidden copies to concurrent threads; otherwise it discards the hidden copies and the transaction aborts. Because it defers all discovery of interference to the commit point this discipline is even more optimistic than the readcapture discipline described in Section 9.4.3 above, so it is most useful in situations where interference between concurrent threads is possible but unlikely. Transactional memory has been experimentally implemented in both hardware and software. Hard­ ware implementations typically involve tinkering with either a cache or a reorder buffer to make it defer writing hidden copies back to primary memory until commit time, while software implementations create hidden copies of changed variables somewhere else in primary memory. As with instruction renaming, this description of transactional mem­ ory is somewhat oversimplified, and the interested reader should consult the literature for fuller explanations. Other software implementations of version histories for before-or-after atomicity have been explored primarily in research environments. Designers of database systems usually use locks rather than version histories because there is more experience in achiev­ ing high performance with locks. Before-or-after atomicity by using locks systematically is the subject of the next section of this chapter.

9.5 Before-or-After Atomicity II: Pragmatics The previous section showed that a version history system that provides all-or-nothing atomicity can be extended to also provide before-or-after atomicity. When the all-or­ nothing atomicity design uses a log and installs data updates in cell storage, other, con­ current actions can again immediately see those updates, so we again need a scheme to provide before-or-after atomicity. When a system uses logs for all-or-nothing atomicity, it usually adopts the mechanism introduced in Chapter 5—locks—for before-or-after atomicity. However, as Chapter 5 pointed out, programming with locks is hazardous, and the traditional programming technique of debugging until the answers seem to be correct is unlikely to catch all locking errors. We now revisit locks, this time with the goal

Saltzer & Kaashoek Ch. 9, p. 69

June 25, 2009 8:22 am

9–70

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

of using them in stylized ways that allow us to develop arguments that the locks correctly implement before-or-after atomicity.

9.5.1 Locks To review, a lock is a flag associated with a data object and set by an action to warn other, concurrent, actions not to read or write the object. Conventionally, a locking scheme involves two procedures: ACQUIRE

(A.lock)

marks a lock variable associated with object A as having been acquired. If the object is already acquired, ACQUIRE waits until the previous acquirer releases it. RELEASE

(A.lock)

unmarks the lock variable associated with A, perhaps ending some other action’s wait for that lock. For the moment, we assume that the semantics of a lock follow the singleacquire protocol of Chapter 5: if two or more actions attempt to acquire a lock at about the same time, only one will succeed; the others must find the lock already acquired. In Section 9.5.4 we will consider some alternative protocols, for example one that permits several readers of a variable as long as there is no one writing it. The biggest problem with locks is that programming errors can create actions that do not have the intended before-or-after property. Such errors can open the door to races that, because the interfering actions are timing dependent, can make it extremely diffi­ cult to figure out what went wrong. Thus a primary goal is that coordination of concurrent transactions should be arguably correct. For locks, the way to achieve this goal is to follow three steps systematically: • Develop a locking discipline that specifies which locks must be acquired and when. • Establish a compelling line of reasoning that concurrent transactions that follow the discipline will have the before-or-after property. • Interpose a lock manager, a program that enforces the discipline, between the programmer and the ACQUIRE and RELEASE procedures. Many locking disciplines have been designed and deployed, including some that fail to correctly coordinate transactions (for an example, see exercise 9.5). We examine three disciplines that succeed. Each allows more concurrency than its predecessor, though even the best one is not capable of guaranteeing that concurrency is maximized. The first, and simplest, discipline that coordinates transactions correctly is the systemwide lock. When the system first starts operation, it creates a single lockable variable named, for example, System, in volatile memory. The discipline is that every transaction must start with

Saltzer & Kaashoek Ch. 9, p. 70

June 25, 2009 8:22 am

9.5 Before-or-After Atomicity II: Pragmatics

9–71

begin_transaction ACQUIRE (System.lock)



and every transaction must end with …

(System.lock)

end_transaction RELEASE

A system can even enforce this discipline by including the ACQUIRE and RELEASE steps in the code sequence generated for begin_transaction and end_transaction, indepen­ dent of whether the result was COMMIT or ABORT. Any programmer who creates a new transaction then has a guarantee that it will run either before or after any other transactions. The systemwide lock discipline allows only one transaction to execute at a time. It serializes potentially concurrent transactions in the order that they call ACQUIRE. The sys­ temwide lock discipline is in all respects identical to the simple serialization discipline of Section 9.4. In fact, the simple serialization pseudocode id ← NEW_OUTCOME_RECORD ()

preceding_id ← id - 1

wait until preceding_id.outcome_record.value ≠ PENDING



COMMIT (id) [or ABORT (id)]

and the systemwide lock invocation ACQUIRE

(System.lock)



RELEASE

(System.lock)

are actually just two implementations of the same idea. As with simple serialization, systemwide locking restricts concurrency in cases where it doesn’t need to because it locks all data touched by every transaction. For example, if systemwide locking were applied to the funds TRANSFER program of Figure 9.16, only one transfer could occur at a time, even though any individual transfer involves only two out of perhaps several million accounts, so there would be many opportunities for concur­ rent, non-interfering transfers. Thus there is an interest in developing less restrictive locking disciplines. The starting point is usually to employ a finer lock granularity: lock smaller objects, such as individual data records, individual pages of data records, or even fields within records. The trade-offs in gaining concurrency are first, that when there is more than one lock, more time is spent acquiring and releasing locks and second, cor­ rectness arguments become more complex. One hopes that the performance gain from concurrency exceeds the cost of acquiring and releasing the multiple locks. Fortunately, there are at least two other disciplines for which correctness arguments are feasible, simple locking and two-phase locking.

Saltzer & Kaashoek Ch. 9, p. 71

June 25, 2009 8:22 am

9–72

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

9.5.2 Simple Locking The second locking discipline, known as simple locking, is similar in spirit to, though not quite identical with, the mark-point discipline. The simple locking discipline has two rules. First, each transaction must acquire a lock for every shared data object it intends to read or write before doing any actual reading and writing. Second, it may release its locks only after the transaction installs its last update and commits or completely restores the data and aborts. Analogous to the mark point, the transaction has what is called a lock point: the first instant at which it has acquired all of its locks. The collection of locks it has acquired when it reaches its lock point is called its lock set. A lock manager can enforce simple locking by requiring that each transaction supply its intended lock set as an argu­ ment to the begin_transaction operation, which acquires all of the locks of the lock set, if necessary waiting for them to become available. The lock manager can also interpose itself on all calls to read data and to log changes, to verify that they refer to variables that are in the lock set. The lock manager also intercepts the call to commit or abort (or, if the application uses roll-forward recovery, to log an END record) at which time it auto­ matically releases all of the locks of the lock set. The simple locking discipline correctly coordinates concurrent transactions. We can make that claim using a line of argument analogous to the one used for correctness of the mark-point discipline. Imagine that an all-seeing outside observer maintains an ordered list to which it adds each transaction identifier as soon as the transaction reaches its lock point and removes it from the list when it begins to release its locks. Under the simple locking discipline each transaction has agreed not to read or write anything until that transaction has been added to the observer’s list. We also know that all transactions that precede this one in the list must have already passed their lock point. Since no data object can appear in the lock sets of two transactions, no data object in any transaction’s lock set appears in the lock set of the transaction preceding it in the list, and by induction to any transaction earlier in the list. Thus all of this transaction’s input values are the same as they will be when the preceding transaction in the list commits or aborts. The same argument applies to the transaction before the preceding one, so all inputs to any trans­ action are identical to the inputs that would be available if all the transactions ahead of it in the list ran serially, in the order of the list. Thus the simple locking discipline ensures that this transaction runs completely after the preceding one and completely before the next one. Concurrent transactions will produce results as if they had been serialized in the order that they reached their lock points. As with the mark-point discipline, simple locking can miss some opportunities for concurrency. In addition, the simple locking discipline creates a problem that can be sig­ nificant in some applications. Because it requires the transaction to acquire a lock on every shared object that it will either read or write (recall that the mark-point discipline requires marking only of shared objects that the transaction will write), applications that discover which objects need to be read by reading other shared data objects have no alter­ native but to lock every object that they might need to read. To the extent that the set of objects that an application might need to read is larger than the set for which it eventually

Saltzer & Kaashoek Ch. 9, p. 72

June 25, 2009 8:22 am

9.5 Before-or-After Atomicity II: Pragmatics

9–73

does read, the simple locking discipline can interfere with opportunities for concurrency. On the other hand, when the transaction is straightforward (such as the TRANSFER trans­ action of Figure 9.16, which needs to lock only two records, both of which are known at the outset) simple locking can be effective.

9.5.3 Two-Phase Locking The third locking discipline, called two-phase locking, like the read-capture discipline, avoids the requirement that a transaction must know in advance which locks to acquire. Two-phase locking is widely used, but it is harder to argue that it is correct. The twophase locking discipline allows a transaction to acquire locks as it proceeds, and the trans­ action may read or write a data object as soon as it acquires a lock on that object. The primary constraint is that the transaction may not release any locks until it passes its lock point. Further, the transaction can release a lock on an object that it only reads any time after it reaches its lock point if it will never need to read that object again, even to abort. The name of the discipline comes about because the number of locks acquired by a trans­ action monotonically increases up to the lock point (the first phase), after which it monotonically decreases (the second phase). Just as with simple locking, two-phase lock­ ing orders concurrent transactions so that they produce results as if they had been serialized in the order they reach their lock points. A lock manager can implement twophase locking by intercepting all calls to read and write data; it acquires a lock (perhaps having to wait) on the first use of each shared variable. As with simple locking, it then holds the locks until it intercepts the call to commit, abort, or log the END record of the transaction, at which time it releases them all at once. The extra flexibility of two-phase locking makes it harder to argue that it guarantees before-or-after atomicity. Informally, once a transaction has acquired a lock on a data object, the value of that object is the same as it will be when the transaction reaches its lock point, so reading that value now must yield the same result as waiting till then to read it. Furthermore, releasing a lock on an object that it hasn’t modified must be harm­ less if this transaction will never look at the object again, even to abort. A formal argument that two-phase locking leads to correct before-or-after atomicity can be found in most advanced texts on concurrency control and transactions. See, for example, Trans­ action Processing, by Gray and Reuter [Suggestions for Further Reading 1.1.5]. The two-phase locking discipline can potentially allow more concurrency than the simple locking discipline, but it still unnecessarily blocks certain serializable, and there­ fore correct, action orderings. For example, suppose transaction T1 reads X and writes Y, while transaction T2 just does a (blind) write to Y. Because the lock sets of T1 and T2 intersect at variable Y, the two-phase locking discipline will force transaction T2 to run either completely before or completely after T1. But the sequence T1: READ X

T2: WRITE Y

T1: WRITE Y

Saltzer & Kaashoek Ch. 9, p. 73

June 25, 2009 8:22 am

9–74

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

in which the write of T2 occurs between the two steps of T1, yields the same result as running T2 completely before T1, so the result is always correct, even though this sequence would be prevented by two-phase locking. Disciplines that allow all possible concurrency while at the same time ensuring before-or-after atomicity are quite difficult to devise. (Theorists identify the problem as NP-complete.) There are two interactions between locks and logs that require some thought: (1) individual transactions that abort, and (2) system recovery. Aborts are the easiest to deal with. Since we require that an aborting transaction restore its changed data objects to their original values before releasing any locks, no special account need be taken of aborted transactions. For purposes of before-or-after atomicity they look just like com­ mitted transactions that didn’t change anything. The rule about not releasing any locks on modified data before the end of the transaction is essential to accomplishing an abort. If a lock on some modified object were released, and then the transaction decided to abort, it might find that some other transaction has now acquired that lock and changed the object again. Backing out an aborted change is likely to be impossible unless the locks on modified objects have been held. The interaction between log-based recovery and locks is less obvious. The question is whether locks themselves are data objects for which changes should be logged. To ana­ lyze this question, suppose there is a system crash. At the completion of crash recovery there should be no pending transactions because any transactions that were pending at the time of the crash should have been rolled back by the recovery procedure, and recov­ ery does not allow any new transactions to begin until it completes. Since locks exist only to coordinate pending transactions, it would clearly be an error if there were locks still set when crash recovery is complete. That observation suggests that locks belong in vol­ atile storage, where they will automatically disappear on a crash, rather than in non­ volatile storage, where the recovery procedure would have to hunt them down to release them. The bigger question, however, is whether or not the log-based recovery algorithm will construct a correct system state—correct in the sense that it could have arisen from some serial ordering of those transactions that committed before the crash. Continue to assume that the locks are in volatile memory, and at the instant of a crash all record of the locks is lost. Some set of transactions—the ones that logged a BEGIN record but have not yet logged an END record—may not have been completed. But we know that the transactions that were not complete at the instant of the crash had nonoverlapping lock sets at the moment that the lock values vanished. The recovery algo­ rithm of Figure 9.23 will systematically UNDO or REDO installs for the incomplete transactions, but every such UNDO or REDO must modify a variable whose lock was in some transaction’s lock set at the time of the crash. Because those lock sets must have been non-overlapping, those particular actions can safely be redone or undone without con­ cern for before-or-after atomicity during recovery. Put another way, the locks created a particular serialization of the transactions and the log has captured that serialization. Since RECOVER performs UNDO actions in reverse order as specified in the log, and it per­ forms REDO actions in forward order, again as specified in the log, RECOVER reconstructs exactly that same serialization. Thus even a recovery algorithm that reconstructs the

Saltzer & Kaashoek Ch. 9, p. 74

June 25, 2009 8:22 am

9.5 Before-or-After Atomicity II: Pragmatics

9–75

entire database from the log is guaranteed to produce the same serialization as when the transactions were originally performed. So long as no new transactions begin until recov­ ery is complete, there is no danger of miscoordination, despite the absence of locks during recovery.

9.5.4 Performance Optimizations Most logging-locking systems are substantially more complex than the description so far might lead one to expect. The complications primarily arise from attempts to gain per­ formance. In Section 9.3.6 we saw how buffering of disk I/O in a volatile memory cache, to allow reading, writing, and computation to go on concurrently, can complicate a log­ ging system. Designers sometimes apply two performance-enhancing complexities to locking systems: physical locking and adding lock compatibility modes. A performance-enhancing technique driven by buffering of disk I/O and physical media considerations is to choose a particular lock granularity known as physical locking. If a transaction makes a change to a six-byte object in the middle of a 1000-byte disk sector, or to a 1500-byte object that occupies parts of two disk sectors, there is a question about which “variable” should be locked: the object, or the disk sector(s)? If two concur­ rent threads make updates to unrelated data objects that happen to be stored in the same disk sector, then the two disk writes must be coordinated. Choosing the right locking granularity can make a big performance difference. Locking application-defined objects without consideration of their mapping to phys­ ical disk sectors is appealing because it is understandable to the application writer. For that reason, it is usually called logical locking. In addition, if the objects are small, it appar­ ently allows more concurrency: if another transaction is interested in a different object that is in the same disk sector, it could proceed in parallel. However, a consequence of logical locking is that logging must also be done on the same logical objects. Different parts of the same disk sector may be modified by different transactions that are running concurrently, and if one transaction commits but the other aborts neither the old nor the new disk sector is the correct one to restore following a crash; the log entries must record the old and new values of the individual data objects that are stored in the sector. Finally, recall that a high-performance logging system with a cache must, at commit time, force the log to disk and keep track of which objects in the cache it is safe to write to disk with­ out violating the write-ahead log protocol. So logical locking with small objects can escalate cache record-keeping. Backing away from the details, high-performance disk management systems typically require that the argument of a PUT call be a block whose size is commensurate with the size of a disk sector. Thus the real impact of logical locking is to create a layer between the application and the disk management system that presents a logical, rather than a physical, interface to its transaction clients; such things as data object management and garbage collection within disk sectors would go into this layer. The alternative is to tailor the logging and locking design to match the native granularity of the disk management system. Since matching the logging and locking granularity to the disk write granularity

Saltzer & Kaashoek Ch. 9, p. 75

June 25, 2009 8:22 am

9–76

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

can reduce the number of disk operations, both logging changes to and locking blocks that correspond to disk sectors rather than individual data objects is a common practice. Another performance refinement appears in most locking systems: the specification of lock compatibility modes. The idea is that when a transaction acquires a lock, it can specify what operation (for example, READ or WRITE) it intends to perform on the locked data item. If that operation is compatible—in the sense that the result of concurrent transactions is the same as some serial ordering of those transactions—then this transac­ tion can be allowed to acquire a lock even though some other transaction has already acquired a lock on that same data object. The most common example involves replacing the single-acquire locking protocol with the multiple-reader, single-writer protocol. According to this protocol, one can allow any number of readers to simultaneously acquire read-mode locks for the same object. The purpose of a read-mode lock is to ensure that no other thread can change the data while the lock is held. Since concurrent readers do not present an update threat, it is safe to allow any number of them. If another transaction needs to acquire a write-mode lock for an object on which several threads already hold read-mode locks, that new transaction will have to wait for all of the readers to release their read-mode locks. There are many applications in which a majority of data accesses are for reading, and for those applica­ tions the provision of read-mode lock compatibility can reduce the amount of time spent waiting for locks by orders of magnitude. At the same time, the scheme adds complexity, both in the mechanics of locking and also in policy issues, such as what to do if, while a prospective writer is waiting for readers to release their read-mode locks, another thread calls to acquire a read-mode lock. If there is a steady stream of arriving readers, a writer could be delayed indefinitely. This description of performance optimizations and their complications is merely illustrative, to indicate the range of opportunities and kinds of complexity that they engender; there are many other performance-enhancement techniques, some of which can be effective, and others that are of dubious value; most have different values depend­ ing on the application. For example, some locking disciplines compromise before-or­ after atomicity by allowing transactions to read data values that are not yet committed. As one might expect, the complexity of reasoning about what can or cannot go wrong in such situations escalates. If a designer intends to implement a system using performance enhancements such as buffering, lock compatibility modes, or compromised before-or­ after atomicity, it would be advisable to study carefully the book by Gray and Reuter, as well as existing systems that implement similar enhancements.

9.5.5 Deadlock; Making Progress Section 5.2.5 of Chapter 5 introduced the emergent problem of deadlock, the wait-for graph as a way of analyzing deadlock, and lock ordering as a way of preventing deadlock. With transactions and the ability to undo individual actions or even abort a transaction completely we now have more tools available to deal with deadlock, so it is worth revis­ iting that discussion.

Saltzer & Kaashoek Ch. 9, p. 76

June 25, 2009 8:22 am

9.5 Before-or-After Atomicity II: Pragmatics

9–77

The possibility of deadlock is an inevitable consequence of using locks to coordinate concurrent activities. Any number of concurrent transactions can get hung up in a dead­ lock, either waiting for one another, or simply waiting for a lock to be released by some transaction that is already deadlocked. Deadlock leaves us a significant loose end: cor­ rectness arguments ensure us that any transactions that complete will produce results as though they were run serially, but they say nothing about whether or not any transaction will ever complete. In other words, our system may ensure correctness, in the sense that no wrong answers ever come out, but it does not ensure progress—no answers may come out at all. As with methods for concurrency control, methods for coping with deadlock can also be described as pessimistic or optimistic. Pessimistic methods take a priori action to pre­ vent deadlocks from happening. Optimistic methods allow concurrent threads to proceed, detect deadlocks if they happen, and then take action to fix things up. Here are some of the most popular methods: 1. Lock ordering (pessimistic). As suggested in Chapter 5, number the locks uniquely, and require that transactions acquire locks in ascending numerical order. With this plan, when a transaction encounters an already-acquired lock, it is always safe to wait for it, since the transaction that previously acquired it cannot be waiting for any locks that this transaction has already acquired—all those locks are lower in number than this one. There is thus a guarantee that somewhere, at least one transaction (the one holding the highest-numbered lock) can always make progress. When that transaction finishes, it will release all of its locks, and some other transaction will become the one that is guaranteed to be able to make progress. A generalization of lock ordering that may eliminate some unnecessary waits is to arrange the locks in a lattice and require that they be acquired in some lattice traversal order. The trouble with lock ordering, as with simple locking, is that some applications may not be able to predict all of the locks they need before acquiring the first one. 2. Backing out (optimistic): An elegant strategy devised by Andre Bensoussan in 1966 allows a transaction to acquire locks in any order, but if it encounters an alreadyacquired lock with a number lower than one it has previously acquired itself, the transaction must back up (in terms of this chapter, UNDO previous actions) just far enough to release its higher-numbered locks, wait for the lower-numbered lock to become available, acquire that lock, and then REDO the backed-out actions. 3. Timer expiration (optimistic). When a new transaction begins, the lock manager sets an interrupting timer to a value somewhat greater than the time it should take for the transaction to complete. If a transaction gets into a deadlock, its timer will expire, at which point the system aborts that transaction, rolling back its changes and releasing its locks in the hope that the other transactions involved in the deadlock may be able to proceed. If not, another one will time out, releasing further locks. Timing out deadlocks is effective, though it has the usual defect: it

Saltzer & Kaashoek Ch. 9, p. 77

June 25, 2009 8:22 am

9–78

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

is difficult to choose a suitable timer value that keeps things moving along but also accommodates normal delays and variable operation times. If the environment or system load changes, it may be necessary to readjust all such timer values, an activity that can be a real nuisance in a large system. 4. Cycle detection (optimistic). Maintain, in the lock manager, a wait-for graph (as described in Section 5.2.5) that shows which transactions have acquired which locks and which transactions are waiting for which locks. Whenever another transaction tries to acquire a lock, finds it is already locked, and proposes to wait, the lock manager examines the graph to see if waiting would produce a cycle, and thus a deadlock. If it would, the lock manager selects some cycle member to be a victim, and unilaterally aborts that transaction, so that the others may continue. The aborted transaction then retries in the hope that the other transactions have made enough progress to be out of the way and another deadlock will not occur. When a system uses lock ordering, backing out, or cycle detection, it is common to also set a timer as a safety net because a hardware failure or a programming error such as an endless loop can create a progress-blocking situation that none of the deadlock detection methods can catch. Since a deadlock detection algorithm can introduce an extra reason to abort a trans­ action, one can envision pathological situations where the algorithm aborts every attempt to perform some particular transaction, no matter how many times its invoker retries. Suppose, for example, that two threads named Alphonse and Gaston get into a deadlock trying to acquire locks for two objects named Apple and Banana: Alphonse acquires the lock for Apple, Gaston acquires the lock for Banana, Alphonse tries to acquire the lock for Banana and waits, then Gaston tries to acquire the lock for Apple and waits, creating the deadlock. Eventually, Alphonse times out and begins rolling back updates in preparation for releasing locks. Meanwhile, Gaston times out and does the same thing. Both restart, and they get into another deadlock, with their timers set to expire exactly as before, so they will probably repeat the sequence forever. Thus we still have no guarantee of progress. This is the emergent property that Chapter 5 called livelock, since formally no deadlock ever occurs and both threads are busy doing something that looks superficially useful. One way to deal with livelock is to apply a randomized version of a technique familiar from Chapter 7[on-line]: exponential random backoff. When a timer expiration leads to an abort, the lock manager, after clearing the locks, delays that thread for a random length of time, chosen from some starting interval, in the hope that the randomness will change the relative timing of the livelocked transactions enough that on the next try one will succeed and then the other can then proceed without interference. If the transaction again encounters interference, it tries again, but on each retry not only does the lock manager choose a new random delay, but it also increases the interval from which the delay is chosen by some multiplicative constant, typically 2. Since on each retry there is an increased probability of success, one can push this probability as close to unity as desired by continued retries, with the expectation that the interfering transactions will

Saltzer & Kaashoek Ch. 9, p. 78

June 25, 2009 8:22 am

9.6 Atomicity across Layers and Multiple Sites

9–79

eventually get out of one another’s way. A useful property of exponential random backoff is that if repeated retries continue to fail it is almost certainly an indication of some deeper problem—perhaps a programming mistake or a level of competition for shared variables that is intrinsically so high that the system should be redesigned. The design of more elaborate algorithms or programming disciplines that guarantee progress is a project that has only modest potential payoff, and an end-to-end argument suggests that it may not be worth the effort. In practice, systems that would have frequent interference among transactions are not usually designed with a high degree of concur­ rency anyway. When interference is not frequent, simple techniques such as safety-net timers and exponential random backoff not only work well, but they usually must be provided anyway, to cope with any races or programming errors such as endless loops that may have crept into the system design or implementation. Thus a more complex progress-guaranteeing discipline is likely to be redundant, and only rarely will it get a chance to promote progress.

9.6 Atomicity across Layers and Multiple Sites There remain some important gaps in our exploration of atomicity. First, in a layered system, a transaction implemented in one layer may consist of a series of component actions of a lower layer that are themselves atomic. The question is how the commitment of the lower-layer transactions should relate to the commitment of the higher layer trans­ action. If the higher-layer transaction decides to abort, the question is what to do about lower-layer transactions that may have already committed. There are two possibilities: • Reverse the effect of any committed lower-layer transactions with an UNDO action. This technique requires that the results of the lower-layer transactions be visible only within the higher-layer transaction. • Somehow delay commitment of the lower-layer transactions and arrange that they actually commit at the same time that the higher-layer transaction commits. Up to this point, we have assumed the first possibility. In this section we explore the sec­ ond one. Another gap is that, as described so far, our techniques to provide atomicity all involve the use of shared variables in memory or storage (for example, pointers to the lat­ est version, outcome records, logs, and locks) and thus implicitly assume that the composite actions that make up a transaction all occur in close physical proximity. When the composing actions are physically separated, communication delay, communication reliability, and independent failure make atomicity both more important and harder to achieve. We will edge up on both of these problems by first identifying a common subprob­ lem: implementing nested transactions. We will then extend the solution to the nested transaction problem to create an agreement protocol, known as two-phase commit, that

Saltzer & Kaashoek Ch. 9, p. 79

June 25, 2009 8:22 am

9–80

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

procedure PAY_INTEREST (reference account) if account.balance > 0 then interest = account.balance * 0.05 TRANSFER (bank, account, interest) else interest = account.balance * 0.15 TRANSFER (account, bank, interest) procedure MONTH_END_INTEREST:() for A ← each customer_account do PAY_INTEREST (A)

FIGURE 9.35 An example of two procedures, one of which calls the other, yet each should be individually atomic.

coordinates commitment of lower-layer transactions. We can then extend the two-phase commit protocol, using a specialized form of remote procedure call, to coordinate steps that must be carried out at different places. This sequence is another example of boot­ strapping; the special case that we know how to handle is the single-site transaction and the more general problem is the multiple-site transaction. As an additional observation, we will discover that multiple-site transactions are quite similar to, but not quite the same as, the dilemma of the two generals.

9.6.1 Hierarchical Composition of Transactions We got into the discussion of transactions by considering that complex interpreters are engineered in layers, and that each layer should implement atomic actions for its nexthigher, client layer. Thus transactions are nested, each one typically consisting of multi­ ple lower-layer transactions. This nesting requires that some additional thought be given to the mechanism of achieving atomicity. Consider again a banking example. Suppose that the TRANSFER procedure of Section 9.1.5 is available for moving funds from one account to another, and it has been imple­ mented as a transaction. Suppose now that we wish to create the two application procedures of Figure 9.35. The first procedure, PAY_INTEREST, invokes TRANSFER to move an appropriate amount of money from or to an internal account named bank, the direc­ tion and rate depending on whether the customer account balance is positive or negative. The second procedure, MONTH_END_INTEREST, fulfills the bank’s intention to pay (or extract) interest every month on every customer account by iterating through the accounts and invoking PAY_INTEREST on each one. It would probably be inappropriate to have two invocations of MONTH_END_INTEREST running at the same time, but it is likely that at the same time that MONTH_END_INTEREST is running there are other banking activities in progress that are also invoking TRANSFER.

Saltzer & Kaashoek Ch. 9, p. 80

June 25, 2009 8:22 am

9.6 Atomicity across Layers and Multiple Sites

9–81

It is also possible that the for each statement inside MONTH_END_INTEREST actually runs several instances of its iteration (and thus of PAY_INTEREST) concurrently. Thus we have a need for three layers of transactions. The lowest layer is the TRANSFER procedure, in which debiting of one account and crediting of a second account must be atomic. At the next higher layer, the procedure PAY_INTEREST should be executed atomically, to ensure that some concurrent TRANSFER transaction doesn’t change the balance of the account between the positive/negative test and the calculation of the interest amount. Finally, the proce­ dure MONTH_END_INTEREST should be a transaction, to ensure that some concurrent TRANSFER transaction does not move money from an account A to an account B between the interest-payment processing of those two accounts, since such a transfer could cause the bank to pay interest twice on the same funds. Structurally, an invocation of the TRANS­ FER procedure is nested inside PAY_INTEREST, and one or more concurrent invocations of PAY_INTEREST are nested inside MONTH_END_INTEREST. The reason nesting is a potential problem comes from a consideration of the commit steps of the nested transactions. For example, the commit point of the TRANSFER transac­ tion would seem to have to occur either before or after the commit point of the PAY_INTEREST transaction, depending on where in the programming of PAY_INTEREST we place its commit point. Yet either of these positions will cause trouble. If the TRANSFER commit occurs in the pre-commit phase of PAY_INTEREST then if there is a system crash PAY_INTEREST will not be able to back out as though it hadn’t tried to operate because the values of the two accounts that TRANSFER changed may have already been used by concur­ rent transactions to make payment decisions. But if the TRANSFER commit does not occur until the post-commit phase of PAY_INTEREST, there is a risk that the transfer itself can not be completed, for example because one of the accounts is inaccessible. The conclusion is that somehow the commit point of the nested transaction should coincide with the com­ mit point of the enclosing transaction. A slightly different coordination problem applies to MONTH_END_INTEREST: no TRANSFERs by other transactions should occur while it runs (that is, it should run either before or after any concurrent TRANSFER transactions), but it must be able to do multiple TRANSFERs itself, each time it invokes PAY_INTEREST, and its own possibly concurrent transfer actions must be before-or-after actions, since they all involve the account named “bank”. Suppose for the moment that the system provides transactions with version histories. We can deal with nesting problems by extending the idea of an outcome record: we allow outcome records to be organized hierarchically. Whenever we create a nested transaction, we record in its outcome record both the initial state (PENDING) of the new transaction and the identifier of the enclosing transaction. The resulting hierarchical arrangement of out­ come records then exactly reflects the nesting of the transactions. A top-layer outcome record would contain a flag to indicate that it is not nested inside any other transaction. When an outcome record contains the identifier of a higher-layer transaction, we refer to it as a dependent outcome record, and the record to which it refers is called its superior. The transactions, whether nested or enclosing, then go about their business, and depending on their success mark their own outcome records COMMITTED or ABORTED, as usual. However, when READ_CURRENT_VALUE (described in Section 9.4.2) examines the sta-

Saltzer & Kaashoek Ch. 9, p. 81

June 25, 2009 8:22 am

9–82

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

tus of a version to see whether or not the transaction that created it is COMMITTED, it must additionally check to see if the outcome record contains a reference to a superior out­ come record. If so, it must follow the reference and check the status of the superior. If that record says that it, too, is COMMITTED, it must continue following the chain upward, if necessary all the way to the highest-layer outcome record. The transaction in question is actually COMMITTED only if all the records in the chain are in the COMMITTED state. If any record in the chain is ABORTED, this transaction is actually ABORTED, despite the COMMITTED claim in its own outcome record. Finally, if neither of those situations holds, then there must be one or more records in the chain that are still PENDING. The outcome of this trans­ action remains PENDING until those records become COMMITTED or ABORTED. Thus the outcome of an apparently-COMMITTED dependent outcome record actually depends on the outcomes of all of its ancestors. We can describe this situation by saying that, until all its ancestors commit, this lower-layer transaction is sitting on a knife-edge, at the point of committing but still capable of aborting if necessary. For purposes of discussion we will identify this situation as a distinct virtual state of the outcome record and the transaction, by saying that the transaction is tentatively committed. This hierarchical arrangement has several interesting programming consequences. If a nested transaction has any post-commit steps, those steps cannot proceed until all of the hierarchically higher transactions have committed. For example, if one of the nested transactions opens a cash drawer when it commits, the sending of the release message to the cash drawer must somehow be held up until the highest-layer transaction determines its outcome. This output visibility consequence is only one example of many relating to the tenta­ tively committed state. The nested transaction, having declared itself tentatively committed, has renounced the ability to abort—the decision is in someone else’s hands. It must be able to run to completion or to abort, and it must be able to maintain the ten­ tatively committed state indefinitely. Maintaining the ability to go either way can be awkward, since the transaction may be holding locks, keeping pages in memory or tapes mounted, or reliably holding on to output messages. One consequence is that a designer cannot simply take any arbitrary transaction and blindly use it as a nested component of a larger transaction. At the least, the designer must review what is required for the nested transaction to maintain the tentatively committed state. Another, more complex, consequence arises when one considers possible interactions among different transactions that are nested within the same higher-layer transaction. Consider our earlier example of TRANSFER transactions that are nested inside PAY_INTEREST, which in turn is nested inside MONTH_END_INTEREST. Suppose that the first time that MONTH_END_INTEREST invokes PAY_INTEREST, that invocation commits, thus moving into the tentatively committed state, pending the outcome of MONTH_END_INTEREST. Then MONTH_END_INTEREST invokes PAY_INTEREST on a second bank account. PAY_INTEREST needs to be able to read as input data the value of the bank’s own interest account, which is a pending result of the previous, tentatively COMMITTED, invocation of PAY_INTEREST. The READ_CURRENT_VALUE algorithm, as implemented in Section 9.4.2, doesn’t distinguish between reads arising within the same group of nested transactions and reads from some

Saltzer & Kaashoek Ch. 9, p. 82

June 25, 2009 8:22 am

9.6 Atomicity across Layers and Multiple Sites

9–83

completely unrelated transaction. Figure 9.36 illustrates the situation. If the test in READ_CURRENT_VALUE for committed values is extended by simply following the ancestry of the outcome record controlling the latest version, it will undoubtedly force the second invocation of PAY_INTEREST to wait pending the final outcome of the first invocation of PAY_INTEREST. But since the outcome of that first invocation depends on the outcome of

MONTH_END_INTEREST

outcome: superior:

PAY_INTEREST

outcome: superior:

1

superior:

none

PAY_INTEREST 2

(1st invocation)

outcome:

COMMITTED MONTH_END_INTEREST

superior:

(2nd invocation) PENDING

MONTH_END_INTEREST

TRANSFER2

TRANSFER1

outcome:

PENDING

outcome:

PENDING

superior:

PAY_INTEREST2

COMMITTED PAY_INTEREST1

OK for TRANSFER2 to read? creator: TRANSFER1 newest version of account bank

FIGURE 9.36 Transaction TRANSFER2, nested in transaction PAY_INTEREST2, which is nested in transaction MONTH_END_INTEREST, wants to read the current value of account bank. But bank was last writ­ ten by transaction TRANSFER1, which is nested in COMMITTED transaction PAY_INTEREST1, which is nested in still-PENDING transaction MONTH_END_INTEREST. Thus this version of bank is actually PENDING, rather than COMMITTED as one might conclude by looking only at the outcome of TRANSFER1. However, TRANSFER1 and TRANSFER2 share a common ancestor (namely, MONTH_END_INTEREST), and the chain of transactions leading from bank to that common ances­ tor is completely committed, so the read of bank can—and to avoid a deadlock, must—be allowed.

Saltzer & Kaashoek Ch. 9, p. 83

June 25, 2009 8:22 am

9–84

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

MONTH_END_INTEREST, and the outcome of MONTH_END_INTEREST currently depends on the success of the second invocation of PAY_INTEREST, we have a built-in cycle of waits that at best can only time out and abort. Since blocking the read would be a mistake, the question of when it might be OK to permit reading of data values created by tentatively COMMITTED transactions requires some further thought. The before-or-after atomicity requirement is that no update made by a tentatively COMMITTED transaction should be visible to any transaction that would survive if for some reason the tentatively COMMITTED transaction ultimately aborts. Within that constraint, updates of tentatively COMMITTED transactions can freely be passed around. We can achieve that goal in the following way: compare the outcome record ancestry of the transaction doing the read with the ancestry of the outcome record that controls the ver­ sion to be read. If these ancestries do not merge (that is, there is no common ancestor) then the reader must wait for the version’s ancestry to be completely committed. If they do merge and all the transactions in the ancestry of the data version that are below the point of the merge are tentatively committed, no wait is necessary. Thus, in Figure 9.36, MONTH_END_INTEREST might be running the two (or more) invocations of PAY_INTEREST con­ currently. Each invocation will call CREATE_NEW_VERSION as part of its plan to update the value of account “bank”, thereby establishing a serial order of the invocations. When later invocations of PAY_INTEREST call READ_CURRENT_VALUE to read the value of account “bank”, they will be forced to wait until all earlier invocations of PAY_INTEREST decide whether to commit or abort.

9.6.2 Two-Phase Commit Since a higher-layer transaction can comprise several lower-layer transactions, we can describe the commitment of a hierarchical transaction as involving two distinct phases. In the first phase, known variously as the preparation or voting phase, the higher-layer transaction invokes some number of distinct lower-layer transactions, each of which either aborts or, by committing, becomes tentatively committed. The top-layer transac­ tion evaluates the situation to establish that all (or enough) of the lower-layer transactions are tentatively committed that it can declare the higher-layer transaction a success. Based on that evaluation, it either COMMITs or ABORTs the higher-layer transaction. Assuming it decides to commit, it enters the second, commitment phase, which in the simplest case consists of simply changing its own state from PENDING to COMMITTED or ABORTED. If it is the highest-layer transaction, at that instant all of the lower-layer tenta­ tively committed transactions also become either COMMITTED or ABORTED. If it is itself nested in a still higher-layer transaction, it becomes tentatively committed and its com­ ponent transactions continue in the tentatively committed state also. We are implementing here a coordination protocol known as two-phase commit. When we implement multiple-site atomicity in the next section, the distinction between the two phases will take on additional clarity.

Saltzer & Kaashoek Ch. 9, p. 84

June 25, 2009 8:22 am

9.6 Atomicity across Layers and Multiple Sites

9–85

If the system uses version histories for atomicity, the hierarchy of Figure 9.36 can be directly implemented by linking outcome records. If the system uses logs, a separate table of pending transactions can contain the hierarchy, and inquiries about the state of a transaction would involve examining this table. The concept of nesting transactions hierarchically is useful in its own right, but our particular interest in nesting is that it is the first of two building blocks for multiple-site transactions. To develop the second building block, we next explore what makes multi­ ple-site transactions different from single-site transactions.

9.6.3 Multiple-Site Atomicity: Distributed Two-Phase Commit If a transaction requires executing component transactions at several sites that are sepa­ rated by a best-effort network, obtaining atomicity is more difficult because any of the messages used to coordinate the transactions of the various sites can be lost, delayed, or duplicated. In Chapter 4 we learned of a method, known as Remote Procedure Call (RPC) for performing an action at another site. In Chapter 7[on-line] we learned how to design protocols such as RPC with a persistent sender to ensure at-least-once execu­ tion and duplicate suppression to ensure at-most-once execution. Unfortunately, neither of these two assurances is exactly what is needed to ensure atomicity of a multiple-site transaction. However, by properly combining a two-phase commit protocol with persis­ tent senders, duplicate suppression, and single-site transactions, we can create a correct multiple-site transaction. We assume that each site, on its own, is capable of implement­ ing local transactions, using techniques such as version histories or logs and locks for allor-nothing atomicity and before-or-after atomicity. Correctness of the multiple-site ato­ micity protocol will be achieved if all the sites commit or if all the sites abort; we will have failed if some sites commit their part of a multiple-site transaction while others abort their part of that same transaction. Suppose the multiple-site transaction consists of a coordinator Alice requesting com­ ponent transactions X, Y, and Z of worker sites Bob, Charles, and Dawn, respectively. The simple expedient of issuing three remote procedure calls certainly does not produce a transaction for Alice because Bob may do X while Charles may report that he cannot do Y. Conceptually, the coordinator would like to send three messages, to the three workers, like this one to Bob: From: Alice

To: Bob

Re: my transaction 91

if (Charles does Y and Dawn does Z) then do X, please.

and let the three workers handle the details. We need some clue how Bob could accom­ plish this strange request. The clue comes from recognizing that the coordinator has created a higher-layer transaction and each of the workers is to perform a transaction that is nested in the higher-layer transaction. Thus, what we need is a distributed version of the two-phase commit protocol. The complication is that the coordinator and workers cannot reliably

Saltzer & Kaashoek Ch. 9, p. 85

June 25, 2009 8:22 am

9–86

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

communicate. The problem thus reduces to constructing a reliable distributed version of the two-phase commit protocol. We can do that by applying persistent senders and duplicate suppression. Phase one of the protocol starts with coordinator Alice creating a top-layer outcome record for the overall transaction. Then Alice begins persistently sending to Bob an RPClike message: From:Alice

To: Bob

Re: my transaction 271

Please do X as part of my transaction.

Similar messages go from Alice to Charles and Dawn, also referring to transaction 271, and requesting that they do Y and Z, respectively. As with an ordinary remote procedure call, if Alice doesn’t receive a response from one or more of the workers in a reasonable time she resends the message to the non-responding workers as many times as necessary to elicit a response. A worker site, upon receiving a request of this form, checks for duplicates and then creates a transaction of its own, but it makes the transaction a nested one, with its superior being Alice’s original transaction. It then goes about doing the pre-commit part of the requested action, reporting back to Alice that this much has gone well: From:Bob

To: Alice

Re: your transaction 271

My part X is ready to commit.

Alice, upon collecting a complete set of such responses then moves to the two-phase commit part of the transaction, by sending messages to each of Bob, Charles, and Dawn saying, e.g.: Two-phase-commit message #1: From:Alice

To: Bob

Re: my transaction 271

PREPARE

to commit X.

Bob, upon receiving this message, commits—but only tentatively—or aborts. Having created durable tentative versions (or logged to journal storage its planned updates) and having recorded an outcome record saying that it is PREPARED either to commit or abort, Bob then persistently sends a response to Alice reporting his state:

Saltzer & Kaashoek Ch. 9, p. 86

June 25, 2009 8:22 am

9.6 Atomicity across Layers and Multiple Sites

9–87

Two-phase-commit message #2: From:Bob To:Alice Re: your transaction 271 I am PREPARED to commit my part. Have you decided to commit yet? Regards.

or alternatively, a message reporting it has aborted. If Bob receives a duplicate request from Alice, his persistent sender sends back a duplicate of the PREPARED or ABORTED response. At this point Bob, being in the PREPARED state, is out on a limb. Just as in a local hier­ archical nesting, Bob must be able either to run to the end or to abort, to maintain that state of preparation indefinitely, and wait for someone else (Alice) to say which. In addi­ tion, the coordinator may independently crash or lose communication contact, increasing Bob’s uncertainty. If the coordinator goes down, all of the workers must wait until it recovers; in this protocol, the coordinator is a single point of failure. As coordinator, Alice collects the response messages from her several workers (perhaps re-requesting PREPARED responses several times from some worker sites). If all workers send PREPARED messages, phase one of the two-phase commit is complete. If any worker responds with an abort message, or doesn’t respond at all, Alice has the usual choice of aborting the entire transaction or perhaps trying a different worker site to carry out that component transaction. Phase two begins when Alice commits the entire transaction by marking her own outcome record COMMITTED. Once the higher-layer outcome record is marked as COMMITTED or ABORTED, Alice sends a completion message back to each of Bob, Charles, and Dawn: Two-phase-commit message #3 From:Alice To:Bob Re: my transaction 271 My transaction committed. Thanks for your help.

Each worker site, upon receiving such a message, changes its state from PREPARED to COM­ MITTED, performs any needed post-commit actions, and exits. Meanwhile, Alice can go about other business, with one important requirement for the future: she must remem­ ber, reliably and for an indefinite time, the outcome of this transaction. The reason is that one or more of her completion messages may have been lost. Any worker sites that are in the PREPARED state are awaiting the completion message to tell them which way to go. If a completion message does not arrive in a reasonable period of time, the persistent sender at the worker site will resend its PREPARED message. Whenever Alice receives a duplicate PREPARED message, she simply sends back the current state of the outcome record for the named transaction. If a worker site that uses logs and locks crashes, the recovery procedure at that site has to take three extra steps. First, it must classify any PREPARED transaction as a tentative win­ ner that it should restore to the PREPARED state. Second, if the worker is using locks for

Saltzer & Kaashoek Ch. 9, p. 87

June 25, 2009 8:22 am

9–88

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

before-or-after atomicity, the recovery procedure must reacquire any locks the PREPARED transaction was holding at the time of the failure. Finally, the recovery procedure must restart the persistent sender, to learn the current status of the higher-layer transaction. If the worker site uses version histories, only the last step, restarting the persistent sender, is required. Since the workers act as persistent senders of their PREPARED messages, Alice can be confident that every worker will eventually learn that her transaction committed. But since the persistent senders of the workers are independent, Alice has no way of ensuring that they will act simultaneously. Instead, Alice can only be certain of eventual comple­ tion of her transaction. This distinction between simultaneous action and eventual action is critically important, as will soon be seen. If all goes well, two-phase commit of N worker sites will be accomplished in 3N mes­ sages, as shown in Figure 9.37: for each worker site a PREPARE message, a PREPARED message in response, and a COMMIT message. This 3N message protocol is complete and sufficient, although there are several variations one can propose. An example of a simplifying variation is that the initial RPC request and response could also carry the PREPARE and PREPARED messages, respectively. However, once a worker sends a PREPARED message, it loses the ability to unilaterally abort, and it must remain on the knife edge awaiting instructions from the coordinator. To minimize this wait, it is usually preferable to delay the PREPARE/PREPARED message pair until the coordinator knows that the other workers seem to be in a position to do their parts. Some versions of the distributed two-phase commit protocol have a fourth acknowl­ edgment message from the worker sites to the coordinator. The intent is to collect a complete set of acknowledgment messages—the coordinator persistently sends comple­ tion messages until every site acknowledges. Once all acknowledgments are in, the coordinator can then safely discard its outcome record, since every worker site is known to have gotten the word. A system that is concerned both about outcome record storage space and the cost of extra messages can use a further refinement, called presumed commit. Since one would expect that most transactions commit, we can use a slightly odd but very space-efficient representation for the value COMMITTED of an outcome record: non-existence. The coordi­ nator answers any inquiry about a non-existent outcome record by sending a COMMITTED response. If the coordinator uses this representation, it commits by destroying the out­ come record, so a fourth acknowledgment message from every worker is unnecessary. In return for this apparent magic reduction in both message count and space, we notice that outcome records for aborted transactions can not easily be discarded because if an inquiry arrives after discarding, the inquiry will receive the response COMMITTED. The coor­ dinator can, however, persistently ask for acknowledgment of aborted transactions, and discard the outcome record after all these acknowledgments are in. This protocol that leads to discarding an outcome record is identical to the protocol described in Chapter 7[on-line] to close a stream and discard the record of that stream. Distributed two-phase commit does not solve all multiple-site atomicity problems. For example, if the coordinator site (in this case, Alice) is aboard a ship that sinks after

Saltzer & Kaashoek Ch. 9, p. 88

June 25, 2009 8:22 am

9.6 Atomicity across Layers and Multiple Sites

Coordinator Alice log BEGIN

Worker Bob PREP ARE

Worker Charles

Worker Dawn

X PREP ARE

Y PREP ARE

Bob

to bort a r to

ED PAR PRE

is mi com les is

Char

n is

Daw

9–89

ED PAR PRE

ED PAR PRE

Z

log BEGIN

bort

ra mit o

log PREPARED

m

to co

rt

r abo

it o omm

Time

to c

log COMMITTED

COMM IT

COMM IT

COMM IT

log COMMITTED

FIGURE 9.37 Timing diagram for distributed two-phase commit, using 3N messages. (The initial RPC request and response messages are not shown.) Each of the four participants maintains its own version history or recovery log. The diagram shows log entries made by the coordinator and by one of the workers.

sending the PREPARE message but before sending the COMMIT or ABORT message the worker sites are in left in the PREPARED state with no way to proceed. Even without that concern, Alice and her co-workers are standing uncomfortably close to a multiple-site atomicity problem that, at least in principle, can not be solved. The only thing that rescues them is our observation that the several workers will do their parts eventually, not necessarily simultaneously. If she had required simultaneous action, Alice would have been in trouble. The unsolvable problem is known as the dilemma of the two generals.

Saltzer & Kaashoek Ch. 9, p. 89

June 25, 2009 8:22 am

9–90

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

9.6.4 The Dilemma of the Two Generals An important constraint on possible coordination protocols when communication is unreliable is captured in a vivid analogy, called the dilemma of the two generals.* Suppose that two small armies are encamped on two mountains outside a city. The city is wellenough defended that it can repulse and destroy either one of the two armies. Only if the two armies attack simultaneously can they take the city. Thus the two generals who com­ mand the armies desire to coordinate their attack. The only method of communication between the two generals is to send runners from one camp to the other. But the defenders of the city have sentries posted in the val­ ley separating the two mountains, so there is a chance that the runner, trying to cross the valley, will instead fall into enemy hands, and be unable to deliver the message. Suppose that the first general sends this message: From:Julius Caesar

To:Titus Labienus

Date:11 January

I propose to cross the Rubicon and attack at dawn tomorrow. OK?

expecting that the second general will respond either with: From:Titus Labienus

To:Julius Caesar;

Date:11 January

Yes, dawn on the 12th.

or, possibly: From:Titus Labienus

To:Julius Caesar

Date:11 January

No. I am awaiting reinforcements from Gaul.

Suppose further that the first message does not make it through. In that case, the sec­ ond general does not march because no request to do so arrives. In addition, the first general does not march because no response returns, and all is well (except for the lost runner). Now, instead suppose the runner delivers the first message successfully and second general sends the reply “Yes,” but that the reply is lost. The first general cannot distin­ guish this case from the earlier case, so that army will not march. The second general has agreed to march, but knowing that the first general won’t march unless the “Yes” confir­ mation arrives, the second general will not march without being certain that the first

* The origin of this analogy has been lost, but it was apparently first described in print in 1977 by Jim N. Gray in his “Notes on Database Operating Systems”, reprinted in Operating Systems, Lecture Notes in Computer Science 60, Springer Verlag, 1978. At about the same time, Danny Cohen described another analogy he called the dating protocol, which is congruent with the dilemma of the two generals.

Saltzer & Kaashoek Ch. 9, p. 90

June 25, 2009 8:22 am

9.6 Atomicity across Layers and Multiple Sites

9–91

general received the confirmation. This hesitation on the part of the second general sug­ gests that the first general should send back an acknowledgment of receipt of the confirmation: From:Julius Caesar

To:Titus Labienus

Date:11 January

The die is cast.

Unfortunately, that doesn’t help, since the runner carrying this acknowledgment may be lost and the second general, not receiving the acknowledgment, will still not march. Thus the dilemma. We can now leap directly to a conclusion: there is no protocol with a bounded num­ ber of messages that can convince both generals that it is safe to march. If there were such a protocol, the last message in any particular run of that protocol must be unnecessary to safe coordination because it might be lost, undetectably. Since the last message must be unnecessary, one could delete that message to produce another, shorter sequence of mes­ sages that must guarantee safe coordination. We can reapply the same reasoning repeatedly to the shorter message sequence to produce still shorter ones, and we conclude that if such a safe protocol exists it either generates message sequences of zero length or else of unbounded length. A zero-length protocol can’t communicate anything, and an unbounded protocol is of no use to the generals, who must choose a particular time to march. A practical general, presented with this dilemma by a mathematician in the field, would reassign the mathematician to a new job as a runner, and send a scout to check out the valley and report the probability that a successful transit can be accomplished within a specified time. Knowing that probability, the general would then send several (hopefully independent) runners, each carrying a copy of the message, choosing a num­ ber of runners large enough that the probability is negligible that all of them fail to deliver the message before the appointed time. (The loss of all the runners would be what Chapter 8[on-line] called an intolerable error.) Similarly, the second general sends many runners each carrying a copy of either the “Yes” or the “No” acknowledgment. This pro­ cedure provides a practical solution of the problem, so the dilemma is of no real consequence. Nevertheless, it is interesting to discover a problem that cannot, in princi­ ple, be solved with complete certainty. We can state the theoretical conclusion more generally and succinctly: if messages may be lost, no bounded protocol can guarantee with complete certainty that both gen­ erals know that they will both march at the same time. The best that they can do is accept some non-zero probability of failure equal to the probability of non-delivery of their last message. It is interesting to analyze just why we can’t we use a distributed two-phase commit protocol to resolve the dilemma of the two generals. As suggested at the outset, it has to do with a subtle difference in when things may, or must, happen. The two generals require, in order to vanquish the defenses of the city, that they march at the same time.

Saltzer & Kaashoek Ch. 9, p. 91

June 25, 2009 8:22 am

9–92

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

The persistent senders of the distributed two-phase commit protocol ensure that if the coordinator decides to commit, all of the workers will eventually also commit, but there is no assurance that they will do so at the same time. If one of the communication links goes down for a day, when it comes back up the worker at the other end of that link will then receive the notice to commit, but this action may occur a day later than the actions of its colleagues. Thus the problem solved by distributed two-phase commit is slightly relaxed when compared with the dilemma of the two generals. That relaxation doesn’t help the two generals, but the relaxation turns out to be just enough to allow us to devise a protocol that ensures correctness. By a similar line of reasoning, there is no way to ensure with complete certainty that actions will be taken simultaneously at two sites that communicate only via a best-effort network. Distributed two-phase commit can thus safely open a cash drawer of an ATM in Tokyo, with confidence that a computer in Munich will eventually update the balance of that account. But if, for some reason, it is necessary to open two cash drawers at dif­ ferent sites at the same time, the only solution is either the probabilistic approach or to somehow replace the best-effort network with a reliable one. The requirement for reli­ able communication is why real estate transactions and weddings (both of which are examples of two-phase commit protocols) usually occur with all of the parties in one room.

9.7 A More Complete Model of Disk Failure (Advanced Topic) Section 9.2 of this chapter developed a failure analysis model for a calendar management program in which a system crash may corrupt at most one disk sector—the one, if any, that was being written at the instant of the crash. That section also developed a masking strategy for that problem, creating all-or-nothing disk storage. To keep that development simple, the strategy ignored decay events. This section revisits that model, considering how to also mask decay events. The result will be all-or-nothing durable storage, mean­ ing that it is both all-or-nothing in the event of a system crash and durable in the face of decay events.

9.7.1 Storage that is Both All-or-Nothing and Durable In Chapter 8[on-line] we learned that to obtain durable storage we should write two or more replicas of each disk sector. In the current chapter we learned that to recover from a system crash while writing a disk sector we should never overwrite the previous version of that sector, we should write a new version in a different place. To obtain stor­ age that is both durable and all-or-nothing we combine these two observations: make more than one replica, and don’t overwrite the previous version. One easy way to do that would be to simply build the all-or-nothing storage layer of the current chapter on top of the durable storage layer of Chapter 8[on-line]. That method would certainly work but it is a bit heavy-handed: with a replication count of just two, it would lead to allo-

Saltzer & Kaashoek Ch. 9, p. 92

June 25, 2009 8:22 am

9.7 A More Complete Model of Disk Failure (Advanced Topic)

9–93

cating six disk sectors for each sector of real data. This is a case in which modularity has an excessive cost. Recall that the parameter that Chapter 8[on-line] used to determine frequency of checking the integrity of disk storage was the expected time to decay, Td. Suppose for the moment that the durability requirement can be achieved by maintaining only two cop­ ies. In that case, Td must be much greater than the time required to write two copies of a sector on two disks. Put another way, a large Td means that the short-term chance of a decay event is small enough that the designer may be able to safely neglect it. We can take advantage of this observation to devise a slightly risky but far more economical method of implementing storage that is both durable and all-or-nothing with just two replicas. The basic idea is that if we are confident that we have two good replicas of some piece of data for durability, it is safe (for all-or-nothing atomicity) to overwrite one of the two replicas; the second replica can be used as a backup to ensure all-or-nothing atom­ icity if the system should happen to crash while writing the first one. Once we are confident that the first replica has been correctly written with new data, we can safely overwrite the second one, to regain long-term durability. If the time to complete the two writes is short compared with Td, the probability that a decay event interferes with this algorithm will be negligible. Figure 9.38 shows the algorithm and the two replicas of the data, here named D0 and D1. An interesting point is that ALL_OR_NOTHING_DURABLE_GET does not bother to check the status returned upon reading D1—it just passes the status value along to its caller. The reason is that in the absence of decay CAREFUL_GET has no expected errors when reading data that CAREFUL_PUT was allowed to finish writing. Thus the returned status would be BAD only in two cases: 1.

CAREFUL_PUT

2.

D1

of D1 was interrupted in mid-operation, or

was subject to an unexpected decay.

The

algorithm

guarantees that the first case cannot happen. doesn’t begin CAREFUL_PUT on data D1 until after the comple­ tion of its CAREFUL_PUT on data D0. At most one of the two copies could be BAD because of a system crash during CAREFUL_PUT. Thus if the first copy (D0) is BAD, then we expect that the second one (D1) is OK. The risk of the second case is real, but we have assumed its probability to be small: it arises only if there is a random decay of D1 in a time much shorter than Td. In reading D1 we have an opportunity to detect that error through the status value, but we have no way to recover when both data copies are damaged, so this detectable error must be clas­ sified as untolerated. All we can do is pass a status report along to the application so that it knows that there was an untolerated error. There is one currently unnecessary step hidden in the SALVAGE program: if D0 is BAD, nothing is gained by copying D1 onto D0, since ALL_OR_NOTHING_DURABLE_PUT, which called SALVAGE, will immediately overwrite D0 with new data. The step is included because it allows SALVAGE to be used in a refinement of the algorithm. ALL_OR_NOTHING_DURABLE_PUT

Saltzer & Kaashoek Ch. 9, p. 93

June 25, 2009 8:22 am

9–94

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

In the absence of decay events, this algorithm would be just as good as the all-or-noth­ ing procedures of Figures 9.6 and 9.7, and it would perform somewhat better, since it involves only two copies. Assuming that errors are rare enough that recovery operations do not dominate performance, the usual cost of ALL_OR_NOTHING_DURABLE_GET is just one disk read, compared with three in the ALL_OR_NOTHING_GET algorithm. The cost of ALL_OR_NOTHING_DURABLE_PUT is two disk reads (in SALVAGE) and two disk writes, compared with three disk reads and three disk writes for the ALL_OR_NOTHING_PUT algorithm. That analysis is based on a decay-free system. To deal with decay events, thus making the scheme both all-or-nothing and durable, the designer adopts two ideas from the dis­ cussion of durability in Chapter 8[on-line], the second of which eats up some of the better performance: 1. Place the two copies, D0 and D1, in independent decay sets (for example write them on two different disk drives, preferably from different vendors). 2. Have a clerk run the SALVAGE program on every atomic sector at least once every Td seconds.

1 2 3 4 5

procedure ALL_OR_NOTHING_DURABLE_GET (reference data, atomic_sector)

ds ← CAREFUL_GET (data, atomic_sector.D0)

if ds = BAD then

ds ← CAREFUL_GET (data, atomic_sector.D1)

return ds

6 procedure ALL_OR_NOTHING_DURABLE_PUT (new_data, atomic_sector)

7 SALVAGE(atomic_sector)

8 ds ← CAREFUL_PUT (new_data, atomic_sector.D0)

9 ds ← CAREFUL_PUT (new_data, atomic_sector.D1)

10 return ds

11 procedure SALVAGE(atomic_sector) //Run this program every Td seconds.

12 ds0 ← CAREFUL_GET (data0, atomic_sector.D0)

13 ds1 ← CAREFUL_GET (data1, atomic_sector.D1)

14 if ds0 = BAD then

15 CAREFUL_PUT (data1, atomic_sector.D0)

16 else if ds1 = BAD then

17 CAREFUL_PUT (data0, atomic_sector.D1)

18 if data0 ≠ data1 then

19 CAREFUL_PUT (data0, atomic_sector.D1)

D0:

data0

D1:

data1

FIGURE 9.38 Data arrangement and algorithms to implement all-or-nothing durable storage on top of the careful storage layer of Figure 8.12.

Saltzer & Kaashoek Ch. 9, p. 94

June 25, 2009 8:22 am

9.8 Case Studies: Machine Language Atomicity

9–95

The clerk running the SALVAGE program performs 2N disk reads every Td seconds to maintain N durable sectors. This extra expense is the price of durability against disk decay. The performance cost of the clerk depends on the choice of Td, the value of N, and the priority of the clerk. Since the expected operational lifetime of a hard disk is usu­ ally several years, setting Td to a few weeks should make the chance of untolerated failure from decay negligible, especially if there is also an operating practice to routinely replace disks well before they reach their expected operational lifetime. A modern hard disk with a capacity of one terabyte would have about N = 109 kilobyte-sized sectors. If it takes 10 milliseconds to read a sector, it would take about 2 x 107 seconds, or two days, for a clerk to read all of the contents of two one-terabyte hard disks. If the work of the clerk is sched­ uled to occur at night, or uses a priority system that runs the clerk when the system is otherwise not being used heavily, that reading can spread out over a few weeks and the performance impact can be minor. A few paragraphs back mentioned that there is the potential for a refinement: If we also run the SALVAGE program on every atomic sector immediately following every system crash, then it should not be necessary to do it at the beginning of every ALL_OR_NOTHING_DURABLE_PUT. That variation, which is more economical if crashes are infrequent and disks are not too large, is due to Butler Lampson and Howard Sturgis [Suggestions for Further Reading 1.8.7]. It raises one minor concern: it depends on the rarity of coincidence of two failures: the spontaneous decay of one data replica at about the same time that CAREFUL_PUT crashes in the middle of rewriting the other replica of that same sector. If we are convinced that such a coincidence is rare, we can declare it to be an untolerated error, and we have a self-consistent and more economical algorithm. With this scheme the cost of ALL_OR_NOTHING_DURABLE_PUT reduces to just two disk writes.

9.8 Case Studies: Machine Language Atomicity 9.8.1 Complex Instruction Sets: The General Electric 600 Line In the early days of mainframe computers, most manufacturers reveled in providing elab­ orate instruction sets, without paying much attention to questions of atomicity. The General Electric 600 line, which later evolved to be the Honeywell Information System, Inc., 68 series computer architecture, had a feature called “indirect and tally.” One could specify this feature by setting to ON a one-bit flag (the “tally” flag) stored in an unused high-order bit of any indirect address. The instruction Load register A from Y indirect.

was interpreted to mean that the low-order bits of the cell with address Y contain another address, called an indirect address, and that indirect address should be used to retrieve the operand to be loaded into register A. In addition, if the tally flag in cell Y is ON, the processor is to increment the indirect address in Y by one and store the result back in Y. The idea is that the next time Y is used as an indirect address it will point to a different

Saltzer & Kaashoek Ch. 9, p. 95

June 25, 2009 8:22 am

9–96

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

operand—the one in the next sequential address in memory. Thus the indirect and tally feature could be used to sweep through a table. The feature seemed useful to the design­ ers, but it was actually only occasionally, because most applications were written in higher-level languages and compiler writers found it hard to exploit. On the other hand the feature gave no end of trouble when virtual memory was retrofitted to the product line. Suppose that virtual memory is in use, and that the indirect word is located in a page that is in primary memory, but the actual operand is in another page that has been removed to secondary memory. When the above instruction is executed, the processor will retrieve the indirect address in Y, increment it, and store the new value back in Y. Then it will attempt to retrieve the actual operand, at which time it discovers that it is not in primary memory, so it signals a missing-page exception. Since it has already mod­ ified the contents of Y (and by now Y may have been read by another processor or even removed from memory by the missing-page exception handler running on another pro­ cessor), it is not feasible to back out and act as if this instruction had never executed. The designer of the exception handler would like to be able to give the processor to another thread by calling a function such as AWAIT while waiting for the missing page to arrive. Indeed, processor reassignment may be the only way to assign a processor to retrieve the missing page. However, to reassign the processor it is necessary to save its current execu­ tion state. Unfortunately, its execution state is “half-way through the instruction last addressed by the program counter.” Saving this state and later restarting the processor in this state is challenging. The indirect and tally feature was just one of several sources of atomicity problems that cropped up when virtual memory was added to this processor. The virtual memory designers desperately wanted to be able to run other threads on the interrupted processor. To solve this problem, they extended the definition of the cur­ rent program state to contain not just the next-instruction counter and the programvisible registers, but also the complete internal state description of the processor—a 216­ bit snapshot in the middle of the instruction. By later restoring the processor state to con­ tain the previously saved values of the next-instruction counter, the program-visible registers, and the 216-bit internal state snapshot, the processor could exactly continue from the point at which the missing-page alert occurred. This technique worked but it had two awkward side effects: 1) when a program (or programmer) inquires about the current state of an interrupted processor, the state description includes things not in the programmer’s interface; and 2) the system must be careful when restarting an interrupted program to make certain that the stored micro-state description is a valid one. If someone has altered the state description the processor could try to continue from a state it could never have gotten into by itself, which could lead to unplanned behavior, including fail­ ures of its memory protection features.

9.8.2 More Elaborate Instruction Sets: The IBM System/370 When IBM developed the System/370 by adding virtual memory to its System/360 architecture, certain System/360 multi-operand character-editing instructions caused

Saltzer & Kaashoek Ch. 9, p. 96

June 25, 2009 8:22 am

9.8 Case Studies: Machine Language Atomicity

9–97

atomicity problems. For example, the TRANSLATE instruction contains three arguments, two of which are addresses in memory (call them string and table) and the third of which, length, is an 8-bit count that the instruction interprets as the length of string. TRANSLATE takes one byte at a time from string, uses that byte as an offset in table, retrieves the byte at the offset, and replaces the byte in string with the byte it found in table. The designers had in mind that TRANSLATE could be used to convert a character string from one character set to another. The problem with adding virtual memory is that both string and table may be as long as 65,536 bytes, so either or both of those operands may cross not just one, but several page boundaries. Suppose just the first page of string is in physical memory. The TRANS­ LATE instruction works its way through the bytes at the beginning of string. When it comes to the end of that first page, it encounters a missing-page exception. At this point, the instruction cannot run to completion because data it requires is missing. It also can­ not back out and act as if it never started because it has modified data in memory by overwriting it. After the virtual memory manager retrieves the missing page, the problem is how to restart the half-completed instruction. If it restarts from the beginning, it will try to convert the already-converted characters, which would be a mistake. For correct operation, the instruction needs to continue from where it left off. Rather than tampering with the program state definition, the IBM processor design­ ers chose a dry run strategy in which the TRANSLATE instruction is executed using a hidden copy of the program-visible registers and making no changes in memory. If one of the operands causes a missing-page exception, the processor can act as if it never tried the instruction, since there is no program-visible evidence that it did. The stored program state shows only that the TRANSLATE instruction is about to be executed. After the proces­ sor retrieves the missing page, it restarts the interrupted thread by trying the TRANSLATE instruction from the beginning again, another dry run. If there are several missing pages, several dry runs may occur, each getting one more page into primary memory. When a dry run finally succeeds in completing, the processor runs the instruction once more, this time for real, using the program-visible registers and allowing memory to be updated. Since the System/370 (at the time this modification was made) was a single-processor architecture, there was no possibility that another processor might snatch a page away after the dry run but before the real execution of the instruction. This solution had the side effect of making life more difficult for a later designer with the task of adding mul­ tiple processors.

9.8.3 The Apollo Desktop Computer and the Motorola M68000 Microprocessor When Apollo Computer designed a desktop computer using the Motorola 68000 micro­ processor, the designers, who wanted to add a virtual memory feature, discovered that the microprocessor instruction set interface was not atomic. Worse, because it was con­ structed entirely on a single chip it could not be modified to do a dry run (as in the IBM 370) or to make it store the internal microprogram state (as in the General Electric 600 line). So the Apollo designers used a different strategy: they installed not one, but two

Saltzer & Kaashoek Ch. 9, p. 97

June 25, 2009 8:22 am

9–98

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

Motorola 68000 processors. When the first one encounters a missing-page exception, it simply stops in its tracks, and waits for the operand to appear. The second Motorola 68000 (whose program is carefully planned to reside entirely in primary memory) fetches the missing page and then restarts the first processor. Other designers working with the Motorola 68000 used a different, somewhat risky trick: modify all compilers and assemblers to generate only instructions that happen to be atomic. Motorola later produced a version of the 68000 in which all internal state reg­ isters of the microprocessor could be saved, the same method used in adding virtual memory to the General Electric 600 line.

Exercises 9.1 Locking up humanities: The registrar’s office is upgrading its scheduling program for limited-enrollment humanities subjects. The plan is to make it multithreaded, but there is concern that having multiple threads trying to update the database at the same time could cause trouble. The program originally had just two operations: status ← REGISTER (subject_name)

DROP (subject_name)

where subject_name was a string such as “21W471”. The REGISTER procedure checked to see if there is any space left in the subject, and if there was, it incremented the class size by one and returned the status value ZERO. If there was no space, it did not change the class size; instead it returned the status value –1. (This is a primitive registration system—it just keeps counts!) As part of the upgrade, structure:

subject_name

has been changed to a two-component

structure subject

string subject_name

lock slock

and the registrar is now wondering where to apply the locking primitives, ACQUIRE RELEASE

(subject.slock)

(subject.slock)

Here is a typical application program, which registers the caller for two humanities

Saltzer & Kaashoek Ch. 9, p. 98

June 25, 2009 8:22 am

Exercises

9–99

subjects, hx and hy: procedure REGISTER_TWO (hx, hy)

status ← REGISTER (hx)

if status = 0 then

status ← REGISTER (hy)

if status = –1 then

DROP (hx)

return status;

9.1a. The goal is that the entire procedure REGISTER_TWO should have the before-or-after property. Add calls for ACQUIRE and RELEASE to the REGISTER_TWO procedure that obey the simple locking protocol. 9.1b. A dd calls to ACQUIRE and RELEASE that obey the two-phase locking protocol, and in addition postpone all ACQUIREs as late as possible and do all RELEASEs as early as possible.

Louis Reasoner has come up with a suggestion that he thinks could simplify the job of programmers creating application programs such as REGISTER_TWO. His idea is to revise the two programs REGISTER and DROP by having them do the ACQUIRE and RELEASE internally. That is, the procedure: procedure REGISTER (subject)

{ current code }

return status

would become instead: procedure REGISTER (subject)

ACQUIRE (subject.slock)

{ current code }

RELEASE (subject.slock)

return status

9.1c. As usual, Louis has misunderstood some aspect of the problem. Give a brief explanation of what is wrong with this idea. 1995–3–2a…c

9.2 Ben and Alyssa are debating a fine point regarding version history transaction disciplines and would appreciate your help. Ben says that under the mark point transaction discipline, every transaction should call MARK_POINT_ANNOUNCE as soon as possible, or else the discipline won't work. Alyssa claims that everything will come out correct even if no transaction calls MARK_POINT_ANNOUNCE. Who is right? 2006-0-1

9.3 Ben and Alyssa are debating another fine point about the way that the version history transaction discipline bootstraps. The version of NEW_OUTCOME_RECORD given in the text uses TICKET as well as ACQUIRE and RELEASE. Alyssa says this is overkill—it

Saltzer & Kaashoek Ch. 9, p. 99

June 25, 2009 8:22 am

9–100

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

should be possible to correctly coordinate NEW_OUTCOME_RECORD using just ACQUIRE and RELEASE. Modify the pseudocode of Figure 9.30 to create a version of NEW_OUTCOME_RECORD that doesn't need the ticket primitive. 9.4 You have been hired by Many-MIPS corporation to help design a new 32-register RISC processor that is to have six-way multiple instruction issue. Your job is to coordinate the interaction among the six arithmetic-logic units (ALUs) that will be running concurrently. Recalling the discussion of coordination, you realize that the first thing you must do is decide what constitutes “correct” coordination for a multiple-instruction-issue system. Correct coordination for concurrent operations on a database was said to be: No matter in what order things are actually calculated, the final result is always guaranteed to be one that could have been obtained by some sequential ordering of the concurrent operations. You have two goals: (1) maximum performance, and (2) not surprising a programmer who wrote a program expecting it to be executed on a singleinstruction-issue machine. Identify the best coordination correctness criterion for your problem. A. Multiple instruction issue must be restricted to sequences of instructions that have non-overlapping register sets. B. No matter in what order things are actually calculated, the final result is always guaranteed to be one that could have been obtained by some sequential ordering of the instructions that were issued in parallel. C. No matter in what order things are actually calculated, the final result is always guaranteed to be the one that would have been obtained by the original ordering of the instructions that were issued in parallel. D. The final result must be obtained by carrying out the operations in the order specified by the original program. E. No matter in what order things are actually calculated, the final result is always guaranteed to be one that could have been obtained by some set of instructions carried out sequentially. F. The six ALUs do not require any coordination. 1997–0–02

9.5 In 1968, IBM introduced the Information Management System (IMS) and it soon became one of the most widely used database management systems in the world. In fact, IMS is still in use today. At the time of introduction IMS used a before-or-after atomicity protocol consisting of the following two rules: • A transaction may read only data that has been written by previously committed transactions. • A transaction must acquire a lock for every data item that it will write.

Saltzer & Kaashoek Ch. 9, p. 100

June 25, 2009 8:22 am

Exercises

9–101

Consider the following two transactions, which, for the interleaving shown, both adhere to the protocol: 1 2 3 4 5 6 7 8 9

(t1); (y.lock) temp1 ← x

BEGIN

BEGIN

(t2)

ACQUIRE

(x.lock) temp2 ← y x ← temp2

ACQUIRE

y ← temp1 (t1)

COMMIT

COMMIT

(t2)

Previously committed transactions had set x ← 3 and y ← 4. 9.5a. After both transactions complete, what are the values of x and y? In what sense is this answer wrong? 1982–3–3a 9.5b. In the mid-1970’s, this flaw was noticed, and the before-or-after atomicity protocol was replaced with a better one, despite a lack of complaints from customers. Explain why customers may not have complained about the flaw. 1982–3–3b

9.6 A system that attempts to make actions all-or-nothing writes the following type of records to a log maintained on non-volatile storage: • • • •



action i starts. action i writes the value new over the value old for the variable x. action i commits. action i aborts. At this checkpoint, actions i, j,… are pending.

Actions start in numerical order. A crash occurs, and the recovery procedure finds

Saltzer & Kaashoek Ch. 9, p. 101

June 25, 2009 8:22 am

9–102

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

the following log records starting with the last checkpoint:





<53, y, 5, 6>

<53, x, 5, 9>



<54, y, 6, 4>



<55, z, 3, 4>



<51, q, 1, 9>



<55, y, 4, 3>



<55, y, 3, 7>





<56, x, 9, 2>

<56, w, 0, 1>



<57, u, 2, 1>

****************** crash happened here **************

9.6a. Assume that the system is using a rollback recovery procedure. How much farther back in the log should the recovery procedure scan? 9.6b. Assume that the system is using a roll-forward recovery procedure. How much farther back in the log should the recovery procedure scan? 9.6c. Which operations mentioned in this part of the log are winners and which are losers? 9.6d. What are the values of x and y immediately after the recovery procedure finishes? Why? 1994–3–3

9.7 The log of exercise 9.6 contains (perhaps ambiguous) evidence that someone didn’t follow coordination rules. What is that evidence? 1994–3–4

9.8 Roll-forward recovery requires writing the commit (or abort) record to the log before doing any installs to cell storage. Identify the best reason for this requirement. A. So that the recovery manager will know what to undo. B. So that the recovery manager will know what to redo. C. Because the log is less likely to fail than the cell storage. D. To minimize the number of disk seeks required. 1994–3–5

Saltzer & Kaashoek Ch. 9, p. 102

June 25, 2009 8:22 am

Exercises

9–103

9.9 Two-phase locking within transactions ensures that A. No deadlocks will occur. B. Results will correspond to some serial execution of the transactions. C. Resources will be locked for the minimum possible interval. D. Neither gas nor liquid will escape. E. Transactions will succeed even if one lock attempt fails. 1997–3–03

9.10 Pat, Diane, and Quincy are having trouble using e-mail to schedule meetings. Pat suggests that they take inspiration from the 2-phase commit protocol. 9.10a. Which of the following protocols most closely resembles 2-phase commit? I. a. Pat requests everyone’s schedule openings. b. Everyone replies with a list but does not guarantee to hold all the times available. c. Pat inspects the lists and looks for an open time.

If there is a time,

Pat chooses a meeting time and sends it to everyone.

Otherwise

Pat sends a message canceling the meeting.

II. a–c, as in protocol I. d. Everyone, if they received the second message,

acknowledge receipt.

Otherwise

send a message to Pat asking what happened.

III a–c, as in protocol I. d. Everyone, if their calendar is still open at the chosen time

Send Pat an acknowledgment.

Otherwise

Send Pat apologies.

e. Pat collects the acknowledgments. If all are positive

Send a message to everyone saying the meeting is ON.

Otherwise

Send a message to everyone saying the meeting is OFF.

f. Everyone, if they received the ON/OFF message,

acknowledge receipt.

Otherwise

send a message to Pat asking what happened.

IV. a–f, as in protocol III. g. Pat sends a message telling everyone that everyone has confirmed. h. Everyone acknowledges the confirmation.

9.10b. For the protocol you selected, which step commits the meeting time? 1994–3–7

Saltzer & Kaashoek Ch. 9, p. 103

June 25, 2009 8:22 am

9–104

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

9.11 Alyssa P. Hacker needs a transaction processing system for updating information about her collection of 97 cockroaches.* 9.11a. In her first design, Alyssa stores the database on disk. When a transaction commits, it simply goes to the disk and writes its changes in place over the old data. What are the major problems with Alyssa’s system? 9.11b. In Alyssa’s second design, the only structure she keeps on disk is a log, with a reference copy of all data in volatile RAM. The log records every change made to the database, along with the transaction which the change was a part of. Commit records, also stored in the log, indicate when a transaction commits. When the system crashes and recovers, it replays the log, redoing each committed transaction, to reconstruct the reference copy in RAM. What are the disadvantages of Alyssa’s second design?

To speed things up, Alyssa makes an occasional checkpoint of her database. To checkpoint, Alyssa just writes the entire state of the database into the log. When the system crashes, she starts from the last checkpointed state, and then redoes or undoes some transactions to restore her database. Now consider the five transactions in the illustration: begin T1 commit T2 T3

T5

crash

checkpoint

T4

time

Transactions T2, T3, and T5 committed before the crash, but T1 and T4 were still pending. 9.11c. When the system recovers, after the checkpointed state is loaded, some transactions will need to be undone or redone using the log. For each transaction,

* Credit for developing exercise 9.11 goes to Eddie Kohler.

Saltzer & Kaashoek Ch. 9, p. 104

June 25, 2009 8:22 am

Exercises

9–105

mark off in the table whether that transaction needs to be undone, redone, or neither. Undone

Redone

Neither

T1 T2 T3 T4 T5

9.11d. Now, assume that transactions T2 and T3 were actually nested transactions: T2 was nested in T1, and T3 was nested in T2. Again, fill in the table Undone

Redone

Neither

T1 T2 T3 T4 T5

1996–3–3

9.12 Alice is acting as the coordinator for Bob and Charles in a two-phase commit protocol. Here is a log of the messages that pass among them: 1 2 3 4 5 6 7 8 9 10

Alice ⇒ Bob: please do X Alice ⇒ Charles: please do Y Bob ⇒ Alice: done with X Charles ⇒ Alice: done with Y Alice ⇒ Bob: PREPARE to commit or abort Alice ⇒ Charles: PREPARE to commit or abort Bob ⇒ Alice: PREPARED Charles ⇒ Alice: PREPARED Alice ⇒ Bob: COMMIT Alice ⇒ Charles: COMMIT

At which points in this sequence is it OK for Bob to abort his part of the

Saltzer & Kaashoek Ch. 9, p. 105

June 25, 2009 8:22 am

9–106

CHAPTER 9 Atomicity: All-or-Nothing and Before-or-After

transaction? A. B. C. D. E.

After Bob receives message 1 but before he sends message 3. After Bob sends message 3 but before he receives message 5. After Bob receives message 5 but before he sends message 7. After Bob sends message 7 but before he receives message 9. After Bob receives message 9. 2008–3–11

Additional exercises relating to Chapter 9 can be found in problem sets 29 through 40.

Saltzer & Kaashoek Ch. 9, p. 106

June 25, 2009 8:22 am

CHAPTER

Consistency

10

CHAPTER CONTENTS Overview........................................................................................10–2

10.1 Constraints and Interface Consistency ..................................10–2

10.2 Cache Coherence ...................................................................10–4

10.2.1 Coherence, Replication, and Consistency in a Cache .............. 10–4

10.2.2 Eventual Consistency with Timer Expiration ......................... 10–5

10.2.3 Obtaining Strict Consistency with a Fluorescent Marking Pen .. 10–7

10.2.4 Obtaining Strict Consistency with the Snoopy Cache ............. 10–7

10.3 Durable Storage Revisited: Widely Separated Replicas..........10–9

10.3.1 Durable Storage and the Durability Mantra .......................... 10–9

10.3.2 Replicated State Machines ................................................10–11

10.3.3 Shortcuts to Meet more Modest Requirements .....................10–13

10.3.4 Maintaining Data Integrity ................................................10–15

10.3.5 Replica Reading and Majorities ..........................................10–16

10.3.6 Backup ..........................................................................10–17

10.3.7 Partitioning Data .............................................................10–18

10.4 Reconciliation......................................................................10–19

10.4.1 Occasionally Connected Operation .....................................10–20

10.4.2 A Reconciliation Procedure ................................................10–22

10.4.3 Improvements ................................................................10–25

10.4.4 Clock Coordination ..........................................................10–26

10.5 Perspectives........................................................................10–26

10.5.1 History ..........................................................................10–27

10.5.2 Trade-Offs ......................................................................10–28

10.5.3 Directions for Further Study ..............................................10–31

Exercises......................................................................................10–32 Glossary for Chapter 10 ...............................................................10–35 Index of Chapter 10 .....................................................................10–37 Last chapter page 10–38

10–1

Saltzer & Kaashoek Ch. 10, p. 1

June 24, 2009 12:28 am

10–2

CHAPTER 10 Consistency

Overview The previous chapter developed all-or-nothing atomicity and before-or-after atomicity, two properties that define a transaction. This chapter introduces or revisits several applica­ tions that can make use of transactions. Section 10.1 introduces constraints and discusses how transactions can be used to maintain invariants and implement memory models that provide interface consistency. Sections 10.2 and 10.3 develop techniques used in two dif­ ferent application areas, caching and geographically distributed replication, to achieve higher performance and greater durability, respectively. Section 10.4 discusses reconcili­ ation, which is a way of restoring the constraint that replicas be identical if their contents should drift apart. Finally, Section 10.5 considers some perspectives relating to Chapters 9[on-line] and 10.

10.1 Constraints and Interface Consistency One common use for transactions is to maintain constraints. A constraint is an applica­ tion-defined requirement that every update to a collection of data preserve some specified invariant. Different applications can have quite different constraints. Here are some typical constraints that a designer might encounter: • Table management: The variable that tells the number of entries should equal the number of entries actually in the table. • Double-linked list management: The forward pointer in a list cell, A, should refer a list cell whose back pointer refers to A. • Disk storage management: Every disk sector should be assigned either to the free list or to exactly one file. • Display management: The pixels on the screen should match the description in the display list. • Replica management: A majority (or perhaps all) of the replicas of the data should be identical. • Banking: The sum of the balances of all credit accounts should equal the sum of the balances of all debit accounts. • Process control: At least one of the valves on the boiler should always be open. As was seen in Chapter 9[on-line], maintaining a constraint over data within a single file can be relatively straightforward, for example by creating a shadow copy. Maintaining constraints across data that is stored in several files is harder, and that is one of the pri­ mary uses of transactions. Finally, two-phase commit allows maintaining a constraint that involves geographically separated files despite the hazards of communication. A constraint usually involves more than one variable data item, in which case an update action by nature must be composite—it requires several steps. In the midst of those steps, the data will temporarily be inconsistent. In other words, there will be times when the data violates the invariant. During those times, there is a question about what

Saltzer & Kaashoek Ch. 10, p. 2

June 24, 2009 12:28 am

10.1 Constraints and Interface Consistency

10–3

to do if someone—another thread or another client—asks to read the data. This question is one of interface, rather than of internal operation, and it reopens the discussion of memory coherence and data consistency models introduced in Section 2.1.1.1. Different designers have developed several data consistency models to deal with this inevitable tem­ porary inconsistency. In this chapter we consider two of those models: strict consistency and eventual consistency. The first model, strict consistency, hides the constraint violation behind modular boundaries. Strict consistency means that actions outside the transaction performing the update will never see data that is inconsistent with the invariant. Since strict consistency is an interface concept, it depends on actions honoring abstractions, for example by using only the intended reading and writing operations. Thus, for a cache, read/write coher­ ence is a strict consistency specification: “The result of a READ of a named object is always the value that was provided by the most recent WRITE to that object”. This specification does not demand that the replica in the cache always be identical to the replica in the backing store, it requires only that the cache deliver data at its interface that meets the specification. Applications can maintain strict consistency by using transactions. If an action is allor-nothing, the application can maintain the outward appearance of consistency despite failures, and if an action is before-or-after, the application can maintain the outward appearance of consistency despite the existence of other actions concurrently reading or updating the same data. Designers generally strive for strict consistency in any situation where inconsistent results can cause confusion, such as in a multiprocessor system, and in situations where mistakes can have serious negative consequences, for example in banking and safety-critical systems. Section 9.1.6 mentioned two other consistency models, sequential consistency and external time consistency. Both are examples of strict consistency. The second, more lightweight, way of dealing with temporary inconsistency is called eventual consistency. Eventual consistency means that after a data update the constraint may not hold until some unspecified time in the future. An observer may, using the stan­ dard interfaces, discover that the invariant is violated, and different observers may even see different results. But the system is designed so that once updates stop occurring, it will make a best effort drive toward the invariant. Eventual consistency is employed in situations where performance or availability is a high priority and temporary inconsistency is tolerable and can be easily ignored. For example, suppose a Web browser is to display a page from a distant service. The page has both a few paragraphs of text and several associated images. The browser obtains the text immediately, but it will take some time to download the images. The invariant is that the appearance on the screen should match the Web page specification. If the browser renders the text paragraphs first and fills in the images as they arrive, the human reader finds that behavior not only acceptable, but perhaps preferable to staring at the previous screen until the new one is completely ready. When a person can say, “Oh, I see what is happening,” eventual consistency is usually acceptable, and in cases such as the Web browser it can even improve human engineering. For a second example, if a librarian cat-

Saltzer & Kaashoek Ch. 10, p. 3

June 24, 2009 12:28 am

10–4

CHAPTER 10 Consistency

alogs a new book and places it on the shelf, but the public version of the library catalog doesn't include the new book until the next day, there is an observable inconsistency, but most library patrons would find it tolerable and not particularly surprising. Eventual consistency is sometimes used in replica management because it allows for relatively loose coupling among the replicas, thus taking advantage of independent fail­ ure. In some applications, continuous service is a higher priority than always-consistent answers. If a replica server crashes in the middle of an update, the other replicas may be able to continue to provide service, even though some may have been updated and some may have not. In contrast, a strict consistency algorithm may have to refuse to provide service until a crashed replica site recovers, rather than taking a risk of exposing an inconsistency. The remaining sections of this chapter explore several examples of strict and eventual consistency in action. A cache can be designed to provide either strict or eventual consis­ tency; Section 10.2 provides the details. The Internet Domain Name System, described in Section 4.4 and revisited in Section 10.2.2, relies on eventual consistency in updating its caches, with the result that it can on occasion give inconsistent answers. Similarly, for the geographically replicated durable storage of Section 10.3 a designer can choose either a strict or an eventual consistency model. When replicas are maintained on devices that are only occasionally connected, eventual consistency may be the only choice, in which case reconciliation, the topic of Section 10.4, drives occasionally connected replicas toward eventual consistency. The reader should be aware that these examples do not pro­ vide a comprehensive overview of consistency; instead they are intended primarily to create awareness of the issues involved by illustrating a few of the many possible designs.

10.2 Cache Coherence 10.2.1 Coherence, Replication, and Consistency in a Cache Chapter 6 described the cache as an example of a multilevel memory system. A cache can also be thought of as a replication system whose primary goal is performance, rather than reliability. An invariant for a cache is that the replica of every data item in the primary store (that is, the cache) should be identical to the corresponding replica in the secondary memory. Since the primary and secondary stores usually have different latencies, when an action updates a data value, the replica in the primary store will temporarily be incon­ sistent with the one in the secondary memory. How well the multilevel memory system hides that inconsistency is the question. A cache can be designed to provide either strict or eventual consistency. Since a cache, together with its backing store, is a memory system, a typical interface specification is that it provide read/write coherence, as defined in Section 2.1.1.1, for the entire name space of the cache:

Saltzer & Kaashoek Ch. 10, p. 4

June 24, 2009 12:28 am

10.2 Cache Coherence

10–5

• The result of a read of a named object is always the value of the most recent write to that object. Read/write coherence is thus a specification that the cache provide strict consistency. A write-through cache provides strict consistency for its clients in a straightforward way: it does not acknowledge that a write is complete until it finishes updating both the primary and secondary memory replicas. Unfortunately, the delay involved in waiting for the write-through to finish can be a performance bottleneck, so write-through caches are not popular. A non-write-through cache acknowledges that a write is complete as soon as the cache manager updates the primary replica, in the cache. The thread that performed the write can go about its business expecting that the cache manager will eventually update the sec­ ondary memory replica and the invariant will once again hold. Meanwhile, if that same thread reads the same data object by sending a READ request to the cache, it will receive the updated value from the cache, even if the cache manager has not yet restored the invariant. Thus, because the cache manager masks the inconsistency, a non-write­ through cache can still provide strict consistency. On the other hand, if there is more than one cache, or other threads can read directly from the secondary storage device, the designer must take additional measures to ensure that other threads cannot discover the violated constraint. If a concurrent thread reads a modified data object via the same cache, the cache will deliver the modified version, and thus maintain strict consistency. But if a concurrent thread reads the modified data object directly from secondary memory, the result will depend on whether or not the cache manager has done the secondary memory update. If the second thread has its own cache, even a write-through design may not maintain consistency because updating the secondary memory does not affect a potential replica hiding in the second thread’s cache. Nevertheless, all is not lost. There are at least three ways to regain consistency, two of which provide strict consistency, when there are multiple caches.

10.2.2 Eventual Consistency with Timer Expiration The Internet Domain Name System, whose basic operation was described in Section 4.4, provides an example of an eventual consistency cache that does not meet the read/write coherence specification. When a client calls on a DNS server to do a recursive name lookup, if the DNS server is successful in resolving the name it caches a copy of the answer as well as any intermediate answers that it received. Suppose that a client asks some local name server to resolve the name ginger.pedantic.edu. In the course of doing so, the local name server might accumulate the following name records in its cache: names.edu ns.pedantic.edu ginger.pedantic.edu

Saltzer & Kaashoek Ch. 10, p. 5

198.41.0.4 name server for .edu 128.32.25.19 name server for .pedantic.edu 128.32.247.24 target host name

June 24, 2009 12:28 am

10–6

CHAPTER 10 Consistency

If the client then asks for thyme.pedantic.edu the local name server will be able to use the cached record for ns.pedantic.edu to directly ask that name server, without having to go back up to the root to find names.edu and thence to names.edu to find ns.pedantic.edu. Now, suppose that a network manager at Pedantic University changes the Internet address of ginger.pedantic.edu to 128.32.201.15. At some point the manager updates the authoritative record stored in the name server ns.pedantic.edu. The prob­ lem is that local DNS caches anywhere in the Internet may still contain the old record of the address of ginger.pedantic.edu. DNS deals with this inconsistency by limiting the lifetime of a cached name record. Recall that every name server record comes with an expiration time, known as the time-to-live (TTL) that can range from seconds to months. A typical time-to-live is one hour; it is measured from the moment that the local name server receives the record. So, until the expiration time, the local cache will be inconsis­ tent with the authoritative version at Pedantic University. The system will eventually reconcile this inconsistency. When the time-to-live of that record expires, the local name server will handle any further requests for the name ginger.pedantic.edu by asking ns.pedantic.edu for a new name record. That new name record will contain the new, updated address. So this system provides eventual consistency. There are two different actions that the network manager at Pedantic University might take to make sure that the inconsistency is not an inconvenience. First, the net­ work manager may temporarily reconfigure the network layer of ginger.pedantic.edu to advertise both the old and the new Internet addresses, and then modify the authori­ tative DNS record to show the new address. After an hour has passed, all cached DNS records of the old address will have expired, and ginger.pedantic.edu can be reconfigured again, this time to stop advertising the old address. Alternatively, the network manager may have realized this change is coming, so a few hours in advance he or she modifies just the time-to-live of the authoritative DNS record, say to five minutes, with­ out changing the Internet address. After an hour passes, all cached DNS records of this address will have expired, and any currently cached record will expire in five minutes or less. The manager now changes both the Internet address of the machine and also the authoritative DNS record of that address, and within a few minutes everyone in the Internet will be able to find the new address. Anyone who tries to use an old, cached, address will receive no response. But a retry a few minutes later will succeed, so from the point of view of a network client the outcome is similar to the case in which gin­ ger.pedantic.edu crashes and restarts—for a few minutes the server is non-responsive. There is a good reason for designing DNS to provide eventual, rather than strict, con­ sistency, and for not requiring read/write coherence. Replicas of individual name records may potentially be cached in any name server anywhere in the Internet—there are thou­ sands, perhaps even millions of such caches. Alerting every name server that might have cached the record that the Internet address of ginger.pedantic.edu changed would be a huge effort, yet most of those caches probably don’t actually have a copy of this partic­ ular record. Furthermore, it turns out not to be that important because, as described in the previous paragraph, a network manager can easily mask any temporary inconsistency

Saltzer & Kaashoek Ch. 10, p. 6

June 24, 2009 12:28 am

10.2 Cache Coherence

10–7

by configuring address advertisement or adjusting the time-to-live. Eventual consistency with expiration is an efficient strategy for this job.

10.2.3 Obtaining Strict Consistency with a Fluorescent Marking Pen In certain special situations, it is possible to regain strict consistency, and thus read/write coherence, despite the existence of multiple, private caches: If only a few variables are actually both shared and writable, mark just those variables with a fluorescent marking pen. The meaning of the mark is “don't cache me”. When someone reads a marked vari­ able, the cache manager retrieves it from secondary memory and delivers it to the client, but does not place a replica in the cache. Similarly, when a client writes a marked vari­ able, the cache manager notices the mark in secondary memory and does not keep a copy in the cache. This scheme erodes the performance-enhancing value of the cache, so it would not work well if most variables have don’t-cache-me marks. The World Wide Web uses this scheme for Web pages that may be different each time they are read. When a client asks a Web server for a page that the server has marked “don’t cache me”, the server adds to the header of that page a flag that instructs the browser and any intermediaries not to cache that page. The Java language includes a slightly different, though closely related, concept, intended to provide read/write coherence despite the presence of caches, variables in reg­ isters, and reordering of instructions, all of which can compromise strict consistency when there is concurrency. The Java memory model allows the programmer to declare a variable to be volatile. This declaration tells the compiler to take whatever actions (such as writing registers back to memory, flushing caches, and blocking any instruction reor­ dering features of the processor) might be needed to ensure read/write coherence for the volatile variable within the actual memory model of the underlying system. Where the fluorescent marking pen marks a variable for special treatment by the memory system, the volatile declaration marks a variable for special treatment by the interpreter.

10.2.4 Obtaining Strict Consistency with the Snoopy Cache The basic idea of most cache coherence schemes is to somehow invalidate cache entries whenever they become inconsistent with the authoritative replica. One situation where a designer can use this idea is when several processors share the same secondary memory. If the processors could also share the cache, there would be no problem. But a shared cache tends to reduce performance, in two ways. First, to minimize latency the designer would prefer to integrate the cache with the processor, but a shared cache eliminates that option. Second, there must be some mechanism that arbitrates access to the shared cache by concurrent processors. That arbitration mechanism must enforce waits that increase access latency even more. Since the main point of a processor cache is to reduce latency, each processor usually has at least a small private cache. Making the private cache write-through would ensure that the replica in secondary memory tracks the replica in the private cache. But write-through does not update any

Saltzer & Kaashoek Ch. 10, p. 7

June 24, 2009 12:28 am

10–8

CHAPTER 10 Consistency

replicas that may be in the private caches of other processors, so by itself it doesn’t pro­ vide read/write coherence. We need to add some way of telling those processors to invalidate any replicas their caches hold. A naive approach would be to run a wire from each processor to the others and specify that whenever a processor writes to memory, it should send a signal on this wire. The other processors should, when they see the signal, assume that something in their cache has changed and, not knowing exactly what, invalidate everything their cache currently holds. Once all caches have been invalidated, the first processor can then confirm com­ pletion of its own write. This scheme would work, but it would have a disastrous effect on the cache hit rate. If 20% of processor data references are write operations, each pro­ cessor will receive signals to invalidate the cache roughly every fifth data reference by each other processor. There would not be much point in having a big cache, since it would rarely have a chance to hold more than half a dozen valid entries. To avoid invalidating the entire cache, a better idea would be to somehow commu­ nicate to the other caches the specific address that is being updated. To rapidly transmit an entire memory address in hardware could require adding a lot of wires. The trick is to realize that there is already a set of wires in place that can do this job: the memory bus. One designs each private cache to actively monitor the memory bus. If the cache notices that anyone else is doing a write operation via the memory bus, it grabs the memory address from the bus and invalidates any copy of data it has that corresponds to that address. A slightly more clever design will also grab the data value from the bus as it goes by and update, rather than invalidate, its copy of that data. These are two variations on what is called the snoopy cache [Suggestions for Further Reading 10.1.1]—each cache is snooping on bus activity. Figure 10.1 illustrates the snoopy cache. The registers of the various processors constitute a separate concern because they may also contain copies of variables that were in a cache at the time a variable in the cache was invalidated or updated. When a program loads a shared variable into a register, it should be aware that it is shared, and provide coordination, for example through the use of locks, to ensure that no other processor can change (and thus invalidate) a variable that this processor is holding in a register. Locks themselves generally are implemented using write-through, to ensure that cached copies do not compromise the single-acquire protocol. A small cottage industry has grown up around optimizations of cache coherence pro­ tocols for multiprocessor systems both with and without buses, and different designers have invented many quite clever speed-up tricks, especially with respect to locks. Before undertaking a multiprocessor cache design, a prospective processor architect should review the extensive literature of the area. A good place to start is with Chapter 8 of Com­ puter Architecture: A Quantitative Approach, by Hennessy and Patterson [Suggestions for Further Reading 1.1.1].

Saltzer & Kaashoek Ch. 10, p. 8

June 24, 2009 12:28 am

10.3 Durable Storage Revisited: Widely Separated Replicas

10–9

10.3 Durable Storage Revisited: Widely Separated Replicas 10.3.1 Durable Storage and the Durability Mantra Chapter 8[on-line] demonstrated how to create durable storage using a technique called mirroring, and Section 9.7[on-line] showed how to give the mirrored replicas the all-or­ nothing property when reading and writing. Mirroring is characterized by writing the replicas synchronously—that is, waiting for all or a majority of the replicas to be written before going on to the next action. The replicas themselves are called mirrors, and they are usually created on a physical unit basis. For example, one common RAID configura­ tion uses multiple disks, on each of which the same data is written to the same numbered sector, and a write operation is not considered complete until enough mirror copies have been successfully written. Mirroring helps protect against internal failures of individual disks, but it is not a magic bullet. If the application or operating system damages the data before writing it, all the replicas will suffer the same damage. Also, as shown in the fault tolerance analyses in the previous two chapters, certain classes of disk failure can obscure discovery that a replica was not written successfully. Finally, there is a concern for where the mirrors are physically located. Placing replicas at the same physical location does not provide much protection against the threat of environmental faults, such as fire or earthquake. Having them all

Processor A

Processor B

Processor C

1 Cache

Cache

Cache

2 bus

3 2 Secondary Memory

FIGURE 10.1 A configuration for which a snoopy cache can restore strict consistency and read/write coher­ ence. When processor A writes to memory (arrow 1), its write-through cache immediately updates secondary memory using the next available bus cycle (arrow 2). The caches for pro­ cessors B and C monitor (“snoop on”) the bus address lines, and if they notice a bus write cycle for an address they have cached, they update (or at least invalidate) their replica of the con­ tents of that address (arrow 3).

Saltzer & Kaashoek Ch. 10, p. 9

June 24, 2009 12:28 am

10–10

CHAPTER 10 Consistency

under the same administrative control does not provide much protection against admin­ istrative bungling. To protect against these threats, the designer uses a powerful design principle: The durability mantra Multiple copies, widely separated and independently administered…

Multiple copies, widely separated and independently administered…

Sidebar 4.5 referred to Ross Anderson’s Eternity Service, a system that makes use of this design principle. Another formulation of the durability mantra is “lots of copies keep stuff safe” [Suggestions for Further Reading 10.2.3]. The idea is not new: “…let us save what remains; not by vaults and locks which fence them from the public eye and use in consigning them to the waste of time, but by such a multiplication of copies, as shall place them beyond the reach of accident.”* The first step in applying this design principle is to separate the replicas geographi­ cally. The problem with separation is that communication with distant points has high latency and is also inherently unreliable. Both of those considerations make it problem­ atic to write the replicas synchronously. When replicas are made asynchronously, one of the replicas (usually the first replica to be written) is identified as the primary copy, and the site that writes it is called the master. The remaining replicas are called backup copies, and the sites that write them are called slaves. The constraint usually specified for replicas is that they should be identical. But when replicas are written at different times, there will be instants when they are not identical; that is, they violate the specified constraint. If a system failure occurs during one of those instants, violation of the constraint can complicate recovery because it may not be clear which replicas are authoritative. One way to regain some simplicity is to organize the writing of the replicas in a way understandable to the application, such as file-by-file or record-by-record, rather than in units of physical storage such as disk sector-by-sector. That way, if a failure does occur during replica writing, it is easier to characterize the state of the replica: some files (or records) of the replica are up to date, some are old, the one that was being written may be damaged, and the application can do any further recovery as needed. Writing replicas in a way understandable to the application is known as mak­ ing logical copies, to contrast it with the physical copies usually associated with mirrors. Logical copying has the same attractions as logical locking, and also some of the perfor­ mance disadvantages, because more software layers must be involved and it may require more disk seek arm movement. In practice, replication schemes can be surprisingly complicated. The primary reason is that the purpose of replication is to suppress unintended changes to the data caused by random decay. But decay suppression also complicates intended changes, since one must * Letter from Thomas Jefferson to the publisher and historian Ebenezer Hazard, February 18, 1791. Library of Congress, The Thomas Jefferson Papers Series 1. General Correspondence. 1651-1827.

Saltzer & Kaashoek Ch. 10, p. 10

June 24, 2009 12:28 am

10.3 Durable Storage Revisited: Widely Separated Replicas

10–11

now update more than one copy, while being prepared for the possibility of a failure in the midst of that update. In addition, if updates are frequent, the protocols to perform update must not only be correct and robust, they must also be efficient. Since multiple replicas can usually be read and written concurrently, it is possible to take advantage of that possibility to enhance overall system performance. But performance enhancement can then become a complicating requirement of its own, one that interacts strongly with a requirement for strict consistency.

10.3.2 Replicated State Machines Data replicas require a management plan. If the data is written exactly once and never again changed, the management plan can be fairly straightforward: make several copies, put them in different places so they will not all be subject to the same environmental faults, and develop algorithms for reading the data that can cope with loss of, disconnec­ tion from, and decay of data elements at some sites. Unfortunately, most real world data need to be updated, at least occasionally, and update greatly complicates management of the replicas. Fortunately, there exists an eas­ ily-described, systematic technique to ensure correct management. Unfortunately, it is surprisingly hard to meet all the conditions needed to make it work. The systematic technique is a sweeping simplification known as the replicated state machine. The idea is to identify the data with the state of a finite state machine whose inputs are the updates to be made to the data, and whose operation is to make the appro­ priate changes to the data, as illustrated in Figure 10.2. To maintain identical data replicas, co-locate with each of those replicas a replica of the state machine, and send the same inputs to each state machine. Since the state of a finite state machine is at all times determined by its prior state and its inputs, the data of the various replicas will, in prin­ ciple, perfectly match one another. The concept is sound, but four real-world considerations conspire to make this method harder than it looks: 1. All of the state machine replicas must receive the same inputs, in the same order. Agreeing on the values and order of the inputs at separated sites is known as achieving consensus. Achieving consensus among sites that do not have a common clock, that can crash independently, and that are separated by a best-effort communication network is a project in itself. Consensus has received much attention from theorists, who begin by defining its core essence, known as the consensus problem: to achieve agreement on a single binary value. There are various algorithms and protocols designed to solve this problem under specified conditions, as well as proofs that with certain kinds of failures consensus is impossible to reach. When conditions permit solving the core consensus problem, a designer can then apply bootstrapping to come to agreement on the complete set of values and order of inputs to a set of replicated state machines.

Saltzer & Kaashoek Ch. 10, p. 11

June 24, 2009 12:28 am

10–12

CHAPTER 10 Consistency

2. All of the data replicas (in Figure 10.2, the “prior state”) must be identical. The problem is that random decay events can cause the data replicas to drift apart, and updates that occur when they have drifted can cause them to drift further apart. So there needs to be a plan to check for this drift and correct it. The mechanism that identifies such differences and corrects them is known as reconciliation. 3. The replicated state machines must also be identical. This requirement is harder to achieve than it might at first appear. Even if all the sites run copies of the same program, the operating environment surrounding that program may affect its behavior, and there can be transient faults that affect the operation of individual state machines differently. Since the result is again that the data replicas drift apart, the same reconciliation mechanism that fights decay may be able to handle this problem. 4. To the extent that the replicated state machines really are identical, they will contain identical implementation faults. Updates that cause the faults to produce errors in the data will damage all the replicas identically, and reconciliation can neither detect nor correct the errors.

update

request #1

2 1

Site 1 State machine

prior state

new state

2 1

Site 2 State machine

prior state

new state

2 1 update request #2

Site 3 State machine

prior state

new state

FIGURE 10.2 Replicated state machines. If N identical state machines that all have the same prior state receive and perform the same update requests in the same order, then all N of the machines will enter the same new state.

Saltzer & Kaashoek Ch. 10, p. 12

June 24, 2009 12:28 am

10.3 Durable Storage Revisited: Widely Separated Replicas

10–13

The good news is that the replicated state machine scheme not only is systematic, but it lends itself to modularization. One module can implement the consensus-achieving algorithm; a second set of modules, the state machines, can perform the actual updates; and a third module responsible for reconciliation can periodically review the data replicas to verify that they are identical and, if necessary, initiate repairs to keep them that way.

10.3.3 Shortcuts to Meet more Modest Requirements The replicated state machine method is systematic, elegant, and modular, but its imple­ mentation requirements are severe. At the other end of the spectrum, some applications can get along with a much simpler method: implement just a single state machine. The idea is to carry out all updates at one replica site, generating a new version of the database at that site, and then somehow bring the other replicas into line. The simplest, brute force scheme is to send a copy of this new version of the data to each of the other replica sites, completely replacing their previous copies. This scheme is a particularly simple example of master/slave replication. One of the things that makes it simple is that there is no need for consultation among sites; the master decides what to do and the slaves just follow along. The single state machine with brute force copies works well if: • The data need to be updated only occasionally. • The database is small enough that it is practical to retransmit it in its entirety. • There is no urgency to make updates available, so the master can accumulate updates and perform them in batches. • The application can get along with temporary inconsistency among the various replicas. Requiring clients to read from the master replica is one way to mask the temporary inconsistency. On the other hand if, for improved performance, clients are allowed to read from any available replica, then during an update a client reading data from a replica that has received the update may receive different answers from another client reading data from a different replica to which the update hasn’t propagated yet. This method is subject to data decay, just as is the replicated state machine, but the effects of decay are different. Undetected decay of the master replica can lead to a disaster in which the decay is propagated to the slave replicas. On the other hand, since update installs a complete new copy of the data at each slave site, it incidentally blows away any accumulated decay errors in slave replicas, so if update is frequent, it is usually not nec­ essary to provide reconciliation. If updates are so infrequent that replica decay is a hazard, the master can simply do an occasional dummy update with unchanged data to reconcile the replicas. The main defect of the single state machine is that even though data access can be fault tolerant—if one replica goes down, the others may still available for reading—data update is not: if the primary site fails, no updates are possible until that failure is detected

Saltzer & Kaashoek Ch. 10, p. 13

June 24, 2009 12:28 am

10–14

CHAPTER 10 Consistency

and repaired. Worse, if the primary site fails while in the middle of sending out an update, the replicas may remain inconsistent until the primary site recovers. This whole approach doesn't work well for some applications, such as a large database with a require­ ment for strict consistency and a performance goal that can be met only by allowing concurrent reading of the replicas. Despite these problems, the simplicity is attractive, and in practice many designers try to get away with some variant of the single state machine method, typically tuned up with one or more enhancements: • The master site can distribute just those parts of the database that changed (the updates are known as “deltas” or “diffs”) to the replicas. Each replica site must then run an engine that can correctly update the database using the information in the deltas. This scheme moves back across the spectrum in the direction of the replicated state machine. Though it may produce a substantial performance gain, such a design can end up with the disadvantages of both the single and the replicated state machines. • Devise methods to reduce the size of the time window during which replicas may appear inconsistent to reading clients. For example, the master could hold the new version of the database in a shadow copy, and ask the slave sites to do the same, until all replicas of the new version have been successfully distributed. Then, short messages can tell the slave sites to make the shadow file the active database. (This model should be familiar: a similar idea was used in the design of the two-phase commit protocol described in Chapter 9[on-line].) • If the database is large, partition it into small regions, each of which can be updated independently. Section 10.3.7, below, explores this idea in more depth. (The Internet Domain Name System is for the most part managed as a large number of small, replicated partitions.) • Assign a different master to each partition, to distribute the updating work more evenly and increase availability of update. • Add fault tolerance for data update when a master site fails by using a consensus algorithm to choose a new master site. • If the application is one in which the data is insensitive to the order of updates, implement a replicated state machine without a consensus algorithm. This idea can be useful if the only kind of update is to add new records to the data and the records are identified by their contents, rather than by their order of arrival. Members of a workgroup collaborating by e-mail typically see messages from other group members this way. Different users may find that received messages appear in different orders, and may even occasionally see one member answer a question that another member apparently hasn’t yet asked, but if the e-mail system is working correctly, eventually everyone sees every message.

Saltzer & Kaashoek Ch. 10, p. 14

June 24, 2009 12:28 am

10.3 Durable Storage Revisited: Widely Separated Replicas

10–15

• The master site can distribute just its update log to the replica sites. The replica sites can then run REDO on the log entries to bring their database copies up to date. Or, the replica site might just maintain a complete log replica rather than the database itself. In the case of a disaster at the master site, one of the log replicas can then be used to reconstruct the database. This list just touches the surface. There seem to be an unlimited number of variations in application-dependent ways of doing replication.

10.3.4 Maintaining Data Integrity In updating a replica, many things can go wrong: data records can be damaged or even completely lost track of in memory buffers of the sending or receiving systems, transmis­ sion can introduce errors, and operators or administrators can make blunders, to name just some of the added threats to data integrity. The durability mantra suggests imposing physical and administrative separation of replicas to make threats to their integrity more independent, but the threats still exist. The obvious way to counter these threats to data integrity is to apply the method sug­ gested on page 9–94 to counter spontaneous data decay: plan to periodically compare replicas, doing so often enough that it is unlikely that all of the replicas have deteriorated. However, when replicas are not physically adjacent this obvious method has the draw­ back that bit-by-bit comparison requires transmission of a complete copy of the data from one replica site to another, an activity that can be time-consuming and possibly expensive. An alternative and less costly method that can be equally effective is to calculate a wit­ ness of the contents of a replica and transmit just that witness from one site to another. The usual form for a witness is a hash value that is calculated over the content of the rep­ lica, thus attesting to that content. By choosing a good hash algorithm (for example, a cryptographic quality hash such as described in Sidebar 11.7) and making the witness sufficiently long, the probability that a damaged replica will have a hash value that matches the witness can be made arbitrarily small. A witness can thus stand in for a rep­ lica for purposes of confirming data integrity or detecting its loss. The idea of using witnesses to confirm or detect loss of data integrity can be applied in many ways. We have already seen checksums used in communications, both for endto-end integrity verification (page 7–31) and in the link layer (page 7–40); checksums can be thought of as weak witnesses. For another example of the use of witnesses, a file system might calculate a separate witness for each newly written file, and store a copy of the witness in the directory entry for the file. When later reading the file, the system can recalculate the hash and compare the result with the previously stored witness to verify the integrity of the data in the file. Two sites that are supposed to be maintaining replicas of the file system can verify that they are identical by exchanging and comparing lists of witnesses. In Chapter 11[on-line] we will see that by separately protecting a witness one can also counter threats to data integrity that are posed by an adversary.

Saltzer & Kaashoek Ch. 10, p. 15

June 24, 2009 12:28 am

10–16

CHAPTER 10 Consistency

10.3.5 Replica Reading and Majorities So far, we have explored various methods of creating replicas, but not how to use them. The simplest plan, with a master/slave system, is to direct all client read and write requests to the primary copy located at the master site, and treat the slave replicas exclu­ sively as backups whose only use is to restore the integrity of a damaged master copy. What makes this plan simple is that the master site is in a good position to keep track of the ordering of read and write requests, and thus enforce a strict consistency specification such as the usual one for memory coherence: that a read should return the result of the most recent write. A common enhancement to a replica system, intended to increase availability for read requests, is to allow reads to be directed to any replica, so that the data continues to be available even when the master site is down. In addition to improving availability, this enhancement may also have a performance advantage, since the several replicas can prob­ ably provide service to different clients at the same time. Unfortunately, the enhancement has the complication that there will be instants during update when the several replicas are not identical, so different readers may obtain different results, a vio­ lation of the strict consistency specification. To restore strict consistency, some mechanism that ensures before-or-after atomicity between reads and updates would be