IPv6乌托邦

2021-01-25 ⏳39.4分钟(15.8千字)

本文原名叫 The world in which IPv6 was a good design。我一开始以为是说IPv6的。读完了才发现这是一篇讲网络的神文 (甚至都没怎么讲IPv6)。学过网络的人都会知道OSI分层。每一层干什么应该也清楚。但为什么要分这么多层大家有考虑过吗?我一直想研究一下分层的原理,分层到底解决了什么问题。网络技术是怎么一步一步演变成今天这个样子的。没想到 Avery Pennarun 在一篇文章里就写了个明明白白。读罢让人拍案叫绝。

我是在元旦读到这篇文章的的,一直很想翻译出来,分享给大家。但文章确实很长,作者的遣词造句也很率性,就我的英语水平来说确实有点难。但好文章就应该分享给更多的人。今天还是硬着头皮强翻一把。因为水平有限,所以为大家附带英文原文。一段英文一段中文,尽量做到信,达和雅就别想了。强烈建议英文还可以的同学阅读原文

Last November I went to an IETF meeting for the first time. The IETF is an interesting place; it seems to be about 1/3 maintenance grunt work, 1/3 extending existing stuff, and 1/3 blue sky insanity. I attended mostly because I wanted to see how people would react to TCP BBR, which was being presented there for the first time. (Answer: mostly positively, but with suspicion. It kinda seemed too good to be true.)

Anyway, the IETF meetings contain lots and lots of presentations about IPv6, the thing that was supposed to replace IPv4, which is what the Internet runs on. (Some would say IPv4 is already being replaced; some would say it has already happened.) Along with those presentations about IPv6, there were lots of people who think it’s great, the greatest thing ever, and they’re pretty sure it will finally catch on Any Day Now, and IPv4 is just a giant pile of hacks that really needs to die so that the Internet can be elegant again.

I thought this would be a great chance to really try to figure out what was going on. Why is IPv6 such a complicated mess compared to IPv4? Wouldn’t it be better if it had just been IPv4 with more address bits? But it’s not, oh goodness, is it ever not. So I started asking around. Here’s what I found.

以上内容是作者介绍写作本文的背景。大意是说作者参加IETF的讨论会。好多人都在讲IPv6。作者就趁机跟他们深入讨论了互联网的过去、现在和将来。然后写成本文。

Buses ruined everything

总线摧毁一切

Once upon a time, there was the telephone network, which used physical circuit switching. Essentially, that meant moving connectors around so that your phone connection was literally just a very long wire (“OSI layer 1”). A “leased line” was a very long wire that you leased from the phone company. You would put bits in one end of the wire, and they’d come out the other end, a fixed amount of time later. You didn’t need addresses because there was exactly one machine at each end.

很久以前,世界上只有电话网。电话网是由物理线路和开关连接起来的。(想想当时还有接线员。译者注)。电话网的核心功能是通过拨动不同线路的开关,使用者可以看起来独占一条很长的电话线,也就是所谓的OSI第一层,物理层(所以打电话比较贵。译者注)。这就好比你从电话公司租来一条很长的电话线。你在电话线的一头写入数据,另一头就会收到同样的内容。当然,传输需要一定的时间。因为这条电话线两头各自只有一台设备,所以你不需要给它们指定地址。

Eventually the phone company optimized that a bit. Time-division multiplexing (TDM) and “virtual circuit switching” was born. The phone company could transparently take the bits at a slower bit rate from multiple lines, group them together with multiplexers and demultiplexers, and let them pass through the middle of the phone system using fewer wires than before. Making that work was a little complicated, but as far as we modem users were concerned, you still put bits in one end and they came out the other end. No addresses needed.

后来,电话公司对线路做了一点优化。他们发明了分时复用技术(TDM)和虚拟交换电路。电话公司可以以一个稍微慢一点的速度从多条线路读取数据,再使「复用器」将数据分组发送让它们通过中间的电话系统传输,最后使用「解复用器」将数据分另传界到对应的接收端。整个过程使用者感知不到。而且可以比之前的电话网络省不少的线路。维护这样的网络确实有点复杂,但使用「猫」上网的用户大可不必担心,因为你从一端发送的数据仍然会被另一端接收到。依然不需要地址。

The Internet (not called the Internet at the time) was built on top of these circuits. You had a bunch of wires that you could put bits into and have them come out the other side. If one computer had two or three interfaces, then it could, if given the right instructions, forward bits from one line to another, and you could do something a lot more efficient than a separate line between each pair of computers. And so IP addresses (“layer 3”), subnets, and routing were born. Even then, with these point-to-point links, you didn’t need MAC addresses, because once a packet went into the wire, there was only one place it could come out. You used IP addresses to decide where it should go after that.

因特网(当时还不叫这个名字)就是基于这样的电路构建的。你有一捆电线,你可以在一端发数据,然后在另一端收它们。如果一台电脑有两个或三个接口,如果再给它发送正确的指令,它们就能把数据从一条线转发给另一条线。这样的通信比为每一对电脑都连一条单独的电线要高效。这个时候IP地址(三层)、子网和路由就诞生了。即便到了这个时候,你不需要给这个点对点的线路设置MAC地址,因为数据包一旦被发送到线足上,它们只能对端一个出口。你只需IP地址来确定收到的包要转发到哪里。

Meanwhile, LANs got invented as an alternative. If you wanted to connect computers (or terminals and a mainframe) together at your local site, it was pretty inconvenient to need multiple interfaces, one for each wire to each satellite computer, arranged in a star configuration. To save on electronics, people wanted to have a “bus” network (also known as a “broadcast domain,” a name that will be important later) where multiple stations could just be plugged into a single wire, and talk to any other station plugged into the same wire. These were not the same people as the ones building the Internet, so they didn’t use IP addresses for this. They all invented their own scheme (“layer 2”).

与此同时,人们发明了局域网(LAN)用来取代原来的电话网。如果你想在自己的地方连接电脑或者终端和主机,你需要为每一台接入的设备准备一个接口,并且设置成星型模式。这显然很不方便。为了节约电路,人们希望能有一种总线网络(也就是所谓的广播域,非常重要)。在总线网中,所以的设备只需接入同一根线路就能相互通信。发明总线网络的人跟发明因特网的人并不是同一拨人,所以他们没有使用IP地址。他们发明了一套新的模式(二层)。

One of the early local bus networks was arcnet, which is dear to my heart (I wrote the first Linux arcnet driver and arcnet poetry way back in the 1990s, long after arcnet was obsolete). Arcnet layer 2 addresses were very simplistic: just 8 bits, set by jumpers or DIP switches on the back of the network card. As the network owner, it was your job to configure the addresses and make sure you didn’t have any duplicates, or all heck would ensue. This was kind of a pain, but arcnet networks were usually pretty small, so it was only kind of a pain.

arcnet就是早期总线网络的一种。arcnet是我的小心肝,因为我在上世纪90年代为linux开发了第一个arcnet网络驱动,虽然那时候人们已经好久不用arcnet了。Archnet的二层地址非常简单,只用8比特,需要通过网卡后面的跳线或拨键开关DIP设置。作为网络管理员,你的任务就是设置地址并且确保地址没有重复。如果有,你会收到惊喜。这种工作有点无聊,但常见的网络规模一般都很小,所以也就没有那么的无聊。

A few years later, ethernet came along and solved that problem once and for all, by using many more bits (48, in fact) in the layer 2 address. That’s enough bits that you can assign a different (sharded-sequential) address to every device that has ever been manufactured, and not have any overlaps. And that’s exactly what they did! Thus the ethernet MAC address was born.

几年之后,以太网诞生了。以网通过引入更长的地址(实际上是48位)为所有人根除了这个问题。这个地址空间足够大,你可以为生产出来的每一个设备分配不同的地址(分片连续),而且不会重复。人们也确实这样做了。于是就有了以太网的MAC地址。

Various LAN technologies came and went, including one of my favourites, IPX (Internetwork Packet Exchange, though it had nothing to do with the “real” Internet) and Netware, which worked great as long as all the clients and servers were on a single bus network. You never had to configure any addresses, ever. It was beautiful, and reliable, and worked. The golden age of networking, basically.

不同的局域网技术诞生又消亡。其中我最喜欢的是IPX和Netware。IPX是Internetwork Packet Exchange的缩写,但它跟真正的因特网没有半毛钱关系。只要所有的客户端跟服务器都在一个总线网中,IPX和Netware都能很好的工作。你不需要配置任何地址。优雅、可靠且能用。基本上是网络技术的黄金时代。

Of course, someone had to ruin it: big company/university networks. They wanted to have so many computers that sharing 10 Mbps of a single bus network between them all became a huge bottleneck, so they needed a way to have multiple buses, and then interconnect - “internetwork,” if you will - those buses together. You’re probably thinking, of course! Use the Internet Protocol for that, right? Ha ha, no. The Internet protocol, still not called that, wasn’t mature or popular back then, and nobody took it seriously. Netware-over-IPX (and the many other LAN protocols at the time) were serious business, so as serious businesses do, they invented their own thing(s) to extend the already-popular thing, ethernet. Devices on ethernet already had addresses, MAC addresses, which were about the only thing the various LAN protocol people could agree on, so they decided to use ethernet addresses as the keys for their routing mechanisms. (Actually they called it bridging and switching instead of routing.)

显然,有人想毁灭这这个黄金时代。他就是大公司或者大学的网络。为了复用网络,他们不断把电脑拉入一条网速为10Mbps的总线网络,很快就遇到了网络瓶颈。所以他们希望能把多个总线网络连接起来。你可能会理所当然的认为可以使用Internet协议来做这个事情,然而并没有。当时的网络协议还不叫Internet协议,既不成熟也不流行,没人会把它跟严肃的商业产品一样对待。Netware-over-IPX协议跟同时代的其他局域网协议一样,是正规的商业产品。跟正现的商业行为一样,他们为以太网(当是已经很流行)发明了自己扩展方案。以太网中的设备都有MAC地址,这可能是不同局域网技术最大的共同点。所以,人们决定使用MAC地址来路由数据。(实际上,他们并不称之为路由,而是称之为桥接和交换。)

The problem with ethernet addresses is they’re assigned sequentially at the factory, so they can’t be hierarchical. That means the “bridging table” is not as nice as a modern IP routing table, which can talk about the route for a whole subnet at a time. In order to do efficient bridging, you had to remember which network bus each MAC address could be found on. And humans didn’t want to configure each of those by hand, so it needed to figure itself out automatically. If you had a complex internetwork of bridges, this could get a little complicated. As I understand it, that’s what led to the spanning tree poem, and I think I’ll just leave it at that. Poetry is very important in networking.

MAC地址是在出厂的时候按顺序分配的,所以它们没法分层组网。这是一个问题。也就是说,当时的「桥接表」就没有现代的IP路由表那么优雅了。现代的路由表可以一次性指定一可个子网。为了能快速地转发数据,你不得不记录每一个MAC地址属于哪一个总线网络。人肯定是不想手工配置,所以需要网络自动发现。如果你的你桥接的网络很复杂,那么网络的自动发现和配置也会变的很复杂。就我理解,这也是产生《生成树之歌》的原因。我觉得还是不展开为好。诗歌在网络中发挥着重要作用。

Anyway, it mostly worked, but it was a bit of a mess, and you got broadcast floods every now and then, and the routes weren’t always optimal, and it was pretty much impossible to debug. (You definitely couldn’t write something like traceroute for bridging, because none of the tools you need to make it work - such as the ability for an intermediate bridge to even have an address - exist in plain ethernet.)

总之,大体上能工作。不过你会摊上一点麻烦,你会不断收到广播风暴。网络的路由也可能不是最优的,而且你几乎没有办法调试。(你根本没有办法为桥接网络开发类似 traceroute 的工具,因为你根本没有可用的工具。比如,这些网桥甚至都不能设置MAC地址。)

On the other hand, all these bridges were hardware-optimized. The whole system was invented by hardware people, basically, as a way of fooling the software, which had no idea about multiple buses and bridging between them, into working better on large networks. Hardware bridging means the bridging could go really really fast - as fast as the ethernet could go. Nowadays that doesn’t sound very special, but at the time, it was a big deal. Ethernet was 10 Mbps, because you could maybe saturate it by putting a bunch of computers on the network all at once, not because any one computer could saturate 10 Mbps. That was crazy talk.

另一方面,所有的网桥设备都进行了硬件优化。总的来说,整个系统是由搞硬件的人发明的。他们不太懂软件,也不知道如何才能更好的连接大规模的总线网络。硬件桥接意味着速度极快,几乎跟以太网一样快。现在听起来没什么,但在当年,这是一个很大的问题。以太网的速度只有10Mbps。因为是你可以一次将多台电脑连入网络并吃满带宽,而不是因为另人可以吃江。听起来有点不可思议。

Anyway, the point is, bridging was a mess, and impossible to debug, but it was fast.

总之,问题的核心是桥接网很麻烦,没法调试,但是它还很快!

Internet over buses

总线上的互联网

While all that was happening, those Internet people were getting busy, and were of course not blind to the invention of cool cheap LAN technologies. I think it might have been around this time that the ARPANET got actually renamed to the Internet, but I’m not sure. Let’s say it was, because the story is better if I sound confident.

与此同时,搞因特网的人也在忙碌着,而且也留意到了新发明的、便宜,而且很酷的局域网技术。我想大约是在那前后,ARPANET更名为因特网。但我不确定。我们假设就是这样,因为这会让我讲的故事更有趣。

At some point, things progressed from connecting individual Internet computers over point-to-point long distance links, to the desire to connect whole LANs together, over point-to-point links. Basically, you wanted a long-distance bridge.

从某种意义上讲,事情已经从使用长途点对点链路连接单个设备转变成连接整个局域网了。简单来说,人们需要长途网桥。

You might be thinking, hey, no big deal, why not just build a long distance bridge and be done with it? Sounds good, doesn’t work. I won’t go into the details right now, but basically the problem is congestion control. The deep dark secret of ethernet bridging is that it assumes all your links are about the same speed, and/or completely uncongested, because they have no way to slow down. You just blast data as fast as you can, and expect it to arrive. But when your ethernet is 10 Mbps and your point-to-point link is 0.128 Mbps, that’s completely hopeless. Separately, the idea of figuring out your routes by flooding all the links to see which one is right - this is the actual way bridging typically works - is hugely wasteful for slow links. And sub-optimal routing, an annoyance on local networks with low latency and high throughput, is nasty on slow, expensive long-distance links. It just doesn’t scale.

你可能什觉得这不是什么大问题上。为什么不造一个长途网桥把网格桥接起来呢?听起来没毛病,但就是没法运行。现在我不想介绍太多细节,但简单来说是拥塞控制的问题。以太桥接网络有一个黑科技,它假定所有的链路有相同的速度。也就是说,根本不可能产生拥塞。你要做的就是数据以最快的速度发出去,然后期待到达目的地。但是,如果你的以太网的网速是10Mbpc,你的点对点链路的网速是0.128Mbps,那就无药可救了。另外,桥接网络会通过给所有链路广播消息来确定那个路由是正确的,这在慢速链路上非常不经济。还有就是路由表没有优化。在局域网的设备延迟低、吞吐高,在长途链路上又非常慢。总的来说就是没法扩展。

Luckily, those Internet people (if it was called the Internet yet) had been working on that exact set of problems. If we could just use Internet stuff to connect ethernet buses together, we’d be in great shape.

幸运的是搞因特网的那拨人着手处理同样的问题了。如果我们只用因特网的技术来连接以太总线网络,我们可能会有不错的结果。

And so they designed a “frame format” for Internet packets over ethernet (and arcnet, for that matter, and every other kind of LAN).

所以,他们在以太网(或者arcnet和其他类型的局域网)之上,给因特网的数据包设计了一种「帧结构」。

And that’s when everything started to go wrong.

这是恶梦的开始。

The first problem that needed solving was that now, when you put an Internet packet onto a wire, it was no longer clear which machine was supposed to “hear” it and maybe forward it along. If multiple Internet routers were on the same ethernet segment, you couldn’t have them all picking it up and trying to forward it; that way lies packet storms and routing loops. No, you had to choose which router on the ethernet bus is supposed to pick it up. We can’t just use the IP destination field for that, because we’re already using that for the final destination, not the router destination. Instead, we identify the desired router using its MAC address in the ethernet frame.

首先要解决一个问题,当你把一个网络包发到链路上去,我们该如何确定哪一台机器应该接收或者转发这个包呢?如果在同一个网段中有多个路由器,你不能让它们无脑接收并转发,这会产生数据风暴和路由环路。你必须自己决定由哪个路由器接收你的数据。我们已经使用IP包的目标地址字段指定目标接收设备,也就不能用它来指定路由器的地址。所以,我们使用以太帧的MAC地址来指定路由器。

So basically, to set up your local IP routing table, you want to be able to say something like, “send packets to IP address 10.1.1.1 via the router at MAC address 11:22:33:44:55:66.” That’s the actual thing you want to express. This is important! Your destination is an IP address, but your router is a MAC address. But if you’ve ever configured a routing table, you might have noticed that nobody writes it like that. Instead, because the writers of your operating system’s TCP/IP stack are stubborn, you write something like “send packets to IP address 10.1.1.1 via the router at IP address 192.168.1.1.”

所以,简单讲,设置本地路由表的时候,你希望表达的意思是「目标地址为10.1.1.1的数据需要通过MAC地址为11:22:33:44:55:66的路由器发送」。这是你实际要表达的内容。这是重点。你的目标是IP地址,但你的路由器是MAC地址。如果你配过路由表,你会发现没人会这样写。因为操作系统TCP/IP协议栈的固执,你只能写成「目标地址为10.1.1.1的数据需要通过IP地址为192.168.1.1的路由器转发」。

In truth, that really is just complicating things. Now your operating system has to first look up the ethernet address of 192.168.1.1, find out it’s 11:22:33:44:55:66, and finally generate a packet with destination ethernet address 11:22:33:44:55:66 and destination IP address 10.1.1.1. 192.168.1.1 shows up nowhere in the packet; it’s just an abstraction at the human level.

实际上这是一个很复杂的事情。你的操作系统首先得查一下192.168.1.1的以太网MAC地址,发现是11:22:33:44:55:66,然后发送一条目标标MAC地址为11:22:33:44:55:66而且目标IP地址为10.1.1.1。发出去的包里没有192.168.1.1这个地址,它只是给人看的。

To do that pointless intermediate step, you need to add ARP (address resolution protocol), a simple non-IP protocol whose job it is to convert IP addresses to ethernet addresses. It does this by broadcasting to everyone on the local ethernet bus, asking them all to answer if they own that particular IP address. If you have bridges, they all have to forward all the ARP packets to all their interfaces, because they’re ethernet broadcast packets, and that’s what broadcasting means. On a big, busy ethernet with lots of interconnected LANs, excessive broadcasts start becoming one of your biggest nightmares. It’s especially bad on wifi. As time went on, people started making bridges/switches with special hacks to avoid forwarding ARP as far as it’s technically supposed to go, to try to cut down on this problem. Some devices (especially wifi access points) just make fake ARP answers to try to help. But doing any of that is a hack, albeit sometimes a necessary hack.

为了完成毫无意义的中间步骤,你需要添加APR(地址解析协议),这是一个简单的非IP协议,它的工作是将IP地址转换成MAC地址。ARP会给局域网的所有主机发广播,问他问是否拥有某个IP地址。如果你的网络有网桥,它们会将ARP数据转发到所有网口上,这就是广播的含义。在一个很大、很忙、相互连接了很多局域网的以太网络中,大量的广播会成为最大的梦魇。如果是wifi网络,事情会变得更糟。为了减少这类问题,人们在技术条件允许的前提下,给网桥设了很多黑科技有避免APR转发。有些设备(尤其是wifi热点)会伪造ARP应答。但黑科技就是黑科技。

Death by legacy

Time passed. Eventually (and this actually took quite a while), people pretty much stopped using non-IP protocols on ethernet at all. So basically all networks became a physical wire (layer 1), with multiple stations on a bus (layer 2), with multiple buses connected over bridges (gotcha! still layer 2!), and those inter-buses connected over IP routers (layer 3).

时光荏苒。后来(很久之后),人们渐渐淘汰了以太网上的非IP协议。所以,所有的网络基本上变成了物理连接(一层)加总线(二层)加通过网桥连接的多个总线(也是二层)再加连接总线的IP路由器(三层)。

After a while, people got tired of manually configuring IP addresses, arcnet style, and wanted them to auto-configure, ethernet style, except it was too late to literally do it ethernet style, because a) the devices had already been manufactured with ethernet addresses, not IP addresses, and b) IP addresses were only 32 bits, which is not enough to just manufacture them forever with no overlaps, and c) just assigning IP addresses sequentially instead of using subnets would bring us back to square one: it would just be ethernet over again, and we already have ethernet.

再后来,人们也厌倦了像 arcnet 那样手工配置IP地址,也想让它们像以太网MAC地址一样自动配置。但是已经不可能了,因为(一)网卡出厂的时候已经分配了MAC地址,而非IP地址;(二)IP地址只有三十二位,不够分;(三)顺序分配IP而不是使用子网让我们回到了原地,跟以太网没有区别,我们已经有一个以太网了。

So that’s where bootp and DHCP came from. Those protocols, by the way, are special kinda like ARP is special (except they pretend not to be special, by technically being IP packets). They have to be special, because an IP node has to be able to transmit them before it has an IP address, which is of course impossible, so it just fills the IP headers with essentially nonsense (albeit nonsense specified by an RFC), so the headers might as well have been left out. (You know these “IP” headers are nonsense because the DHCP server has to open a raw socket and fill them in by hand; the kernel IP layer can’t do it.) But nobody would feel nice if they were inventing a whole new protocol that wasn’t IP, so they pretended it was IP, and then they felt nice. Well, as nice as one can feel when one is inventing DHCP.

所以,bootp和DHCP协议诞生了。这些协议跟ARP一样有点特别。但它们在技术上使用IP报文,来让自己看起来很正常。它们本应当被区别对待。因为IP节点在没有IP地址的时候就需要传输它们。这显然不可能。所以它们简单在IP报文头部填充无意义的信息(尽管这是RFC的规定)。所以这些头信息也可能被发出去。(这个IP包头没有意义,因为DHCP服务需要打开一个 raw 套接字,然后手工填充对应字段。内核的IP协议栈干不了这活)。如果他们发明的新协议不是IP协议,大家都不会开心。所以他们就伪装成IP协议,大家都接受了。

Anyway, I digress. The salient detail here is that unlike real IP services, bootp and DHCP need to know about ethernet addresses, because after all, it’s their job to hear your ethernet address and assign you an IP address to go with it. They’re basically the reverse of ARP, except we can’t say that, because there’s a protocol called RARP that is literally the reverse of ARP. Actually, RARP worked quite fine and did the same thing as bootp and DHCP while being much simpler, but we don’t talk about that.

我好像有点跑题了。这里最主要的问题是bootp和DHCP需要知道MAC地址,因为它们的工作就是监听以太网地址然后分配IP地址。这个跟一般的IP服务是不一样的。它们基本上可以看作是ARP的反版。但我们不能这么叫,因为ARP还有对应的RAPR协议。事实是,RAPR跟bootp和DHCP做的事很类似,而且做的还不错。便我在这就不展开了。

The point of all this is that ethernet and IP were getting further and further intertwined. They’re nowadays almost inseparable. It’s hard to imagine a network interface (except ppp0) without a 48-bit MAC address, and it’s hard to imagine that network interface working without an IP address. You write your IP routing table using IP addresses, but of course you know you’re lying when you name the router by IP address; you’re just indirectly saying that you want to route via a MAC address. And you have ARP, which gets bridged but not really, and DHCP, which is an IP packet but is really an ethernet protocol, and so on.

所有这些内容的关键点是以太网跟IP网已经越来越紧密了。现在几乎是不可分割的。很难想象一个网卡(除了ppp0)没有48位的MAC地址,也很难想象一个网络没有IP地址。你在设置路由表的时候使用IP地址,但你显然知道自己写了个假的路由地址。你只是间接地说你想能过MAC地址来路由。你会用ARP协议,用来桥接,用DHCP,这是一个IP包,实际上确是以太网协议,诸如此类。

Moreover, we still have both bridging and routing, and they both get more and more complicated as the LANs and the Internet get more and more complicated, respectively. Bridging is still, mostly, hardware based and defined by IEEE, the people who control the ethernet standards. Routing is still, mostly, software based and defined by the IETF, the people who control the Internet standards. Both groups still try to pretend the other group doesn’t exist. Network operators basically choose bridging vs routing based on how fast they want it to go and how much they hate configuring DHCP servers, which they really hate very much, which means they use bridging as much as possible and routing when they have to.

此外,我们依然同时使用桥接和路由。而且,随着局域网和因特网变得越来越复杂,它们也变得越来越复杂。桥接总体上还是基于硬件的,它们由IEEE定义,这拨人还控制着以太网标准。路由基本上基于软件,它们由IETF定义,这拨人控制着因特网标准。两拨人都假装对方不存在。网络管理员通常选择桥接而非路由,因为他们喜欢速度,他们不喜欢配置DHCP服务器,甚至是讨厌配置DHCP。也就是说,他们会尽可能使用桥接,只在必要的时候使用路由。

In fact, bridging has gotten so completely out of control that people decided to extract the layer 2 bridging decisions out completely to a higher level (with configuration exchanged between bridges using a protocol layered over IP, of course!) so it can be centrally managed. That’s called software-defined networking (SDN). It helps a lot, compared to letting your switches and bridges just do whatever they want, but it’s also fundamentally silly, because you know what’s software defined networking? IP. It is literally and has always been the software-defined network you use for interconnecting networks that have gotten too big. But the problem is, IPv4 was initially too hard to hardware accelerate, and anyway, it didn’t get hardware accelerated, and configuring DHCP really is a huge pain, so network operators just learned how to bridge bigger and bigger things. And nowadays big data centers are basically just SDNed, and you might as well not be using IP in the data center at all, because nobody’s routing the packets. It’s all just one big virtual bus network.

实际上,桥接网络也完全失控了。人们决定把二层的桥接功能提到更高的一层(通过IP协议交换配置),这样就可以统一管理。这就是所谓的软件定义网络(SDN)。这招确实管用,你不用去一个一个地配置交换机和网桥。但这仍然是个愚蠢的想法。因为你知道软件定义的网络是什么吗?就是IP呀。IP从字面上看就是,而且一直都是软件定义的网络。你用它来连接那些很大型的网络。但问题是IPv4在一开始就设计的很复杂,没法实现硬件加速。确实也没有实现硬件加速。配置DHCP也非常痛苦,所以网工们一直都在学怎么桥接更大的网络。现在大型的数据中心基本都是基于SDN的。可能你也会觉察,数据中心不用IP,因为不需要路由。它们就是一个巨大的总线网络。

It is, in short, a mess.

简单来说,就是一坨屎!

Now forget I said all that…

Great story, right? Right. Now pretend none of that happened, and we’re back in the early 1990s, when most of that had in fact already happened, but people at the IETF were anyway pretending that it hadn’t happened and that the “upcoming” disaster could all be avoided. This is the good part!

故事精彩吧。现在,假设所有的事情都没有发生,我们又回到了90年代早期。其实几乎那时所有的事情都发生了,但IETF的那拨人还是假装什么也没发生,还且还觉得灾难可避免。

There’s one thing I forgot to mention in that big long story above: somewhere in that whole chain of events, we completely stopped using bus networks. Ethernet is not actually a bus anymore. It just pretends to be a bus. Basically, we couldn’t get ethernet’s famous CSMA/CD to keep working as speeds increased, so we went back to the good old star topology. We run bundles of cables from the switch, so that we can run one cable from each station all the way back to the center point. Walls and ceilings and floors are filled with big, thick, expensive bundles of ethernet, because we couldn’t figure out how to make buses work well… at layer 1. It’s kinda funny actually when you think about it. If you find sad things funny.

在前面的故事里我忘了说一个事,在故事的某个时间节点,我们不再使用总线网络。以太网早就不是总线网了。它只是装成总线网。简单来说,随着网络的提升,我们没法继续使用著名的CSMA/CD(载波侦听、多路访问/冲突检测,译者注)机制。所以,我们回退到了更好也更古老的星型拓扑。我们从交换机扯出一捆捆的网线,每根线连一台设备,这样它们可昼夜接入数据中心。房间内到处都是又大又粗又贵的以太网线。就是因为我们没法上总线网很好的工作在一层。当你真正思考这个问题的时候,你会觉得很有趣。因为你会觉得有一些不好的事情也很有意思。

In fact, in a bonus fit of insanity, even wifi - the ultimate bus network, right, where literally everybody is sharing the same open-air “bus” - we almost universally use wifi in a mode, called “infrastructure mode,” which simulates a giant star topology. If you have two wifi stations connected to the same access point, they don’t talk to each other directly, even when they can hear each other just fine. They send a packet to the access point, but addressed to the MAC address of the other node. The access point then bounces it back out to the destination node.

事实上,作为精神失常的结果,wifi是终极的总线网络,所有人共享同一个空中总线。大家基本都在使用同一个叫「基础设施」的模式。这个模式也是模拟了一个具大的星型网络。如果你有现从设备,它们连到同一个无线AP,即使他们能探测到彼此,也不能直接通信。它们会把数据包发给AP,但目标MAC地址却是另外一个设备。然后AP再把数据「反弹」给目标节点。

HOLD THE HORSES LET ME JUST REVIEW THAT FOR YOU. There’s a little catch there. When node X wants to send to Internet node Z, via IP router Y, via wifi access point A, what does the packet look like? Just to draw a picture, here’s what we want to happen:

X -> [wifi] -> A -> [wifi] -> Y -> [internet] -> Z

Z is the IP destination, so obviously the IP destination field has to be Z. Y is the router, which we learned above that we specify by using its ethernet MAC address in the ethernet destination field. But in wifi, X can’t just send out a packet to Y, for various reasons (including that they don’t know each other’s WPA2 encryption keys). We have to send to A. Where do we put A’s address, you might ask?

勒住你的马,让我给你细细道来。这里需要拐一个小弯。假如节点X连着wifi A,路由器Y也是能过 wifi 连接A。如果X想给因特网节点Z发送数据,这个数据包会长成什么样子?画一张图,我们想要的是这个样子:

X -> [wifi] -> A -> [wifi] -> Y -> [因特网] -> Z

Z是IP目标,所以IP包的目的地址肯定是Z。Y是路由器。我们之前已经知道,我们会将它的MAC地址写入以太帧的目标字段。但是在wifi网中,X不能直接将数据发送给Y。原因有很多了(包括它们不知道各自的WPA2密钥)。我们必须将数据发给A。你可能要问了,我们要把A的地址放到哪里呢?

No problem! 802.11 has a thing called 3-address mode. They add a third ethernet MAC address to every frame, so they can talk about the real ethernet destination, and the intermediate ethernet destination. On top of that, there are bit fields called “to-AP” and “from-AP,” which tell you if the packet is going from a station to an AP, or from an AP to a station, respectively. But actually they can both be true at the same time, because that’s how you make wifi repeaters (APs send packets to APs).

没问题!802.11协议有一种所谓的三地址模式。他们给每一帧数据添加了第三个MAC地址。如此,他们就能同时传输真正的目标MAC地址和中间的MAC地址。在三地址之上还有几个称为to-AP和from-AP的比特,用来表示数据包是发往AP还是从AP发出。有的数据会同时设置to-AP和from-AP,因为你还可以设置wifi中继网(也就是AP将数据发送到另一个AP)。

Speaking of wifi repeaters! If A is a repeater, it has to send back to the base station, B, along the way, which looks like this:

X -> [wifi] -> A -> [wifi-repeater] -> B -> [wifi] -> Y -> [internet] -> Z

X->A uses three-address mode, but A->B has a problem: the ethernet source address is X, and the ethernet destination address is Y, but the packet on the air is actually being sent from A to B; X and Y aren’t involved at all. Suffice it to say that there’s a thing called 4-address mode, and it works pretty much like you think.

现在说一下wifi中继。如果A是中继,它必须将数据发回基站B,路线如下

X -> [wifi] -> A -> [中继] -> B -> [wifi] -> Y -> [因特网] -> Z

X到A使用三地址模式,但A到B有问题了:以太帧源地址是X,目的地址是Y,但数据包要从A发给B。数据包要同时包含X和Y。简单来说就是还有一种所谓的四地址模式,你应该能猜出来它的工作原理。

(In 802.11s mesh networks, there’s a 6-address mode, and that’s about where I gave up trying to understand.)

(在802.11s的mesh网络中还有一种六地址模式。看到它我直接弃坑了。)

Avery, I was promised IPv6, and you haven’t even mentioned IPv6

慢着,说好的IPv6呢?你甚至都还没提到过IPv6

Oh, oops. This post went a bit off the rails, didn’t it?

马德,好像有点跑题了。

Here’s the point of the whole thing. The IETF people, when they were thinking about IPv6, saw this mess getting made - and maybe predicted some of the additional mess that would happen, though I doubt they could have predicted SDN and wifi repeater modes - and they said, hey wait a minute, stop right there. We don’t need any of this crap! What if instead the world worked like this?

下面是整个事情的核心。在考虑IPv6的时候,IETF那拨人看到了混乱的网络技术,而且可能估计到以后还会有更混乱的事情出现(我猜他们预测不到SDN和wifi中继模式)。他还还说稍等片刻,我们能搞定。简直是一派胡言。如果世界是这个样子的会怎样?

Imagine that we lived in such a world: wifi repeaters would just be IPv6 routers. So would wifi access points. So would ethernet switches. So would SDN. ARP storms would be gone. “IGMP snooping bridges” would be gone. Bridging loops would be gone. Every routing problem would be traceroute-able. And best of all, we could drop 12 bytes (source/dest ethernet addresses) from every ethernet packet, and 18 bytes (source/dest/AP addresses) from every wifi packet. Sure, IPv6 adds an extra 24 bytes of address (vs IPv4), but you’re dropping 12 bytes of ethernet, so the added overhead is only 12 bytes - pretty comparable to using two 64-bit IP addresses but having to keep the ethernet header. The idea that we could someday drop ethernet addresses helped to justify the oversized IPv6 addresses.

假如我们生活在那样一个世界,wif中继就是IPv6路由器。wifi接入点也是一样。以太网交换机也是。SDN也是。不再有ARP风暴。IGMP嗅探网桥也没有了。网桥环路也没有了。每个路由问题都可以使用traceroute排查。而且最重要的是,我们可以每一个以太帧节省掉12字节(来源/目标以太地址),为每个wifi帧节省18字节(来源/目标/AP地址)。当然了,IPv6给每个地址增加了24字节(跟IPv4对比),但我们省去了12字节,所以一共只增加了12个字节。跟使用两个64位的IP地址和保留以太地址相比已经很划算了。因为以后可能会干掉以太帧地址,所以大家觉得IPv6地址长一点也无所谓。

It would have been beautiful. Except for one problem: it never happened.

多么优雅的方案。唯一的问题是它从来没有落地。

Requiem for a dream

梦想安魂曲

One person at work put it best: “layers are only ever added, never removed.”

有一个精辟的总结「分层只增不减」。

All this wonderfulness depended on the ability to start over and throw away the legacy cruft we had built up. And that is, unfortunately, pretty much impossible. Even if IPv6 hits 99% penetration, that doesn’t mean we’ll be rid of IPv4. And if we’re not rid of IPv4, we won’t be rid of ethernet addresses, or wifi addresses. And if we have to keep the IEEE 802.3 and 802.11 framing standards, we’re never going to save those bytes. So we will always need the “IPv6 neighbour discovery” protocol, which is just a more complicated ARP. Even though we no longer have bus networks, we’ll always need some kind of simulator for broadcasts, because that’s how ARP works. We’ll need to keep running a local DHCP server at home so that our obsolete IPv4 light bulbs keep working. We’ll keep needing NAT so that our obsolete IPv4 light bulbs can keep reaching the Internet.

一切的美好取决于我们有能力启动并抛弃那些遗留的技术。遗憾的是这几乎不太可能。即便IPv6的份额达到了99%,我们也不能摆脱IPv4。如果不能摆脱IPv4,我们就没法淘汰以地址或者wifi地址。如果我们还要保留IEEE 802.3和802.11帧标准,我们就没法节省前面提到的字节。所以,我们还要一直依赖IPv6邻居发现协议(比ARP还复杂)。即使我们不再使用总线网络了,我们依然需要类似广播的特性,不然ARP没法工作。我们还需要在家里运行一个本地的DHCP服务器,这样我们的废弃的IPv4电灯(可能指老设备,译者注)才能继续工作。我还需要NAT,这样我们的IPv4电灯才能有继续上网。

And that’s not the worst of it. The worst of it is we still need the infinite abomination that is layer 2 bridging, because of one more mistake the IPv6 team forgot to fix. Unfortunately, while they were blue-skying IPv6 back in the 1990s, they neglected to solve the “mobile IP” problem. As I understand it, the idea was to get IPv6 deployed first - it should only take a few years - and then work on it after IPv4 and MAC addresses had been eliminated, at which time it should be much easier to solve, and meanwhile, nobody really has a “mobile IP” device yet anyway. I mean, what would that even mean, like carrying your laptop around and plugging into a series of one ethernet port after another while you ftp a file? Sounds dumb.

这还是最大问题。最大的问题是我们还是得依赖无比可恶的二层桥接网络。因为IPv6那拨人忘记处理这个问题了。很不幸,他们在1990年代设设IPv6的蓝图的时候,忽略了移动IP的问题。就我理解,他们的想法是先部署IPv6(估计也就几年时间)等IPv6和MAC地址淘汰之后再解决。与此同时,还没有人拥有一个移动IP的设备。我的意思是说,没人会想拿着你的比记本来回走,把电脑从一个网口接到另一个网口,你正在用ftp下载的文件还能继续。

The killer app: mobile IP

大杀招:移动IP

Of course, with a couple more decades of history behind us, now we know a few use cases for carrying around a computer - your phone - and letting it plug into one ethernet port wireless access point after another. We do it all the time. And with LTE, it even mostly works! With wifi, it works sometimes. Good, right?

当然了,几十年后,现在我们知道了几个带着电脑到处走的场景了。你的手机。把它从一个无线基站「拨下来」接到另一个基站。我们一直都在做这事。在LTE网络,它工作的很好;在wifi网络,它也是时而有效。是不是很不错呢?

Not really, because of the Internet’s secret shame: all that stuff only works because of layer 2 bridging. Internet routing can’t handle mobility - at all. If you move around on an IP network, your IP address changes, and that breaks any connections you have open.

并不是。因为它是因特网最不为人知的耻辱:它们能工作都是依赖二层桥接网络。因特网路由完全没法处理移动网络。如果你不断变换IP网络,你的IP地址会改变,你之前创建的连接都会失效。

Corporate wifi networks fake it for you, bridging their whole LAN together at layer 2, so that the giant central DHCP server always hands you the same IP address no matter which corporate wifi access point you join, and then gets your packets to you, with at most a few seconds of confusion while the bridge reconfigures. Those newfangled home wifi systems with multiple extenders/repeaters do the same trick. But if you switch from one wifi network to another as you walk down the street - like if there’s a “Public Wifi” service in a series of stores - well, too bad. Each of those gives you a new IP address, and each time your IP address changes, you kill all your connections.

公司的wifi网络欺骗了你。它们把所有的wifi网格桥接成一个二层局域网,有一个巨大的中心DHPC服务器。不管你接入哪一个wifi,DHPC总是给你分配相同的IP地址。然后把你的数据发给你。当桥接网络在重新配置的时候,你可能会感觉到几秒钟的卡顿。但如果你沿着大街一直走,假设在一系列店铺都有一个Public Wifi的热点,你会从一个网络切到另一个网络。那就有问题了。每个网络会给你分配不同的IP地址,你的连接都会失效。

LTE tries even harder. You keep your IP address (usually an IPv6 address in the case of mobile networks), even if you travel miles and miles and hop between numerous cell towers. How? Well… they typically just tunnel all your traffic back to a central location, where it all gets bridged together (albeit with lots of firewalling) into one super-gigantic virtual layer 2 LAN. And your connections keep going. At the expense of a ton of complexity, and a truly embarrassing amount of extra latency, which they would really like to fix, but it’s almost impossible.

LTE网络更加复杂。即便是你移动了几公里,换了很多信号塔,网络依然会给你分配相同的IP地址(一般是IPv6地址)。怎么做到的呢?他们一般会把你所有的流量发送到一个中心区域。在那里有一个超级巨大的虚拟二层局域网。你的连接会一直操持。代价就是无比的复杂性,还有令人尴尬的延迟。他们真的想修复延迟问题,但几乎是不可能的。

Making mobile IP actually work1

拯救移动IP

So okay, this has been a long story, but I managed to extract it from those IETF people eventually. When we got to this point - the problem of mobile IP - I couldn’t help but ask. What went wrong? Why can’t we make it work?

OK,这又是一个很长的故事。但我还是努力从IETF那拨人那里挖出来了。当我们谈到移动IP这个关键问题,我情不自禁的发问。到底哪里出了问题?难道我们不能把它修好吗?

The answer, it turns out, is surprisingly simple. The great design flaw was in how the famous “4-tuple” (source ip, source port, destination ip, destination port) was defined. We use the 4-tuple to identify a given TCP or UDP session; if a packet has those four fields the same, then it belongs to a given session, and we can deliver it to whatever socket is handling that session. But the 4-tuple crosses two layers: internetwork (layer 3) and transport (layer 4). If, instead, we had identified sessions using only layer 4 data, then mobile IP would have worked perfectly.

答案反而是出奇的简单。最根本的设计问题居然源自众所周知的四元组(源IP、源端口、目标IP、目标端口)的定义。我们使用四元组来识别TCP或者UDP会话。如果一个数据包有相符合的四元组,那么它就属于对应的会话。我们就需要将它转发给处理这一会话的套接字。只不过,四元组横跨了两层:网络层(三层)和传输层(四层)。如果我们只用四层的数据来区分会话,那么移动IP的问题就很好解决。

Let’s do a quick example. X port 1111 is talking to Y port 80, so it sends a packet with 4-tuple (X,1111,Y,80). The response comes back with (Y,80,X,1111), and the kernel delivers it to the socket that generated the original packet. When X sends more packets tagged (X,1111,Y,80), then Y delivers them all to the same server socket, and so on.

举个栗子。X使用端口1111跟Y的80端口通信。所以,它发送的数据包对应的四元组是(X,1111,Y,80)。响应信息则对应(Y,80,X,1111),而且内核驱动会将它转发给最开始的套接字。当X继续发送标记有(X,1111,Y,80)的数所时,Y会将它们转发给对应的服务端套接字。如此往复。

Then, if X hops IP addresses, it gets a new name, say Q. Now it’ll start sending packets with (Q,1111,Y,80). Y has no idea what that means, and throws it away. Meanwhile, if Y sends packets tagged (Y,80,X,1111), they get lost, because there is no longer an X to receive them.

接下来,如果X的IP地址变了,我们给它取个新的名字Q。现在它开始使用(Q,1111,Y,80)发送数据。Y收到后根本识别不了,只能丢弃。同时,如果Y继续发送标有(Y,80,X,1111)的数据,他会被丢弃。因为接收它们的X已经不存在了。

Imagine now that we tagged sockets without reference to their IP address. For that to work, we’d need much bigger port numbers (which are currently 16 bits). Let’s make them, say, 128 or 256 bits, some kind of unique hash.

现在我们谁想可不再使用IP地址来描述套接字。为此,我们需要更大的端口号(目前的端口号是16位)。假设是128位或者256位,有点像是唯一哈希(uuid,译者注)。

Now X sends out packets to Y with tag (uuid,80). Note, the packets themselves still contain the (X,Y) addressing information, down at layer 3 - that’s how they get routed to the right machine in the first place. But the kernel doesn’t use the layer 3 information to decide which socket to deliver to; it just uses the uuid. The destination port (80 in this case) is only needed to initiate a new session, to identify what service you want to connect to, and can be ignored or left out after that.

现在X给Y发的包使用(uuid,80)标记。注意,数据包本身依然包含(X,Y)地址,只有在网络层,没在它们就不能将数据路由到正确的机器。但是,内核不再使用三层信息来确定需要转发给哪个套接字。它只用uuid。目标端口(这里是80)只在创建会话的时候能用到,用来确定你想连接另一个端口。会话创建之后就没有用了。

For the return direction, Y’s kernel caches the fact that packets for (uuid) go to IP address X, which is the address it most recently received (uuid) packets from.

在回包的方向上,Y的内核会记录会话uuid的包需要发送给X。X就是uuid这个会话最近接收数据的IP地址(会一直变。译者注)。

Now imagine that X changes addresses to Q. It still sends out packets tagged with (uuid,80), to IP address Y, but now those packets come from address Q. On machine Y, it receives the packet and matches it to the socket associated with (uuid), notes that the packets for that socket are now coming from address Q, and updates its cache. Its return packets can now be sent, tagged as (uuid), back to Q instead of X. Everything works! (Modulo some care to prevent connection hijacking by impostors.2)

现在设想X的地址变成了Q。它还是会发送标记有(uuid,80)的数据包到Y,只不过现在的来源IP变成Q了。在Y上,它收到数据,并且通过uuid找到了匹配的套接字,而且注意到数据包的来源地址变成了Q,于是它默默更新了自己的缓存。它的回包现在可以被发送回Q(带有uuid标识)而不是原来的X。一切都恢复正常!(除了需要想办法阻止会话支持)。

There’s only one catch: that’s not how UDP and TCP work, and it’s too late to update them. Updating UDP and TCP would be like updating IPv4 to IPv6; a project that sounded simple, back in the 1990s, but decades later, is less than half accomplished (and the first half was the easy part; the long tail is much harder).

只有一个问题:UDP和TCP并不是这样工作的。而且想更新它们为时已晚。更新UDP和TCP跟将IPv4升级到IPv6也差不了多少。IPv6在1990年代听起来很容易,但是几十年之后,连一半都没有完成(而且前面的一半很容易,后面的长尾部分会更难)。

The positive news is we may be able to hack around it with yet another layering violation. If we throw away TCP - it’s getting rather old anyway - and instead use QUIC over UDP, then we can just stop using the UDP 4-tuple as a connection identifier at all. Instead, if the UDP port number is the “special mobility layer” port, we unwrap the content, which can be another packet with a proper uuid tag, match it to the right session, and deliver those packets to the right socket.

积极的消息是我们可能通过引入新的一层来绕过这个问题。如果我们不再使用TCP(确实太老了),而是使用基于UDP的QUIC协议,我们这可以完全不依赖UDP的四元组来标识连接。相反,如果UDP端口是特殊的移动网络层端口,我们就可以解出消息的内容。这里面应该包含对应的uuid标记,用来匹配对应的会话,最终转发给对应的套接字。

There’s even more good news: the experimental QUIC protocol already, at least in theory, has the right packet structure to work like this. It turns out you need unique session identifiers (keys) anyhow if you want to use stateless packet encryption and authentication, which QUIC does. So, perhaps with not much work, QUIC could support transparent roaming. What a world that would be!

现在还有更好的消息:QUIC协议的实验已经开始了(QUIC协议标准预计会在2021年发布。译者注)。理论上,QUIC协议可以像我们设想的那样工作。只是,在QUIC协议中,如果你想使用无状态的报文加密和认证,你需要一个唯一的会话标识(密钥)。所以,可能不需做什么工作,QUIC就能实现无缝漫游。那是多么美好的未来呀。

At that point, all we’d have to do is eliminate all remaining UDP and TCP from the Internet, and then we would definitely not need layer 2 bridging anymore, for real this time, and then we could get rid of broadcasts and MAC addresses and SDN and DHCP and all that stuff.

从这个角度上看,我们需要做的就是弃用因特网剩下的UDP和TCP协议。然后就能完全抛弃两层桥接网络了。之后,我们就可以抛弃广播、MAC地址、SDN、DHCP,还有其他乱七八糟的东西。

And then the Internet would be elegant again.

到那个时候,因特网才回重回优雅!

以下是作者更新。实再太累,翻不动了。

1 Edit 2017-08-16: It turns out that nothing in this section requires IPv6. It would work fine with IPv4 and NAT, even roaming across multiple NATs.

2 Edit 2017-08-15: Some people asked what “some care to prevent connection hijacking” might look like. There are various ways to do it, but the simplest would be to do something like the SYN-ACK-SYNACK exchange TCP does at connection startup. If Y just trusts the first packet from the new host Q, then it’s too easy for any attacker to take over the X->Y connection by simply sending a packet to Y from anywhere on the Internet. (Although it’s a bit hard to guess which 256-bit uuid to fill in.) But if Y sends back a cookie that Q must receive and process and send back to Y, that ensures that Q is at least a man-in-the-middle and not just an outside attacker (which is all TCP would guarantee anyway). If you’re using an encrypted protocol (like QUIC3), the handshake can also be protected by your session key.

3 Edit 2017-10-24: Besides QUIC, there are several other candidates for such a protocol, including MinimaLT. I didn’t mention MinimaLT originally because it wasn’t part of my original conversation with the IETF people, but I don’t mean to imply that QUIC is the only possible option as a roaming-capable TCP replacement. In fact, MinimaLT is the first protocol I heard of that elegantly solved the roaming problem. Future solutions that might get adopted, including by QUIC, will likely be modeled after MinimaLT’s solution.

Update 2020-07-09: I’ve posted more thoughts on IPv4/IPv6 migration and interoperability on the Tailscale blog.