Applying GENEVE encapsulation (VPC<->VPC NAT at AWS)

This post is going to be about one of the technologies that powers networking at AWS: an application of Geneve encapsulation to do VPC to VPC NAT, or how you get packets from an ENI in one VPC to an ENI in another VPC.

What is the networking stack

When you hear the “networking stack” you need understand it in terms of layers of abstraction that build on top of each other. At the lowest level you have physical data transmission, the next level you have ethernet frames (data link), the next level contains IP frames (network layer) and then finally a TCP or UDP frame. Colloquially we just refer to these as layers 1-4.

One way to understand the usefulness of the networking stack is in terms of what constraints you have: when you go up the networking stack you have less constraints. TCP on the transport layer enforces a socket per connection and has strict ordering rules, but when you get to layer 3 (IP aka TUN layer) you can inspect TCP, as well as UDP packets, as well as everything else (eg ICMP packets).

One way to interact with lower levels of the networking stack is with something called a TUN/TAP device which you can think of as a network interface that exists in the kernel, but you can access from user space to intercept layer 3 packets.

In practice at AWS (and a lot of cloud companies) we use TUN devices all the time. With a TUN device you can inspect packets, read the IP header to know where it’s going (it contains a source and destination address for IPv4/IPv6), then read TCP or UDP frames to do additional work to re-write it to wherever you want it to go. If you go down even further to TAP devices on layer 2 then you can inspect whole ethernet frames (including all the frames above it).

The reason this is important to understand is that it directly relates to why GENEVE encapsulation is a thing: it solves the problem of how you would translate and send packets from one network (an underlay network) to a network that sits on top of it (overlay network), while keeping all of the necessary OSI level contextx that you need (metadata) to do this successfully.

Geneve encapsulation

RFC 8926 describes Geneve encapsulation.

Geneve encapsulation is most relevant from sections 3.1 to 3.3. From a technical perspective the main thing you need to know to understand GENEVE is the following: it exists as a layer 4 (transport layer) UDP way of encapsulating layer 2 - layer 4 information for network virtualization. So you can send these GENEVE encapsulated packets using a user space socket but it communicates a lot more information because it has both an inner and outer packet frame. The inner frame can contain an entire ethernet frame which can be intercepted by a TUN/TAP device running on a different machine (a NIC). Any time you have a need to do network virtualization it is most likely you would have to use GENEVE encapsulation as it is the standard (supercedes VXLAN).

The included figure of GENEVE on IPv4 shows that your payload can be anything you want it to be or what the receiver expects. If you’re encapsulating a regular TCP packet you would just write the IPv4 header and TCP frame here.

Importantly the details of how these packets are used are up to the network you’re connecting to

Since GENEVE operates within the context of UDP it has a natural analogue to the connectionless semantics of Ethernet and IP which is why it lends itself so well to be used to directly pipe in data from a TUN device (as opposed to something like TCP).

GENEVE encapsulation is super important whenever you want to build mini-networks (overlay networks) from another network (an underlay) which is why it’s widely adopted among cloud computing companies. They have the raw hardware (underlay) and the overlay (to oversimplify a bit) is the virtualized network that they create for you (eg a VPC with subnets). The following section will describe this in more detail

Real world application / intuition

This has real world applications! If you’re operating from a machine within a data center and you’re in a customer’s VPC and want to send traffic to another customer’s VPC you need a way to do this. If you just assume that the underlay network provides a route from customer A’s VPC exists to reach customer B’s VPC for the packet and this is merely an implementation detail, in order to get traffic from one virtualized network to another you have to encode the packet with GENEVE in order to do both forward and reverse traffic routing. It should be intuitive to understand that the actual raw metadata in the packet is not enough to work with and you need to attach additional metadata to make this possible. Part of the reason the VNI (virtual network identifier) is encoded in the packet is so that as these packets are flowing through the underlay network, you can directly identify which virtualized network “tunnel” (VXLAN) they are associated with and gain useful telemetry from it.

Gotchas

A major issue issue with operating at a lower level of the networking stack is that you’re intercepting packets that are just raw ETH/IP headers with no knowledge of the client they’re destined for. In the case of trying to send back reverse traffic to a client connected to your machine this is actually a significant issue. For example, for something like like NAT gateway, you need the entire packet frame in order to egress the packet, and it would be unfeasible to try to manage a TCP connection pool, so you need to maintain a mapping for every forward packet where the packet on the reverse path should go.

If your NAT gateway operates in an IPv6 space, that is to say every client that sends traffic to your machine sends it with an IPv6 header and you want to forward traffic to a machine that operates exclusively in IPv4 space this becomes a significant issue. On the reverse path you must attempt to identify the correct forward initiating IPv6 flow from an IPv4 packet, so you are forced to store additional information in-process to identify originating flows for reverse traffic. This can get quite complicated the more flows you have, and you have to make sure that whatever form of hashing you’re using is resistant to collisions. This is a real problem that services have to solve when sending traffic from one machine to another when transitioning over network boundaries.

Conclusion

This post was brief but something that I felt was necessary to capture as I haven’t found any other articles about it online.

Thanks for reading!

Back