Skip to content

Virtual eXtensible Local Area Network (VXLAN)

Ido Schimmel edited this page Jan 24, 2019 · 2 revisions
Table of Contents
  1. Introduction
  2. Use with VLAN-unaware Bridges
    1. Decapsulation
    2. Encapsulation
      1. Unicast Forwarding
      2. Flooding
    3. Bridge MAC Address
  3. Use with VLAN-aware Bridges
  4. VXLAN Learning
  5. Neighbour Suppression
  6. VXLAN Routing
    1. Asymmetric Routing
    2. Symmetric Routing
  7. Features and Limitations
  8. Further Resources

Introduction

Virtual eXtensible Local Area Network (VXLAN) enables the encapsulation of Ethernet frames inside UDP packets with a designated UDP destination port (4789). VXLAN allows users to overlay L2 networks on top of existing L3 networks. In the data center, it is commonly used to stretch an L2 network across multiple racks.

Initial VXLAN support appeared in kernel 3.7. Since kernel 4.20 it is possible to offload the VXLAN forwarding plane to the Spectrum ASIC.

Use with VLAN-unaware Bridges

The VXLAN data path can be split into two parts: decapsulation and encapsulation.

Decapsulation

Decapsulation occurs when the switch receives a VXLAN-encapsulated packet whose underlay destination IP corresponds to that of the local VTEP. The source IP of the local VTEP is usually assigned to the loopback device:

$ ip -d link show dev vx10010
3972: vx10010: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether de:49:10:b4:e3:79 brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535
    vxlan id 10 local 192.0.2.1 srcport 0 0 dstport 4789 tos inherit ttl 10 ageing 300 noudpcsum noudp6zerocsumtx noudp6zerocsumrx
...

$ ip address add 192.0.2.1/32 dev lo
$ ip route show 192.0.2.1 table local
local 192.0.2.1 dev lo proto kernel scope host src 192.0.2.1 offload

The local route is marked as offloaded since VXLAN-encapsulated packets that hit it are decapsulated by the hardware and forwarded in the overlay network.

Note: The same local route cannot be used to decapsulate IP-in-IP and VXLAN packets at the same time.

Encapsulation

Encapsulation takes place when the switch decides to forward a packet to a VXLAN tunnel. This can happen either due to an FDB entry pointing to a VXLAN device or due to the packet being flooded by the enslaving bridge.

Unicast Forwarding

In order for a packet to be forwarded to a single remote VTEP, FDB entries need to be configured at both the bridge and VXLAN devices' FDB tables:

$ bridge fdb add 00:11:22:33:44:55 dev vx10010 self master static \
	dst 198.51.100.1
$ bridge fdb show brport vx10010
00:11:22:33:44:55 offload master br0 static
...
00:11:22:33:44:55 dst 198.51.100.1 self offload static

The self keyword will add the entry to the VXLAN FDB, whereas the master keyword will add the entry to the bridge FDB.

Note that both entries are squashed into one {MAC, VLAN/VNI} -> IP entry in the hardware. Therefore, in case one entry is removed, the entry will be removed from the hardware and the remaining entry will be unmarked since it is not offloaded anymore:

$ bridge fdb del 00:11:22:33:44:55 dev vx10010 master
$ bridge fdb show brport vx10010
00:11:22:33:44:55 dst 198.51.100.1 self static

$ bridge fdb add 00:11:22:33:44:55 dev vx10010 master
$ bridge fdb show brport vx10010
00:11:22:33:44:55 offload master br0 static
...
00:11:22:33:44:55 dst 198.51.100.1 self offload static
Flooding

When a packet does not match an FDB entry it is flooded to all the local ports enslaved to the bridge as well as to all the configured remote VTEPs. To add a new remote VTEP to the VXLAN device, use the all-zeroes MAC address:

$ bridge fdb append 00:00:00:00:00:00 dev vx10010 self static \
	dst 198.51.100.1
$ bridge fdb append 00:00:00:00:00:00 dev vx10010 self static \
	dst 198.51.100.64
$ bridge fdb show brport vx10010
...
00:00:00:00:00:00 dst 198.51.100.1 self offload static
00:00:00:00:00:00 dst 198.51.100.64 self offload static
...

In the above example, a flooded packet will be replicated twice in the hardware and routed to both remote VTEPs.

Bridge MAC Address

By default, the MAC address of the bridge is inherited from the bridge slave with the lowest MAC address. If this is the VXLAN device - whose MAC is randomly generated by default - it might not be possible to create a router interface on top of the bridge. That is because all the router interfaces must have the same MSBs in their MAC address.

$ ip link set dev swp3 master br0

# MAC address of the bridge is inherited from swp3
$ ip -br link show dev swp3
swp3             DOWN           7c:fe:90:ff:27:d1 <BROADCAST,MULTICAST>
$ ip -br link show dev br0
br0              DOWN           7c:fe:90:ff:27:d1 <BROADCAST,MULTICAST>

# MAC address of the bridge is inherited from VXLAN device
$ ip -br link show dev vx10010
vx10010          UNKNOWN        7a:04:ef:cd:59:5f <BROADCAST,MULTICAST,UP,LOWER_UP>
$ ip link set dev vx10010 master br0
$ ip -br link show dev br0
br0              DOWN           7a:04:ef:cd:59:5f <BROADCAST,MULTICAST>

# IP address cannot be assigned to the bridge device
$ ip address add 10.1.1.1/24 dev br0
Error: mlxsw_spectrum: All router interface MAC addresses must have the same prefix.

To prevent the Linux bridge from inheriting a MAC address, explicitly set its MAC address to that of one of the physical interfaces:

$ ip link set dev br0 address 7c:fe:90:ff:27:d1
$ ip address add 10.1.1.1/24 dev br0
$ echo $?
0

Note: Only a single VXLAN device can be enslaved to a VLAN-unaware bridge.

Use with VLAN-aware Bridges

When using VLAN-aware bridges, multiple VXLAN devices can be enslaved to the bridge. The mapping between a VLAN and a VNI is performed by configuring the VLAN as PVID and egress untagged on the bridge slave corresponding to the VXLAN device:

$ ip link add name br0 type bridge vlan_filtering 1 vlan_default_pvid 0 \
	mcast_snooping 0

$ ip link set dev swp3 master br0
$ bridge vlan add vid 10 dev swp3 pvid untagged

$ ip link set dev vx10010 master br0
$ bridge vlan add vid 10 dev vx10010 pvid untagged

The same VLAN cannot be mapped to multiple VNIs:

$ ip link set dev vx10020 master br0
$ bridge vlan add vid 10 dev vx10020 pvid untagged
RTNETLINK answers: Invalid argument

When configuring FDB entries in a VLAN-aware bridge, the vlan keyword should be used to specify the VLAN that the FDB entry will be programmed with:

$ bridge fdb add 00:11:22:33:44:55 dev vx10010 self master static \
	dst 198.51.100.2 vlan 10

VXLAN Learning

When VXLAN learning ("snooping") is enabled, the VXLAN device's FDB is populated based on decapsulated packets. Whenever a packet is decapsulated, a new {MAC, VNI} -> IP FDB entry is created from the packet's overlay source MAC, VNI and underlay source IP. In case the entry already exists, it is refreshed.

To enable VXLAN learning:

$ ip link add name vx10010 up type vxlan id 10 noudpcsum tos inherit \
        ttl 10 local 192.0.2.1 dstport 4789 learning
$ ip link set dev vx10010 master br0

Learning also needs to be enabled at bridge slave. That is the case by default, but if it has been turned off, it needs to be enabled again:

$ ip link set dev vx10010 type bridge_slave learning on

The bridge and the VXLAN drivers are notified on each FDB entry learned by the hardware. These entries will be marked as offloaded and externally learned:

$ bridge fdb show brport vx10010
00:11:22:33:44:55 vlan 10 extern_learn offload master br0
...
00:11:22:33:44:55 dst 198.51.100.1 self extern_learn offload

The default ageing time is 5 minutes, but can be changed as follows:

$ ip link set dev br0 type bridge ageing_time 1000

The above will set the ageing time to 10 seconds.

Neighbour Suppression

Neighbour suppression allows the Linux bridge to answer IPv4 ARP requests and IPv6 neighbour discovery messages on behalf of remote hosts. It reduces the amount of packets a VTEP needs to flood.

Upon the reception of such a packet the Linux bridge will try to find a corresponding neighbour entry on the bridge device itself or on a VLAN interface configured on top of the bridge (based on the packet's VLAN tag). Assuming an entry was found, the Linux bridge will look up the resulting MAC in its FDB. If the FDB entry points to an interface with neighbour suppression enabled, the Linux bridge will reply on behalf of the remote host.

To enable neighbour suppression on the VXLAN device:

$ ip link set dev vx10010 type bridge_slave neigh_suppress on

Note: Suppression of IPv6 Neighbour Discovery packets is currently not supported.

VXLAN Routing

VXLAN routing allows hosts in different overlay networks to communicate with each other. Two popular models for VXLAN routing are distributed asymmetric routing and distributed symmetric routing. Instead of designating special VTEPs to perform routing - as in the centralized routing model - each VTEP performs routing for the hosts connected to it.

In the asymmetric model, the ingress VTEP performs the routing and the egress VTEP only performs bridging. In the symmetric model, routing occurs at both VTEPs.

While the interface configuration of the symmetric model is more involved, it scales better than the asymmetric model. In the symmetric model, each remote host consumes one route entry and each VTEP consumes a neighbour and an FDB entry. In the asymmetric model, each remote host consumes one neighbour entry and one FDB entry and each VTEP consumes a single route entry. In addition, when using the symmetric model, it is not required for every VTEP to be a member in all the VNIs it needs to communicate with.

To allow VM mobility between different VTEPs, it is recommended to configure an anycast gateway on each VTEP. With an anycast gateway, a VM can be moved to a different VTEP without changing its default gateway configuration. In the examples below the anycast gateway is implemented using a macvlan device.

Asymmetric Routing

The following example illustrates the interface configuration on a switch connected to two hosts and a spine. The switch acts as a VTEP and performs routing between VNIs 1000 and 2000 that belong to a single tenant (VRF). The IP addresses 10.1.1.1 and 10.1.2.1 serve as the gateway IPs for the overlay networks corresponding to VNIs 1000 and 2000, respectively.

     + Host 1                                     + Host 2
     |                                            |
+----|--------------------------------------------|-------------------------+
|    |                                            |                         |
| +--|--------------------------------------------|-----------------------+ |
| |  + swp1                         br1           + swp2                  | |
| |    vid 10 pvid untagged                         vid 20 pvid untagged  | |
| |                                                                       | |
| |  + vx10                                       + vx20                  | |
| |    local 10.0.0.1                               local 10.0.0.1        | |
| |    remote 10.0.0.2                              remote 10.0.0.2       | |
| |    id 1000                                      id 2000               | |
| |    dstport 4789                                 dstport 4789          | |
| |    vid 10 pvid untagged                         vid 20 pvid untagged  | |
| |                                                                       | |
| +-----------------------------------+-----------------------------------+ |
|                                     |                                     |
| +-----------------------------------|-----------------------------------+ |
| |                                   |                                   | |
| |  +--------------------------------+--------------------------------+  | |
| |  |                                                                 |  | |
| |  + vlan10                                                   vlan20 +  | |
| |  | 10.1.1.11/24                                       10.1.2.11/24 |  | |
| |  |                                                                 |  | |
| |  + vlan10-v (macvlan)                           vlan20-v (macvlan) +  | |
| |    10.1.1.1/24                                         10.1.2.1/24    | |
| |    00:00:5e:00:01:01                             00:00:5e:00:01:01    | |
| |                               vrf-green                               | |
| +-----------------------------------------------------------------------+ |
|                                                                           |
|    + swp3                                       + lo                      |
|    | 192.0.2.1/24                                 10.0.0.1/32             |
+----|----------------------------------------------------------------------+
     |
     + Spine

The following commands were used:

# asymmetric routing - interface configuration

ip link add name br1 type bridge vlan_filtering 1 vlan_default_pvid 0 \
	mcast_snooping 0

# Make sure the bridge uses the MAC address of the local port and not
# that of the VXLAN's device
ip link set dev br1 address <swp1's MAC address>
ip link set dev br1 up

ip link set dev swp3 up
ip address add dev swp3 192.0.2.1/24
ip route add 10.0.0.2/32 nexthop via 192.0.2.2

ip link add name vx10 type vxlan id 1000		\
	local 10.0.0.1 remote 10.0.0.2 dstport 4789	\
	nolearning noudpcsum tos inherit ttl 100
ip link set dev vx10 up

ip link set dev vx10 master br1
bridge vlan add vid 10 dev vx10 pvid untagged

ip link add name vx20 type vxlan id 2000		\
	local 10.0.0.1 remote 10.0.0.2 dstport 4789	\
	nolearning noudpcsum tos inherit ttl 100
ip link set dev vx20 up

ip link set dev vx20 master br1
bridge vlan add vid 20 dev vx20 pvid untagged

ip link set dev swp1 master br1
ip link set dev swp1 up
bridge vlan add vid 10 dev swp1 pvid untagged

ip link set dev swp2 master br1
ip link set dev swp2 up
bridge vlan add vid 20 dev swp2 pvid untagged

ip address add 10.0.0.1/32 dev lo

# Create tenant VRF
ip link add dev vrf-green up type vrf table 10
ip -4 route add table 10 unreachable default metric 4278198272
ip -6 route add table 10 unreachable default metric 4278198272
ip -4 rule add pref 32765 table local
ip -4 rule del pref 0
ip -6 rule add pref 32765 table local
ip -6 rule del pref 0

# Create SVIs
ip link add link br1 name vlan10 up master vrf-green type vlan id 10
ip address add 10.1.1.11/24 dev vlan10
ip link add link vlan10 name vlan10-v up master vrf-green \
	address 00:00:5e:00:01:01 type macvlan mode private
ip address add 10.1.1.1/24 dev vlan10-v metric 1024

ip link add link br1 name vlan20 up master vrf-green type vlan id 20
ip address add 10.1.2.11/24 dev vlan20
ip link add link vlan20 name vlan20-v up master vrf-green \
	address 00:00:5e:00:01:01 type macvlan mode private
ip address add 10.1.2.1/24 dev vlan20-v metric 1024

bridge vlan add vid 10 dev br1 self
bridge vlan add vid 20 dev br1 self

bridge fdb add 00:00:5e:00:01:01 dev br1 self local vlan 10
bridge fdb add 00:00:5e:00:01:01 dev br1 self local vlan 20

# Disable rp_filter and enable arp_ignore to make sure ARPs for the
# anycast IP are answered with the anycast MAC
sysctl -w net.ipv4.conf.all.rp_filter=0
sysctl -w net.ipv4.conf.vlan10-v.rp_filter=0
sysctl -w net.ipv4.conf.vlan20-v.rp_filter=0
sysctl -w net.ipv4.conf.all.arp_ignore=1

The full example is available here.

Symmetric Routing

The interface configuration in the symmetric model is similar to the asymmetric model. The main difference is the addition of an L3 VNI, which is used for routed traffic in both directions: from and to the VTEP.

     + Host 1                                     + Host 2
     |                                            |
+----|--------------------------------------------|-------------------------+
|    |                                            |                         |
| +--|--------------------------------------------|-----------------------+ |
| |  + swp1                         br1           + swp2                  | |
| |    vid 10 pvid untagged                         vid 20 pvid untagged  | |
| |                                                                       | |
| |  + vx10                                       + vx20                  | |
| |    local 10.0.0.1                               local 10.0.0.1        | |
| |    remote 10.0.0.2                              remote 10.0.0.2       | |
| |    id 1010                                      id 1020               | |
| |    dstport 4789                                 dstport 4789          | |
| |    vid 10 pvid untagged                         vid 20 pvid untagged  | |
| |                                                                       | |
| |                             + vx4001                                  | |
| |                               local 10.0.0.1                          | |
| |                               remote 10.0.0.2                         | |
| |                               id 104001                               | |
| |                               dstport 4789                            | |
| |                               vid 4001 pvid untagged                  | |
| |                                                                       | |
| +-----------------------------------+-----------------------------------+ |
|                                     |                                     |
| +-----------------------------------|-----------------------------------+ |
| |                                   |                                   | |
| |  +--------------------------------+--------------------------------+  | |
| |  |                                |                                |  | |
| |  + vlan10                         |                         vlan20 +  | |
| |  | 10.1.1.11/24                   |                   10.1.2.11/24 |  | |
| |  |                                |                                |  | |
| |  + vlan10-v (macvlan)             +             vlan20-v (macvlan) +  | |
| |    10.1.1.1/24                vlan4001                 10.1.2.1/24    | |
| |    00:00:5e:00:01:01                             00:00:5e:00:01:01    | |
| |                               vrf-green                               | |
| +-----------------------------------------------------------------------+ |
|                                                                           |
|    + swp3                                       + lo                      |
|    | 192.0.2.1/24                                 10.0.0.1/32             |
+----|----------------------------------------------------------------------+
     |
     + Spine

The following commands were used:

# symmetric routing - interface configuration

ip link add name br1 type bridge vlan_filtering 1 vlan_default_pvid 0 \
	mcast_snooping 0

# Make sure the bridge uses the MAC address of the local port and not
# that of the VXLAN's device
ip link set dev br1 address <swp1's MAC address>
ip link set dev br1 up

ip link set dev swp3 up
ip address add dev swp3 192.0.2.1/24
ip route add 10.0.0.2/32 nexthop via 192.0.2.2

ip link add name vx10 type vxlan id 1010		\
	local 10.0.0.1 remote 10.0.0.2 dstport 4789	\
	nolearning noudpcsum tos inherit ttl 100
ip link set dev vx10 up

ip link set dev vx10 master br1
bridge vlan add vid 10 dev vx10 pvid untagged

ip link add name vx20 type vxlan id 1020		\
	local 10.0.0.1 remote 10.0.0.2 dstport 4789	\
	nolearning noudpcsum tos inherit ttl 100
ip link set dev vx20 up

ip link set dev vx20 master br1
bridge vlan add vid 20 dev vx20 pvid untagged

ip link set dev swp1 master br1
ip link set dev swp1 up
bridge vlan add vid 10 dev swp1 pvid untagged

ip link set dev swp2 master br1
ip link set dev swp2 up
bridge vlan add vid 20 dev swp2 pvid untagged

ip link add name vx4001 type vxlan id 104001		\
	local 10.0.0.1 dstport 4789			\
	nolearning noudpcsum tos inherit ttl 100
ip link set dev vx4001 up

ip link set dev vx4001 master br1
bridge vlan add vid 4001 dev vx4001 pvid untagged

ip address add 10.0.0.1/32 dev lo

# Create tenant VRF
ip link add dev vrf-green up type vrf table 10
ip -4 route add table 10 unreachable default metric 4278198272
ip -6 route add table 10 unreachable default metric 4278198272
ip -4 rule add pref 32765 table local
ip -4 rule del pref 0
ip -6 rule add pref 32765 table local
ip -6 rule del pref 0

# Create SVIs
ip link add link br1 name vlan10 up master vrf-green type vlan id 10
ip address add 10.1.1.11/24 dev vlan10
ip link add link vlan10 name vlan10-v up master vrf-green \
	address 00:00:5e:00:01:01 type macvlan mode private
ip address add 10.1.1.1/24 dev vlan10-v metric 1024

ip link add link br1 name vlan20 up master vrf-green type vlan id 20
ip address add 10.1.2.11/24 dev vlan20
ip link add link vlan20 name vlan20-v up master vrf-green \
	address 00:00:5e:00:01:01 type macvlan mode private
ip address add 10.1.2.1/24 dev vlan20-v metric 1024

ip link add link br1 name vlan4001 up master vrf-green \
	type vlan id 4001

bridge vlan add vid 10 dev br1 self
bridge vlan add vid 20 dev br1 self
bridge vlan add vid 4001 dev br1 self

bridge fdb add 00:00:5e:00:01:01 dev br1 self local vlan 10
bridge fdb add 00:00:5e:00:01:01 dev br1 self local vlan 20

# Disable rp_filter and enable arp_ignore to make sure ARPs for the
# anycast IP are answered with the anycast MAC
sysctl -w net.ipv4.conf.all.rp_filter=0
sysctl -w net.ipv4.conf.vlan10-v.rp_filter=0
sysctl -w net.ipv4.conf.vlan20-v.rp_filter=0
sysctl -w net.ipv4.conf.all.arp_ignore=1

The full example is available here.

Note: Symmetric routing is not supported on revision A0 of the Spectrum-1 ASIC. Refer to this section for instructions on how to determine the ASIC revision.

Features and Limitations

Features by Version

Kernel Version
4.20 Support for VXLAN with VLAN-unaware bridges
5.0 Support for VXLAN with VLAN-aware bridges, Support for VXLAN routing
5.1 Spectrum-2 support, FDB vetoing

Limitations

  • The bridge to which the VXLAN device is enslaved must have multicast snooping disabled. This means that packets with a multicast destination MAC are treated as broadcast and flooded

  • Only head-end-replication (HER) flooding is supported. Flooding packets to a multicast IP address in the underlay network is not supported

  • A source IP must be specified for the VXLAN tunnel

  • Only IPv4 underlay is supported and in the default VRF (i.e., main table)

  • TOS must be inherited from the overlay packet. In case overlay packet is not an IP packet, 0 is used

  • A static TTL must be used. TTL cannot be inherited from the overlay packet

  • UDP checksum must be disabled on the VXLAN tunnel

  • The ASIC supports a single VXLAN tunnel endpoint (VTEP). Therefore, all the offloaded VXLAN tunnels must share the following properties: TTL, learning, UDP destination port, source IP

  • Runtime configuration change of a VXLAN tunnel is currently not supported while it is enslaved to a bridge. The device needs to be unlinked from the bridge and enslaved again for changes to take affect. Alternatively, it can be cycled down-up

  • A bridge with VXLAN device(s) enslaved will not be offloaded unless a physical port (or its upper) is also enslaved to the bridge

Valid configuration example:

$ ip link add name br0 type bridge mcast_snooping 0
$ ip link set dev swp3 master br0
$ ip link add name vx10010 up type vxlan id 10 noudpcsum tos inherit \
	ttl 10 local 192.0.2.1 dstport 4789
$ ip link set dev vx10010 master br0

In case of an invalid configuration (e.g., TTL set to inherit), an error message will be emitted:

$ ip link add name br0 type bridge mcast_snooping 0
$ ip link set dev swp3 master br0
$ ip link add name vx10010 up type vxlan id 10 noudpcsum tos inherit \
	ttl inherit local 192.0.2.1 dstport 4789
$ ip link set dev vx10010 master br0
Error: mlxsw_spectrum: VxLAN: TTL must not be configured to inherit.

Further Resources

  1. man ip-link
  2. ip link help vxlan
  3. RFC 7348
  4. VXLAN & Linux
Clone this wiki locally