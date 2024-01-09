#NIX.CZ #switched #VxLANEVPN #double #star #ring

From double star to circle

In 2019, the NIX.CZ association operated two independent nodes, NIX.CZ and NIX.SK, which were separated and were operationally completely independent. I was faced with a fundamental decision – how to ensure the development of the association, the increasing number of ports, capacity and, above all, the connection of two nodes into one.

We finally connected the two nodes in mid-2019, and from the point of view of the network topology, it was not a big complication. Towards the end of the same year, the association decided to expand to another location: Vienna. That already meant topology change and in general a complete revision of the structure of the entire network. NIX has always been run as a pure L2 infrastructure since it started operating in 1997. As late as 2020, NIX was in a dual star topology after the connection of both nodes. However, this had to be changed in view of the annexation of the third city.

It was created by the union of three cities ring topology and since the L2 infrastructure cannot be managed very safely and efficiently in a ring topology, I chose to switch to a routed network with a virtualized extension. In the end, I chose VxLAN technology, which was new and quite unique in the Internet node environment. This month, we finished rebuilding the entire network and now the entire NIX.CZ infrastructure has been migrated to new switches, so it fully uses VxLAN/EVPN technology.

Note: In the following article, I will mention the names of manufacturers and devices in some parts for the sake of authenticity. Mentioning specific names is not a sign of our dissatisfaction or reservations about a specific device or manufacturer. In our environment we used the device at our own risk and in some cases the usage of the features at the time of acquisition was not fully documented for our use case and therefore we could encounter inconsistencies.

Why Leaf-Spine for NIX?

They were used in NIX.CZ for a very long time switch chassis, large devices that need to be equipped with interface cards, according to your wishes. There are many practical reasons for its use, especially when there is a high density of ports. In our case, however, it turned out that the combination of some types of cards is not supported at all – for example, the combination of 1G ports and 400G ports in one device is a big problem due to the need for huge buffers.

Another 7 photos

Leaf-spine i two-level architecture network, which distinguishes the roles of network elements into so-called “leaf” and “spine” elements. A network element in the “leaf” role aggregates traffic, typically from servers, while elements in the “spine” role switch or route data between different “leaf” elements. Since the connection of this topology is full-mesh, it is possible to increase the capacity of the network by adding “leaf” devices.

One of the other important reasons for the decision to use a Leaf-Spine topology consisting of several separate devices, instead of one huge chassis, is saving electricity, which turns out to be a pretty good idea a few years later. The advantage is also a better life cycle of the device, which can be moved in the network as needed. This is very expensive and impractical for chassis versions.

Virtualized L2 network

Building a virtualized L2 network was the most likely scenario, and the idea of ​​maintenance was very appealing to us L3 networks, instead of just a network operating on one of the L2 protocols. Basically also because L2 protocols such as Fabric-Path, Trill or SPB disappeared from the support of manufacturers and ended up in the abyss of history.

So the only sensible solution to choose from is tagging/packaging into MPLS/VPLS or VxLAN. Finally, after several weeks of testing and consultation, we came to the opinion that it would be appropriate to go the route of the simplest possible configuration of changes and chose VxLAN.

So it remained to decide what type of “control-plane” to choose. For the sake of data consistency, ensured by the verified BGP protocol, we decided to go the route VPN. At the time, there was no experience (at least not that we were aware of) of deploying VxLAN/EVPN in an IXP environment. So we had to study everything, plan it, test it on a small scale and also deploy it.

Topology of NIX.CZ in 2018

Author: NIX.CZ

Here we go!

It played into our hands that two Nexus 7010 switches were operating in the Slovak part of the network, for which the manufacturer declared end of support and we had to replace them by the end of 2020. Despite the covid restrictions and uncertainty, the ordered equipment arrived and we started preparing the transition mechanism. It was necessary to install two new “core” switches in Prague and set them as a gateway to the new world of VxLAN/EVPN towards Bratislava.

We have already ordered these devices with 400G interface support so that we can also connect the first network with a 2×400G line speed at the beginning of 2021. Replacing equipment in Bratislava meant disconnecting 50 optical cables, disassembling the original 130 kg switch, installing new switches, connecting cables correctly, connecting lines to Prague and resetting the network from L2 to L3 in one night. The whole thing three places coordinated.

The biggest problem was the broken “Fabric-Peering” technology that we planned to use instead of the original vPC configuration. It turned out to be the intended combination two different protections not supported in the real world. Under pressure, a person is very keenly aware of how time flies when something is not going well. I finally managed to identify the problem and find a solution in time. By six o’clock in the morning it was done, but my head was really messed up.

During this first migration, we confirmed once again how complex a task it is to replace a device. He wants it system and automation.

NIX.CZ topology in 2022

Author: NIX.CZ

Do you have documentation? could i see her

After experiencing several night events, I very quickly realized that although documentation the original networks were fairly accurate, the data had to be searched in several different systems that did not have a direct link. This required a great deal of effort and long hours of preparation.

Together with our colleagues, we decided to create a new system where everything will correspond to the real situation after the connection and the data will be tied together. Therefore, in 2020 we started implementing a new system based on the Netbox platform. At the time when we started with Netbox, he didn’t know many things and modeling of our network in it meant making a great effort. We persevered and are now reaping the benefits of our three-year effort.

When planning the transition to VxLAN/EVPN in Prague, the main emphasis was placed on the shortest possible outages for clients. However, this cannot be ensured without a physical connection, so we had to prepare the new cabling in advance. We also wanted to take advantage of that. Together we decided that the largest data centers completely we rewire and we will come up with a new and better system that will be clearer, take up less space and reduce the risk of switching errors.

New location – Vienna

Determined to try the new network design, migration scripts, new cabling and also in a completely new location, we set out to install a new site in Vienna. A new installation is always easier than modifying an old one while it’s running. After launching Vienna and connecting this site to our VxLAN fabric, we now had a complete ring topology that we could drive on the third layer.

Unfortunately, covid came in 2020 and the delivery times of technology became extremely short extended. It was hardly possible to travel to a neighboring district, let alone to a foreign country. We wanted to take advantage of this unplanned intervention from above and started planning the transition of the Prague data centers to a new design.

Prague – how to do it

Prague was a big challenge, nothing remained of the original involvement. Immediately after we connected O2’s (then) new 400Gbit/s technology, we identified the biggest traffic contributors and were the first to connect them to the same switch. However, this was only the first step.

Due to the limitation of eight 400G ports, we built a circle covering other Prague data centers with a capacity of 1.6 Tbit/s per data center in full redundancy. For the final configuration, we lacked the necessary devices, which had unknown delivery dates at the time of the Covid-19 pandemic, in fact this usually meant at least a year or more. We had to consult and choose the path of controlled switching of the backbone network. Once the backbone layer was ready, we started gradually migrating customers to the new switches.

During 2022, we switched the temporary topology to the newly delivered Nexus switches, which have 32 fast 400G ports, and since then the backbone network is complete.

Another 13 photos

Ansible, Python and more

Each customer had to be rewired to new switches, including new cables, creating new documentation and porting configuration to new switches.

In order to make as few mistakes as possible when transferring configurations, we wrote the first version of a migration tool that “translated” the original configuration files to the new syntax for VxLAN/EVPN. He mainly used it Ansible and a set of libraries in Python.

Everything worked pretty decently, except for the speed. Debugging CLI commands in this environment was lengthy and many combinations of configurations. The biggest wrinkle for us were issues where a configuration entered via the CLI worked manually, but if we sent it as a batch it failed with no apparent error.

After this rather bitter experience, we finally decided to use another method – DME (Data Managment Engine) API, and we have stayed with this method.

So we wrote a second set of scripts to handle the configuration conversion. These new scripts were responsible for configuring VLANs, VNIs, mappings, port-channels, and the like. The input format is the well-known configuration Cisco CLI and the output is a set of REST HTTPS calls with JSON objects that the switch sets accordingly.

The Nexus 9300 switches are equipped with a REST API that allows device control at the level of objects arranged in a logical tree (similar to SNMP). Internally, these switches only work with this configuration model and everything else is translated. For example, the known CLI is only emulace and commands from the CLI are internally translated into the internal object tree. In this way, the configuration is also saved. Thanks to the DME REST API, objects can be controlled directly.

Data from switches can also be retrieved using this property. We tried “Streaming Telemetry” but ended up reading data own system. The biggest advantage is speed, the REST API is really fast and requests are processed within milliseconds. The downside is that you can get the switch into an unwanted state very quickly with this low-level approach. Caution and intensive testing is entirely appropriate.

Port Security – IXP vs. DC

Loops are the main enemy of any L2 network. If you run your data center, you probably also have control over the infrastructure that is connected to your L2 network, so you don’t have to worry too much about the security and consistency of the ports on the VxLAN/EVPN factory side. On the contrary moving MAC addresses from one corner of the network to another is a completely legitimate and desirable process, for example, for moving virtual machines between hypervisors.

In the Internet node, the situation is the opposite, moving MAC addresses is in the vast majority of cases considered an error and needs to be handled properly. So in an ideal world you want the MAC addresses to be rigidly defined and they did not appear anywhere else on the network without your knowledge. In a network like NIX.CZ, it is quite common for MAC addresses to appear in a different place than they should. We observe this phenomenon several times a week, mainly at night, and it is mainly related to the maintenance of network or transport devices connected to a common peering network. It is quite common to see (unintentional) MAC address hijacking related to a loop on the connected side of the network. The classic problem is returning the traffic we sent back to our interface.

The differences described above are the main problem of using data center technology in IXP mode. Unfortunately, EVPN technology is defined for data centers and only accounts for MAC transmission (MAC mobility) and not prevention. Each manufacturer must define “port-security” in their own way, and thanks to the hard work of our colleagues, we managed to implement the correct interface security settings in NX-OS (feature port-security).

In the current version of NX-OS, you can use the “port-security” function. The main implementation change is in the perception of static MAC on the remote switch. In the newly released NX-OS, the MAC address learned on the secure port is propagated to EVPN s symptom (sticky bit) and the remote switch thus knows that if the same MAC appears locally, it should ignore the local occurrence and not announce it to everyone around.

Another problem we encountered was the speed of distribution of learned MAC addresses. If you have a limit set on the interface, say two MAC addresses, and the customer sends you data from another 100 MAC addresses within a short time (a few milliseconds), then the port security will react and the port drops. Before this happens (the shedding process takes a few milliseconds), MAC addresses that are beyond the allowed range are still signaled to the entire EVPN via BGP. After a few more milliseconds, they are removed again (the source port is dropped and the MAC is cleared). However, even this phenomenon causes collision and hijacking of MAC addresses for a very short moment. If such a short moment is 1ms, then you will lose 3.2 GB of data on a 400Gbit interface.

If you’ve read this far, you might be wondering why we don’t just do a MAC ACL or set the MAC on the port hard. Well, because it’s on this platform not supported. You can create a MAC ACL on the port and filter the data according to the source MAC, but unfortunately this does not disable the learning of MAC addresses, so the traffic sent to the port is not forwarded, but the learned MAC address is hijacked anyway. The same problem is with fixed MACs: if you enable this function, you must disable the “port-security” function and then your port will not be deactivated in case of a security policy violation.

What went wrong

Of course, during the implementation we came across several issues that caught us by surprise, or surprised. We had to test some of them only on a live network, because they could not be simulated in laboratory conditions. An unpleasant aspect was also the change in the manufacturer’s documentation, which changed the support of some of the functionalities we used over time.

Fabric-Peering – or vPC (Virtual Port-channel) connected using EVPN. The original idea was to offer customers connected using LACP technology the ability to connect to any two switches in the network. The documentation talks about the possibility of using EVPN ESI, but in practice it turned out that the use of these extended features of EVPN is possible, but not suitable for networks where it is not possible to ensure complete security of connected elements. So we narrowed down the design to support LACP to two adjacent switches and I really liked the idea of ​​using Fabric-Peering. If you are familiar with vPC technology, know that fabric-peering gives you the same features, but without using a “peer-link”, saving ports on the switch. I won’t stress you out, in the end we almost eliminated fabric-peering and vPC from the network. The main reason is the difficulty of diagnostics, the unpredictable behavior of customer port loops and especially the lack of support for port security in combination with EVPN.

Port-security – probably the biggest stumbling block we encountered. The lack of port protection support on used switches was a nightmare for two long years, but we tried to deal with this ailment technically. In total, we found three problems with security ports:

Layer 2 MAC Hijacking – this scenario was most evident during the migration. The original Nexus 7710 switches were placed behind a couple of new types of VxLAN/EVPN switches. The connection between these two worlds was at the level of backbone ports and thus “port-security” could not be used. If one of the connected networks sent a frame with the source MAC of another network through its link and this frame arrived through the trusted backbone lines to the new switches, EVPN wanted to ensure that the address was transferred to the new port. On the L2 side, the switches learned the MAC in the collision from the new direction (towards the Nexus 7710) and EVPN wanted to move the MAC to the new direction as well within EVPN, but due to the properties this redirection was rejected and the network remained split into two worlds. The original L2 network routed traffic to the (wrong) new location, the EVPN factory routed traffic to the original (correct) location. This resulted in awkward moments where a major carrier’s MAC was (unintentionally) hijacked and one of its routers was unavailable until the MAC table was manually cleared. We worked very hard on this problem until we finally developed a custom script that did the MAC “cleaning” by itself. The exact description, including the source code, is located on GitHub.

MAC Hijacking by VxLAN/EVPN – this scenario was also most evident during migration. This is a very similar sequence of events, except that the collision occurs on an already migrated switch port that has port-security enabled, but has more than one dynamic MAC enabled – typically two. If just one other MAC appeared on the port of the connected network and it caused a collision, the system did not recover from the situation automatically, but required manual intervention.

MAC address flood – we also discovered a situation where tens or hundreds of MAC addresses appear on the customer port in a very small time window (typically units of milliseconds) during a network loop. Even though port-security had one or two MACs enabled, flooding the port with new MACs starts the process of dropping the port for exceeding the allowed address limit. Due to the fact that the whole process is parallelized, port dropping and switching table cleaning are started on the switch, but at the same time all MAC addresses (even those that exceed the limit) are transmitted by the control protocol to other switches using EVPN, and they include them in their switching table. The data flow will thus be moved to the new port that was announced by the MAC last. Because the switches have fast CPUs, the processing of these BGP updates is very fast and happens before another process finishes dropping the port. After dropping the port and clearing the local switch table on the flooding customer’s port, all MACs that caused the flood will then be cleared and EVPN will report the MAC withdrawal and the whole situation will be rectified. However, there are several tens of milliseconds when traffic is directed to the network port that caused the flood.

As a result of these findings, we proceeded to a drastic change in the operating regulations and now only allow one dynamic MAC address in the peering segment.

If you would like to know more, you can watch a video recording of my lecture on YouTube.

IP unnumbered on parallel lines – not supported (until NXOS 10.2(3)) and caused two serious incidents where some connected networks were not receiving BUM (Broadcast, Unknown unicast, Multicast) traffic from random MACs after a short L3 backbone outage. We have been analyzing this problem for several nights and I would call it looking for a needle in a haystack. In traffic around 5 Mpps you look for one ARP packet that is dropped and analyze why it happened. Hell.

userCfgdFlags – using the REST API has its advantages. Unfortunately, it happened to us that we didn’t notice a change in the documentation when NXOS went from version 9.2.x to 10.3.x, that one parameter was added to the API, with which you have to explicitly say which parameters you changed by the user – this is a kind of recapitulation. By not setting the recapitulation, the configuration will be saved, but after the next restart, the selected parameters will not be transferred to the running configuration, and you will end up without set ports after restarting the switch. So you must correctly specify the following when setting the port:

“l1PhysIf”: {

“attributes”: {

“id”: “eth1/1”,

“layer”: “Layer2”,

“mode”: “trunk”,

“mtu”: “9192”,

“trunkVlans”: “100”,

“userCfgdFlags”: “admin_layer,admin_mtu,admin_state”

}

}

Alternative notation in CLI (as we are used to):

interface Ethernet1/1

switchport

switchport mode trunk

switchport trunk allowed vlan 100

mtu 9192

The highlighted part must be used if you are running NXOS version 10.2 or higher.

Unknown unicast – so far, the last problem we solved was the prevention of “unknown unicast” transmission. If there is a sudden disconnection of a customer in your network and the customer’s port is suddenly dropped, its MAC address becomes unknown to your entire network at that moment, and by default the switches treat such traffic as broadcast. All traffic with the destination MAC is thus sent to all ports. Now imagine that such a network that suddenly disconnected is a large content provider with 200Gbit/s traffic. So you have to block such traffic for about 40 seconds, otherwise all customer interfaces will be overloaded.

What would I say to my other self back in 2019

Save more notes

learn the documentation by heart and don’t trust it,

save the documentation locally every time you read it, tomorrow may be different,

what’s beyond the net is dangerous and don’t trust it,

newer software does not mean better (but it often helps and is a lesser evil),

there will be covid, buy everything at once and you won’t have to wait months for deliveries,

there will be technical problems, support is important.

What awaits us next?

Thanks to the successful start, the required capacity increased at a great speed and the entire circle Prague – Bratislava – Vienna – Prague built at 4×100 Gbit/s is no longer enough. We are just before completing the capacity increase between individual locations to 2x400Gbit/s and we are already using 400G lines. In January, we will open another location in Frankfurt am Main for connection and connect it to the entire network with additional 400G lines.

In next year we will increase capacities between locations, but in particular we will plan development for the next period. I am considering the inclusion of better protocols for traffic control and will be keenly watching the development of modules and interfaces with capacities higher than 400Gbit/s. We are also awaiting the restoration of internal systems processing statistical data on traffic, anomaly detection and monitoring. I would like to focus on increasing the user comfort of the customer section.

What else are we doing?

Thanks to the fact that NIX.CZ co-organizes the CSNOG and Peering Days conferences, our association is developing a system for organizing, registering and planning meetings. The application has caught the attention of other organizers of similar events in the last year and we are proud to have been able to support it already seven meetings. In total, approximately 4,000 participants actively used the application. We named this tool Meet, and you can find information about it at nix.cz/meet.

We have developed our own sets for our own use libraries for Python, which allow us to control and monitor Nexus 9300 switches. Thanks to our own concept, we are now able to obtain information about the state of the interface very quickly and thus process a large amount of data. We are still at the beginning of this project, but we have many ideas on how to handle the data further, including processing with LLM libraries. It’s a job for several years, but if you showed me today’s NIX network in 2019, I wouldn’t believe we could do it. We will continue to have bold plans and ideas that seem unrealistic today, but will already have outlines tomorrow.

A great challenge to improve

We’re at the end. At the end of the article, not the path. I will admit quite frankly that rebuilding the NIX.CZ node has been the biggest challenge I have experienced so far. Every incident and anomaly are always the best opportunities for me to further improve, learn and understand more about the issue. However, I could not have achieved any of this without the support of the entire team of colleagues who suffer from my constant new ideas.

A big thank you goes to cooperating representatives of manufacturers and distributors and to foreign colleagues and friends for inspiration for their persistent support. I firmly believe that NIX.CZ will continue to develop so that it continues to offer services at a high level. He supported the community and the development of the Internet in the Czech Republic, Slovakia and other countries where he operates or will one day operate.