As a Project-based Network Engineer, I’ve had the ‘pleasure’ of encountering a considerable number of bugs while deploying network infrastructure over the years. Sometimes, these ‘bugs’ have enough impact to grind operations to a halt. With a bit of luck, the challenges will arise early in the project or during the pilot phase, minimising potential disruptions in a live environment.
I was recently introduced to one such ‘bug’ during a large (actually very large) Wireless LAN deployment for a client. This happened after upgrading several Aruba Mobility Controllers midway through a greenfield rollout. The upgrade entailed moving from the Conservative Release in the AOS 8.6 train to 126.96.36.199, the newest release in the Long Service Release (LSR) at the time.
When the upgrade was complete – everything broke. And I mean everything!
Have you noticed that I keep putting ‘bug’ in quotes? I could have sworn it was a bug because the Mobility Gateway config pre and post-upgrade was identical. To my surprise, the issue was caused by the introduction of an ‘undocumented feature’ in AOS version 8.8.0. But, more on that soon.
You might be asking why I upgraded in the first place. Well, the Conservative Release, did not offer support for the 505H model of APs. This particular AP model wasn’t on my radar for testing or deployment until much later in the project. Unfortunately, the lack of support for this AP model slipped through the cracks when I read over the release notes and decided which firmware to install early on.
After the upgrade, I discovered that the corporate SSID would experience 50% packet loss at precisely 25 minutes, and the guest SSID would never pass traffic. If the corporate users disconnected and reconnected to the wireless, they would be fine for another 25 minutes. The default gateways for the corporate VLANs terminated on a pair of Nexus 7K core switches running vPC and HSRP, and the gateways for the guest VLANs terminated on a virtualised pair of Mikrotik routers running VRRP. I had access to the Nexus 7K core switches but not the Mikrotik routers because a third party managed them. This made troubleshooting the corporate wireless issues much easier.
We ran packet captures across all devices in the data path.
We identified that ARP requests from the core switch reached the client successfully.
However, ARP response packets – where the client responded to the core switch – were present on the Aruba Mobility Gateway but not on the directly connected 7K core switches. The ARP responses never returned.
No packets dropped, and there were no runts, giant frames, CRC, or FCS errors on any of the interfaces in the data path. So, where were the ARP responses being dropped? And why the odd behaviour like the 25-minute dropout, or 50% packet loss from the client’s perspective?
After digging, I discovered the default ARP timer for Cisco Nexus switches is 25 minutes. This explains why the switches would lose their ARP entry for the client, but how was it ever maintained in the first place? Strangely, only one switch in the Nexus vPC pair lost its ARP entries, even with ARP synchronization enabled on the vPC pair. ARP synchronization worked as expected on hundreds of other VLANs within the customer’s network for many years until this point.
The corporate wireless VLANs also worked as expected on the customer’s old Cisco WLCs. So, I didn’t want to pay too much attention to the configuration on the Nexus switches – despite Aruba TAC pointing the finger in this direction on numerous occasions. Again, a similar issue was happening on the guest network using completely different routers.
I discussed the ARP issues with a colleague. He mentioned that the Cisco Nexus switches can use various methods to maintain an ARP table. One method involves capturing the information from a client-initiated Gratuitous ARP (GARP) packet. A GARP is an unsolicited ARP used to announce or update a device’s IP-to-MAC address mapping across the network. This was likely the method used to add the client’s MAC address to the ARP table, as the client connected to the SSID the first time. However, at that time, I wasn’t particularly concerned about how a single Nexus 7K maintained the ARP entry without ARP responses or why the switches didn’t share GARP entries across the two 7K VPC processes. I wanted to focus on what device was dropping the ARP responses, and why.
After many days of non-stop troubleshooting with Aruba TAC, we spotted a large number of arp rcv drop entries in the output of a show datapath frame command on the Aruba Mobility Gateways. This counter matched the ARP response packets for the Nexus 7K core switches.
We were finally getting somewhere.
The Root Cause
After a bit of Research, Aruba TAC noted a change in the way ARP traffic is handled from Aruba OS 8.8.0 onwards. If client isolation features are enabled on the Mobility Gateway, an ARP frame destined toward a non-default gateway IP is considered non-trusted and dropped. The specific client isolation settings that affect the behaviour are:
- deny inter user traffic – This disallows the forwarding of any frames between untrusted users. Set globally or on the Virtual AP (VAP) Profile via the GUI or CLI.
- deny-inter-user-bridging – This disallows the forwarding of non-IP frames between untrusted users. Set globally via the CLI.
This is a massive issue if you are operating First-Hop Redundancy Protocols (FHRP) such as HSRP, VRRP & GLBP. Depending on your topology and specific configuration, each of the routers participating in the FHRP cluster (not just the VIP) may send its own ARP request, forcing the client to respond to both routers. If the ARP response never returns, the router won’t know the client’s MAC address and can’t forward traffic to the client.
In my configuration, I had deny inter user traffic enabled in the VAP profiles. Temporarily disabling this made ARP traffic work again – Hooray!
The Work Around
If you encounter issues with Aruba AOS 8.8 or later while using FHRP and require either of the above client isolation features, there are two possible workarounds.
- Enable Proxy ARP on the Aruba Mobility Controller for the client VLANs affected by the above settings or
- Add the list of default gateways to the IPv4 Allowed Address List on the Mobility Gateway’s CLI.
When you enable Proxy ARP on the Aruba Mobility Gateways, the Gateway can intercept and reply to ARP requests from clients. In large networks, this can reduce ARP broadcasts, cutting down on unnecessary traffic and improving performance.
The ARP functionality changes in Aruba OS 8.0 and later don’t affect clients that have Proxy ARP enabled. You can enable Proxy ARP on the VLAN where the clients reside. Depending on your setup, this might negatively impact the network or produce unwanted effects. So, proceed with caution.
Navigate to Configuration > Interfaces > VLANs. Click on the VLAN. Click on the VLAN ID. Select IPv4 > Other Option. Tick Local-proxy ARP. Remember to click Submit and Deploy Changes.
You don’t require an IP address on this VLAN for Proxy ARP to function. The Mobility Gateway will use the management IP address in the absence of a VLAN IP address.
Allowed Address List
Another solution is to add IP addresses to a whitelist using the CLI. By doing this, you bypass all firewall, filtering, or other configurations that might block certain activities, including ARP traffic (especially in AOS 8.0 and later). Add the virtual default gateway and any real router addresses used in the FHRP to the whitelist.
Here’s the command to whitelist IP addresses at the CLI:
allowed-address-list ipv4 10.12.110.1 allowed-address-list ipv4 10.12.110.2 allowed-address-list ipv4 10.12.110.3
Repeat for each VLAN where clients are affected.
You can verify the configuration using the command:
show allowed-address-list all
Be careful when using this feature. Once you whitelist these IP addresses, you can’t apply policy using the Mobility Gateway using any other means! All IP traffic to and from these IP address via the Mobility Gateway will be peritted. Policy will have to be applied on upstream network devices.
Firstly, a huge shoutout to the Aruba TAC engineer who helped me resolve this. You just received the TLDR version, but this was one of the toughest issues I have ever had to troubleshoot. I wish I had written this when I experienced the issue. I had tonnes of logs, show commands, and packet captures that were interesting, to say the least.
- How on Earth did Aruba introduce this functionality without any mention of it in the 8.8.0 release notes?
- Aruba said this was the first time they had heard of the issue. Am I the only engineer enabling client isolation features while running HSRP on a pair of Cisco Nexus 7Ks? Or on Mikrotik routers running VRRP?
- Why is the whitelist workaround not available in the GUI? It needs to be a little more obvious.
- Why is there no workaround that allows ARP traffic to the default gateways? Does it really have to be all IP or nothing at all?
I want to hear your thoughts. Am I missing something?