Introduction
This article presents a summary of a real world project for a customer, that required network connectivity from one of their Azure Global sites into Azure China. This is not a topic that is covered in great detail online. There were a number of lessons learned and I feel it was important to share my journey with the issues encountered and the workarounds that were deployed.
The Azure Global region in question is Azure Southeast Asia, where the customer already had a number of production resources running, ranging from virtual machines to Web Application Gateway.
High Level Network Diagram
Below is a simplified network architecture diagram of the two regions:
- Network Virtual Appliance – NVAs were placed in both regions. A regulated and approved third party by the Chinese authorities were used to manage and configure these devices. This was a legal requirement. The third party then configured a Site-to-site VPN tunnel between both devices. Traffic subsequently passed through the network fabric and the “great firewall of China”.
- Azure Virtual Network Gateway – This was ultimately a requirement outlined by the customer. A VPN gateway was required in both regions behind the NVA devices. This was to essentially encrypt all network traffic before it hit the NVA device. In other words, the traffic was being double encrypted to protect the business from any potential packet sniffing.
- Route Tables – The green lines on the diagram represent user defined routes. Instead of traffic communicating between both virtual network gateways, a route table was imposed on the gateway subnet which redirected traffic towards the NVA.
Microsoft Recommendations
Microsoft’s guidance and documentation on network connectivity between Azure Global and Azure China is limited. Three methods are highlighted by Microsoft:
1. Connection via two individual ExpressRoute circuits:
2. Connection via China Express model:
3. Connection via VPN model:
Additional details can be found here. The third option via VPN model was not an official recommendation until much later down the line in the project. This option also requires a form to be submitted to Azure China technical support.
Technical Issues
Countless issues were encountered with using VPN gateways in conjunction with the NVA appliances:
- There was unpredictable packet loss across both regions.
- Latency was unexpectedly high.
- MTU compatibility issues between the virtual network gateways and NVA appliances.
It soon became apparent there were technical incompatibilities, but finding the root of the problem was challenging. A series of packet captures were run on every hop. Test virtual machines were configured in both regions. PSPing and dummy file transfers were initiated in both directions as a test, while packet captures were processed on the VMs, both virtual network gateways, and both NVA appliances.
This sprawled into a 12 week troubleshooting ordeal where the root could not be correctly identified. What made matters worse, was that the NVA MTU values were set in stone and could not be increased. All concerned parties worked extremely hard with troubleshooting and analysing the network packets.
Workaround
Eventually, the Microsoft product backend team acknowledged that the virtual network gateways were not unpacking network packets correctly passed on from the NVA appliances. The product team applied a fix to the virtual network gateways which solved the issue. It must be noted that this was not an official fix rolled out to all customers. It was a temporary fix applied to our environment which would then be considered as a bug fix for a future release.
The virtual network gateway direct to virtual network gateway option, was deployed as a backup connectivity method.
Lessons Learned
In hindsight (and even during the project) a number of lessons were learned:
- Design principles: Even though using virtual network gateways behind the NVA appliances was an acknowledged risk in the project, it should’ve perhaps been pushed harder with the customer. As it was not an official recommended (but technically possible) Microsoft approach to begin with, the solution and testing should’ve been time boxed.
- Virtual Network Gateways only: When it was realised that using virtual network gateways in isolation was a supported and legal method, the team should’ve halted all progress on the main issue and progressed with this option to allow the business to complete the project. The initial issue surrounding the gateways and NVAs, should then have been revisited at the end.
- No more packet captures: The number of packet captures that were conducted during testing was substantial. At times it felt like we were chasing tails. Performing packet captures was a necessary evil to get to the root, but this added to the project timelines and didn’t provide the team with much benefit overall. The number of captures could have been easily halved.
Conclusion
It was a difficult project in reality, which was also wrapped with political challenges. Numerous tests and workarounds were being completed during the initial troubleshooting phase. Option papers were written outlining the pros, cons, and next steps. But somewhere down the line, the customer was submerged in the virtual network gateway and NVA issue and were committed to making that work. This was certainly admirable. Sometimes however, it is worth going back to the drawing board and considering the overall damage that is being presented to the project. While technical and security sacrifices should not be made on projects (if possible), if the project is in jeopardy, it is important to consider other workable solutions in parallel.
Why don’t select China Unicom’s cloud connection
Good question, it was a limitation at the time. The project team were restricted and “locked in” with a provider for political reasons more than technical.