In part one of this series (see my original post published on December 1, 2016), I did a comparison of both AWS VPN CloudHub and Cisco DMVPN, giving an overview and providing use cases for each. In part two of this series, I’ll be taking a deeper dive into both of these solutions, uncovering technical specifications for you to further evaluate and compare AWS VPN CloudHub and Cisco DMVPN.
AWS VPN CloudHub Technical Details
First, to get a closer look at AWS VPN CloudHub and how it works, here’s a quick overview of AWS VPN technology and terminology. You can find AWS VPC documentation here.
A Virtual Private Gateway (VPG) is a logical interface for communication with on-premises environments either via VPN or Direct Connect (DX). A Customer Gateway (CG) is the logical representation in your AWS VPC of the VPN device located in your remote sites. You will have a single CG for each remote site that you want to connect to your VPC. The CG’s purpose is to provide AWS with a static publicly routable IP address of the remote site and what BGP Autonomous System Number (ASN) you want to use if you are going to run dynamic routing (BGP), which is required for CloudHub. When we create and attach a VPN connection between a VPG and a CG, AWS will create two virtual concentrators in two different availability zones. You can kind of think of the VPG as a logical representation within your VPC of the VPN concentrators. VPN connections are created for each one of the remote sites that you want to connect privately to your VPC.
Figure 4: Technical AWS VPN CloudHub architecture
As you can see in Figure 4 above, under the covers, two IPsec tunnels are created, one to each VPN concentrator. Over these tunnels we are establishing BGP peering adjacencies to AWS using IP addresses from the link-local address space (169.254.0.0/16). Each customer gateway (CG) has been configured with a unique BGP ASN, and AWS will exchange routes between our ASNs in order for the the remote sites be able to route traffic to each other. Because we are exchanging routes, your remotes sites cannot have overlapping IP addresses. At your remote sites, you can use either a hardware-based or a software-based VPN device for connecting to your VPC. A fantastic feature that AWS provides is the ability to download configuration text files for many different VPN devices that you might use in your AWS VPN CloudHub for these remote sites. See Figure 5 below for a list of some of the vendors AWS provides configurations for.
Note: If you read the AWS documentation on CloudHub and come across the statement that says you can use the same BGP ASN for your remote sites, that was once valid but is no longer the case. In my proof of concept for CloudHub, I at first configured each remote site with the same ASN and it would not work. I confirmed with AWS support that you have to use separate BGP ASNs. Once I configured everything with separate ASNs, I was able to see routes from the other sites.
Figure 5: VPN Connection Configurations
As part of the downloadable configuration for your CG, AWS will automatically provide all of the security settings for IKE, IPsec, and the pre-shared keys (PSK) to build tunnel interfaces back to AWS. The BGP configuration is also provided for you so your VPN device can establish a BGP adjacency with the AWS VPN concentrators over the IPsec tunnels built. The BGP ASN that is used is what you set in the AWS console for the CG.
Note: One thing to consider is that for AWS VPN CloudHub deployments, the default CG configuration will not work properly for CloudHub. By default, the configurations are set to advertise a default route into the BGP table, which in AWS VPN CloudHub, we don’t want. In many configurations, private subnets in a VPC have a static default route already pointing to an AWS NAT gateway to allow instances in those subnets to access the Internet (or to a DX or VPN connection back to an on-premises data center to route to the Internet). We only want to attach remote site traffic either by summarizing the networks and placing a static route in the VPC routing table(s) or by propagating BGP learned routes from the VPG into the VPC.
Using host routes (/32) attached to loopback interfaces at each remote site in my test environment, I observed the AWS VPN influencing which path your traffic takes by advertising metric values called the BGP Multi-Exit Discriminator (MED) to my remote routers. Considering that AWS deploys the VPN concentrators in different availability zones, it makes sense to me that they would use the MED to keep all of the traffic going to a single AZ, as this would be more deterministic. If the BGP peer on the primary path drops, it will failover to the other tunnel, hence the other AZ. Once the primary path is back up on AWS, I noticed that AWS would advertise a route with a lower MED (100 vs 200) to influence traffic to flip back over to the primary path. Figure 6 below shows this in action with the preferred routes circled for each destination. You can also see the BGP AS paths as well. Notice that you see a private path of 65200 or 65201. That is the ASN for the other remote site in my test environment. In a production deployment, you would see all of your remote site networks and the BGP ASN that you configured as part of the CG in AWS.
Figure 6 : Show ip bgp topology * output from remote site A
Figure 7 : Show ip bgp topology * output from remote site B
Finally, it is worth mentioning that with AWS VPN CloudHub, you can configure your AWS routing tables to allow propagated routes to be installed automatically from your VPG. One thing to keep in mind is the default limit for propagated routes in an AWS routing table is 100. If you have more than 100 sites, it may be best to create static summary routes and point those to your VPG. Routing traffic to the proper VPN concentrator is handled by AWS, so you don’t have to configure or worry about that piece.
Cisco DMVPN Technical Details
In figure 8 below, I have shown an example of a DMVPN solution in AWS. This was the model that I used in my test environment and it exposes more of the technical details around this solution.
Figure 8: Detailed DMVPN Solution
With DMVPN, we have an IPsec tunnel that is statically configured from the spoke routers to the hub routers in our VPC (our mGRE tunnel interface). The hub routers do not initiate the VPN connections to the spokes. A static IKE pre-shared key on the hub router is associated to 0.0.0.0 (any address) and this behavior combined with the spokes initiating the IPsec tunnel is how we can use dynamic IP addresses on the spoke routers in DMVPN.
The hub routers maintain a table of private tunnel IP addresses to the public IP addresses that the connection is using. Figure 9 below shows an example from my test environment:
Figure 9: DMVPN output from a hub CSR in AWS
One important detail to note to get the direct spoke-to-spoke tunnels going is to make sure that the hub routers do not advertise themselves as the next hop router. For example, in EIGRP, you can disable split horizon and apply the “no ip next-hop-self eigrp X” command on the hub’s tunnel interface and the hub will pass along the tunnel IP address of the spoke routers to other spoke routers in the DMVPN cloud. To demonstrate this, I’ve included Figure 10 below showing pings between loopback interfaces defined on each of the spoke routers in my test environment. One of the routers sits in the Oregon AWS region and the other in the North California region. I have my hub router sitting across the country in the North Virginia AWS region, so latency will be much higher when we do not use the dynamic DMVPN tunnel between Oregon and North California.
Figure 10: Ping between spokes using hub vs direct spoke to spoke
Between running the two ping commands above, I modified to hub router to disable advertising itself as the next hop, so the routing tables would use the IP address of the spoke’s tunnel interface. You might notice that the first packet was dropped for the spoke-to-spoke test. The reason this happens is it does take a second for the spokes to build the IPsec tunnel between themselves because it has to query the hub router for the public IP address of that spoke and then build an IPsec tunnel between the two if there is not already an DMVPN tunnel created. Once the tunnel is established, the ping test would work for all five packets if I ran the same test again. The time that the tunnels will remain up will depend on the ip nhrp timeout settings on each spoke router. That setting on the spoke router controls the time that it will advertise to the Next Hop Server (which is our hub router) in NHRP which will in turn advertise that timeout value down to the other spoke routers to control how long those routers will keep a tunnel up. Cisco recommends this timer be set for 5-10 minutes, or 300-600 seconds (the command is configured in seconds). In my Lab, I noticed that Cisco changed the default NHRP holdtime values between the two different versions of IOS-XE I was using (16.03.01a and 03.16.04a). In the new Denali release (16.03), I noticed the default hold time was 10 minutes (ip nhrp holdtime 600), whereas in the older version, I saw two hours (ip nhrp holdtime 7200). The point here is to make sure you include setting the hold time you want on your spoke routers in case you are running older versions of IOS and IOS-XE.
Which routing protocol to use?
Which routing protocol is better to run over DMVPN? Truth be told, we could probably write an entire article on this subject alone, but lets just keep it simple. What protocol you choose to use is important as this could affect the scale of your DMVPN deployment. According to Cisco, the best protocols to use are BGP or EIGRP for large scale deployments. I personally have seen more deployments using EIGRP because it scales better for DMVPN environments than OSPF and is easier to implement than BGP. For really large environments (>1000 spokes) you need to run BGP.
If you want more information on this subject, check out this article.
DMVPN Redundancy in AWS
In order to get traffic from our instances to flow to our secondary CSR router in the event that our main CSR router fails, we have to modify the AWS subnet’s routing table to either point the default route to the Elastic Network Interface (ENI) of our CSR, or we point a summary route of all the DMVPN spoke networks to the ENI of our CSR. If there is a failure of the CSR for whatever reason (host issues, etc.) we need a method to change the routing tables in AWS to point to the other CSR. AWS VPC only supports layer 3 networking, so we cannot use normal methods of redundancy like HSRP/VRRP/GLBP.
Here is an overview of the process to get redundancy in place (assuming you are using IOS-XE 16.03 or higher):
- Before you deploy any CSRs, make sure you create the AWS IAM role to allow the CSR instances to modify the AWS routing tables. This is important as you can not assign an IAM role to an instance after that instance has already been launched. I learned the hard way in the Lab!
- Make sure you have licensed the CSRs with the AX license so you can run BFD between each CSR. As a reminder, BFD is used as a failure detection method so the surviving CSR instance can modify the AWS routing table(s).
- You need to build a GRE tunnel between the CSR routers using the Elastic IP address of the other peer (this is recommend by Cisco vs the private addresses to avoid DHCP renewals from causing a false positive BFD peer down event).
- Enable EIGRP across the tunnel interfaces so the routers form an adjacency. You can run any routing protocol between the CSRs that you can enable BFD across.
- Enable BFD in the EIGRP process for the tunnel interfaces.
Note: Cisco also recommends changing the BFD intervals between the CSRs from the default 50ms to 500ms (with a multiplier of three). This should detect an outage within 1.5 seconds. The reason we want to change these values is because of the shared tenancy model of the AWS environment. If the BFD intervals are too aggressive, then we could have a BFD peer down event occur and that would trigger a false positive failover event. 500ms was a number that produced accepted failover times in most environments and was stable per Cisco best practices.
- Configure the redundancy/cloud provider AWS settings in IOS to change the AWS routing table to a new ENI. Example configuration is below.
Figure 11: IOS-XE 16.03 (Denali) AWS redundancy configuration example
AWS Availability Zone (AZ) redundancy is also something to consider in this process as well. In order to keep cost to a minimum, you need to consider how your VPC is designed since there is a small charge for traffic that traverses AZs in AWS. As of this writing, that charge was effectively $0.02/GB ($0.01/GB per EC2 instance). The traffic used by BFD is very small. The cost to run BFD is about $0.02 per BFD session per month, so running BFD over GRE tunnel between AZs is not expensive. The consideration to make will be the traffic that crosses AZs from other instances to reach the CSRs. This will all depend on how your VPC is designed and how much traffic is sent to and from the CSR for the remote sites, so your mileage will vary. One way to minimize and make it a non-issue, in my opinion, is to make sure that you route traffic from the different subnets within a single AZ to a CSR running in that AZ and fail over to another CSR in a different AZ. Our pair of CSRs running BFD between each other could be “active” for the local subnets, and standby for the subnets in other AZs. Figure 8 above illustrates this type of setup.
My network engineering background also has me wanting to default to all subnets in my VPC to be /24s, or a subnet able to hold 251 hosts in AWS. The reason network engineers want to do this is to limit layer 2 broadcast domains on traditional networks. AWS doesn’t support layer 2 networking in VPC, so we no longer have to worry about broadcast domain size. This would allow us to use larger subnets (/23, /22, etc) that are capable of handling a larger number of hosts. The advantage to this is that it would simplify our routing tables and DMVPN deployments for a cross-AZ type configuration. This of course is one of many design considerations that need to be talked about when you are deploying applications to AWS.
AWS has hidden some of the complexities for customers looking to deploy a highly available VPN solution. Deploying the AWS VPN components into your VPC is very easy compared to what network engineers have had to do in the past for other VPN solutions. However, this simple service may not fit all customer use cases for connecting remote sites and the customer may need to deploy EC2-based VPN solutions like Cisco DMVPN in order to accommodate. You, as the customer, will have to build and manage that EC2-based service and make it highly available since customers are responsible for applications running on EC2 instances. It is wise to plan and design your AWS environment carefully up front so you architect the best solution possible for your business. As I say to customers all of the time, “Nerd knobs are great, but only if you need them.” (For the record, I love nerd knobs). The same holds true here. There are a lot more “nerd knobs” a customer can tweak and configure with Cisco DMVPN, but that also increases the complexity when comparing it to a VPN deployment such as AWS CloudHub, where part of the service is managed on your behalf.
To learn more about considerations for VPN solutions, networking in the cloud, or hybrid cloud networking, engage with AHEAD’s networking experts when you sign up for a visit to the AHEAD Lab and Briefing Center today.