Vagrant, Ansible, Cumulus, and EVPN….Orchestrating a Virtualized Network on your Laptop.

In this post we will talk about how to use Vagrant, Ansible, and Cumulus Linux to build a virtual layer 2 extension across an IP fabric. We will use the standard protocol MP-BGP and its EVPN extension to run the control plane, and VXLAN to provide the virtual data plane.

Each technology deserves a brief introduction prior making things happen.

Vagrant up!

Vagrant is an orchestration tool that uses “providers” to deploy virtual machines to a variety of platforms. Vagrant is probably most famous for its integration with Virtualbox. It’s quite common for cloud and application developers to use Vagrant to build their application stacks locally. Vagrant then makes it easier to then deploy them to production in cloud environments, in an automated fashion. We will use Vagrant to deploy the switches and servers in our topology.

Ansible.

Ansible is another orchestration tool that helps you manage your infrastructure as code. It is agent-less and has probably the best tool box for network automation out of the various orchestration tools. In this post we will use ansible to provision, or configure, our devices.

Cumulus Linux.

Cumulus Networks is a virtual networking company that has made available a machine image running their networking software. This is a linux image that will be run as the switches in our topology. In the real world, you can run this software on a physical switch and operate the topology in the same way, just replace Vagrant with your OPs guy installing and cabling them. They publish the CumulusVX image, along with several automated topologies, so that you can do exactly what we are doing in this post, automate your network with Cumulus.

Last but not least….EVPN.

EVPN stands for Ethernet Virtual Private Network. Its another address-family under the MP-BGP standard. The technology is used as the control plane for transporting and updating the MAC address table to your switches across an IP fabric.

Another technology that should be mentioned, that is not in the title, is VXLAN. We will use VXLAN in the dataplane for encapsulating and transporting our ethernet frames.

Ok, here is the topology we are going to build:

 

On the bottom are our servers. Each server is connected to a single leaf switch in a mode 2 port channel. Many of the labs that Cumulus publishes dual attach the servers to two leaf switches and bonds them using their m-lag implementation. This requires lacp, which unfortunately, I was not able to get working locally. To keep the project moving forward, I implemented a work around which was to change the physical connectivity, and configure all of the bonds as mode 2 (balance-xor).

Cumulus provides some great tools for building these topologies, one of them is their topology converter.

Using their python script topology_converter.py, I took the following “.dot” file and converted it into a Vagrantfile. Vagrant will use this to build, and most importantly connect, all of the instances.

The Vagrant file that it creates is fairly long and took a number of modifications for what I wanted. Luckily, I’m going to share this with you… so clone this repo and take a look at it yourself. The repo will be needed to complete the build anyway.

The modifications that I made to the file involve commenting out some of the helper scripts, and using Vagrant’s ansible integration to run the playbooks.

Lets run a few commands:

This shows the instances we are about to create…so lets build and provision them!

Our VMs are up…cool!

If you watched vagrant do its magic, you will see that it also ran the ansible playbooks in a consumption model. Each time a machine build is completed, the playbooks are run, the dynamic inventory is matched against the new machine, and ansible deploys the new configuration to the machine. The next machine that is built is provisioned in the same way, but the ansible does not provision any of the previous instances because it has already completed them.

Lets jump one of our servers and verify its working as expected. Here we log into server1 (10.1.1.101) and ping server2 (10.1.1.102).

It works! If you’ve gotten this far congratulations! You have layer-2 connected a couple of hosts across a layer-3 IP fabric….all using EVPN as the control plane, and vxlan as the virtual dataplane!

Now its time to let our networking geek-birds fly. Lets view the configurations, verify the control and dataplane functionality using command line, and dig even deeper with a review of network captures from the IP fabric.

Here is the leaf control plane configuration. Notice how the neighbor statements refer to interfaces, and not actual neighbor IP addresses. This is because the neighbors are established using an IPv6 link local addressing. Using this strategy, you simply specify the interface and the peer addressing is derived from IPv6 ND/RD.

Here is the interface configuration:

In the above snippet, we combine our vlan10, VXLAN10, and SERVER01 into one bridge domain….named “bridge”. Our Layer3 interface, vlan 10, is assigned to the RED vrf.

So there’s the config, lets verify it at the command line.

Look closely, the bgp evpn family is talking about MAC addresses. Its associating each MAC with a loopback address as the next hop. VXLAN will use this information to establish the overlay in the dataplane. Based on this output, the control plane seems to be working.

Next….we capture the dataplane and test fault tolerance. We are using ECMP across our uplinks so we will have to shut one path down to make sure our captures are plentiful.

We toasted spine1 and the network continues to roll…awesome.

We can now see that we are single threaded through spine2…and can be sure we are capturing everything through a single interface.

Time to take captures.

Here’s our capture. I’ve filtered everything but some ICMP traffic. Check out the VXLAN header, more specifically the VNI.  As defined by RFC-7348:

“Each VXLAN segment is identified through a 24-bit segment ID, termed the “VXLAN Network Identifier (VNI)”. This allows up to 16 M VXLAN segments to coexist within the same administrative domain. The VNI identifies the scope of the inner MAC frame originated by the individual VM.”

Once again we made it to the end of another pretty cool post. The use cases for combining Vagrant, Ansible, and Cumulus Linux are vast. In future posts, I hope to build on this topology by establishing routing between networks, to external networks, and by implementing security within the fabric.

I had an absolute blast building and sharing this environment. I… and hopefully we… learned a ton!

 

Using GnuPG to Handle Your Network Automation Credentials!

One thing I struggled a long time with is the following:

How do we code our network while securely handling our device credentials? How do we do this in a way that is highly collaborative?

Here’s one issue that I ran into. It is easy to get roped into baking your credentials into a script (completely guilty here). But what happens when it’s time to deliver your code to a colleague, or even an external customer? You will need to refactor your code to deal with the AAA credentials that are displayed (plaintext) in your code.

With python and GnuPG, we can securely deal with device credentials in sharable code.

One of my favorite parts about this strategy is thinking about the extensibility of GnuPG….particularly with its ability send and receive secure messages. This post won’t dive into that much. Instead we’ll stick to the following objectives:

  1. Install GnuPG, the associated python libraries, and generate keys.
  2. Build an encrypted credentials file in yaml or json.
  3. Use python to interface with your keys and securely load your credentials.

Ok… that was highly summarized..let’s get into the details:

Installing gpg via brew…there is more chatter in real life, but this is a blog:

Installing the required python libraries.

 

Generating keys…Please read the entire section before starting.

This step generates a public and private key, in the .gnupg folder. When you proceed to using this in code, you encrypt with the specified users public key, and decrypt with your own private key.

Run this command and follow the self explanatory prompts. Be advised that not generating a passphrase is less secure. In this scenario I’m treating my keys like ssh rsa keys and giving them file permissions of 600.

 

Cool…lets play with gnupg in the interpreter:

We specify our .gnupg location and begin to interact with our keys:

Lets encrypt some stuff. We setup a string to encrypt and perform the encryption with the gpg.encrypt() function. We also have ways to make sure the encryption worked, and see the encrypted object.

Yes this is an object!

That means you have to convert to a string with the str() function to decrypt it…you guessed it, that’s next:

 

Ok, we have gnupg working in python and bash. How do we automate our network credentials?

First we need to encrypt credentials_file.txt from bash:
Here is the credentials file:

Here’s how we encrypt it.

 

And here is our python…we made it!

There might be a lot to look at below, but look at the “Decrypt/Load credentials” section. We’re automating our network credentials securely! The creds are loaded and used by the connection handler…in code that’s shareable.

This script deploys a new vlan to a datacenter ethernet fabric, and ensures the new vlan is available via 802.1q tag to a pre-specified Vmware cluster.

Automate Where it Makes Sense…or… It Makes Sense to Automate?

Automate Where it Makes Sense…or… It Makes Sense to Automate?

Well certainly the layman would say to automate where it makes sense…but why not drive your network to a place where it makes sense to automate? Transform your network to one that’s conducive to automation, and the code will flow freely. Like the infamous Dan Bilzerian once said “Its all about setup”.

In many cases its useful to run a script to change several network devices, but I believe many stop here when it comes to network scripting, and fail to see the benefits of an “automate everything” culture. One time use scripts have their place, but driving a code based architecture …and culture…helps to drive opportunities for automation. This approach can be daunting up front, but the investment will pay off if executed properly.

The way I envision a successful transition to a code based network, would be to follow a three step process.

  1. Standardize and document EVERYTHING. (Well that’s not very sexy, when do we start coding?)

This doesn’t seem very fun or exciting, but its absolutely critical. Picture this, you want to deploy some new SNMP configurations to your 2000 routers and switches, but many of them are different vendors, and have different AAA configurations. Congratulations, we’ve just hit the first non starter for automation. Get the picture?

You can consider the potential for automation a direct function of how standardized your network is. The documentation of these standards will translate directly to business rules to write code to.

  1. Instantiate all configuration data as structured data.

Right now you probably have a configuration management platform that is backing up all of your configurations as text files. This is great, but we need to be able to have our code (or orchestration tools) act on this data in meaningful and efficient ways. The goal of this step is to take your configurations, and split them into variables, and parameterized templates. The below example was built for ansible, but a python script can apply the variables to the jinja2 template just as well.

Here’s my parameterized_template.yml:

Here’s my variables.yml:

  1. Manage your structured data like an application, not a network.

This is where I get to throw buzzwords like Agile and DevOps out there. These terms are used to describe methodologies for software development. I won’t go into the details of each one on this post, but the takeaway is that your network is now an application. Each configuration snippet should be treated as a software feature.

For example, we want our application to to use ISP2, when ISP1 is down. How should we code this feature and deploy it? How can we unit test this code? How can we roll it back if the deployment goes bad? How can we canary test the deployment to reveal issues early with minimal business impact (fail early fail often)?

The coming posts will aim to answer all of these questions…so stay tuned!

Today I added a WordPress blog to my site…and it was pretty good.

Today I added a WordPress blog to my site…cool story bro.

I wanted to bolt a blog on to my existing site over running a separate WordPress instance. The reason for this was basically to increase my skills with GCP, ansible, PHP, Python and MySQL.

If all you are after is to get a WordPress site up and running, then its probably much easier to go with the GCP Marketplace WP instance and walk away. I initially tried this and it was almost too easy. You more or less fill out a form and its built for you, and all of the automation behind this is available and unit tested.

That being said my work is not fully absent of a quest for shortcuts. I played with the Google Marketplace instance, a canned WP ansible role, and installing WP manually. At this point I don’t remember what I ended with, but it didn’t matter once I learned how to deploy the existing site repeatably. The key here was two google buckets, one for the WP content files, and another for a restorable database dump.

Here is how this site is built and deployed:

The site is deployed across two Google Cloud instances. The first runs nginx and PHP. It serves a static webpage, as well as the WordPress site. The second is a database instance.

The two instances are fully orchestrated using ansible. When you run the playbook, it builds the two instances, and deploys roles to them.

The playbook pulls the static/wordpress content, along with a database dump from Google Storage buckets.

To facilitate updates and repeatability, a python script is used to keep the buckets updated.

Security??…Gosh I hope so.
Our variables are encrypted using Ansible Vault. Whats cool is under the hood, we use a private-key file to encrypt everything…no passphrase at command line:

Deploying to Google Cloud using Ansible.

Here is the primary playbook:

Here is the directory structure that ansible uses:

This python script is run to keep the content and database updated off platform. You run this after making live changes to maintain repeatability:

Python script …for the boys:   wp_db_gb_sync.py