0 Comments

(Get it, like the law of Demeter, except completely different and totally unrelated).

My focus at work recently has been on performance and load testing the service that lies at the heart of our newest piece of functionality. The service is fairly straightforward, acting as a temporary data store and synchronization point between two applications, one installed at the client side and one mobile. Its multi-tenant, so all of the clients share the same service, and there can be multiple mobile devices per client, split by user (where each device is assumed to belong to a specific user, or at least is authenticated as one).

I’ve done some performance testing before, but mostly on desktop applications, checking to see whether or not the app was fast and how much memory it consumed. Nothing formal, just using the performance analysis tools built into Visual Studio to identify slow areas and tune them.

Load testing a service hosted in Amazon is an entirely different beast, and requires different tools.

Enter JMeter.

Yay! Java!

JMeter is well and truly a Java app. You can smell it. I tried to not hold that against it though, and found it to be very functional and easy enough to understand once you get passed that initial (steep) learning curve.

At a high level, there are solution-esque containers, containing one or more Thread Groups. Each Thread Group can model one or more users (threads) that run through a series of configured actions that make up the content of the test. I’ve barely scratched the surface of the application, but you can at least configure loops and counters, specify variables and of course, most importantly, HTTP requests. You can weave all of these things (and more) together to create whatever load test you desire.

It turns out, that’s actually the hard bit. Deciding what the test should, well, test. Ideally you want something that approximates to average expected usage, so you need to do some legwork to work out what that could be. Then, you use the expected usage example and replicate it repeatedly, up to the number of concurrent users you want to evaluate.

This would probably be difficult and time consuming, but luckily JMeter has an extremely useful feature that helps somewhat. A recording proxy.

Configure the proxy on your endpoints (in my case, an installed application on a virtual machine and a mobile device) and start it, and all requests to the internet at either of those places will be helpfully recorded in JMeter, ready to be played back. Thus, rather than trying to manually concoct the set of actions of a typical user, you can just use the software as normal, approximate some real usage and then use those recorded requests to form the baseline of your load test.

This is obviously useful when you have a mostly completed application and someone has said “now load test it” just before release. I don’t actually recommend this, as load and performance testing should be done throughout the development process and performance metrics are set early and measured often. Getting to the end only to discover that your software works great for 1 person, but fails miserably for 10 is an extremely bad situation to be in. Not only that, but throwing all the load and performance testing infrastructure together at the end doesn’t give it enough time to mature, leading to oversights and other nasty side effects. Test early and test often applies to performance as much as functional correctness.

Helpfully, if you have a variable setup (say for the base URL) and you’re recording, any instances of the value in the variable will be replaced by a reference to the variable itself. This saves a huge amount of time going through the recorded requests and changing them to allow for configurability.

The variable substitution is a bit of a double edged sword as I found out though. I had a variable set at the global scope with a value of 10 (it was a loop counter or something) and JMeter dutifully replaced all instances of 10 with references to that variable. Needless to say, that recording session had to be thrown out, and I moved the variable to the appropriate Thread Group level scope shortly after.

Spitting Images

All was not puppies and roses though, as the recording proxy seemed to choke on image uploads. The service deals in images that are attached to other pieces of data, so naturally image uploads are part of its API.

Specifically, when the proxy wasn’t in place, everything worked fine. As soon as I configured the proxy on either endpoint, the images failed to upload. Looking at the request, all of the content that would normally be attached to an image upload was being stripped out. By the time the request got to the service, after having passed through the proxy, the request looked like there should be an image attached, but there was none available.

With the image uploads failing, I couldn’t record a complete picture of interactions with the service, so I had to make some stuff up.

It wasn’t until much later, when I started incorporating image uploads into the test plan manually that I discovered JMeter does not like file uploads as part of a PUT request. It’s fine with POST, but the multi-part support does not seem to work with PUT. I wasn’t necessarily in a position to change either of our applications to force them to use POST just so I could record a load test though, and I think PUT better encapsulates what we are doing (placing content at a known ID), so I just had to live with my artificially constructed image uploads that approximated usage.

Custom Rims

Now that I had a collection of requests defining some nice expected usage I only had one more problem to solve, kind of specific to our API, but I’ll mention it here because the general approach is probably applicable.

Our API uses an Auth Header. Not particularly uncommon, but our Auth Header isn’t as simple as a token obtained from sending the username/password to an auth endpoint, or anything similarly sane. Our Auth Header is an encrypted package containing a set of customer information, which is decrypted on the server side and then validated. It contains user identifying information plus a time stamp, for a validity period.

It needs to be freshly calculated on almost every outgoing request.

My recorded requests stored the Auth Header that they were made with, but of course, the token can’t be reused as time goes on, because it will time out. So I needed to create a fresh Auth Header from the information I had available. Obviously, this being a very custom function (involving AES encryption and some encoding), JMeter can’t help out of the box.

So it was time to extend.

I was impressed with JMeter in this respect. It has a number of ways of entering custom scripts/code when the in-built functionality of JMeter doesn’t allow you to do what you need to do. Originally I was going to use a BeanShell script, but after doing some reading, I discovered that BeanShell can be slow when executed many times (like for example on every single request), so I went with a custom function written in Java.

Its been a long time since I’ve written Java. Its wouldn’t say its bad, but I definitely like (and are far more experienced) with C#. Java generics are weird.

Anyway, once I managed to implement the interface that JMeter supplies for custom functions, it was a simple matter to compile it into a JAR file and include the location of the JAR in the search_path when starting JMeter. JMeter will automatically load all of the custom functions it finds, and you can freely call them just like you would a variable (using the ${} syntax, functions are typically named with __ to distinguish them from variables).

Redistributable Fun

All of the load testing goodness above is encapsulated in a Git repository, as is normal for sane practices

I like my repositories to stand alone when possible (with the notable exception of taking an external dependency on NuGet.org or a private NuGet feed for dependencies), so I wanted to make sure that someone could pull this repository, and just run the load tests with no issues. It should just work.

To that end, I’ve wrapped the running of JMeter with and without a GUI into 2 Powershell scripts, dealing with things like including the appropriate directory containing custom functions, setting memory usage and locating the JRE and JMeter itself.

In the interests of “it should just work”, I also included the Java 1.8 JRE and the latest version of JMeter in the repository, archived. They are dynamically extracted as necessary whenever the scripts that run JMeter are executed.

In the past I’ve shied away from including binaries in my repositories, because it tends to bloat them and make them take longer to pull down. Typically I would use a NuGet package for a dependency, or if one does not exist, create one. I considered doing this for the Java and JMeter dependencies, but it wasn’t really worth the effort at this stage.

You can find a cut down version of the repository (with a very lightweight jmx file of little to no substance) on GitHub, for your perusal.

Summary

Once I got passed the initial distaste of a Java UI and then subsequently got passed the fairly steep learning curve, I was impressed with what JMeter could accomplish with regards to load testing. It can take a bit of reading to really grok the underlying concepts, but once you do, you can use it to do almost anything. I don’t believe I have the greatest understanding of the software, but I was definitely able to use it to build a good load test that I felt put our service under some serious strain.

Of course, now that I had a working load test, I would need a way to interpret and analyse the results. Also, you can only do so much load testing on a single machine before you hit some sort of physical limit, so additional machines need to get involved to really push the service to its breaking point.

Guess what the topic of my next couple of blog posts will be?

0 Comments

Its that time of the release cycle where you need to see if your system will actually stand up to people using it. Ideally this is the sort of thing you do as you are developing the solution (so the performance and load tests grow and evolve in parallel with the solution), but I didn’t get that luxury this time.

Now that I’m doing something new, that means another serious of blog posts about things that I didn’t understand before, and now partially understand! Mmmmmm, partial understanding, the goal of true champions.

At a high level, I want to break our new API. I want to find the point at which it just gives up, because there are too many users trying to do things, and I want to find out why it gives up. Then I want to fix that and do the whole process again, hopefully working my way up to how many users we are expecting + some wiggle room.

Over the next couple of weeks I’m going to be talking about JMeter (our load testing application of choice), Nxlog (a log shipping component for Windows), ELK (our log aggregator), AWS (not a huge amount) and some other things that I’ve done while working towards good, reproducible load tests.

Todays topic is something that has bugged me for a few weeks, ever since I implemented shipping logs from our API instances into our log aggregator.

Configurific

Our log shipper of choice is Nxlog, being that we primarily use Windows servers. The configuration for Nxlog, the part that says what logs to process and how and where to send them, is/was installed at initialization time for the AWS instances inside our auto scaling group. This seemed like a good idea at the time, and it got us a working solution fairly quickly, which is always helpful.

Once I started actually using the data in the log aggregator though, I realised very quickly that I needed to make fairly frequent changes to the configuration, to either add more things being logged (windows event log for example) or to change fields or fix bugs. I needed to be able to deploy changes, and I didn’t want to have to do that by manually copying configuration files directly onto the X number of machines in the group. I certainly didn’t want to recreate the whole environment every time I wanted to change the log configuration either.

In essence, the configuration quickly became very similar to a piece of software. Something that is deployed, rather than just incorporated into the environment once when it is created. I needed to be able to version the configuration, easily deploy new versions and to track what versions were installed where.

Sounds like a job for Octopus Deploy.

Many Gentle Yet Strong Arms

I’ve spoken about Octopus Deploy a few times before on this blog, but I’ll just reiterate one thing.

Its goddamn amazing.

Honestly, I’m not entirely sure how I lived without it. I assume with manual deployment of software and poor configuration management.

We use Octopus in a fairly standard way. When the instances are created in our environment they automatically install an Octopus Tentacle, and are tagged with appropriate roles. We then have a series of projects in Octopus that are are configured to deploy various things to machines in certain roles, segregated by environment. Nothing special, very standard Octopus stuff.

If I wanted to leverage Octopus to manage my log configuration, I would need to package up my configuration files (and any dependencies, like Nxlog itself) into NuGet packages. I would then have to configure at least one project to deploy the package to the appropriate machines.

Tasty Tasty Nougat

I’ve used NuGet a lot before, but I’ve never really had to manually create a package. Mostly I just publish .NET libraries, and NuGet packaging is well integrated into Visual Studio through the usage of .csproj files. You don’t generally have to think about the structure of the resulting package much, as its taken care of by whatever the output of the project is. Even when producing packages for Octopus, intended to be deployed instead of just referenced from other projects, Octopack takes are of all the heavy lifting, producing a self contained package with all of the dependencies inside.

This time, however, there was no project.

I needed to package up the installer for Nxlog, along with any configuration files needed. I would also need to include some instructions for how the files should be deployed.

Luckily I had already written some scripts for an automated deployment of Nxlog, which were executed as part of environment setup, so all I needed to do was package all of the elements together, and give myself a way to easily create and version the resulting package (so a build script).

I created a custom nuspec file to define the contents of the resulting package, and relied on the fact that Octopus runs appropriately named scripts at various times during the deployment lifecycle. For example, the Deploy.ps1 script is run during deployment. The Deploy script acts as an entry point and executes other scripts to silently install Nxlog, modify and copy the configuration file to the appropriate location and start the Nxlog service.

Alas Octopus does not appear to have any concept of uninstallation, so there is no way to include a cleanup component in the package. This would be particularly useful when deploying new versions of the package that now run a new version of Nxlog for example. Instead of having to include the uninstallation script inside the new package, you could deploy the original package with clean up code already encapsulated inside, ready to be run when uninstallation is called for (for example, just before installing the new package).

There was only one issue left to solve.

How to select the appropriate configuration file based on where the package is being deployed? I assumed that there will be multiple configuration files, because different machines require different logging. This is definitely not a one-size fits all kind of thing.

Selection Bias

In the end, the way in which I solved the configuration selection problem was a bit hacky. It worked (hacky solutions usually do), but it doesn’t feel nice and clean. I generally shy away from solutions like that, but as this continuous log configuration deployment was done while I was working on load/performance testing, I had to get something in place that worked, and I didn’t have time to sand off all the rough edges.

Essentially each unique configuration file gets its own project in Octopus, even though they all reference the same NuGet package. This project has a single step, which just installs the package onto the machines with the specified roles. Since machines could fill multiple roles (even if they don’t in our environments) I couldn’t use the current role to select the appropriate configuration file.

What I ended up doing was using the name of the project inside the Deploy script. The project name starts with NXLOG_ and the last part of the name is the name of the configuration file to select. Some simple string operations very quickly get the appropriate configuration file, and the project can target any number of roles and environments as necessary.

It definitely isn't the greatest solution, but it gets the job done. There are some other issues that I can foresee as well, like the package being installed multiple times on the same machine through different projects, which definitely won’t give the intended result. That's partially a factor of the way Nxlog works though,  so I’m not entirely to blame.

Results

In the end I managed to get something together that accomplished my goal. With the help of the build script (which will probably be incorporated into a TeamCity Build Configuration at some stage), I can now build, version and optionally deploy a package that contains Nxlog along with automated instructions on how and when to install it, as well as configuration describing what it should do. I can now make log changes locally on my machine, commit them, and have them appear on any number of servers in our AWS infrastructure within moments.

I couldn’t really ask for anything more.

I’ve already deployed about 50 tweaks to the log configuration since I implemented this continuous deployment, as a result of needing more information during load and performance testing, so I definitely think it has been worthwhile.

I like that the log configuration isn’t something that is just hacked together on each machine as well. Its controlled in source and we have a history of all configurations that were previously deployed, and can keep an eye on what package is active in what environments.

Treating the log configuration like I would treat an application was definitely the right choice.

I’ve knocked together a repository containing the source for the Nxlog package and all of the supporting elements (i.e. build scripts and tests). As always, I can’t promise that I will maintain it moving forward, as the actual repository I’m using is in our private BitBucket, but this should be enough for anyone interested in what I’ve written about above.

Next time: Load testing and JMeter. I write Java for the first time in like 10 years!

0 Comments

Last time I outlined the start of setting up an ELK based Log Aggregator via CloudFormation. I went through some of the problems I had with executing someone else's template (permissions! proxies!), and then called it there, because the post was already a mile long.

Now for the thrilling conclusion where I talk about errors that mean nothing to me, accepting defeat, rallying and finally shipping some real data.

You Have Failed Me Again Java

Once I managed to get all of the proxy issues sorted, and everything was being downloaded and installed properly, the instance was still not responding to HTTP requests over the appropriate ports. Well, it seemed like it wasn’t responding anyway.

Looking into the syslog, I saw repeated attempts to start Elasticsearch and Logstash, with an equal number of failures, where the process had terminated immediately and unexpectedly.

The main issue appeared to be an error about “Bad Page Map”, which of course makes no sense to me.

Looking it up, it appears as though there was an issue with the version of Ubuntu that I was using (3.13? Its really hard to tell what version means what), and it was not actually specific to Java. I’m going to blame Java anyway. Apparently the issue is fixed in 3.15.

After swapping the AMI to the latest distro of Ubuntu, the exceptions no longer appeared inside syslog, but the damn thing still wasn’t working.

I could get to the appropriate pages through the load balancer, which would redirect me to Google to supply OAuth credentials, but after supplying appropriate credentials, nothing else ever loaded. No Kibana, no anything. This meant of course that ES was (somewhat) working, as the load balancer was passing its health checks, and Kibana was (somewhat) working because it was at least executing the OAuth bit.

Victory?

Do It Yourself

It was at this point that I decided to just take it back to basics and start from scratch. I realised that I didn’t understand some of the components being installed (LogCabin for example, something to do with the OAuth implementation?), and that they were getting in the way of me accomplishing a minimum viable product. I stripped out all of the components from the UserData script, looked up the latest compatible versions of ES, Logstash and Kibana, installed them, and started them as services. I had to make some changes to the CloudFormation template as well (ES defaults to 9200, Kibana 4 to 5601, had to expose the appropriate ports and make some minor changes to the health check. Logstash was fine).

The latest version of Kibana is more self contained than previous ones, which is nice. It comes with its own web server, so all you have to do is start it up and it will listen to and respond to requests via port 5601 (which can be changed). This is different to the version that I was originally working with (3?), which seemed to be hosted directly inside Elasticsearch? I’m still not sure what the original template was doing to be honest, all I know is that it didnt work.

Success!

A Kibana dashboard, load balancers working, ES responding. Finally everything was up and running. I still didn’t fully understand it fully, but it was a hell of a lot more promising than it was before.

Now all I had to do was get some useful information into it.

Nxlog Sounds Like a Noise You Make When You Get Stabbed

There are a number of ways to get logs into ES via Logstash. Logstash itself can be installed on other machines and forward local logs to a remote Logstash, but its kind of heavy weight for that sort of thing. Someone has written a smaller component called Logstash-Forwarder which does a similar thing. You can also write directly to ES using Logstash compatible index names if you want to as well (Serilog offers a sink that does just that).

The Logstash solutions above seem to assume that you are gathering logs on a Unix based system though, and don’t really offer much in the way of documentation or help if you have a Windows based system.

After a small amount of investigation, a piece of software called Nxlog appears to be the most commonly used log shipper as far as Windows is concerned.

As with everything in automation, I couldn’t just go onto our API hosting instances and just install and configure Nxlog. I had to script it, and then add those scripts to the CloudFormation template for our environment setup.

Installing Nxlog from the command line is relatively simple using msiexec and the appropriate “don't show the damn UI” flags, and configuring it is simple as well. All you need to do is have an nxlog.conf file configured with what you need (in my case, iis and application logs being forwarded to the Logstash endpoint) and then copy it to the appropriate conf folder in the installation directory.

The nxlog configuration file takes some getting used to, but their documentation is pretty good, so its just a matter of working through it. The best tip I can give is to create a file output until you are sure that nxlog is doing what you think its doing, and then flipping everything over to output to Logstash. You’ll save a log of frustration if you know exactly where the failures are (and believe me, there will be failures).

After setting up Nxlog, it all started to happen! Stuff was appearing in Kibana! It was one of those magical moments where you actually get a result, and it felt good.

Types? We need Types?

I got everything working nicely in my test environment, so I saved my configuration, tore down the environments and created them again (to verify they could be recreated). Imagine my surprise when I was getting Nxlog internal messages into ES, but nothing from IIS. I assumed that I had messed up Nxlog somehow, so I spent a few hours trying to debug what was going wrong. My Nxlog config appeared to be fine, so I assumed that there was something wrong with the way I had configured Logstash. Again, it seemed to be fine.

It wasn't until I looked into the Elasticsearch logs that I found out why all of my IIS logs were not making it. The first document sent to Elasticsearch had a field called EventRecievedTime (from the Nxlog internal source) which was a date, represented as ticks since X, i.e. a gigantic number. ES had inferred the type of this field as a long. The IIS source also had a field called EventRecievedTime, which was an actual date (i.e. YYYY-MM-DD HH:mm). When any IIS entry arrived into ES from Logstash, ES errored out trying to parse the datetime into a long, and discarded it. Because of the asynchronous nature of the system, there was no way for Logstash to communicate the failure back to anywhere that I could see it.

After making sure that both EventRecievedTimes were dates, everything worked swimmingly.

I suppose this might reoccur in the future, with a different field name conflicting. I’m not sure exactly what the best way to deal with this would be. Maybe the Elasticsearch logs should be put into ES as well? At least then I could track it. You could setup a template to strongly type the fields as well, but due to the fluidity of ES, there are always going to be new fields, and ES will always try to infer an appropriate type, so having a template won’t stop it from occurring.

Mmmmmmm Kibana

Look at this dashboard.

Look at it.

So nice.

I haven’t even begun to plumb the depths of the information now at my fingertips. Most of those charts are simple (average latency, total requests, response codes, requested paths), but it still provides a fantastic picture of the web applications in question.

Conclusion

Last time I wrote a CloudFormation template, I didn’t manage to get it into a publically available repository, which kind of made the blog posts around it significantly less useful.

This time I thought ahead. You can find all of the scripts and templates for the Log Aggregator in this repository. This is a copy of our actual repository (private, in Bitbucket), so I’m not sure if I will be able to keep it up to date as we make changes, but at least there is some context to what I’ve been speaking about.

I’ve included the scripts that setup and configure Nxlog as well. These are actually located in the repository that contains our environment setup, but I think they are useful inside this repository as a point of reference for setting up log shipping on a Windows system. Some high level instructions are available in the readme of the repository.

Having a Log Aggregator, even though it only contains IIS and application logs for a single application, has already proved useful. It adds a huge amount of transparency to what is going on, and Kibana’s visualisations are extremely helpful in making sense of the data.

Now to do some load testing on the web application and see what Kibana looks like when everything breaks.

0 Comments

Logging is one of the most important components of a good piece of software.

That was a statement, not a question, and is not up for debate.

Logging enables you to do useful things, like identify usage patterns (helpful when decided what sections of the application need the most attention), investigate failures (because there are always failures, and you need to be able to get to their root causes) and keep an eye on performance (which is a feature, no matter what anyone else tells you). Good logging enables a piece of software to be supported long after the software developers who wrote it have moved on, extending the life expectancy of the application and thus improving its return on investment.

It is a shame really, that logging is not always treated like a first class citizen. Often it is an afterthought, added in later after some issue or failure proves that it would have been useful, and then barely maintained from that point forward.

Making sure your application has excellent logging is only the first part though, you also need somewhere to put it so that the people who need to can access it.

The most common approach is to have logs be output to a file, somewhere on the local file system relative to the location where the software is installed and running. Obviously this is better than not having logs, but only just barely. When you have log files locally, you are stuck in a reactive mindset, using the logs as a diagnostic tool when a problem is either observed or reported through some other channel (like the user complaining).

The better approach is to the send the logs somewhere. Somewhere they can be watched and analysed and alerted on. You can be proactive when you have logs in a central location, finding issues before the users even notice and fixing them even faster.

I’m going to refer to that centralised place where the logs go as a Log Aggregator, although I’m not sure if that is the common term. It will do for now though.

Bucket O’ Logs

At my current job, we recently did some work to codify the setup of our environments. Specifically, we used AWS CloudFormation and Powershell to setup an auto scaling group + load balancer (and supporting constructs) to be the home for a new API that we were working on.

When you have a single machine, you can usually make do with local logs, even if its not the greatest of ideas (as mentioned above). When you have a variable number of machines, whose endpoints are constantly shifting and changing, you really need a central location where you can keep an eye on the log output.

Thus I’ve spent the last week and bit working on exactly that. Implementing a log aggregator.

After some initial investigation, we decided to go with an ELK stack. ELK stands for Elasticsearch, Logstash and Kibana, three components that each serve a different purpose. Elasticsearch is a document database with strong analysis and search capabilities. Logstash is a ETL (Extract, Transform, Load) system, used for moving logs around as well as transforming and mutating them into appropriate structures to be stored in Elasticsearch and Kibana is a front end visualisation and analysis tool that sits on top of Elasticsearch.

We went with ELK because a few other teams in the organization had already experimented with it, so there was at least a little organizational knowledge floating around to exploit. Alas, the other teams had not treated their ELK stacks as anything reusable, so we still had to start from scratch in order to get anything up and running.

We did look at a few other options (Splunk, Loggly, Seq) but it seemed like ELK was the best fit for our organisation and needs, so that was what we went with.

Unix?

As is my pattern, I didn’t just want to jam something together and call that our log aggregator, hacking away at a single instance or machine until it worked “enough”. I wanted to make sure that the entire process was codified and reproducible. I particularly liked the way in which we had done the environment setup using CloudFormation, so I decided that would be a good thing to aim for.

Luckily someone else had already had the same idea, so in the true spirit of software development, I stole their work to bootstrap my own.

Stole in the sense that they had published a public repository on GitHub with a CloudFormation template to setup an ELK stack inside it.

I cloned the repository, wrote a few scripts around executing the CloudFormation template and that was that. ELK stack up and running.

Right?

Right?

Ha! It’s never that easy.

Throughout the rest of this post, keep in mind that I haven't used a Unix based operating system in anger in a long time. The ELK stack used a Ubuntu distro, so I was at a disadvantage from the word go. On the upside, having been using cmder a lot recently, I was far more comfortable inside a command line environment than I ever have been before. Certainly more comfortable than I was when I was used Unix.

Structural Integrity

The structure of the CloudFormation template was fairly straightforward. There were two load balancers, backed by an auto scaling group. One of the load balancers was public, intended to expose Kibana. The other was internal (i.e. only accessible from within the specified VPC) intended to expose Logstash. There were some Route53 entries to give everything nice names, and an Auto Scaling Group with a LaunchConfig to define the configuration of the instances themselves.

The auto scaling group defaulted to a single instance, which is what I went with. I’ll look into scaling later, when it actually becomes necessary and we have many applications using the aggregator.

As I said earlier, the template didn’t just work straight out of the repository, which was disappointing.

The first issue I ran into was that the template called for the creation of an IAM role. The credentials I was using to execute the template did not have permissions to do that, so I simply removed it until I could get the appropriate permissions from our AWS account managers. It turns out I didn’t really need it anyway, as the only thing I needed to access using AWS credentials was an S3 bucket (for dependency distribution) which I could configure credentials for inside the template, supplied as parameters.

Removing the IAM role allowed me to successfully execute the template, and it eventually reach that glorious state of “Create Complete”. Yay!

It still didn’t work though. Booooo!

Its Always a Proxy

The initial template assumed that the instance would be accessible over port 8080. The public load balancer relied on that fact and its health check queried the __es path. The first sign that something was wrong was that the load balancer thought that the instance inside it was unhealthy, so it was failing its health check.

Unfortunately, the instance was not configured to signal failure back to CloudFormation if its setup failed, so although CloudFormation had successfully created all of its resources, when I looked into the cloud-init-output.log file in /var/log, it turned out that large swathes of the init script (configured in the UserData section of the LaunchConfig) had simply failed to execute.

The issue here, was that we require all internet access from within our VPC to outside to go through a proxy. Obviously the instance was not configured to use the proxy (how could it be, it was from a public git repo), so all communications to the internet were being blocked, including calls to apt-get and the download of various configuration files directly from git.

Simple enough to fix, set the http_proxy and https_proxy environment variables with the appropriate value.

It was at this point that I also added a call to install the AWS CloudFormation components on the instance during initialisation, so that I could use cfn-signal to indicate failures. This at least gave me an indication of whether or not the instance had actually succeeded its initialization, without having to remote into the machine to look at the logs.

When working on CloudFormation templates, its always useful to have some sort of repeatable test that you can run in order to execute the template, ideally from the command line. You don’t want to have to go into the AWS Dashboard to do that sort of thing, and its good to have some tests outside the template itself to check its external API. As I was already executing the template through Powershell, it was a simple matter to include a Pester test that executed the template, checked that the outputs worked (the outputs being the Kibana and Logstash URLs) and then tear the whole thing down if everything passed.

At this point I also tried to setup some CloudWatch logs that would automatically extract the contents of the various initialization log files to a common location, so that I could view them from the AWS Dashboard when things were not going as expected. I did not, in the end, manage to get this working. The irony of needing a log aggregator to successfully setup a log aggregator was not lost on me.

Setting the environment variables fixed the majority of the proxy issues, but there was one other proxy related problem left that I didn’t pick up until much later. All of the Elasticsearch plugins were failing to install, for exactly the same reason. No proxy settings. Apparently Java does not read the system proxy settings (bad Java!) so I had to manually supply the proxy address and port to the call to the Elasticsearch plugin installation script.

The initialisation log now showed no errors, and everything appeared to be installed correctly.

But it still wasn’t working.

To Be Continued

Tune in next week for the thrilling conclusion, where I discover a bug caused by the specific combination of Ubuntu version and Java version, get fed up with the components being installed and start from scratch and then struggle with Nxlog in order to get some useful information into the stack.

0 Comments

It’s been a while since I posted the first part of this blog post, but now its time for the thrilling conclusion! Note: Conclusion may not actually be thrilling.

Last time I gave a general outline of the problem we were trying to solve (automatic deployment of an API, into a controlled environment), went through our build process (TeamCity) and quickly ran through how we were using Amazon CloudFormation to setup environments.

This time I will be going over some additional pieces of the environment setup (including distributing dependencies and using Powershell Desired State Configuration for machine setup) and how we are using Octopus Deploy for deployment.

Like the last blog post, this one will be mostly explanation and high level descriptions, as opposed to a copy of the template itself and its dependencies (Powershell scripts, config files, etc). In other words, there won’t be a lot of code, just words.

Distribution Network

The environment setup has a number of dependencies, as you would expect, like scripts, applications (for example, something to zip and unzip archives), configuration files, etc. These dependencies are needed in a number of places. One of those places is wherever the script to create the environment is executed (a development machine or maybe even a CI machine) and another place is the actual AWS instances themselves, way up in the cloud, as they are being spun up and configured.

The most robust way to deal with this, to ensure that the correct versions of the dependencies are being used, is to deploy the dependencies as part of the execution of the script. This way you ensure that you are actually executing what you think you’re executing, and can make local changes and have those changes be used immediately, without having to upload or update scripts stored in some web accessible location (like S3 or an FTP site or something). I’ve seen approaches where dependencies are uploaded to some location once and then manually updated, but I think that approach is risky from a dependency management point of view, so I wanted to improve on it.

I did this in a similar way to what I did when automating the execution of our functional tests. In summary, all of the needed dependencies are collected and compressed during the environment creation script, and uploaded to a temporary location in S3, ready to be downloaded during the execution of the Cloud Formation template within Amazon’s infrastructure.

The biggest issue I had with distributing the dependencies via S3 was getting the newly created EC2 instances specified in the CloudFormation template access to the S3 bucket where the dependencies were. I first tried to use IAM roles (which I don’t really understand), but I didn’t have much luck, probably as a result of inexperience. In the end I went with supplying 3 pieces of information as parameters in the CloudFormation template. The S3 path to the dependencies archive, and a key and secret for a pre-configured AWS user that had guaranteed access to the bucket (via a Bucket Policy).

Within the template, inside the LaunchConfiguration (which defines the machines to be spun up inside an Auto Scaling Group) there is a section for supplying credentials to be used when accessing files stored in an S3 bucket, and the parameters are used there.

"LaunchConfig" : {
    "Type" : "AWS::AutoScaling::LaunchConfiguration",
    "Metadata" : {
        "AWS::CloudFormation::Authentication" : {
            "S3AccessCreds" : {
                "type" : "S3",
                "accessKeyId" : { "Ref" : "S3AccessKey" },
                "secretKey" : { "Ref": "S3SecretKey" },
                "buckets" : [ { "Ref":"S3BucketName" } ]
            }
        }
    }
}

I’m not a huge fan of the approach I had to take in the end, as I feel the IAM roles are a better way to go about it, I’m just not experienced enough to know how to implement them. Maybe next time.

My Wanton Desires

I’ll be honest, I don’t have a lot of experience with Powershell Desired State Configuration (DSC from here on). What I have observed so far, is that it is very useful, as it allows you to take a barebones Windows machine and specify (using a script, so its repeatable) what components you would like to be installed and how you want them to be configured. Things like the .NET Framework, IIS and even third party components like Octopus Tentacles.

When working with virtualisation in AWS, this allows you to skip the part where you have to configure an AMI of your very own, and to instead use one of the pre-built and maintained Amazon AMI’s. This allows you to easily update to the latest, patched version of the OS whenever you want, because you can always just run the same DSC on the new machine to get it into the state you want. You can even switch up the OS without too much trouble, flipping to a newer, greater version or maybe even dropping back to something older.

Even though I don’t have anything to add about DSC, I thought I’d mention it here as a record of the method we are using to configure the Windows instances in AWS during environment setup. Essentially what happens is that AWS gives you the ability to execute arbitrary scripts during the creation of a new EC2 instance. We use a barebones Windows Server 2012 AMI, so during setup it executes a series of scripts (via CloudFormation) one of which is the Powershell DSC script that installs IIS, .NET and an Octopus Tentacle.

I didn’t write the DSC script that we using, I just copied it from someone else in the organisation. It does what I need it to do though, so I haven’t spent any time trying to really understand it. I’m sure I’ll have to dig into it in more depth at some future point, and when I do, I’ll make sure to write about it.

Many Arms Make Light Work

With the environment itself sorted (CloudFormation + dependencies dynamically uploaded to S3 + Powershell DSC for configuration + some custom scripts) now its time to talk about deployment/release management.

Octopus Deploy is all about deployment/release management, and is pretty amazing.

I hadn’t actually used Octopus before this, but I had heard of it. I saw Paul speak at DDD Brisbane 2014, and had a quick chat with him after, so I knew what the product was for and approximately how it worked, but I didn’t realise how good it actually was until I started using it myself.

I think the thing that pleases me most about Octopus is that it treats automation and programmability as a first class citizen, rather than an afterthought. You never have to look too far or dig too deep to figure out how you are going to incorporate your deployment in an automated fashion. Octopus supplies a lot of different programmability options, from an executable to a set of .NET classes to a REST API, all of which let you do anything that you could do through the Octopus website.

Its great, and I wish more products were implemented in the same way.

Our usage of Octopus is very straightforward.

During environment setup Octopus Tentacles are installed on the machines with the appropriate configuration (i.e. in the appropriate role, belonging to the appropriate environment). These tentacles provide the means by which Octopus deploys releases.

One of the outputs from our build process is a NuGet Package containing everything necessary to run the API as a web service. We use Octopack for this (another awesome little tool supplied by Octopus Deploy), which is a simple NuGet Package that you add to your project, and then execute using an extra MSBuild flag at build time. Octopack takes care of expanding dependencies, putting everything in the right directory structure and including any pre and post deploy scripts that you have specified in the appropriate place for execution during deployment.

The package containing our application (versioned appropriately) is uploaded to a NuGet feed, and then we have a project in Octopus Deploy that contains some logic for how to deploy it (stock standard stuff relating to IIS setup). At the end of a CI build (via TeamCity) we create a new release in Octopus Deploy for the version that was just built and then automatically publish it to our CI environment.

This automatic deployment during build is done via a Powershell script.

if ($isCI)
{
    Get-ChildItem -Path ($buildDirectory.FullName) | NuGet-Publish -ApiKey $octopusServerApiKey -FeedUrl "$octopusServerUrl/nuget/packages"

    . "$repositoryRootPath\scripts\common\Functions-OctopusDeploy.ps1"
    $octopusProject = "{OCTOPUS PROJECT NAME]"
    New-OctopusRelease -ProjectName $octopusProject -OctopusServerUrl $octopusServerUrl -OctopusApiKey $octopusServerApiKey -Version $versionChangeResult.New -ReleaseNotes "[SCRIPT] Automatic Release created as part of Build."
    New-OctopusDeployment -ProjectName $octopusProject -Environment "CI" -OctopusServerUrl $octopusServerUrl -OctopusApiKey $octopusServerApiKey
}

Inside the snippet above, the $versionChangeResult is a local variable that defines the version information for the build, with the New property being the new version that was just generated. The NuGet-Publich function is a simple wrapper around NuGet.exe and we’ve abstracted the usage of Octo.exe to a set of functions (New-OctopusRelease, New-OctopusDeployment).

function New-OctopusRelease
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$octopusServerUrl,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$octopusApiKey,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$projectName,
        [string]$releaseNotes,
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$version
    )

    if ($repositoryRoot -eq $null) { throw "repositoryRoot script scoped variable not set. Thats bad, its used to find dependencies." }

    $octoExecutable = Get-OctopusToolsExecutable
    $octoExecutablePath = $octoExecutable.FullName

    $command = "create-release"
    $arguments = @()
    $arguments += $command
    $arguments += "--project"
    $arguments += $projectName
    $arguments += "--server"
    $arguments += $octopusServerUrl
    $arguments += "--apiKey" 
    $arguments += $octopusApiKey
    if (![String]::IsNullOrEmpty($releaseNotes))
    {
        $arguments += "--releasenotes"
        $arguments += "`"$releaseNotes`""
    }
    if (![String]::IsNullOrEmpty($version))
    {
        $arguments += "--version"
        $arguments += $version
        $arguments += "--packageversion"
        $arguments += $version
    }

    (& "$octoExecutablePath" $arguments) | Write-Verbose
    $octoReturn = $LASTEXITCODE
    if ($octoReturn -ne 0)
    {
        throw "$command failed. Exit code [$octoReturn]."
    }
}

The only tricksy thing in the script above is the Get-OctopusToolsExecutable function. All this function does is ensure that the executable exists. It looks inside a known location (relative to the global $repositoryRoot variable) and if it can’t find the executable it will download the appropriate NuGet package.

The rest of the Octopus functions look very similar.

The End

I would love to be able to point you towards a Github repository containing the sum total of all that I have written about with regards to environment setup, deployment, publishing, etc, but I can’t. The first reason is because its very much tied to our specific requirements, and stripping out the sensitive pieces would take me too long. The second reason is that everything is not quite encapsulated in a single repository, which disappointments me greatly. There’s some pieces in TeamCity, some in Octopus Deploy with the rest in the repository. Additionally, the process is dependent on a few external components, like Octopus Deploy. This makes me uncomfortable actually, as I like for everything related to an application to be self contained, ideally within the one repository, including build, deployment, environment setup, etc.

Ignoring the fact that I can’t share the detail of the environment setup (sorry!), I consider this whole adventure to be a massive success.

This is the first time that I’ve felt comfortable with the amount of control that has gone into the setup of an environment. Its completely repeatable, and even has some allowance for creating special, one-off environments for specific purposes. Obviously there would be a number of environments that are long lived (CI, Staging and Production at least) but the ability to create a temporary environment in a completely automated fashion, maybe to do load testing or something similar, is huge for me.

Mostly I just like the fact that the entire environment setup is codified, written into scripts and configuration files that can then be versioned and controlled via some sort of source control (we use Git, you should too). This gives a lot of traceability to the whole process, especially when something breaks, and you never have to struggle with trying to understand exactly what has gone into the environment. Its written right there.

For me personally, a lot of the environment setup was completely new, including CloudFormation, Powershell DSC and Octopus Deploy, so this was a fantastic learning experience.

I assume that I will continue to gravitate towards this sort of DevOps work in the future, and I’m not sad about this at all. Its great fun, if occasionally frustrating.