We Built This Delphi On Rock And Roll

January 29. 2019 0 Comments

It makes me highly uncomfortable if someone suggests that I support a piece of software without an automated build process.

In my opinion its one of the cornerstones on top of which software delivery is built. It just makes everything that comes afterwards easier, and enables you to think and plan at a much higher level, allowing you to worry about much more complicated topics.

Like continuous delivery.

But lets shy away from that for a moment, because sometimes you have to deal with the basics before you can do anything else.

In my current position I’m responsible for the engineering behind all of the legacy products in our organization. At this point in time those products make almost all of the money (yay!), but contain 90%+ of the technical terror (boo!) so the most important thing from my point of view is to ensure that we can at least deliver them reliably and regularly.

Now, some of the products are in a good place regarding delivery.

Some, however, are not.

Someones Always Playing Continuous Integration Games

One specific product comes to mind. Now, that’s not to say that the other products are perfect (far from it in fact), but this product in particular is lacking some of the most fundamental elements of good software delivery, and it makes me uneasy.

In fairness, the product is still quite successful (multiple millions of dollars of revenue), but from an engineering point of view, that’s only because of the heroic efforts of the individuals involved.

With no build process, you suffer from the following shortcomings:

No versioning (or maybe ad-hoc versioning if you’re lucky). This makes it hard to reason about what variety of software the customer has, and can make support a nightmare. Especially true when you’re dealing with desktop software.
Mysterious or arcane build procedures. If no-one has taken the time to recreate the build environment (assuming there is one), then it probably has all sorts of crazy dependencies. This has the side effect of making it really hard to get a new developer involved as well.
No automated tests. With no build process running the tests, if you do have tests, they are probably not being run regularly. That’s if you have tests at all of course, because with no process running them, people probably aren’t writing them.
A poor or completely ad-hoc distribution mechanism. Without a build process to form the foundation of such a process, the one that does exist is mostly ad-hoc and hard to follow.

But there is no point in dwelling on what we don’t have.

Instead, lets do something about it.

Who Cares They’re Always Changing Continuous Integration Names

The first step is a build script.

Now, as I’ve mentioned before on this blog, I’m a big fan of including the build script into the repository, so that anyone with the appropriate dependencies can just clone the repo and run the script to get a deliverable. Release candidates will be built on some sort of controlled build server obviously, but I’ve found its important to be able to execute the same logic both locally and remotely in order to be able to react to unexpected issues.

Of course, the best number of dependencies outside of the repository is zero, but sometimes that’s not possible. Aim to minimise them at least, either by isolating them and including them directly, or by providing some form of automated bootstrapping.

This particular product is built in a mixture of Delphi (specifically Delphi 7) and .NET, so it wasn’t actually all that difficult to use our existing build framework (a horrific aberration built in Powershell) to get something up and running fairly quickly.

The hardest past was figuring out how to get the Delphi compiler to work from the command line, while still producing the same output as it would if you just followed the current build process (i.e. compilation from within the IDE).

With the compilation out of the way, the second hardest part was creating an artifact that looked and acted like the artifact that was being manually created. This comes in the form of a self-extracting zip file containing an assortment of libraries and executables that make up the “update” package.

Having dealt with both of those challenges, its nothing but smooth sailing.

We Just Want to Dance Here, But We Need An AMI

Ha ha ha ha no.

Being a piece of legacy software, the second step to was to create a build environment that could be used from TeamCIty.

This means an AMI with everything required in order to execute the build script.

For Delphi 7, that means an old version of the Delphi IDE and build tools. Good thing we still had the CD that the installer came on, so we just made an ISO and remotely mounted it in order to install the required software.

Then came the multitude of library and tool dependencies specific to this particular piece of software. Luckily, someone had actually documented enough instructions on how to set up a development environment, so we used that information to complete the configuration of the machine.

A few minor hiccups later and we had a build artifact coming out of TeamCity for this product for the very first time.

A considerable victory.

But it wasn’t versioned yet.

They Call Us Irresponsible, The Versioning Is A Lie

This next step is actually still under construction, but the plan is to use the TeamCity build number input and some static version configuration stored inside the repository to create a SemVer styled version for each build that passes through TeamCity.

Any build not passing through TeamCity, or being built from a branch should be tagged with an appropriate pre-release string (i.e. 1.0.0-[something]), allowing us to distinguish good release candidates (off master via TeamCity) from dangerous builds that should never be released to a real customer.

The abomination of a Powershell build framework allows for most of this sort of stuff, but assumes that a .NET style AssemblyInfo.cs file will exist somewhere in the source.

At the end of the day, we decided to just include such a file for ease of use, and then propagate that version generated via the script into the Delphi executables through means that I am currently unfamiliar with.

Finally, all builds automatically tag the appropriate commit in Git, but that’s pretty much built into TeamCity anyway, so barely worth mentioning.

Conclusion

Like I said at the start of the post, if you don’t have an automated build process, you’re definitely doing it wrong.

I managed to summarise the whole “lets construct a build process” journey into a single, fairly lightweight blog post, but a significant amount of work went into it over the course of a few months. I was only superficially involved (as is mostly the case these days), so I have to give all of the credit to my colleagues.

The victory that this build process represents cannot be understated though, as it will form a solid foundation for everything to come.

A small step in the greater scheme of things, but I’m sure everyone knows the quote at this point.

Baby Got Background Tasks

May 1. 2018 0 Comments

Continuous delivery of your software is something you should aspire to, even if you may never quite reach that lofty goal of “code gets committed, changes available in production”.

There is just so much value to business value to be had by making sure that your improvements , features and bug fixes are always in the hands of your users, and never spend dead time waiting for the stars to align and that mysterious release window to open up.

Of course, it can all take an incredible amount of engineering effort though, so as with everything in software, you probably need to think about what exactly you are trying to accomplish and how much you’re willing to pay for it.

Along the way, you’ll run across a wide variety of situations that make constant deployments challenging, and that’s where this post gets relevant. One my my teams has recently been made responsible for an API whose core function is to execute tasks that might take up to an hour to run (spoiler alert, its data migration), and that means being able to arbitrarily deploy our changes whenever we want is just not a capability we have right now.

In fact, the entire deployment process for this particular API is a bit of a special flower, differing in a number of ways from the rest of the organization.

And special flowers are not to be tolerated

Its A Bit Of A Marathon

I’ve written about this a few times already, but our organization (like most), has an old, highly profitable piece of legacy desktop software. Its future is…limited, to say the least, so the we’re engaged on a long term project to build a replacement that offers similar (but better) features using a SaaS (Software as a Service) model.

Ideally, we want to make the transition from old and busted to new hotness as easy as possible for every single one of our users, so there is a huge amount of value to be gained by investing in a reliable and painless migration process.

We’re definitely not there yet, but its getting closer with every deployment that we make.

Architecturally, we have a specialist migration tool, backed by an API, all of which is completely separate from the main user interface to the system. At this point in time, migrations are executed by an internal team, but the dream is that the user will be able to do it themselves.

The API is basically a fancy ETL, in that it gets data from some source (extract), transforms it into a format that works for the cloud product (transform) and then injects everything as appropriate via the cloud APIs (load). Its written in Java (specifically Kotlin) and leverages Spring for its API-ness, and Spring Batch for job scheduling and management. Deployment wise, the API is encapsulated in a Docker image (output from our build process) and when its time to ship a new version, the existing Docker containers are simply replaced with new ones in a controlled fashion.

More importantly to the blog post at hand, each migration is a relatively long running task that executes a series of steps in sequence in order to get customer data from legacy system A into new shiny cloud system B.

Combine “long running uninterruptable task” with “container replacement” and you get in-flight migrations being terminated every time a deployment occurs, which in turn leads to the fear, and we all know where fear leads.

A manual deployment process, gated by another manual “hey, are you guys running any migrations” process.

Waiting At The Finish Line

To allow for arbitrary deployments, one of the simplest solutions is to have the deployment process simply wait for any in-flight migrations to complete before it does anything destructive.

Automation wise, the approach is pretty straight forward:

Implement a new endpoint on the API that returns if the API instance can be “shutdown” using in-memory information about migrations in flight
Change the deployment process to use this endpoint to decide whether or not the container is ready to be replaced

With the way we’re using Spring Batch, each “migration” job runs from start to finish on a single API instance, so its simple enough to just trigger an in-memory count to increment whenever a job starts, and decrement when it finishes (or fails).

The deployment process then just waits for each container to state whether or not its allowed to shutdown before tearing anything down. Specifically each individual container needs to be asked though, not the API through the load balancer, because they all have their own state.

This approach has the unfortunate side effect of making it hard to reason about how long a deployment might actually take though, as a migration in flight could have anywhere from a few seconds to a few hours of runtime left and the deployment cannot continue until all migrations have finished. Additionally, if another migration is started while the process is waiting for any in-flight migrations to complete, it might never get to the chance to continue, which is troublesome.

That is, unless you put the API into “maintenance mode” or something, where its not allowed to start new migrations. That’s downtime though, which isn’t continuous delivery.

Running More Than One Race

A slight tweak to the first solution is to allow for some parallel execution between the old and new containers:

Spin up as many new version API containers as necessary
Take all of the old containers out of the load balancer (or equivalent) so no new migrations can be started on them
Leave the old ones around, but only until they finish up their migrations, and then terminate

This allows for continuous operation of the service (which is in line with the original goal, of the user not knowing that anything is going on behind the scenes), but can lead to complications.

The main one is what will happen if the new API version contains any database updates? That might make the database incompatible with the old version, which would cause everything to explode. Obviously, there is value in making sure that changes are at least one version backwards compatible, but that can be hard to enforce automatically, and it can be dangerous to just leave up to people to remember.

The other complication is that this approach assumes that the new containers can answer requests about the jobs running on the old containers (i.e. status), which is probably true if everything is behind a load balancer anyway, but its still something to be aware of.

So again, not an ideal solution, but at least it maintains availability while doing its thing.

Or We Could Just Do Sprints

If you really want to offer continuous delivery with something that does long running background tasks, the answer is to not have long running background tasks.

Well, to be fair, the entire operation can be long running, but it needs to be broken down into smaller atomic elements.

With smaller constituent elements, you can use a very similar process to the solutions above:

Spin up a bunch of new containers, have them automatically pick up new tasks as they become available
Stop traffic going directly to the old containers
Mark old containers as “to be shutdown” so they don’t grab new tasks
Wait for each old container to be “finished” and murder it

You get a much tighter deployment cycle, and you also get the nice side effect of potentially allowing for parallelisation of tasks across multiple containers (if there are tasks that will allow it obviously).

Conclusion

For our API, we went with the first option (wait for long running tasks to finish, then shutdown), mostly because it was the simplest, and with the assumption that we’ll probably revisit it in time. Its still a vast improvement over the manual “only do deployments when the system is not in use” approach, so I consider it a win.

More generally, and to echo my opening statement, the idea of “continuous delivery” is something that should be aimed for at the very least, even if you might not make it all the way. When you’re free to deploy any time that you want, you gain a lot of flexibility in the way that you are able to react to things, especially bug fixes and the like.

Also, each deployment is likely to be smaller, as you don’t have to wait for an “acceptable window”, and bundle up a bunch of stuff together when that window arrives. This means that if you do get a failure (and you probably will) its much easier to reason about what might have gone wrong.

Mostly I’m just really tired of only being able to deploy on Sundays though, so anything that stops that practice is amazing from my point of view.

In The End, There Can Be Only One

April 10. 2018 0 Comments

I’ve written a lot of words on this blog about the data synchronization algorithm. Probably too many to be honest, but its an interesting technical topic for me, so the words come easily.

Not much has changed since the optimization to the differencing check to stop it from wastefully scanning the entire table, its just been quietly chugging along, happily grabbing data whenever clients opt in, and just generally being useful.

As we accumulate more data though, a flaw in the system is becoming more obvious.

Duplicates.

Once Uploaded, Data Lives Forever

I like to think that the data synchronization algorithm is really good at what it does.

Given a connection to a legacy database and some identifying information (i.e. the identity of the client), it will make sure that a copy of the data in that data exists remotely, and then, as the underlying database changes, ensure those changes are also present.

The actual sync algorithm is provided to clients in the form of a plugin for a component that enables services for their entire office, like server side automated backups and integrations with (other) cloud services. All of this hangs on a centralised store of registered databases, which the client is responsible for maintaining. The whole underlying system was built before my time, and while its a little rough around the edges, its pretty good.

Unfortunately, it does have one major flaw.

When a client registers a database (generally done by supplying a connection string), that database is given a unique identifier.

If the client registers the same physical database again (maybe they moved servers, maybe they lost their settings due to a bug, maybe support personnel think that re-registering databases is like doing a computer reboot), they get a new database identifier.

For the sync process, this means that all of the data gets uploaded again, appearing as another (separate) database belonging to the same client. In most cases the old registration continues to exist, but it probably stops being updated. Sometimes the client is actively uploading data from the same database more than once though, but that kind of thing is pretty rare.

For most use cases this sort of thing is mostly just annoying, as the client will select the right database whenever they interact with whatever system is pulling from the data in the cloud (and we generally try to hide databases that look like they would have no value to the customer, like ones that haven’t been actively updated in the last 24 hours).

For business intelligence though, it means a huge amount of useless duplicate data, which has all sorts of negative effects on the generated metrics.

Well, Until We Delete It And Take Its Power

From an engineering point of view, we should fix the root flaw, ensuring that the same physical database is identified correctly whenever it participates in the system.

As always, reality tends to get in the way, and unpicking that particular beast is not a simple task. Its not off the table completely, its just less palatable than it could be.

Even if the flaw is fixed though, the duplicate data that already exists is not going to magically up and disappear out of a respect for our engineering prowess. We’re going to have to deal with it anyway, so we might as well start there.

Algorithmically, a data set (customer_id-database_id pair) can be considered a duplicate of another data set if and only if:

The customer_id matches (we ignore duplicates across clients, for now anyway)
The data set contains at least 25 GUID identifiers that also appear in the other data set (each entity in the database generally has both a numerical and GUID identifier, so we just use the most common entity)

Nothing particularly complicated or fancy.

For the automated process itself, there are a few things worth considering:

It needs to communicate clearly what it did and why
There is value in separating the analysis of the data sets from the actions that were performed (and their results)
We’ll be using TeamCity for actually scheduling and running the process,so we can store a full history of what the process has done over time
To minimise risk, its useful to be able to tell the process “identify all the duplicates, but only delete the first X”, just in case it tries to delete hundreds and causes terrible performance problems

Taking all of the above into account, we' created a simple C# command line application that could be run like this:

Cleanup.exe –c {connection-string} –a {path-to-analysis-file} –d {path-to-deletion-results-file} –limitDeletionCount {number-to-delete}

Like everything we do, it gets built, tested, packaged (versioned), and uploaded to our private Nuget feed. For execution, there is a daily task in TeamCity to download the latest package and run it against our production database.

Its Like A Less Impressive Quickening

The last thing to do is make sure that we don’t ever delete any data that might still have value, to either us or the client.

As I mentioned above, the main reason that duplicates happen is when a client re-registers the same database for some reason. Upon re-registration, the “new” database will begin its data synchronization from scratch.

During the period of time where data is still uploading for the “new” database, but all the “old” data is still hanging around, how can we reasonably say which data set is the duplicate and should be deleted?

If we go off number of records, we’d almost certainly delete the “new” database mistakenly, which would just start it uploading again from scratch, and we’d get into an infinite stupidity loop.

We need some sort of indication that the data is “recent”, but we can’t use the timestamps on the data itself, because they are just copies from the local database, and oldest data uploads first.

Instead we need to use timing information from when the data set last participated in the sync process, i.e. a recency indicator.

A small modification of the tool later, and its execution looks like this:

Cleanup.exe –c {connection-string} –a {path-to-analysis-file} –d {path-to-deletion-results-file} –limitDeletionCount {number-to-delete} –recencyLimitDays {dont-delete-if-touched-this-recently}

We currently use 7 days as our recency limit, but once we’re more comfortable with the process, we’ll probably tune it down to 1 or 2 days (just to get rid of the data as soon as we can).

Conclusion

To be honest, we’ve known about the duplicate data flaw for a while now, but as I mentioned earlier, it didn’t really affect customers all that much. We’d put some systems in place to allow customers to only select recently synced databases already, so from their point of view, there might be a period where they could see multiple, but that would usually go away relatively quickly.

It wasn’t until we noticed the duplicates seeping into our metrics (which we use the make business decisions!) that we realised we really needed to do something about them, thus the automated cleanup.

A nice side effect of this, is that when we did the duplicate analysis, we realised that something like 30% of the entire database worthless duplicate data, so there might actually be significant performance gains once we get rid of it all, which is always nice.

To be honest, we probably should have just fixed the flaw as soon as we noticed it, but its in a component that is not well tested or understood, so there was a significant amount of risk in doing so.

Of course, as is always the case when you make that sort of decision, now we’re paying a different price altogether.

And who can really say which one is more expensive in the end?

We’re Finally Paying Attention, Conclusion

November 21. 2017 0 Comments

Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.

Unfortunately, this post brings me to the end of all the Elastalert goodness, at least for now.

Like I said right at the start (and embedded in the post titles), we’re finally paying attention to the wealth of information inside our ELK stack. Well, we aren’t really paying attention to everything right now, but when we notice something or even realize ahead of time that “it would be good if we got told when this happens” we actually have somewhere to put that logic.

I’ll call that a victory.

Anyway, to bring it all full circle:

The first post was a general and gentle introduction to Elastalert, rules and why we even bothered with this whole process
The second post explained how we put together the underlying infrastructure in AWS to support our Elastalert environments
The third post outlined how we actually manage the configuration for Elastalert (i.e. the rules and notifications) and the deployment thereof
Finally, this post is basically just a capstone, drawing everything together and finishing it all off

To be honest, when you look at what we’ve done for Elastalert from a distance, it looks suspiciously similar to the ELK stack (specifically the Elasticsearch segment).

I don’t necessarily think that’s a bad thing though. Honestly, I think we’ve just found a pattern that works for us, so rather than reinventing the wheel each time, we just roll with it.

Consistency is a quality all on its own.

Rule The World

Its actually been almost a couple of months now since we put this all together, and people are slowly starting to incorporate different rules to notify us when interesting things happen.

A good example of this sort of thing is with one of our new features.

As a general rule of thumb, we try our best to include dedicated business intelligence events into the software for whatever features we develop, including major checkpoints like starting, finishing and failure. One of our recent features also raised a “configured” event, which indicated when a customer had put in the specific configuration necessary for the feature to be enabled (it was a third party integration, so required an externally provided API key to function).

We added a rule to detect when this relatively rare event occurred, and now we get a notification whenever someone configures the new feature. This sort of thing is useful when you still have a relatively small number of people coming online (so you can keep tabs on them and follow through to see if they are experiencing any issues), but we’ll probably turn it off one usage picks up so we’re not constantly being spammed.

Recently a customer came online with the new feature, but never followed up with actual usage beyond the initial configuration, so we were able to flag this with the relevant parties (like their account manager) and investigate why that was happening and how we could help.

Without Elastalert, we never would have known, even though the information was actually available for all to see.

Breaking All The Rules

Of course, no series of blog posts would be complete without noting down some potential ways in which we could improve the thing we literally just finished putting together.

I mean, we could barely call ourselves engineers if we weren’t already engineering a better version in our heads before the paint had even dried on the first one.

There are two areas that I think could use improvement, but neither of them are particularly simple:

The architecture that we put together is high availability, even though it is self healing. There is only one Elastalert instance and we don’t really have particularly good protection against that instance being “alive” according to AWS but not actually evaluating rules. We should probably put some more effort into detecting issues with Elastalert so that the AWS Auto Scaling Group self healing can kick in at the appropriate times. I don’t think we can really do anything about side-by-side redundancy though, as Elastalert isn’t really designed to be a distributed alerting system. Two copies would probably just raise two alerts which would get annoying quickly.
There is no real concept of an alert getting worse over time, like there is with some other alerting platforms. Pingdom is a good example of this, though its alerts are a lot simpler (pretty much just up/down). If a website is down, different actions get triggered based on the length of the downtime. We use this sort of approach to first send a note to Hipchat, then to email, then to SMS some relevant parties in a natural progression. Elastalert really only seems to have on/off, as opposed to a schedule of notifications. You could probably accomplish the same thing by having multiple similar rules with different criteria, but that sounds like a massive pain to manage moving forward. This is something that will probably have to be done at the Elastalert level, and I doubt it would be a trivial change, so I’m not going to hold my breath.

Having said that, the value that Elastalert provides in its current state is still astronomically higher that having nothing, so who am I to complain?

Conclusion

When all is said and done, I’m pretty happy that we finally have the capability to alert of our ELK stack.

I mean, its not like the data was going to waste before we had that capability, it just feels better knowing that we don’t always have to be watching in order to find out when interesting things happen.

I know I don’t have time to watch the ELK stack all day, and I doubt anyone else does.

Thought it is awfully pretty to look at.

We’re Finally Paying Attention, Part 3

November 14. 2017 0 Comments

Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.

Continuing with the Elastalert theme, its time to talk configuration and the deployment thereof.

Last week I covered off exactly how we put together the infrastructure for the Elastalert stack. It wasn’t anything fancy (AMI through Packer, CloudFormation template deployed via Octopus), but there were some tricksy bits relating to Python conflicts between Elastalert and the built-in AWS EC2 initialization scripts.

With that out of the way, we get into the meatiest part of the process; how we manage the configuration of Elastalert, i.e. the alerts themselves.

The Best Laid Plans

When it comes to configuring Elastalert, there are basically only two things to worry about; the overall configuration and the rules and actions that make up the alerts.

The overall configuration covers things like where to find Elasticsearch, which Elasticsearch index to write results into, high level execution timings and so on. All that stuff is covered clearly in the documentation, and there aren’t really any surprises.

The rules are where it gets interesting. There are a wide variety of ways to trigger actions off the connected Elasticsearch cluster, and I provided an example in the initial blog post of this series. I’m not going to go into too much detail about the rules and their structure or capabilities because the documentation goes into that sort of thing at length. For the purposes of this post, the main thing to be aware of is that each rule is fully encapsulated within a file.

The nice thing about everything being inside files is that it makes deployment incredibly easy.

All you have to do is identify the locations where the files are expected to be and throw the new ones in, overwriting as appropriate. If you’re dealing with a set of files its usually smart to clean out the destination first (so deletions are handled correctly), but its still pretty straightforward.

When we started on the whole Elastalert journey, the original plan was for a simple file copy + service restart.

Then Docker came along.

No Plan Survives Contact With The Enemy

To be fair, even with Docker, the original plan was still valid.

All of the configuration was still file based, so deployment was still as simple as copying some files around.

Mostly.

Docker did complicate a few things though. Instead of Elastalert being installed, we had to run an Elastalert image inside a Docker container.

Supplying the configuration files to the Elastalert container isn’t hard. When starting the container you just map certain local directories to directories in the container and it all works pretty much as expected. As long as the files exist in a known place after deployment, you’re fine.

However, in order to “restart” Elastalert, you have to find and murder the container you started last time, and then start up a new one so it will capture the new configuration files and environment variables correctly.

This is all well and good, but even after doing that you only really know whether or not the container itself is running, not necessarily the Elastalert process inside the container. If your config is bad in some way, the Elastalert process won’t start, even though the container will quite happily keep chugging along. So you need something to detect if Elastalert itself is up inside the container.

Putting all of the above together, you get something like this:

echo -e "STEP: Stop and remove existing docker containers..."
echo "Checking for any existing docker containers"
RUNNING_CONTAINERS=$(docker ps -a -q)
if [ -n "$RUNNING_CONTAINERS" ]; then
    echo "Found existing docker containers."
    echo "Stopping the following containers:"
    docker stop $(docker ps -a -q)
    echo "Removing the following containers:"
    docker rm $(docker ps -a -q)
    echo "All containers removed"
else
    echo "No existing containers found"
fi
echo -e "...SUCCESS\n"

echo -e "STEP: Run docker container..."
ELASTALERT_CONFIG_FILE="/opt/config/elastalert.yaml"
SUPERVISORD_CONFIG_FILE="/opt/config/supervisord.conf"
echo "Elastalert config file: $ELASTALERT_CONFIG_FILE"
echo "Supervisord config file: $SUPERVISORD_CONFIG_FILE"
echo "ES HOST: $ES_HOST"
echo "ES PORT: $ES_PORT"
docker run -d \
    -v $RUN_DIR/config:/opt/config \
    -v $RUN_DIR/rules:/opt/rules \
    -v $RUN_DIR/logs:/opt/logs \
    -e "ELASTALERT_CONFIG=$ELASTALERT_CONFIG_FILE" \
    -e "ELASTALERT_SUPERVISOR_CONF=$SUPERVISORD_CONFIG_FILE" \
    -e "ELASTICSEARCH_HOST=$ES_HOST" \
    -e "ELASTICSEARCH_PORT=$ES_PORT" \
    -e "SET_CONTAINER_TIMEZONE=true" \
    -e "CONTAINER_TIMEZONE=$TIMEZONE" \
    --cap-add SYS_TIME \
    --cap-add SYS_NICE $IMAGE_ID
if [ $? != 0 ]; then
    echo "docker run command returned a non-zero exit code."
    echo -e "...FAILED\n"
    exit -1
fi
CID=$(docker ps --latest --quiet)
echo "Elastalert container with ID $CID is now running"
echo -e "...SUCCESS\n"

echo -e "STEP: Checking for Elastalert process inside container..."
echo "Waiting 10 seconds for elastalert process"
sleep 10
if docker top $CID | grep -q elastalert; then
    echo "Found running Elastalert process. Nice."
else
    echo "Did not find elastalert running"
    echo "You can view logs for the container with: docker logs -f $CID"
    echo "You can shell into the container with: docker exec -it $CID sh"
    echo -e "...FAILURE\n"
    exit -1
fi
echo -e "...SUCCESS\n"

But wait, there’s more!

Environmental Challenges

Our modus operandi is to have multiple copies of our environments (CI, Staging, Production) which form something of a pipeline for deployment purposes. I’ve gone through this sort of thing in the past, the most recent occurrence of which was when I wrote about rebuilding the ELK stack. Its a fairly common pattern, but it does raise some interesting challenges, especially around configuration.

For Elastalert specifically, each environment should have the same baseline behaviour (rules, timings, etc), but also different settings for things like where the Elasticsearch cluster is located, or which Hipchat room notifications go to.

When using Octopus Deploy, the normal way to accomplish this is to have variables defined in your Octopus Deploy project that are scoped to the environments being deployed to, and then leverage some of the built in substitution functionality to do replacements in whatever files need to be changed.

This works great at first, but has a few limitations:

You now have two places to look when trying to track changes, which can become a bit of a pain. Its much nicer to be able to view all of the changes (barring sensitive credentials of course) in your source control tool of choice.
You can’t easily develop and test the environment outside of Octopus, especially if your deployment is only valid after passing through a successful round of substitutions in Octopus Deploy.

Keeping those two things in mind, we now lean towards having all of our environment specific parameters and settings in configuration files in source control (barring sensitive variables, which require some additional malarkey), and then loading the appropriate file based on some high level flags that are set either by Octopus or in the local development environment.

For Elastalert specifically we settled into having a default configuration file (which is always loaded) and then environment specific overrides. Which environment the deployment is executing in is decided by the following snippet of code:

echo -e "STEP: Determining Environmnet..."
if [ "$(type -t get_octopusvariable)" = function ]; then
    echo "get_octopusvariable function is defined => assuming we are running on Octopus"
    ENVIRONMENT=$(get_octopusvariable "Octopus.Environment.Name")
elif [ -n "$ENVIRONMENT" ]; then
    echo "--environment command line option was used"
else
    echo "Not running on Octopous and no --environment command line option used. Using 'Default'"
    ENVIRONMENT="Default"
fi
echo "ENVIRONMENT=$ENVIRONMENT"
echo -e "...SUCCESS\n"

Once the selection of the environment is out of the way, the deployed files are mutated by executing a substitution routine written in Python which does most of the heavy lifting (replacing any tokens of the format @@KEY@@ in the appropriate files).

To Be Continued

I’ve covered the two biggest challenges in the deployment of our Elastalert configuration, but I’ve glossed over quite a few pieces of the process because covering the entire thing in this blog post would make it way too big.

The best way to really understand how it works is to have a look at the actual repository.

With both the environment and configuration explained, all that is really left to do is bring it all together, and explain some areas that I think could use improvement.

That’s a job for next week though.