The Bigger Picture

September 25. 2018 0 Comments

Its mostly my fault to be honest. It was only a few years ago that I learned about log aggregation, and once you have an ELK stack, everything looks like an structured log event formatted as JSON.

We aggregate a wealth of information into our log stack these days, including, but not limited to:

Business intelligence events from our legacy software (i.e. customer did X)
Application logs from a variety of places
IIS logs from our APIs
ELB logs from AWS
System statistics for infrastructure (i.e. CPU, Memory, etc)

Now, if I had my way, we would keep everything forever. My dream would be to be able to ask the question “What did our aggregate API traffic look like over the last 12 months?”

Unfortunately, I can’t keep the raw data forever.

But I might be able to keep a part of it.

Storage Space

Storage space is pretty cheap these days, especially in AWS. In the Asia Pacific region, we pay $US 0.12 per GB per month for a stock standard, non-provisioned IOPS ELB volume.

Our ELK stack accumulates gigabytes of data every day though, and trying to store everything for all of eternity can add up pretty quickly. This gets even more complicated by the nature of Elasticsearch, because it likes to keep replicas of things just in case a node explodes, so you actually need more storage space than you think in order to account for redundancy.

In the end we somewhat randomly decided to keep a bit more than a months worth of data (40 days), which gives us the capability to reliably support our products, and to have a decent window for viewing business intelligence and usage. We have a scheduled task in TeamCity that leverages Curator to remove data as appropriate.

Now, a little more than a month is a pretty long time.

But I want more.

In For The Long Haul

In any data set, you are likely to find patterns that emerge over a much longer period than a month.

A good example would be something like daily active users. This is the sort of trend that is likely to show itself over months or years, especially for a relatively stable product. Unless you’ve done something extreme of course, in which case we might get a meaningful trend over a much shorter period.

Ignoring the extremes, we have all the raw data required to calculate the metric, we’re just not keeping it. If we had some way of summarising it into a smaller data set though, we can keep it for a much longer period. Maybe some sort of mechanism to do some calculations and store the resulting derivation somewhere safe?

The simplest approach is some sort of script or application that runs on a schedule and uses the existing data in the ELK stack to create and store new documents, preferably back into the ELK stack. If we want to ensure those new documents don’t get deleted by Curator, all we have to do is put them into different indexes (as Curator is only cleaning up indexes prefixed with logstash).

Seems simple enough.

Generator X

For once it actually was simple enough.

At some point in the past we actually implemented a variation of this idea, where we calculated some metrics from a database (yup, that database) and stored them in an Elasticsearch instance for later use.

Architecturally, the metric generator was a small C# command line application scheduled for daily execution through TeamCity, so nothing particularly complicated.

We ended up decommissioning those particular metrics (because it turned out they were useless) and disabling the scheduled task, but the framework already existed to do at least half of what I wanted to do; the part relating to generating documents and storing them in Elasticsearch. All I had to do was extend it to query a different data source (Elasticsearch) and generate a different set of metrics documents for indexing.

So that’s exactly what I did.

The only complicated part was figuring out how to query Elasticsearch from .NET, which as you can see from the following metrics generation class, can be quite a journey.

public class ElasticsearchDailyDistinctUsersDbQuery : IDailyDistinctUsersDbQuery
{
    public ElasticsearchDailyDistinctUsersDbQuery
    (
        SourceElasticsearchUrlSetting sourceElasticsearch,
        IElasticClientFactory factory,
        IClock clock,
        IMetricEngineVersionResolver version
    )
    {
        _sourceElasticsearch = sourceElasticsearch;
        _clock = clock;
        _version = version;
        _client = factory.Create(sourceElasticsearch.Value);
    }

    private const string _indexPattern = "logstash-*";

    private readonly SourceElasticsearchUrlSetting _sourceElasticsearch;
    private readonly IClock _clock;
    private readonly IMetricEngineVersionResolver _version;

    private readonly IElasticClient _client;

    public IEnumerable<DailyDistinctUsersMetric> Run(DateTimeOffset parameters)
    {
        var start = parameters - parameters.TimeOfDay;
        var end = start.AddDays(1);

        var result = _client.Search<object>
        (
            s => s
                .Index(_indexPattern)
                .AllTypes()
                .Query
                (
                    q => q
                        .Bool
                        (
                            b => b
                                .Must(m => m.QueryString(a => a.Query("Application:GenericSoftwareName AND Event.Name:SomeLoginEvent").AnalyzeWildcard(true)))
                                .Must(m => m
                                    .DateRange
                                    (
                                        d => d
                                            .Field("@timestamp")
                                            .GreaterThanOrEquals(DateMath.Anchored(start.ToUniversalTime().DateTime))
                                            .LessThan(DateMath.Anchored(end.ToUniversalTime().DateTime))
                                    )
                                )
                        )
                )
                .Aggregations(a => a
                    .Cardinality
                    (
                        "DistinctUsers", 
                        c => c.Field("SomeUniqueUserIdentifier")
                    )
                )
        );

        var agg = result.Aggs.Cardinality("DistinctUsers");

        return new[]
        {
            new DailyDistinctUsersMetric(start)
            {
                count = Convert.ToInt32(agg.Value),
                generated_at = _clock.UtcNow,
                source = $"{_sourceElasticsearch.Value}/{_indexPattern}",
                generator_version = _version.ResolveVersion().ToString()
            }
        };
    }
}

Conclusion

The concept of calculating some aggregated values from our logging data and keeping them separately has been in my own personal backlog for a while now, so it was nice to have a chance to dig into it in earnest.

It was even nicer to be able to build on top of an existing component, as it would have taken me far longer if I had to put everything together from scratch. I think its a testament to the quality of our development process that even this relatively unimportant component was originally built following solid software engineering practices, and has plenty of automated tests, dependency injection and so on. It made refactoring it and turning it towards a slightly different purpose much easier.

Now all I have to do is wait months while the longer term data slowly accumulates.

Where Have All The Good Shards Gone

March 20. 2018 0 Comments

A wild technical post appears!

This weeks post returns to a topic very close to my heart, the Elasticsearch, Logstash and Kibana (ELK) Stack that we use for log aggregation. As you might be able to tell from my post history, logging, metrics and business intelligence rank pretty high on my list of priorities, regardless of any other focuses I might have. To me, if you don’t have good intelligence, you might as well be fighting in the dark, flailing about in the hopes that you hit something important.

This post specifically is about the process by which we deploy new versions of Elasticsearch, and an issue that can occur when you do rolling deployments and the Elasticsearch cluster is hosted in AWS.

Version Control

Way back in August 2017 I wrote about automating the deployment of new Elasticsearch versions to our ELK stack.

Long story short, the part of that post that it relevant to this one is the bit about unassigned shards in Elasticsearch when rebalancing after a version upgrade. Specifically, if you have nodes that are at a later version of Elasticsearch than others (which is normal when doing a rolling deployment), and the later version node is elected to hold the primary shard, replicas cannot be assigned to any of the nodes with the lower version.

This is troublesome if you’re waiting for a cluster to go green before progressing to the next node replacement, because unassigned shards equal a yellow cluster. You’ll be waiting forever (or you’ll hit your timeout because you were smart enough to put a timeout in, right?).

Without some additional change, the system will never reach a state of equilibrium.

La La La I Can’t Hear You

To extrapolate on the content of the initial post, the solution was to check that all remaining unassigned shards were version conflicts whenever an appropriate end state is reached. An end state would be something like a timeout waiting for the cluster to go green, or maybe something fancier like “number of unassigned shards has not changed over a period of time.

If the only unassigned shards left are version conflicts, its relatively safe to just continue on with the process and let Elasticsearch sort it out (which it will once all of the nodes are replaced). There is minimal risk of data loss (the primary shards are all guaranteed to exist in order for this problem to happen anyway), and each time a new node comes online, the cluster will rebalance into a better state anyway.

The script for checking for version conflicts is below:

function Get-UnassignedShards
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$elasticsearchUrl
    )

    $shards = Invoke-RestMethod -Method GET -Uri "$elasticsearchUrl/_cat/shards" -Headers @{"accept"="application/json"} -Verbose:$false;
    $unassigned = $shards | Where-Object { $_.state -eq "UNASSIGNED" };

    return $unassigned;
}

function Test-AllUnassignedShardsAreVersionConflicts
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$elasticsearchUrl
    )

    Write-Verbose "Getting all UNASSIGNED shards, to see if all of them are UNASSIGNED because of version conflicts";

    $unassigned = Get-UnassignedShards -elasticsearchUrl $elasticsearchUrl;

    foreach ($unassignedShard in $unassigned)
    {
        $primary = "true";
        if ($unassignedShard.prirep -eq "r")
        {
            $primary = "false";
        }
        $explainBody = "{ `"index`": `"$($unassignedShard.index)`", `"shard`": $($unassignedShard.shard), `"primary`": $primary }";
        Write-Verbose "Getting allocation explanation using query [$explainBody]";
        $explain = Invoke-RestMethod -Method POST -Uri "$elasticsearchUrl/_cluster/allocation/explain" -Headers @{"accept"="application/json"} -Body $explainBody -Verbose:$false;

        $versionConflictRegex = "target node version.*is older than the source node version.*";
        $sameNodeConflictRegex = "the shard cannot be allocated to the same node on which a copy of the shard already exists";
        $explanations = @();
        foreach ($node in $explain.node_allocation_decisions)
        {
            foreach ($decider in $node.deciders)
            {
                $explanations += @{Node=$node.node_name;Explanation=$decider.explanation};
            }
        }

        foreach ($explanation in $explanations)
        {
            if ($explanation.explanation -notmatch $versionConflictRegex -and $explanation.explanation -notmatch $sameNodeConflictRegex)
            {
                Write-Verbose "The node [$($explanation.Node)] in the explanation for shard [$($unassignedShard.index):$($unassignedShard.shard)] contained an allocation decider explanation that was unacceptable (i.e. not version conflict and not same node conflict). It was [$($explanation.Explanation)]";
                return $false;
            }
        }
    }

    return $true;
}

In The Zone…Or Out Of It?

This solution works really well for the specific issue that it was meant to detect, but to absolutely no-ones surprise, it doesn’t work so well for other problems.

Case in point, if your Elasticsearch cluster is AWS Availability Zone aware, then you can encounter a very similar problem to what I’ve just described, except with availability zone conflicts instead of version conflicts.

An availability zone aware Elasticsearch cluster will avoid putting shard replicas in the same availability zone as the primary (within reason), which is just another way to protect itself against losing data in the event of a catastrophic failure. I’m sure you can disable the functionality, but that seems like a relatively sane safety measure, so I’m not sure why you would.

Unfortunately, when combined with version conflicts also preventing shard allocation, you can be left in a situation where there is no appropriate place to dump a shard, so our deployment process can’t move on because the cluster never goes green.

Interestingly enough, there are two possible solutions for this:

The first is to be more careful about the order that you annihilate nodes in. Alternating availability zones is the way to go here, but this approach can get complicated if you’re also dealing with version conflicts at the same time. Also, it doesn’t really work all that well if you don’t have a full complement of nodes (with redundancy) spread about both availability zones.
The second is to just replicate the version conflict solution above, except for unassigned shards as a result of availability zone conflicts. This is by far the easier and less fiddly approach, assuming that the entire deployment finishes (so the cluster can rebalance as appropriate)

I haven’t actually updated our deployment since I discovered the issue, but my current plan is to go with the second option and see how far I get.

Conclusion

This is one of those cases where I knew that the initial solution that I put into place would probably not be enough over the long term, but there was just no value in trying to engineer for every single eventuality right at the start.

Also, truth be told, I didn’t know that Elasticsearch took AWS Availability Zones into account when allocating shards, so that was a bit of a surprise anyway.

Thinking about the actual deployment process some more, it might be easier to scale up, wait for a rebalance and then scale down again, terminating the oldest (and thus earlier version) nodes after all the new ones have already come online. The downside to this approach is mostly just time (because you have to wait for 2N rebalances, instead of just N rebalances (where N is the number of nodes), but it feels like it might be more robust in the face of unexpected weirdness.

Which, from my experience, I should probably just start expecting from now on, as it (ironically) seems like the one constant in software.

Clusterstruck

December 6. 2017 0 Comments

We’ve had full control over our Elasticsearch field mappings for a while now, but to be honest, once we got the initial round of mappings out of the way, we haven’t really had to deal with too many deployments. Sure, they happen every now and then, but its not something we do every single day.

In the intervening time period, we’ve increased the total amount of data that we store in the ELK stack, which has had an unfortunate side effect when it comes to the deployment of more mappings.

It tends to take the Elasticsearch cluster down.

Is The Highway To Hell Paved With Good Intentions?

When we originally upgraded the ELK stack and introduced the deployment of the Elasticsearch configuration it was pretty straightforward. The deployment consisted of only those settings related to the cluster/node, and said settings wouldn’t be applied until the node was restarted anyway, so it was an easy decision to just force a node restart whenever the configuration was deployed.

When deploying to multiple nodes, the first attempt at orchestration just deployed sequentially, one node at a time, restarting each one and then waiting for the node to rejoin the cluster before moving on.

This approach….kind of worked.

It was good for when a fresh node was joining the cluster (i.e. after an auto scale) or when applying simple configuration changes (like log settings), but tended to fall apart when applying major configuration changes or when the cluster did not already exist (i.e. initial creation).

The easy solution was to just deploy to all nodes at the same time, which pretty much guaranteed a small downtime while the cluster reassembled itself.

Considering that we weren’t planning on deploying core configuration changes all that often, this seemed like a decent enough compromise.

Then we went and included index templates and field mappings into the configuration deployment.

Each time we deployed a new field mapping the cluster would go down for a few moments, but would usually come good again shortly after. Of course, this was when we still only had a weeks worth of data in the cluster, so it was pretty easy for it to crunch through all of the indexes and shards and sort them out when it came back online.

Now we have a little over a months worth of data, and every time the cluster goes down it takes a fair while to come back.

That’s real downtime for no real benefit, because most of the time we’re just deploying field mappings, which can actually just be updated using the HTTP API, no restart required.

Dirty Deployments, Done Dirt Cheap

This situation could have easily been an opportunity to shift the field mappings into a deployment of their own, but I still had the same problem as I did the first time I had to make this decision – what’s the hook for the deployment when spinning up a new environment?

In retrospect the answer is probably “the environment is up and has passed its smoke test”, but that didn’t occur to me until later, so we went in a different direction.

What if we didn’t always have to restart the node on a configuration deployment?

We really only deploy three files that could potentially require a node restart:

The core Elasticsearch configuration file (/etc/elasticsearch/elasticsearch.yml)
The JVM options file (/etc/elasticsearch/jvm.options)
The log4j2 configuration file (/etc/elasticsearch/log4j2.properties)

If none of those files have changed, then we really don’t need to do a node restart, which means we can just move ahead with the deployment of the field mappings.

No fuss, no muss.

Linux is pretty sweet in this regard (well, at least the Amazon Linux baseline is) in that it provides a diff command that can be used to easily compare two files.

It was a relatively simple matter to augment the deployment script with some additional logic, like below:

… more script up here
 
temporary_jvm_options="/tmp/elk-elasticsearch/jvm.options"
destination_jvm_options="/etc/elasticsearch/jvm.options"

echo "Mutating temporary jvm.options file [$temporary_jvm_options] to contain appropriate memory allocation and to fix line endings"
es_memory=$(free -m | awk '/^Mem:/ { print int($2/2) }') || exit 1
sed -i "s/@@ES_MEMORY@@/$es_memory/" $temporary_jvm_options || exit 1
sed -i 's/\r//' $temporary_jvm_options || exit 1
sed -i '1 s/^\xef\xbb\xbf//' $temporary_jvm_options || exit 1

echo "Diffing $temporary_jvm_options and $destination_jvm_options"
diff --brief $temporary_jvm_options $destination_jvm_options
jvm_options_changed=$?

 
… more script down here
 
if [[ $jvm_options_changed == 1 || $configuration_file_changed == 1 || $log_config_file_changed == 1 ]]; then
    echo "Configuration change detected, will now restart elasticsearch service."
    sudo service elasticsearch restart || exit 1
else
    echo "No configuration change detected, elasticsearch service will not be restarted."
fi

No more unnecessary node restarts, no more unnecessary downtime.

Conclusion

This is another one of those cases that seems incredibly obvious in retrospect, but I suppose everything does. Its still almost always better to go with the naive solution at first, and then improve, rather than try to deal with everything up front. Its better to focus on making something easy to adapt in the face of unexpected issues than to try and engineer some perfect unicorn.

Regardless, with no more risk that the Elasticsearch cluster will go down for an unspecified amount of time whenever we just deploy field mapping updates, we can add new fields with impunity (it helps that we figured out how to reindex).

Of course, as with anything, there are still issues with the deployment logic:

Linking the index template/field mapping deployment to the core Elasticsearch configuration was almost certainly a terrible mistake, so we’ll probably have to deal with that eventually.
The fact that a configuration deployment can still result in the cluster going down is not great, but to be honest, I can’t really think of a better way either. You could deploy the configuration to the master nodes first, but that leaves you in a tricky spot if it fails (or if the configuration is a deep enough change to completely rename or otherwise move the cluster). You might be able to improve the logic to differentiate between “first time” and “additional node”, but you still have the problem of dealing with major configuration changes. Its all very complicated and honestly we don’t really do configuration deployments enough to spend time solving that particularly problem.
The index template/field mapping deployment technically occurs once on every node, simultaneously. For something that can be accomplished by a HTTP call, this is pretty wasteful (though doesn’t have any obvious negative side effects).

There’s always room for improvement.

We’re Finally Paying Attention, Conclusion

November 21. 2017 0 Comments

Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.

Unfortunately, this post brings me to the end of all the Elastalert goodness, at least for now.

Like I said right at the start (and embedded in the post titles), we’re finally paying attention to the wealth of information inside our ELK stack. Well, we aren’t really paying attention to everything right now, but when we notice something or even realize ahead of time that “it would be good if we got told when this happens” we actually have somewhere to put that logic.

I’ll call that a victory.

Anyway, to bring it all full circle:

The first post was a general and gentle introduction to Elastalert, rules and why we even bothered with this whole process
The second post explained how we put together the underlying infrastructure in AWS to support our Elastalert environments
The third post outlined how we actually manage the configuration for Elastalert (i.e. the rules and notifications) and the deployment thereof
Finally, this post is basically just a capstone, drawing everything together and finishing it all off

To be honest, when you look at what we’ve done for Elastalert from a distance, it looks suspiciously similar to the ELK stack (specifically the Elasticsearch segment).

I don’t necessarily think that’s a bad thing though. Honestly, I think we’ve just found a pattern that works for us, so rather than reinventing the wheel each time, we just roll with it.

Consistency is a quality all on its own.

Rule The World

Its actually been almost a couple of months now since we put this all together, and people are slowly starting to incorporate different rules to notify us when interesting things happen.

A good example of this sort of thing is with one of our new features.

As a general rule of thumb, we try our best to include dedicated business intelligence events into the software for whatever features we develop, including major checkpoints like starting, finishing and failure. One of our recent features also raised a “configured” event, which indicated when a customer had put in the specific configuration necessary for the feature to be enabled (it was a third party integration, so required an externally provided API key to function).

We added a rule to detect when this relatively rare event occurred, and now we get a notification whenever someone configures the new feature. This sort of thing is useful when you still have a relatively small number of people coming online (so you can keep tabs on them and follow through to see if they are experiencing any issues), but we’ll probably turn it off one usage picks up so we’re not constantly being spammed.

Recently a customer came online with the new feature, but never followed up with actual usage beyond the initial configuration, so we were able to flag this with the relevant parties (like their account manager) and investigate why that was happening and how we could help.

Without Elastalert, we never would have known, even though the information was actually available for all to see.

Breaking All The Rules

Of course, no series of blog posts would be complete without noting down some potential ways in which we could improve the thing we literally just finished putting together.

I mean, we could barely call ourselves engineers if we weren’t already engineering a better version in our heads before the paint had even dried on the first one.

There are two areas that I think could use improvement, but neither of them are particularly simple:

The architecture that we put together is high availability, even though it is self healing. There is only one Elastalert instance and we don’t really have particularly good protection against that instance being “alive” according to AWS but not actually evaluating rules. We should probably put some more effort into detecting issues with Elastalert so that the AWS Auto Scaling Group self healing can kick in at the appropriate times. I don’t think we can really do anything about side-by-side redundancy though, as Elastalert isn’t really designed to be a distributed alerting system. Two copies would probably just raise two alerts which would get annoying quickly.
There is no real concept of an alert getting worse over time, like there is with some other alerting platforms. Pingdom is a good example of this, though its alerts are a lot simpler (pretty much just up/down). If a website is down, different actions get triggered based on the length of the downtime. We use this sort of approach to first send a note to Hipchat, then to email, then to SMS some relevant parties in a natural progression. Elastalert really only seems to have on/off, as opposed to a schedule of notifications. You could probably accomplish the same thing by having multiple similar rules with different criteria, but that sounds like a massive pain to manage moving forward. This is something that will probably have to be done at the Elastalert level, and I doubt it would be a trivial change, so I’m not going to hold my breath.

Having said that, the value that Elastalert provides in its current state is still astronomically higher that having nothing, so who am I to complain?

Conclusion

When all is said and done, I’m pretty happy that we finally have the capability to alert of our ELK stack.

I mean, its not like the data was going to waste before we had that capability, it just feels better knowing that we don’t always have to be watching in order to find out when interesting things happen.

I know I don’t have time to watch the ELK stack all day, and I doubt anyone else does.

Thought it is awfully pretty to look at.

We’re Finally Paying Attention, Part 3

November 14. 2017 0 Comments

Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.

Continuing with the Elastalert theme, its time to talk configuration and the deployment thereof.

Last week I covered off exactly how we put together the infrastructure for the Elastalert stack. It wasn’t anything fancy (AMI through Packer, CloudFormation template deployed via Octopus), but there were some tricksy bits relating to Python conflicts between Elastalert and the built-in AWS EC2 initialization scripts.

With that out of the way, we get into the meatiest part of the process; how we manage the configuration of Elastalert, i.e. the alerts themselves.

The Best Laid Plans

When it comes to configuring Elastalert, there are basically only two things to worry about; the overall configuration and the rules and actions that make up the alerts.

The overall configuration covers things like where to find Elasticsearch, which Elasticsearch index to write results into, high level execution timings and so on. All that stuff is covered clearly in the documentation, and there aren’t really any surprises.

The rules are where it gets interesting. There are a wide variety of ways to trigger actions off the connected Elasticsearch cluster, and I provided an example in the initial blog post of this series. I’m not going to go into too much detail about the rules and their structure or capabilities because the documentation goes into that sort of thing at length. For the purposes of this post, the main thing to be aware of is that each rule is fully encapsulated within a file.

The nice thing about everything being inside files is that it makes deployment incredibly easy.

All you have to do is identify the locations where the files are expected to be and throw the new ones in, overwriting as appropriate. If you’re dealing with a set of files its usually smart to clean out the destination first (so deletions are handled correctly), but its still pretty straightforward.

When we started on the whole Elastalert journey, the original plan was for a simple file copy + service restart.

Then Docker came along.

No Plan Survives Contact With The Enemy

To be fair, even with Docker, the original plan was still valid.

All of the configuration was still file based, so deployment was still as simple as copying some files around.

Mostly.

Docker did complicate a few things though. Instead of Elastalert being installed, we had to run an Elastalert image inside a Docker container.

Supplying the configuration files to the Elastalert container isn’t hard. When starting the container you just map certain local directories to directories in the container and it all works pretty much as expected. As long as the files exist in a known place after deployment, you’re fine.

However, in order to “restart” Elastalert, you have to find and murder the container you started last time, and then start up a new one so it will capture the new configuration files and environment variables correctly.

This is all well and good, but even after doing that you only really know whether or not the container itself is running, not necessarily the Elastalert process inside the container. If your config is bad in some way, the Elastalert process won’t start, even though the container will quite happily keep chugging along. So you need something to detect if Elastalert itself is up inside the container.

Putting all of the above together, you get something like this:

echo -e "STEP: Stop and remove existing docker containers..."
echo "Checking for any existing docker containers"
RUNNING_CONTAINERS=$(docker ps -a -q)
if [ -n "$RUNNING_CONTAINERS" ]; then
    echo "Found existing docker containers."
    echo "Stopping the following containers:"
    docker stop $(docker ps -a -q)
    echo "Removing the following containers:"
    docker rm $(docker ps -a -q)
    echo "All containers removed"
else
    echo "No existing containers found"
fi
echo -e "...SUCCESS\n"

echo -e "STEP: Run docker container..."
ELASTALERT_CONFIG_FILE="/opt/config/elastalert.yaml"
SUPERVISORD_CONFIG_FILE="/opt/config/supervisord.conf"
echo "Elastalert config file: $ELASTALERT_CONFIG_FILE"
echo "Supervisord config file: $SUPERVISORD_CONFIG_FILE"
echo "ES HOST: $ES_HOST"
echo "ES PORT: $ES_PORT"
docker run -d \
    -v $RUN_DIR/config:/opt/config \
    -v $RUN_DIR/rules:/opt/rules \
    -v $RUN_DIR/logs:/opt/logs \
    -e "ELASTALERT_CONFIG=$ELASTALERT_CONFIG_FILE" \
    -e "ELASTALERT_SUPERVISOR_CONF=$SUPERVISORD_CONFIG_FILE" \
    -e "ELASTICSEARCH_HOST=$ES_HOST" \
    -e "ELASTICSEARCH_PORT=$ES_PORT" \
    -e "SET_CONTAINER_TIMEZONE=true" \
    -e "CONTAINER_TIMEZONE=$TIMEZONE" \
    --cap-add SYS_TIME \
    --cap-add SYS_NICE $IMAGE_ID
if [ $? != 0 ]; then
    echo "docker run command returned a non-zero exit code."
    echo -e "...FAILED\n"
    exit -1
fi
CID=$(docker ps --latest --quiet)
echo "Elastalert container with ID $CID is now running"
echo -e "...SUCCESS\n"

echo -e "STEP: Checking for Elastalert process inside container..."
echo "Waiting 10 seconds for elastalert process"
sleep 10
if docker top $CID | grep -q elastalert; then
    echo "Found running Elastalert process. Nice."
else
    echo "Did not find elastalert running"
    echo "You can view logs for the container with: docker logs -f $CID"
    echo "You can shell into the container with: docker exec -it $CID sh"
    echo -e "...FAILURE\n"
    exit -1
fi
echo -e "...SUCCESS\n"

But wait, there’s more!

Environmental Challenges

Our modus operandi is to have multiple copies of our environments (CI, Staging, Production) which form something of a pipeline for deployment purposes. I’ve gone through this sort of thing in the past, the most recent occurrence of which was when I wrote about rebuilding the ELK stack. Its a fairly common pattern, but it does raise some interesting challenges, especially around configuration.

For Elastalert specifically, each environment should have the same baseline behaviour (rules, timings, etc), but also different settings for things like where the Elasticsearch cluster is located, or which Hipchat room notifications go to.

When using Octopus Deploy, the normal way to accomplish this is to have variables defined in your Octopus Deploy project that are scoped to the environments being deployed to, and then leverage some of the built in substitution functionality to do replacements in whatever files need to be changed.

This works great at first, but has a few limitations:

You now have two places to look when trying to track changes, which can become a bit of a pain. Its much nicer to be able to view all of the changes (barring sensitive credentials of course) in your source control tool of choice.
You can’t easily develop and test the environment outside of Octopus, especially if your deployment is only valid after passing through a successful round of substitutions in Octopus Deploy.

Keeping those two things in mind, we now lean towards having all of our environment specific parameters and settings in configuration files in source control (barring sensitive variables, which require some additional malarkey), and then loading the appropriate file based on some high level flags that are set either by Octopus or in the local development environment.

For Elastalert specifically we settled into having a default configuration file (which is always loaded) and then environment specific overrides. Which environment the deployment is executing in is decided by the following snippet of code:

echo -e "STEP: Determining Environmnet..."
if [ "$(type -t get_octopusvariable)" = function ]; then
    echo "get_octopusvariable function is defined => assuming we are running on Octopus"
    ENVIRONMENT=$(get_octopusvariable "Octopus.Environment.Name")
elif [ -n "$ENVIRONMENT" ]; then
    echo "--environment command line option was used"
else
    echo "Not running on Octopous and no --environment command line option used. Using 'Default'"
    ENVIRONMENT="Default"
fi
echo "ENVIRONMENT=$ENVIRONMENT"
echo -e "...SUCCESS\n"

Once the selection of the environment is out of the way, the deployed files are mutated by executing a substitution routine written in Python which does most of the heavy lifting (replacing any tokens of the format @@KEY@@ in the appropriate files).

To Be Continued

I’ve covered the two biggest challenges in the deployment of our Elastalert configuration, but I’ve glossed over quite a few pieces of the process because covering the entire thing in this blog post would make it way too big.

The best way to really understand how it works is to have a look at the actual repository.

With both the environment and configuration explained, all that is really left to do is bring it all together, and explain some areas that I think could use improvement.

That’s a job for next week though.