0 Comments

Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.

Continuing with the Elastalert theme, its time to talk configuration and the deployment thereof.

Last week I covered off exactly how we put together the infrastructure for the Elastalert stack. It wasn’t anything fancy (AMI through Packer, CloudFormation template deployed via Octopus), but there were some tricksy bits relating to Python conflicts between Elastalert and the built-in AWS EC2 initialization scripts.

With that out of the way, we get into the meatiest part of the process; how we manage the configuration of Elastalert, i.e. the alerts themselves.

The Best Laid Plans

When it comes to configuring Elastalert, there are basically only two things to worry about; the overall configuration and the rules and actions that make up the alerts.

The overall configuration covers things like where to find Elasticsearch, which Elasticsearch index to write results into, high level execution timings and so on. All that stuff is covered clearly in the documentation, and there aren’t really any surprises.

The rules are where it gets interesting. There are a wide variety of ways to trigger actions off the connected Elasticsearch cluster, and I provided an example in the initial blog post of this series. I’m not going to go into too much detail about the rules and their structure or capabilities because the documentation goes into that sort of thing at length. For the purposes of this post, the main thing to be aware of is that each rule is fully encapsulated within a file.

The nice thing about everything being inside files is that it makes deployment incredibly easy.

All you have to do is identify the locations where the files are expected to be and throw the new ones in, overwriting as appropriate. If you’re dealing with a set of files its usually smart to clean out the destination first (so deletions are handled correctly), but its still pretty straightforward.

When we started on the whole Elastalert journey, the original plan was for a simple file copy + service restart.

Then Docker came along.

No Plan Survives Contact With The Enemy

To be fair, even with Docker, the original plan was still valid.

All of the configuration was still file based, so deployment was still as simple as copying some files around.

Mostly.

Docker did complicate a few things though. Instead of Elastalert being installed, we had to run an Elastalert image inside a Docker container.

Supplying the configuration files to the Elastalert container isn’t hard. When starting the container you just map certain local directories to directories in the container and it all works pretty much as expected. As long as the files exist in a known place after deployment, you’re fine.

However, in order to “restart” Elastalert, you have to find and murder the container you started last time, and then start up a new one so it will capture the new configuration files and environment variables correctly.

This is all well and good, but even after doing that you only really know whether or not the container itself is running, not necessarily the Elastalert process inside the container. If your config is bad in some way, the Elastalert process won’t start, even though the container will quite happily keep chugging along. So you need something to detect if Elastalert itself is up inside the container.

Putting all of the above together, you get something like this:

echo -e "STEP: Stop and remove existing docker containers..."
echo "Checking for any existing docker containers"
RUNNING_CONTAINERS=$(docker ps -a -q)
if [ -n "$RUNNING_CONTAINERS" ]; then
    echo "Found existing docker containers."
    echo "Stopping the following containers:"
    docker stop $(docker ps -a -q)
    echo "Removing the following containers:"
    docker rm $(docker ps -a -q)
    echo "All containers removed"
else
    echo "No existing containers found"
fi
echo -e "...SUCCESS\n"

echo -e "STEP: Run docker container..."
ELASTALERT_CONFIG_FILE="/opt/config/elastalert.yaml"
SUPERVISORD_CONFIG_FILE="/opt/config/supervisord.conf"
echo "Elastalert config file: $ELASTALERT_CONFIG_FILE"
echo "Supervisord config file: $SUPERVISORD_CONFIG_FILE"
echo "ES HOST: $ES_HOST"
echo "ES PORT: $ES_PORT"
docker run -d \
    -v $RUN_DIR/config:/opt/config \
    -v $RUN_DIR/rules:/opt/rules \
    -v $RUN_DIR/logs:/opt/logs \
    -e "ELASTALERT_CONFIG=$ELASTALERT_CONFIG_FILE" \
    -e "ELASTALERT_SUPERVISOR_CONF=$SUPERVISORD_CONFIG_FILE" \
    -e "ELASTICSEARCH_HOST=$ES_HOST" \
    -e "ELASTICSEARCH_PORT=$ES_PORT" \
    -e "SET_CONTAINER_TIMEZONE=true" \
    -e "CONTAINER_TIMEZONE=$TIMEZONE" \
    --cap-add SYS_TIME \
    --cap-add SYS_NICE $IMAGE_ID
if [ $? != 0 ]; then
    echo "docker run command returned a non-zero exit code."
    echo -e "...FAILED\n"
    exit -1
fi
CID=$(docker ps --latest --quiet)
echo "Elastalert container with ID $CID is now running"
echo -e "...SUCCESS\n"

echo -e "STEP: Checking for Elastalert process inside container..."
echo "Waiting 10 seconds for elastalert process"
sleep 10
if docker top $CID | grep -q elastalert; then
    echo "Found running Elastalert process. Nice."
else
    echo "Did not find elastalert running"
    echo "You can view logs for the container with: docker logs -f $CID"
    echo "You can shell into the container with: docker exec -it $CID sh"
    echo -e "...FAILURE\n"
    exit -1
fi
echo -e "...SUCCESS\n"

But wait, there’s more!

Environmental Challenges

Our modus operandi is to have multiple copies of our environments (CI, Staging, Production) which form something of a pipeline for deployment purposes. I’ve gone through this sort of thing in the past, the most recent occurrence of which was when I wrote about rebuilding the ELK stack. Its a fairly common pattern, but it does raise some interesting challenges, especially around configuration.

For Elastalert specifically, each environment should have the same baseline behaviour (rules, timings, etc), but also different settings for things like where the Elasticsearch cluster is located, or which Hipchat room notifications go to.

When using Octopus Deploy, the normal way to accomplish this is to have variables defined in your Octopus Deploy project that are scoped to the environments being deployed to, and then leverage some of the built in substitution functionality to do replacements in whatever files need to be changed.

This works great at first, but has a few limitations:

  • You now have two places to look when trying to track changes, which can become a bit of a pain. Its much nicer to be able to view all of the changes (barring sensitive credentials of course) in your source control tool of choice.
  • You can’t easily develop and test the environment outside of Octopus, especially if your deployment is only valid after passing through a successful round of substitutions in Octopus Deploy.

Keeping those two things in mind, we now lean towards having all of our environment specific parameters and settings in configuration files in source control (barring sensitive variables, which require some additional malarkey), and then loading the appropriate file based on some high level flags that are set either by Octopus or in the local development environment.

For Elastalert specifically we settled into having a default configuration file (which is always loaded) and then environment specific overrides. Which environment the deployment is executing in is decided by the following snippet of code:

echo -e "STEP: Determining Environmnet..."
if [ "$(type -t get_octopusvariable)" = function ]; then
    echo "get_octopusvariable function is defined => assuming we are running on Octopus"
    ENVIRONMENT=$(get_octopusvariable "Octopus.Environment.Name")
elif [ -n "$ENVIRONMENT" ]; then
    echo "--environment command line option was used"
else
    echo "Not running on Octopous and no --environment command line option used. Using 'Default'"
    ENVIRONMENT="Default"
fi
echo "ENVIRONMENT=$ENVIRONMENT"
echo -e "...SUCCESS\n"

Once the selection of the environment is out of the way, the deployed files are mutated by executing a substitution routine written in Python which does most of the heavy lifting (replacing any tokens of the format @@KEY@@ in the appropriate files).

To Be Continued

I’ve covered the two biggest challenges in the deployment of our Elastalert configuration, but I’ve glossed over quite a few pieces of the process because covering the entire thing in this blog post would make it way too big.

The best way to really understand how it works is to have a look at the actual repository.

With both the environment and configuration explained, all that is really left to do is bring it all together, and explain some areas that I think could use improvement.

That’s a job for next week though.

0 Comments

In last weeks post I explained how we would have liked to do updates to our Elasticsearch environment using CloudFormation. Reality disagreed with that approach and we encountered timing problems as a result of the ES cluster and CloudFormation not talking with one another during the update.

Of course, that means that we need to come up with something ourselves to accomplish the same result.

Move In, Now Move Out

Obviously the first thing we have to do is turn off the Update Policy for the Auto Scaling Groups containing the master and data nodes. With that out of the way, we can safely rely on CloudFormation to update the rest of the environment (including the Launch Configuration describing the EC2 instances that make up the cluster), safe in the knowledge that CloudFormation is ready to create new nodes, but will not until we take some sort of action.

At that point its just a matter of controlled node replacement using the auto healing capabilities of the cluster.

If you terminate one of the nodes directly, the AWS Auto Scaling Group will react by creating a replacement EC2 instance, and it will use the latest Launch Configuration for this purpose. When that instance starts up it will get some configuration deployed to it by Octopus Deploy, and shortly afterwards will join the cluster. With a new node in play, the cluster will react accordingly and rebalance, moving shards and replicas to the new node as necessary until everything is balanced and green.

This sort of approach can be written in just about any scripting language, out poison of choice is Powershell, which was then embedded inside the environment nuget package to be executed whenever an update occurs.

I’d copy the script here, but its pretty long and verbose, so here is the high level algorithm instead:

  • Iterate through the master nodes in the cluster
    • Check the version tag of the EC2 instance behind the node
    • If equal to the current version, move on to the new node
    • If not equal to the current version
      • Get the list of current nodes in the cluster
      • Terminate the current master node
      • Wait for the cluster to report that the old node is gone
      • Wait for the cluster to report that the new node exists
  • Iterate through the data nodes in the cluster
    • Check the version tag of the EC2 instance behind the node
    • If equal to the current version, move on to the new node
    • If not equal to the current version
      • Get the list of current nodes in the cluster
      • Terminate the current data node
      • Wait for the cluster to report that the old node is gone
      • Wait for the cluster to report that the new node exists
      • Wait for the cluster to go yellow (indicating rebalancing is occurring
      • Wait for the cluster to go green (indicating rebalancing is complete). This can take a while, depending on the amount of data in the cluster

As you can see, there isn’t really all that much to the algorithm, and the hardest part of the whole thing is knowing that you should wait for the node to leave/join the cluster and for the cluster to rebalance before moving on to the next replacement.

If you don’t do that, you risk destroying the cluster by taking away too many of its parts before its ready (which was exactly the problem with leaving the update to CloudFormation).

Hands Up, Now Hands Down

For us, the most common reason to run an update on the ELK environment is when there is a new version of Elasticsearch available. Sure we run updates to fix bugs and tweak things, but those are generally pretty rare (and will get rarer as time goes on and the stack gets more stable).

As a general rule of thumb, assuming you don’t try to jump too far all at once, new versions of Elasticsearch are pretty easily integrated.

In fact, you can usually have nodes in your cluster at the new version while there are still active nodes on the old version, which is nice.

There are at least two caveats that I’m aware of though:

  • The latest version of Kibana generally doesn’t work when you point it towards a mixed cluster. It requires that all nodes are running the same version.
  • If new indexes are created in a mixed cluster, and the primary shards for that index live on a node with the latest version, nodes with the old version cannot be assigned replicas

The first one isn’t too problematic. As long as we do the upgrade overnight (unattended), no-one will notice that Kibana is down for a little while.

The second one is a problem though, especially for our production cluster.

We use hourly indexes for Logstash, so a new index is generally created every hour or so. Unfortunately it takes longer than an hour for the cluster to rebalance after a node is replaced.

This means that the cluster is almost guaranteed to be stuck in the yellow status (indicating unassigned shards, in this case the replicas from the new index that cannot be assigned to the old node), which means that our whole process of “wait for green before continuing” is not going to work properly when we do a version upgrade on the environment that actually matter, production.

Lucky for us, the API for Elasticsearch is pretty amazing, and allows you to get all of the unassigned shards, along with the reason why they were unassigned.

What this means is that we can keep our process the same, and when the “wait for green” part of the algorithm times out, we can check to see whether or not the remaining unassigned shards are just version conflicts, and if they are, just move on.

Works like a charm.

Tell Me What You’re Gonna Do Now

The last thing that we need to take into account during an upgrade is related to Octopus Tentacles.

Each Elasticsearch node that is created by the Auto Scaling Group registers itself as a Tentacle so that it can have the Elasticsearch  configuration deployed to it after coming online.

With us terminating nodes constantly during the upgrade, we generate a decent number of dead Tentacles in Octopus Deploy, which is not a situation you want to be in.

The latest versions (3+ I think) of Octopus Deploy allow you to automatically remove dead tentacles whenever a deployment occurs, but I’m still not sure how comfortable I am with that piece of functionality. It seems like if your Tentacle is dead for a bad reason (i.e. its still there, but broken) then you probably don’t want to just clean it up and keep on chugging along.

At this point I would rather clean up the Tentacles that I know to be dead because of my actions.

As a result of this, one of the outputs from the upgrade process is a list of the EC2 instances that were terminated. We can easily use the instance name to lookup the Tentacle in Octopus Deploy, and remove it.

Conclusion

What we’re left with at the end of this whole adventure is a fully automated process that allows us to easily deploy changes to our ELK environment and be confident that not only have all of the AWS components updated as we expect them to, but that Elasticsearch has been upgraded as well.

Essentially exactly what we would have had if the CloudFormation update policy had worked the way that I initially expected it to.

Speaking of which, it would be nice if AWS gave a little bit more control over that update policy (like timing, or waiting for a signal from a system component before moving on), but you can’t win them all.

Honestly, I wouldn’t be surprised if there was a way to override the whole thing with a custom behaviour, or maybe a custom CloudFormation resource or something, but I wouldn’t even know where to start with that.

I’ve probably run the update process around 10 times at this point, and while I usually discover something each time I run it, each tweak makes it more and more stable.

The real test will be what happens when Elastic.co releases version 6 of Elasticsearch and I try to upgrade.

I foresee explosions.

0 Comments

Its been a little while since I made a post on Elasticsearch. Time to remedy that.

Our log stack has been humming along relatively well ever since we took control of it. Its not perfect, but its much better than it was.

One of the nicest side effects of the restructure has been the capability to test our changes in the CI/Staging environments before pushing them into Production. Its saved us from a a few boneheaded mistakes already (mostly just ES configuration blunders), which has been great to see. It does make pushing things into the environment actually care about a little bit slower than they otherwise would be, but I’m willing to make that tradeoff for a bit more safety.

When I was putting together the process for deploying our log stack (via Nuget, Powershell and Octopus Deploy), I tried to keep in mind what it would be like when I needed to deploy an Elasticsearch version upgrade. To be honest, I thought I had a pretty good handle on it:

  • Make an AMI with the new version of Elasticsearch on it
  • Change the environment definition to reference this new AMI instead of the old one
  • Deploy the updated package, leveraging the Auto Scaling Group instance replacement functionality
  • Dance like no-one is watching

The dancing part worked perfectly. I am a graceful swan.

The rest? Not so much.

Rollin’, Rollin’

I think the core issue was that I had a little bit too much faith in Elasticsearch to react quickly and robustly in the face of random nodes dying and being replaced.

Don’t get me wrong, its pretty amazing at what it does, but there are definitely situations where it is understandably incapable of adjusting and balancing itself.

Case in point, the process that occurs when an AWS Auto Scaling Group starts doing a rolling update because the definition of its EC2 instance launch configuration has changed.

When you use CloudFormation to initialize an Auto Scaling Group, you define the instances inside that group through a configuration structure called a Launch Configuration. This structure contains the definition of your EC2 instances, including the base AMI, security groups, tags and other meta information, along with any initialization that needs to be performed on startup (user data, CFN init, etc).

Inside the Auto Scaling Group definition in the template, you decide what should be the appropriate reaction upon detecting changes to the launch configuration, which mostly amounts to a choice between “do nothing” or “start replacing the instances in a sane way”. That second option is referred to as a “rolling update”, and you can specify a policy in the template for how you would like it to occur.

For our environment, a new ES version means a new AMI, so theoretically, it should be a simple matter to update the Launch Configuration with the new AMI and push out an update, relying on the Auto Scaling Group to replace the old nodes with the new ones, and relying on Elasticsearch to rebalance and adjust as appropriate.

Not that simple unfortunately, as I learned when I tried to apply it to the ES Master and Data ASGs in our ELK template.

Whenever changes were detected, CloudFormation would spin up a new node, wait for it to complete its initialization (which was just machine up + octopus tentacle registration), then it would terminate an old node and rinse and repeat until everything was replaced. This happened for both the master nodes and data nodes at the same time (two different Auto Scaling Groups).

Elasticsearch didn’t stand a chance.

With no feedback loop between ES and CloudFormation, there was no way for ES to tell CloudFormation to wait until it had rebalanced the cluster, replicated the shards and generally recovered from the traumatic event of having a piece of itself ripped out and thrown away.

The end result? Pretty much every scrap of data in the environment disappeared.

Good thing it was a scratch environment.

Rollin’, Rollin’

Sticking with the whole “we should probably leverage CloudFormation” approach. I implemented a script to wait for the node to join the cluster and for the cluster to be green (bash scripting is fun!). The intent was that this script would be present in the baseline ES AMI, would be executed as part of the user data during EC2 instance initialization, and would essentially force the auto scaling process to wait for Elasticsearch to actually be functional before moving on.

This wrought havoc with the initial environment creation though, as the cluster isn’t even valid until enough master nodes exist to elect a primary (which is 3), so while it kind of worked for the updates, initial creation was broken.

Not only that, but in a cluster with a decent amount of data, the whole “wait for green” thing takes longer than the maximum time allowed for CloudFormation Auto Scaling Group EC2 instance replacements, which would cause the auto scaling to time out and the entire stack to fail.

So we couldn’t use CloudFormation directly.

The downside of that is that CloudFormation is really good at detecting changes and determining if it actually has to do anything, so not only did we need to find another way to update our nodes, we needed to find a mechanism that would safely know when that node update should be applied.

To Be Continued

That’s enough Elasticsearch for now I think, so next time I’ll continue with the approach we actually settled on.

0 Comments

With environments and configuration out of the way, its time to put all of the pieces together.

Obviously this isn’t the first time that both of those things have been put together. In order to validate that everything was working as expected, I was constantly creating/updating environments and deploying new versions of the configuration to them. Not only that, but with the way our deployment pipeline works (commit, push, build, test, deploy [test], [deploy]), the CI environment had been up and running for some time.

What’s left to do then?

Well, we still need to create the Staging and Production environments, which should be easy because those are just deployments inside Octopus now.

The bigger chunk of work is to use those new environments and to redirect all of our existing log traffic as appropriate.

Hostile Deployment

This is a perfect example of why I spend time on automating things.

With the environments setup to act just like everything else in Octopus, all I had to do to create a Staging environment was push a button. Once the deployment finished and the infrastructure was created, it was just another button push to deploy the configuration for that environment to make it operational. Rinse and repeat for all of the layers (Broker, Indexer, Cache, Elasticsearch) and Staging is up and running.

Production was almost the same, with one caveat. We use an entirely different AWS account for all our production resources, so we had to override all of the appropriate Octopus variables for each environment project (like AWS Credentials, VPC ID, Subnet ID’s, etc). With those overrides in place, all that’s left is to make new releases (to capture the variables) and deploy to the appropriate environments.

It’s nice when everything works.

Redirecting Firepower

Of course, the new logging environments are worthless without log events. Luckily, we have plenty of those:

  • IIS logs from all of our APIs
  • Application logs from all of our APIs
  • ELB logs from a subset of our load balancers, most of which are APIs, but at least one is an Nginx router
  • Simple performance counter statistics (like CPU, Memory, Disk, etc) from basically every EC2 instance
  • Logs from on-premises desktop applications

We generally have CI, Staging and prod-X (green/blue) environments for all of our services/APIs (because its how our build/test/deployment pipeline works), so now that we have similarly partitioned logging environments, all we have to do is line them up (CI to CI, Staging to Staging and so on).

For the on-premises desktop applications, there is no CI, but they do generally have the capability to run in Staging mode, so we can use that setting to direct log traffic.

There are a few ways in which the log events hit the Broker layer:

  • Internal Logstash instance running on an EC2 instance with a TCP output pointing at the Broker hostname
  • Internal Lambda function writing directly to the Broker hostname via TCP (this is the ELB logs processor)
  • External application writing to an authenticated Logging API, which in turn writes to the Broker via TCP (this is for the on-premises desktop applications)

We can change the hostname used by all of these mechanisms simply by changing some variables in Octopus deploy, making a new release and deploying it through the environments.

And that’s exactly what we did, making sure to monitor the log event traffic for each one to make sure we didn’t lose anything.

With all the logs going to their new homes, all that was left to do was destroy the old log stack, easily done manually through CloudFormation.

You might be wondering about any log events that were stored in the old stack? Well, we generally only keep around 14 days worth of log events in the stack itself (because there are so many), so we pretty much just left the old stack up for searching purposes until it was no longer relevant, and then destroyed it.

Conclusion

And that basically brings us to the end of this series of posts about our logging environment and the reclamation thereof.

We’ve now got our standard deployment pipeline in place for both infrastructure and configuration and have separated our log traffic accordingly.

This puts us in a much better position moving forward. Not only is the entire thing fully under our control, but we now have the capability to test changes to infrastructure/configuration before just deploying them into production, something we couldn’t do before when we only had a single stack for everything.

In all fairness though, all we really did was polish an existing system so that it was a better fit for our specific usage.

Evolutionary, not revolutionary.

0 Comments

Continuing on from last week, its time to talk software and the configuration thereof.

With the environments themselves being managed by Octopus (i.e. the underlying infrastructure), we need to deal with the software side of things.

Four of the five components in the new ELK stack require configuration of some sort in order to work properly:

  • Logstash requires configuration to tell it how to get events, filter/process then and where to put them, so the Broker and Indexer layers require different, but similar, configuration.
  • Elasticsearch requires configuration for a variety of reasons including cluster naming, port setup, memory limits and so on.
  • Kibana requires configuration to know where Elasticsearch is, and a few other things.

For me, configuration is a hell of a lot more likely to change than the software itself, although with the pace that some organisations release software, that might not always be strictly true. Also, coming from a primarily Windows background, software is traditionally a lot more difficult to install and setup, and by that logic is not something you want to do all the time.

Taking those things into account, I’ve found it helpful to separate the installation of the software from the configuration of the software. What this means in practice is that a particular version of the software itself will be baked into an AMI, and then the configuration of that software will be handled via Octopus Deploy whenever a machine is created from the AMI.

Using AMIs or Docker Images to create immutable software + configuration artefacts is also a valid approach, and is superior in a lot of respects. It makes dynamic scaling easier (by facilitating a quick startup of fully functional nodes), helps with testing and generally simplifies the entire process. Docker Images in particular are something that I would love to explore in the future, just not right at this moment.

The good news is that this is pretty much exactly what the new stack was already doing, so we only need to make a few minor improvements and we’re good to go.

The Lament Configuration

As I mentioned in the first post in this series, software configuration was already being handled by TeamCity/Nuget/Octopus Deploy, it just needed to be cleaned up a bit. First thing was to move the configuration out into its own repository as appropriate for each layer and rewrite TeamCity as necessary. The Single Responsibility Principle doesn’t just apply to classes after all.

The next part is something of a personal preference, and it relates to the logic around deployment. All of the existing configuration deployments in the new stack had their logic (i.e. where to copy the files on the target machine, how to start/restart services and so on) encapsulated entirely inside Octopus Deploy. I’m not a fan of that. I much prefer to have all of this logic inside scripts in source control alongside the artefacts that will be deployed. This leaves projects in Octopus Deploy relatively simple, only responsible for deploying Nuget packages, managing variables (which is hard to encapsulate in source control because of sensitive values) and generally overseeing the whole process. This is the same sort of approach that I use for building software, with TeamCity acting as a relatively stupid orchestration tool, executing scripts that live inside source control with the code.

Octopus actually makes using source controlled scripts pretty easy, as it will automatically execute scripts named a certain way at particular times during the deployment of a Nuget package (for example, any script called deploy.ps1 at the root of the package will be executed after the package has been copied to the appropriate location on the target machine). The nice thing is that this also works with bash scripts for Linux targets (i.e. deploy.sh), which is particularly relevant here, because all of the ELK stuff happens on Linux.

Actually deploying most of the configuration is pretty simple. For example, this is the deploy.sh script for the ELK Broker configuration.

# The deploy script is automatically run by Octopus during a deployment, after Octopus does its thing.
# Octopus deploys the contents of the package to /tmp/elk-broker/
# At this point, the logstash configuration directory has been cleared by the pre-deploy script

# Echo commands after expansion
set -x

# Copy the settings file
cp /tmp/elk-broker/logstash.yml /etc/logstash/logstash.yml || exit 1

# Copy the config files
cp /tmp/elk-broker/*.conf /etc/logstash/conf.d/ || exit 1

# Remove the UTF-8 BOM from the config files
sed -i '1 s/^\xef\xbb\xbf//' /etc/logstash/conf.d/*.conf || exit 1

# Test the configuration
sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/ -t --path.settings /etc/logstash/logstash.yml || exit 1

# Set the ownership of the config files to the logstash user, which is what the service runs as
sudo chown -R logstash:logstash /etc/logstash/conf.d || exit 1

# Restart logstash - dont use the restart command, it throws errors when you try to restart a stopped service
sudo initctl stop logstash || true
sudo initctl start logstash

Prior to the script based approach, this logic was spread across five or six different steps inside an Octopus Project, which I found much harder to read and reason about.

Or Lemarchand’s Box If You Prefer

The only other difference worth talking about is the way in which we actually trigger configuration deployments.

Traditionally, we have asked Octopus to deploy the most appropriate version of the necessary projects during the initialization of a machine. For example, the ELK Broker EC2 instances had logic inside their LaunchConfiguration:UserData that said “register as Tentacle”, “deploy X”, “deploy Y” etc.

This time I tried something a little different, but which feels a hell of a lot smarter.

Instead of the machine being responsible for asking for projects to be deployed to it, we can just let Octopus react to the registration of a new Tentacle and deploy whatever Projects are appropriate. This is relatively easy to setup as well. All you need to do is add a trigger to your Project that says “deploy whenever a new machine comes online”. Octopus takes care of the rest, including picking what version is best (which is just the last successful deployment to the environment).

This is a lot cleaner than hardcoding project deployment logic inside the environment definition, and allows for changes to what software gets deployed where without actually having to edit or update the infrastructure definition. This sort of automatic deployment approach is probably more useful to our old way of handling environments (i.e. that whole terrible migration process with no update logic), than it is to the newer, easier to update environment deployments, but its still nice all the same.

Conclusion

There really wasn’t much effort required to clean up the configuration for each of the layers in the new ELK stack, but it was a great opportunity to try out the new trigger based deployments in Octopus Deploy, which was pretty cool.

With the configuration out of the way, and the environment creation also sorted, all that’s left is to actually create some new environments and start using them instead of the old one.

That’s a topic for next time though.