0 Comments

Continuous delivery of your software is something you should aspire to, even if you may never quite reach that lofty goal of “code gets committed, changes available in production”.

There is just so much value to business value to be had by making sure that your improvements , features and bug fixes are always in the hands of your users, and never spend dead time waiting for the stars to align and that mysterious release window to open up.

Of course, it can all take an incredible amount of engineering effort though, so as with everything in software, you probably need to think about what exactly you are trying to accomplish and how much you’re willing to pay for it.

Along the way, you’ll run across a wide variety of situations that make constant deployments challenging, and that’s where this post gets relevant. One my my teams has recently been made responsible for an API whose core function is to execute tasks that might take up to an hour to run (spoiler alert, its data migration), and that means being able to arbitrarily deploy our changes whenever we want is just not a capability we have right now.

In fact, the entire deployment process for this particular API is a bit of a special flower, differing in a number of ways from the rest of the organization.

And special flowers are not to be tolerated

Its A Bit Of A Marathon

I’ve written about this a few times already, but our organization (like most), has an old, highly profitable piece of legacy desktop software. Its future is…limited, to say the least, so the we’re engaged on a long term project to build a replacement that offers similar (but better) features using a SaaS (Software as a Service) model.

Ideally, we want to make the transition from old and busted to new hotness as easy as possible for every single one of our users, so there is a huge amount of value to be gained by investing in a reliable and painless migration process.

We’re definitely not there yet, but its getting closer with every deployment that we make.

Architecturally, we have a specialist migration tool, backed by an API, all of which is completely separate from the main user interface to the system. At this point in time, migrations are executed by an internal team, but the dream is that the user will be able to do it themselves.

The API is basically a fancy ETL, in that it gets data from some source (extract), transforms it into a format that works for the cloud product (transform) and then injects everything as appropriate via the cloud APIs (load). Its written in Java (specifically Kotlin) and leverages Spring for its API-ness, and Spring Batch for job scheduling and management. Deployment wise, the API is encapsulated in a Docker image (output from our build process) and when its time to ship a new version, the existing Docker containers are simply replaced with new ones in a controlled fashion.

More importantly to the blog post at hand, each migration is a relatively long running task that executes a series of steps in sequence in order to get customer data from legacy system A into new shiny cloud system B.

Combine “long running uninterruptable task” with “container replacement” and you get in-flight migrations being terminated every time a deployment occurs, which in turn leads to the fear, and we all know where fear leads.

A manual deployment process, gated by another manual “hey, are you guys running any migrations” process.

Waiting At The Finish Line

To allow for arbitrary deployments, one of the simplest solutions is to have the deployment process simply wait for any in-flight migrations to complete before it does anything destructive.

Automation wise, the approach is pretty straight forward:

  • Implement a new endpoint on the API that returns if the API instance can be “shutdown” using in-memory information about migrations in flight
  • Change the deployment process to use this endpoint to decide whether or not the container is ready to be replaced

With the way we’re using Spring Batch, each “migration” job runs from start to finish on a single API instance, so its simple enough to just trigger an in-memory count to increment whenever a job starts, and decrement when it finishes (or fails).

The deployment process then just waits for each container to state whether or not its allowed to shutdown before tearing anything down. Specifically each individual container needs to be asked though, not the API through the load balancer, because they all have their own state.

This approach has the unfortunate side effect of making it hard to reason about how long a deployment might actually take though, as a migration in flight could have anywhere from a few seconds to a few hours of runtime left and the deployment cannot continue until all migrations have finished. Additionally, if another migration is started while the process is waiting for any in-flight migrations to complete, it might never get to the chance to continue, which is troublesome.

That is, unless you put the API into “maintenance mode” or something, where its not allowed to start new migrations. That’s downtime though, which isn’t continuous delivery.

Running More Than One Race

A slight tweak to the first solution is to allow for some parallel execution between the old and new containers:

  • Spin up as many new version API containers as necessary
  • Take all of the old containers out of the load balancer (or equivalent) so no new migrations can be started on them
  • Leave the old ones around, but only until they finish up their migrations, and then terminate

This allows for continuous operation of the service (which is in line with the original goal, of the user not knowing that anything is going on behind the scenes), but can lead to complications.

The main one is what will happen if the new API version contains any database updates? That might make the database incompatible with the old version, which would cause everything to explode. Obviously, there is value in making sure that changes are at least one version backwards compatible, but that can be hard to enforce automatically, and it can be dangerous to just leave up to people to remember.

The other complication is that this approach assumes that the new containers can answer requests about the jobs running on the old containers (i.e. status), which is probably true if everything is behind a load balancer anyway, but its still something to be aware of.

So again, not an ideal solution, but at least it maintains availability while doing its thing.

Or We Could Just Do Sprints

If you really want to offer continuous delivery with something that does long running background tasks, the answer is to not have long running background tasks.

Well, to be fair, the entire operation can be long running, but it needs to be broken down into smaller atomic elements.

With smaller constituent elements, you can use a very similar process to the solutions above:

  • Spin up a bunch of new containers, have them automatically pick up new tasks as they become available
  • Stop traffic going directly to the old containers
  • Mark old containers as “to be shutdown” so they don’t grab new tasks
  • Wait for each old container to be “finished” and murder it

You get a much tighter deployment cycle, and you also get the nice side effect of potentially allowing for parallelisation of tasks across multiple containers (if there are tasks that will allow it obviously).

Conclusion

For our API, we went with the first option (wait for long running tasks to finish, then shutdown), mostly because it was the simplest, and with the assumption that we’ll probably revisit it in time. Its still a vast improvement over the manual “only do deployments when the system is not in use” approach, so I consider it a win.

More generally, and to echo my opening statement, the idea of “continuous delivery” is something that should be aimed for at the very least, even if you might not make it all the way. When you’re free to deploy any time that you want, you gain a lot of flexibility in the way that you are able to react to things, especially bug fixes and the like.

Also, each deployment is likely to be smaller, as you don’t have to wait for an “acceptable window”, and bundle up a bunch of stuff together when that window arrives. This means that if you do get a failure (and you probably will) its much easier to reason about what might have gone wrong.

Mostly I’m just really tired of only being able to deploy on Sundays though, so anything that stops that practice is amazing from my point of view.