Where Have All The Good Shards Gone

March 20. 2018 0 Comments

A wild technical post appears!

This weeks post returns to a topic very close to my heart, the Elasticsearch, Logstash and Kibana (ELK) Stack that we use for log aggregation. As you might be able to tell from my post history, logging, metrics and business intelligence rank pretty high on my list of priorities, regardless of any other focuses I might have. To me, if you don’t have good intelligence, you might as well be fighting in the dark, flailing about in the hopes that you hit something important.

This post specifically is about the process by which we deploy new versions of Elasticsearch, and an issue that can occur when you do rolling deployments and the Elasticsearch cluster is hosted in AWS.

Version Control

Way back in August 2017 I wrote about automating the deployment of new Elasticsearch versions to our ELK stack.

Long story short, the part of that post that it relevant to this one is the bit about unassigned shards in Elasticsearch when rebalancing after a version upgrade. Specifically, if you have nodes that are at a later version of Elasticsearch than others (which is normal when doing a rolling deployment), and the later version node is elected to hold the primary shard, replicas cannot be assigned to any of the nodes with the lower version.

This is troublesome if you’re waiting for a cluster to go green before progressing to the next node replacement, because unassigned shards equal a yellow cluster. You’ll be waiting forever (or you’ll hit your timeout because you were smart enough to put a timeout in, right?).

Without some additional change, the system will never reach a state of equilibrium.

La La La I Can’t Hear You

To extrapolate on the content of the initial post, the solution was to check that all remaining unassigned shards were version conflicts whenever an appropriate end state is reached. An end state would be something like a timeout waiting for the cluster to go green, or maybe something fancier like “number of unassigned shards has not changed over a period of time.

If the only unassigned shards left are version conflicts, its relatively safe to just continue on with the process and let Elasticsearch sort it out (which it will once all of the nodes are replaced). There is minimal risk of data loss (the primary shards are all guaranteed to exist in order for this problem to happen anyway), and each time a new node comes online, the cluster will rebalance into a better state anyway.

The script for checking for version conflicts is below:

function Get-UnassignedShards
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$elasticsearchUrl
    )

    $shards = Invoke-RestMethod -Method GET -Uri "$elasticsearchUrl/_cat/shards" -Headers @{"accept"="application/json"} -Verbose:$false;
    $unassigned = $shards | Where-Object { $_.state -eq "UNASSIGNED" };

    return $unassigned;
}

function Test-AllUnassignedShardsAreVersionConflicts
{
    [CmdletBinding()]
    param
    (
        [Parameter(Mandatory=$true)]
        [ValidateNotNullOrEmpty()]
        [string]$elasticsearchUrl
    )

    Write-Verbose "Getting all UNASSIGNED shards, to see if all of them are UNASSIGNED because of version conflicts";

    $unassigned = Get-UnassignedShards -elasticsearchUrl $elasticsearchUrl;

    foreach ($unassignedShard in $unassigned)
    {
        $primary = "true";
        if ($unassignedShard.prirep -eq "r")
        {
            $primary = "false";
        }
        $explainBody = "{ `"index`": `"$($unassignedShard.index)`", `"shard`": $($unassignedShard.shard), `"primary`": $primary }";
        Write-Verbose "Getting allocation explanation using query [$explainBody]";
        $explain = Invoke-RestMethod -Method POST -Uri "$elasticsearchUrl/_cluster/allocation/explain" -Headers @{"accept"="application/json"} -Body $explainBody -Verbose:$false;

        $versionConflictRegex = "target node version.*is older than the source node version.*";
        $sameNodeConflictRegex = "the shard cannot be allocated to the same node on which a copy of the shard already exists";
        $explanations = @();
        foreach ($node in $explain.node_allocation_decisions)
        {
            foreach ($decider in $node.deciders)
            {
                $explanations += @{Node=$node.node_name;Explanation=$decider.explanation};
            }
        }

        foreach ($explanation in $explanations)
        {
            if ($explanation.explanation -notmatch $versionConflictRegex -and $explanation.explanation -notmatch $sameNodeConflictRegex)
            {
                Write-Verbose "The node [$($explanation.Node)] in the explanation for shard [$($unassignedShard.index):$($unassignedShard.shard)] contained an allocation decider explanation that was unacceptable (i.e. not version conflict and not same node conflict). It was [$($explanation.Explanation)]";
                return $false;
            }
        }
    }

    return $true;
}

In The Zone…Or Out Of It?

This solution works really well for the specific issue that it was meant to detect, but to absolutely no-ones surprise, it doesn’t work so well for other problems.

Case in point, if your Elasticsearch cluster is AWS Availability Zone aware, then you can encounter a very similar problem to what I’ve just described, except with availability zone conflicts instead of version conflicts.

An availability zone aware Elasticsearch cluster will avoid putting shard replicas in the same availability zone as the primary (within reason), which is just another way to protect itself against losing data in the event of a catastrophic failure. I’m sure you can disable the functionality, but that seems like a relatively sane safety measure, so I’m not sure why you would.

Unfortunately, when combined with version conflicts also preventing shard allocation, you can be left in a situation where there is no appropriate place to dump a shard, so our deployment process can’t move on because the cluster never goes green.

Interestingly enough, there are two possible solutions for this:

The first is to be more careful about the order that you annihilate nodes in. Alternating availability zones is the way to go here, but this approach can get complicated if you’re also dealing with version conflicts at the same time. Also, it doesn’t really work all that well if you don’t have a full complement of nodes (with redundancy) spread about both availability zones.
The second is to just replicate the version conflict solution above, except for unassigned shards as a result of availability zone conflicts. This is by far the easier and less fiddly approach, assuming that the entire deployment finishes (so the cluster can rebalance as appropriate)

I haven’t actually updated our deployment since I discovered the issue, but my current plan is to go with the second option and see how far I get.

Conclusion

This is one of those cases where I knew that the initial solution that I put into place would probably not be enough over the long term, but there was just no value in trying to engineer for every single eventuality right at the start.

Also, truth be told, I didn’t know that Elasticsearch took AWS Availability Zones into account when allocating shards, so that was a bit of a surprise anyway.

Thinking about the actual deployment process some more, it might be easier to scale up, wait for a rebalance and then scale down again, terminating the oldest (and thus earlier version) nodes after all the new ones have already come online. The downside to this approach is mostly just time (because you have to wait for 2N rebalances, instead of just N rebalances (where N is the number of nodes), but it feels like it might be more robust in the face of unexpected weirdness.

Which, from my experience, I should probably just start expecting from now on, as it (ironically) seems like the one constant in software.

Health Is Just A State Of Mind

January 16. 2018 0 Comments

Another week, another post.

Stepping away the data synchronization algorithm for a bit, its time to explore some AWS weirdness..

Well, strictly speaking, AWS is not really all that weird. Its just that sometimes things happen, and they don’t immediately make any sense, and it takes a few hours and some new information before you realise “oh, that makes perfect sense”.

That intervening period can be pretty confusing though.

Todays post is about a disagreement between a Load Balancer and a CloudWatch alarm about just how healthy some EC2 instances were.

Schrödinger's Symptoms

At the start of this whole adventure, we got an alert email from CloudWatch saying that one of our Load Balancers contained some unhealthy instances.

This happens from time to time, but its one of those things that you need to get to the bottom of quickly, just in case its the first sign of something more serious.

In this particular case, the alert was for one of the very small number of services whose infrastructure we manually manage. That is, the service is hosted on hand crafted EC2 instances, manually placed into a load balancer. That means no auto scaling group, so no capability to self heal or scale if necessary, at least not without additional effort. Not the greatest situation to be in, but sometimes compromises must be made.

Our first step for this sort of thing is typically to log into the AWS dashboard and have a look around, starting with the entities involved in the alert.

After logging in and checking the Load Balancer though, everything seemed fine. No unhealthy instances were being reported. Fair enough, maybe the alarm had already reset and we just hadn’t gotten the followup email (its easy enough to forgot to configure an alert to send emails when the alarm returns back to the OK state).

But no, checking on the CloudWatch alarm showed that it was still triggered.

Two views on the same information, one says “BAD THING HAPPENING” the only says “nah, its cool, I’m fine”.

But which one was right?

Diagnosis: Aggregations

When you’re working with instance health, one of the most painful things is that AWS does not offer detailed logs showing the results of Load Balancer health checks. Sure, you can see whether or not there were any healthy/unhealthy instances, but if you want to know how the machines are actually responding, you pretty much just have to execute the health check yourself and see what happens.

In our case, that meant hitting the EC2 instances directly over HTTP with a request for /index.html.

Unfortunately (fortunately?) everything appeared to be working just fine, and each request returned 200 OK (which is what the health check expects).

Our suspicion then fell to the CloudWatch alarm itself. Perhaps we had misconfigured it and it wasn’t doing what we expected? Maybe it was detecting missing data as an error and then alarming on that. It might still indicate a problem of some sort, but would at least confirm that the instances were functional, which is what everything else appeared to be saying.

The alarm was correctly configured though, saying the equivalent of “alert when there is more than 0 unhealthy instances in the last 5 minutes”.

We poked around a bit into the other metrics available on the Load Balancer (request count, latency, errors, etc) and discovered that latency had spiked a bit, request count had dropped a bit and that there was a tonne of backend errors, so something was definitely wrong.

Returning back to the alarm we noticed that the aggregation on the data point was “average”, which meant that it was actually saying “alert when there is more than 0 unhealthy instances on average over the last 5 minutes”. Its not obvious what the average value of a health check is over time, but changing the aggregation to minimum showed that there were zero unhealthy instances over the same time period, and changing it to maximum showed that all four of the instances were unhealthy over the same time period.

Of course, this meant that the instances were flapping between up and down, which meant the health checks were sometimes failing and sometimes succeeding.

It was pure chance that when we looked at the unhealthy instances directly in the Load Balancer that it had never shown any, and similarly when we had manually hit the health endpoint it had always responded appropriately. The CloudWatch alarm remembered though, and the average of [healthy, healthy, unhealthy] was unhealthy as far as it was concerned, so it was alerting correctly.

Long story short, both views of the data were strictly correct, and were just showing slightly different interpretations.

Cancerous Growth

The root cause of the flapping was exceedingly high CPU usage on the EC2 instances in question, which was leading to timeouts.

We’d done a deployment earlier that day, and it had increased the overall CPU usage of two of the services hosted on those instances by enough that the machines were starting to strain with the load.

More concerning though was the fact that the deployment had only really spiked the CPU from 85% to 95-100%. We’d actually been running those particularly instances hot for quite a while.

In fact, looking back at the CPU usage over time, there was a clear series of step changes leading up to the latest increase, and we just hadn’t been paying attention.

Conclusion

It can be quite surprising when two different views on the same data suddenly start disagreeing, especially when you’ve never seen that sort of thing happen before.

Luckily for us, both views were actually correct, and it just took a little digging to understand what was going on. There was a moment there where we started thinking that maybe we’d found a bug in CloudWatch (or the Load Balancer), but realistically, at this point in the lifecycle of AWS, that sort of thing is pretty unlikely, especially considering that neither of those systems are exactly new or experimental.

More importantly, now we’re aware of the weirdly high CPU usage on the underlying EC2 instances, so we’re going to have to deal with that.

Its not like we can auto scale them to deal with the traffic.

Stop! In The Name Of Instance Retirement

December 12. 2017 0 Comments

Posted in:
aws
alerting

I learned a new thing about AWS, instance retirement and auto scaling groups a few weeks ago.

I mean, lets be honest, the amount of things I don’t know dwarfs the amount of things I do know, but this one in particular was surprising. At the time the entire event was incredibly confusing, and no-one at my work really knew what was going on, but later on, with some new knowledge, it all made perfect sense.

I’m Getting Too Old For This

Lets go back a step though.

Sometimes AWS needs to murder one of your EC2 instances. As far as I know, this tends to happen when AWS detects failures in the underlying hardware, and they schedule the instance to be “retired”, notifying the account holder as appropriate. Of course, this assumes that AWS notices the failure before it becomes an issue, so if you’re really unlucky, sometimes you don’t get a warning and an EC2 instance just disappears into the ether.

The takeaway here, is that you should never rely on just one EC2 instance for any critical functions. This is one of the reasons why auto scaling groups are so useful, because you specify a template for the instances instead of just making one by itself. Of course, if your instances are accumulating important state, then you’ve still got a problem if one goes poof.

Interestingly enough, when you stop (not terminate, just stop) an EC2 instance that is owned by an auto scaling group, the auto scaling group tends to murder it and spin up another one, because it thinks the instance has gone bad and needs to be replaced.

Anyway, I was pretty surprised when AWS scheduled two of the Elasticsearch data nodes in our ELK stack for retirement and:

The nodes hung around in a stopped state, i.e. they didn’t get replaced, even though they were owned by an auto scaling group
AWS didn’t trigger the CloudWatch alarm on the Elasticsearch load balancer that is supposed to detect unhealthy instances, even though the stopped instances were clearly marked unhealthy

After doing some soul searching, I can explain the first point somewhat.

The second point still confuses me though.

You’re Suspended! Hand In Your Launch Configuration And Get Out Of Here

It turns out that when AWS schedules an instance for retirement, it doesn’t necessarily mean the instance is actually going to disappear forever. Well, it won’t if you’re using an EBS volume at least. If you’re just using an instance store volume you’re pretty boned, but those are ephemeral anyway, so you should really know better.

Once the instance is “retired” (i.e. stopped), you can just start it up again. It will migrate to some new (healthy) hardware, and off it goes, doing whatever the hell it was doing in the first place.

However, as I mentioned earlier, if you stop an EC2 instance owned by an auto scaling group, the auto scaling group will detect it as a failure, terminate the instance and spin up a brand new replacement.

Now, this sort of reaction can be pretty dangerous, especially when AWS is the one doing the shutdown(as opposed to the account holder), so AWS does the nice thing and suspends the terminate and launch processes of the auto scaling group, just to be safe.

Of course, the assumption here is that the account holder knows that the processes have been suspended and that some instances are being retired, and they go and restart the stopped instances, resume the auto scaling processes and continues on with their life, singing merrily to themselves.

Until this happened to us, I did not even know that suspending auto scaling group processes was an option, let alone that AWS would do it for me. When we happened to notice that two of our Elasticsearch data nodes had become unavailable through Octopus Deploy, I definitely was not the “informed account holder” in the equation, and instead went on an adventure trying to figure out what the hell was going on.

I tried terminating the stopped nodes, in the hopes that they would be replaced, but because the processes were suspended, I got nothing. I tried raising the number of desired instances, but again, the processes were suspended, so nothing happened.

In the end, I created a secondary auto scaling group using the same Launch Configuration and got it to spin up a few instances, which then joined the cluster and helped to settle everything down.

It wasn’t until the next morning that cooler heads prevailed and I got a handle on what was actually happening that we cleaned everything up properly.

Conclusion

This was one of those cases where AWS was definitely doing the right thing (helping people to avoid data loss because of an underlying failure out of their control), but a simple lack of knowledge on our part caused a bit of a kerfuffle.

The ironic thing is that if AWS had simply terminated the EC2 instances (which I’ve seen happen before) the cluster would have self-healed and rebalanced perfectly well (as long as only a few nodes were terminated of course).

Like I said earlier, I still don’t know why we didn’t get a CloudWatch alarm when the instances were stopped, as they were definitely marked as “unhealthy” in the Load Balancer. We didn’t even realise that something had gone wrong until someone noticed that the data nodes were reporting as unavailable in Octopus Deploy, and that happened purely by chance.

Granted, we still had four out of six data nodes in service, and we run our shards with one primary and two replicas, so we weren’t exactly in the danger zone, but we were definitely approaching it.

Maybe its time to try and configure an alarm on the cluster health.

That’s always nice and colourful.

AWS Lambda and .NET Core, Two Great Tastes That Taste Great Together, Part 4

September 19. 2017 0 Comments

And thus we come to the end of this particular saga, of rewriting the our AWS Lambda ELB Logs Processor in C# using .NET Core.

To summarise the posts in this series:

In conjunction with those posts, I’ve also uploaded a copy of the entire source code (build and deployment scripts included) to Solavirum.Logging.ELB.Processor.Lambda, mostly as a master reference point. Its a copy of our internal repository, quickly sanitized to remove anything specific to our organization, so I can’t guarantee that it works perfectly, but there should be enough stuff there to demonstrate the entire process working end to end.

As with all the rest of our stuff, the build script is meant to run from inside TeamCity, and the deployment script is meant to run from inside Octopus Deploy, so if you don’t have either of those things, you will have to use your imagination.

Of course, everything I’ve gone through is purely academic unless its running successfully in Production…

Seize The Means Of Production

If you’ve read any of my previous posts about the Javascript/Node.js ELB Logs Processor, you’re probably aware of the fact that it tended to look like it was working fine right up to the point where I published it into production and the sheer weight of traffic caused it to have a meltdown.

Unsurprisingly, the difference between a sample file with a few hundred lines in it and a real file with hundreds of thousands of lines is significant, and that’s not even really taking into account the difference between my local development environment and AWS Lambda.

In an unprecedented result though, the .NET Core implementation worked just fine when I shipped it to our production environment.

Unfortunately it was around twenty times slower.

The old Javascript/Node.js implementation could process a file in around 9 seconds, which is pretty decent.

The .NET Core implementation processes a file in around 200 seconds, which is a hell of a lot slower.

Now, I can explain some of that away with the fact that the .NET Core implementation uses HTTP instead of TCP to communicate with Logstash (which was the primary reason I rewrote it in the first place), but it was still pretty jarring to see.

Interestingly enough, because executing Lambda functions is so cheap, the total cost increase was like $US8 a month, even though it was twenty times slower. so who really cares.

In the end, while the increase in processing time was undesirable, I’d rather have slower C# and HTTP over faster Javascript and TCP.

Summary

Overall, I’m pretty happy with how this whole .NET Core rewrite went, for a number of reasons:

I got to explore and experiment with the newest .NET and C# framework, while delivering actual business value, which was great.
In the process of developing this particular solution, I managed to put together some more reusable scripts and components for building other .NET Core components
I finally switched the ELB Logs Processor to HTTP, which means that the overall feed of events into our ELK stack is more reliable (TCP was prone to failure)
I managed to get a handle on fully automated testing of components that require AWS Credentials, without accidentally checking the credentials into source control

In the end though, I’m mostly just happy because its not Javascript.

AWS Lambda and .NET Core, Two Great Tastes That Taste Great Together, Part 3

September 12. 2017 0 Comments

Last week I wrote a high level overview of the build, package, test and deploy process that we created in order to get .NET Core C# into AWS Lambda.

As part of that post, I briefly touched on the validation part of the proces, which basically amounted to “we wrote some tests in C# using XUnit, then used the dotnet command line interface to run them as part of the build”.

Considering that the testing took a decent amount of the total time, that amount of coverage isn’t really appropriate, so this post will elaborate further, and explain some of the challenges we encountered and how we overcame them.

I’m A Little Testy

At a high level, testing a .NET Core AWS Lambda function written in C# is not really any more difficult than testing other C# code. Like the stuff I’ve been writing for years.

Unit test classes in isolation, integration test classes together, have a few end to end tests (if possible) and you’re probably in a good place.

When you write a lambda function, all you really need is a single class with a method on it (Handler::Handler or equivalent). You could, if you wanted, put literally all of your logic into this class/method combination and have a fully functional lambda function.

That would probably make it harder to test though.

A better idea is to break it down into its constituent parts, ideally taking the form of small classes with a single responsibility, working together to accomplish a greater goal.

For the ELB Logs Processor, there are three clear responsibilities:

Read a file from S3, and expose the information therein as lines of text
Take a line of text and transform it into some sort of structure, potentially breaking it down and adding meta information
Take a structure and output it to a Logstash endpoint

Technically there is also a fourth responsibility which is to tie the three things above together (i.e. orchestration), but we’ll get to that in a minute.

Anyway, if the responsibilities are clearly separated, we’re in unit testing territory, which is familiar enough that I don’t really want to talk about it too much. Grab your testing framework of choice (XUnit is a good pick for .NET Core, NUnit didn’t seem to work too well in Visual Studio + Resharper), mock out your dependencies and test away.

You can apply the same sort of testing to the orchestrator (the Handler) by giving it its dependencies through its constructor. Make sure you give it a parameterless constructor with default values though, or you’ll have issues running it in AWS.

This is basically just a poor mans dependency injection, but my favour dependency injection library, Ninject, does not exist in .NET Core and I didn’t want to invest the time to learn one of the other ones (like AutoFac) for such a small project.

Of course, while unit tests are valuabl, they don’t really tell us whether or not everything will work when you put it all together.

That’s where integration tests come in.

Integration Functions

In my experience, I’ve consistently found there to be two distinct flavours of integration test.

The first is the test that validates whether or not everything works when all of your things are put together. This tends to end up being a test of your dependency injection configuration (if you’re using it), where you create some top level object using your container of choice, then execute a series of methods on it (and its children) echoing what would normally happen in your program execution. You might mock out the very edges of the system (like databases, API dependencies, disk-based configuration, etc), but in general you want to validate that your tens (hundreds? thousands?) of classes play nice together and actually accomplish their overarching goal.

The second type of test is the one that validates whether or not a class that relies on an external system works as expected. These are very close to unit tests in appearance (single class, some mocked out dependencies), but are much slower and more fragile (due to the reliance on something outside of the current code base). They do have a lot of value though, because beyond application failures, they inform you when your integration with the external dependency breaks for whatever reason.

The ELB Logs Processor features both types of integration test, which is nice, because it makes for a comprehensive blog post.

The simpler integration test is the one for the component that integrates with S3 to download a file and expose it as a series of lines. All you need to do is put a file in a real S3 bucket, then run the class with an appropriate set of parameters to find that file, download it and expose it in the format that is desired. You’re free to simulate missing files (i.e. go get this file that doesn’t exist) and other error cases as desired as well, which is good.

The other integration test involves the Handler, using its default configuration and a realistic input. The idea would be to run the Handler in a way that is as close to how it will be run in reality as possible, and validate that it does what it advertises. This means we need to put some sort of sample file in S3, fire an event at the Handler describing that file (i.e. the same event that it would receive during a normal trigger in AWS), validate that the Handler ran as expected (perhaps measuring log output or something) and then validate that the events from the sample file were available in the destination ELK stack. A complicated test for sure, but one with lots of value.

As is always the case, while both of those tests are conceptually fairly simple and easy to spell out, they both have to surmount the same challenge.

AWS Authentication and Authorization.

401, Who Are You?

The various libraries supplied by AWS (Javascript, Powershell, .NET, Python, etc) all handle credential management in a similar way.

They use a sort of priority system for determining the source of their credentials, and it looks something like this:

If credentials have been supplied directly in the code (i.e. key/secret), use those
Else, if a profile has been supplied, look for that and use it
Else, if no profile has been supplied, use the default profile
Else, if environment variables have been set, use those
Else, use information extracted from the local machine (i.e. IAM profiles or EC2 meta information)

The specific flow differs from language to language (.NET specific rules can be found here for example), but the general flow is pretty consistent.

What does that mean for our tests?

Well, the actual Lambda function code doesn’t need credentials at all, because the easiest way to run it is with an IAM profile applied to the Lambda function itself in AWS. Assuming no other credentials are available, this kicks in at option 5 in the list above, and is by far the least painful (and most secure) option to use.

When running the tests during development, or running them on the build server, there is no ambient IAM profile available, so we need to find some other way to get credentials to the code.

Luckily, the .NET AWS library allows you to override the logic that I described above, which is a pretty great (even though it does it with static state, which makes me sad).

With a little work we can nominate an AWS profile with a specific name to be our current “context”, and then use those credentials to both setup a temporary S3 bucket with a file in it, and to run the code being tested, like in the following Handler test.

/// <summary>
/// This test is reliant on an AWS profile being setup with the name "elb-processor-s3-test" that has permissions
/// to create and delete S3 buckets with prefixes oth.console.elb.processor.tests. and read items in those buckets
/// </summary>
[Fact]
public void WhenHandlingAValidS3Event_ConnectsToS3ToDownloadTheFile_AndOutputsEventsToLogstash()
{
    var inMemoryLogger = new InMemoryLogger();
    var xunitLogger = new XUnitLogger(_output);
    var logger = new CompositeLogger(inMemoryLogger, xunitLogger);

    var application = Guid.NewGuid().ToString();
    var config = new Dictionary<string, string>
    {
        { Configuration.Keys.Event_Environment, Guid.NewGuid().ToString() },
        { Configuration.Keys.Event_Application, application},
        { Configuration.Keys.Event_Component, Guid.NewGuid().ToString() },
        { Configuration.Keys.Logstash_Url, "{logstash-url}" }
    };
    using (AWSProfileContext.New("elb-processor-s3-test"))
    {
        logger.Log("Creating test bucket and uploading test file");
        var s3 = new AmazonS3Client(new AmazonS3Config { RegionEndpoint = RegionEndpoint.APSoutheast2 });
        var testBucketManager = new TestS3BucketManager(s3, "oth.console.elb.processor.tests.");
        using (var bucket = testBucketManager.Make())
        {
            var templateFile = Path.Combine(ApplicationEnvironment.ApplicationBasePath, @"Helpers\Data\large-sample.log");
            var altered = File.ReadAllText(templateFile).Replace("@@TIMESTAMP", DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.ffffffZ"));
            var stream = new MemoryStream(Encoding.UTF8.GetBytes(altered));
            var putResult = s3.PutObjectAsync(new PutObjectRequest { BucketName = bucket.Name, InputStream = stream, Key = "test-file" }).Result;

            var message = new S3EventBuilder().ForS3Object(bucket.Name, "test-file").Build();

            logger.Log("Executing ELB logs processor");
            var handler = new Handler(logger, config);
            handler.Handle(message);
        }
    }

    // Check that there are some events in Elasticsearch with our Application
    Func<long> query = () =>
    {
        var client = new HttpClient {BaseAddress = new Uri("{elasticsearch-url}")};
        var raw = client.GetStringAsync($"/logstash-*/_search?q=Application:{application}").Result;
        dynamic result = JsonConvert.DeserializeObject(raw);
        return (long) result.hits.total;
    };

    query.ResultShouldEventuallyBe(hits => hits.Should().BeGreaterThan(0, $"because there should be some documents in Elasticsearch with Application:{application}"));
}

The code for the AWSProfileContext class is relatively simple, in that while it is in effect it inserts the profile into the first position of the credential fallback process, so that it takes priority over everything else. When disposed, it removes that override, returning everything to the way it was. A similar dispose-delete pattern is used for the temporary S3 bucket.

With a profile override in place, you can run the tests in local development just by ensuring you have a profile available with the correct name, which allows for a nice smooth development experience once a small amount of manual setup has been completed.

To run the tests on the build servers, we put together a simple helper in Powershell that creates an AWS profile, runs a script block and then removes the profile. The sensitive parts of this process (i.e. the credentials themselves) are stored in TeamCity, and supplied when the script is run, so nothing dangerous needs to go into source control at all.

To Be Continued

The most complicated part of testing the ELB Logs Processor was getting the AWS credentials working consistently across a few different execution strategies, while not compromising the safety of the credentials themselves. The rest of the testing is pretty just vanilla C# testing.

The only other thing worth mentioning is that the AWS libraries available for .NET Core aren’t quite the same as the normal .NET libraries, so some bits and pieces didn’t work as expected. The biggest issue I ran into was the how a lot of the functionality for managing credential is typically configured through the app.config, and that sort of thing doesn’t even seem to exist in .NET Core. Regardless, we still got it work in the end, even though we did have to use the obsolete StoredProfileAWSCredentials class.

Next week will be the last post in this series, drawing everything together and focusing on actually running the new .NET Core ELB Logs Processor in a production environment.