0 Comments

We have a lot of logs.

Its mostly my fault to be honest. It was only a few years ago that I learned about log aggregation, and once you have an ELK stack, everything looks like an structured log event formatted as JSON.

We aggregate a wealth of information into our log stack these days, including, but not limited to:

Now, if I had my way, we would keep everything forever. My dream would be to be able to ask the question “What did our aggregate API traffic look like over the last 12 months?”

Unfortunately, I can’t keep the raw data forever.

But I might be able to keep a part of it.

Storage Space

Storage space is pretty cheap these days, especially in AWS. In the Asia Pacific region, we pay $US 0.12 per GB per month for a stock standard, non-provisioned IOPS ELB volume.

Our ELK stack accumulates gigabytes of data every day though, and trying to store everything for all of eternity can add up pretty quickly. This gets even more complicated by the nature of Elasticsearch, because it likes to keep replicas of things just in case a node explodes, so you actually need more storage space than you think in order to account for redundancy.

In the end we somewhat randomly decided to keep a bit more than a months worth of data (40 days), which gives us the capability to reliably support our products, and to have a decent window for viewing business intelligence and usage. We have a scheduled task in TeamCity that leverages Curator to remove data as appropriate.

Now, a little more than a month is a pretty long time.

But I want more.

In For The Long Haul

In any data set, you are likely to find patterns that emerge over a much longer period than a month.

A good example would be something like daily active users. This is the sort of trend that is likely to show itself over months or years, especially for a relatively stable product. Unless you’ve done something extreme of course, in which case we might get a meaningful trend over a much shorter period.

Ignoring the extremes, we have all the raw data required to calculate the metric, we’re just not keeping it. If we had some way of summarising it into a smaller data set though, we can keep it for a much longer period. Maybe some sort of mechanism to do some calculations and store the resulting derivation somewhere safe?

The simplest approach is some sort of script or application that runs on a schedule and uses the existing data in the ELK stack to create and store new documents, preferably back into the ELK stack. If we want to ensure those new documents don’t get deleted by Curator, all we have to do is put them into different indexes (as Curator is only cleaning up indexes prefixed with logstash).

Seems simple enough.

Generator X

For once it actually was simple enough.

At some point in the past we actually implemented a variation of this idea, where we calculated some metrics from a database (yup, that database) and stored them in an Elasticsearch instance for later use.

Architecturally, the metric generator was a small C# command line application scheduled for daily execution through TeamCity, so nothing particularly complicated.

We ended up decommissioning those particular metrics (because it turned out they were useless) and disabling the scheduled task, but the framework already existed to do at least half of what I wanted to do; the part relating to generating documents and storing them in Elasticsearch. All I had to do was extend it to query a different data source (Elasticsearch) and generate a different set of metrics documents for indexing.

So that’s exactly what I did.

The only complicated part was figuring out how to query Elasticsearch from .NET, which as you can see from the following metrics generation class, can be quite a journey.

public class ElasticsearchDailyDistinctUsersDbQuery : IDailyDistinctUsersDbQuery
{
    public ElasticsearchDailyDistinctUsersDbQuery
    (
        SourceElasticsearchUrlSetting sourceElasticsearch,
        IElasticClientFactory factory,
        IClock clock,
        IMetricEngineVersionResolver version
    )
    {
        _sourceElasticsearch = sourceElasticsearch;
        _clock = clock;
        _version = version;
        _client = factory.Create(sourceElasticsearch.Value);
    }

    private const string _indexPattern = "logstash-*";

    private readonly SourceElasticsearchUrlSetting _sourceElasticsearch;
    private readonly IClock _clock;
    private readonly IMetricEngineVersionResolver _version;

    private readonly IElasticClient _client;

    public IEnumerable<DailyDistinctUsersMetric> Run(DateTimeOffset parameters)
    {
        var start = parameters - parameters.TimeOfDay;
        var end = start.AddDays(1);

        var result = _client.Search<object>
        (
            s => s
                .Index(_indexPattern)
                .AllTypes()
                .Query
                (
                    q => q
                        .Bool
                        (
                            b => b
                                .Must(m => m.QueryString(a => a.Query("Application:GenericSoftwareName AND Event.Name:SomeLoginEvent").AnalyzeWildcard(true)))
                                .Must(m => m
                                    .DateRange
                                    (
                                        d => d
                                            .Field("@timestamp")
                                            .GreaterThanOrEquals(DateMath.Anchored(start.ToUniversalTime().DateTime))
                                            .LessThan(DateMath.Anchored(end.ToUniversalTime().DateTime))
                                    )
                                )
                        )
                )
                .Aggregations(a => a
                    .Cardinality
                    (
                        "DistinctUsers", 
                        c => c.Field("SomeUniqueUserIdentifier")
                    )
                )
        );

        var agg = result.Aggs.Cardinality("DistinctUsers");

        return new[]
        {
            new DailyDistinctUsersMetric(start)
            {
                count = Convert.ToInt32(agg.Value),
                generated_at = _clock.UtcNow,
                source = $"{_sourceElasticsearch.Value}/{_indexPattern}",
                generator_version = _version.ResolveVersion().ToString()
            }
        };
    }
}

Conclusion

The concept of calculating some aggregated values from our logging data and keeping them separately has been in my own personal backlog for a while now, so it was nice to have a chance to dig into it in earnest.

It was even nicer to be able to build on top of an existing component, as it would have taken me far longer if I had to put everything together from scratch. I think its a testament to the quality of our development process that even this relatively unimportant component was originally built following solid software engineering practices, and has plenty of automated tests, dependency injection and so on. It made refactoring it and turning it towards a slightly different purpose much easier.

Now all I have to do is wait months while the longer term data slowly accumulates.

0 Comments

A very long time ago I wrote a post on this blog about interceptors.

The idea behind an interceptor is pretty straight forward; dynamically create a wrapper around some interface or class to either augment or alter its behaviour when used by the rest of the system, without actually having to implement the interface or override the class. For example, my original post was for an interceptor that slowed down requests heading to an API, simulating how that particular application would feel for a user with terrible internet.

I honestly haven’t touched the concept since, until recently that is.

I wanted to add some logging around usage of a third party API from our legacy application and conceptually the problem seemed like a perfect application for another interceptor. A quick implementation of a logging interceptor injected via Ninject and I’d have all the logging I could ever want, without having to mess around too much.

Reality had other ideas though.

Here’s Why I Came Here Tonight

Our legacy software is at that time in its life where it mostly just gets integrations. Its pretty feature complete as far as core functionality goes, so until the day we finally grant it Finis Rerum and it can rest, we look to add value to our users by integrating with third party services.

The most recent foray in this space integrated a payment provider into the software, which is quite useful considering its core value proposition is trust account management. From a technical point of view, the payment provider has an API and we wrote a client to access that API, with structured request and response objects. Pretty standard stuff.

As part of our development, we included various log events that allowed us to track the behaviour of parts of the system, mostly so that we could more easily support the application and get accurate metrics and feedback from users in the wild relating to performance. This is all well and good, but those events generally cover off combined parts of the application logic; for example, an operation that queries the local DB and then augments that information by calling into the third party API to display a screen to the user.

This makes it relatively easy to see when any users are experiencing performance issues, but it makes it hard to see whether or not the local DB, the actual programming logic or the internet call was the root cause.

An improvement to this would be to also log any outgoing API requests and their responses, along with the execution time. With that information we would be able to either blame or absolve the clients internet connection when it comes to performance questions.

Now, I’m an extremely lazy developer, so while we have a nice interface that I could implement to accomplish this (i.e. some sort of LoggingPaymentProviderClient), its got like twenty methods and I really don’t have the time, patience or motivation for that. Also, while its unlikely that the interface would change all that much over time, its still something of a maintenance nightmare.

Interceptors to the rescue!

But I Got The Feeling That Something Ain’t Right

As I explained in my original post all those years ago, the IInterceptor interface supplied by the Castle library allows you to implement logic via a proxy and slot it seamlessly into a call stack. Its usage is made easier by the presence of good dependency injection, but its definitely not required at all.

Thus enters the logging interceptor.

public class PaymentProviderClientMethodExecutionLoggingInterceptor : IInterceptor
{
    public PaymentProviderClientMethodExecutionLoggingInterceptor(ILogger logger)
    {
        _logger = logger;
    }

    private readonly ILogger _logger;

    public void Intercept(IInvocation invocation)
    {
        var stopwatch = Stopwatch.StartNew();
        try
        {
            invocation.Proceed();
            stopwatch.Stop();

            var log = new PaymentProviderMethodCallCompleted(invocation, stopwatch.Elapsed);
            _logger.Information(log);
        }
        catch (Exception ex)
        {
            stopwatch.Stop();

            var log = new PaymentProviderMethodCallFailed(invocation, stopwatch.Elapsed, ex);
            _logger.Warning(log);

            throw;
        }
    }
}

Its not an overly complicated class, and while its written to be specific, its actually quite generic.

Given a proxied class, all methods called on that class will be logged via Serilog, with the method name, its parameters and its return value (the structured logging being provided by the dedicated event classes).

Nothing ever works the first time though, and while I’m constantly being reminded of that, I’m always hopeful all the same. Denial is a powerful motivator.

The problem was that the IInterceptor interface is old enough that it doesn’t consider the existence of asynchronous methods. It does exactly what it says on the tin, starts a timer, proceeds with the method invocation and then because the method is asynchronous, immediately logs an event with the wrong execution time and no return value.

It has no idea that it has to wait for the invocation to complete because it thinks everything is synchronous.

Clowns To The Left Of Me, Jokers To The Right

This is where everything is going to get a little bit fuzzier than I would like, because I wrote this blog post before I had a working solution.

From what I can tell, the situation is quite complex.

The simplest solution appears to be to leverage the existing interface and simply check for the presence of a Task (or Task<T>) return value. If detected, append a continuation to that Task to perform the desired functionality. For me this would be a continuation on both faulted and success (and maybe cancelled?) that would log the completion of the method. It seems like it would work, but I do have some concerns about the scheduling of the continuation and how that makes the code harder to reason about.

Luckily, someone has already written a reusable library together that allows for asynchronous interceptors via a slightly different interface.

This is attractive because its code that I don’t have to write (remember, I’m lazy), but it not being built into the core Castle library does make me question its legitimacy. Surely if it was that critical the maintainers would have updated Castle.Core?

Regardless, I explored using the library first, but in order to use it I had to go on an adventure to upgrade a bunch of our Nuget dependencies (because it relied on the latest version of Castle.Core), which meant updates to Castle, Ninject and Ninject’s extension libraries. This caused knock on effects because the Ninject.MockingKernel.NSubstitute library was not available for .NET 4.5 (even though all the others were), so I had to temporarily copy that particular implementation into our codebase.

Once everything was compiling, a full test run showed some failing tests that weren’t failing before the library upgrades, so I kind of stopped there.

For now.

Conclusion

Unfortunately this is one of those blog posts that comes off feeling a little hollow. I didn’t actually get to my objective (seamless per method logging for a third-party dependency), but I did learn a lot on the way, so I think it was still useful to write about.

Probably should have waited a little longer though, I jumped the gun a bit.

Its not the only time in recent memory that asynchronous behaviour has made things more difficult than they would have been otherwise. In an unrelated matter, some of our automated tests have been flakey recently, and the core of the issue seems to be asynchronous behaviour that is either a.) not being correctly overridden to be synchronous for testing purposes or b.) not correctly being waited for before tests proceed.

Its not hard to write tests that are purely async of course (NUnit supports tests marked with the async keyword), but when you’re testing a view model and the commands are “synchronous” it gets a bit more challenging.

That’s probably a topic for another day though.

0 Comments

Are you monitoring a production environment with a system that raises alerts when something strange happens?

Do you also have some sort of pre-production or staging environment where you smoke test changes before pushing them live?

Does the pre-production environment have exactly the same monitoring as the production environment?

It really should.

Silence Definitely Doesn’t Equal Consent

A few months back we were revisiting one of our older API’s.

We’d just made a release into the pre-production environment and verified that the API was doing what it needed to do with a quick smoke test. Everything seemed fine.

Promote to production and a few moments later a whole bunch of alarms went off as they detected a swathe of errors occurring in the backend. I honestly can’t remember what the nature of the errors were, but they were real problems that were causing subtle failures in the migration process.

Of course, we were surprised, because we didn’t receive any such indication from the pre-production environment.

When we dug into it a bit deeper though, exactly the same things were happening in pre-prod, the environment was just missing equivalent alarms.

Samsies

I’m honestly not sure how we got ourselves into this particular situation, where there was a clear difference in behaviour between two environments that should be basically identical. Perhaps the production environment was manually tweaked to include different alarms? I’m not as familiar with the process used for this API as I am for others (Jenkins and Ansible vs TeamCity and Octopus Deploy), but regardless of the technology involved, its easy to accidentally fall into the trap of “I’ll just manually create this alarm here” when you’re in the belly of the beast during a production incident.

Thar be dragons that way though.

Ideally you should be treating your infrastructure as code and deploying it similarly to how you deploy your applications. Of course, this assumes you have an equivalent, painless deployment pipeline for your infrastructure, which can be incredibly difficult to put together.

We’ve had some good wins in the past with this approach (like our log stack rebuild), where the entire environment is encapsulated within a single Nuget package (using AWS CloudFormation), and then deployed using Octopus Deploy.

Following such a process strictly can definitely slow you down when you need to get something done fast (because you have to add it, test it, review it and then deploy it through the chain of CI, Staging, Production), but it does prevent situations like this from arising.

Our Differences Make Us Special

As always, there is at least one caveat.

Sometimes you don’t WANT your production and pre-production systems to have the same alarms.

For example, imagine if you had an alarm that fired when the traffic dropped below a certain threshold. Production is always getting enough traffic, such that a drop indicates a serious problem.

Your pre-production environment might not be getting as much traffic, and the alarm might always be firing.

An alarm that is always firing is a great way to get people to ignore all of the alarms, even when they matter.

By that logic, there might be good reasons to have some differences between the two environments, but they are almost certainly going to be exceptions rather than the rule.

Conclusion

To be honest, not having the alarms go off in the pre-production environment wasn’t exactly the worst thing that’s ever happened to us, but it was annoying.

Short term it was easy enough to just remember to check the logs before pushing to production, but the manual and unreliable nature of that check is why we have alarms altogether.

Machines are really good at doing repetitive things after all.

0 Comments

As should be somewhat obvious by now, I’m a big fan of logging. I try to justify that interest in a logical fashion, stating the intelligence benefits but really, I just personally find it interesting and as a result it tends to be something that I focus on when developing software.

Log aggregation as a concept is still pretty new to me though. It wasn’t until a few years ago that I even became aware that it even existed, even though something always lingered in the corners of my mind about the usefulness of logs that were trapped on hundreds of different machines and environments. The idea to take the logs from a whole bunch of different locations (servers, client machines, everything), destructure them and put them all into a centralised, searchable location kind of blew my mind to be honest, and its been something of a passion ever since.

In an interesting twist of fate, my first exposure to log aggregation involved RavenDB. One of the previous developers had set up a RavenDB database to act as the destination for log events coming out of a piece of desktop software, and when i realized what it was, I was intent on using it to understand how the software was being used.

The idea was amazing.

The implementation?

Not so much.

A Missed Opportunity

As I mentioned above, at some point in the past the development team identified a need to raise and aggregate structured log events for a variety of reasons, like business intelligence, error reporting, general logging and so on. They picked RavenDB as the storage location, probably because it was so easy to use from a development point of view, and trivial to integrate into a C# application using the RavenDB client to write arbitrary plain old C# objects with no real messing around.

A simple custom logger interface was created, which would expose the capability to write log events that derived from some common base, and the implementation of that interface used the RavenDB client. From a storage point of view, staging and production instances of RavenDB were provisioned inside AWS, and exposed to the greater internet via an Elastic Load Balancer. The very next release of the software was writing all sorts of interesting events into RavenDB with very little difficulty, like notifications that the application was being started (including things like who started it and what version it was on) and errors that were occurring during normal usage.

It was all very useful.

That is, until you tried to actually search the resulting log events and glean some information from them.

This was back in RavenDB 2.X, so the default UI was the Silverlight implementation of the RavenDB Studio, which wasn’t exactly the most amazing piece of software.

Even ignoring the usability problems, RavenDB 2.X did not have good support for ad-hoc queries, and the no one had thought ahead and set up fields for indexing. This meant that the wealth of information available damn near unusable.

How Could You Not Call It Muninn?

Shortly after I started and found out about all of this awesome information going to waste I put together an ELK stack and hacked together a bridge between the two.

Whenever a log event was written into RavenDB, a small memory resident application would automatically write it out to a JSON file, and Logstash would pick it up and put it into the ELK stack.

With the new ELK stack in place, whenever we wrote new code we would ensure that it could aggregate its logs as necessary, augmenting the system with new capabilities each time. In the end it was mostly just log files being picked up and processed by local Logstash instances, but we process ELB logs in Lambda by posting directly to Logstash over HTTP and we exposed a public facing authenticated Logging API for software to aggregate log events outside of our little walled garden in AWS.

The rest is history, and we’ve been using ELK ever since.

But we never changed the initial piece of desktop software and its been happily writing into RavenDB this whole time, with the events being replicated into the ELK stack via that “temporary” hack.

Its The Perfect Name

Don’t get me wrong, the system worked, and log events from the desktop software were available in our ELK stack, it just felt so inefficient to have this whole other system in place providing no real additional value.

Not only that, while it did technically work, there were definitely times when that particularly log event channel failed spectacularly, from full disks (no-one had ever implemented maintenance or cleanup of any sort on the RavenDB database) through to performance problems similar to the ones that we saw in one of our APIs.

The last straw was when I did an audit on our AWS costs and discovered that we were actually paying a non-trivial amount of cash for the RavenDB logging infrastructure and that just seemed wrong (plus, if I recovered that cash I could put it into other things, like the ELK stack).

All told, there were more reasons to get rid of it than there were to keep it.

But how?

Well, as I mentioned earlier, when the development team initially implementing the log aggregation, they put a simple logging interface in place, with the RavenDB specific implementation behind that interface.

Switching the logging to use our Serilog based HTTP sink (which connected to our authenticated Logging API) was as simple as providing a new implementation of the old logging interface, so even though its usage infested the entire system, we really only had to change it once at the top level.

Of course, it wasn’t quite that simple (references to RavenDB were strewn through at least 3 different libraries that dealt with logging, so they had to be removed), but all told it was a hell of a lot easier than it would have been without that abstraction.

Obviously we want to switch to using Serilog directly (rather than a custom logging abstraction that provides no real value), but that can happen slowly, bit by bit, and the flow of log events will be maintained the entire time.

Summary

Unfortunately, we can’t actually decommission the RavenDB logging infrastructure just yet, as we have to wait until everyone in the wild upgrades to the latest version or we risk missing out on important intelligence from the slackers. I’m keeping a close eye on the total number of logs received through the old system, and the moment it drops below a certain threshold, I’m going to nuke the entire thing.

To be honest, the removal of the RavenDB logging channel that I described above makes me happier than it really has cause to. It’s been bugging me for a while (mostly from an engineering point of view), so it felt really good to get rid of it in the latest release of our software. That sort of maintenance and consolidation doesn’t always have a direct and measurable impact on the bottom line, but its still important for the system as a whole and important for engineering morale.

Its like ripping out a mostly dead plant from a garden.

Better to make room for something that might actually be useful.

0 Comments

Last week I wrote a high level overview of the build, package, test and deploy process that we created in order to get .NET Core C# into AWS Lambda.

As part of that post, I briefly touched on the validation part of the proces, which basically amounted to “we wrote some tests in C# using XUnit, then used the dotnet command line interface to run them as part of the build”.

Considering that the testing took a decent amount of the total time, that amount of coverage isn’t really appropriate, so this post will elaborate further, and explain some of the challenges we encountered and how we overcame them.

I’m A Little Testy

At a high level, testing a .NET Core AWS Lambda function written in C# is not really any more difficult than testing other C# code. Like the stuff I’ve been writing for years.

Unit test classes in isolation, integration test classes together, have a few end to end tests (if possible) and you’re probably in a good place.

When you write a lambda function, all you really need is a single class with a method on it (Handler::Handler or equivalent). You could, if you wanted, put literally all of your logic into this class/method combination and have a fully functional lambda function.

That would probably make it harder to test though.

A better idea is to break it down into its constituent parts, ideally taking the form of small classes with a single responsibility, working together to accomplish a greater goal.

For the ELB Logs Processor, there are three clear responsibilities:

  1. Read a file from S3, and expose the information therein as lines of text
  2. Take a line of text and transform it into some sort of structure, potentially breaking it down and adding meta information 
  3. Take a structure and output it to a Logstash endpoint

Technically there is also a fourth responsibility which is to tie the three things above together (i.e. orchestration), but we’ll get to that in a minute.

Anyway, if the responsibilities are clearly separated, we’re in unit testing territory, which is familiar enough that I don’t really want to talk about it too much. Grab your testing framework of choice (XUnit is a good pick for .NET Core, NUnit didn’t seem to work too well in Visual Studio + Resharper), mock out your dependencies and test away.

You can apply the same sort of testing to the orchestrator (the Handler) by giving it its dependencies through its constructor. Make sure you give it a parameterless constructor with default values though, or you’ll have issues running it in AWS.

This is basically just a poor mans dependency injection, but my favour dependency injection library, Ninject, does not exist in .NET Core and I didn’t want to invest the time to learn one of the other ones (like AutoFac) for such a small project.

Of course, while unit tests are valuabl, they don’t really tell us whether or not everything will work when you put it all together.

That’s where integration tests come in.

Integration Functions

In my experience, I’ve consistently found there to be two distinct flavours of integration test.

The first is the test that validates whether or not everything works when all of your things are put together. This tends to end up being a test of your dependency injection configuration (if you’re using it), where you create some top level object using your container of choice, then execute a series of methods on it (and its children) echoing what would normally happen in your program execution. You might mock out the very edges of the system (like databases, API dependencies, disk-based configuration, etc), but in general you want to validate that your tens (hundreds? thousands?) of classes play nice together and actually accomplish their overarching goal.

The second type of test is the one that validates whether or not a class that relies on an external system works as expected. These are very close to unit tests in appearance (single class, some mocked out dependencies), but are much slower and more fragile (due to the reliance on something outside of the current code base). They do have a lot of value though, because beyond application failures, they inform you when your integration with the external dependency breaks for whatever reason.

The ELB Logs Processor features both types of integration test, which is nice, because it makes for a comprehensive blog post.

The simpler integration test is the one for the component that integrates with S3 to download a file and expose it as a series of lines. All you need to do is put a file in a real S3 bucket, then run the class with an appropriate set of parameters to find that file, download it and expose it in the format that is desired. You’re free to simulate missing files (i.e. go get this file that doesn’t exist) and other error cases as desired as well, which is good.

The other integration test involves the Handler, using its default configuration and a realistic input. The idea would be to run the Handler in a way that is as close to how it will be run in reality as possible, and validate that it does what it advertises. This means we need to put some sort of sample file in S3, fire an event at the Handler describing that file (i.e. the same event that it would receive during a normal trigger in AWS), validate that the Handler ran as expected (perhaps measuring log output or something) and then validate that the events from the sample file were available in the destination ELK stack. A complicated test for sure, but one with lots of value.

As is always the case, while both of those tests are conceptually fairly simple and easy to spell out, they both have to surmount the same challenge.

AWS Authentication and Authorization.

401, Who Are You?

The various libraries supplied by AWS (Javascript, Powershell, .NET, Python, etc) all handle credential management in a similar way.

They use a sort of priority system for determining the source of their credentials, and it looks something like this:

  1. If credentials have been supplied directly in the code (i.e. key/secret), use those
  2. Else, if a profile has been supplied, look for that and use it
  3. Else, if no profile has been supplied, use the default profile
  4. Else, if environment variables have been set, use those
  5. Else, use information extracted from the local machine (i.e. IAM profiles or EC2 meta information)

The specific flow differs from language to language (.NET specific rules can be found here for example), but the general flow is pretty consistent.

What does that mean for our tests?

Well, the actual Lambda function code doesn’t need credentials at all, because the easiest way to run it is with an IAM profile applied to the Lambda function itself in AWS. Assuming no other credentials are available, this kicks in at option 5 in the list above, and is by far the least painful (and most secure) option to use.

When running the tests during development, or running them on the build server, there is no ambient IAM profile available, so we need to find some other way to get credentials to the code.

Luckily, the .NET AWS library allows you to override the logic that I described above, which is a pretty great (even though it does it with static state, which makes me sad).

With a little work we can nominate an AWS profile with a specific name to be our current “context”, and then use those credentials to both setup a temporary S3 bucket with a file in it, and to run the code being tested, like in the following Handler test.

/// <summary>
/// This test is reliant on an AWS profile being setup with the name "elb-processor-s3-test" that has permissions
/// to create and delete S3 buckets with prefixes oth.console.elb.processor.tests. and read items in those buckets
/// </summary>
[Fact]
public void WhenHandlingAValidS3Event_ConnectsToS3ToDownloadTheFile_AndOutputsEventsToLogstash()
{
    var inMemoryLogger = new InMemoryLogger();
    var xunitLogger = new XUnitLogger(_output);
    var logger = new CompositeLogger(inMemoryLogger, xunitLogger);

    var application = Guid.NewGuid().ToString();
    var config = new Dictionary<string, string>
    {
        { Configuration.Keys.Event_Environment, Guid.NewGuid().ToString() },
        { Configuration.Keys.Event_Application, application},
        { Configuration.Keys.Event_Component, Guid.NewGuid().ToString() },
        { Configuration.Keys.Logstash_Url, "{logstash-url}" }
    };
    using (AWSProfileContext.New("elb-processor-s3-test"))
    {
        logger.Log("Creating test bucket and uploading test file");
        var s3 = new AmazonS3Client(new AmazonS3Config { RegionEndpoint = RegionEndpoint.APSoutheast2 });
        var testBucketManager = new TestS3BucketManager(s3, "oth.console.elb.processor.tests.");
        using (var bucket = testBucketManager.Make())
        {
            var templateFile = Path.Combine(ApplicationEnvironment.ApplicationBasePath, @"Helpers\Data\large-sample.log");
            var altered = File.ReadAllText(templateFile).Replace("@@TIMESTAMP", DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.ffffffZ"));
            var stream = new MemoryStream(Encoding.UTF8.GetBytes(altered));
            var putResult = s3.PutObjectAsync(new PutObjectRequest { BucketName = bucket.Name, InputStream = stream, Key = "test-file" }).Result;

            var message = new S3EventBuilder().ForS3Object(bucket.Name, "test-file").Build();

            logger.Log("Executing ELB logs processor");
            var handler = new Handler(logger, config);
            handler.Handle(message);
        }
    }

    // Check that there are some events in Elasticsearch with our Application
    Func<long> query = () =>
    {
        var client = new HttpClient {BaseAddress = new Uri("{elasticsearch-url}")};
        var raw = client.GetStringAsync($"/logstash-*/_search?q=Application:{application}").Result;
        dynamic result = JsonConvert.DeserializeObject(raw);
        return (long) result.hits.total;
    };

    query.ResultShouldEventuallyBe(hits => hits.Should().BeGreaterThan(0, $"because there should be some documents in Elasticsearch with Application:{application}"));
}

The code for the AWSProfileContext class is relatively simple, in that while it is in effect it inserts the profile into the first position of the credential fallback process, so that it takes priority over everything else. When disposed, it removes that override, returning everything to the way it was. A similar dispose-delete pattern is used for the temporary S3 bucket.

With a profile override in place, you can run the tests in local development just by ensuring you have a profile available with the correct name, which allows for a nice smooth development experience once a small amount of manual setup has been completed.

To run the tests on the build servers, we put together a simple helper in Powershell that creates an AWS profile, runs a script block and then removes the profile. The sensitive parts of this process (i.e. the credentials themselves) are stored in TeamCity, and supplied when the script is run, so nothing dangerous needs to go into source control at all.

To Be Continued

The most complicated part of testing the ELB Logs Processor was getting the AWS credentials working consistently across a few different execution strategies, while not compromising the safety of the credentials themselves. The rest of the testing is pretty just vanilla C# testing.

The only other thing worth mentioning is that the AWS libraries available for .NET Core aren’t quite the same as the normal .NET libraries, so some bits and pieces didn’t work as expected. The biggest issue I ran into was the how a lot of the functionality for managing credential is typically configured through the app.config, and that sort of thing doesn’t even seem to exist in .NET Core. Regardless, we still got it work in the end, even though we did have to use the obsolete StoredProfileAWSCredentials class.

Next week will be the last post in this series, drawing everything together and focusing on actually running the new .NET Core ELB Logs Processor in a production environment.