0 Comments

Are you monitoring a production environment with a system that raises alerts when something strange happens?

Do you also have some sort of pre-production or staging environment where you smoke test changes before pushing them live?

Does the pre-production environment have exactly the same monitoring as the production environment?

It really should.

Silence Definitely Doesn’t Equal Consent

A few months back we were revisiting one of our older API’s.

We’d just made a release into the pre-production environment and verified that the API was doing what it needed to do with a quick smoke test. Everything seemed fine.

Promote to production and a few moments later a whole bunch of alarms went off as they detected a swathe of errors occurring in the backend. I honestly can’t remember what the nature of the errors were, but they were real problems that were causing subtle failures in the migration process.

Of course, we were surprised, because we didn’t receive any such indication from the pre-production environment.

When we dug into it a bit deeper though, exactly the same things were happening in pre-prod, the environment was just missing equivalent alarms.

Samsies

I’m honestly not sure how we got ourselves into this particular situation, where there was a clear difference in behaviour between two environments that should be basically identical. Perhaps the production environment was manually tweaked to include different alarms? I’m not as familiar with the process used for this API as I am for others (Jenkins and Ansible vs TeamCity and Octopus Deploy), but regardless of the technology involved, its easy to accidentally fall into the trap of “I’ll just manually create this alarm here” when you’re in the belly of the beast during a production incident.

Thar be dragons that way though.

Ideally you should be treating your infrastructure as code and deploying it similarly to how you deploy your applications. Of course, this assumes you have an equivalent, painless deployment pipeline for your infrastructure, which can be incredibly difficult to put together.

We’ve had some good wins in the past with this approach (like our log stack rebuild), where the entire environment is encapsulated within a single Nuget package (using AWS CloudFormation), and then deployed using Octopus Deploy.

Following such a process strictly can definitely slow you down when you need to get something done fast (because you have to add it, test it, review it and then deploy it through the chain of CI, Staging, Production), but it does prevent situations like this from arising.

Our Differences Make Us Special

As always, there is at least one caveat.

Sometimes you don’t WANT your production and pre-production systems to have the same alarms.

For example, imagine if you had an alarm that fired when the traffic dropped below a certain threshold. Production is always getting enough traffic, such that a drop indicates a serious problem.

Your pre-production environment might not be getting as much traffic, and the alarm might always be firing.

An alarm that is always firing is a great way to get people to ignore all of the alarms, even when they matter.

By that logic, there might be good reasons to have some differences between the two environments, but they are almost certainly going to be exceptions rather than the rule.

Conclusion

To be honest, not having the alarms go off in the pre-production environment wasn’t exactly the worst thing that’s ever happened to us, but it was annoying.

Short term it was easy enough to just remember to check the logs before pushing to production, but the manual and unreliable nature of that check is why we have alarms altogether.

Machines are really good at doing repetitive things after all.

0 Comments

I expect accounting software to make some pretty convincing guarantees about the integrity of its data over time.

From my experience, such software generally restricts the users capability to change something once it has been entered. Somewhat unforgiving of innocent mistakes (move money to X, whoops, now you’ve got to move to back and there is a full audit trail of your mistake), but it makes for a secure system in the long run.

Our legacy software is very strict about maintaining the integrity of its transactional history, and it has been for a very long time.

Except when you introduce the concept of database restores, but that’s not a topic for this blog post.

Nothing is perfect though, and a long history of development by a variety of parties (some highly competent, some….not) has lead to a complicated system that doesn’t play by its own rules every now and then.

Its like a reverse Butterfly Effect, changing the present can unfortunately change the past.

Its All Wrong

A natural and correct assumption about any reports that come out of a piece of accounting software, especially one that focuses on transactions, is that it shouldn’t matter when you look at the report (today, tomorrow, six months from now), if you’re looking at data in the past, it shouldn’t be changing.

When it comes to the core transactional items (i.e. “Transferred $200 to Bob”) we’re good. Those sorts of things are immutable at the time when they occur and it doesn’t matter when you view the data, its always the same.

Being that this is a blog post, suspiciously titled “Protecting the Timeline”, I think you can probably guess that something is rotten in the state of Denmark.

While the core transactional information is unimpeachable, sometimes there is meta information attached to a transaction with less moral integrity. For example, if the transaction is an EFT payment exiting the system, it needs to record bank account details to be compliant with legislation (i.e. “Transferred $200 to Bob (EFT: 123-456, 12346785)”).

Looking at the system, its obvious that the requirement to capture this additional information came after the original implementation, and instead of capturing the entire payload when the operation is executed, the immutable transaction is dynamically linked to the entities involved and the bank account details (or equivalent) are loaded from the current state of the entity whenever a report is created.

So we know unequivocally that the transaction was an EFT transaction, but we don’t technically know which account the transfer targeted. If the current details change, then a re-printed report will technically lie.

Freeze Frame

The solution is obvious.

Capture all of the data at the time the operation is executed, not just some of it.

This isn’t overly difficult from a technical point of view, just hook into the appropriate places, capture a copy of the current data and store it somewhere safe.

Whenever the transactions are queried (i.e. in the report), simply load the the same captured data and present it to the user.

Of course, if the requirements change again in the future (and we need to show additional information like Bank Name or something), then we will have to capture that as well, and all previous reports will continue to show the data as it was before the new requirement. That’s the tradeoff, you can’t capture everything, and whatever you don’t capture is never there later.

But what about the literal mountain of data that already exists with no captured meta information?

Timecop!

There were two obvious options that we could see to deal with the existing data:

  1. Augment the reporting/viewing logic such that it would use the captured data if if existed, but would revert back to the old approach if not.
  2. Rewrite history using current information and “capture” the current data, then just use the captured data consistently (i.e. in reports and whatnot).

The benefit of option one is that we’re just extending the logic that has existed for years. When we have better data we can use that, but if not, we just fall back to old faithful. The problem here is one of complication, as any usages now need to do two things, with alternate code paths. We want to make the system simpler over time (and more reliable), not harder to grok. Also, doing two operations instead of one, combined with the terrible frameworks in use (a positively ancient version of Crystal Reports) led to all sorts of terrible performance problems.

Option two basically replicates the logic in option one, but executes it only once when we distribute the upgrade to our users, essentially capturing data at that point in time, which then becomes immutable. From that point forward everything is simple, just use the new approach, and all of the old data is the same as it would have been if the reports had of been printed out normally.

If you couldn’t guess, we went with option two.

Conclusion

What we were left with was a more reliable reporting system, specifically focused around the chronological security of data.

Also, I’m pretty sure I made up the term “chronological security”, but it sounds cool, so I’m pretty happy.

I honestly don’t know what led to the original decision to not capture key parts of the transaction in an immutable fashion, and with hindsight its easy for me to complain about it. I’m going to assume the group (or maybe even individual) that developed the feature simply did not think through the ramifications of the implementation over time. Making good software requires a certain level of care and I know for a fact that that level of care was not always present for our long-suffering legacy software.

We’re better now, but that’s still a small slice of the overall history pie, and sometimes we build on some very shaky foundations.

0 Comments

I have a lot of half-formed thoughts about job titles.

On one hand, the engineer in me loves classification systems, and job titles seem to provide the capability to classify people such that you know approximately where they sit in regards to responsibilities.

On the other, the cynic in me has seen behind the curtain enough to know that titles are at the very least used so inconsistently in our industry that they are functionally meaningless.

Therefore this post is an exploratory piece, written as I try to solidify my thoughts on the subject.

Set your expectations appropriately low.

That’s Classified

Historically, I have to imagine that the purpose of a job title was to provide structure. If you had title X, you were responsible for A, B and C and you were probably paid Z. When it came time for you to move on, someone else could look at your title, understand what was required and what the remuneration was like and make an informed decision about whether or not they wanted to pursue that opportunity. If you saw someone else with said title, even from another company, then you could probably reason about what they do on a day to day basis and how well they were paid.

That seems like a decent enough line of reasoning, and in an industry that is structured and consistent, it could probably work.

I don’t think software development is such an industry.

That is not to say that we don’t love classifying ourselves though:

  • Developer (Graduate/Junior/Senior)
    • Front End
    • Back End
    • Full Stack
    • {Technology Specific}
  • Designer (UX/UI)
  • Leads (Technical/Team/Practice)
  • Architects
  • Tester (Automation/Manual)
    • QA
  • Analyst (Business/Data)
  • Manager (Iteration/Delivery)

Looking at that list, I can kind of explain what each title is expected to do, which is good, but that list is nowhere near close to exhaustive (I’ve left out the differentiation between Engineer and Developer for example), even from my own experience. The sheer variety of titles available just makes extracting any value from the system harder than it could be.

I tried to write some things here about “here are classifications that I think might work”, but all I was doing was creating the situation described in this XKCD comic, so I gave up.

Always With The Rugby Metaphor

Instead, lets look at Scrum.

Scrum has three roles, which are kind of titles, but not really.

  • Scrum Team Member
  • Product Owner
  • Scrum Master

The simplicity in this approach is great, because it doesn’t spend any effort on classifying people in a team. You either help deliver, make decisions and provide direction or you keep the machine running. That’s it.

It doesn’t preclude people from specialising their skillsets either, but nor does it support it. If you’re a scrum team member, then you’re helping to deliver an increment however you can. You might be testing, providing designs, implementing code, automating build pipelines or deploying infrastructure, it really doesn’t care. It fully expects you to be a well rounded person who is capable of contributing in many different ways (even though you might be better at some than others).

Of course, the scrum approach doesn’t really fix anything, its just a different way of looking at the situation.

If you’re correlating titles to remuneration, do you pay all of your Scrum Team Members the same? Probably not, as there might very well be a range of skills and experience there that you want to draw monetary attention to.

But doesn’t that defeat the entire point of them all having the same title? Its the team that is valuable, not the individual.

Reality Bites

The last two paragraphs attempted to explore any intrinsic value that might exist in a job title itself, mostly ignoring the reality.

Unless you work in a highly structured part of the industry (maybe Government?), I doubt a lot of thought was put into your title. Hell, maybe you even got to make it up yourself.

People with the same title can have wildly different salaries, usually as a result of being hired at different times or simply negotiating better. There is generally little to no drive to adjust salaries in line with titles, because all that might do is bring attention to the people who were being underpaid. This inconsistency is probably the biggest hit against titles as a useful mechanism in the wild, but it still assumes that the titles are being used in good faith.

There is a darker side still, where titles form nothing more than a power game for those who enjoy office politics, used as bargaining chips in place of actual monetary recognition.

Salesmanship

To alleviate some of the doom and gloom in the last section, perhaps there is a way that titles can be beneficial regardless of how you acquire them.

As humans, we generally make assumptions when we see that a person worked in a job with title X, and we use that information to form part of our evaluation of a candidate.

“Oh, they were a team leader at company Y, and they did it for 12 months, they probably aren’t terrible at it”.

Even knowing what I know now about the mostly arbitrary relationship between title and actual responsibilities and value delivery, I still would probably scan a candidates professional history and make assumptions like that.

So, there is value in fighting for a good title, ideally one that accurately represents what you do, with the understanding that the main value in the title comes after you leave the place that you acquired it.

Conclusion

In conclusion, I should probably stick to technical posts or posts about Mario Kart or D&D. They are, by far, much easier to write, and honestly, probably more valuable to any poor soul who decides to read this blog.

To paraphrase the principal from Happy Gilmore:

At no point in this rambling, incoherent blog post was I even close to anything that could be considered a rational thought. Everyone on the internet is now dumber for having read it. I award myself no points, and may God have mercy on my soul.

It was nice to get some of these thoughts out of my head though.

0 Comments

In my experience, you are far more likely to extend an existing system as a professional software developer than you are to build something from scratch. I mean, unless you’re amazing at engineering situations for yourself where you never have to build on top of existing code that is.

You tricky developer you.

Assuming that you’re not always as tricky as you want to be, when you build on top of an existing system, I have found the following quote to be quite helpful:

First, refactor to make the change easy (warning, this might be hard),

Then make the easy change

The internet says that this particular piece of wisdom originally came from Kent Beck in his book on TDD, which seems pretty likely.

Regardless of where it came from though, I’m going to use this post to explore a case study where I applied the quote directly while I was extending a piece of code that I didn’t originally write.

Minimum Viable Product

To set the stage, we have a small application that extracts data from a database and pushes it to an S3 bucket, where it is consumed by some third parties. It runs quietly in the background on a regular schedule thanks to TeamCity.

One of the consuming third parties would prefer that the data was delivered via SFTP though, to allow them to trigger some process automatically when the files arrive, rather than having to poll the S3 bucket for new files on their own schedule.

A completely understandable and relatable desire.

In the same vein, when we run the automated process through TeamCity, it would be nice if the files generated were attached to the execution as Build Artifacts, so that we could easily look at them later without having to go into the S3 bucket.

Obviously neither of these two requirements existing when the application was originally written, but now there needs to be multiple ways to export the data for each run. Perhaps we don’t always want to export to all the destinations either, maybe we want to do a test run that only exports to TeamCity for example.

In my current role I don’t typically write as much code as I used to, but everyone else was engaged in their own priorities, so it was time for me to step up, extend a system I did not write and flex my quickly atrophying technical muscles.

Structural Integrity Lacking

The application in question is not particularly complicated.

There are less than 20 classes total, and they are generally pretty focused. For example, there is class that offers the capability to write a file to S3 (abstracting away some of the complexity inherent in the AWS supplied interfaces), another for building queries, another for executing a query on top of a PostgreSQL database and streaming the results to a CSV file and finally a class that orchestrates the whole “get data, upload data” process for all of the queries that we’re interested in.

That last class (the aptly named CsvS3Export) is where we need to start refactoring.

public class CsvS3Export : IS3Export
{
    private readonly IClock _clock;
    private readonly IFileNamer _fileNamer;
    private readonly ITableToCsvExporter _tableExporter;
    private readonly IS3FileUploader _s3Uploader;
    private readonly ITableExportSpecificationResolver _tableExportSpecificationResolver;
    private readonly AmazonS3Client _s3Client;
    private readonly IReporter _reporter;

    public CsvS3Export(
        IClock clock, 
        IFileNamer fileNamer, 
        ITableToCsvExporter tableExporter,
        IS3FileUploader s3Uploader,
        ITableExportSpecificationResolver tableExportSpecificationResolver,
        AmazonS3Client s3Client, 
        IReporter reporter
    )
    {
        _clock = clock;
        _fileNamer = fileNamer;
        _tableExporter = tableExporter;
        _s3Uploader = s3Uploader;
        _tableExportSpecificationResolver = tableExportSpecificationResolver;
        _s3Client = s3Client;
        _reporter = reporter;
    }

    public S3ExportsSummary ExportTables(IList<string> tableNames, string workingDirectory, string s3BucketName)
    {
        try
        {
            var runTimestamp = _clock.UtcNow;
            var summaries = tableNames
                .AsParallel()
                .Select((tn, i) => ExportTable(workingDirectory, s3BucketName, tn, runTimestamp));

            return new S3ExportsSummary(summaries);
        }
        catch (Exception e)
        {
            _reporter.ReportInfo($"Error running export \n\n {e}");
            return new S3ExportsSummary(e, tableNames);
        }
    }

    private S3TableExportSummary ExportTable(string workingDirectory, string s3BucketName, string tn, DateTimeOffset runTimestamp)
    {
        try
        {
            var fileName = _fileNamer.GetFileName(runTimestamp, tn);
            var spec = _tableExportSpecificationResolver.Resolve(tn);
            var localOutputFile = new FileInfo(workingDirectory + "\\" + fileName);
            var exportResult = _tableExporter.Export(spec, localOutputFile);

            var s3Location = new S3FileLocation(s3BucketName, fileName.Replace("\\", "/"));
            var uploadResult = _s3Uploader.Upload(exportResult.FileLocation, s3Location);

            return new S3TableExportSummary(spec.TableName, localOutputFile, uploadResult);
        }
        catch (Exception e)
        {
            _reporter.ReportInfo($"Error running export for table {tn}\n\n {e}");
            return new S3TableExportSummary(tn, e);
        }
    }
}

As you can tell from the source above, the code is very focused around a specific type of file (a CSV) and a specific export destination (S3).

We don’t really care about the file generation part (we have no interest in generating non-CSV files for now), but if we want to start adding new destinations, we really should break this class apart into its constituent parts.

Remember, first make the change easy (refactor the existing code to allow for new destinations), then make the easy change (create the new destinations and slot them in).

Export Tariffs

Rather than coupling the class directly with the S3 functionality, all we have to do is extract a simple IExporter interface and then take a list of exporters as a constructor dependency.

The CsvS3Export then becomes more of an orchestrator, calling out to the class that does the actual data extraction with a target file and then iterating (safely) through the exporters, with the intent to report on the success or failure of any of the components involved.

public class DefaultExportOrchestrator : IExportOrchestrator
{
    public DefaultExportOrchestrator(IClock clock, ITableExportSpecificationResolver resolver, IFileNamer namer, ITableToCsvExtractor extractor, List<IExporter> exporters)
    {
        _clock = clock;
        _resolver = resolver;
        _namer = namer;
        _extractor = extractor;
        _exporters = exporters;
    }

    private readonly IClock _clock;
    private readonly ITableExportSpecificationResolver _resolver;
    private readonly IFileNamer _namer;
    private readonly ITableToCsvExtractor _extractor;
    private readonly List<IExporter> _exporters;

    public ExportOrchestrationResult Export(IList<string> tableNames, string workingDirectory)
    {
        if (!Directory.Exists(workingDirectory))
        {
            return new ExportOrchestrationResult(new DirectoryNotFoundException($"The supplied working directory [{workingDirectory}] did not exist"));
        }

        var now = _clock.UtcNow;
        var extracted = tableNames
            .AsParallel()
            .Select(a => ExtractTable(a, workingDirectory, now));

        return new ExportOrchestrationResult(extracted.ToList());
    }

    private TableExportOrchestrationResult ExtractTable(string table, string workingDirectory, DateTimeOffset runTimestamp)
    {
        var fileName = _namer.GetFileName(runTimestamp, table, "csv");
        var spec = _resolver.Resolve(table);
        var localOutputFile = new FileInfo(Path.Combine(workingDirectory, fileName.Combined));

        var result = _extractor.Extract(spec, localOutputFile);

        if (!result.Successfull)
        {
            return new TableExportOrchestrationResult(table, result);
        }

        var exported = _exporters.AsParallel().Select(a => SafeExport(table, result.FileLocation, a)).ToList();

        return new TableExportOrchestrationResult(table, result, exported);
    }

    private ExportResult SafeExport(string tableName, FileInfo file, IExporter exporter)
    {
        try
        {
            return exporter.Export(tableName, file);
        }
        catch (Exception ex)
        {
            return new ExportResult(exporter.Description, tableName, file, false, string.Empty, ex);
        }
    }
} 

The important part of this process is that I did not add any new exporters until I was absolutely sure that the current functionality of the application was maintained (via tests and whatnot).

Only once the refactor was “complete” did I add the new functionality, and test it independently and in isolation, safe in the knowledge that I was extending a system designed to be extended.

From a Git point of view, the refactor itself was a single commit, and the addition of each new exporter was another commit.

Because a clean Git history is a beautiful thing.

Conclusion

Looking back, you might think that the initial structure of the code was wrong, and it should have been built this way in the first place.

I disagree with that, because it did exactly what it needed to do before we went and changed the requirements. It would be a fools errand to try and anticipate all future changes that might need to be made and accommodate them, not to mention incredibly expensive and wasteful, so I really can’t recommend going down that path.

The most important thing is to ensure that we consistently follow good engineering practices, so that later on when we need to make some changes, we don’t have to struggle through terribly structured code that doesn’t have any tests.

To tie it all back to the quote that started this post, spending the time to refactor the code to be more accepting of the pattern that I wanted to use made the additional of the additional export destinations much easier, while still following good software engineering practices.

I could probably have just jammed some naïve code right into that original CsvS3Export class that just exported to the other destinations, but that would have just made the code much more confusing to anyone coming by at a later date.

There’s already enough terrible code in the world, why add to it?

0 Comments

Good news everyone!

I bet you heard Professor Farnsworths voice in your head just then, didn’t you.

Our monthly Mario Kart Tournament is going strong. We’ve just finished our fifth season, with strong intentions to immediately start a sixth. Our plush shells are slowly building up a real sense of history, getting covered in the names of people who have won a tournament so far. Its really great to see.

The races continue to be a bright spot each day, acting both as social lubricant and as a way to rest and recuperate from the stresses of the morning.

Like any Agile organisation though, we’re still iterating and improving on the formula, hence this follow up post.

Tooling Around

When we first started the tournament, we were just using a Google Sheet to capture scores and calculate rankings. As we finished each season, we just created a new sheet (well, we copied the old one and cleared its data). It was a relatively simple system, where each race was a row of data, and the ELO calculation was done via some spreadsheet magic.

It was pretty good to be honest, but it was just a spreadsheet. We’re mostly software engineers, so we took that as a challenge. It didn’t help that the spreadsheet implementation of the ELO algorithm was pretty intense too, and somewhat hard to maintain and reason about.

For a replacement piece of software, the initial requirements were as simple as the sheet:

  • The ability to record races (participants, score, etc)
  • The ability to calculate a ranked list of players

The engineer that ended up implementing the tool went above and beyond though, and also included basic season and participant management (including disqualifications, which are super useful for when people have to pull out of the tournament for reasons) and alternate scoring algorithms.

There’s still a bunch of improvements that we want to make (like easily inputting scores from a chatbot, overall ELO ranking (i.e. not seasonal), charts of ranking changes and so on), but it is pretty amazing already.

Orchestral Score

Speaking of alternate scoring algorithms...

After the first season (where we used a pretty naïve combination of average score + knockouts), we’re mostly been using ELO.

Now that we have a custom tool, its much easier to implement another algorithm that sits in parallel to the existing ones, which gives us a lot of freedom to try new things. For example, last season we started out by implementing a different scoring algorithm based around the concept of average win rate.

As races are run, the system determines how likely you are to beat every other participant as a percentage score. If you’ve never raced someone before, it will infer a win rate by using other people who you have raced against (i.e. a fairly basic transitive relationship), until it you actually do race them. If it can’t do that, it will just use your average.

Your win rate percentage against all other participants in the season (a score between 0-100) is then summed together and all the sums are ranked numerically.

As a whole, it was a decent enough algorithm, but as more and more races were run, changes to the scores because less and less meaningful, as each subsequent race had less and less of an impact on your average.

It was a pretty accurate long term representation of how you stacked up against the other people, but it was slow to change if you improved. It was also really boring from a tournament point of view, and didn’t really allow for comebacks and upsets, which are the most interesting bits.

In the end, we just used ELO again, but that doesn’t mean it wasn’t worthwhile.

You’ll never get any better if you don’t try new things.

Spanner? Meet Works

In order to keep things interesting, we’ve also added season specific conditions and constraints, rather than just racing the same way every time.

For example:

  • Season 3: 150cc (mirror), Standard Kart, Shy Guy
  • Season 4: 150cc, Sports Bike, Yoshi
  • Season 5: 150cc, Free For All
  • Season 6: 200cc, Frantic Items, Free For All

This has honestly had a bit of a mixed reception.

Some people are happy to be disrupted and learn how to play a style different than they might normally be used to.

Other people are unhappy about having to change just to participate. This comes down to a few different reasons include comfort level (“I like racing like this, why change”), desire to get better at one particular build instead of getting average at many and just pure enjoyment or lack thereof (“Its called Mario Kart, not Mario Bike”).

I personally enjoy the conditions and constraints, because I think it keeps everything from getting stale, but I can see the other side of the argument as well.

Conclusion

And that’s kind of it for the update on our Mario Kart Tournaments.

All in all I think it continues to be a great social activity that helps people get to know one another and provides a nice non-cerebral break in the middle of the day.

Having said that, our first season had the most participation by far, so I think the competitive nature of the tournament (and the improving skills of the regular combatants), is erecting something of a barrier to entry for the people who aren’t quite as competitive.

I have some ideas about how we might be able to deal with that, but I’m not entire sure how effective they will be.

For example, if it was a requirement that anyone in the top four was unable to participate in the next season until they trained a protégé, that might encourage highly competitive people to induct new members into the group.

Or they might just stop playing completely, which would be unfortunate and make me sad.