0 Comments

Its the gift that keeps on giving, our data synchronization process!

Well, it keeps on giving to me anyway, because its fuel for the furnaces of this blog. Sometimes finding topics to write about every week can be hard, so its nice when they drop into your lap.

Anyway, the process has started to creak at the seams a bit, because we’re pushing more data through it than ever before.

And when I say creak at the seams, what I mean is that our Read IOPS usage on the underlying database has returned to being consistently ridiculous.

Couldn’t Eat Another Bite

The data synchronization process had been relatively stable over most of 2018. Towards the middle, we scaled the underlying database to allow for the syncing of one of the two biggest data sets in the application, and after a slow rollout, that seemed to be going okay.

Of course, with that success under our belt, we decided to sync the other biggest data set in the application. Living life on the edge.

We ended up getting about half way through because everything started to fall apart again, with similar symptoms to last time (spiking Read IOPS capping out at the maximum allowed burst, which would consume the IO credits and then tank the performance completely). We tried a quick fix of provisioning IOPS (to guarantee performance and remove the tipping point created by the consumption of IO credits), but it wasn’t enough.

The database just could not keep up what was being demanded of it.

I’m A Very Understanding Person

Just like last time, the first step was to have a look at the queries being run and see if there was anything obviously inefficient.

With the slow queries related to the “version” of the remote table mostly dealt with in our last round of improvements, the majority of the slow queries remaining were focused on the part of the process that gets a table “manifest”. The worst offenders were the manifest calls for one of the big tables that we had only started syncing relatively recently. Keep in mind that this table is the “special” one featuring hard deletes (compared to the soft deletes of the other tables), so it was using the manifest functionality a lot more than any of the other tables were.

Having had enough of software level optimizations last time, we decided to try a different approach.

An approach that is probably, by far, the more common approach when dealing with performance issues in a database.

Indexes.

Probably The Obvious Solution

The first time we had performance problems with the database we shied away from implementing additional indexes. At the time, we thought that the indexes that we did have were the most efficient for our query load (being a Clustered Index on the two most selective fields in the schema), and we assumed we would have to look elsewhere for optimization opportunities. Additionally, we were worried that the performance issues might have an underlying cause related to total memory usage, and adding another index (or 10) is just more things to keep in memory.

Having scaled the underlying instance and seeing no evidence that the core problem was memory related, we decided to pull the index lever this time.

Analysis showed that the addition of another index similar to the primary key would allow for a decent reduction in the amount of reads required to service a single request (in that, the index would short circuit the need to read the entire partition of the data set into memory in order to figure out what the max value was for the un-indexed field). A quick replication on our performance testing environment proved it unequivocally, which was nice.

For implementation, its easy enough to use Entity Framework to add an index as part of a database migration, so that’s exactly what we did.

We only encountered two issues, which was nice:

  • We didn’t seem to be able to use the concurrent index creation feature in PostgreSQL with the version of EF and Npgsql that we were using (which are older than I would like)
  • Some of the down migrations would not consistently apply, no matter what we tried

Neither of those two factors could stop us though, and the indexes were created.

Now we just had to roll them out.

Be Free Indexes, Be Free!

That required a little finesse.

We had a decent number of indexes that we wanted to add, and the datasets we wanted to add them to were quite large. Some of the indexes only took a few minutes to initialise, but others took as long as twenty.

Being that we couldn’t seem to get concurrent index creation working with Entity Framework data migrations, we had to sequence them out one at a time in sequential releases.

Not too hard, but a little bit more time consuming than we originally desired.

Of course, the sync process being what it is, its okay if it goes down for a half hour every now and then, so we just took everything out of service temporarily on each deployment to ensure that the database could focus on the index creation without having worry too much about dealing with the constant flood of requests that it usually gets.

Conclusion

At the end of the day, this round of performance investigation and optimization actually took a hell of a lot less time and effort than the last, but I think that’s kind of to be expected when you’re actively trying to minimise code changes.

With the first few of the indexes deployed, we’ve already seen a significant drop in the Read IOPS of the database, and I think we’re going to be in a pretty good place to continue to sync the remainder of the massive data set that caused the database to choke.

The best indicator of future performance is past data though, so I’m sure there will be another post one day, talking all about the next terrible problem.

And how we solved it of course, because that’s what we do.

0 Comments

A new year means means more blog posts, and there is no better time to start than now.

Or maybe a week ago I suppose when the new year actually started, but I was on holidays, and writing blog posts while I’m on holidays just seems wrong. Blog posts are written on the train on my way to work, and that pattern is far too ingrained to do anything about now.

Anyway, that’s probably enough rambling, so lets get on with the show and discuss software prototypes, because I have opinions and this is the internet.

Its Code, But You Throw It Away

Software prototyping is simple in concept, but quickly gets complicated in execution.

Typically a prototype consists of some engineers throwing something together, probably ignoring normal engineering practices, to prove an idea or approach. Then, once its served its purpose, those same engineers toss it in the garbage.

The goal is to learn, not to create a long lasting artefact, and that is often where prototypes become dangerous. If a business sees something working (and a prototype probably works, even though it might have rough edges), then it might be inclined to make plans based on that. Perhaps attempt to push it out to a larger audience than was originally intended, or to start making claims that a feature is complete and ready to use.

Its a horrible feeling, watching the terrifying hacked together piece of code meant to prove a possibility become a core part of a business process. Especially so when you’re the one responsible for maintaining it, probably because you’re the only one who knows how it works.

Like I said, simple in concept, but complicated in the long run.

Of course, a prototype does not strictly have to be thrown away, but in my opinion, if you’re not throwing it away at the end, you’re probably just doing iterative development. If that’s the case, you really should be following good engineering practices all the way through instead of hacking something together and then trying to build on top of unstable foundations later.

Building Things To Answer The Wrong Questions

This blog post exists because we built a prototype recently.

I’m sure you think that the next paragraph is going to describe the situation where it “accidentally” became a core part of the business and its causing all sorts of problems, but that is surprisingly not the case. Everyone involved understood the purpose and limitations of the prototype and it was abandoned at the appropriate time, once it had served its purpose.

I had a completely different issue with our prototype experience; we probably shouldn’t have built one at all, and the construction of the prototype felt like it was wasted effort.

The situation we found ourselves in was that we wanted to provide some new functionality to the users of our legacy application that leveraged our new and shiny cloud platform. Kind of like a typical integration, with two different systems working in tandem, but we had a lot of control over both sides.

We prototyped the process for getting the two systems to talk to each other, with the plan that once we had that working at least partially, we could go and have early conversations with customers to see if they wanted to use it and how.

The reality was that the actual data flow between the two systems never really came up in any of those early conversations, as the topics covered were almost entirely focused around the new features available. We already had a one-off data migration process that would initialize the cloud system with information from the legacy software, and honestly, that would have been enough to start the conversation.

So the first mark against the prototype was that it just didn’t feel like its existence made a difference.

Hindsight Is Misleading

To be fair, I could very well be suffering from the curse of hindsight. Being able to look back at a situation with current knowledge and see a much more efficient way to do it is not really surprising after all. That’s how learning works.

Or it could be that we simply held on to the prototype for too long, and should have switched into constructing it (iteratively) for real sooner. Possibly as soon as we had answered the question “is it even possible?”.

Instead we held on to the prototype while we engaged with customers because we wanted to give them a sense of how the system would work in practice. Of course, because we were asking them to do real work in an environment that would one day be thrown away, they were rightly resistant, so not only did we gain little to nothing from throwing together the process from a customer conversation point of view, we actually made it harder to engage with them in relation to trying out the system for real.

If we had of simply started building the process out, piece by piece, following good engineering practices, we probably would have ended up at the same place in the end. It might have taken us longer in terms of constructing the real version (not having the lessons of the prototype to build on top of), but total time spent would probably have been less. Not only that, but the customers would have been able to try it out for real sooner, which would have given us the feedback that we needed sooner as well.

That’s not to say that a prototype is never beneficial, just that in my most recent experience, it didn’t really feel like it generated an appropriate amount of value.

Conclusion

Unlike the technical posts that I make, this one feels much more like a series of vaguely connected musings.

I don’t really have a concrete conclusion or lesson to take away, I’m just left with vague sense that our most recent experiment with building a prototype was a waste of time and effort that could have been better spent elsewhere.

Of course, there’s always the possibility that the specific situation we found ourselves in was simply a bad place to apply a prototype (which seems likely looking back), or maybe the prototype actually generated a huge amount of value and its just hard to see it in hindsight, because we have that knowledge now and its hard to analyse the situation without it.

Perhaps I’ll be making another post in a few months about a situation where I wished we had built a prototype…

0 Comments

With no need for additional fanfare, I now present to you the continuation of last weeks post about DDD 2018.

Break It Down

My fourth session for the day was presented by the wonderful Larene Le Gassick.

As a woman in tech, Larene was curious about the breakdown of gender for the speakers participating in the various Brisbane based Meetups, so she built a bot that would aggregate all of that information together and post it into Slack, thus bringing the data out into the open.

Well, the word “bot” might be overselling it.

It was Larene. Larene was the bot.

Regardless of the mechanism, there was some good tech stuff in there (including a neat website using a NES css style), but the real value from the process was in the data itself, and the conversation that it started when presented in a relatively public place on a regular basis.

From my own experience, the technology industry and software development in particular, does seem to be male dominated. I’m honestly unsure whether that’s a good or bad thing, but I am fully in favour of encouraging more participation from anyone who wants to get involved, regardless of sex, race or any other discriminating factor you can think of.

DDD in particular is pretty great for this sort of inclusiveness actually, sometimes resulting in surprising feedback.

Actually, This Time It Does Mean What You Think It Means

The fifth session that I attended was delivered by Steve Morris in his usual style. Which is to say, awesomely.

To be honest, I probably could have skipped this session as it was basically Domain Driven Design 101, but it was still pretty useful as a refresher all the same.

Domain driven design is in a weird place in my head. The blue book is legendary for how dry and difficult to read it is, but there is some really great stuff in there. Actually trying to understand and then model the domain that your software is operating in seems like an extremely good idea, but its one of those things that’s really hard to do properly.

I’ve inherited at least one system built by people who had clearly read some of the book, but what I ended up with was a hard to maintain and understand system, so I’m going to assume that they did it wrong. I don’t know how to do it right though.

Regardless, I’ll keep trying to head in that direction as best I can.

Intelligent Design

The sixth session of the day was a presentation on UX and Design by Jamie Larkin. Her first such presentation in fact, which was actually really hard to tell, because she did extremely well.

The session itself was fantastic.

I’ve always questioned why developers seem to shy away from design (or why designers shy away from development), and I like to think that I’ve tried to keep UX high in my priority list when implementing things in the past. Having said that, I’m definitely not cognizant of many design patterns and principles, so it was really nice to see something with experience in both design and development talk about the topic.

The main body of the talk was focused on UX design patterns presented in such a way that they would be relevant to developers. Even better, the presentation used real websites (MailChimp and Air B&B) as examples. This was pretty great, because it paired the generic design principles with concrete examples of how they had been applied, or in some cases, how the design principles had been broken and how it was negatively affecting the resulting user experience.

Some specific takeaways:

  • Consistency is key. If you’re building something inside a system, its probably a good idea to match the style that is already present, even if it results in a sub-optimal experience. Disjointed design can be extremely damaging to the user experience.
  • Put things where users will expect to find them. This might mean bending towards common interaction paradigms (i.e. it looks like Word), or even just spending the time to understand your users so that interaction elements appear in places that make sense to them.
  • Understand what the user wants to accomplish and focus the experience around that. That is, don’t just present information or actions for no reason, focus them around goals and intent.
  • Consider the context in which the user wants to use the software. Are they on a train? In a car? At home in bed? Smart answers to these questions can make a huge difference to the usability of your service.
  • Feedback to the user while operating your system is essential. Things like hover highlights, immediate feedback when validating user input and loading or processing indicators can really reinforce in the users mind that they are doing something meaningful and that the system recognizes that

At the end of the session I left richer in knowledge than when I arrived, so I consider that a victory.

Don’t Trust Your Brain

The last session I attended was a presentation on cognitive bias by Joseph Cooney.

For me, this was the most interesting session of the day, as it really reinforced that I should never trust the first thing that comes into my brain, because it was probably created as a result of a lazy thought process that took as many shortcuts as it could.

I’ve been aware of the concept of cognitive bias for a while now, but I didn’t really understand it all that well. To be honest, I still don’t really understand it all that well, but I think I know more about it than I did before the session, so that’s probably a good outcome.

To quote my notes from the session verbatim:

Cognitive bias is the situations where people don't make rational decisions for a number of reasons (which may not be conscious). Kind of like an optical illusion, but harder to dispel.

Not the greatest definition in the world, but good enough to be illustrative I think.

What it comes down to is that the human brain appears to operate via a combination of two systems:

  • The first system is automatic, effortless, fast and specialized. Its always running in the background and offers up images and feelings as opposed to raw data. It thinks in stories and deals with ambiguity well, even retconning past events to fit into a new model as necessary.
  • The second system is deliberate, effortful, slow, general purpose and incredibly lazy. That is, you have to actually try to engage it, as its expensive to run.

The first system does a lot of work, and helps you to make decisions quickly and without fuss. Unfortunately, sometimes it takes a shortcut that is less appropriate than it could be and makes a non-ideal decision, thus cognitive bias.

As conscious beings though, we can choose to be aware of the decisions being made by the first system, question them and kick the second system into gear if we need to (performing the rational and data based analysis that we thought we were probably doing in the first place).

I’m sure I haven’t done the topic justice here though, so if you’re interested, I recommended starting at the article in Wikipedia and discovering all the ways in which I have misinterpreted and otherwise misrepresented such an interesting facet of the human psyche.

In summary, 10/10, would listen to talk again.

Conclusion

Unfortunately, I had to bug out before the locknote (a session on how to support constant change), but all in all the day was well worth it.

Its always nice to see a decent chunk of the Brisbane Developer Community get together and share the knowledge they’ve gained and the lessons they’ve learned over the last year. DDD is one of those low-key conferences that just kind of happens (thanks to the excellent efforts of everyone involved obviously), but doesn’t seem to have the underlying agenda that others do. It really does feel like a bunch of friends getting together to just chat about software development stuff, and I appreciate that.

If you get a chance, I highly recommend attending.

0 Comments

This post is a week later than I originally intended it to be, but I think we can all agree that terrifying and unforeseen technical problems are much more interesting than conference summaries.

Speaking of conference summaries!

DDD Brisbane 2018 was on Saturday December 1, and, as always, it was a solid event for a ridiculously cheap price. I continue to heartily recommend it to any developer in Brisbane.

In an interesting twist of fate I actually made notes this time, so I’m slightly better prepared to author this summarization.

Lets see if it makes a difference.

I Don’t Think That Word Means What You Think It Means

The first session of the day, and thus the keynote, was a talk on Domain Driven Design by Jessica Kerr.

Some pretty good points here about feedback/growth loops, and ensuring that when you establish a loop that you understand what indirect goal that you are actually moving towards. One of the things that resonated the most with me here was how most long term destinations are actually the acquisition of domain knowledge in the brains of your people. This sort of knowledge acquisition allows for a self-perpetuating success cycle, as the people building and improving the software actually understand the problems faced by the people who use it and can thus make better decisions on a day to day basis.

As a lot of that knowledge is often sequestered inside specific peoples heads, it reinforced to me that while the software itself probably makes the money in an organization, its the people who put it together that allow you to move forward. Thus retaining your people is critically important, and the cost of replacing a person who is skilled at the domain is probably much higher than you think it is.

A softer, less technical session, but solid all round.

Scale Mail

The next session that I attended was about engineering for scale from a DDD staple, Andrew Harcourt.

Presented in his usual humorous fashion, it featured a purely hypothetical situation around a census website and the requirement that it be highly available. Something that would never happen in reality I’m sure.

Interestingly enough, it was a live demonstration as well, as he invited people to “attack” the website during the talk, to see if anyone could flood it with enough requests to bring it down. Unfortunately (fortunately?) no-one managed to do any damage to the website itself, but someone did managed to take out his Seq instance, which was pretty great.

Andrew went through a wealth of technical detail about how the website and underlying service was constructed (Docker, Kubernetes, Helm, React, .NET Core, Cloudflare) illustrating the breadth of technologies involved. He even did a live, zero-downtime deployment while the audience watched, which was impressive.

For me though, the best parts of the session were the items to consider when designing for scale, like:

  • Actually understand your expected load profile. Taking the Australian Census as an example, it needed to be designed for 25 million requests over an hour (i.e. after dinner as everyone logged on to do the thing), instead of that load spread evenly across a 24 hour period. In my opinion, understanding your load profile is one of the more challenging aspects of designing for scale, as it is very easy to make a small mistake or misunderstanding that snowballs from that point forward.
  • Make the system as simple as possible. A simpler system will have less overhead and generally be able to scale better than a complex one. The example he gave (his Hipster Census), contained a lot of technologies, but was conceptually pretty straight forward.
  • Provide developers with a curated path to access the system. This was a really interesting one, as when he invited people to try and take down the website, he supplied a client library for connecting to the underlying API. What he didn’t make obvious though, was that the supplied client library had rate limiting built in, which meant that anyone who used it to try and flood the service was kind of doomed from that start. A sneaky move indeed. I think this sort of thing would be surprisingly effective even against actual attackers, as it would catch out at least a few of them.
  • Do as little as possible up front, and as much as possible later on. For the census example specifically, Andrew made a good point that its more important to simply accept and store the data, regardless of its validity, because no-one really cares if it takes a few months to sort through it later.
  • Generate access tokens and credentials through math, so that its much easier to filter out bad credentials later. I didn’t quite grok this one entirely, because there was still a whitelist of valid credentials involved, but I think that might have just been for demonstration purposes. The intent here is to make it easier to sift through the data later on for valid traffic.

As is to be expected from Andrew, it was a great talk with a fantastic mix of both new and shiny technology and real-world pragmatism.

Core Competencies

The third session was from another DDD staple, Damien McLennan.

It was a harrowing tale of one mans descent into madness.

But seriously, it was a great talk about some real-world experiences using .NET Core and Docker to build out an improved web presence for Work180. Damien comes from a long history of building enterprisey systems (his words, not mine) followed by a chunk of time being entirely off the tools altogether and the completely different nature of the work he had to do in his new position (CTO at Work180) threw him for a loop initially.

The goal was fairly straightforward; replace an existing hosted solution that was not scaling well with something that would.

The first issue he faced was selecting a technology stack from the multitude that were available; Node, Python, Kotlin, .NET Core and so on.

The second issue he faced, once he had made the technology decision, was feeling like a beginner again as he learned the ins and outs of an entirely new thing.

To be honest, the best part of the session was watching a consummate industry professional share his experiences struggling through the whole process of trying a completely different thing. Not from a “ooooo, a train wreck” point of view though, because it wasn’t that at all. It was more about knowing that this is something that other people have gone through successfully, which can be really helpful when its something that you’re thinking about doing yourself.

Also, there was some cool tech stuff too.

To Be Continued

With three session summaries out of the way, I think this blog post is probably long enough.

Tune in next week for the thrilling conclusion!

0 Comments

And so I return from my break, with a fresh reenergized hatred for terrible and unexpected technical issues.

Speaking of which…

Jacked Up And Good To Go

I’ve been sitting on a data export project for a while now. Its a simple C# command line application that connects to one of our databases, correlates some entities together, pulls some data out and dumps it into an S3 bucket as CSV files. It has a wide variety of automated unit tests proving that the various components of the system function as expected, and a few integration and end-to-end tests that show that the entire export process works (i.e. given a set of command line arguments + an S3 bucket, does running the app result in data in the bucket).

From a purely technical point of view, the project has been “complete” for a while now, but I couldn’t turn it on until some other factors cleared up.

Well, while I was on holidays those other factors cleared up, so I thought to myself “I’ll just turn on the schedule in TeamCity and all will be well”.

Honestly, you’d think having written as much software as I have that I would know better.

That first run was something of a crapshow, and while some of the expected files ended up in the S3 bucket, a bunch of others were missing completely.

Even worse, the TeamCity task that we use for job execution thought everything completed successfully. The only indicator that it had failed was that the task description in TeamCity was not updated to show the summary of what it had done (i.e. it simply showed the generic “Success” message instead of the custom one we were supplying), which is suspicious, because I’ve seen TeamCity fail miserably like that before.

Not Enough Time In The World

Breaking it down, there were two issues afoot:

  1. Something failed, but TeamCity thought everything was cool
  2. Something failed

The first problem was easy to deal with; there was a bug in the way the process was reporting back to TeamCity when there was an unhandled exception in the data extraction process. With that sorted, the next run at least indicated a failure had occurred.

With the process indicating correctly that a bad thing had happened, the cause of the second problem became obvious.

Timeouts.

Which explained perfectly why the tests had not picked up any issues, as while they run the full application (from as close to the command line as possible), they don’t run a full export, instead leveraging a limiter parameter to avoid the test taking too long.

Entirely my fault really, as I should have at least done a full export at some stage.

Normally I would look at a timeout with intense suspicion, as they typically indicate a inefficient query or operation of some sort. Simply raising the time allowed when timeouts start occurring is often a route to a terrible experience for the end-user, as operations take longer and longer to do what they want them to.

In this case though, it was reasonable that the data export would actually take a chunk of time greater than the paltry 60 seconds applied to command execution in Npgsql by default. Also, the exports that were failing were for the larger data sets (one of which had some joins onto other data sets) and being full exports, could not really make effective use of any indexes for optimisation.

So I upped the timeouts via the command line parameters and off it went.

Three hours later though, and I was pretty sure something else was wrong.

Socket To Me

Running the same process with the same inputs from my development environment showed the same problem. The process just kept on chugging along, never finishing. Pausing the execution in the development environment showed that at least one thread was stuck waiting eternally for some database related thing to conclude.

The code makes use of parallelisation (each export is its own isolated operation), so my first instinct was that there was some sort of deadlock involving one or more exports.

With the hang appearing to be related to database connectivity, I thought that maybe it was happening in the usage of Npgsql connections, but each export creates its own connection, so that seemed unlikely. There was always the possibility that the problem was related to connection pooling though, which is built into the library and is pretty much static state, but I had disabled that via the connection string, so it shouldn’t have been a factor.

I ripped out all of the parallelisation and ran the process again and it still hung. On closer investigation, it was only one specific export that was hanging, which was weird,  because they all use exactly the same code.

Turning towards the PostgreSQL end of the system, I ran the process again, except this time started paying attention to the active connections to the database, the queries they were running, state transitions (i.e. active –> idle) and execution timestamps. This is pretty easy to do using the query:

SELECT * FROM pg_stat_activity

I could clearly see the export that was failing execute its query on a connection, stay active for 15 minutes and then transition back to idle, indicating that the database was essentially done with that operation. On the process side though, it kept chugging along, waiting eternally for some data to come through the Socket that it was listening on that would apparently never come.

The 15 minute query time was oddly consistent too.

It turns out the query was actually being terminated server side because of replica latency (the export process queries a replica DB), which was set to max out at, wait for it, 15 minutes.

For some reason the stream returned by the NpgsqlConnection.BeginTextExport(sql) just never ends when the underlying query is terminated on the server side.

My plan is to put together some information and log an issue in the Npgsql Github Repository, because I can’t imagine that the behaviour is intended.

Solve For X

With the problem identified, the only question remaining was what to do about it.

I don’t even like that our maximum replica latency is set to 15 minutes, so raising it was pretty much out of the question (and this process is intended to be automated and ongoing, so I would have to raise it permanently).

The only real remaining option is to break down the bigger query into a bunch of smaller queries and then aggregate myself.

So that’s exactly what I did.

Luckily for me, the data set had a field that made segmentation easy, though running a bunch of queries and streaming the results into a single CSV file meant that they had to be run sequentially, so no parallelization bonus for me.

Conclusion

This was one of those issues where I really should have had the foresight to see the first problem (timeouts when dealing with large data sets),  but the existence of what looks to be a real bug made everything far more difficult than it could have been.

Still, it just goes to show that no matter how confident you are in a process, there is always the risk that when you execute it in reality that it all might go belly up.

Which just reinforces the idea that you should always be running it in reality as soon as you possibly can, and paying attention to what the results are.