0 Comments

Spurred on by this video, we’ve been discussing the concept of technical debt, so this post is going to be something of an opinion piece.

I mean, what does everyone need more than anything? Another opinion on the internet, that’s what.

To be harsh for a moment, at a high level I think that a lot of the time people use the concept of technical debt as a way to justify crappy engineering.

Of course, that is a ridiculous simplification, so to extrapolate further, lets look at three potential ways to classify a piece of code:

  • Technical debt
  • Terrible code
  • Everything else

The kicker is that a single piece of code might be classified into all three of those categories at different points in its life.

Blood Debt

Before we get into that though, lets dig into the classifications a bit.

For me, classifying something as technical debtimplies a conscious, well founded software related decision was made, and that all of the parties involved understood the risks and ramifications of that decision, and agreed upon a plan to rectify it. Something like “we will implement this emailing mechanism in a synchronous way which will present a longer term scaling problem, if/when our total usage approaches a pre-defined number we will return to the area and change it to be asynchronous in order to better deal with the traffic”. Generally, these sorts of decisions trade delivery speed for a limitation of some sort (like pretty much all code), but the key difference is the communication and acceptance of that limitation.

Honestly, the definition of technical debt outlined above is so restrictive that it is almost academic.

Terrible codeon the other hand is just that. You usually know it when you see it, but its code that hurts you when you try to change or understand it.

For me, any single element out of the follow list is a solid indication of terrible code:

  • No automated tests
  • Poorly written tests
  • Constant and invasive duplication
  • Having to update all seven hundred places where a constructor is instantiated when you change it
  • Massive classes with many responsibilities
  • Repeated or duplicated functionality
  • Poor versioning

I could go on and on, but the common thread is definitely “things that make change hard”. Keeping these sorts of things at bay requires consistent effort and discipline, especially in the face of delivery pressure, but they are all important for the long term health of a piece of software.

The last category is literally everything else. Not necessarily debt, but not necessarily terrible code either. In reality, this is probably the majority of code produced in the average organization, assuming they have software engineers that care even a little bit. This code is not optimal or perfectly architected, but it delivers real value right now, and does only what it needs to do, in a well engineered way. It almost certainly has limitations, but they are likely not widely communicated and understood.

Now, about that whole fluidity thing.

Fluid Dynamics

Over its lifetime (from creation to deletion), a single piece of software might meet the criteria to be classified into all of the definitions above.

Maybe it started its life as an acceptable implementation, then coding standards changed (i.e. got better) and it was considered terrible code. As time wore on, perhaps someone did some analysis, communicated its limitations and failings and put a plan into place to fix it up, thus mutating it into debt. Maybe a few short months later the organization went through a destructive change, that knowledge was lost, and the new developers just saw it as crappy code again. Hell, sometime after that maybe we all agreed that automated tests aren’t good practice and it was acceptable again.

Its all very fluid and dynamic, and altogether quite subjective, so why bother with classifications at all?

I suppose its the same reason for why you would bother trying to put together a ubiquitous language when interacting with a domain, so you can communicate more effectively with stakeholders.

There is value in identifying and communicating the technical limitations of a piece of software to a wider audience, allowing that information to be used to inform business decisions. Of course, this can be challenging, because the ramifications have to be in terms the audience will understand (and empathize with), or the message might be lost.

But this comes with the problem of safely accumulating and storing the knowledge so that it doesn’t get lost, which is one of the reasons debt can mutate into terrible codeover time. This requires a consistent application of discipline over time that I have yet to bear witness to. It’s also very easy for this sort of debt registerto just turn into everything we don’t like about the codebase which is not its intent at all.

Moving away from debt, drawing a solid line between what is acceptable and what is not (i.e. the terrible code definition) obviously has beneficial effects on the quality of the software, but what happens when that line moves, which it is almost certain to do as people learn and standards change? Does everything that was already written just become terrible, implying that it should be fixed immediately (because why draw a line if you tolerate things that cross it) or does it just automatically become debt? It seems unlikely that the appropriate amount of due diligence would occur to ensure the issues are understood, and even less likely that they would be actioned.

In the end, a classification system has to provide real value as an input to decision making, and shouldn’t exists “just because”.

Conclusion

Being able to constantly and consistently deliver valuable software is hard, even without having to take into account what might happen in the future. I don’t know about you, but I certainly don’t have a solid history of anticipating future requirements.

Classifying code is a useful tool for discussion (and for planning), but in the end, if the code is causing pain for developers, or issues for clients, then it should be improved in whatever way necessary, as the need arises. There is no point in going back and fixing some technical debt in an arcane part of the system that is not actively being developed (even if it is actively being used), that’s just wasted effort. Assuming it does what its supposed to of course.

The key is really to do as little as possible, but to do it while following good engineering practices. If you do that at least, you should find that you never reach a place where your development grinds to a halt because the system can’t adapt to changing circumstances.

Evolve or die isn’t a suggestion after all.

0 Comments

When building a series of services to allow clients to access their own (previously office locked) data over the greater internet, there are a number of considerations to be made.

The old way was simple. There is a database. Stuff is in the database. When you want stuff, access the database. As long as the database in one office was powerful enough for the users in that office, you would be fine.

Moving all of that information into the cloud though…

Now everyone needs to access all their stuff at the same time. Now efficiency and isolation matters.

Well technically it mattered before as well, just not as much to the people who came before me.

I’m going to be talking about two things briefly in this post.

The first is isolating our upload and synchronization process from the actual service that needs to be queried.

The second is isolating binary data from all other requests.

Data Coming Right Up

In order to grant remote access to previously on-premises locked data we need to get that data out somehow. Unfortunately, for this system, the source of truth needs to stay on-premises for a number of different reasons that I won’t go into in too much detail. What we’re focusing on is allowing authenticated read-only access to the data from external systems.

The simplest solution to this is to have a replica of the data available in the cloud, and use that data for all incoming remote requests. Obviously this isn’t perfect (its an eventually consistent model), but because its read-only and we have some allowances for data latency (i.e. its okay if a mobile application doesn’t see exactly what is in the on-premises data the moment that it’s changed).

Of course, all of this data constantly being uploaded can cause a considerable amount of strain on the system as a whole, so we need to make sure that if there is a surge in the quantity of synchronization requests that the service responding to queries (get me the last 100 X entities) is not negatively impacted.

Easiest solution? Simply separate the two services, and share the data via a common store of some sort (our initial implementation will have a database).

With this model we gain some protection from load on on side impacting the other.

Its not perfect mind you, but the early separation gives us a lot of power moving forward if we need to change. For example, we could queue all synchronization requests to the sync service fairly easily, or split the shared database into a master and a number of read replicas. We don’t know if we’ll have a problem or what the solution to that problem will be, the important part is that we’ve isolated the potential danger, allowing for future change without as much effort.

10 Types of People

The system that we are constructing involves a moderate amount of binary data. I say moderate, but in reality, for most people who have large databases on premises, a good percentage of that data is binary data in various forms. Mostly images, but there are a lot of documents of various types as well (ranging from small and efficient PDF files to monstrous Word document abominations with embedded Excel spreadsheets).

Binary data is relatively problematic for a web service.

If you grant access to the binary data from a service, every request ties up one of your possible request handlers (whether it be threads, pseudo-threads or various other mechanisms of parallelism). This leaves less resources available for your other requests (data queries), which can make things difficult in the long run as the total number of binary data requests in flight at any particular moment slowly rises.

If you host the data outside the main service, you have to deal with the complexity of owning something else and making sure that it is secure (raw S3 would be ideal here, but then securing it is a pain).

In our case, our plan is to go with another service, purely for binary data. This allows us to leverage our existing authentication framework (so at least everything is secure), and allows us to use our existing logging tools to track access.

The benefits of isolating the access to binary data like this is that if there is a sudden influx of requests for images or documents or something, only that part of the system will be impacted (assuming there are no other shared components). Queries to get normal data will still complete in a timely fashion, and assuming we have written our integration well, some retry strategy will ensure that the binary data is delivered appropriately once the service resumes normal operation.

Summary

It is important to consider isolation concerns like I have outlined above when you are designing the architecture of a system.

You don’t necessarily have to implement all of your considerations straight away, but you at least need to know where your flex areas and where you can make changes without having to rewrite the entire thing. Understand how and when your architecture could adapt to potential changes, but don’t build it until you need it.

In our case, we also have a gateway/router sitting in front of everything, so we can remap URLs as we see fit moving into the future. In the case of the designs I’ve outlined above, they come from past (painful) experience. We’ve already encountered those issues in the past, while implementing similar systems, so we decided to go straight to the design that caters for them, rather than implement something we know would have problems down the track.

Its this sort of learning from your prior experiences that really makes a difference to the viability of an architecture in the long run.