0 Comments

Last week I described the re-occurrence of an issue from late last year with our data synchronization process.

Basically, read IOPS on the underlying database much higher than expected, causing issues with performance in AWS when the database volume ran out of burst balance (or IO credits as they are sometimes called).

After identifying that the most recent problem had been caused by the addition of tables, we disabled those tables, restoring performance and stability.

Obviously we couldn’t just leave them turned off though, and we couldn’t throw money at the problem this time, like we did last time. It was time to dig in and fix the problem properly.

But first we’d need to reproduce it.

But The Read Performance Issue In My Store

The good thing about ensuring that your environment is codified is that you can always spin up another one that looks just like the existing one.

Well, theoretically anyway.

In the case of the sync API environment everything was fine. One script execution later and we had a new environment called “sync performance” with the same number of identically sized API instances running the exact same code as production.

The database was a little trickier unfortunately.

You see, the database environment was from the time before Iimproved our environment deployment process. This meant that it was easy to make one, but hard to update an existing one.

Unfortunately, it was hard enough to update an existing one that the most efficient course of action had been to simply update the live one each time we had to tweak it, so we had diverged from the source code.

First step? Get those two back in sync.

Second step, spin up a new one that looks just like prod, which needed to include a copy of the prod data. Luckily, RDS makes that easy with snapshots.

With a fully functional, prod-like data synchronization environment up and running, all we needed was traffic.

Good thing Gor exists. We still had a deployable Gor component from last time I wanted to replicate traffic, so all we needed to do was make a new Octopus project, configure it appropriately and deploy it to our production API.

Now we had two (mostly) identical environments processing the same traffic, behaving pretty much the same. Because we’d turned off multiple tables in order to stop the high read IOPS, it was a simple matter to turn one back on, causing a less severe version of the issue to reoccur in both environments (higher than normal read IOPS, but not enough to eat burst balance).

With that in place we were free to investigate and experiment.

Is Still Abhorred

I’m going to cut ahead in the timeline here, but we analysed the behaviour of the test environment for a few days, trying to get a handle on what was going on.

Leveraging some of the inbuilt query statistics in postgreSQL, it looked like the the most frequent and costly type of query was related to getting a “version” of the remote table for synchronization orchestration. The second most costly type of query was related to getting a “manifest” of a subset of the table being synchronized for the differencing engine.

Disabling those parts of the API (but leaving the uploads alone) dropped the IOPS significantly, surprising exactly zero people. This did disagree with our hypothesis from last time though, so that was interesting.

Of course, the API is pretty useless without the ability to inform the synchronization process, so it was optimization time.

  • We could try to reduce the total number of calls, reducing the frequency that those queries are executed. We’d already done some work recently to dramatically reduce the total number of calls to the API from each synchronization process though, so it was unlikely we would be able to get any wins here
  • We could implement a cache in front of the API, but this just complicates things and all it will really result in is doing work repeatedly for no benefit (if the process syncs data then asks the API for a status, and gets the cached response, it will just sync the data again)
  • We could reduce the frequency of syncing, doing it less often. Since we already did the work I mentioned above to reduce overall calls, the potential gains here were small
  • We could try to make the queries more efficient. The problem here was that the queries were already using the primary keys of the tables in question, so I’m not entirely sure that any database level optimizations on those tables would have helped
  • We could make getting an answer to the question “give me the remote version of this table” more efficient by using a dedicated data structure to service those requests, basically a fancy database level cache

We prototyped the last option (basically a special cache within the database that contained table versions in a much easier to query format) and it had a positive effect on the overall read IOPS.

But it didn’t get rid of it entirely.

Within The Sound Of Synchronization

Looking into our traffic, we discovered that our baseline traffic had crept up since we’d implemented the skipping strategy in the sync process. Most of that baseline traffic appeared to be requests relating to the differencing engine (i.e. scanning the table to get primary key manifests for comparison purposes), which was one of the expensive type of queries that we identified above.

We’d made some changes to the algorithm to incorporate the ability to put a cap on the number of skips we did (for safety, to avoid de-sync edge cases) and to introduce forced skips for tables whose changes we were happy to only sync a few times a day.

A side effect of these changes was that whenever we decided NOT to skip using the local comparison, the most common result of the subsequent local vs remote comparison was choosing to execute the differencing engine. There was a piece of the algorithm missing where it should have been choosing to do nothing if the local and remote were identical, but that did not seem to be working due to the way the skip resolution had been implemented.

Fixing the bug and deploying it cause the read IOPS on our normal production server to drop a little bit, which was good.

The different pattern of traffic + our prototype table version cached caused a much more dramatic drop in read IOPS in our test environment though. It looked like the two things acting together apparently reduced the demands on the database enough to prevent it from having to read so much all the time.

Conclusion

We’re still working on a production quality cached table version implementation, but I am cautiously optimistic. There are some tricky bits regarding the cache (like invalidation vs updates, and where that sort of decision is made), so we’ve got a bit of work ahead of us though.

At this point I’m pretty thankful that we were easily able to both spin up an entirely separate and self contained environment for testing purposes, and that we were able to replicate traffic from one environment to the other without a lot of fuss. Without the capability to reproduce the problem disconnected from our clients and experiment, I don’t think we would have been able to tackle the problem as efficiently as we did.

I’m a little disappointed that a bug in our sync process managed to slip through our quality assurance processes, but I can understand how it happened. It wasn’t strictly a bug with the process itself, as the actions it was performing were still strictly valid, just not optimal. Software with many interconnected dependent components can be a very difficult thing to reason about, and this problem was relatively subtle unless you were looking for it specifically.  We might have been able to prevent it from occurring with additional tests, but its always possible that we actually had those tests anyway, and during one of the subsequent changes a test failed and was fixed in an incorrect way. I mean, if that was the case, then we need to be more careful about “fixing” broken tests.

Regardless, we’re starting to head into challenging territory with our sync process now, as it is a very complicated beast. So complicated in fact that its getting difficult to keep the entire thing in your head at the same time.

Which is scary.

0 Comments

in late 2016 we deployed the first version of our data synchronization API. It was a momentous occasion, and it all worked perfectly…at least until we turned on the software that actually used it. Actual traffic has a habit of doing that.

Long story short, we experienced a bunch of growing pains as we scaled our traffic up during those first few weeks, and this post is a follow-up to one of those issues in particular.

Specifically, the issue of particularly high read IOPS on the RDS instance that backed the sync API.

At the time, we investigated, didn’t find the root cause of the problem, threw money at i,t and ran away to do things that the business considered more important.

Why does it matter now?

Well, it happened again, and this time we can’t just throw money at it.

Sound familiar?

I’ve Come to Read From You Again

As I mentioned above, the main symptom was overly high read IOPS on our RDS server. Much higher than we expected for what we were doing.

This was an issue because of the way that data volumes work in AWS. You get some guaranteed IOPS (read and write combined) based on the size of the volume (1GB = 3 IOPS), while at the same time you get the capability to burst up to 3000 IOPS for brief periods of time.

The brief period of time is the important part, because it is actually modelled as a “balance” of IOPS credits that you consume if you go above your baseline, which get replenished if you’re below your baseline. Very very similar to the way that CPU credits work on EC2 instances. Back in late 2016, what exactly the balance was at any particular point in time was a bit of mystery though, because you couldn’t graph or alarm on it through CloudWatch. You only knew when it ran out and the performance of your database tanked because it was dependent on using more IOPS than its baseline.

Anyway, you can chart and alarm on the balance now, which is nice.

We tried a few things at the time, including scaling the underlying data volume (to give us more baseline), but it wasn’t enough.

We then increased the size of the RDS instance, and the high read IOPS vanished. I mean, that was obviously the best outcome, clearly. A problem we didn’t understand reacting in a completely unexpected way when we threw more power at it.

At the time, we thought that it might be related to EF querying the database on writes (to choose between an insert or update), but we never proved that hypothesis conclusively.

No-one in the business complained too loudly about the cost and everything was working swimmingly, so after an initial period of discomfort (mostly the engineers inside us sobbing quietly), we promptly forgot about it all together.

Because Some Usage Softly Creeping

A few weeks back, our data synchronization API started behaving strangely.

The first we knew about the problem was when our external monitoring service (Pingdom) reported that the API was down. Then up. Then down again. Then up again. So on and so forth.

Looking at the RDS metrics there were thousands of connections active to the database, causing us to hit connection limits. My first thought was that someone internal to the organisation had somehow got the credentials to the database and was running insane queries for business intelligence. I mean, thinking about that conclusion now, its completely bogus, because:

  1. No-one has the credentials to the master database but a few members of my team
  2. No-one even knows what the URL to access the master database is
  3. We created a read replica specifically for the organisation to use for business intelligence, so why would someone use the master
  4. You can’t even connect to the master database without being on our internal AWS VPN, which most people in the organisation don’t have access to

It didn’t help that when PostgreSQL runs out of connections, you can’t even connect to it with pgAdmin or something similar to see what the connections are doing.

Anyway, after falsely bombarding some innocent people with polite requests to “please stop doing things to our master database, its killing our API”, I shut down our data synchronization API and all the connections disappeared.

Because it was our API.

The calls were coming from inside the house.

Wrecked Everything While I was Sleeping

Luckily, the data synchronization API is probably the one API we can take down for a few hours and not have anybody really notice. As long as the database is available, the other API that actually allows access to the data can continue to do what it normally does, just with increasingly stale information. Its an eventually consistent model anyway, so some staleness is inherent.

We still only realistically had a few hours to dig into the issue before it became a real problem for actual users though.

Our only lead was the massive number of connections that were being established to the RDS instance. I managed to get a sample of what the connections were doing just as I shut the API down, and it wasn’t anything interesting. Just normal queries that I would expect from the API. Traffic hadn’t changed either, so we weren’t dealing with some weird DDOS attack or anything like that.

Well, to skip ahead, it turns out that the rise in the number of connections was a symptom, not the cause. The root issue was that we had burnt through our IO burst balance over the last few hours due to an increase in read IOPS. We’d almost done the same thing the day before, but it bottomed out at like 5% remaining because the days traffic started dropping off, so we didn’t notice. The lack of IOPS was causing each request to execute slower than normal, which just meant that there were more requests being executed in parallel.

Each API instance had a maximum number of connections that it could use, but the sum total across all of the instances was higher than the total number available to postgreSQL (whoops).

But why were we eating all of our delicious burst balance all of a sudden?

Well, a few days prior we started syncing a new table, and even though it was a small one, it increased the drain on the infrastructure.

Interestingly enough, it wasn’t a case of read IOPS going from close to nothing to a massive number with that new table. Our read IOPS had actually increased to a ridiculous baseline a few weeks before that when we shipped a completely different table, and we just hadn’t noticed. The table we shipped recently had just been the straw that broke the camels back.

We turned both of the new tables off, restored the API and all was well.

Conclusion

Honestly, we really should have had some alarms. Any alarms. Literally any alarms on the things we were already suspicious of at the database level (connections, read IOPS, etc) would have given us some sort of warning that bad things were happening, weeks before they actually happened.

We have those alarms now, but it was still a pretty big oversight.

With the new tables turned off and stability and functionality restored, we looked to fix the problem.

Throwing money at it wasn’t really feasible, as we would have basically doubled the infrastructure costs, and it was already costing us a few thousand $US each month. No, this time we would have to understand the problem and fix is properly.

But that’s a topic for next time, because this post is already large.

0 Comments

Its been at least a few years since I’ve actually worked with a dedicated DBA. I’m sure the role still exists, but it seems like a lot of organisations mostly just expect their developers to be able to do anything at least passably well, especially with the seemingly inexorable merge of the development and operations responsibilities.

As a developer, I personally like the trend, because it moves me closer to the end-user and it gives me a greater sense of responsibility about the software I produce. If I’m going to be the one who has to support it and the one that gets woken up at 1 in the morning because production servers are melting down due to a poorly optimised data update algorithm, then you better believe I will be writing the most maintainable, supportable and reliable software that I can.

I mean, I would have done that anyway, but now I have even more of a personal stake in the situation.

Where I work now, we still have a dedicated operations department, but they are mostly responsible for managing the IT operations of the business as a whole, rather than the deployment specifics of any software that my team writes. They do still participate in our process in an advisory role (something for which I am perpetually grateful), but other than that, we’re pretty much on our own.

Anyway, the point that this rambling preface is slowly trying to arrive at is we’ve had some teething issues with one of our more recent releases that I thought might be worth talking about.

Specifically, massive, unexpected read load on our relational databases during periods of high write activity.

Everything In a Box To The Write

The service at the core of the data freeing functionality that I’ve mentioned previously is our synchronization service.

Its only purpose is to facilitate the continual synchronization of data from on-premises databases to a central cloud repository, with the help of a locally installed application. I won’t go into too much detail about the actual algorithm in play, but its mostly just batched updates to the cloud database at regular intervals through a HTTP API, where the batched data is obtained by leveraging the SQL Server row version concept.

In the last few weeks, we’ve been deploying the on-premises application to customers in order to kick off the synchronization process, a chunk of customers at a time.

It went pretty well, releasing to each group and watching their data upload, until we hit some sort of breakpoint and the Read IOPS of our backend RDS service rose to an unsustainable level.

We were seeing numbers in excess of 1500 for read IOPS (and 200ish for write), which is somewhat problematic, because the database is relatively small (150GB), which means it only has around 450 baseline IOPS split between reads and writes. Considering the way that AWS volumes work (guaranteed baseline, spike up to 3000 by consuming IO credits), the consumption rate we were seeing would leave us high and dry within an hour or two. We scaled a number of things to deal with the traffic (which eventually subsided as the initial flurry of data gave way to more incremental updates).

But where did the massive amount of read load come from?

To understand that, I’m going to have to outline a little bit more about the service itself.

The API for the service exposes an endpoint for each table we’re synchronizing, differentiated by customer identity. Something like the following:

https://{url}/{customer-identity}/tables/{table-name}

A GET to this endpoint returns a manifest of sorts, stating information about what data is currently available in the service for that customer-table combination, which is then used to decide what to upload. A POST allows for inserts and updates and a DELETE allows for either the removal of all data or the removal of a subset of data defined by the body.

Inside the service, the POST body is handled by iterating through the rows contained therein, and executing an add or update for each one using Entity Framework, saving as we go.

Profiling the resulting queries from the execution of a POST, we discovered that entity framework will always do a SELECT first on each row in question, in order to determine whether to run an INSERT or an UPDATE. Given that the number of these SELECT operations dwarfed the number of reads resulting from GET requests, we assumed that that was where the majority of the read load was coming from.

Now we just had to find out how to optimise it.

The EF Tradeoff

One of the really nice things when working with Entity Framework is that you don’t need to give too much thought to the resulting queries that it executes on the underlying data store. I mean, the whole point of the library is to reduce the visibility of those sorts of things, adding a nice abstraction layer over the top so you don’t have to deal with it.

When it comes to performance though, that is also one of the places where EF can hurt you the most.

For our purposes, EF increased the speed at which we could deliver this service (by removing some of the nastyness around talking to databases), but by being so far away from the proverbial metal, when it came time to try and deal with performance issues, there didn’t seem to be much that we could do.

We tried a few things with EF, including:

  • Just adding the entities, instead of add/update, which forced EF to insert and fail on primary key violations. This was faster, but a lot of our load is actually updates as opposed to pure inserts, so applied to our production environment it would have simply caused other performance issues (as a result of the constant exceptions).
  • Turning off the various entity tracking features in EF, as we didn’t need them for the relatively simple inserts we were doing. This helped a little bit, but the same pattern of SELECT, INSERT/UPDATE was still present afterwards, so it wouldn’t have resolved out read problem.
  • Updating to the very latest version of EF6 and Npgsql (just in case there had been some opimisations that we’d missed), but it was pretty much the same. We were not that far behind the curve anyway, so it wasn’t too surprising.
  • Batching our data updates under a single SaveChanges. This helped a fair amount, but would require us to change the way we handle update errors to remove the bad rows and try the update again rather than simply moving on. We don’t have any real statistics as to how many errors we actually get during updates, so this one was plausible, but would need more investigation.

Then we tried something completely different.

No EF.

Just pure ADO.NET insert calls, catching primary key violations and doing updates.

This approach was so much faster it wasn’t funny, but it left a bad taste in our mouths at multiple levels. For starters, we would have to write and maintain the SQL ourselves, which is something we’d rather not do (for various reasons, maintainability being one and database specific code being another). Even getting past that, we were still handling exceptions as part of normal application flow, which didn’t feel right, and would almost certainly lead to performance problems of its own, even if it was dodging writes.

It was here that PostgreSQL came to our rescure.

Version 9.5 introduced an UPSERT command in the form of INSERT () ON CONFLICT UPDATE (). Now we could leave the whole thing up to the database and move on with our lives. We’d still have to maintain the SQL ourselves (or write a generator of some sort), but that would be a small price to pay for a massive performance increase.

Unfortunately this sort ends with an anti-climax, because we’re still in the process of implementing the raw ADO.NET, PostgreSQL specific UPSERT as an optimisation. Once I have more information I’ll follow up with another post filling in the end of the story.

In the meanwhile, we threw money at it by scaling up the RDS instance so it could keep more things in memory at once, which alleviated the issue, at least until we hit the next breakpoint.

Conclusion

Its true that no plan survives contact with the enemy, even when the enemy isn’t particularly hostile. All of our data synchronization was a silent precursor to shipping some new cloud features, so no-one notice when it exploded a few times, but it was still pretty depressing to see it be completely unable to handle the load we were throwing at it. I’m not surprised that Entity Framework doesn’t lead to optimal performance, even though its not pleasant having that particular fact slap you in the face. We can still use EF for a lot of what we need to do, and benefit from all of its goodies, like database migrations and easy in-memory testing, but we definitely need to put some things in place to bypass EF when the situation calls for it.

When it comes to this particular issue; even though throwing money at a problem doesn’t really work for us in the long term (for obvious reasons), it was nice to have that option available until we had something better in place. We’re still not sure if what we’re planning on doing is going to make a massive difference, but I remain hopeful.

Of course, software development and hope is never a good combination.

Look where it gets us when it comes to estimates.

0 Comments

Welp, holidays finished, so back into software delivery I go.

We just finished a 3 day Hackathon, which was pretty fun. The team I was in worked on a proof of concept for providing business intelligence on our customers to internal users by connecting and correlating our many disparate data sources. It was a fairly simple React.js website backed  by a C# API, which connected to a number of different systems including Microsoft Dynamics CRM, Elasticsearch (specifically, our log stack), some crazy financial database called Helpy (which I don’t even seem to be able to Google) and finally a PostgreSQL database containing customer information.

It was cool to see it all come together over just a few days, but I doubt I’ll give it its own blog post like I did for the faster S3 clone that I did for the last Hackathon. Funnily enough, the hardest part was getting the damn thing deployed to an instance of IIS on a manually provisioned server inside our office. I’ve spent so much time using automated scripts that I wrote months ago in AWS to setup websites and API’s that I’ve completely forgotten how to just hack a webserver together.

Why didn’t I use the scripts you might ask? Well, I tried, and they really only work inside AWS on top of the AMI’s we’ve built up specifically for that purpose. A bit disappointing, but I got over it.

Hackathon fun aside, I had to deal with something else when I got back from my holidays, identifying and resolving a massive unexpected increase in Read IOPS usage for one of our production databases.

And for once, It had nothing to do with RavenDB.

Zero Hour

While I was on holidays, one of our production API’s started exhibiting some pretty degraded performance. It happened suddenly too, it was running fine one minute and then all of a sudden its latency spiked hard. Unfortunately I was not there for the initial investigation, so the first half of this post is going to be a second-hand account, with all the inherent inaccuracy that entails.

Looking into the issue, all queries to the underlying database (a postgreSQL database hosted in RDS) were taking much longer than expected, but it wasn’t immediately obvious as to why. There had been no new deployments (so nothing had changed), and it didn’t look like the amount of traffic being received had drastically increased.

The RDS monitoring statistics showed the root cause of the performance problems. There had been a massive increase in the amount of Read IOPS being consumed by the database instance, from 300ish to 700ish. This had been happening for at least a few days before the performance went downhill, so the operative question was why it had become an issue all of a sudden.

Apparently, IO credits are a thing, like CPU credits.

With the increase in Read IOPS being sustained over many hours, the IO credits available to the instance were slowly being consumed, until they were all gone and performance tanked.

The way IO credits work, to my current understanding, is that when you provision an EBS volume, the size of the volume determines the minimum guaranteed IO throughput, at a rate of 3 IOPS per GB provisioned (unless you’ve specified provisioned IOPS, which is a whole different ballgame). The volume is capable of bursting to 3000 IOPS, but operating above the minimum will consume IO credits. Running out of IO credits means no more bursting, so if the performance of your service was reliant on pulling more than the minimum IO available, it will tank. Like ours did.

Apart from the unfortunate fact that unlike CPU credits, there is no way to monitor IO credits, the problem was relatively easy to solve. Either increase the volume size of the RDS instance (to increase the minimum IO performance) or leave the size the same, but switch to using provisioned IOPS. We opted for the cheaper option (bigger disk = more performance) and the performance stabilised.

This left us in a precarious position though, as we didn’t know what had cause the dramatic increase in Read IOPS.

Detective Agency

PostgreSQL has some pretty good logs, which is a relief, considering no equivalent tool to SQL Server Profiler exists to my knowledge.

Getting the logs for a PostgreSQL instance hosted in RDS isn’t particularly complicated. Either edit the existing parameter group associated with the instance, or create a new one with logging enabled and then associate it with the instance. The logging options we were interested in were log_statement and log_min_duration_statement, which represent what activities to log and how long a query needs to be running before it should be logged, respectively. More information about working with PostgreSQL logs in RDS can be found at the appropriate AWS webpage. Getting to the logs manually is pretty easy. They are available for download when looking at the RDS instance in the AWS console/dashboard.

The logs showed that there was a class of query that was constantly being executed to determine the most recent piece of information in the database for a subset of the data (i.e.a select top x from y where a, b). The only piece of software that can access the database is a C# API, and it uses Entity Framework 6 for all database communication, so the queries were monstrous, but once our attention had been drawn to the appropriate place, it was pretty easy to see some issues:

  • The data type on the first segmentation column was a string instead of an integer (only integer values were ever present in the column, and conceptually only integer values would ever be present)
  • The data type on the second segmentation column was a string instead of a guid (similar to the integer comment above)
  • The primary key on the table was a simple auto ID, whereas the actual primary key should have been a compound key with column a and b above, along with a third column
  • There did not appear to be any appropriate indexes on the segmentation columns

What all of this meant was that the most frequent query being executed on the database was incredibly sub-optimal. It was likely that because the most frequent query needed to do a table scan in order to get an answer, and because the RDS instance was only specified as a t2.micro, postgreSQL was constantly churning memory, which would account for the high Read IOPS. The issue likely only occurred recently because we finally hit some threshold in traffic or database size where the churning manifested.

Fixing the problem was simple. Edit the EF models to change the types and specify appropriate indexes and add a new EF migration to be executed on the next deployment. The migration was a little more complex than usual because it had to upgrade data already in the database (for the wrong column types), but all in all, the fix didn’t take a huge amount of effort.

Like good software developers though, we had to make sure that it actually did fix the problem without actually deploying to production.

Validate Me!

Thanks to previous efforts in traffic replication, I made the assumption that it would be relatively easy to replicate our production traffic into an updated version of the service.

I was wrong.

Our previous traffic replication had been accomplished using Gor replicating traffic from a single EC2 instance (a machine hosting RavenDB) to another similar instance. Pure HTTP traffic, no URLS, just IP addresses and Gor had been manually installed and configured. I had started work on replicating traffic from our API instances to a URL of my choosing, but I hadn’t quite finished it before I went on holidays.

I needed to replicate traffic from a variable number of instances to a publicly accessible URL, and I really didn’t want to have to manually install it on each machine (and then have to do it again next time I wanted to replicate traffic).

Creating a deployable Nuget package containing Gor and some helper Powershell scripts wasn’t overly difficult, I just used a similar pattern to the one we have for our Logstash deployment package. After deploying it though, it wouldn’t work. When I say it wouldn’t work, I mean I didn’t get any feedback at all when running it, no matter how I ran it. The issue was that Gor running on windows is dependent on WinPCap being present in order to function at all. This is easy enough to accomplish when you’re doing a manual install (just go grab WinPCap and install it), but unfortunately much harder when you want to be able to deploy it to any old machine in a completely unattended fashion. WinPCap doesn’t supply a silent installer (why god why), so I was forced to install NMap silently on the machines as part of the Gor deployment. Why NMap? It also installs WinPCap silently, because they’ve customised it.

Not my greatest victory, but I’ll take it.

With Gor working, I still had an issue where even though Gor seemed to be aware of the traffic occurring on the machine (I could log it to stdout for example), no matter what I did I could not get the traffic to replicate to another service via a standard URL available to the greater internet. Long story short, the issue was that Gor (as of version 0.15.1) does not and cannot use a proxy. Being that these instances did not have public IP addresses (no need, they were hidden behind a load balancer), they had to go through a proxy to get anything meaningful done on the internet.

We’re switching to using NAT Gateways for all internet traffic from private instances, so in the long term, this won’t be an issue. For now, I have to live with the fact that proxies are a thing, so I was forced to create an internal load balancer pointing at the same EC2 instances as the public URL I was going to use.

With that in place, I finally managed to replicate the production traffic to a test service with the changes applied.

And it worked amazingly. With the changes in place, our Read IOPS dropped from 700ish to like 10. Much more efficient. The only cost was that the service had to be made unavailable to apply the database updates (because they involved the transformation of the column contents). That only took like 5 minutes though, so I just did it at night when nobody noticed or cared.

Conclusion

I suppose the conclusion of this entire journey is to never underestimate the impact that early design decisions can have on the long term health of a service, and that the value of detailed monitoring should never be undervalued.

Honestly, we should have picked up this sort of thing much earlier than we did (probably through load testing), but it was good that we had the systems in place to react quickly when it did happen unexpectedly.

Another benefit of the whole thing is that it proved the value of our ability to spin up an entirely self contained environment as a mechanism for testing changes under relatively real circumstances. With the ability to replicate traffic at will now (thanks to the working Gor deployment), we’re in a much better position to try things out before we go live with them.

0 Comments

As part of the work we did recently to add search to one of our newer API’s, we leveraged the support that Npgsql has for constructing PostgreSQL Full Text Search queries via Linq. This was a vast improvement over our previous attempts to use the PostgreSQL Full Text Search, because we could combine it with sorting and filtering, and have everything apply right at the database level for maximum efficiency.

Internally, we use a Retriever class for each entity that we offer access to via the API. This class in turn leverages a number of different dependencies to provide for filtering, sorting and searching by modifying an IQueryable. For search, there is an IFullTextSearcher<TEntity> interface, which contains a single method Search, taking an IQueryable and the query and returning a modified IQueryable.

A searcher implementation for an entity called Entity looks like this:

public class EntityFullTextSearcher : IFullTextSearcher<Entity>
{
    private readonly IPostgresQueryBuilder _postgresQueryBuilder;

    public EntityFullTextSearcher(IPostgresQueryBuilder postgresQueryBuilder)
    {
        _postgresQueryBuilder = postgresQueryBuilder;
    }

    public IQueryable<EntityListProjectionModel> Search(IQueryable<EntityListProjectionModel> queryable, SearchQuery searchQuery)
    {
        var query = _postgresQueryBuilder.Build(searchQuery);
        if (string.IsNullOrEmpty(query)) return queryable;

        queryable = queryable.Where(b => NpgsqlTextFunctions.Match
        (
            NpgsqlTextFunctions.QueryOr(
                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldA), NpgsqlWeightLabel.A),
                NpgsqlTextFunctions.QueryOr(
                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldB), NpgsqlWeightLabel.A),
                    NpgsqlTextFunctions.QueryOr(
                        NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldC), NpgsqlWeightLabel.A),
                        NpgsqlTextFunctions.QueryOr(
                            NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldD), NpgsqlWeightLabel.B),
                            NpgsqlTextFunctions.QueryOr(
                                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldE), NpgsqlWeightLabel.C),
                                NpgsqlTextFunctions.QueryOr(
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldF), NpgsqlWeightLabel.C),
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldG), NpgsqlWeightLabel.C)
                                    )
                                )
                            )
                        )
                    )
                )
            ,
            NpgsqlTextFunctions.ToTsQuery(query))
        );

        return queryable;
    }
}

Pretty straightforward, but there is a massive inline lambda function that is somewhat hard to read.

Its A Frowny Face

The essence of that massive chunk of lambda is the definition of the fields that will be indexed during the search, along with the weight at which they will be indexed and how they are combined, which in our case, is just using OR. This Full Text Search definition is then combined with the Match operator. This is the way to get Full Text Search working via Linq, as I wrote about in my previous post on the subject.

If you look at the definition of these functions inside the Npgsql source code, you’ll see that their implementations all throw an InvalidOperationException. How do they even work?

The answer lies in the magic of expression trees and Entity Framework. The Where function on an IQueryable actually takes an Expression<Func<TEntity, bool>>, which this lambda function is supplying.

At no point is this code intended to be executed directly. You could try (i.e. by running this on top of a List or In Memory Database via Effort), but it would fail miserably, throwing an InvalidOperationException as expected.

However, when the IQueryable is passed through the Npgsql EF Provider, it is transformed into actual SQL which is then executed and the results interpreted accordingly. To be honest, I’ve never had to really dig into the whole EF provider concepts too deeply, and I don’t really want to. I’m happy enough that it works, and allows us to combine Full Text Search with arbitrary selection, filtering and sorting, leaving all of the hard work up to the database.

Back to the massive chunk of lambda though, at least its encapsulated inside this tiny class. Right?

Well, it was up until we needed to add sorting by relevance.

Maybe More Of A Grimace?

Have a look at this class and see if anything looks familiar.

public class EntitySearchRelevanceSorter : ISearchRelevanceSorter<EntityListProjectionModel>
{
    private readonly IPostgresQueryBuilder _postgresQueryBuilder;

    public EntitySearchRelevanceSorter(IPostgresQueryBuilder postgresQueryBuilder)
    {
        _postgresQueryBuilder = postgresQueryBuilder;
    }

    public IQueryable<EntityListProjectionModel> Sort(IQueryable<EntityListProjectionModel> queryable, SearchQuery searchQuery)
    {
        if(searchQuery == null || !searchQuery.Terms.Any()) return queryable;
        var query = _postgresQueryBuilder.Build(searchQuery);
        if (string.IsNullOrEmpty(query)) return queryable;

        return queryable.OrderByDescending(b => NpgsqlTextFunctions.TsRank(
               NpgsqlTextFunctions.QueryOr(
                   NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldA), NpgsqlWeightLabel.A),
                   NpgsqlTextFunctions.QueryOr(
                       NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldB), NpgsqlWeightLabel.A),
                       NpgsqlTextFunctions.QueryOr(
                           NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldC), NpgsqlWeightLabel.A),
                           NpgsqlTextFunctions.QueryOr(
                               NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldD), NpgsqlWeightLabel.B),
                               NpgsqlTextFunctions.QueryOr(
                                   NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldE), NpgsqlWeightLabel.C),
                                   NpgsqlTextFunctions.QueryOr(
                                       NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldF), NpgsqlWeightLabel.C),
                                       NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldG), NpgsqlWeightLabel.C)
                                       ))))))
               ,
               NpgsqlTextFunctions.ToTsQuery(query))
               );
    }
}

The lambda expression being used as the input for the TsRank function is exactly the same as the function used inside the searcher class.

It should be the same as well, because the relevance sorting should return the results in an order appropriate for how they were found via the search.

Now we have a problem though, because if the search definition changes, the relevance sorting will no longer make sense. The same is somewhat true for the reverse option (changing the sort order).

My first attempt at fixing this so that there was only a single place where the Full Text Search definition existed, was to simply extract the lambda function to a class of its own, and take a dependency on that class in both the searcher and sorter.

The definition class then looked like this:

public class EntitySearchDefinition
{
    public Expression<Func<Entity, string>> Definition
    {
        get
        {
            return b => NpgsqlTextFunctions.QueryOr( 
                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldA), NpgsqlWeightLabel.A), 
                NpgsqlTextFunctions.QueryOr( 
                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldB), NpgsqlWeightLabel.A), 
                    NpgsqlTextFunctions.QueryOr( 
                        NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldC), NpgsqlWeightLabel.A), 
                        NpgsqlTextFunctions.QueryOr( 
                            NpgsqlTextFunctions.SetWeightt(NpgsqlTextFunctions.ToTsVector(b.FieldD), NpgsqlWeightLabel.B), 
                            NpgsqlTextFunctions.QueryOr(
                                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldE), NpgsqlWeightLabel.C), 
                                NpgsqlTextFunctions.QueryOr( 
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldF), NpgsqlWeightLabel.C), 
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldG), NpgsqlWeightLabel.C) 
                                )
                            )
                        )
                    )
                )
            )
        }
    }
}

And the usage within the searcher looked like this:

public class EntityFullTextSearcher : IFullTextSearcher<Entity>
{
    private readonly IPostgresQueryBuilder _postgresQueryBuilder;
    private readonly EntitySearchDefinition _def;

    public EntityFullTextSearcher(IPostgresQueryBuilder postgresQueryBuilder, EntitySearchDefinition def)
    {
        _postgresQueryBuilder = postgresQueryBuilder;
        _def = def;
    }

    public IQueryable<EntityListProjectionModel> Search(IQueryable<EntityListProjectionModel> queryable, SearchQuery searchQuery)
    {
        var query = _postgresQueryBuilder.Build(searchQuery);
        if (string.IsNullOrEmpty(query)) return queryable;

        queryable = queryable.Where(b => NpgsqlTextFunctions.Match
        (
            def.Definition,
            NpgsqlTextFunctions.ToTsQuery(query))
        );

        return queryable;
    }
}

This won’t compile at all though.

Nope, Definitely Rage

The problem here is that the definition of NpgsqlTextFunctions.Match is bool Match(string, string), so from the compilers point of view it doesn’t take an expression like the one I extracted out to the other class. We can’t just call it here either, because that would defeat the entire point of constructing the Where so that it can be turned into SQL and executed against the database. We have to put together an expression tree containing the logic we want to be pushed to the database.

How can we do that and maintain some common definition of the Full Text Search that will be used by both search and sorting components?

Long story short, my definition class ended up looking like this:

public class EntitySearchDefinition
{
    private Expression<Func<Entity, string>> Definition
    {
        get
        {
            return b => NpgsqlTextFunctions.QueryOr( 
                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldA), NpgsqlWeightLabel.A), 
                NpgsqlTextFunctions.QueryOr( 
                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldB), NpgsqlWeightLabel.A), 
                    NpgsqlTextFunctions.QueryOr( 
                        NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldC), NpgsqlWeightLabel.A), 
                        NpgsqlTextFunctions.QueryOr( 
                            NpgsqlTextFunctions.SetWeightt(NpgsqlTextFunctions.ToTsVector(b.FieldD), NpgsqlWeightLabel.B), 
                            NpgsqlTextFunctions.QueryOr(
                                NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldE), NpgsqlWeightLabel.C), 
                                NpgsqlTextFunctions.QueryOr( 
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldF), NpgsqlWeightLabel.C), 
                                    NpgsqlTextFunctions.SetWeight(NpgsqlTextFunctions.ToTsVector(b.FieldG), NpgsqlWeightLabel.C) 
                                )
                            )
                        )
                    )
                )
            )
        }
    }

    private static MethodInfo TsRank()
    {
        return MethodInfoFromExpression<Func<string, string, float>>((a, b) => NpgsqlTextFunctions.TsRank(a, b));
    }

    private static MethodInfo ToTsQuery()
    {
        return MethodInfoFromExpression<Func<string, string>>(a => NpgsqlTextFunctions.ToTsQuery(a));
    }

    private MethodInfo Match()
    {
        return MethodInfoFromExpression<Func<string, string, bool>>((a, b) => NpgsqlTextFunctions.Match(a, b));
    }
    
    public Expression<Func<Entity, float>> CreateOrderBy(string query)
    {
        return MakeMethodCallExpression<float>(query, TsRank());
    }

    public Expression<Func<Entity, bool>> CreateWhere(string query)
    {
        return MakeMethodCallExpression<bool>(query, Match());
    }

    private Expression<Func<Entity, TOut>> MakeMethodCallExpression<TOut>(string query, MethodInfo methodInfo)
    {
        var p = Expression.Parameter(typeof(Entity));
        var queryExpression = CreateConstantExpressionForQuery(query);
        var definitionBody = Definition.Body;
        var replacer = new ExpressionReplacer(a => a.Exp == typeof(Entity), a => p);
        var visited = replacer.Visit<TEntity>(definitionBody);
        var expression = Expression.Lambda<Func<TEntity, TOut>>(Expression.Call(methodInfo, visited, queryExpression), new List<ParameterExpression> { p });
        return expression;
    }

    private static MethodCallExpression CreateConstantExpressionForQuery(string query)
    {
        var queryConstant = Expression.Constant(query);
        var queryExpression = Expression.Call(ToTsQuery(), queryConstant);
        return queryExpression;
    }
    
    private class ExpressionReplacer : ExpressionVisitor
    {
        private readonly Func<Expression, bool> _selector;
        private readonly Func<Expression, Expression> _replace;

        public ExpressionReplacer(Func<Expression, bool> selector, Func<Expression, Expression> replace)
        {
            _selector = selector;
            _replace = replace;
        }

        public override Expression Visit(Expression exp)
        {
            if (exp == null) return exp;
            if (_selector(exp)) exp = _replace(exp);
            return base.Visit(exp);
        }
    }
}

Yup, that’s a lot of code, but a lot of it is just utility methods for interacting with expression trees (specifically the parts that get method names without resorting to Reflection or magic strings and the expression visitor that mutates the expression tree for a purpose I’ll get to in a minute).

Using this definition class the body of the search method in EntityFullTextSearcher now looks like this:

public IQueryable<EntityListProjectionModel> Search(IQueryable<EntityListProjectionModel> queryable, SearchQuery searchQuery)
{
    var query = _postgresQueryBuilder.Build(searchQuery);
    if (string.IsNullOrEmpty(query)) return queryable;

    queryable = queryable.Where(b => NpgsqlTextFunctions.Match(def.CreateWhere(query));

    return queryable;
}

With the body of the sort method looking very very similar (calling CreateOrderBy instead).

But what does it all mean?

Expressing Myself

There are two important things to notice about the search definition class above.

The first is that it dynamically constructs the expression tree that would have previously been supplied by the inline lambda expression. You can see this construction inside the MakeMethodCallExpressionmethod. It creates a ParameterExpression (for the incoming Entity from the IQueryable), and then constructs a Call node that has the definition as one parameter, and the query as the other, just like how it was originally specified inline. This method is then used in the construction of the expression trees for Where and OrderBy.

The second important thing is the usage of the ExpressionReplacerclass. The lambda expression that contains the definition for the Full Text Search (the one returned by the Definition property) contains a parameter of type Entity. The lambda expression constructed by the CreateWhere and CreateOrderBy methods also contains a parameter of type Entity.

They are not the same thing.

In order for the inner expression parameter to be correctly bound to the same value as the outer expression parameter, it was necessary to walk the expression tree, find the inner parameter expression and replace it with the outer one. Assuming that it is unlikely there will be multiple parameter expressions throughout the tree of the same type that shouldn’t be bound to the same value, this approach will work just fine to consolidate all of the parameter expressions to the same eventual value.

The end result of the replacement is a new expression tree where all parameter expressions of type Entity are the same reference (there are only two, one inner and one outer).

Conclusion

Looking back at the effort that went into this, I’m not entirely sure it was worth it. I did accomplish my original goal (only one place where the Full Text Search definition was defined), but I’m not sure if the code is more understandable.

It felt good to separate the concerns, which makes it clear that the searcher and sorter classes are dependent on the same thing, and all of the complicated and hard to understand expression tree logic is encapsulated inside a single class, but I worry that it will be almost impenetrable to anyone who comes after me who doesn’t understand expression trees. Keep in mind, at this point, even after this entire adventure, I barely understand them. Me 3 months from now will probably look at the code, get confused, google part of it, find this blog post and then relive the entire affair.

On the upside, this did generate a significant increase in understanding the way that magic of Entity Framework works, and sparked quite a few discussions within the team that let to some interesting insights for all involved. It even spawned some other changes, where we can now use our own custom methods inside the expressions being passed into Entity Framework and then have them transformed appropriately to things that EF knows how to turn into SQL.

That’s a topic for another time though.