Mo Data, Mo Problems

January 18. 2019 0 Comments

Its the gift that keeps on giving, our data synchronization process!

Well, it keeps on giving to me anyway, because its fuel for the furnaces of this blog. Sometimes finding topics to write about every week can be hard, so its nice when they drop into your lap.

Anyway, the process has started to creak at the seams a bit, because we’re pushing more data through it than ever before.

And when I say creak at the seams, what I mean is that our Read IOPS usage on the underlying database has returned to being consistently ridiculous.

Couldn’t Eat Another Bite

The data synchronization process had been relatively stable over most of 2018. Towards the middle, we scaled the underlying database to allow for the syncing of one of the two biggest data sets in the application, and after a slow rollout, that seemed to be going okay.

Of course, with that success under our belt, we decided to sync the other biggest data set in the application. Living life on the edge.

We ended up getting about half way through because everything started to fall apart again, with similar symptoms to last time (spiking Read IOPS capping out at the maximum allowed burst, which would consume the IO credits and then tank the performance completely). We tried a quick fix of provisioning IOPS (to guarantee performance and remove the tipping point created by the consumption of IO credits), but it wasn’t enough.

The database just could not keep up what was being demanded of it.

I’m A Very Understanding Person

Just like last time, the first step was to have a look at the queries being run and see if there was anything obviously inefficient.

With the slow queries related to the “version” of the remote table mostly dealt with in our last round of improvements, the majority of the slow queries remaining were focused on the part of the process that gets a table “manifest”. The worst offenders were the manifest calls for one of the big tables that we had only started syncing relatively recently. Keep in mind that this table is the “special” one featuring hard deletes (compared to the soft deletes of the other tables), so it was using the manifest functionality a lot more than any of the other tables were.

Having had enough of software level optimizations last time, we decided to try a different approach.

An approach that is probably, by far, the more common approach when dealing with performance issues in a database.

Indexes.

Probably The Obvious Solution

The first time we had performance problems with the database we shied away from implementing additional indexes. At the time, we thought that the indexes that we did have were the most efficient for our query load (being a Clustered Index on the two most selective fields in the schema), and we assumed we would have to look elsewhere for optimization opportunities. Additionally, we were worried that the performance issues might have an underlying cause related to total memory usage, and adding another index (or 10) is just more things to keep in memory.

Having scaled the underlying instance and seeing no evidence that the core problem was memory related, we decided to pull the index lever this time.

Analysis showed that the addition of another index similar to the primary key would allow for a decent reduction in the amount of reads required to service a single request (in that, the index would short circuit the need to read the entire partition of the data set into memory in order to figure out what the max value was for the un-indexed field). A quick replication on our performance testing environment proved it unequivocally, which was nice.

For implementation, its easy enough to use Entity Framework to add an index as part of a database migration, so that’s exactly what we did.

We only encountered two issues, which was nice:

We didn’t seem to be able to use the concurrent index creation feature in PostgreSQL with the version of EF and Npgsql that we were using (which are older than I would like)
Some of the down migrations would not consistently apply, no matter what we tried

Neither of those two factors could stop us though, and the indexes were created.

Now we just had to roll them out.

Be Free Indexes, Be Free!

That required a little finesse.

We had a decent number of indexes that we wanted to add, and the datasets we wanted to add them to were quite large. Some of the indexes only took a few minutes to initialise, but others took as long as twenty.

Being that we couldn’t seem to get concurrent index creation working with Entity Framework data migrations, we had to sequence them out one at a time in sequential releases.

Not too hard, but a little bit more time consuming than we originally desired.

Of course, the sync process being what it is, its okay if it goes down for a half hour every now and then, so we just took everything out of service temporarily on each deployment to ensure that the database could focus on the index creation without having worry too much about dealing with the constant flood of requests that it usually gets.

Conclusion

At the end of the day, this round of performance investigation and optimization actually took a hell of a lot less time and effort than the last, but I think that’s kind of to be expected when you’re actively trying to minimise code changes.

With the first few of the indexes deployed, we’ve already seen a significant drop in the Read IOPS of the database, and I think we’re going to be in a pretty good place to continue to sync the remainder of the massive data set that caused the database to choke.

The best indicator of future performance is past data though, so I’m sure there will be another post one day, talking all about the next terrible problem.

And how we solved it of course, because that’s what we do.

Delta Force

September 4. 2018 0 Comments

Its been two months since I posted about our foray into continuous data migration from our legacy product to our cloud platform. Its been mulling around in the back of my head ever since though, and a few weeks ago we finally got a chance to go back and poke at it again. Our intent was to extend the original prototype and get it into a place where it was able to be demonstrated to a real client (using their real data).

Spending development effort to build an early, extremely rough prototype and then demonstrating it to users and gathering feedback as quickly as possible is a great way to stop from building the wrong thing. You can get early indicators about whether or not you’re headed in the right direction without having to invest too much money, assuming you pick a representative example of your target audience of course.

When we finished building the first prototype, it quickly became apparent we couldn’t actually show it to a user. We could barely show it to ourselves.

I can just imagine the sort of awkward statement that such a demonstration would have started with:

And when you make a change in your legacy system like this, all you have to do is wait 45+ minutes before its available over here in the new and shiny cloud platform! How great is that!

Its a pretty hard sell, so before we could even talk to anyone, we needed to do better.

Its All So Pointless

The first prototype extended our existing migration process, and meant that instead of creating a brand new account in our cloud platform every time customer data was migrated, it could update an existing account.

In doing so, it just re-ran the entire migration again (query, transform, publish) over the entire customer data set, focusing its efforts on identifying whether or not a transformed entity was new or existing and then performing the appropriate actions via the cloud API’s

This was something of a nuclear approach (like our original strategy for dealing with database restores in the sync algorithm) and resulted in a hell of a lot of wasted effort. More importantly, it resulted in a huge amount of wasted time, as the system still had to iterate through thousands of entities only to decide that nothing needed to be done.

The reality is that customers don’t change the entire data set all the time. They make small changes consistently throughout the day, so as long as we can identify only those changes and act on them, we should be able to do an update in a much shorter amount of time.

So that’s exactly what we did.

Tactical Strike

Whoever it was that implemented row level versioning in our legacy database, I should send them a gift basket or something, as it was one of the major contributing factors to the success of our data synchronization algorithm. With all of that delicious versioning information is available to the migration process, the optimisation is remarkably simple.

Whenever we do a migration and it results in a set of transformed entities, we just store the versioning information about those entities.

When an update is triggered, we can query the database for the things that are greater than the maximum version we dealt with last time, vastly decreasing the amount of raw information that we have to deal with, decreasing the total amount of time taken.

The only complication is that because each transformed entity might be built up from many legacy entities, the version must be an aggregate, specifically the lowest version of the constituent entities.

With that change in place, rather than the execution time of every migration being directly dependent on the total size of the customers data, its now dependent on how much has changed since we last migrated. As a result, its actually better to run the update process frequently to ensure that it makes many small updates over the course of the day, reducing the overall latency of the process and generally giving a much better user experience.

Excellent Timing

Speaking of running many small updates over the course of a day.

With the biggest blocker to a simple timer out of the way (the time required to execute the migration), we could actually put a timer into place.

The migration API is written in Kotlin, using Spring IO and Spring Batch, so it was a relatively simple matter to implement an in-memory job that runs every minute, identifies the migrations that should be updated (by picking the last successfully completed ones) and then executes an update operation on each.

For simplicity we execute the job synchronously, so each migration update must finish before the next can start, and the entire job cannot be rescheduled for re-execution until the previous job finishes. Obviously that approach doesn’t scale at all (every additional migration being updated increases the latency of the others), but in a controlled environment where we only have a limited set of migrations, its perfectly fine.

The only other thing we had to do in order to ensure the timer job worked as expected was to lock down the migration API to only have a single instance. Again, something that would never advise in production, but is acceptable for a prototype. If we do end up using a timer in production, we’d probably have to leverage some sort of locking process to ensure that it only executes once.

Oops I Did It Again

We are highly unlikely to go with the delta approach if this project pushes ahead though.

It provides just enough functionality to be able to to demonstrate the concept to the anchor customer (and maybe a few additional validation customers), but it does not cater for at least two critical cases:

Entity deletions
Database restores

It could be augmented to cater for those cases of course.

Its just software, and software is infinitely mutable, but all we would be doing is re-implementing the data synchronization algorithm, and it was hard enough to get right the first time. I don’t really want to write it all over again in a different system, we’ll just mess it up in new and unique ways.

Instead, we should leverage the existing algorithm, which is already really good at identifying the various events that can happen.

So the goal would be to implement some sort of event pipeline that contains a set of changes that have occurred (i.e. new entity, updated entity, deleted entity, database restored to X and so on), and then react to those events as appropriate from the migration side.

Obviously its not that simple in practice, but it is likely the direction that we will end up going if this all pans out.

Conclusion

What we’re left with right now is a prototype that allows for changes to customer data in their legacy system to be applied automatically to the cloud platform with a latency of under 5 minutes.

Of course, it has a bunch of conditions attached to it (every additional customer makes it slower for every other customer, doesn’t handle deletes, doesn’t handle database restores, was not built using normal engineering practices), but its enough to demonstrate the concept to a real person and start a useful conversation.

As is always the case with this sort of thing, there is a very real risk that this prototype might accidentally become production code, so its something that we as developers are eternally vigilant against.

That’s a hill I’m prepared to die on though.

That Syncing Feeling

June 26. 2018 0 Comments

As I’ve written about a bunch of times before, we have a data synchronization process for freeing customer data from restrictive confines of their office. We built it to allow the user to get at their valuable information when they are on the road, as our users tend to be, as something of a stopgap while we continue to develop and improve our cloud platform.

A fantastic side effect of having the data is that we can make it extremely easy for our customers to migrate from our legacy product to the new cloud platform, because we’ve already dealt with the step of getting their information into a place where we can easily access it. There are certainly challenges inherent in this approach, but anything we can do to make it easier to move from one system to another is going to make our customers happier when they do decide to switch.

And happy customers are the best customers.

As is always the case with software though, while the new cloud platform is pretty amazing, its still competing with a piece of software that’s been actively developed for almost 20 years, and not everyone wants to make the plunge just yet.

In the meanwhile, it would be hugely beneficial to let our customers use at least some part of the new system that we’ve been developing both because we honestly believe its a simpler and more focused experience and because the more people that use the new system, the more feedback we get, and that equals a better product for everyone.

Without a link between the two systems though, its all very pointless. Customers aren’t going to want to maintain two sets of identical information, and rightly so.

But what if we did it for them?

Data…Remember Who You Were

We already have a decent automated migration process. Its not perfect, but it ticks a lot of boxes with regards to bringing across the core entities that our customers rely on on a day to day basis.

All we need to do is execute that migration process on a regular basis, following some sort of cadence and acceptable latency that we figure out with the help of some test users.

This is all very good in theory, but the current migration process results in a brand new cloud account every time its executed on a customers data, which is pretty unusable for if we’re trying to make the customers lives easier. Ideally we want to re-execute the migration, but targeting an existing account. New entities get created, existing entities get updated and deleted entities get deleted.

Its basically another sync algorithm, except this time its more complicated. The current on-premises to remote sync is a relatively simple table by table, row by row process, because both sides need to look identical at the end. This time though, transforms occur, entities are broken down (or consolidated) and generally there is no one-to-one relationship that’s easy to follow.

On the upside, the same transformations are executed for both the initial migration and the update, so we can deal with the complexity somewhat by simply remembering the outcome of the previous process (actually all previous processes), and using that to identify what the destination was for all previously migrated entities.

What we’re left with is an update process that can be executed on top of any previous migration, which will:

Identify any previously migrated entities by their ID’s, and perform appropriate update or replacement operations
Identify new entities and perform appropriate creation operations, storing the ID mapping (i.e. original to new) in a place where it can be retrieved later

But what about deletions?

Well, at this point in time we’re just prototyping, so we kind of just swept that under the rug.

I’m sure it will be easy.

I mean, look at how easy deletions were to handle in the original data synchronization algorithm.

Its All Just Too Much

Unfortunately, while the prototype I described above works perfectly fine to demonstrate that an account in the cloud system can be kept up to date with data from the legacy product, we can’t give it to a customer yet because every migration “update” is a full migration of all of the customers data, not just the stuff that’s changed

Incredibly wasteful.

For our test data set, this isn’t really a problem. A migration only takes a minute at most, usually less, and its only really moving tends of entities around.

For a relatively average customer though, a migration takes like 45 minutes and moves thousands to tens of thousands of entities. Its never really been a huge problem before, because we only did it once or twice while moving them to the new cloud platform, but it quickly becomes unmanageable if you want to do it in some sort of constant loop.

Additionally, we can’t easily use a dumb timer to automate the process or we’ll get into problems with re-entrancy, where two migrations for the same data are happening at the same time. Taking that into account though, even if we started the next migration the moment the previous one finished (in some sort of eternal chain), we’re still looking at a total latency that is dependent on the size of the customers complete data set, not just the stuff they are actively working with. Also, we’d be using resources pointlessly, which is a scaling nightmare.

We should only be acting on a delta, using only the things that have changed since the last “update”.

A solvable problem, but it does add to the overall complexity, as we can’t fall back on the row based versioning that we used for the core data synchronization algorithm because of the transforms, so we have to create some sort of aggregate version from the constituent elements and store that for later use.

Its Time To Move On

Even if we do limit our migration updates to only the things that have changed since last time, we still have a problem if we’re just blindly running the process on a timer (or eternal chain). If there are no changes, why do a bunch of work for nothing?

Instead, we should be able to reactto the changes as the existing synchronization system detects them.

To be clear, the existing process is also timer based, so its not perfect. It would be nice to not add another timer to the entire thing though.

With some effort we should be able to produce a stream of information from our current sync API that describes the changes being made. We can then consume that stream in order to optimise the migration update process, focusing only on those entities that have changed, and more importantly, only doing work when changes actually occur.

Conclusion

Ignoring the limitations for a moment, the good news is that the prototype was enough to prove to interested stakeholders that the approach has merit. At least enough merit to warrant a second prototype, under the assumption that the second prototype will alleviate some of the limitations and we can get it in front of some real customers for feedback.

I have no doubt that there will be some usability issues that come out during the user testing, but that’s to be expected when we do something as lean as this. Whether or not those usability issues are enough to cause us to run away from the idea altogether is the different question, but we won’t know until we try.

Even if the idea of using two relatively disparate systems on a day to day basis is unpalatable, we still get some nice things from being able to silently migrate our customers into our cloud platform, and keep that information up to date

I mean, what better way is there to demo a new piece of software to a potential customer than to do it with their own data?

In The End, There Can Be Only One

April 10. 2018 0 Comments

I’ve written a lot of words on this blog about the data synchronization algorithm. Probably too many to be honest, but its an interesting technical topic for me, so the words come easily.

Not much has changed since the optimization to the differencing check to stop it from wastefully scanning the entire table, its just been quietly chugging along, happily grabbing data whenever clients opt in, and just generally being useful.

As we accumulate more data though, a flaw in the system is becoming more obvious.

Duplicates.

Once Uploaded, Data Lives Forever

I like to think that the data synchronization algorithm is really good at what it does.

Given a connection to a legacy database and some identifying information (i.e. the identity of the client), it will make sure that a copy of the data in that data exists remotely, and then, as the underlying database changes, ensure those changes are also present.

The actual sync algorithm is provided to clients in the form of a plugin for a component that enables services for their entire office, like server side automated backups and integrations with (other) cloud services. All of this hangs on a centralised store of registered databases, which the client is responsible for maintaining. The whole underlying system was built before my time, and while its a little rough around the edges, its pretty good.

Unfortunately, it does have one major flaw.

When a client registers a database (generally done by supplying a connection string), that database is given a unique identifier.

If the client registers the same physical database again (maybe they moved servers, maybe they lost their settings due to a bug, maybe support personnel think that re-registering databases is like doing a computer reboot), they get a new database identifier.

For the sync process, this means that all of the data gets uploaded again, appearing as another (separate) database belonging to the same client. In most cases the old registration continues to exist, but it probably stops being updated. Sometimes the client is actively uploading data from the same database more than once though, but that kind of thing is pretty rare.

For most use cases this sort of thing is mostly just annoying, as the client will select the right database whenever they interact with whatever system is pulling from the data in the cloud (and we generally try to hide databases that look like they would have no value to the customer, like ones that haven’t been actively updated in the last 24 hours).

For business intelligence though, it means a huge amount of useless duplicate data, which has all sorts of negative effects on the generated metrics.

Well, Until We Delete It And Take Its Power

From an engineering point of view, we should fix the root flaw, ensuring that the same physical database is identified correctly whenever it participates in the system.

As always, reality tends to get in the way, and unpicking that particular beast is not a simple task. Its not off the table completely, its just less palatable than it could be.

Even if the flaw is fixed though, the duplicate data that already exists is not going to magically up and disappear out of a respect for our engineering prowess. We’re going to have to deal with it anyway, so we might as well start there.

Algorithmically, a data set (customer_id-database_id pair) can be considered a duplicate of another data set if and only if:

The customer_id matches (we ignore duplicates across clients, for now anyway)
The data set contains at least 25 GUID identifiers that also appear in the other data set (each entity in the database generally has both a numerical and GUID identifier, so we just use the most common entity)

Nothing particularly complicated or fancy.

For the automated process itself, there are a few things worth considering:

It needs to communicate clearly what it did and why
There is value in separating the analysis of the data sets from the actions that were performed (and their results)
We’ll be using TeamCity for actually scheduling and running the process,so we can store a full history of what the process has done over time
To minimise risk, its useful to be able to tell the process “identify all the duplicates, but only delete the first X”, just in case it tries to delete hundreds and causes terrible performance problems

Taking all of the above into account, we' created a simple C# command line application that could be run like this:

Cleanup.exe –c {connection-string} –a {path-to-analysis-file} –d {path-to-deletion-results-file} –limitDeletionCount {number-to-delete}

Like everything we do, it gets built, tested, packaged (versioned), and uploaded to our private Nuget feed. For execution, there is a daily task in TeamCity to download the latest package and run it against our production database.

Its Like A Less Impressive Quickening

The last thing to do is make sure that we don’t ever delete any data that might still have value, to either us or the client.

As I mentioned above, the main reason that duplicates happen is when a client re-registers the same database for some reason. Upon re-registration, the “new” database will begin its data synchronization from scratch.

During the period of time where data is still uploading for the “new” database, but all the “old” data is still hanging around, how can we reasonably say which data set is the duplicate and should be deleted?

If we go off number of records, we’d almost certainly delete the “new” database mistakenly, which would just start it uploading again from scratch, and we’d get into an infinite stupidity loop.

We need some sort of indication that the data is “recent”, but we can’t use the timestamps on the data itself, because they are just copies from the local database, and oldest data uploads first.

Instead we need to use timing information from when the data set last participated in the sync process, i.e. a recency indicator.

A small modification of the tool later, and its execution looks like this:

Cleanup.exe –c {connection-string} –a {path-to-analysis-file} –d {path-to-deletion-results-file} –limitDeletionCount {number-to-delete} –recencyLimitDays {dont-delete-if-touched-this-recently}

We currently use 7 days as our recency limit, but once we’re more comfortable with the process, we’ll probably tune it down to 1 or 2 days (just to get rid of the data as soon as we can).

Conclusion

To be honest, we’ve known about the duplicate data flaw for a while now, but as I mentioned earlier, it didn’t really affect customers all that much. We’d put some systems in place to allow customers to only select recently synced databases already, so from their point of view, there might be a period where they could see multiple, but that would usually go away relatively quickly.

It wasn’t until we noticed the duplicates seeping into our metrics (which we use the make business decisions!) that we realised we really needed to do something about them, thus the automated cleanup.

A nice side effect of this, is that when we did the duplicate analysis, we realised that something like 30% of the entire database worthless duplicate data, so there might actually be significant performance gains once we get rid of it all, which is always nice.

To be honest, we probably should have just fixed the flaw as soon as we noticed it, but its in a component that is not well tested or understood, so there was a significant amount of risk in doing so.

Of course, as is always the case when you make that sort of decision, now we’re paying a different price altogether.

And who can really say which one is more expensive in the end?

Time For A Dating Opportunity

January 30. 2018 0 Comments

Date and Time data structures are always so fun and easy to use, aren’t they?

Nope.

They are, in fact, the devil. No matter how hard I try (and believe me, I’ve tried pretty hard) or what decisions I make, I always find myself facing subtle issues relating to time that go unnoticed until the problem is way harder to fix than it should be. Like say, after your customers have created millions and millions of entities with important timestamps attached to them that aren’t timezone aware.

More recently, we were bitten when we tried to sync fields representing a whole day (i.e. 1 January 2017, no time) through our data synchronization algorithm.

To our surprise, a different day came out of the other end, which was less than ideal.

Dates Are Delicious

During the initial analysis of the tables earmarked to be synced from client databases, we located quite a few fields containing dates with time information. That is, usage of the actual DateTime data structure in SQL Server. As most of us had been bitten in the past by subtle conversion bugs when playing with raw DateTimes in .NET, we made the decision to convert all DateTimes to DateTimeOffsets (i.e. 2017-01-01 00:00+10) at the earliest possible stage in the sync process, using the timezone information on the client server that the sync process was running on.

What we didn’t know was that some of the DateTime fields actually represented whole dates, and they were just represented as that day at midnight because there was no better structure available when they were initially created.

Unfortunately, converting whole days stored as DateTimes into DateTimeOffsets isn’t actually the best idea, because an atomic days representation should not change when you move into different timezones. For example, 1 January 2017 in Brisbane does not magically turn into 31 December 2016 22:00 just because you’re in Western Australia. Its still 1 January 2017.

This is one of the weird and frustrating things about the difference between whole Dates and DateTimes. Technically speaking, a Date as explained above probably should be location aware, especially as the timezone differences get more extreme. The difference between WA and QLD is pretty academic, but there’s a whole day between the US and Australia. If two users were to calculate something like rental arrears in two different places using whole dates, they would probably get two different numbers, which could lead to poor decisions. Of course, from a users point of view, the last thing they would expect is to have one day turn into another, or to add a time representation to something they entered as a day using a calendar selector, so all told, its confusing and terrible and I hate it.

If you want to get technical, the converted DateTime still represents the same instantin time, so as long as you know what the original offset was, you can use that to revert back to the original value (which is some day at midnight) without too much trouble, and then interpret it as necessary.

Of course, that’s when PostgreSQL decided to get involved.

Lost Time

A long time ago when we started the data synchronization project, we decided to use PostgreSQL as the remote store. We did this mostly because PostgreSQL was cheaper to run in AWS via RDS (the licensing costs for SQL Server in RDS were kind of nuts in comparison).

In hindsight, this was a terrible decision.

We might have saved raw money on a month to month basis, but we exposed ourselves to all sorts of problems inherent to the differences between the two database engines, not to mention the generally poor quality of the PostgreSQL tools, at least in comparison SQL Server Management Studio.

Returning to the date and time discussion; we chose to use Entity Framework (via NPGSQL) as our interface to PostgreSQL and to be honest we pretty much just trusted it to get the database schema right. All of our DateTimeOffsets got mapped to the PostgreSQL data structure timestamp_with_timezone, which looks like its pretty much the same thing.

Except its not. Its not the same at all. It actually loses date when storing a DateTimeOffset, and it does this by design.

In PostgreSQL terms, using a timestamp_with_timezone structure actually means “please automatically adjust the data I insert into this field using the given offset, so store it as UTC”. This makes sense, in a way, because strictly speaking, the data still represents the same instant in time, and can be freely adjusted to the users current offset as necessary (i.e. show me what the data looks like in +10).

Unfortunately, this approach means that the actual offset the data was inserted with is lost completely.

PostgreSQL has another data type called timestamp_without_timezone, but all it does it ignore the offset completely, while still stripping it out. Less than useful.

To summarise, here is the chain of events:

The user enters some data, representing a whole day: 1 Jan 2017
The system stores this data in SQL Server as a DateTime: 1 Jan 2017 00:00
Our sync process reads the data as a DateTimeOffset, using the local timezone: 1 Jan 2017 00:00 +10
The data is pushed to PostgreSQL and stored: 31 Dec 2016 14:00

Technically the data still represents the exact same point in time, but its actual meaning is now compromised. If anyone reads the remote data and assumes its still just a date, they are now a full day off, which is terrible.

Daylight savings doesn’t help either, because now the offset is inconsistent, so in NSW sometimes you will see the date as the previous day at 1400 and sometimes at 1300.

I mean honestly, daylight savings doesn’t really help anyone anyway, but that’s a different story.

That Approach Is Sooooo Dated

For whole dates, the only realistic solution is to treat them exactly as they should be treated, as dates with no time component.

Completely obvious in retrospect.

Both SQL Server and PostgreSQL have a Date data type which does exactly what it says on the box and leaves no room for misinterpretation or weird conversion errors.

Of course, .NET still only has the good old DateTime and DateTimeOffset data types, so there is room for shenanigans there, but at least the storage on either end would be correct.

For dates that actually do come with a time component (like an appointment), you have to really understand whether or not the time should be able to be interpreted in another timezone. Taking the appointment example, it might be reasonable to think that a company using your software product to track their appointments might exist in both NSW and QLD. The person executing the appointment would want to schedule it in their local time, but a remote administration person might want to know what time the appointment was in their own local time so that they know now to transfer calls.

SQL Server is fine in this respect, because a DateTimeOffset is perfect, freely able to be converted between one timezone and another with no room for misinterpretation.

In PostgreSQL, the timestamp_with_timezone data type might be good enough, assuming you don’t need to know with absolute certainty what the original offset was (and thus the time in the original creators context). If you do need to know that (maybe for business intelligence or analytics) you either need to know some information about the creator, or you should probably just use a timestamp data type, convert it to UTC yourself and store the original offset separately.

Conclusion

Representing time is hard in software. I’ve heard some people say this is because the fundamentally explicit structures that we have to have in software are simply unable to represent such a fluid and mutable construct, and I’m not sure if I disagree.

For the case I outlined above, we made some bad decisions with good goals (lets use DateTimeOffset, its impossible to misinterpret!) that built on top of other peoples bad decisions, but at least we understand the problem well enough now to potentially fix it.

Unfortunately, its unlikely that the concept of dates and times in software is just going to go away, as it seems pretty important.

The best we can hope for is probably just for existence to end.

Someone would probably want to know when existence is going to end taking daylight savings into account.