0 Comments

Its the gift that keeps on giving, our data synchronization process!

Well, it keeps on giving to me anyway, because its fuel for the furnaces of this blog. Sometimes finding topics to write about every week can be hard, so its nice when they drop into your lap.

Anyway, the process has started to creak at the seams a bit, because we’re pushing more data through it than ever before.

And when I say creak at the seams, what I mean is that our Read IOPS usage on the underlying database has returned to being consistently ridiculous.

Couldn’t Eat Another Bite

The data synchronization process had been relatively stable over most of 2018. Towards the middle, we scaled the underlying database to allow for the syncing of one of the two biggest data sets in the application, and after a slow rollout, that seemed to be going okay.

Of course, with that success under our belt, we decided to sync the other biggest data set in the application. Living life on the edge.

We ended up getting about half way through because everything started to fall apart again, with similar symptoms to last time (spiking Read IOPS capping out at the maximum allowed burst, which would consume the IO credits and then tank the performance completely). We tried a quick fix of provisioning IOPS (to guarantee performance and remove the tipping point created by the consumption of IO credits), but it wasn’t enough.

The database just could not keep up what was being demanded of it.

I’m A Very Understanding Person

Just like last time, the first step was to have a look at the queries being run and see if there was anything obviously inefficient.

With the slow queries related to the “version” of the remote table mostly dealt with in our last round of improvements, the majority of the slow queries remaining were focused on the part of the process that gets a table “manifest”. The worst offenders were the manifest calls for one of the big tables that we had only started syncing relatively recently. Keep in mind that this table is the “special” one featuring hard deletes (compared to the soft deletes of the other tables), so it was using the manifest functionality a lot more than any of the other tables were.

Having had enough of software level optimizations last time, we decided to try a different approach.

An approach that is probably, by far, the more common approach when dealing with performance issues in a database.

Indexes.

Probably The Obvious Solution

The first time we had performance problems with the database we shied away from implementing additional indexes. At the time, we thought that the indexes that we did have were the most efficient for our query load (being a Clustered Index on the two most selective fields in the schema), and we assumed we would have to look elsewhere for optimization opportunities. Additionally, we were worried that the performance issues might have an underlying cause related to total memory usage, and adding another index (or 10) is just more things to keep in memory.

Having scaled the underlying instance and seeing no evidence that the core problem was memory related, we decided to pull the index lever this time.

Analysis showed that the addition of another index similar to the primary key would allow for a decent reduction in the amount of reads required to service a single request (in that, the index would short circuit the need to read the entire partition of the data set into memory in order to figure out what the max value was for the un-indexed field). A quick replication on our performance testing environment proved it unequivocally, which was nice.

For implementation, its easy enough to use Entity Framework to add an index as part of a database migration, so that’s exactly what we did.

We only encountered two issues, which was nice:

  • We didn’t seem to be able to use the concurrent index creation feature in PostgreSQL with the version of EF and Npgsql that we were using (which are older than I would like)
  • Some of the down migrations would not consistently apply, no matter what we tried

Neither of those two factors could stop us though, and the indexes were created.

Now we just had to roll them out.

Be Free Indexes, Be Free!

That required a little finesse.

We had a decent number of indexes that we wanted to add, and the datasets we wanted to add them to were quite large. Some of the indexes only took a few minutes to initialise, but others took as long as twenty.

Being that we couldn’t seem to get concurrent index creation working with Entity Framework data migrations, we had to sequence them out one at a time in sequential releases.

Not too hard, but a little bit more time consuming than we originally desired.

Of course, the sync process being what it is, its okay if it goes down for a half hour every now and then, so we just took everything out of service temporarily on each deployment to ensure that the database could focus on the index creation without having worry too much about dealing with the constant flood of requests that it usually gets.

Conclusion

At the end of the day, this round of performance investigation and optimization actually took a hell of a lot less time and effort than the last, but I think that’s kind of to be expected when you’re actively trying to minimise code changes.

With the first few of the indexes deployed, we’ve already seen a significant drop in the Read IOPS of the database, and I think we’re going to be in a pretty good place to continue to sync the remainder of the massive data set that caused the database to choke.

The best indicator of future performance is past data though, so I’m sure there will be another post one day, talking all about the next terrible problem.

And how we solved it of course, because that’s what we do.

0 Comments

And so I return from my break, with a fresh reenergized hatred for terrible and unexpected technical issues.

Speaking of which…

Jacked Up And Good To Go

I’ve been sitting on a data export project for a while now. Its a simple C# command line application that connects to one of our databases, correlates some entities together, pulls some data out and dumps it into an S3 bucket as CSV files. It has a wide variety of automated unit tests proving that the various components of the system function as expected, and a few integration and end-to-end tests that show that the entire export process works (i.e. given a set of command line arguments + an S3 bucket, does running the app result in data in the bucket).

From a purely technical point of view, the project has been “complete” for a while now, but I couldn’t turn it on until some other factors cleared up.

Well, while I was on holidays those other factors cleared up, so I thought to myself “I’ll just turn on the schedule in TeamCity and all will be well”.

Honestly, you’d think having written as much software as I have that I would know better.

That first run was something of a crapshow, and while some of the expected files ended up in the S3 bucket, a bunch of others were missing completely.

Even worse, the TeamCity task that we use for job execution thought everything completed successfully. The only indicator that it had failed was that the task description in TeamCity was not updated to show the summary of what it had done (i.e. it simply showed the generic “Success” message instead of the custom one we were supplying), which is suspicious, because I’ve seen TeamCity fail miserably like that before.

Not Enough Time In The World

Breaking it down, there were two issues afoot:

  1. Something failed, but TeamCity thought everything was cool
  2. Something failed

The first problem was easy to deal with; there was a bug in the way the process was reporting back to TeamCity when there was an unhandled exception in the data extraction process. With that sorted, the next run at least indicated a failure had occurred.

With the process indicating correctly that a bad thing had happened, the cause of the second problem became obvious.

Timeouts.

Which explained perfectly why the tests had not picked up any issues, as while they run the full application (from as close to the command line as possible), they don’t run a full export, instead leveraging a limiter parameter to avoid the test taking too long.

Entirely my fault really, as I should have at least done a full export at some stage.

Normally I would look at a timeout with intense suspicion, as they typically indicate a inefficient query or operation of some sort. Simply raising the time allowed when timeouts start occurring is often a route to a terrible experience for the end-user, as operations take longer and longer to do what they want them to.

In this case though, it was reasonable that the data export would actually take a chunk of time greater than the paltry 60 seconds applied to command execution in Npgsql by default. Also, the exports that were failing were for the larger data sets (one of which had some joins onto other data sets) and being full exports, could not really make effective use of any indexes for optimisation.

So I upped the timeouts via the command line parameters and off it went.

Three hours later though, and I was pretty sure something else was wrong.

Socket To Me

Running the same process with the same inputs from my development environment showed the same problem. The process just kept on chugging along, never finishing. Pausing the execution in the development environment showed that at least one thread was stuck waiting eternally for some database related thing to conclude.

The code makes use of parallelisation (each export is its own isolated operation), so my first instinct was that there was some sort of deadlock involving one or more exports.

With the hang appearing to be related to database connectivity, I thought that maybe it was happening in the usage of Npgsql connections, but each export creates its own connection, so that seemed unlikely. There was always the possibility that the problem was related to connection pooling though, which is built into the library and is pretty much static state, but I had disabled that via the connection string, so it shouldn’t have been a factor.

I ripped out all of the parallelisation and ran the process again and it still hung. On closer investigation, it was only one specific export that was hanging, which was weird,  because they all use exactly the same code.

Turning towards the PostgreSQL end of the system, I ran the process again, except this time started paying attention to the active connections to the database, the queries they were running, state transitions (i.e. active –> idle) and execution timestamps. This is pretty easy to do using the query:

SELECT * FROM pg_stat_activity

I could clearly see the export that was failing execute its query on a connection, stay active for 15 minutes and then transition back to idle, indicating that the database was essentially done with that operation. On the process side though, it kept chugging along, waiting eternally for some data to come through the Socket that it was listening on that would apparently never come.

The 15 minute query time was oddly consistent too.

It turns out the query was actually being terminated server side because of replica latency (the export process queries a replica DB), which was set to max out at, wait for it, 15 minutes.

For some reason the stream returned by the NpgsqlConnection.BeginTextExport(sql) just never ends when the underlying query is terminated on the server side.

My plan is to put together some information and log an issue in the Npgsql Github Repository, because I can’t imagine that the behaviour is intended.

Solve For X

With the problem identified, the only question remaining was what to do about it.

I don’t even like that our maximum replica latency is set to 15 minutes, so raising it was pretty much out of the question (and this process is intended to be automated and ongoing, so I would have to raise it permanently).

The only real remaining option is to break down the bigger query into a bunch of smaller queries and then aggregate myself.

So that’s exactly what I did.

Luckily for me, the data set had a field that made segmentation easy, though running a bunch of queries and streaming the results into a single CSV file meant that they had to be run sequentially, so no parallelization bonus for me.

Conclusion

This was one of those issues where I really should have had the foresight to see the first problem (timeouts when dealing with large data sets),  but the existence of what looks to be a real bug made everything far more difficult than it could have been.

Still, it just goes to show that no matter how confident you are in a process, there is always the risk that when you execute it in reality that it all might go belly up.

Which just reinforces the idea that you should always be running it in reality as soon as you possibly can, and paying attention to what the results are.

0 Comments

In previous posts, I’ve briefly mentioned how one of the most valuable outcomes of the data synchronization algorithm was the capability to gather meaningful business intelligence about our users. This includes information like how often they use the software, what parts they are using and how many of its entities they are creating (and deleting) during their normal daily process.

When used responsibly, this sort of intelligence is invaluable in helping to create a better, more valuable experience for everyone involved. Literally every part of your standard organization (support, sales, development, marketing, finance) can make use of the data in some way.

Of course, when its all sequestered inside a PostgreSQL database, the barrier to entry can be quite high. Generally this just means that the people who are best positioned to make use of the data either don’t get access to it (a waste) or they end up constantly bugging the members of the small group of people with the capability to do the analysis.

When I’m a member of that small group, I get suddenly motivated to find an alternate way to give them what they need.

Re-education Campaign

There are really two kinds of analysis that tend to happen for business intelligence purposes.

The first is answering ad-hoc questions. Typically these sorts of questions have not been asked before, or they could be a slight variation on the same theme as a previous question.

Regardless, these generally require someone who actually knows the data and has the skills to wrangle it in order to get to the meat of the issue and provide the answer. This can be taught (with some effort), but not everyone has the time to dedicate to that sort of thing, so new analysis will probably always have to be performed by trained professionals.

The second type of analysis is re-answering a previously answered question, now that time has passed.

This tends to occur quite frequently, especially when it comes to metrics (like “how many customers have X, how many Y does each customer have, etc), and it is here that there is an opportunity to optimise.

In the long term, there should be an easy platform for people to get at the information they want, without having to understand how to run SQL scripts (or Powershell). This is likely going to come in the form of Metabase, but unfortunately this is probably still a little ways off in the future, and people are asking for data now.

So in the short term, some way to easily run pre-canned queries and get the results as a CSV is desirable.

Copy That

The good news is that PostgreSQL makes it pretty easy to run a query and get CSV output using the COPY command:

COPY ({query}) TO STDOUT WITH DELIMITER ',' CSV HEADER

Executing that command will result in a stream containing the data specified in the query (along with headers), in CSV format.

Of course, its kind of useless without a mechanism to easily run it, so if you’re using .NET, you can leverage Npgsql to actually execute the command using the BeginTextExport function on an NpgsqlConnection, like in the following class:

using System.IO;

namespace Solavirum.Exporting
{
    public class PostgresTableStreamer
    {
        private readonly INpgsqlConnectionFactory _npgsqlConnectionFactory;

        public PostgresTableStreamer(INpgsqlConnectionFactory npgsqlConnectionFactory)
        {
            _npgsqlConnectionFactory = npgsqlConnectionFactory;
        }

        public long Stream(string copySql, StreamWriter output)
        {
            var numberOfLinesInFile = 0;
            using (var conn = _npgsqlConnectionFactory.Create())
            {
                conn.Open();
                using (var reader = conn.BeginTextExport(copySql))
                {
                    var i = reader.Read();
                    var insideQuotedField = false;

                    while (i != -1)
                    {
                        var c = (char) i;
                        if (c == '"') insideQuotedField = !insideQuotedField;
                        switch (c)
                        {
                            case '\r':
                                i = reader.Read();
                                continue;
                            case '\n':
                                if (insideQuotedField) output.Write("\\n");
                                else
                                {
                                    output.Write(c);
                                    numberOfLinesInFile++;
                                }
                                break;
                            default:
                                output.Write(c);
                                break;
                        }
                        i = reader.Read();
                    }
                    output.Flush();
                }
                conn.Close();
            }
            return numberOfLinesInFile;
        }
    }
}

The only tricksy thing is escaping new line characters so that they don’t explode the formatting of the resulting file.

A Model View

Of course, a C# class is just as useless to a non-technical person as an SQL query, so now we have to think about delivery.

The easiest way to give people access to some piece of functionality is typically a website, so that seems like a good approach.

My skills are somewhat limited when it comes to putting together a website in a hurry. Our previous websites have been React.JS, but they need a bit of a pipeline to get up and running and I didn’t want to invest that much in this prototype.

Eventually, I settled on an ASP.NET CORE MVC website, deployed manually to a single Linux instance in AWS.

Nothing too complicated, just a single controller that lists the queries that are available (which are deployed with the code, no custom queries here), and then a way to run a single query and get the result as a file.

Getting ASP.NET CORE MVC up and running on a Linux box is pretty straightforward, assuming you use a local Nginx instance as a reverse proxy (which is easy).

As always, because we’re not stupid, a little touch of security is also necessary, so:

  • The box is behind a load balancer, to allow for easy HTTPS.
  • The load balancer is only accessible from the static IP at our main office. If you’re out of the office you can still use the service, but you need to be on the VPN, which is required for a bunch of our other things anyway.

If the website lives for long enough, I’ll probably augment it with Active Directory logins, so people can just use their domain credentials, but short term, its secure enough.

Conclusion

That’s basically it.

Start to finish, it all took about 4 hours, and while I wouldn’t say it was my best work, Its good enough to solve the immediate problem. At least until we get Metabase up and running and we throw the whole thing out anyway.

I’ll probably clean it up a bit (the manual deployment rankled me more than I wanted to admit) and then wash my hands of the whole thing.

More importantly, the amount of people asking me to “please run query X and email me the results” has dropped dramatically, so I’ll consider this a success.

Now I have more time for the really important things.

Like participating in the Mario Kart Tournament.

0 Comments

I’ve written a lot of words on this blog about the data synchronization algorithm. Probably too many to be honest, but its an interesting technical topic for me, so the words come easily.

Not much has changed since the optimization to the differencing check to stop it from wastefully scanning the entire table, its just been quietly chugging along, happily grabbing data whenever clients opt in, and just generally being useful.

As we accumulate more data though, a flaw in the system is becoming more obvious.

Duplicates.

Once Uploaded, Data Lives Forever

I like to think that the data synchronization algorithm is really good at what it does.

Given a connection to a legacy database and some identifying information (i.e. the identity of the client), it will make sure that a copy of the data in that data exists remotely, and then, as the underlying database changes, ensure those changes are also present.

The actual sync algorithm is provided to clients in the form of a plugin for a component that enables services for their entire office, like server side automated backups and integrations with (other) cloud services. All of this hangs on a centralised store of registered databases, which the client is responsible for maintaining. The whole underlying system was built before my time, and while its a little rough around the edges, its pretty good.

Unfortunately, it does have one major flaw.

When a client registers a database (generally done by supplying a connection string), that database is given a unique identifier.

If the client registers the same physical database again (maybe they moved servers, maybe they lost their settings due to a bug, maybe support personnel think that re-registering databases is like doing a computer reboot), they get a new database identifier.

For the sync process, this means that all of the data gets uploaded again, appearing as another (separate) database belonging to the same client. In most cases the old registration continues to exist, but it probably stops being updated. Sometimes the client is actively uploading data from the same database more than once though, but that kind of thing is pretty rare.

For most use cases this sort of thing is mostly just annoying, as the client will select the right database whenever they interact with whatever system is pulling from the data in the cloud (and we generally try to hide databases that look like they would have no value to the customer, like ones that haven’t been actively updated in the last 24 hours).

For business intelligence though, it means a huge amount of useless duplicate data, which has all sorts of negative effects on the generated metrics.

Well, Until We Delete It And Take Its Power

From an engineering point of view, we should fix the root flaw, ensuring that the same physical database is identified correctly whenever it participates in the system.

As always, reality tends to get in the way, and unpicking that particular beast is not a simple task. Its not off the table completely, its just less palatable than it could be.

Even if the flaw is fixed though, the duplicate data that already exists is not going to magically up and disappear out of a respect for our engineering prowess. We’re going to have to deal with it anyway, so we might as well start there.

Algorithmically, a data set (customer_id-database_id pair) can be considered a duplicate of another data set if and only if:

  • The customer_id matches (we ignore duplicates across clients, for now anyway)
  • The data set contains at least 25 GUID identifiers that also appear in the other data set (each entity in the database generally has both a numerical and GUID identifier, so we just use the most common entity)

Nothing particularly complicated or fancy.

For the automated process itself, there are a few things worth considering:

  • It needs to communicate clearly what it did and why
  • There is value in separating the analysis of the data sets from the actions that were performed (and their results)
  • We’ll be using TeamCity for actually scheduling and running the process,so we can store a full history of what the process has done over time
  • To minimise risk, its useful to be able to tell the process “identify all the duplicates, but only delete the first X”, just in case it tries to delete hundreds and causes terrible performance problems

Taking all of the above into account, we' created a simple C# command line application that could be run like this:

Cleanup.exe –c {connection-string} –a {path-to-analysis-file} –d {path-to-deletion-results-file} –limitDeletionCount {number-to-delete}

Like everything we do, it gets built, tested, packaged (versioned), and uploaded to our private Nuget feed. For execution, there is a daily task in TeamCity to download the latest package and run it against our production database.

Its Like A Less Impressive Quickening

The last thing to do is make sure that we don’t ever delete any data that might still have value, to either us or the client.

As I mentioned above, the main reason that duplicates happen is when a client re-registers the same database for some reason. Upon re-registration, the “new” database will begin its data synchronization from scratch.

During the period of time where data is still uploading for the “new” database, but all the “old” data is still hanging around, how can we reasonably say which data set is the duplicate and should be deleted?

If we go off number of records, we’d almost certainly delete the “new” database mistakenly, which would just start it uploading again from scratch, and we’d get into an infinite stupidity loop.

We need some sort of indication that the data is “recent”, but we can’t use the timestamps on the data itself, because they are just copies from the local database, and oldest data uploads first.

Instead we need to use timing information from when the data set last participated in the sync process, i.e. a recency indicator.

A small modification of the tool later, and its execution looks like this:

Cleanup.exe –c {connection-string} –a {path-to-analysis-file} –d {path-to-deletion-results-file} –limitDeletionCount {number-to-delete} –recencyLimitDays {dont-delete-if-touched-this-recently}

We currently use 7 days as our recency limit, but once we’re more comfortable with the process, we’ll probably tune it down to 1 or 2 days (just to get rid of the data as soon as we can).

Conclusion

To be honest, we’ve known about the duplicate data flaw for a while now, but as I mentioned earlier, it didn’t really affect customers all that much. We’d put some systems in place to allow customers to only select recently synced databases already, so from their point of view, there might be a period where they could see multiple, but that would usually go away relatively quickly.

It wasn’t until we noticed the duplicates seeping into our metrics (which we use the make business decisions!) that we realised we really needed to do something about them, thus the automated cleanup.

A nice side effect of this, is that when we did the duplicate analysis, we realised that something like 30% of the entire database worthless duplicate data, so there might actually be significant performance gains once we get rid of it all, which is always nice.

To be honest, we probably should have just fixed the flaw as soon as we noticed it, but its in a component that is not well tested or understood, so there was a significant amount of risk in doing so.

Of course, as is always the case when you make that sort of decision, now we’re paying a different price altogether.

And who can really say which one is more expensive in the end?

0 Comments

Date and Time data structures are always so fun and easy to use, aren’t they?

Nope.

They are, in fact, the devil. No matter how hard I try (and believe me, I’ve tried pretty hard) or what decisions I make, I always find myself facing subtle issues relating to time that go unnoticed until the problem is way harder to fix than it should be. Like say, after your customers have created millions and millions of entities with important timestamps attached to them that aren’t timezone aware.

More recently, we were bitten when we tried to sync fields representing a whole day (i.e. 1 January 2017, no time) through our data synchronization algorithm.

To our surprise, a different day came out of the other end, which was less than ideal.

Dates Are Delicious

During the initial analysis of the tables earmarked to be synced from client databases, we located quite a few fields containing dates with time information. That is, usage of the actual DateTime data structure in SQL Server. As most of us had been bitten in the past by subtle conversion bugs when playing with raw DateTimes in .NET, we made the decision to convert all DateTimes to DateTimeOffsets (i.e. 2017-01-01 00:00+10) at the earliest possible stage in the sync process, using the timezone information on the client server that the sync process was running on.

What we didn’t know was that some of the DateTime fields actually represented whole dates, and they were just represented as that day at midnight because there was no better structure available when they were initially created.

Unfortunately, converting whole days stored as DateTimes into DateTimeOffsets isn’t actually the best idea, because an atomic days representation should not change when you move into different timezones. For example, 1 January 2017 in Brisbane does not magically turn into 31 December 2016 22:00 just because you’re in Western Australia. Its still 1 January 2017.

This is one of the weird and frustrating things about the difference between whole Dates and DateTimes. Technically speaking, a Date as explained above probably should be location aware, especially as the timezone differences get more extreme. The difference between WA and QLD is pretty academic, but there’s a whole day between the US and Australia. If two users were to calculate something like rental arrears in two different places using whole dates, they would probably get two different numbers, which could lead to poor decisions. Of course, from a users point of view, the last thing they would expect is to have one day turn into another, or to add a time representation to something they entered as a day using a calendar selector, so all told, its confusing and terrible and I hate it.

If you want to get technical, the converted DateTime still represents the same instantin time, so as long as you know what the original offset was, you can use that to revert back to the original value (which is some day at midnight) without too much trouble, and then interpret it as necessary.

Of course, that’s when PostgreSQL decided to get involved.

Lost Time

A long time ago when we started the data synchronization project, we decided to use PostgreSQL as the remote store. We did this mostly because PostgreSQL was cheaper to run in AWS via RDS (the licensing costs for SQL Server in RDS were kind of nuts in comparison).

In hindsight, this was a terrible decision.

We might have saved raw money on a month to month basis, but we exposed ourselves to all sorts of problems inherent to the differences between the two database engines, not to mention the generally poor quality of the PostgreSQL tools, at least in comparison SQL Server Management Studio.

Returning to the date and time discussion; we chose to use Entity Framework (via NPGSQL) as our interface to PostgreSQL and to be honest we pretty much just trusted it to get the database schema right. All of our  DateTimeOffsets got mapped to the PostgreSQL data structure timestamp_with_timezone, which looks like its pretty much the same thing.

Except its not. Its not the same at all. It actually loses date when storing a DateTimeOffset, and it does this by design.

In PostgreSQL terms, using a timestamp_with_timezone structure actually means “please automatically adjust the data I insert into this field using the given offset, so store it as UTC”. This makes sense, in a way, because strictly speaking, the data still represents the same instant in time, and can be freely adjusted to the users current offset as necessary (i.e. show me what the data looks like in +10).

Unfortunately, this approach means that the actual offset the data was inserted with is lost completely.

PostgreSQL has another data type called timestamp_without_timezone, but all it does it ignore the offset completely, while still stripping it out. Less than useful.

To summarise, here is the chain of events:

  • The user enters some data, representing a whole day: 1 Jan 2017
  • The system stores this data in SQL Server as a DateTime: 1 Jan 2017 00:00
  • Our sync process reads the data as a DateTimeOffset, using the local timezone: 1 Jan 2017 00:00 +10
  • The data is pushed to PostgreSQL and stored: 31 Dec 2016 14:00

Technically the data still represents the exact same point in time, but its actual meaning is now compromised. If anyone reads the remote data and assumes its still just a date, they are now a full day off, which is terrible.

Daylight savings doesn’t help either, because now the offset is inconsistent, so in NSW sometimes you will see the date as the previous day at 1400 and sometimes at 1300.

I mean honestly, daylight savings doesn’t really help anyone anyway, but that’s a different story.

That Approach Is Sooooo Dated

For whole dates, the only realistic solution is to treat them exactly as they should be treated, as dates with no time component.

Completely obvious in retrospect.

Both SQL Server and PostgreSQL have a Date data type which does exactly what it says on the box and leaves no room for misinterpretation or weird conversion errors.

Of course, .NET still only has the good old DateTime and DateTimeOffset data types, so there is room for shenanigans there, but at least the storage on either end would be correct.

For dates that actually do come with a time component (like an appointment), you have to really understand whether or not the time should be able to be interpreted in another timezone. Taking the appointment example, it might be reasonable to think that a company using your software product to track their appointments might exist in both NSW and QLD. The person executing the appointment would want to schedule it in their local time, but a remote administration person might want to know what time the appointment was in their own local time so that they know now to transfer calls.

SQL Server is fine in this respect, because a DateTimeOffset is perfect, freely able to be converted between one timezone and another with no room for misinterpretation.

In PostgreSQL, the timestamp_with_timezone data type might be good enough, assuming you don’t need to know with absolute certainty what the original offset was (and thus the time in the original creators context). If you do need to know that (maybe for business intelligence or analytics) you either need to know some information about the creator, or you should probably just use a timestamp data type, convert it to UTC yourself and store the original offset separately.

Conclusion

Representing time is hard in software. I’ve heard some people say this is because the fundamentally explicit structures that we have to have in software are simply unable to represent such a fluid and mutable construct, and I’m not sure if I disagree.

For the case I outlined above, we made some bad decisions with good goals (lets use DateTimeOffset, its impossible to misinterpret!) that built on top of other peoples bad decisions, but at least we understand the problem well enough now to potentially fix it.

Unfortunately, its unlikely that the concept of dates and times in software is just going to go away, as it seems pretty important.

The best we can hope for is probably just for existence to end.

Someone would probably want to know when existence is going to end taking daylight savings into account.