Downtime Postmortem - Blog


On the 2nd of March, 2020 we had a period of downtime.

The downtime started at 4:29PM AEST and ended at 7:27PM AEST.

The cause is simple enough: Our data host ran into some problems after they upgraded a particular part of their stack, causing all attempts to access data to result in either an Unauthorized errror, or a Server Fault error.

We contacted them rather quickly, but they were already in the process of attempting to fix it.

Unfortunately, implementing that fix took some time.

Avoiding this in future is not currently possible, as we rely on the data host to provide our data. Multiple host fallback is an option in the future - but we're not currently at the point where we can financially afford to store terabytes of data across multiple hosts.

We don't expect another failure any time soon, and the host does have a good reputation for stability. This has been the worst outage in several years.

