System outage of all services due to Datacentre issue at service provider

Updates

Resolved
Friday, December 24, 2021 at 6:29:23 PM
Resolved
Friday, December 24, 2021 at 6:29:23 PM
Issue was resolved several hours ago. Including a detailed report below for what transpired during the outage overnight.

Post-mortem report on overnight incident

Incident start time: 1:22 AM Pacific Time
Incident end time: 4:28 AM Pacific Time
Total incident duration: 3 hours 6 minutes
- There was a cooling system issue in the data centre where my servers reside in Montreal. As a result - and in order to prevent hardware failure - the provider (OVH) had to proactively shutdown their systems until their cooling issue was fixed.
- Their issue had to do with a burst pipe that was unable to deliver water to their cooling system, and as a result it took a couple hours for them to repair. Once it was repaired, they had to wait a bit for the temperatures to come down to normal levels again before they could safely turn on all the hardware.
The monitoring solutions I had in place were successful at alerting me to the incident immediately. I was able to then investigate and determine the issue above by way of OVH's status page, with their particular incident details located at https://public-cloud.status-ovhcloud.com/incidents/hh3bm7xxt1s9.

After the servers came back online at 4:28 AM Pacific Time, there was a second issue raised to me a couple hours later that email archives were missing. I immediately jumped on this second issue and concluded that when the servers came back online at OVH, it hadn't recognized the external block storage used which is where the emails are stored (it's a triple-replicated external storage device). Thankfully this was a very simple fix - restarting the server reinitialized the connection to the block storage and immediately resolved the email archive part. Everyone should have access to all of their emails again. The second issue was resolved at 7:16 AM.

Thankfully this incident occurred during the middle of the night so there was likely no impact observed by most people, however in my commitment to full transparency I wanted to ensure everyone was aware of what happened as it had led to a prolonged outage between 1:22 AM and 4:28 AM overnight.

I appreciate your understanding and patience. If you have any questions or concerns, please don't hesitate to reach out to me.
Monitoring
Friday, December 24, 2021 at 12:34:50 PM
Monitoring
Friday, December 24, 2021 at 12:34:50 PM
The incident appears resolved, services are operational again. However there may be a period of instability over the next short while as the OVH Datacentre comes back online. I’ll continue to monitor until I’m satisfied that it shouldn’t be causing any more issues, but in the meantime everything should be fine now. Thank you all for your patience while the Datacentre issue was resolved.
Update
Friday, December 24, 2021 at 12:21:23 PM
Update
Friday, December 24, 2021 at 12:21:23 PM
Latest status from OVH (https://public-cloud.status-ovhcloud.com/incidents/hh3bm7xxt1s9) suggests that things are starting to come back up again, but I haven’t seen that be the case yet for my servers. I’m continuing to monitor OVH’s status page for updates and hopefully this issue won’t last too much longer now that they state things are slowly coming online again. Stay tuned.
Identified
Friday, December 24, 2021 at 11:42:15 AM
Identified
Friday, December 24, 2021 at 11:42:15 AM
There was a major Datacentre cooling issue a couple hours ago which forced the provider to have to shut down many of their systems.

The cooling pipe has been repaired as of just 20 minutes ago and services should slowly start coming online, but it may be a little while yet. No ETA at this time.

More details on the incident at the Datacentre can be tracked here on their own status page too: https://network.status-ovhcloud.com/incidents/fl05qs8nzwjn

I apologize for the inconvenience and will provide further updates as soon as possible.

Hosting Services by Dustin Dauncey (d19) - System outage of all services due to Datacentre issue at service provider – Incident details

All systems operational

Post-mortem report on overnight incident