(As of the 6th May the issue was fully resolved by DigitalOcean)
On the 5th of May, our systems reported an unusually high number of errors with DigitalOcean snapshot backups. We could also see the number of error emails we where sending to customers had increased.
We normally present the error that DigitalOcean gives us to customers when attempting the backup, however in the email we didn't explain what a 'Server Error' was, it's not something we normally ever see. Some customers where rightly worried and send us support emails; did it mean their server was broken? In actual fact it was DigitalOcean API error message, everyones servers was online and running fine. We will need to improve this in the future for customers.
Reporting the Issue to DigitalOcean
We replied to all customers who emailed us about this error, letting them know that the backup had in-fact failed, their droplets where still online and that we had detected a high number of errors and was in the process of raising in issue.
From experience I have found the best way to resolve an issue quickly with DigitalOcean is to write a very good and comprehensive initial report. I provided error rates over the last 30 days, this showed them there had clearly been an increased the error rate.
I provided records of all the droplet ids and times that API errors had failed. We also take full logs of every DigitalOcean request we make (for this very reason) so I gave them a very good sample of headers and responses of errors during the day. I also did a deep dive on a single customers droplet that had hourly backups.
Straight away Karnik Modi from the customer success team got back to me. After a quick back and forth he confirmed there was an issue and the issue had be raised to the right team, the right developers where paged to come in and address the issue (or a guess with Covid the home office!). The issue was raised as an investigation on the incident status page. At the same time I also started a twitter thread on @SnapShooterio
At this point it was getting very late for me and I clocked off for the night. When I was woken up by my six month old at 4am I checked my phones, emails to see that DigitalOcean status page had updates, it seems within two hours the issue had been found and fixed. When I work up fully in the morning at 6am the issue was fully resolved.
SnapShooter continued to monitor the issue its side for the day, and was proven from the moment DigitalOcean said the issue was fixed the number of 'Server Error' backups went to zero. Case closed!
Emailing All Effected Users
While all customers with effected backups would have been informed via our email and slack notification system, I wanted to be very clear with what had happened.
I wrote a quick query and email for all users who had received the error.
In the last 24 hours you have had SnapShooter backups fail (if you had email notifications on you would have received one saying 'Server Error'), DigitalOcean had an issue with API processing yesterday, we received abnormally high number of errors (1-2%) within a 12h window with backup jobs across all our customers. We worked with the DO team to raise and resolve the issue.
The matter seems to be fully resolved now but we are continually monitor the situation from our side.
Within 20 minutes of sending the email support received a flood of replies from customers, thanking us for looking after thier backups and keeping them in the loop.
We felt we handled the situation well, having backups failing to process is never great news and not something we wish to see. It was great that all customers receive error notifications in real time.
We are going to put into place more improvements to our error rate system to flag to the SnapShooter team sooner than an issue is happening. We have also penciled into the road map improving a status system for customers.
Notes: While debugging an issue with DigitalOcean 3 years ago we decided that we should make full logs of all requests and responses we take. We settled on storing the header information (response payloads can be huge). This was a very good business decision and I encourage other businesses who have build products on API to also to this. This has become especially helpful now we manage a large percentage of snapshots make in DigitalOceans.