Incident Report for Bugherd
An update regarding the recent outages:

All systems have been restored to working order. Please accept our apologies for the down time. This was the first (and hopefully last) such outage we've had at BugHerd in 6 years. We pride ourselves on our service and support, and as a team we're extremely disappointed with these events.

Over the weekend our engineering team was alerted to intermittent performance issues. After diagnoses we found that our database server was suffering from intermittent iowait issues. We took steps to address the problem, and as performance returned to normal thought we'd resolved the issue. At this stage, we did not yet suspect a hardware failure to be the likely culprit.

As traffic increased again after the weekend, we saw the issue return and began focusing again on finding the cause. The slow response time resulted in intermittent outages throughout the day. We took the server offline again for a short period to conduct some database maintenance, shortly after we brought the server back on line we started seeing WAL errors and immediately took the server offline, this time to prevent data loss. We immediately took additional backups, and checked them for data integrity.

The normal process at this stage would be to quickly switch over to a database follower, but we were unable to do this successfully (for reasons we are still discussing with our provider). At this point our primary concern switched from getting back online ASAP to ensuring customer data had not been lost. We then restored from the most recent backups. The restore was successful, but we still saw the same errors. At this stage we were now suspecting hardware was to blame.

At this point we got in contact with our provider to get their assistance to diagnose and confirm hardware as the cause. Once they'd confirmed, we began migrating our data to a new server. Given the size of our database, this unfortunately took much longer than we'd like and resulted in a much longer down time than we'd normally deem acceptable.

We're now in the process of reviewing our processes and procedures. Whilst we're satisfied that our policies have meant that our customers have suffered no data loss, the length of the downtime was far from acceptable, and for that we're extremely sorry. If you were affected by this outage, we would like to apply a one week credit to your account, please contact support@bugherd.com for details.
Posted Nov 30, 2016 - 01:18 UTC
Services have been restored. We'll have more details about the problem/solution shortly.
Posted Nov 30, 2016 - 00:25 UTC
Our provider is still experiencing problems. We're again investigating.
Posted Nov 29, 2016 - 23:06 UTC
All systems have been restored, we're now actively monitoring. More details to follow.
Posted Nov 29, 2016 - 23:00 UTC
We are currently working with our service provider to restore our failed servers, we don't currently have an ETA. Earlier today our database system suffered a major failure. We do maintain regular backups and no customer data has been lost. However, as the cause of the failure is still unknown at this stage, and to prevent any potential data loss we are leaving servers in maintenance mode until backups have been restored, the integrity of the data has been validated and we have confirmed the route cause with our provider. We are very sorry for the inconvenience this causes.
Posted Nov 29, 2016 - 14:40 UTC
We're still working hard to resolve the downtime. We're in the process of deploying new servers and migrating data.
Posted Nov 29, 2016 - 13:44 UTC
The issue has been identified and a fix is being implemented.
Posted Nov 29, 2016 - 10:22 UTC
We are again seeing increased error rates for requests. We're investigating the cause.
Posted Nov 29, 2016 - 09:08 UTC
Database is back up, errors rates are down and the app is more responsive
We're still monitoring closely.
Posted Nov 29, 2016 - 02:38 UTC
BugHerd will be down to 15mins for database maintenance
Posted Nov 28, 2016 - 23:07 UTC
We're still investigating the degraded performance. We'll update you soon.
Posted Nov 28, 2016 - 21:25 UTC
We're seeing increased error rates for requests. We're investigating the cause.
Posted Nov 28, 2016 - 09:23 UTC