An update regarding the recent outages:
All systems have been restored to working order. Please accept our apologies for the down time. This was the first (and hopefully last) such outage we've had at BugHerd in 6 years. We pride ourselves on our service and support, and as a team we're extremely disappointed with these events.
Over the weekend our engineering team was alerted to intermittent performance issues. After diagnoses we found that our database server was suffering from intermittent iowait issues. We took steps to address the problem, and as performance returned to normal thought we'd resolved the issue. At this stage, we did not yet suspect a hardware failure to be the likely culprit.
As traffic increased again after the weekend, we saw the issue return and began focusing again on finding the cause. The slow response time resulted in intermittent outages throughout the day. We took the server offline again for a short period to conduct some database maintenance, shortly after we brought the server back on line we started seeing WAL errors and immediately took the server offline, this time to prevent data loss. We immediately took additional backups, and checked them for data integrity.
The normal process at this stage would be to quickly switch over to a database follower, but we were unable to do this successfully (for reasons we are still discussing with our provider). At this point our primary concern switched from getting back online ASAP to ensuring customer data had not been lost. We then restored from the most recent backups. The restore was successful, but we still saw the same errors. At this stage we were now suspecting hardware was to blame.
At this point we got in contact with our provider to get their assistance to diagnose and confirm hardware as the cause. Once they'd confirmed, we began migrating our data to a new server. Given the size of our database, this unfortunately took much longer than we'd like and resulted in a much longer down time than we'd normally deem acceptable.
We're now in the process of reviewing our processes and procedures. Whilst we're satisfied that our policies have meant that our customers have suffered no data loss, the length of the downtime was far from acceptable, and for that we're extremely sorry. If you were affected by this outage, we would like to apply a one week credit to your account, please contact
support@bugherd.com for details.