#1420 - Incident Review
This post is a summary of Incident #1420, relating to a period of authentication issues with the (gs) Grid Service.
Earlier today, the AccountCenter became unavailable for approximately 15 minutes due to MySQL Replication errors. Soon after, we began receiving reports of failed email and FTP authentication from customers on various Clusters. After some investigation, it was determined that a portion of the account authentication servers, used by each (gs) Grid-Service Cluster, were out of sync. This is the process by which all new password changes are stored and synced across our multi-node, clustered (gs) Grid-Service platform. These servers are replicated database slaves, which are normally self-healing.
(mt) Engineers identified the source of this issue and made the appropriate corrections to restore functionality to these servers.
- Date/Time: The issue started at approximately 3:15 PM PDT on Tuesday, July 27, 2010 and was resolved by 6:30 PM PDT. Service impact was variable across the (gs) Grid-Service during this time.
- Symptoms: Customers creating or modifying email addresses or updating FTP/SSH passwords may have experienced authentication errors.
- Impact: All (gs) Grid-Service Clusters were affected. Email was not lost during this time.
- Root Cause and Takeaways: Although our investigation will be ongoing, we have identified a point where the binary logs that are required for replication were corrupted. Going forward, we are looking into system changes which would help prevent this issue from re-occurring. We will also be looking into increasing the efficiency of our replication repair utilities. Performing this change will allow us the ability to repair replication services for all (gs) Grid-Service Clusters simultaneously.
This now concludes this System Incident. If you feel that you are still experiencing the symptoms outlined in this post, please open a support request from the (mt) AccountCenter.
#