For the past few days you may have noticed that we’ve been experiencing intermittent service issues. I want to shed a little light on what has been going on, as well as what we’re doing to mitigate future downtimes in the short and long term.
The evening of February 19th, our servers began experiencing significant packet loss, which caused increased response times and random server crashes/restarts. Our hosting provider switched to a backup upstream provider, which solved the majority of the issues.
The next day, February 20th, we experienced a similar degree of packet loss. Our hosting provider switched back to the backup provider which mitigated most of the problems. After some investigation it was determined that the packet loss was attributed to a large scale distributed denial of service attack within the Seattle area, affecting many different providers.
This brings us to downtime throughout the day today, the 21st. This one was our fault. Our database received unusually large load this morning, which caused the querying of a very large table to bring down the server. No data was lost as a result of this outage.
So what are we doing to fix these issues?
For our problems today on the 21st, we’ve cleaned up that huge table. This immediately made responses faster (imagine that). We’re putting checks in place to prevent that table from getting too big in the future. We’re also in the process of splitting apps to different servers. In the past, running 30 apps or so on each server all using the same database server has been fine, but we’re rapidly reaching the point where this is no longer going to scale with the amount of requests we receive on a daily basis. We consider this a good problem!
We also want to ensure that courses still work (paid and free) if possible during outages like this. We’re drawing up plans to make this happen.
The packet loss problem is a bit more complex. We’ve been wanting for a long time to get the deployment process for our entire infrastructure to be more portable. The goal being that we can push a button and temporarily move our entire infrastructure to a new data center and/or host if our current host is experiencing persistent problems. We’ve been putting together plans for a long time now, but it hasn’t been a large priority, until now. We’re shuffling people around to work on this. This will allow us to recover much faster in the future from these longer duration outages that can result from upstream attacks.