Collective2 experienced system downtime between 1pm and 6pm ET on Friday, May 28.
Many people have written to me to ask what happened. The root cause of the Friday outage was that our database outgrew the size constraints we initially used when we started the company seven years ago. For those technically minded, we created one of our critical database tables using unsigned integers as keys, which meant that our table could store 4,294,967,295 records. When we started C2, many years ago, that number seemed large enough to meet any possible contingency. What we didn’t anticipate was how much the site would grow, and the heavy usage it would experience.
When the database table outgrew the size constraints, it crashed. We attempted to take a series of steps to increase the size of the tables and get the site back up and running. Unfortunately, the steps we took led to a complete corruption of our database.
We’re serious about taking precautions, and having multiple redundant backups at C2. In fact, we have two replication slave databases which are running at all times, constantly making dynamic copies of the database every moment of every day (each backup copy is in a different physical location, at different data centers, in different cities.) Unfortunately, the changes we made to support the larger file size replicated out to slave databases, corrupting them too. So, even though we had more than one backup copy, we managed to corrupt them all.
This meant that we needed to resort to our next level of redundancy: we restored the corrupted tables from yet another offsite backup server (our fourth backup, if you are keeping count). This backup does not get updated on a live, dynamic basis, but rather is composed of a series of database snapshots taken over time.
We used these snapshots to restore the corrupted database tables, and the corrupted replication slaves. Ta da. Five hours of downtime later, we were back in business.
The purpose of this message is not to make excuses about our outage on Friday. There are no excuses. Our customers expect 100% uptime, and we strive to deliver it. We failed you on Friday. But I do want to reassure you that Collective2 takes many steps, and pays a lot of money, to have multiple rings of redundancy and safety for our customers and for our data. We understand that you place your trust in us, and we try very hard to meet your expectations.
We hope never to experience such a sustained period of downtime again in the future.
I apologize for the problems you experienced on Friday.
Sincerely,
Matthew Klein
Founder
Collective2