Lessons learnt from Facebook outage

– Some thoughts from our CTO Anurag Jain

Be it business or personal, social platforms are integral part of our lives. When the outage first started, my first reaction was to restart the app a few times, restart my router, and even check with few friends and colleague. After a bit of panic on how clients/friends will reach us, I relaxed for reasons mentioned in the end below.

There are important lessons to be learnt, and while this is not a complete list, a few top ones

  1. Many online services have social media logins, which would have been affected. The more dependencies there are the more the risk for failure of a service. Hence for critical services having direct access might be a better option even though it means you give away the comfort of easy login and have yet another username/password to remember.
  2. Have a plan for failure of services. This is called business continuity and is a very important security plan, one that will be of key importance for a business to know what are the failure points and what can be done to compensate it. If your hosting provider shuts down permanently suddenly, do you have remote backups? If customers face problem to login, can they still make use of the service via alternative means?
  3. One of the problems that Facebook faced, was how tightly integrated its services were, so much so, that this outage caused the key people who could fix the issue be locked out. Things will go wrong, how fast can you restore it is important. Hence segment your infrastructure so that disasters don’t spread across your entire business.
  4. When you make a change or upgrade, especially ones that are not pure aesthetic but play a critical role in functionality and operations, ensure you have planned, tested, executed, and monitored it well.
  5. Roll-back plan is equally important. Your changes might not function, or in some cases may not even be liked, a fast roll-back to the previous state is critical.
  6. Its easy to give insight on these incidents after they have happened and what should have been done. But these insights are important for one to gain experience, and fix and prepare for a similar issue next time. Prepare an incident report for every medium to critical issue.
  7. And the most important of it all. Relax, things do go wrong, and one should accept it. If major giants like Facebook, Google can have critical problems, sure your business/personal side can take a hit once in a blue moon. “Mistakes happen when you are trying to do something”. It means you are progressing, 10 steps forward with 1 step back, it’s still moving forward. During the time of crisis a cool mind will work wonders.

In short.

  1. Reduce dependencies.
  2. Have a business continuity plan.
  3. Segment your infrastructure.
  4. Plan/test/execute/monitor.
  5. Roll-back option.
  6. Prepare incident reports.
  7. Relax. Not the end of the world.