Managing & Identifying Risks of 3rd Party APIs

Introduction

The explosion in 3rd party services to help developers is fantastic! It reduces the need for us to do things like set up mail servers when we can use things like SendGrid or MailGun. We don't have to have a server to host our WebApp, we can just deploy it to Heroku, AppEngine or Parse.

The Risk

While this is great as it lets us spend a minimal amount of time on creating infrastructure and just get to the business of writing our software, it does introduce several risks, such as leaving us open to the outage or shutdown of the API service.

The Parse service recently announced (January 2016) they will be closing down their service. Fortunately they have been kind enough to provide not only a year of advanced notice, but have also open sourced their service! Despite bugs in this version, they deserve respect for taking the time and making the effort.

The original version of this article (April 2013) was prompted by Yahoo! once again shutting down another of their services with a very small window of warning: the "upcoming api" was being shutdown with 11 days notice!

Risk Management

I'll focus here on what is called residual risk. While this is nowhere near being a comprehensive post on risk management, it will give you an idea of how you can help protect yourself.

Risk Register

The first thing we need to do is identify all our risks using something like a five box risk matrix. Essentially this weighs up the likelihood of a risk occurring and the consequence resulting in the risk occurring. This gives us a much better idea of where to focus our plans and to what detail.

Since I'm talking about third party API risks, you may consider that the outage of an email service to have a lower consequence (getting more support calls) than the outage of a payment gateway service (loss of income). You could further break down your risks to include:

  • Outages
  • Shut down (I'm looking you Yahoo!)
  • Bankruptcy
  • Acquisition

Planning - Mitigation and Response

The second thing to do for each risk you identify is to create two plans:

  • The Mitigation Plan
  • The Response Plan

Mitigation Plan

The mitigation plan is how you intend to reduce the risk from occurring. It's important to note that this is not a plan to stop the risk from occurring but to reduce the likelihood of the risk occurring.

Your mitigation plan may be making your app use either SendGrid or MailGun depending on an environment variable. An outage in your preferred provider could result in minimal interruption by simply switching over to a different provider (think factories and interfaces)

Response Plan

The response plan is how you are going to deal with the risk once it occurs. The point of risk management is not trying to stop things from happening that are out of your control, but what you are going to do when it happens.

Your response plan might detail how to switch your app from using SendGrid to MailGun in the event of an outage. This could include how to change a setting in your database or config file and a template email or tweet to send to your users.

Conclusion

While certainly not a detailed description of risk management, I hope that this post will give you an idea of things to think about that could go wrong and how to cover yourself when it does.

Being able to simply flick a switch when a provider is down is going to show your boss you have certainly planned for disaster!

Remember, the point of identifying risks is not to avoid them, but to embrace them with knowledge and understanding.