Third party services are here to stay. Not only do they let you reduce the amount of work that your team has to do, but they also offer integrations with other software on the web that your company may want to use. As a result, you’ll eventually have to work with some third party software, whether it is a payment processor, an email service provider, or even a CRM for your sales team.
The initial integration with most third party APIs is usually pretty easy, but over time you’ll learn that there is a price for being too tightly integrated with people outside your organization. Third parties always have different goals than those of your organization, and you need to design your interaction with their systems in a way that protects you from the consequences of their goals. Done properly, you can not only protect yourself from bad behavior on their part, but you can make your own systems more resilient while doing so.
Integrating with third-party APIs looks simple on the surface, but you need to be very careful how you do it. In addition to obvious problems, such as error handling, integrating with a third party can directly impact your system’s uptime, as well as making everything more complicated. However, it doesn’t have to be that way. If you plan ahead, a third party API can make your application more capable and truly provide value. That only happens if you are well-prepared.
10:10 Error Handling
You will encounter errors when integrating with anything. Many of these may not even be your fault. APIs upgrade or change implementation all the time. Network failures occur and network configurations change as well. Third party APIs may get DDOS’ed.
You have to be able to recover from errors. Retry logic should be built in from the start. Retry logic should also use an escalating/incrementing back-off model.
Errors on their side shouldn’t break your side. Any updates from them should be treated in an idempotent fashion. Webhooks on your side should also log errors and send an alert if the errors exceed a threshold.
You need to be able to get useful debugging information from your integration points. This means being able to see responses from the API (as opposed to whatever you are turning those responses into). It also means sufficient logging, including how long requests take to process.
You need to be able to determine if your integration is incomplete. The other API may change without you knowing. This can mean new responses, new error conditions and new error codes. If you aren’t handling these appropriately, you may notice that your system slowly starts failing more frequently. If you are logging appropriately, you’ll be able to see what the API is sending to you and determine what happened.
You need to be able to determine how many requests you are sending and statistics on the results. At some point as your app scales, your integration points will become a bottleneck. You need to know when. External integrations are easily blamed when a system gets slow, so you need to build proof in from the start. This is also useful for troubleshooting rate limiting and billing issues.
Your network will go down or be under maintenance at some point. Even the best data centers will have the occasional outage – you can’t spend your way out of needing to handle this problem. You need to test how the system reacts to a failure. You might be surprised by the consequences of your assumptions.
Their network will not be working at some point. At some point, the API you are calling will have an outage, intentional or otherwise. You need to have a mechanism to retry any calls that can be retried and gracefully fail on others. You may want to consider using a feature flag to indicate whether a particular service is usable to keep from slowing things down. (aka, web server threads getting tied up making calls to external service that time out.)
You need to think about how both sides catch up on processing after an outage. Once the outage has stopped, you still have to catch up processing. You need to make sure that you don’t immediately throw a ton of work at the API, especially if the results of that work get dumped back into your system. It’s easy to build up a queue of work that will break your system. Bear in mind that their system will also be catching up at the same time, so you’ll want to watch load on your webhooks.
You probably don’t want anything connecting with the outside world from deep in your network. This is why many network admins make sure there is a firewall between web servers and the inner network with sensitive data. If your main database has data of a sensitive nature, your web hook or API integration point just became a huge target for a breach.
Webhooks and services interacting with the external API should be on app servers in the periphery. This tends to mean limited access to things internal to the system. This may even mean separate databases (and other required servers). It could even mean having to have an API behind your integration point that integrates it with the true backend of your system.
You have to think about what happens if there is a breach on their end. You should consider the timing of a breach to be “when”, not “if”. This means careful auditing of the data held by the API, the data held by you, and the data going over the pipe between. This is why a lot of companies use third party APIs. They’d rather not have the sensitive data and the liability for it.
29:40 Their update cycle
The third party does not care about your schedule. If your clients are tax preparers in the US, third party APIs may well ship a breaking change during the last week of March. You have to be prepared to quickly patch your integration points to handle any surprise changes. They may even have a reasonably sane deprecation schedule for their APIs, but if your clients control patching, you might still have a bad time.
You need a good way of finding out about updates before they go live. Get on their newsletter and email list. If the API is critical, you probably need to take the time to participate in their beta program so that you are better prepared for changes. If possible, you should also cultivate some technical contacts at the API provider.
You need to know how long they keep deprecated APIs around and how they update. You need to know this before you even consider working with an API. No matter how good their API is, it will be extremely painful to work with them if they are constantly breaking your stuff at random. Make sure that their deprecation cycle length is longer than your deployment cycle length.
35:00 Your update cycle
Third party integrations are often poorly tested (by you, not them). Do not fall into the temptation of thinking that your integration point is still working because nothing has changed. Some API vendors roll their updates out to the development environment ahead of rolling it out to production.
Weird things can happen during maintenance and you need to be careful. You might do well to disable external APIs until you are sure that your changes didn’t have any side-effects that you didn’t anticipate. If you do take the API offline during your maintenance window, make sure and follow our previous advice about how to recover from network outages.
What happens when the API expects to be able to call your webhook during an outage. Some APIs simply discard failed requests. You may have to request these to get them. Other APIs queue and retry failed requests. These may come in eventually, but may be out of order. A period of extended downtime of your API integration point may result in your system getting hit hard when it comes back online. This can be especially nasty if things like caching systems aren’t already hot when this happens.
39:30 Rate limiting
Most APIs are going to limit the number of calls you are allowed to make. They don’t want people knocking their systems offline. This is true even of development systems. You may be limited on how many calls you can make in a short time period, or even billed for an excessive number of calls. When making a call, you need a clean way to determine whether you are over your limit. This gets tricky with multiple processes sending and receiving data. You’ll probably want to stop short of the maximum.
Rate limits can often result in calls failing. When rate limiting is in place, you may get an error back that has nothing to do with the failure of the actual call. You need to distinguish between failures due to rate limiting and actual errors.
If the API responds to you, you have another implicit rate limit on your end. If you have a webhook that the API vendor calls back on, you need to consider what happens when they screw up and saturate your connection. This should be an HTTP 429: Too Many Requests. You need to know ahead of time how much pressure you can take on your webhook. Also consider what happens when a non-authorized user saturates the webhook (like in a DDOS situation).
43:15 When APIs die
What do you do when an API that you need is deprecated suddenly or changed to make it less useful? Companies go out of business every day, and some of them don’t warn you. Remember that if the company has gone bankrupt, your company is going to have a hard time suing them for costs incurred. If your business is now dependent on this API, how do you recover?
APIs can also die due to legal issues. GDPR changing legal status of sharing certain data. This can make the API useless for your purposes. You might also find that the API vendor has suddenly decided that your company is immoral and shut you off. This is an increasingly common problem for certain businesses. Your mental model for working with other companies is that they react to perceived risk, not actual risk.
Outages may be extended as well. Additionally, an API may be out for an extended period due to major hacks, financial/legal problems, etc. Business continuity plans need to consider external API dependencies more carefully than they do. If you don’t have an SLA (Service Level Agreement) with the API vendor, then you are out of luck when they don’t have their act together.
46:30 Handling multiple APIs
At some point, you’ll have to handle multiple APIs that do similar things. This may be because of client demand, hedging your bets, or simply because you needed features from both. Other than the complexity that it adds, having multiple options can protect you from a lot of the problems outlined in this episode.
This can be so that you have a failover or because both APIs have advantage. Sometimes different segments of your users are more appropriate for one API or the other. Sometimes users in different locations need a different API, for a variety of reasons.
Diagnostics get more interesting with multiple APIs. You now not only have to determine what went wrong, but what API it went through. If some of your codebase is a bit poorly coded, these problems may not surface for some time until some other process hits the data in question. CRUD is a terrible model for this sort of thing, at least not without a ton of other logging that is easily correlated with the situation that caused the problem.
Tangram Factory’s IoT-enabled jump rope has LEDs imbedded in it that displays the number of jumps in front of the user while in use. Instead of using vertical acceleration or vibration, this jump rope measures jumping by rotations of the handle. The Smart Rope allows users to set workout plans and log their repetitions and even gamifies workouts by allowing users to compete with other Smart Rope owners. The ropes can track progress in a free Smart Gym app or connect with several devices.
Tricks of the Trade
Remember the game mousetrap? Basically, it was an old game where you built a convoluted contraption that would improbably perform some action for you. Imagine errors coming from external systems as being like the output of this game, ie., some ridiculously improbable set of circumstances will occur and some ridiculously improbable error will hit your system. Don’t think about things as “that can’t happen”. That will happen.