Beware of Salesmen bearing SLA’s

This is in relation to some consulting with a business about whether to use an Off The Shelf (OTS) solution for some auth middleware or not.

To OTS or not to OTS – is that the question?

Sorry for mangling Shakespeare but when considering an OTS solution, compared to doing something else (like doing in house development or running something equivalent using open source) I tend to look at it from 2 points of view:

  1. Does the OTS solution fulfill the critical need?
  2. Does the OTS solution have an appropriate SLA?

service-level-agreementNow the first usually comes down to working out exactly what is needed, rather than getting distracted by all the bells and whistles. This becomes particularly important when considering core infrastructure, as you need to keep a handle on the dependencies and minimise the security ‘surface’ you are actually letting yourself in for..

The second, with respect to cloud based services, is where things in particular get interesting. The OTS solution being considered had the following ‘qualities’ (found after several email exchanges):

  1. All accounts operated on exactly the same cloud infrastructure, no separation at all;
  2. Accounts were provided with different qualities & features depending on how much you spent in roughly 3 tiers: free, paid for and enterprise (4 figures a year).
  3. If you really wanted to you could have your own hosted solution (5 figure sum per annum!).

Now given this is auth middleware, in that it was providing a critical auth service between users to systems and systems to systems; it’s uptime should be better than the systems which call it – otherwise it would be just degrading the uptime of the systems which were dependent upon it – a bit of a fail really..  So what is the SLA for the enterprise level service? – just 99.9% or 43 minutes of outage per month without any comeback. I nearly fell off my chair! That is just the uptime for the basic AWS class service of a single instance with no resilience! – the company sales rep went on repeatedly about how their actual uptime was much better than that. I answered back, well if thats the case why isn’t their SLA at least four nines if not five nines (something you can easily do having two independent instances in different availability regions). Again the sales pitch about if you want a better SLA you can go down the hosted 5 figure sum path… hmm.

Then I realised – this is a deliberate sales policy decision – if you have a genuine requirement for an actual SLA with some real teeth to it – you are going to pay through the nose for the privilege; regardless of whether they had the actual ability to do so using their current set up or not. Then I also realised the way they have it set up, they cannot actually operate distinct SLA’s – they have everything running in the one set up. To operate distinct SLA levels you need distinct systems, so a failure in one SLA ‘pool’ doesn’t impact anyone else… They are also further ‘stuffed’ in that they cannot do progressive development rollouts up through the quality chain (i.e. start with the free accounts and progress up to those paying the most, on the basis that if you haven’t paid you don’t have that much of an expectation of availability anyways) – essentially there is currently a risk (abet small) of shooting everyone in the foot if an update fails in a way that cannot detect until it is too late. This is akin to what a trading house recently encountered when they did a progressive update that produced erroneous trades and promptly traded themselves out of business in a few minutes… Such a characteristic is not good news if you are someone providing an auth service…

Essentially the way they had set themselves up meant they had no incentive yet to improve their SLA level and instead were keeping it low to force people to pay top dollar for a better SLA…

So it was essentially one big cloud system acting as one big bucket – a bucket I didn’t think it was worth the risk being in.