Friday, 6 January 2012

Risk Appetite - The Need for Security SLAs

When writing our 'The Business v Security Bugs – Risk Management of Software Security Vulnerabilities by ISVs' blog post a point was raised by Phil Huggins which we incorporated into the business questions of our post. His full comment was:
"Once systems are up and running, deploying a critical security patch out of the regular patch cycle introduces a greater risk of outage and therefore, a failure to meet SLAs. Customers want a regular predictable patch cycle they can build into their SLAs, emergency critical patches screw these up."
We paraphrased and incorporated Phil's input into our post; as a vendor, whether or not your clients will deploy a patch is a key driver to the patch development process. Patches although core to the overall security are relative to the risk position of the business. The 'risk appetite' of a business is not something to be over or under estimated; and at times we've been very surprised. For example: the organisation considering an upgrade of it's user network to gigabit Ethernet as a solution to the network slowdown caused by a rampant conficker infection; they understood the cause, but an upgrade was seen by them as the 'least cost' option to dealing with the effects. For us, as a security consultancy and software development business, it's sometimes a challenge to understand the mindset of clients who have ravenous risk appetites, particularly when you're being paid to advise them on them on their technical risks.

The reality of course, is that within a business there is a huge amount of decision conflict between the different risk postions of suppliers, stake-holders and customers. Balancing the three is a complex task where more often than not at least one party is left disappointed and with an uneasy feeling. However, Phil has presented an interesting perspective which is worthy of some more detailed discussion.

Service Level Agreements (SLAs) are commonplace; they exist at various levels, but mostly between a business and it's suppliers and between the business and their customers. For a large proportion of the time, the focus is on the upkeep of the SLA. For example, ensuring that an Internet connection is maintained with five-nines [1][2] availability rather than at the latest patch version of the router; or ensuring that customers can check out of your web store within certain usability time frames rather than implementing additional security checks to the input/output routines that may come with a performance penalty.

A typical SLA will have provisions which permit the provider to perform scheduled work such as maintenance, often outside the guaranteed service agreement. This leaves, as Phil implies, the 0.001% for unplanned events, which encompass all manner of outages and disruptions as well as unplanned emergency security upgrades or mitigation exercises. The nature of emergency upgrades/changes is such that unlike scheduled outages, they don't necessarily have the same logical planning and management over-site and often are deployed hurriedly to mitigate an exposure, which in itself is not a risk free process.

Of course with a system implementing high-availability, there should be sufficient resilience and resource to take down any component to perform maintenance without impacting the SLA; likewise there should be a structured duplicate test environment through which patches can be rolled out to provide detailed impact analysis. However, this is not always (and if we're being honest rarely) the case and with budgetary constraints getting ever tighter in all sectors spare equipment or capacity is constantly being eyed up for repurposing. However robust a system, the penalties of breaching the SLA are often too commercially great to risk implementing unplanned emergency changes to a functioning system; when that window may be required for an actual fault or outage.

The stance you take as a vendor will likely never please all of your clients. They have to deal with the negative perception of service updates and outage announcements to their customers; and how reactive organisations appear regardless of whether the choice is their own. Of course, if that vulnerability is being actively exploited as a vendor you can't appear to react fast enough or provide detailed enough advice; irrespective of whether that impacts your normal patch development process and diligence. Sometimes you just have to be seen doing something.

Of course, you could read Phil's comment and say 'why do we care'. An emergency patch was required and as such you released it; if the SLAs maintained by your customers mean that they can't roll the patch out without risk of breach - surely that's their problem; and maybe they should have negotiated their terms more effectively. Although commercially harsh, this is a valid standpoint. As a business you can't bend to the whims of every customer. Inevitably you will have one who wants a patch yesterday, and another who'd like the whole thing delayed a year or so. There is also the potential for customers to  place pressure on the vendor to downgrade or upgrade the classification of a vulnerability in order to avoid or force the implementation.

Providing work-arounds or alternate mitigations is an approach vendors often take. As is, the implementation of solutions which facilitate 'hot patching', facilitating system up-time whilst allowing risks to be mitigated. However, neither of these approaches is significantly diverse to the deployment of an emergency patch to make them anything other than optional decision routes on the same branch.

The business question we're asking, is how much weighting should you apply to customer SLAs when making the decision to develop an emergency patch for a security exposure in your product; and of course there is no definitive answer. Advising your clients to build in provisions within their service and customer SLAs to allow for reactive implementation of security fixes may be one solution. Provisions that factor in security related actions that don't negatively impact the overall SLA; however of course this is ripe for abuse by unscrupulous providers looking to squeeze the agreement when balancing on the 0.00149% boundary.

Emergency patches do throw a spanner in the works, a regular predictable cycle of patch releases, each robust and diligently tested to the nth degree is the ideal; but almost impossible to achieve without offloading considerable risk onto your clients. Emergency patches will be released for your systems, it will happen and you will have to react to them. How you react, how you balance the risk and how big your risk portion is are all relative. However, we feel that having the security of your system potentially compromised due to an overbearing or highly restrictive SLA is unlikely a solution with solid foresight.

With more and more companies moving services onto hosted or cloud solutions, the bigger security question is what impact does the 99.999% availability have on your security position. How are they patching the hardware, software and applications providing your environment, and what provisions are there within your SLA for reacting to unplanned security risks.

No comments:

Post a Comment