Understanding Risk in Control System Enviroments

Risk in our field is most often defined as risk = threat x vulnerability x consequence. And while it is a formula that is easy to define it is very difficult to give actual values to the variables. How do we quantitatively assign “real” values to the concepts of threat, vulnerability and consequence.

Of the three, consequence may be the most approachable. Consequence can to some degree be quantified in the almost universal units of dollars, with some exceptions. The consequence of an interruption in production of a plant or facility can be quantified as the loss in revenue that the product would have generated + the cost of fixing the problem + the costs of having idle workers at the facility + other normal operational expenses.

Most companies (asset owners) have already performed some of the consequence calculations and have them availabale in the form Recovery Time Objective (RTO) or Business Impact Anaylses (BIA) which define loss for a given time against a given system. RTO being measured in minutes/hours/days.

As an example, an RTO could be calculated using the following inputs: Combine the total revenue loss and cost exposure estimates and use the matrix below to determine the corresponding RTO classification.

Financial Impact / RTO Classification Matrix:

  • RTO 1     Daily Revenue and Cost Exposure is $10 million or greater
  • RTO 2    Daily Revenue and Cost Exposure is between $5 to $9.9 million
  • RTO 3    Daily Revenue and Cost Exposure is between $3.3 to $4.9 million
  • RTO 7    Daily Revenue and Cost Exposure is between $1.0 to $3.2 million
  • RTO 30    Daily Revenue and Cost Exposure is less than $1.0 million

This is then cross-referenced with customer impact tables that cover:

  • High Revenue Customer Risk Table – This table focuses on business processes / services that support customers providing substantial revenue to the organization such as business, corporate and institutional customers.
  • Blended Customer & Revenue Risk Table – This table focuses both on volume of customers impacted and revenue generated by the business process / service.  This table should be used in conjunction with Risk Table to confirm and determine the RTO classification most appropriate for the line of business.

The above can be combined with other inputs which could include regulatory impacts, loss of life, reputation etc.  Each category getting a rating, that is then used to calucate the final RTO.

Consequence multiplies exponentially  when the effects of an event impinges upon environment, human safety and human life. The Bhopal disaster, while not a “cyber” incident, cost Union Carbide over $470 million and most likely will have further mitigation costs for the environmental remediation. And while 3,800 people died immediately from the toxic cloud, it is estimated that 20,000 died from the effects of the release. When calculating risk, what value does one assign to a human life?

The other two values, threat and vulnerability can be viewed as alpha values representing probability.

For vulnerability, given the state of current control systems, there is a probability of 1, that if an attacker is able to penetrate his/her way into the control system that he can do something bad. Knowing this, the probability can be reduced by defense in depth measures that make entry into the control system difficult, and in the best scenario nigh on impossible. The probability of intrusion into a control system through proper perimeter control, system monitoring, other technological solutions, policies & procedures, training of employees, and other mitigations can start to approach 0, but as there are unknowns in any system it can never be viewed as a probability of 0.

Threat too can be viewed as an alpha value, the probability of an organization being targeted for an attack. It is important to understand the makeup of the threat. From hostile nations states, to the gamut of; terrorists, hacker teams, recreational hackers, script kiddies, insider threats, and malicious code agents. Just because an organization is not being directly targeted by an aggressor does not mean that there is no threat. The internet and our networked environments are replete with worms, trojans, and viruses that are automated and will wander into any network that they can leverage themselves into. Threat then must be viewed as a probability approaching 1, and the bigger and more well known an organization, the closer to a probability of 1.

As consequence then becomes the driving factor in the equation it is important to truly understand the consequence of a “worst case” scenario in a facility. This case must be viewed as the consequence of an event if the safety systems fail. Assigning a dollar amount to consequence then provides a reference point against which arguments can be made for further expenditures in safety, and both physical and cyber security.

20 comments to Understanding Risk in Control System Enviroments

  • —–BEGIN PGP SIGNED MESSAGE—–
    Hash: SHA1

    A nice way of thinking by numbers. Are there any particular reasons for which you see it applicable to a particular kind of industrial control, namely distributed control systems ? What would differ if you try it on another kind of industrial control, say SCADA ?

    —–BEGIN PGP SIGNATURE—–
    Version: GnuPG v1.4.6 (GNU/Linux)

    iD8DBQFIfmpl3JhHvEZ9fsERAlWmAKDjs2jfEyXh8lcNhwTxHvgwVZQTnACfXSn2
    LgKvDnwWr+r2Ctk5NRsczbA=
    =8pq6
    —–END PGP SIGNATURE—–

  • lackey

    The methodology for SCADA is the same. I should probably edit the heading to include SCADA systems or all control systems. Fell to the common semantic failure of referring to all of these systems as digital control systems.

    There fixed it! Thanks.

  • cempics

    Quality post. Perhaps the concept of Corporate Social Responsibility (CSR) can be more formally inserted into the risk analysis. In CSR, the triple bottom line of social, financial, and environmental performance is tracked along with the traditional primary financial focus. For the vast majority of corporations practicing CSR today , the environmental and social metrics are then rolled up into the financial tying it all back to a $ amount. However, for purposes of this risk assessment it would appear that using all three indicators individually may assist in a more sound albeit less quantitative consequence definition. To add to that, many organizations in scope of this type of risk assessment are government run (water management districts for example) so their focus is not always tied to making a profit, or even wise use of funds i speculate. So for utility X with a financial focus, a power plant down leading to wide spread power outages may be worst case scenario from a financial (reputation + lost revenue + other impacts caused from power outage) perspective, while for water management district Y with a social (public service/ public impact) focus it is wide spread flooding from a dam malfunction. For a nuclear plant it is affect/loss of human life etc.

    Perhaps keeping the three CSR areas of social, financial, and environmental separate is a good idea when defining consequence. Define worst case scenarios in each of the three areas and agree that there are events that you know you can’t quantify ahead of time, but you would never want to see happen in each area as well. Luckily, for those “you never want to see” events, government regulations often address but if you’re an organization you *should* do whatever is in your power to prevent them as well.

  • “Risk in our field is most often defined as risk = threat x vulnerability x consequence.”

    That doesn’t make it right. I know you’re not to blame for such a terrible equation, but I’ve always wondered, why is threat multiplied by vulnerability? I’ve never understood the reason for using multiplication there, what real world relationship that’s supposed to represent (And perhaps one of the reasons I don’t understand us usually that equation is used with some flavor of ordinal scale – which is simply very silly).

    Now no one enjoys creating probability distributions for risk factors more than I, but I’m wondering about how you’re arriving at those 0 – 1 values. Is there any rigor in setting the bar there? What role does frequency play?

  • Alexander,
    You’re right I didn’t write the equation and it is pretty horrible. I have often seen this equation bandied about, and I am through this posting trying to make some sense of it.

    As I noted the only number in the equation that is really understandable and/or quantifiable is consequence as it can be expressed in terms of dollars except when trying to quantify the loss of human life (the worst case scenario for failure in a variety of these systems). In most cases I think consequence is under estimated as is difficult to fully comprehend/estimate all off the interconnected consequences of a truly large event.

    I have seen this equation used a lot but never explained in depth. I too have seen discussion as to whether threat and vulnerability are ordinals or probabilities.

    How do you come up with the probability factors? Well in my case mostly from my gut. With the knowledge I have of the current state of control systems if an attacker is on the control system segment/lan then the attacker owns the system. Hence the probability of 1 for that segment. Now multiply that by some probability for their penetrating into the control system through various (if any) impediments. The better the defense in depth the closer this value goes towards zero. Outside of this your guess/methodology is probably as good or better than mine.

    As to the probability of being targeted (threat) my gut tells me that we are all targeted in some way every day by some malicious bit of code, virus, trojan, phishing attack, what have you, and I feel it safe to say that the threat probability approaches 1 that on any given day a system is targeted for some kind of attack.

    The sophistication of the attacker and his/her determination, and resources push the threat value toward 1. Now how do you arrive with the specific threat (or vulnerability) value for any given organization? I have seen various methodologies for coming up with those values, from Probabilistic Risk Assessment and fault trees or success tree to schemas based on average times to hack a particular type of system to who knows what. None of which I thought very well represented reality.

    As you noted where is frequency or any bearing to delta T, eg how does risk play across time, in this formula?

    One of my goals for this posting was to generate good discussion on how exactly to apply this equation and/or the relevance of this equation to reality. In the end I think it wise to overestimate the consequences of a catastrophic failure and plan accordingly.

  • Anonymous

    Alexander,
    Besides my opinion on the terribleness of the equation, I think the reason for multiplying is a mathematical one. A threat or vulnerability of zero will give you zero risk.
    Next to that, multiplying scales will enable you to get a more finegrained classification of the risks. Say you measure threat and vulnerability on a scale of 0 to 5 (using various sources (including frequency of occurrence) to feed your gut), multiplying will give you 26 possible outcomes while adding will give you only 11.

    Kevin,
    Good post! It all starts with an RTO or BIA from Asset Management, since they are the people who will be paying for the countermeasures. I often find that these risk assessments are done by technicians who are deeply involved with their systems and often fail to see the role their system plays in the total organization.
    We have therefore split the equation. Consequence is only determined by Asset Management on a number of topics and threat x vulnerability is determined by the technicians.
    The challenge for risk managers is to work with realistic worst-case scenario’s (and really eliminate the chance of occurrence in it).

  • rl

    Kevin, is there any reason why you limit your discussion of threats to attacks? As far as I know, other (non-intentional) threats account for 99.99% of real-world incidents.

  • peterson

    There has been general dissatisfaction with this equation in the control system world because:

    a. Consequence is huge in dollars and human life for many critical infrastructure control systems and makes risk unacceptably high unless threat or vulnerability approach zero.

    b. There has been no calculation or estimation of threat that has been viewed as credible by the community [which Ralph rightly states must include directed and non-directed, intentional and unintentional].

    This does not mean we can stop trying. If the community can not credibly define risk, C-level executives will not take the issue seriously or allocate budget.

    A few interesting past data points:

    - Mike Peters of FERC and Sandia’s RAM say the threat should be set to 1, see http://www.digitalbond.com/index.php/2007/08/16/weiss-day-three-nist-event/ , essentially making the equation consequence x vulnerability. I have not met many C-level executives who would buy that, and it does not help in an efficient allocation of resources. Which of the myriad of vulnerabilities should you address first to reduce risk by the greatest amount or to an acceptable level?

    - Ralph Langner and Bryan Singer tried a different approach with “Scenario Based Threat Modeling” at S4 2008 precisely because the C x T x V equation has failed to date in this space. It was a very controversial approach, a third of the attendees said it was their favorite paper and a third said it was their least favorite paper.

    - The most practical and effective approach I’ve seen to date is Sean McBride and the team at INL’s “Ideal Driven Technical Metrics” to measure control system security – another S4 2008 paper. Theory, tried, modified and with actual numbers that can be used to compare security postures.

    - Finally, we always envisioned one role of a SCADA Honeynet is to gather threat data to help quantify or at least qualify risk.

  • I think the problem with the C x T x V is that these numbers change so rapidly that it’s very difficult for slow moving industries to keep up. Keep in mind that the design of defenses isn’t free or immediate. With new defenses, there are validations that need to be performed.

    I like Langner and Singer’s approach in this regard because it identifies a coherent defense against a specific threat. C x T x V threat modeling, if done correctly, may be exactly what the doctor ordered, but it doesn’t tell front-line people what to look for or what to recognize.

    Toward that end, pilots don’t practice random failures of the aircraft. They practice reactions to system failures, such as a failure of landing gear to extend and lock in to place. We can’t defend against a numerically derived boogie-man. We have to form coherent strategies to defend against specific scenarios so that we can train people to do better.

  • southworthrg

    Hi guys Metrics is one of the “big” topics… I will probably sound like a wet blanket about this but I have changed my mind on this topic based on what I have learned in recent months.

    I have a couple of questions and something to offereveryone to ponder on when discussing this.

    Firstly with metrics why are we re-inventing the wheel?

    Have you looked at ISO27004(draft)? We need to stop replicating efforts I see som many efforts that are exactly the same thing.

    Erwin, will the vulnerability term in a SCADA system threat equation ever reach zero (apart from talking theoretically)?

    I would contend that this value would never reach zero. I contend that the rule applied to vulnerability should be invalid if ever assessed as zero in a formula.

    We need to be practical and use applied science guys. Some of the values we are talking about here are NEVER going to be accurately describable or quantified. Let’s face some hard facts.

    Honeypots are a truly great idea but we have some challenges.

    Network architectures are very similar if configured to best practices. If they are not configured to best practices well, the net consequence is… there is a reduction the mean time to compromise the system.

    This time is a variable figure based on the knowledge and experience of the attacker and the attack surface open to them and it will be tied to the “weakest” connection or segment or component or end point of the control system.

    I think Jake’s suggestion focussing on the “how we defend” our systems is more important (A better ROI to quote Bryan Singer) I think that we should all count to ten, swallow hard, and make the same assumptions when it comes to the bigger picture.

    We already have heaps of data on how corporate systems are compromised and what types of motivated are in the wild and what the effects have been on these systems.

    To a “motivated individual -internal or external” the net effect is the control system is just another information system segment hanging off the corporate network. (I know Joe Weiss will cringe at this).

    Why is this statement true? The deeper “cyber” communications extends towards the plant floor using non-deterministic interfacing the deeper the cyber problem is. (I know Joe will agree with this and maybe won’t cringe after all at my first statement)

    Jake is bang on. We need to be talking more about resilience. To be honest I think we need to understand that the time for this measurement is actually well and truly past.

    I think the quantity or even the likelyhood vulnerabilities are mute issues. The terms we need to metrify are When, Who and How. These are the unknown values we need to be defining as accurately as we can.
    As someone has already mentioned the consequences of how our systems are connected & designed, installed & operated IS such that we are always going to have some level of risk. SCADA systems need to communicate with distributed plant and process assets. We have to look at minimising this risk to a point that it is acceptable to the risk appetite of the organisation (owner/operator). As to the question of is this is a socially acceptable level depends on how much society is willing to pay for the level of risk. Everyone wants zero risk but nobody today wants to pay for it.

    I think we need to be developing early warning systems and networks, good auditing & forensics & mitigation processes and technologies and very accurately describe our systems and concentrate on developing good change management and control capabilities. We need to train our people and pay them well and provide a positive human resource environment. Defence in depth is the philosophy we need to remember as a first principal. These are the sorts of things I think we need to metrify.

  • Ron brings up an interesting point: C x T x V represents a wonderful *METRIC* –but it can’t tell us what to do. We need to divide control systems in to maintainable segments and design layered defenses upon each segment. We need to develop policies and procedures for dealing with breeches of each layer.

    C x T x V doesn’t tell us that, nor can we expect managers to make this connection for us. This is a security and engineering effort. Let’s keep it there.

  • rl

    Talking about managers, Jake… I doubt that any CxO will give you a budget for your efforts of reducing risk from 0.76 to 0.34, for example. What do such numbers mean? Is a risk reduction of X counts worth N thousand dollars? The more abstract we make this stuff, the less plausible it gets, thus less convincing for executives.

  • @Ron, Good comment – a couple of things strike me:

    “Have you looked at ISO27004(draft)? We need to stop replicating efforts I see som many efforts that are exactly the same thing.”

    Agree, but just because we can all agree on something doesn’t make it right. See the R = TxVxA equation.

    “think we need to be developing early warning systems and networks, good auditing & forensics & mitigation processes and technologies and very accurately describe our systems and concentrate on developing good change management and control capabilities. We need to train our people and pay them well and provide a positive human resource environment. Defence in depth is the philosophy we need to remember as a first principal. These are the sorts of things I think we need to metrify.”

    I offer that you simply cannot accomplish any of those without understanding the risks and your capability to manage them. Those are the things you should create metrics around, doing so makes prioritization a relatively simply process. Unfortunately, getting that understanding is something I’ve seen, maybe, 3 companies do effectively.

    @Jake, et. al – to expand on Ralph’s comment – the big problem with R = TxVxA is that you simply can’t multiply values in an ordinal scale. It boggles the mind how and why people insist that this is valid.

    @Erwin – “A threat or vulnerability of zero will give you zero risk.” – yes but using an ordinal scale falls apart there because a threat or vulnerability of zero also gives you an asset worth of zero (see ordinal scale rant above).

  • rob

    Alex,

    I usually try and limit my raising risk related questions to posting on your site :), but Ron raises some points that relate to the trust versus risk issue I raised there recently, when he states:

    “The terms we need to metrify are When, Who and How.”

    Ron also says something else that is important here:

    ” The deeper cyber communications extends towards the plant floor using non-deterministic interfacing the deeper the cyber problem is”,

    the key word here being non-determistic.

    What is the impact of control systems and management networks that ARE deterministic-say on the risk equation ?

    Even if we arbitrarily set a value of 1 for V ( if any vulnerabilities exist), it becomes a mute point if the control system removes the opportunity to enact on any potential threats that stem from them.

    How is that done? Either by adding control mechanisms that prevents escalation of privileges, or absolutely caging suspect services or applications, or both.

    Another difference between the status quo and deterministic systems is authorization for user activity post-authentication. This utilizes white listing approaches to data and systems access in a deny-by default manner that simply rejects what has not been pre-determined as an acceptable behavior. So many control systems and IT environments rely solely on authentication technologies, but then do little about the “When, Who and How”, which are trust, and authorization questions.

    This is really a key missing ingredient in today’s controls, and to my mind, is one of the most poorly understood and recognized issues in security across the board (all verticals), as it is discussed so little. It may be that many consider such a model too far from being achievable, but I worry that many just have not even considered this question at all. The language of authorization for business rules may be about trust, rather than risk.

  • southworthrg

    Hi Guys from NIST..

    NIST is pleased to announce the release of NIST Special Publication
    800-55, Revision 1, Performance Measurement Guide for Information
    Security. This publication provides assistance in the developing,
    selecting, and implementing security performance measures to be used
    at the information system and program levels. These measures
    indicate the effectiveness of security controls applied to
    information systems and supporting information security programs.

    URL to document:
    http://csrc.nist.gov/publications/PubsSPs.html#800-55_Rev1

    So there is one method that is released and one that is in draft form.

    Rob that is interesting what you have said and worthy of exploring especially when trying to explain when to enact what method of incident response in the resilience model.

    Using RISK to explain to our C levels is something conceptually they understand and Trust is what we have based out info sharing on here in AU so that too is something that I think people can understand fairly quickly and probably quantify fairly instinctively without too much trouble.

    Consiquences are something that we are not so good at understanding, describing or necessarily identifying all the interdependancies.

    I have been looking at ways to describe to different stakeholders what is different from other information systems and determinism is a good concept to bring to the table.

  • rl

    “Consiquences are something that we are not so good at understanding, describing or necessarily identifying all the interdependancies.”

    I’m glad you mentioned this, Ron. Funny enough, even though consequences are the one factor in the risk equation that provides for the proper scaling level for multiplication, experience shows that few asset owners are able to pull out dollar figures for production disruptions, for example. Even those who do BIA don’t necessarily extend it to the plant floor, which must seem strange if that is where your primary value-generating business process takes place. It is always puzzling to see folks staring at each other and scratching their heads when we try to collect such numbers during our interviews when performing a risk assessment.

    The same applies to dependancies. Few organizations have done a proper FMEA or FTA; as a result, many have unidentified single points of global failure.

  • Ron.Southworth

    Hi Ralph,

    I talk and try to place emphasis about the need for objective (rather than using the word proper) risk assessment. By simply not being objective does not automatically imply an improper risk assessment. Safety systems assessment processes require an amount of independent peer review and perhaps this is an element that needs to be explored more by people conducting risk assessments.

    Quite often the people making the risk assessment do not have the necessary experience, qualifications or depth of knowledge (or a combination) to be in a position to make the necessary determination. One area that this is happening more and more is Health and Safety. I find that I am expected to make a call on work methods that I am just not capable of making an informed decision on. My way of managing this is to source the right talent and task them with giving me an informed assessment. Maybe more people need to do the same thing?

  • rob

    Hi Ron,

    “Trust is what we have based out info sharing on here in AU so that too is something that I think people can understand fairly quickly and probably quantify fairly instinctively without too much trouble.”

    This is insightful, as what we offer with our technology is full-blown information-centric security, with scalable MLS, multiple domain separation and multilevel integrity. (also useful in critical infrastructure). Trustifier’s basis for rule enforcement are the pre-determined (hence deterministic) trust relationships within user groups, and between groups and user roles, providing an intuitive framework for determining and controlling business data flow. What this does is allow the security rules to be made in the language of business decisions.

    The basis for this model is basing privileges on how much we trust the users, how and when they can access what information, and everything else is denied. Changes may be easily made, but they must be signed off by person responsible. Accountability is built into the system, and assumes evil admins, thereby acting as a behavior enforcer even for inside personnel.

    I have been thinking for a whether the focus on information flow means more consideration of trust than risk, even though Alex says risk is a metric of trust.

  • Ron.Southworth

    Hi Rob I think That you guys are certainly using some good first principals. to leverage your solution upon. It is good that the decisions are focussed on business decisions. Understanding control systems business elements and making certain technology meets these will yield an excellent product synergy!

    I don’t think we have to assume we are all evil rather I think we need to have rigor in place to encourage good work ethics and behaviors. It won’t stop people not thinking rationally or committing harm it hopefully will record it and who ultimately committed the activity.

    I think information flows are very important to describe and analyze and defining where and when to trust is important too.

  • rob

    Ron,

    Not accusing any admins of being evil, but to have safeguards against compromised individuals who have high level access privileges, or the disgruntled, vengeful staffer, is probably not a bad thing to have especially if the potential consequences of unauthorized actions are large in monetary terms, or by any other measure.

Leave a Reply