Bandolier_Leaderboard
AAA  AAA 

More on eWeek.com Hysteria

In my last blog entry I described why there was no “Hole Found in Protocol”. Now let’s talk about why there is no need to panic.

It really is pretty simple. OPC almost always runs on a Microsoft Windows OS because the protocol was designed around Microsoft’s DCOM. Windows OS have critical patches that come out on a month basis, aka Patch Tuesday. So a responsible asset owner has a process to patch their OPC server typically once a month, but at least once a quarter. The asset owner may accept the risk of a vulnerable system for a few weeks during testing and phased deployment of the patch because they have implemented defense in depth and compensating controls.

So the OPC patches are just one more patch that needs to be applied to a Microsoft system. The only difference is it is a third party application on the Microsoft application, but asset owners should be monitoring US-CERT and all third-party vendors’ support for vulnerability notices for all implemented software.

The fact is that software has bugs and some of these bugs can be exploited by an attacker to crash a system or gain remote control. Hopefully, software developers are getting better in secure software design and implementation, but the fact is unlikely to change in the next few years.

In this case, a security professional, Lluis Mora, found the exploitable bug(s) in several OPC implementations prior to a hacker finding these vulnerabilities, responsibly reported the vulnerabilities to US-CERT (as opposed to publishing details and exploit code on the Internet), and two vendors so far have issued patches two months after being notified. That is actually a very fast response time. Now it is up to the affected asset owners to do their part and patch the OPC servers. The eWeek.com article missed the point. This is a success story, not a cause for panic.

One last comment – - I know there are some asset owners out there saying we cannot take our servers down to patch because they must run 365 x 24 x 7. If you fall into this case, you are fooling yourself and need to improve your redundancy.

Comments

Comment from Ralph Langner
Time: March 26, 2007, 6:40 am

“You are fooling yourself and need to improve your redundancy.” — Good point, Dale. However, my bet is that asset owners will continue to fool themselves until they get hit by some major disaster. Why? Because it is common wisdom that IT and software is not “real” equipment, and, therefore, has to be cheap. Dirt cheap. Free at best. Redundancy is expensive. Patching and testing is expensive. So many asset owners don’t even think twice about redundancy and proper security measures — they just don’t do it because it costs. Another “cultural” issue is the common wisdom to never change a running system. Which is, in this case, simply the habit of sitting things out until disaster strikes. So far, my only argument here is to tell everyone: You could have known better, and you should have known better.

Comment from Jake Brodsky
Time: March 26, 2007, 11:36 am

Regarding Ralph’s comment: If you think this, then you haven’t been to many SCADA or DCS sites. We may not be experts on security, but we DO know about redundancy and fail safe control systems. And I’ll bet there are a few folks out there who may be able to teach you a thing or two about the financial and practical justification of a backup.

Never in my career have I witnessed a control system without some sort redundancy or fall-back. By the way, we don’t often resort to redundancy unless there is a good reason to want it. Typical systems have fall-back and fail-safe instruments and controls, not redundant gear. We really try to avoid common points of failure.

Engineers who design single points of failure in to critical processes don’t last long in this business.

Comment from Ralph Langner
Time: March 26, 2007, 12:26 pm

Jake, don’t get me wrong here. You focus on redundancy for process control, in so far as it is related to safety. I certainly agree that engineers have learnt their lesson in this respect, which is partly due to regulation. My argument targets installations where the impact of failure is not equipment safety, but the organization’s bottom line, and where the link between cause and damage is not gear malfunction, but “only” some missing data. Just some missing or incorrect bits and bytes that result in economic loss for whatever complex reason that may be well beyond the process engineer’s scope and responsibility. I don’t blame the process engineer for this.

Let me add that I know quite many process engineers who figured out that not all is fine and dandy with their DCOM configurations, network infrastructure, backup systems etc., but failed to obtain support and budget from management. The situation gets bizarre when the very same organization is spending big bucks for firewalls, virus scanners, UPS etc. in their office IT.

Comment from Dennis Brandl
Time: March 26, 2007, 3:55 pm

Dale, you are wrong. Reduncancy is not the answer to availability. There are many systems that must run 365×24x7 with no down time for 2 years. The Microsoft model of patch, patch again, patch again, repeat as needed, is not suitable for these environments. These systems have redundancy, but that is not the answer to patching.

Comment from Dale Peterson
Time: March 26, 2007, 4:33 pm

Ok Dennis. You have my attention. My assumptions:

- having systems with multiple known vulnerabilities with public exploits for long periods of time is not an option

- you have said patching is not an option

- the fact that software has exploitable bugs is unlikely to change in the near future

What is the solution?

My point was a benefit of redundancy is it allows one to have a rare, small maintenance window (maybe one hour once a month) and still maintain 365 x 24 x 7 operations. Not to mention that expecting a Windows PC, the platform for OPC, to run forever without a memory leak or other need to reboot is probably overly optimistic for mission critical operations.

Comment from Anonymous Coward
Time: March 26, 2007, 4:51 pm

Dennis, as a security practitioner who is trying to learn more about SCADA and the emergence of the two practices into some sort of hybrid, I am also listening intently. If redundancy is not the answer to availability, what is? Are you seriously, honestly saying that an unpatched windows PC is the model of availability? It’s easy to decry the state of affairs and others perspectives…instead, let’s ante up some thoughtful input that we can all discuss and learn from.

Comment from Ron Southworth
Time: March 26, 2007, 8:56 pm

Gentelmen I think Jake made a valid point, that it is not necessarily a question of redundancy but of elimination or mitigation of single points of failure. This does nto necesarily translate into duplication of systems but can translate into selection of technology – selecting proven devices and using proven methodologies.

Patching of an OS if it is Microsoft or Linux VMS UNIX or whatever OS has to be performed using a change managed apporach, despite the claims that control systems lack good change management. I would actually disagree with this as a generalised statement in that change management is usually performed reasonably well otherwise process systems would not work and the pressure to minimise down times is ever present.

Dale, AC- What percentage of the total vulnerabilities released by cert or patch updates actually translate into a high risk vulnerability on a SCADA System if it is structured to best defence in depth best practices?

The interconnectivity between the corporate information systems or other interconnections still seems to be the focus of much of the energy of discussions. I have yet to see a reason to have the number of ports and services that I see open on a system boundary as really being required, if the system was designed with defence in depth principals un compromised.

At this point in time I cannot see the fragility from a security perspective of a control system changing as the closer to “the plant” security is focussed the greater is the need to verify and validate an appliance meeting availability requirements.

We have taken the decision at my present daytime enterprise to continue using deterministic systems -analogue or phnumatic technology as much as possable to eliminate the sorts of issues with field devices until they are resolved, the risk of unreliable operation is just not worth it. More people need to take a affirmative decision to RISK assessment and not forsake quality for commercial short term gain.

Comment from Bryan Singer
Time: March 27, 2007, 12:25 am

OK.. Dennis and Dale.. we started a firestorm on the ISA-99 email lists today on this redundancy topic as well. Some interesting points came out and bear repeating as well.

1- All businesses are not alike. I have seen very few process control operations that require anything above about 95-98% per year availability. Many process control operations SAY they are high availability (24×7 every day of the year), but then you dig deeper and find out they have mandatory shutdowns, maintenance cycles, cleaning cycles, etc.

2- Even at 99% available, a plant would still be down 3.65 days per year, leaving plenty of time for a patch management process to run as often as once per quarter, sufficient time if you are practicing other solid defense in depth strategies such as limiting Internet access, limiting protocols, etc.

3- For most operations, if you availability is less than or equal to around 99% (more than suitable for many plant operations), then there are few other excuses remaining

4- For true high availability requirements, those that ISP’s call “5 nine’s (99.999) or “The rule of 4 nines (99.9999)”, or anything greater than 99% for that matter, that does not excuse operations from not following patching… the risk is just too great. As was mentioned on the ISA-99 list… many say they don’t want to patch because of the legitimate risk of losing a $10,000,000US batch due to patch failure. This is indeed intimidating, but look at an asset owner I deal with recently that lost $40-50M in about 8 hours due to an issues that a patch WOULD have prevented.

5- In this situation, I agree with Dale… you are fooling yourself if you think you shouldn’t patch. What we should be looking at is parallel or redundant systems. My time in hospitals, telecommunications, and in some true 24×7x52 (everday) environments really did need 100% uptime. We created service level agreements in the 99.999 to 99.9999% range, and then built redundancy and parallel systems to allow us to shutdown one, patch (after tested in a testbed), run parallel processing or system tests, then switch over and patch the old system. This can be done, I did it on a near daily basis without shutdowns for about 18 months. Now, this *was* an IT space, but we were using lots of unusual protocols in some of these situations, very similar to our industrial “non-standard” protocols

6- The reason many people fight patching, IMHO, is that they feel that they can’t justify the additional exposure. The look at as hardware and support costs, but not from a business continuity or risk mitigation standpoint. Redundancy does not have to be hard, and it does not have to be universal in many systems. Usually, only a few “critical” systems are truly required to be redundant, leaving some creative solutions with spanning trees, an extra switch or two, and an extra server or two. Of course, this is not necessarily the rule an each situation is unique. Nevertheless, I’ll draw another example from a batch processing plant where they lost a $20M batch because of poor network design and a critical system that failed without redundancy. In their case, only about $50k would have been required to implement the required redundancy at the critical system. Makes me scratch my head as to which one was really cheaper.

7- One person did, correctly, point out that redundancy is typically aimed at random hardware failure, while patches and performance are tied more to systematic behavior. I agree to a point, but security breaches are indeed a semi-random event (anyone that watches IDS logs know that there is not as much entropy as we often think in security attacks). They also pointed out that some failures are not immediately obvious, making things more difficult to detect and recover from. In these potential situations, parallel processing for a suitable time is a must. This is where the REST of the security person’s job takes over, however. Incident Response and business continuity plans take over here to represent necessary and required discipline in our environments.

The patch argument is an emotionally charged and heated one in this space, and for good reasons. As anyone that has heard me present will say… I mention in every speech that asset owners can attribute more downtime to patching than they can to security events…. likely this is true. But, I came out of several industries, including the military, where failure is not an options, including cyber. I imagine that no one wants to see the blue screen of death or require a system reboot in the Future Combat System or in airplanes, etc. In any high availability environment, many other fields understand redundancy, parallel processing, and failover systems. We certainly could borrow at least SOME of that knowledge.

While some processes will never be able to justify the additional project expenditures from an ROI perspective (where the cost to mitigate the risk exceeds the cost of the risk event), then OTHER mitigating controls (low cost / no cost such as change management procedures, backup procedures, etc), should be selected. If your operation is anything 99% available, then it is likely that many other mitigating controls are suitable. Above that, and in high currency loss processes (such as pharmaceutical batch operations), engineers and project managers should seriously look at the risk of the system without mitigating controls as compared to uptime and redundancy needs. I have seen lots of companies lose TONS of money, and then scratch their heads how they became a statistic. This ends up being a simple risk question… are you willing to accept a multi million dollar loss to save some tens of thousands in the project? :)

Comment from Bryan Singer
Time: March 27, 2007, 12:30 am

Oh, and as an add on component… To me there is no “answer” to patching, it is always a question of acceptable risk. But, a solid patch management program would often look like this:

1- Network defenses … reduce the number of protocols through a shop floor firewall to a bare minimum
2- IPSec tunnels for complicated protocols or additional security through the firewall… again, lower the attack surface as much as possible
3- Relationships/support with vendors for patching information and procedures
4- Test bed for patches
5- change management, backup and recovery programs and procedures, tested periodically to ensure correctness and recovery time performance
6- Service level agreements with support staff in IT (it they manage your network) and vendor support
7- parallel or redundant systems at the point of critical systems. Usually not every system has to be redundant, but perhaps a fully redundant network isn’t a bad idea
8- quality, spc, or other data trending systems to detect abnormal process changes during a parallel run

Just some thoughts on the top of my head, but to me this is a situation that can be mitigated. There is certainly cost associated with it, but risk management should determine if it is justifiable or not.

Comment from Ralph Langner
Time: March 27, 2007, 5:16 am

Bryan, I think your argument on ROI is very valid and seems to counter patching and redundancy in some situations. I agree that there are cases where patching results in more downtime than security events. What we do in situations like this is suggest, as you mentioned, to protect the respective (unpatched) systems with other controls such as industrial firewalls, which usually results in good protection for less than 1000$ per unit.

At the end of the day, it all boils down to the ROI question, and if I got Dale right, this is where he says that asset owners are fooling themselves. Some of them don’t believe risk assessment and risk management are worthwhile for SCADA security, and some don’t accept the cost for a risk assessment. Which may lead to bizarre situations.

Here is an example. A client in the automotive industry is logging quality data, which is a requirement for this specific product. Ok, now everything in industrial IT must be cheap. So an outdated office PC was found, and some one-man-show subcontractor wrote the software (which was very cheap). As bad luck had it, the software crashes every now and then for now appearant reason. The subcontractor is out of business and can’t fix things. No one else can because there is no documentation. This instable application runs only on NT4, so OS updates are out of question. For data storage, a guy from quality management walks down to the plant floor once a week, where the dust covered PC is residing in an unlocked cabinet, to copy the data files to CD. If the application (or the hard disk) crashes, products must be retested.

Now there is one smart guy in the company that figures out this is probably worth improving. Guess what, the same company is running a sexy IBM system i5 with redundant disks, nightly backups and stuff for their commercial applications. Wouldn’t it be a good idea to use the reliability of this box for their quality data? You bet! Technically it’s not a problem. Cost is lower than any hard disk crash of the dreaded NT box. Further cost will be cut as the once-per-week manual CD backup is no longer necessary. Added benefits are that the quality guys could browse through data online and create fancy reports. Not to mention that they could get rid of the (networked) NT box.

Our smart guy is fighting for about one year with management to get the funding for this project. This is what I would call “fooling oneself”.

Comment from Jake Brodsky
Time: March 27, 2007, 7:28 am

There is nothing so permenant in large organizations as “temporary.” I fight tooth and nail to keep “temporary experiments” from staying in the field. In my experience, the “fooling oneself” argument is usually the end result of someone’s experiment that tried to stay under some boss’s radar.

Sometimes we have to resort to such measures when upper management is in flux or really needs a clue. However, with the use of such “experiments” comes the responsibility to document what was tried, what worked, and what didn’t. When the results are understood, we must replace the experimental system with a maintainable model.

That last step is where people really get in trouble. It is my experience that some managers deliberately try to stay blind to weak links in the hope that they’ll get promoted before there is any need to address the problem. I have no words to express my utter disgust for such morons.

It’s true, such managers do exist. However, we shouldn’t make policies or proclamations based upon such behavior. Instead, we should do all we can to push them out of the way.

Comment from Ron Southworth
Time: March 27, 2007, 11:15 am

Jake, The phrase plausable deniability is one that I have heard a manager use frequently in the past and I suspect there are others that would not say the phrase out aloud but would have it as a motto I am certain. The other aspect of temporary installations Jake I see quite often is where the quantity of patches and rework on a site has caused it to become untidy at best description. Temorary installations end up being perminant.

Brian, Thanks for sharing the discussion on availability. Process Control systems with planned shutdowns in effect provide availability requirement figures less than perhaps 99% as you rightly say. The reality is in finding resources to be available during shutdown ot perform patching when there are so many other demands on staff can be one of the challenges.

Typical utility availability requirements would start at 99% and go to the magical 99,9999% that you speak of. The mean target figure I hear people setting would be 99.98% .

These availability figures we need to remember represent part of 1 hour to a figure of up to 8 hours per year. Not a lot of time to patch without an effective change management methodology. replication of control centers (Tripple replication is becoming popular for larger organisations and is sumthing I am investigating as a serious option for our next uodate) adding to the complexity of the head end system and communications support systems. As you say these can be mitigated against if the culture permits.

The justification of risk or exposure to the public or to management is the interesting part of the equation. I have seen so much risk bourne by management with the view that they are not around for a long time so risk assessments often overlook the likeleyhood and the consiqence aspects whilst they are in office in effect are worth their return on investment. Not all organisations do it but more and more are being pushed down the farmiliar road.

There is a lot of good meat in the SP99 draft – keep the levely debate up as the curent draft is quite good and he work needs to be commended to all those that have participated..

Comment from Bryan Singer
Time: March 27, 2007, 6:16 pm

One thing that strikes me.. found in some criticism in these discussions leveled either way… is that there are GOOD practices and recommended actions that SHOULD be done… but there are no absolutes in security. Rather, there is acceptable level of business risk. Organizations, including vendors, asset owners, integrators, and consultants, all have responsibilities… but, at the end of the day, the asset owner and their company have ultimate responsibility for site and systems security (as established by legal precedent).

From my perspective it is critical that asset owners understand the TRUE risk of patching versus not patching, and balance that against whether or not redundancy or parallel processing should be introduced into the system. It may not be an easy choice, but one that needs to be made…. and not one that should just be waved off as someone else’s issue….

I have just seen far too many people suffer catastrophic project and system failures under the auspices of “patching is too risky” to simply say they should wait for vendors to give them better processes. Vendors will not do that until they get industry direction and requirements heading that way… beyond the meetings we tend to have at this level. The first time a vendor loses a $40M project because they don’t have security features required in a procurement spec, and maybe they will listen. Having come from a vendor, its hard to find people that are willing to own such critical issues as patching unless there is a product benefit tied to it….

Comment from Ralph Langner
Time: March 28, 2007, 4:48 am

Bryan, whether we’re talking about patching, redundancy or even performing a risk assessment, my experience is that management all too often buries their heads into the sand when the security issue comes up. Why? Because it’s IT, and because they don’t see more or better products coming out of the assembly line after controls have been deployed. — The phrase “plausable deniability” nails it all to often. Reminds me of the feature movie “Training day”, where Denzel Washington puts it this way: “It’s not what you know, it’s what you can prove”.

I think Bryan is correct in determining that the ultimate responsibility is at the asset owner. I have experienced all to often that it is a waste of time to tell vendors how they can make their products more secure. Just look at the absurd industry acceptance of the dreaded Modbus protocol. It is probably not well known in the US, but one “modern” alternative to Modbus (PROFINET CBA) that is pushed by a European standards body (PROFIBUS foundation) is based on DCOM. Kiss your firewalls goodbye! — One issue that we are pushing is teaching asset owners how to incorporate security features in their procurement specs, based on the good stuff from INL and others.

Comment from Jake Brodsky
Time: March 28, 2007, 2:33 pm

Ralph I think we’re approaching a violent agreement from opposite ends of the spectrum. :-)

There are two issues swirling around here: First: Where is redundancy appropriate. Once upon a time, I can recall using redundant TI 565 PLC gear because it was redundant and it must be more reliable, right? Nope. The redundancy was mostly a waste. In far too many cases we’d actually have problems related to the redundancy switching, and never once in the whole history of using this system did we ever encounter a situation where the redundancy actually saved us from a component failure.

Instead we learned to spread the functions among various controllers so that the failure of any one controller or I/O rack wouldn’t take out the process. I call this graceful degradation. Our new SCADA system is designed this way too. Operators can manually redirect tag service from one server to another. The servers are physically scattered all throughout our telecommunications network. No one site failure can take out so much that we can’t recover from it.

Contrast this with the great big holy control room of everything where a building fire or an employee going postal can bring the whole thing to a grinding halt. Our system will continue to work, even with massive failures of our telecommunications network. The nearby plants will simply load their local control center software. whatever Master sites they can “see” they’ll manage locally.

Our solution to this problem is not to have a complete backup, but instead to allow for handoffs to other segments of the system. It is very survivable in the face of all sorts of terrible disasters. And after all, isn’t that why we do this exercise?

Comment from Ron Southworth
Time: March 28, 2007, 5:12 pm

Sounds like a good method of distributed control Jake, “Defence” in depth! I think this is the at the heart of the phillosophy for control systems It is not just security it is more about Buisness Continuity As you have stated elsewhere if things get real ugly the system degredation should not mean that manual control or site control is not possible (it Just means you have to put people back out in the field) the finess is in how many layers you can place in the system to facillitate transitional problems I have seen a lot of redundancy systems that have had a number of flaws as you have described Jake ( a certain security control system springs to mind used in a number of Jails) It can be that for a certain risk profile the single point device can be more reliable than adding redundant devices, Application and implementation alternatives then come more in to play

Comment from cnioperator
Time: March 29, 2007, 4:59 am

Guys, just come back from a trip to find this enourmous thread on Dale’s blog.

Good to see such interest.

For me (and my company) patching of control systems is a solved problem.

Our control systems have the requisit level of redundancy to meet our plant availabilty requirements. This allows for patching to take place especially at the level in our networks where we find windows OS.

In addition, we protect our systems with firewalls, procedures, continuity plans, etc.

If anyone in my company tells me they can’t patch because they have old unsupported systems, the answer is clear… get them upgraded!

Its not acceptable to run critical systems on unsupported hardware/software.

Its not acceptable to run critical systems with single points of failure.

It seems clear reading this thread that my industry (Oil & Gas) may not be typical but this surprises me, good security is (for me) just good engineering.

Comment from Ron Southworth
Time: March 29, 2007, 7:46 am

Hi cnioperator, You may just be fortunate in where you are working, There are places that you find are exemplary in their engineering practices and there are others that are not. I am fortunate (or perhaps not) to work in one sector of the industry and to also have the opportunity to teach and consult with many areas.

I find that it is a mixed bag for certain but the vast majority have some level of cultural issues getting in the way of the sorts of practices we speak of. Sometimes these are not always apparent on the surface or on first glance!

I don’t disagree that it is not acceptable but the problem is ther and as a community we need to work towards changing things for the better.

Write a comment