hiring
AAA  AAA 

More on the Incident at Browns Ferry

The NRC report on the Browns Ferry cyber related incident is fertile ground for discussions and learning. Some have accurately pointed out this is not a unique incident, but it is one of the few that is publicly documented.

  • Nowhere in the report to they discuss correcting the software flaws in the PLC that failed and may have sent the data storm nor the VFD controller that couldn’t handle a data storm on a 10Mbps network. Compensating controls (firewalls) are mentioned and are a good interim step and defense in depth, but the vendors should have issued a patch or firmware upgrade to fix the software flaws that resulted in an exploitable vulnerability. This incident occurred on Aug 19, 2006. Unacceptable.
  • How many other critical installations have the same vulnerable PLC’s and VFD controllers? Vulnerability disclosure is always a contentious topic and probably more so here. Was the vulnerability reported to US-CERT? Has it been now that the incident is public? Certainly there can’t be any additional bad press on Browns Ferry than they will already get. This will be the next Maroochy Shire in every SCADA security presentation for the next five years. US-CERT is very good at placing pressure on vendors to correct problems, with the possibility of a vulnerability note looming, and an independent arbiter on what should be disclosed.
  • US-CERT rarely publishes new information in a vulnerability note if a patch is not available. The exception is if the vulnerability is likely to be exploited because it is so simple or common and no patch is forthcoming. This allows the affected end users to implement compensating controls and consider replacing or ceasing to use the vulnerable device or application. Sounds like this may the case here.
  • It appears that post-incident that Browns Ferry installed a network firewall to segment the control network from other networks. This is obviously a good practice recommended by all in the SCADA security community that I would argue now is at the standard of due care level.
  • Browns Ferry apparently deployed a firewall in front of each vulnerable PLC and VFD controller. Given the apparent absence of a patch from the vendors this is prudent. Byres Security (Tofino), Innominate (mGuard) and the Sandia OPSAID team and other field security device vendors must be encouraged by this data point.
  • There have been discussions on the design of the system particularly the safety system. I’ll leave that to others who are better qualified; a man’s got to know his limitations.

Comments

Comment from Jake Brodsky
Time: May 20, 2007, 10:58 pm

Dale, no control engineer worth a damn should be designing a real time network without evaluating whether there is sufficient bandwidth under all possible conditions. We design machine to machine (M2M) networks so that they have no connection to anything else. We’ll use crossover cables where possible, or very small, dumb switches with a limited number of ports.

Yes, we do need to perform peer to peer PLC communications from time to time, and yes, we do use Ethernet. But that Ethernet is very carefully administered. The switches, if we use them (and not a crossover cable) are powered by the same source that powers the PLC it goes to. It sits in the same cabinet with clear designations that show what it is and why it shouldn’t connect anywhere else. The PLC will alert us through other ports if anything in the M2M link fails.

I do not believe this is not a valid application for a firewall. When it comes to M2M Ethernet, there should not be an outside connection. Putting a firewall on a thing like this is like putting a bandage on someone with a terminal disease. It doesn’t solve a thing. Furthermore, it would be another failure point.

If people are serious about M2M communications, they MUST maintain an isolated network. This is why I don’t think wireless communications will ever be a valid media for applications like this. The potential for denial of service is simply too scary to deal with.

Comment from Dale Peterson
Time: May 20, 2007, 11:42 pm

Jake,

My reading of the NRC report, and there is a bit of ambiguity, is the bandwidth was not the problem. The problem appears to be the VFD controller could not process packets at the rate they were presented which was above normal. The abnormal packets may or may not have come from another PLC in a fault condition.

Since this was only a 10M Ethernet LAN, which means the data rate was likely less than 6M and probably much less than that, it is truly pathetic but unfortunately not uncommon. In our assessments we see many cases where the controller fails prior to the bandwidth being saturated even on 10M LAN’s.

My guess is they added a firewall or router ACL with a simple rule saying drop all broadcast traffic and as well as some least privilege rules. Any firewall or router would have no problem keeping up with this at 10M or 100M. It is a workable compensating control. That said if I where Browns Ferry I’d be raising hell with the vendor to fix the protocol stack vulnerability and considering other alternatives.

Comment from Ron Southworth
Time: May 21, 2007, 6:59 am

HI Dale, What Jake speaks of is something that I would agree with as a first principal in a control systems network design and his methodology has a lot of merit.

Certainly the VFD drive protocol stack has a problem from the sound of the report. Also probably not at the beginning of product cycle so fixing the drive may not be such an easy task. The first principal question to me is how did the problem effect two drives? This should not happen with this sort of plant and seems to be totally brushed aside in the report. A firewall used as the mitigation does not sound like it should have been acceptable as a risk treatment to me. Glad I live no where near that plant!

With respect to this particular case we don’t know the exact structure of the network design or all the factors surrounding the case so we are speculating somewhat & I am prepared to be corrected .

After reading the report it raises a lot of questions for me personally from the respect of control systems design standards and best practices especially for the plant type concerned.

We have a project on the boards at the moment that would benefit from having the ability to learn from this sort of situation. Joe Weiss indicated that this type of problem has occurred in other instances.

I am really struggling to understand how the systems are being operated without a controls guy doing his lolly. Come to think of it why am I so surprised.

With respect to “wireless” systems in plant applications the suggestion that they be used for control based applications is not being seriously considered,,, Yet.

On the lighter side… I have heard the mitigation strategy for denial of service attacks for wireless systems is to counter with a denial of life counter measure! Ballistics aimed and launched at the jamming source!

Wouldn’t it be cheaper to just put the structured infrastructure into the plant and just support it vs the risk?

Keep smiling

Comment from Jake Brodsky
Time: May 21, 2007, 7:40 am

I inferred my position because of this sentence from the report:

“The licensee could not conclusively establish whether the failure of the PLC caused the VFD controllers to become nonresponsive, or the excessive network traffic, originating from a different source, caused the PLC and the VFD controllers to fail.”

Excessive traffic originating from a different source? What different source? There shouldn’t be any other sources.

This is why I think a firewall is not the answer for this problem. This network should be completely isolated. This is a Machine to Machine network. Unless you can tolerate a disruption of such control, nothing else should be on this network.

We use Ethernet media in control systems because it’s a very simple media that anyone can diagnose with off the shelf test gear. However, that doesn’t mean we should use it like any other Ethernet. I treat these networks as if they were some form of serial communications creature. You still have to treat it in a deterministic and reliable fashion.

Comment from Ron Southworth
Time: May 21, 2007, 9:34 am

HI Jake,

I must confess that I am having some real trouble in understanding this picture as being somewhat incomplete etc.

The how, when, where, and why regarding this scenario. The discussion section of the report reinforces what we have been speaking of but did not seem to indicate a clear need for corrective actions.

Jake I believe that you are correct with your concept of methodologies. Best practices I would have thought dictate that you need to maintain and operate a process control network that facilitates deterministic operation in all modes of operation and without single points of failure. At least that is what I was taught. This raises more questions that it answers but does give a good example to use to highlight the differences between PCN and a corporate network

Comment from Julian L. Rrushi
Time: May 21, 2007, 11:20 am

—–BEGIN PGP SIGNED MESSAGE—–
Hash: SHA1

To my opinion there are two (probably identical) problems here, namely: a problem, say A, which caused a packet storm in the control network; and problem B which caused the variable frequency drive controllers to become nonresponsive.

Taking into account the information given in the NRC report, the “problem B” seems to reside in variable frequency drive controllers… a fact which opens the way to the thesis of a vulnerability in those controllers.

Nevertheless, to my opinion we can’t be absolutely sure that it was a protocol stack vulnerability which forced the variable frequency drive controllers to become nonresponsive. An information which could definitely act as a clue to our understanding of the situation is whether the failure was transient or permanent! I’ll explain myself.

Both variable frequency drive controllers and condensate demineralizer ones were running proprietary code. In some cases devices in a control network are equipped with an implementation of a bucket algorithm, like the token bucket algorithm or leaky bucket algorithm, for instance. This would regulate the amount of bytes sent to the network. On the other hand, devices in a control network may perform rate limiting, which means that excess packets may be discarded according to some policy. In theory, under these circumstances a packet storm is not likely to happen since sending devices will shape bursty traffic. In practice a sending device may experience faulty conditions, consequently the bucket algorithm may not be applied leading to creation of a packet storm. Obviously such an observation holds if we assume that the packet storm was generated by a legitimate PLC under a fault condition.

Under the assumptions described above, the variable frequency drive controllers may have performed rate limiting. Due to the packet storm, most of the frames holding operational commands may have been discarded. In this case vulnerability A and vulnerability B are the same vulnerability, namely the cause of the packet storm. Furthermore, such a vulnerability resides in the PLC which caused the packet storm. This is to say that the problem does not rely in the variable frequency drive controllers. Always under the assumptions described above, if there was a way to stop the packet storm, after a while the variable frequency drive controllers would become responsive.

If the failure was permanent, i.e. stopping the packet storm would not help in having frequency drive controllers become responsive, then problem A and problem B are distinct, and problem B resides in frequency drive controllers.

Thanks!

Regards,
Julian L. Rrushi

—–BEGIN PGP SIGNATURE—–
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFGUbiorq0d5u53c2QRAkSRAJ9JFGkcg5E/oyaBJ25IwEk2RVkPpACgi14p
pDh8ZGrD31XPwmw3S6kAPGM=
=KEoY
—–END PGP SIGNATURE—–

Comment from Anonymous
Time: May 21, 2007, 11:34 am

An engineer being “worth a damn” should never be a factor in determining if implementing a specific vendor’s product will or will not adversely impact other devices on the network - M2M or otherwise - during unknown scenarios.

The IT industry struggled with this very argument about a decade ago when things like Code Red and other mass badness started hitting Windows servers that were connected to the internet. Common refrains were “No administrator worth a damn would ever yada yada yada”. While true, this still did not prevent the situation from occurring as, like it or not, every industry IT, SCADA, fast food, etc. is full of very skilled workers, marginally skilled workers, and people who lack skill.

What the vendors need to do is provide compensating controls for the weakest link in any implementation of technology. Call them the end user, system operator, network architect, engineer, whatever - they are human and WILL make mistakes sooner or later.

I believe that Dale and company are very right in insisting that equipment vendors take all reasonable measures to ensure they are not contributing and compacting the problem when the human that is working with their gear makes a mistake. Will there be unnecessary FUD and misdirected funds at ineffective solutions? Absolutely. However, this is a necessary step in the maturity of any line of business as it seeks the most effective solutions for long-term industry benefit.

Unless and until people move back to true “air gapping” of SCADA networks as the previous poster insists on having, vendors MUST harden their systems with interoperability in mind (oxymoron?) before coming to market as we humans will always make mistakes.

Comment from Bryan Singer
Time: May 21, 2007, 12:29 pm

How exactly is a firewall supposed to prevent this? I actually would be MORE concerned with the firewall in line to handle excessive traffic, not less. This is a case for managed switching, VLAN’s, and IGMP snooping, not for firewalls. In fact, I had a case where someone did not have the switch under the firewall properly configured, and the firewall became the point of DoS when it was handling the burden of all the broadcast traffic. Completely isolating SCADA networks is quite often not an available solution.

I agree that isolation may be ideal, but there is still the matter of good design:

- Put your higher risk components (such as VFD’s, IO points, etc), on their own VLAN’s and switching hardware.
- Use MANAGED SWITCHES. The case for the 2am operator having to change out switches and not wanting them to have to configure the device is quickly diminishing. I would never recommend using unmanged devices in any control network today, the risk is too great.
- Use IGMP snooping on control networks.
- Separate or remove high broadcasters such as cameras and VOIP.
- Isolate traffic types as much as possible. I often design networks with the HMI, PLC, and I/O traffic all on their own VLAN’s
- MONITOR YOUR NETWORK. This is key. Good network admins could quickly detect if they have an errant device malfunctioning.
- Validate and document the network infrastructure. Ensure there is enough capacity, check media lines, and document everything. I have seen what should have been a 2-3 minute fix to reconfigure a device take hours while someone hunted it down. I am always amazed when people don’t have solid network drawings up to date and on hand. This is the case for good asset inventories and network drawings.

I don’t know.. it is easy to play “armchair admin” here and second guess everyone at Browns Ferry (which, as it happens, is in my back yard in North Alabama). I’m also guessing that the folks there are (correctly) staying somewhat tight lipped about this. My guess is that they have applied anti-itch cream to a sucking chest wound here.

Firewalls *might* help separate an errant device from communicating, but I would guess that they have far more significant network design issues. I have looked at network designs for several hundred plants in the past couple of years, and have seen some very consistent “mass-badness” situations just waiting to happen, and far too many folks just “live with it” or try and justify why they are using outdated hardware, improperly designed networks, etc, and just band-aid by some appliance someone sold them when it fails. The cost for managed gear is getting much better these days, and the difficult in configuration can be as simple as flash memory cards today.

Comment from Jake Brodsky
Time: May 21, 2007, 2:30 pm

Julian’s point is well taken. Too bad we can’t always afford it.

I’m sorry, large VFDs are very expensive. Few can afford to replace them as quickly as the IT industry comes up with new standards. The best we can hope for is a firmware upgrade and some very careful and expensive validation tests.

This is why I am so adamant about keeping these M2M networks isolated. If companies could replace motor controls at the whim of a security threat, we wouldn’t be discussing this.

Comment from Ron Southworth
Time: May 21, 2007, 5:25 pm

Hi Everyone,

I tend to beleive that there is a larger issue of network design that has not been clearly identified as a contributing element of the situation, as Brian indicated. Bob contributed some material on the SCADA mail list that is quite worthy of reading as well. I certainly don’t think we have all the facts and probably never will in some aspects is a shame as this would be a good example for justification of a number or improvements for the industry all round.

Comment from Matthew Franz
Time: May 21, 2007, 9:25 pm

A minor point starting with a non-SCADA example from my day job…

If your [enteprise] firewall is becoming unresponsive at a 140k/pps, due to an architectural limitations of the underlying OS with regard to interrupt handling (at which throwing faster CPU does nothing) you find a new OS or seek hardware acceleration (i.e. ASIC) to be able handle that rate of performing stateful inspection–or you add boxes to spread the load. This has nothing to do with [implementation] vulnerabilities.

With this in mind, back to the PLC “becoming unresponsive..” due to saturation of device buffers (switch, nic, switchport, IP, UDP, application layer) caused by a flood of traffic may be impossible to prevent and may not be a bug in an individual device, protocol.

Like the firewall example, It is a question of the hardware (NIC, CPU, whatever) only supporting so much data. It may also be as Ron brought out, a network design issue. In some cases there is nothing can do about it short of upgrading the network infrastructure, implementing QoS or whatever. It might be security (availability/resiliency) issue but it is not finding/fixing security problems (vulnerability) issue.

If the OS does not crash if the device performs what it is supposed to do yet you lose ability to monitor the device (assuming the PLC so that the control loop continues to execute on the CPU Module despite the fact that the Ethernet Communication Module is being flooded, etc.) is the best you can hope for.

Assuming this optimal case (which it may or may not be), this is simple a traffic engineering issue. However if it is a single, malformed message caused the devices to become unresponsive, then Dale’s drumbeat on finding and fixing implementation flaws makes sense. Both occured with Code Red and other worms. The Code Red worm exploitation URLs took out Cisco DSL routers and found issues with ARP bugs in Catalyst switches (both implementation fixes) in addition to filling the pipes and exposing.

See http://www.cisco.com/warp/public/707/cisco-sa-20010720-code-red-worm.shtml

Comment from Ralph Langner
Time: May 22, 2007, 3:36 am

“That said if I where Browns Ferry I’d be raising hell with the vendor to fix the protocol stack vulnerability and considering other alternatives.” said Dale. Well, if the vendor is still in business, AND is able to fix the problem. We are having the same kind of issues with outdated 10 Mbps half duplex equipment that is no longer supported by the vendor. For the asset owner, migrating to a new solution might take months, in some cases over a year.

Anyway it is a good thing to bring up this example as it shows that we have to deal with threats other than attacks.

Comment from Dale Peterson
Time: May 22, 2007, 7:11 am

Lot’s of good comments on this post, and the ambiguity of the NRC report allows for some different interpretations.

Ralph - agree on the time issue, but the incident occurred in August 2006 so they should have a plan to correct the root cause by now, and they might. Implementing the plan could take longer than the nine months that elapsed.

Matt - the ‘protocol stack’ problem I mention often in these posts would also include continuing proper operation during data storms that do not saturate the network. Basically processing packets properly that reach the Ethernet interface.

Comment from Bryan Singer
Time: May 22, 2007, 9:57 am

I wonder if it was the PLC that was “non-responsive” or the switch was just overloaded. My guess is that the latter, or something similar, is the case. The PLC certainly could be DoS’d, but I have normally seen the problem to be insufficient switching or network backbone capacity. A telltale sign being that a switch just either drops all traffic, or starts broadcasting out all ports like a hub (depending on the DoS condition, and the hardware implementation of the switch).

I don’t know the device in question, but I suspect that it is not the protocol stack that is the issue. These devices are not made to be highly robust to excessive network traffic, they are designed for reliability of communications under normal conditions. Designing them to be as robust as a switch or PC requires a lot more CPU, processing power, and memory to be effective. Which is why we spend our money on good switches (or at least we should be). I can smash most PC’s protocol stack with a little careful planning, and they have a lot more processing power than your typical PLC. I defer to the earlier post, that they are looking at the wrong thing.

I would put my money on the network infrastructure, having seen this failure scenario all too often.

Comment from Dale Peterson
Time: May 22, 2007, 11:27 pm

Brian - you could be right about the infrastructure and a 10M network that may not even be switched would unfortunately not be terribly unique in the control system world and certainly warrants a review regardless of where the fault for this incident occurred.

That said, my money is still on the VFD controller faulting at substantially less than 6M of traffic. Imagine if it was on a 100M LAN.

My careful reading of the report, phrases such as “the malfunction of the VFD controller” not all of the devices on the 10M LAN and “threshold levels for failure of VFD controllers due to excessive network traffic, as determined by on-site testing, can be achieved on the existing 10-megabit/second network”, points to a device fault rather than an infrastructure fault.

With all these diverse, and possible, scenarios in the comments, I should put together a pool and take bets but I’m not sure we will ever be told the answer.

Comment from Jake Brodsky
Time: May 23, 2007, 7:32 am

Dale, your comment “…I should put together a pool and take bets but I’m not sure we will ever be told the answer.” pretty much sums it up for me too.

The key thing to bring to customers (or in my case, to my managers) is “this could be you.” We’ve pretty much identified a laundry list of potential causes. My guess is that the actual cause may be almost irrelevant since any of the things we’ve cited could easily do the deed.

What we all need to understand is that Browns Ferry is not unique. There may be many more systems out there just like it. And in this case, although a security survey could turn out to be the most useful, a plain old process safety review should have turned up this problem as well.

Comment from Bryan Singer
Time: May 23, 2007, 9:16 am

Couldn’t agree more, Jake… this could happen to any environment. The network infrastructure issues certainly are not unique. I’ve probably analyzed about 24 VFD failures in the past 3 years, and it is a very common situation. What I see more of is that people go and upgrade or replace components on a system, add new components on, and then wonder why the component is failing.

The careful reading I did of the document strikes me as one of two things: 1- Either people are being so tight lipped as to make the report nearly useless in terms of lessons learned, or 2- They didn’t understand the VFD well enough, or the other components, to know what happened.

I make reference to the paragraph: “The licensee could not conclusively establish whether network the failure of the PLC caused the VFD controllers to become nonresponsive, or the excessive network traffic, originating from a different sources, cause the PLC and the VFD controllers to fail.” Sounds to me like there was much more going on than the people onsite and those involved later were capable of determining. Unless the network records traffic, there is no way to know what was happening for sure at the time.

The challenge here is that once the incident has occurred, it is VERY DIFFICULT for us to know what was the true cause of the failure. These devices do not log incidents all that well, in many cases. And, once the failure has occurred, issues may exist that cause future tests (such as the 6M traffic test), to give results that “point to a problem”

The network situation is not unique, but that does not make it more acceptable. I lived industrial networks for the past 5 years through dozens upon dozens of these types of issues. People think that just because the network is “working” that it must be healthy. Outside of outright device failure, we didn’t have one network in all that we touched that had this type of failure once we were able to design and implement a suitable structure in the area.

Sounds kind of challenging, I suppose, but this was the frustration I have lived for years now. People often point at the device when if they would take care of the (albeit less sexy) issues of the infrastructure, they wouldn’t have to worry about many of these failure modes.

Comment from Ron Southworth
Time: May 23, 2007, 1:54 pm

Hi Everyone,

The pool sounds like a hoot!

I certainly share that frustration Brian and I suspect quite a lot of us do.

Eric Byres raised a good point on the SCADA mail list regarding the PLC locking up as well and the problems associated with 10/100 switches is also worth keeping in mind.

I suspect that there were problems in being able to recreate the fault - a lack of detailed data logging to reconstitute or replay. Something we were discussing at U5 - The need for Forensics and Incident Management.

The more people discuss this the more sorts of similar problems and solutions spring to mind and get discussed.

All good for the knowledge bank.

Comment from Anonymous
Time: May 23, 2007, 4:33 pm

Bryan Singer wrote:
” People often point at the device when if they would take care of the (albeit less sexy) issues of the infrastructure, they wouldn’t have to worry about many of these failure modes.”

It is also worth a look at the interests of those pointing at the device, for example, those who are looking to profit by selling “certification testing”. If the issues can be addressed via infrastructure, then they don’t profit as much.

The fact is that end control devices have limited processing power. They can not process packets at wire speed or even a fraction thereof. Users would not want to pay for the cost of a device that was able to.

It will be possible to impact control traffic by introducing extraneous traffic on the network and presenting an interface with more traffic than it can handle.

In that case the end device should not fail — ie, crash — but it is unrealistic to expect that the normal control traffic flow be unaffected. The novice user may not be able to tell the difference between a device “crashing” and the device appearing unresponsive because there is too much traffic on the interface.

This situation can be addressed to a large extent thru infrastructure configuration, where managed switches are used with per-port limits on ingress traffic (broadcast and unicast). Bottom line is that the traffic flows need to be managed in order to prevent overloading end devices — whether the end devices respond gracefully or not.

Write a comment