hiring
AAA  AAA 

Ethernet PLC and VFD Crash / Vulnerability Causes Nuclear Plant Scram

(Big Hat Tip: Robert Lemos of Security Focus, see his article)

This is a fascinating real world case study and example why protocol stack security and reliability is so important. From a NRC report dated April 17, 2007:

On August 19, 2006, operators at Browns Ferry, Unit 3, manually scrammed the unit following a loss of both the 3A and 3B reactor recirculation pumps. …

The licensee determined that the root cause of the event was the malfunction of the VFD controller because of excessive traffic on the plant ICS network. … The licensee could not conclusively establish whether the failure of the PLC caused the VFD controllers to become nonresponsive, or the excessive network traffic, originating from a different source, caused the PLC and the VFD controllers to fail. However, information received from the PLC vendor indicated that the PLC failure was a likely symptom of the excessive network traffic. …

A key point is that all network devices must allocate time and resources to read and interpret each broadcasted data packet, even if the packet is not intended for that particular device. Excessive data packet traffic on the network may cause connected devices to have a delayed response to new commands or even to lockup, thereby, disrupting normal network operations. This excessive network traffic is sometimes called a broadcast (or data) storm. …

The reason the licensee at Browns Ferry investigated whether the failure of one device, the condensate demineralizer PLC, may have been a factor in causing the malfunction of the VFD controllers is that there is documentation of such failures in commercial process control. For instance, a memory malfunction of one device has been shown to cause a data storm by continually transmitting data that disrupts normal network operations resulting in other network devices becoming ‘locked up’ or nonresponsive.

I believe “scram” is the term for an emergency shutdown, so this was serious.

The write up is unclear as to whether the 10Mbps LAN pipe was full and preventing legitimate communication or if the protocol stack in the VFD couldn’t process the traffic on a 10Mbps LAN. The preponderance of the report points to the VFD stack not processing the traffic.

It is truly pathetic if an Ethernet connected device can’t handle 10Mbps traffic. (It may be equally pathetic that they have not upgraded to 100Mbps switches but this would have only exacerbated the problem, but then again if the network utilization is low there is no need to upgrade. It shows how little traffic travels over the average control system network.) Any IT network device would be considered a complete failure if it had such unreliable and insecure performance. Shouldn’t we have higher standards of reliability in control systems for the critical infrastructure?

Unfortunately this all to common for controllers and and other Ethernet enabled devices in control system networks. Asset owners have enough security issues to address without having to worry about whether the controllers can process packets correctly. It is why we are such strong proponents of Achilles Certification. The storms test cases in the Achilles Certification tests would have identified the failures described in the NRC report. Vendors that have protocol stacks that can continue to operate correctly during the rigorous Achilles testing should be commended.

Congratulations to Robert Lemos of Security Focus for identifying this story. One small note on his article, my quote on “if you were to test any control systems that have any more than three or four different network-connected devices, they could be knocked over very easily” is three of four different types of controllers or devices found in the field. Conservatively we are seeing 25% to 33% of the controllers and other Ethernet enabled devices with protocol stack problems identified in assessments. Many of our clients know this going in and say don’t bother scanning field site equipment because we know it causes devices to crash.

This is why you need to be very careful on how you do assessments of control systems and accounts for all the horror stories of an IT professional doing a Nessus scan on the control system subnets and causing outages, see our Scanning Control Systems whitepaper for our methodology.

Comments

Comment from Ron Southworth
Time: May 19, 2007, 6:13 pm

SCRAM - Scramble - A very quick but controlled shut-down on a plant.

Power generation standard joke ??

Q: - What steps do you take during a meltdown on a generating plant?

A: - Big ones in the opposite direction of the generator/reactor!

You just never know when you will ever have to answer that question yourself

Seriously

For certain, protocol stack problems are something that does need to be addressed.

Of far more importance IMHO is the need to ask what has happened in this instance to all the requirements, procedures and polocies for the mitigation of common modes of failure.

How is it that this single point of failure problem effected TWO VFD Drives and not just one? Were they the same Brand?

Why were both drives connected directly or in directly to the “same LAN” as the “faulty” PLC.

Sounds like the control system networks are also not isolated from each other - Defence in depth & compartmentalisation of system into zones is not being undertaken.

The problem is we are not seeing all of the data surrounding the incident so it is all pretty much conjecture - we will have to wait for the official findings.

Dale, you are correct in your point you are trying to make with regards to testing methodologies.

If you recall a while ago our discussions regarding control valves. These devices are probably of more significant interest (in terms of number of devices installed in these sorts of industrial control applications)

There is something to be said for pneumatic safety systems after all - they are pretty hard to hack!

Thanks for sharing this one Dale, some good lessons learned if people bother to read and learn?

Write a comment