S4 Call For Papers
AAA  AAA 

Virtualization in the SCADA World: Part 4 - Recovery

Our last post in this series covered the benefits of virtualization for testing in a lab or development environment. Today we are going to address some of the same features but with a different twist – this time we’re talking about using VM in a production environment.

It doesn’t take much exposure to virtualization before you realize one of the primary benefits is the ease of recovery. With a little planning and the snapshot ability that is available in nearly all of the VM products, a last known good configuration is only clicks or keystrokes away if something goes awry. But that’s really just the beginning because virtualization provides recovery benefit beyond that. In an industry obsessed with reliability, I believe we are compelled to investigate further.

Because a virtualized system is essentially a file or small set of files, new opportunity exists for easily replicating those files to secondary servers and remote locations. Virtualization vendors are latching onto this ability and creating mechanisms to make it easy for administrators to configure.

In a traditional redundancy scenario, the hardware had to be precisely duplicated. With the hardware independence that VM affords, though, recovery processes can be much more flexible.

So let’s assume for a moment that we’ve overcome the obstacles and have implemented virtualization for at least a partial list of the servers in our production system. What might this look like from a recovery perspective? We could have two physical machines at our primary location. Each one could host redundant pairs of a handful of Linux and Windows virtual machines. This helps eliminate any single point of failure issues from a physical hardware standpoint and it allows us to maintain the logical failover ability built into most control systems. At our backup site, we could have a third physical machine that gets regular backups of the virtual machines from the primary site.

In this example, if we lose our primary site, we have a backup that does not require waiting for tape recovery. We enforce a change control process that requires taking a snapshot before any update (plus regularly scheduled snapshots) so when a database load corrupts the system, we can simply revert. If the system is attacked, we can preserve evidence in a way that was not possible before and still be back up and running in a very short period of time. This is just the beginning – use your imagination and I challenge you to come of with a scenario where having virtualized servers does not ease the recovery process.

In the IT realm, the recovery options in virtualized systems have revolutionized business continuation planning. I believe it will have equal impact on control system design and implementation at some point. Will there be “law of unintended consequences” repercussions? Yes, I’m sure there are many – that’s why I want to start the conversation now.

UPDATE - Dale’s Two Cents - Recovery is what I see as the biggest benefit to virtualization for both upgrade failures and catastrophic recovery.

As an asset owner, you wait until the vendor certifies a patch, you test it in the lab, and then apply it. There have been many cases where the patch worked fine in the lab, but failed under load or with additional features or functions that were unique to your production system. VMware or other virtualization is an extremely fast and effective way to rollback.

This goes beyond patching. It applies to all change control. Rollback should be part of any change control procedure and many control system application upgrades offer no rollback short of reinstalling the entire system. Creating a VM for rollback could be an effective part of any change control process.

I’ve blogged before about the over reliance on failover for recovery. This works fine for hardware failures but fails when a worm or other cyber attack takes out all of the systems. The outage time to rebuild a system that has not been rebuilt in years could be days and require vendor participation. A VMware snapshot would recover from a catastrophic failure quickly, in a matter of minutes after the affected systems are removed from the network.

Finally, virtualization may be a good solution for a tertiary control center (or backup if you can’t afford one today). You could put a realtime server, historian, and HMI all on one system.

Comments

Comment from Martin Solum
Time: March 10, 2008, 11:20 am

Sorry to be late to the party…

The use of virtualization in test & dev is pretty compelling. Some of the risk of using virtualization in production comes from unfamiliarity with the technology which can be overcome simply by using virtualization in test & dev until the learning curve is conquered.

As that risk is reduced over time, the benefits of upgrade failure recovery & catastrophic recover Dale mentioned become increasingly compelling. However, there is another, more subtle benefit that should also be considered: bit rot prevention. Changes made by upgrade programs are often very difficult to validate or control. With virtualization, the cost of a fresh operating system and application installation, instead of merely running an upgrade program, is reduced. This makes fresh installs much more likely to happen much more often.

Arguably, being assured that the operating system and application instances are actually in synch with the best available current standard actually reduces the attack surface of the application instance.

Write a comment