Lesson in Successful Disaster Planning

I found the story of the Federal Reserve on 9/11 to be fascinating. It seems they just flipped a switch on all their Y2K preparations, and it worked.

Tags: 9/11, backups, banking, terrorism

Posted on September 23, 2014 at 1:09 PM • 44 Comments

Comments

Gerry • September 23, 2014 2:56 PM

I’m surprised this is news. A lot of organizations fell back on Y2K plans. Their success varied with the competence of their DR and BC planners. After all, 9/11-type scenarios — loss of building, loss of data center, succession planning — were anticipated in good BCP plans and the good BCP plans included Y2K plans as just another scenario.

If the organization had a shoddy BC/DR plan, at least it could fall back onto the Y2K plan which tended to be reasonably current.

Geoffrey Kidd • September 23, 2014 3:31 PM

There was a blog entry over on Computerworld’s “Shark Tank” feature about a consultant who had just come to work in one of the towers for the IT department of a major bank. He had finished a pull of the organization’s code repository into his laptop when 9/11 broke loose, so he got out quickly and went home.

According to the story, a few days later he got a call from the bank’s IT manager asking if he had a copy of the company’s code. He replied in the affirmative and was told that his copy was the only one and they needed it for their recovery.

When the consultant asked about the off-site backup of their code, he was told “It was in the other building.”

Presumably the bank now has off-site backup a little farther off-site.

Anura • September 23, 2014 7:26 PM

@Geoffrey Kidd

Yeah, I don’t like the idea of backups being in the same state as the primary. If your primary datacenter is in Nevada, keep your backups in a datacenter in Virginia. Although, these days if you are running a website it makes increasingly more sense to have an active/active configuration, with load balancing and an almost live copy of the data at each datacenter (and each datacenter taking and storing its own backups).

joet • September 23, 2014 9:51 PM

It’s comforting to know that at least one critical central government function has a workable and fire-tested BC/DR plan.

Here’s an exercise for the reader: compare Amazon S3’s durability specs (see http://aws.amazon.com/s3/details/) to the probability their metropolitan-area replicas are both taken out by a large asteroid (e.g. per wikipedia).

If you put more the 6 nines in your reliability numbers, you’re probably ignoring a black swan.

Nick P • September 23, 2014 11:04 PM

@ joet

The larger number is their data reliability. See below for how it’s calculated. The overall reliability is limited by network reliability, which is four nines. They also offer lower cost, lower reliability options.

http://aws.amazon.com/s3/faqs/

Bruce Clement • September 24, 2014 5:24 AM

@Nick P “They also offer lower cost, lower reliability options”

Aye and therein lies the rub. Organisations expect 99.9999999999% uptime and reliability but aren’t prepared to pay for it.

Managers who won’t allocate the funds for effective DR/BC blame the underlings they didn’t empower when it all goes wrong.

Harry • September 24, 2014 7:11 AM

@Geoffrey – Nothing wrong with having a backup in the other building … as long as they have one in a distant location as well. The next building is quick and cheap to get to, and covers many of the problems that call for a back up. The distant backup is for when you can’t get to that area at all, frex a flood, a hurricane, or police action. That backup is hard and slow to get to, and often expensive as well.

Bad on that bank for not having both.

Darragh McCurragh • September 24, 2014 10:28 AM

I did not quite get the analogy to Y2K. While the Fed obviously didn’t behave like many believe civil servants (which in a way they are) do but were rather circumspect and agile, the whole thing happened well after Y2K and probably the Fed had not that many software programs that were Y2K-“infested” or could not have been corrected swiftly if they had been tripped up – unlike a utility where the ripple effects are far greater, e.g. if cooling of nuclear waste had failed on a certain New Years Eve …

Nick P • September 24, 2014 11:47 AM

@ Bruce Clement

Haha they might do just that. However, it’s rational to list data reliability and network reliability separately. The organizations might be trusting a bunch of critical data to the service. That the data doesn’t disappear or get corrupted might be highly important. That it can be accessed “right now” might be less important. An organization might like having these data protection levels at the prices paid.

I think, though, that keeping critical day-to-day data local is a better decision. The LAN’s tend to run faster, more privately, and more reliably. Problems with many devices can typically be fixed with a quick reboot. I’d treat cloud offerings like this as an off-site backup option.

blaughw • September 24, 2014 12:41 PM

Darragh

Y2K is relevant because the Fed spent 4+ years planning for it, and received a great deal of money from Congress to get it done. The Fed spent that much time planning and implementing DR, the fact that 9/11 was only 21 months after Y2K meant that all of those plans were likely directly applicable.

The big question, to me, is: Y2K preparation was a massive expenditure in time and resources. For a national “Tier 1” service (I’d say the central bank qualifies), what is the appropriate level of upkeep to ensure these processes remain as resilient as they were on 9/11?

How can those lessons apply to businesses and less-critical systems?

Presumably the runbooks and architecture from 1999 are not applicable today in many regards. Could the Fed achieve this level of resiliency if this happened today?

—

Hyperlinks to the Fed’s quarterly reports to Congress seem to be broken in many places, but I’m sure they’re out there.

vas pup • September 24, 2014 1:15 PM

http://www.bbc.com/news/technology-29149221
Directly related to disaster recovery/communication

Nick P • September 24, 2014 2:07 PM

It’s a great read and plenty of heroism. I think, though, that the author missed an important thing about the financial crisis: Wall St and the Fed largely caused it by very bad decision-making. The worst decision they made was to put most critical stuff in one physical location. That’s just asking for it. The next bad decision was for many elites and critical functions to put it in landmark buildings that were a known target, even hit once already. Disaster planning or not, these people were setting their stuff up to fail hard if a single event happened.

A long time ago, ARPA determined the proper solution looking into survivable infrastructure during nuclear war. We call it the Internet today, but it’s really the principles behind it that matter here. It was set up to have much communications redundancy, computational redundancy, diversity, and most importantly massive decentralization. Even by the 90’s there were ways to implement centralized services in a decentralized way, esp if the parties were trusted. Our financial backbone (and any huge firm’s backbone) are the exact types of systems that justify this level of protection.

So, they created a spectacularly bad design in terms of resilience, then achieved a spectacularly good recovery upon its failure with backup options. That recovery cost a LOT of money, though. Much more than implementing a more decentralized system with recovery built-in. So, even though they fought the fire, the next one is right around the corner if they’re still using the strategies with poor resiliency. They should start working on a better strategy right now, then implement it incrementally in parallel with the current system.

Note: Some banks did use decentralization to their advantage. One in WTC ran all core services on VMS cluster sync’d to a distant location. The fail-over went smoothly, service continued running, and no transactions were lost. Even that minimal scheme did better than many in the article at low cost.

Old Woman • September 24, 2014 2:54 PM

“What’s Y2K?,” asked my fresh-out-of-college coworker.

name.withheld.for.obvious.reasons • September 24, 2014 3:03 PM

@ Nick P

Wall St and the Fed largely caused it by very bad decision-making.

Okay, what’s stopping them now. The fact that the Fed’s balance sheet (if it could be called that, and I don’t know how they’ve derived authority) sits a 5 trillion dollars. That, and the interest/time to repay or rebalance, represents not only a huge unpaid tax burden that the wealthy have offed onto us–it also tells you how inflated the markets really are–if you believe that the DOW, NASDAQ, or the S&P can sustain their current levels I’d like to sell you several bridges to nowhere.

Most here are probably aware of this reality, it’s just a question of when do the chickens come home (to rost that is)? Most will argue that it’s about supporting pension or other long term corporate or other debt obligations. And again, what color would you like your bridge to nowhere be?

Nick P • September 24, 2014 3:35 PM

@ name.withheld

Yes, much of the deficit spending is a tax on the majority to feed the rich. The Fed’s activities in 2008 were necessitated by the rich, benefited the rich, and left the rest (even “upper” classes) in debt. The system is still risky. Even years later they were trying to figure out how to safely account for that $1 trillion without messing us up. The Fed is too in bed with Wall St for them to regulate them, Wall St consistently proves to be a huge existential threat to us, and this combination means the Fed is also an existential threat to us. Their main approach to huge problems is to ignore them or push them into the future.

@ All

I was talking to a friend in stock trading. He asked how my decentralized schemes would allow High Frequency Trading (HFT)? That got me thinking that things like HFT might actually have added to their resistance to changing centralized architecture. I pointed out that HFT mainly benefits a select few gaming the system, while dramatically increasing risks to stability across the board for the rest. I’ve always voted that it should be banned. Further, I’ve proposed in the past that it might be a good idea to instill a mandatory holding period between when a given stock is bought and when it’s sold. This reduces volatile behavior in the market, while the time window gives opportunity to perform all kinds of analysis on market itself (including risk). And people can still get rich investing and even attempt gambling.

Personally, though, I’d rather structure our system to ban most forms of speculating in favor of real investing. We’ll get more jobs, real economic activity, etc. Aligning incentives of investors with such goals can only benefit us.

name.withheld.for.obvious.reasons • September 24, 2014 3:55 PM

@ Nick P
Before the moderator gets irate, I’d like to make a final point, though I’d argue that disaster recovery is irrelevant given the real disaster in our financial systems.

Personally, though, I’d rather structure our system to ban most forms of speculating in favor of real investing. We’ll get more jobs, real economic activity, etc. Aligning incentives of investors with such goals can only benefit us.

The forex markets are a perfect example of a perverted system. Betting on the method of transaction, our monetary systems, erases value from the real economic system(s).
It’s ironic that the Fed, ECB, and Abe in Asia all claim that growth is the way out–this position seems to be a great odds with reality or probability.

Brammer • September 24, 2014 4:03 PM

Y2K, sure, but are you ready for Y10K?

All those pesky year date fields you upgraded for Y2K will need to accommodate 5 characters instead of 4.

🙂

Anura • September 24, 2014 4:22 PM

@Brammer

Well, I figure by the time we move to 128-bit CPU architecutres within the next couple of decades we will move to using the yoctoseconds since 2000-01-01 00:00, giving us about +/- 5 million years of range.

name.withheld.for.obvious.reasons • September 24, 2014 4:36 PM

@ Brammer
I don’t know why there wasn’t a move to use hexadecimal dates–it could be treated as a string (most dates are not typed–at least at the time).

Mark • September 24, 2014 4:50 PM

Amazon’s claimed 11 nines of annual durability is complete nonsense. A dinosaur-killer asteroid is 8 nines; the Sun going red giant is less than 10 nines. Sure, S3 may be designed to survive the loss of any two datacenters, but either of the above events is likely to take out all of them.

Buck • September 24, 2014 4:58 PM

@Brammer, Anura, name.withheld, et al.

Far before that, the ‘Y2K38’ problem will be upon us…

Hopefully, by Y10K, humans will be able to agree on natural-phenomena-based timekeeping methods in lieu of local political interests!

Anura • September 24, 2014 6:31 PM

@Buck

Our calendar is natural phenomena based, i.e. Earth’s rotation and Earth’s orbit. Anything else would get pretty annoying if you weren’t space-faring, and at that point the only thing that is really dterministic regardless of location in the universe is Planck Time, which you would probably get rid of arbitrary days/months/years and start using a metric calendar or, if we wish, we can switch to base 12 or base 16 and modify accordingly.

Personally, I hope to be using a hex-based calendar when I leave Earth behind: 16^36 planck time = 1 second (1.2 SI seconds), 64 seconds in a minute, 64 minutes or 16^3 seconds in an hour, 16 hours or 16^4 seconds in a day (21.8 hours), then 16 days in a month, 16 months in a year (for about 0.64 Earth years or 0.34 mars years).

Buck • September 24, 2014 7:11 PM

@Anura

Yeah… Kinda! 😉

Seconds are subdivisions of minutes – parts of hours – all based on the rotational speed of the Earth vs. the relative spacial position of the Sun…

This approximation doesn’t quite mesh with our concept of days, months, seasons, etc.. A leap-second here-or-there, now-and-again, won’t affect anyone, right..? How about time-zones? “It’s this time in my town, while it’s that time at your home!”

Daylight savings, anyone?

Extra bits and bytes are trivial to tack on the end; our real troubles are legal mazes…

Though, thinkin more about it; the Planck-Scale is no longer totally out of our reach – is it?

Anura • September 24, 2014 8:55 PM

@Buck

Daylight savings time is nutes, but for the other stuff the obvious solution is to move Earth’s orbit a bit wider so a year is exactly 366 days (which also helps with global warming), then you can get rid of leap seconds. Periodic adjustments for drifts may be necessary like we do with leap seconds.

Buck • September 24, 2014 9:14 PM

@Anura

I thought it was more like 365 & a quarter… Regardless, the Earth’s seismological outgassing has hopefully already begun to tune our revolution/rotation and all other measures of time to the optimal (and most logical) parameters for continued survival of the species! Cheers to all, and thank you for all the phishes! Live long and prosper…

Dilbert • September 25, 2014 8:39 AM

@Brammer

Forget Y10K, let’s focus on the Unix Epoch! It’s much closer.

JohnP • September 25, 2014 9:14 AM

As a technical architect, it is my job to ensure failover and DR plans work. The only way I know to ensure that is to design failover to another location 500+ miles away into the total solution AND to test it, at least monthly.

Usually, for highly critical systems, the client wants to test the failover weekly, so we build out 2 production systems with near-real-time replication and every Friday night, cause a failover to the other location, run there for a week, then fail back. This has been happening for years now. Once in a while, a change/update to 1 system fails and we have to failover to the other site and fix it. That’s fine. Of course, real-time replication means huge pipes due to the amount of data involved. For clients will more modest bandwidth capacity, even having nightly backups pushed to an alternate location can make DR possible for any data prior to that point – this is not ideal, but it is 98% better than not doing anything.

Clive Robinson • September 25, 2014 9:23 AM

@ Nick P,

When reading down the thread I got to your,

Even by the 90’s there were ways to implement centralized services in a decentralized way, esp if the parties were trusted. Our financial backbone (and any huge firm’s backbone) are the exact types of systems that justify this evel of protection.

The first thought that crossed my mind was “how do you decentralize the speed of light limit on HFT”.

And low&behold you mention HFT further down.

I guess that stoping HFT would all things being considered the best solution.

As I’ve noted in the distant past –though a friend of yours disagreed–, it’s fairly clear that the financial industry are inventing faux markets to increase their take in terms of fees etc. However in more recent times they have also gone overboard on trading that fraction of a second faster, hence the speed of light “time cones” were “crimping their style”. So a big slice of the pie for no real value added in less than a blink of the eye, I guess that’s why we have inflation.

Nick P • September 25, 2014 12:59 PM

@ JohnP

Good work. That’s exactly the kind of thing I’m talking about. Many of these big financial traders didn’t seem to be doing even that much.

@ Clive Robinson

Yeah, getting rid of HFT greatly simplifies things. It’s just a hack a small number of major players use to create money out of thin air, like you said. A previous article on this blog showed it was also used in fraud where they leveraged their position to see what bets others were making, then undercut them with HFT trades. Seeing that, you’d think even Wall St themselves would want to get rid of the stuff just out of their own greed or risk calculations. Fortunately, that discovery did at least lead to a new trading organization with better internal controls.

That said, I think what I propose could even be done with HFT. The idea is that you’d have datacenters in several physical locations that are far apart, geographically safe from most disasters, have internal redundancies, and have fat pipes connecting them to each other. Topology to use is debatable with tradeoffs available, but the key thing is they’re far apart and well-connected. The main systems would store the data in SAN’s that are essentially a replication cache. Dedicated, high speed systems would replicate that data to the other sites constantly. The other sites would run their own copy of the centralized state of the system, updating it as they received things. There might also be local copies in different physical spots in the same site, like different floors of WTC.

In the event of an attack, the trading systems (if not failed already) would be shut down. If network still works (and some did on 9/11), the data continues going out. Meanwhile, dedicated & hotswap storage appliances get a copy of all the activity that would go out on the line. The data that needs to go out on the network can quickly be offloaded onto that device, which is then physically moved out of the building in a hurry. It might even be a battery-backed giant RAM drive with direct PCI interconnect for performance. If network goes down or is too slow, this thing can be connected to another network or even copied to other storage devices that are shipped overnight to other locations. Other locations keep track of what data they’ve received so they can ensure they only pull replications off the storage device that they didn’t already receive from the network. The sites can check each other too.

Once at least one is consistent, it becomes the new master site and the market opens through it. Others start taking replications from it. Replications are constantly loaded onto high speed storage device just in case. Data from the failover is kept and analysed. These kinds of methods might allow, depending on networking, recovery within hours. It should be a little over one day tops. The cost is some high end gear, regular SAN, installation/maintenance of the high bandwidth links, and whatever bandwidth is actually used*. The tests might use less data to keep their cost down.

*One can dramatically reduce costs of replication if you simply ship discs with the data (in protected containers of course). A number of organizations do this because the shipping charge of one 1TB disc or a 16TB server is way cheaper than bandwidth charges. Shipping these daily while keeping the network link open for just emergencies can keep costs down considerably. The delay goes from 1-3 days depending on shipping method.

name.withheld.for.obvious.reasons • September 25, 2014 1:39 PM

@ Nick P
It’s all old hat–in the 90’s we deployed hot sites across the country. Going beyond just compliance we did server mirroring and data replication to keep both host OS’s, applications, patches, configs, and data current across two physically (one in Dallas and Florida) with the west coast primary. We could drop calls and screens from on host to another at a remote site.

Our first build out was completed in 1998. One area of concern for state data that is more prevalent today is transaction level redundancy on hosted VM guests with session-less web clients that proxy transactions. Also this tends to push loads on hot site networks that can compete with database table level replication (or even SAN to SAN replicas).

moo • September 25, 2014 1:52 PM

@Nick P, name.withheld:

If you’ve never heard of it, I suggest checking out “The Chicago Plan, Revisited”:
https://www.imf.org/external/pubs/ft/wp/2012/wp12202.pdf

name.withheld.for.obvious.reasons • September 25, 2014 2:20 PM

@ moo
I’m afraid to ask! I believe Doctor Evil (mawahaaa) works at the IMF. Could you summarise as I am unable (and more importantly, unwilling) to point my browser or wget to/at the URL/URI.

name.withheld.for.obvious.reasons • September 25, 2014 2:22 PM

@ Old Woman
Is your name Dennis?

Nick P • September 25, 2014 7:14 PM

@ moo

Thanks for the link! I’ve never seen it, the predicted results are great, and it could be the first time I praise the IMF haha.

@ name.withheld

Good examples. Yeah, that’s my point exactly. Approaches like I describe have been going on a long time. VMS long-range clusters started in the 1980’s with Tandem NonStop following. Big banks and such were already using these systems for geographically-distributed, high-performance, transaction processing with fail-over. Did the Fed, NYSE, and huge Manhattan banks’ IT departments just sleep through 20+ years of product development in high availability systems? Like I said, there were some banks that day that did different.

Btw, here’s the article on the German bank with VMS clusters on 9/11. It’s really an advertisement, but the details are usable. Reading it again, I noticed it’s especially applicable to this situation given what the cluster did:

“These applications include a money transfer system responsible for the bank’s connection to the Federal Reserve and the New York Clearing House, a trading system, a banking system that handles internal banking requirements, a letter of credit system, a futures and options system, and much more — all running under OpenVMS.”

Basically, all the things that failed hard at many other institutions in the main article just kept on working at this bank. Imagine that. 😉

name.withheld.for.obvious.reasons • September 25, 2014 7:39 PM

@ Nick P
Thanks Nick–good follow on by the way. Always trust you to keep the conversation honest and above board. Your value to the community cannot be overstated (yeah Buck, Clive, Dirk, and the other regulars I have not mentioned are on the list).

Also enjoyed your comment about the IMF. Seems both you and Clive know how to make me laugh…and that’s hard to do nowadays. Hope you enjoyed my Doctor Evil comment, I forgot to choreograph the “with pinky finger in his mouth” (mauwahahaha).

I was familiar with a least a half dozen relisency and hi-availability architectures back in the 80’s and 90’s. Many different layers of redundancy (hw, link, data, session, and application). I don’t know why we are still talking about this? It is the process and laws underneath and on top/over these systems that are problematic–and I’d argue the source of most of “our” problems.

y2kplus1hundred • September 25, 2014 8:31 PM

There is actually a year 2100 leap day problem, something about centery years not being divisible by 500. So I fully expect a lot of fun on march 1st, 2100. or Feb 29th, if your not into the whole doing stuff right thing.

Nick P • September 25, 2014 8:49 PM

How about we solve the Y2038 problem first? Then worry about the other one in a few decades. .

vas pup • September 26, 2014 8:17 AM

http://www.euronews.com/2014/09/25/how-to-survive-a-plane-crash/
Good planning and training, proactive steps.

Nick P • September 26, 2014 10:51 AM

@ name.withheld

“Thanks Nick–good follow on by the way. Always trust you to keep the conversation honest and above board. Your value to the community cannot be overstated (yeah Buck, Clive, Dirk, and the other regulars I have not mentioned are on the list).”

Appreciate it. 🙂

“Hope you enjoyed my Doctor Evil comment, I forgot to choreograph the “with pinky finger in his mouth” (mauwahahaha).”

It was funny. Thing is it’s a meme now so we gotta remmeber this is how most people will interpret Dr Evil references. Unless it’s a pun on an actual quote from the movie. Can’t go wrong letting Mike Myers play an eccentric hero and villain in same movie.

“I don’t know why we are still talking about this? It is the process and laws underneath and on top/over these systems that are problematic–and I’d argue the source of most of “our” problems.”

We’re talking about it to educate readers, personal or professional. The other problems will largely be solved by voters, who so far don’t care. So, I continue to help those who want to understand things while waiting for the rest to care (sigh).

name.withheld.for.obvious.reasons • September 26, 2014 12:38 PM

@ Nick P

We’re talking about it to educate *readers*, personal or professional. The other problems will largely be solved by *voters*, who so far don’t care. So, I continue to help those who want to understand things while waiting for the rest to care (sigh).

Agree wholeheartedly. Tis depressing that those that control the means of accessing useful information spend so much time spreading disinformation. That is what attracted me most to Bruce’s blog–god forbid the “intellectual elite” get a clue or come to understand “enlighted self interest”. You keep going Nick!!!

Anura • September 26, 2014 4:01 PM

And now that we have a disaster planning success story, how about a disaster planning failure story:

http://www.reuters.com/article/2014/09/26/usa-chicago-airport-idUSL2N0RR0OI20140926

It seems to me that something as critical as air-traffic control should have a little bit more disaster planning.

Adjuvant • September 27, 2014 11:16 PM

@moo, NickP, etc:
Just for the record, I previously mentioned the Chicago Plan and its potential contemportary application. See that mention here, and don’t miss the linked video (in the third post) of the IMF’s Michael Kumhof speaking at the LSE. Still absolutely electrifying! Easily one of my favorite “clean-slate” concepts.

vas pup • September 30, 2014 9:31 AM

Proactive security in Sweden:
http://www.bbc.com/news/science-environment-29354579

thevoid • September 30, 2014 3:49 PM

so, they want to detect bomb making ingredients in the sewer:

“If you make homemade explosives or bombs, you need a place to be, you need to use some equipment, and some chemicals,” he explains.

“In the process there could be a need to rinse equipment or pour it down the drain – and this is something we want to take advantage of.”

The sensors can detect minute traces of the bomb ingredients, recording their concentration, the time they were found and their location.

This information is then sent out to police.

but a few lines later they say:

Most homemade bombs are hydrogen peroxide or fertiliser-based, he says.

“They have chemicals you buy in a normal supermarket, and they are using them
to make bombs.”

so if you are a gardener who gets a cut and cleans it with hydrogen peroxide,
you get a visit from the police. nice.

the stupidity and insanity seem to be accelerating.

Schneier on Security

Lesson in Successful Disaster Planning

Comments

Leave a comment Cancel reply