On the Path to Outsourcing: The Biggest Computer Crash in Recent History

Jun 22, 2007

—

One of the crowning accomplishments of the Warner administration was creation of the Virginia Information Technologies Agency, which was supposed to rationalize the state’s IT systems. Yesterday, widespread computer crashes across state government disrupted service to at least one-fifth of Virginia government agencies, reports Peter Bacque at the Times-Dispatch.

The effects: State and local police couldn’t check driver’s licenses and vehicle registrations. More than 14,000 child-support payments could be delayed. Consumers couldn’t examine corporate records. Some agencies were temporarily unable to pay end-of-fiscal-year bills.

Was this the biggest computer crash in state government history? In my recollection, it is. Perhaps someone with a better memory can set me straight.

More to the point: What went wrong?

Bacque quotes VITA spokesperson Marcella Williamson as follows: “We know the state’s IT network needs work and requires money. That’s why the state and VITA partnered with Northrop Grumman. … The construction of the new data center in Chesterfield County and the backup data center in Lebanon in Russell County will help prevent these kinds of problems, or solve them much more quickly.”

The state is in the midst of a $1.9 billion outsourcing of its aging computer and communication systems to Northrop Grumman Corp. The Chesterfield center should be in use next month and the Lebanon operation by year’s end.

VITA needs more money? Really? I thought that the new management structure was supposed to achieve major efficiencies, increase security and provide more redundancy and back-up as protection against catastrophic failure. Clearly, it failed — big time — to provide the back-up. There is more to this story than has been reported so far. Let us hope that the Times-Dispatch gives Bacque the time he needs to dig deeper.

Share this article

(comments below)

Comments

16 responses to “On the Path to Outsourcing: The Biggest Computer Crash in Recent History”

Groveton

June 22, 2007

Virginia’s transportation infrastructure has been allowed to decay. And, despite years of haggling, there really isn’t much of plan in place to fix it.

Virginia’s computing infrastructure has been allowed to decay. And, despite the outsourcing contract with NG, there are still serious outages and a need for more money.

Tell me again – why should anyone vote for any incumbent this November?

Log in to Reply
Larry Gross

June 22, 2007

okay.. let’s be fair…

Didn’t one of the airlines recently go belly up on their computer system?

where I worked .. once.. I had to fight management tooth and nail to convince them that they needed an online backup system..

The answer came back every time – “too expensive ..and we don’t have the money…”

FINALLY, after we had a crash that ate two days of 200 people sitting in front of dead terminals… they relented… and the money miraculously appeared .. by magic..

It’s not about how much money you spend.. whether it is IT or Transportation or even education.

It’s about priorities because you never will have enough money to buy everything you think you need.

I dunno who did what .. but this one is easy… having multiple state agencies “down” because you did not have “parts” is bogus.

There’s a term – called System Failover Capability. In some organizations (like GOOGLE – it’s a real term). In other organizations.. it’s a buzz word on a power point…

GOOGLE has hardware and software failures ALL the time because that’s what hardware and software do. What humans do – is accept that reality and plan for it … sometimes.

Log in to Reply
Groveton

June 22, 2007

OK – let’s talk technology.

At the risk of sounding like some of the bloggers on this site whose arguments I criticize – let me start with some terms. These are de facto definitions rather than anything written in an official technical dictionary.

Crash – when a specific piece of equipment or an isolated area of software fails. As in, my PC crashed.

Outage – when a large application (or set of applications) is unavailable to many, most or all of the end users. As in, the reservation system experienced a two hour outage.

Crashes happen all the time. Disk drives mechanically fail, pedestals (the green boxes that the phone company puts in your neighborhood) get wet when puddles form at their base during thunderstorms, firewalls become overwhelmed with requests and freeze up.

Outages of important applications should be exceedingly rare. This is accomplished by building in a level of redundancy that will allow the application to continue operating (perhaps with degraded performance) even if some individual components crash. The broken components are then either repaired or replaced. Isolation is another related technique used to achieve an absence of outages evern when there are inevitable crashes.

A 777 can fly with one engine but it has two. Why? Partially because Boeing thought that redundancy would be important if one engine failed.

Google is one of the acknowledged masters of avoiding outages. They start with lots of servers and lots of data centers. While they protect the actual number of servers as a trade secret – rumors have them operating over 500,000 servers. They have many data centers. In fact, they are especially clever in locating their data centers near cheap sources of power. The brecently built date center (or server farm) near the Dalles Dam in Oregon is just one example. Finally, Google has a rather brilliant piece of software called Map Reduce which distributes requests for system resources across their data centers and servers. Some of Map Reduce is in the public domain and some is not. As my engineers have explained it to me, Map Reduce is the brains behind Google’s exceptional fault tolerance. If a component or set of components crashes Map Reduce simply stops routing requests to those servers or data centers. The other servers and data centers get busier than usual when some of their “bretheren” are out of commission but they keep running.

Google may represent the state of the art in this area of systems engineering. However, many companies have the skill required to build a systems architecture that will almost never experience a wide spread outage (repeat – almost never, never is probably possible but very very expensive). Northrup Grumman is certainly a company with the requisite skills to build infrastructire and applications which will nearly never experience a widespread outage.

United Airlines also had a two hour outage on Tuesday morning disrupting flight operations across the USA.

So, if outages can be avoided through proper systems engineering – why do they still happen?

1. Economics. Redundancy costs more in hardware, network and software than non-redundancy. If an IT operation or an outsourcer has a contract that demands an ultra cheap solution – redundancy is often among the first victims.

2. Organizational dysfunction. The cost of an outage to an IT shop or an outsourcer is sometimes minimal. The application stops working for a few hours, IT fixes it and then it starts running again. There was very little incrimental cost to IT. However, there is often a huge cost to the user organizations. Customers turn away, orders are not processed, significant overtime has to be authorized to process the backlog of work that occurred during the outage. Some organizations optimize the economics of IT separately from the economics of other, user-based organizations. IT responds to the these incentives by factoring in (usually sub-conciously) a low value for redundancy and survivability. In simple terms, they just don’t get it. In outsourcing the usual answer to this is to provide bonuses for better than expected uptime and penalties for worse than expected uptime. In IT organizations you make sure that IT management knows they will be fired if they crush the business with outages. And, you pay them bonuses partly tied to uptime.

3. Competence. A modern application has a lot of components. Somebody has to have the systems engineering competence to build a fault tolerant application even if there is money availability for redundancy and the organizational maturity to properly understand the cost/benefit analysis of fault tolerence vs. outages.

I have no idea specifically why United Airlines or the State of Virginia experienced widespread outages this week. However, I’ll bet that the root cause will turn out to be a poor decision regarding the value of redundancy in the application or infrastructure made long ago. The risk lied buried until a certain rare set of events unfolded and brought the problem to the surface in the form of a widespread outage.

Both Northrup Grumman and United Airlines have skilled technical people. I’ll bet that the Virginia outage was caused by a design error that was made long before Northrup arrived on the scene. I’ll further bet that the United outage was caused by a design error that occurred when the company was in despertae financial condition and trying to cut corners.

OK – well this has been a hum-dinger of a babble.

I’ll close by saying that what bothers me about the state of Virginia is less the outage than the admission by an official that the systems have been allowed to decay into a poor condition. This is getting to be a recurring theme. Govs. Kilgore, Warner and Kaine have touted themselves as brilliant operators running the best run state in the USA. At this point they look more like “spin doctors” allowing critical infrastructure to decay while concocting a story of managerial competence designed to get them to the next political level. And the state legislature went along for the ride instead of providing the advice and consent they are paid to provide.

So, like I said, why should anybody vote for any incumbents?

And – what will be the next example of a crumbling infrastructure to come into the public eye? Energy is already a leading candidate. Is water the next shock for the voters in the “best run state in America”?

Log in to Reply
Larry Gross

June 22, 2007

Groveton is right on.

My free GOOGLE Gmail account appears to be more reliable than million dollar systems…like the State system.

But I would not blame the govs… my experience has been that corporate attitudes towards IT tells the tale.

Some .. recognize it as the cylinders in the engine and others think it’s the hood decoration.

I’ll bet the State had the opportunity to specify a failover system. It’s very easy to write it in a spec… but it usually causes quite a stir with the bean counters.

Log in to Reply
Reid Greenmun

June 22, 2007

Avoid sigle points of failure in system design – make good use of RAID and disk mirring design.

Log in to Reply
Anonymous

June 22, 2007

I don’t have problems with the previous posts, but I’d like to add my own after 35+ years in IT: Why is there never time to do it right but always time to do it over after it fails?

Deena Flinchum

Log in to Reply
Reid Greenmun

June 22, 2007

Because senior IT decision-makers are seldom those that understand the tasks they are deciding about.

Money and schedule trump sound technical solutions 98% of the time.

The goal of high level managers is to “get it off my plate” as quickly – and inexpensively – as possible.

Log in to Reply
Ray Hyde

June 23, 2007

“FINALLY, after we had a crash that ate two days of 200 people sitting in front of dead terminals… they relented… and the money miraculously appeared .. by magic..”

Every day, tens of thousands of people spend hundreds of thousands of hours sitting behind dead steerieng wheels.

How long do you think it will be before the powers that be relent, and the money, spine, and inventiveness, to fix the problem magically appear?

Log in to Reply
Ray Hyde

June 23, 2007

“It’s about priorities because you never will have enough money to buy everything you think you need.”

I agree.

Please try to explain that to EMR.

Log in to Reply
Ray Hyde

June 23, 2007

Groveton:

That is a fascinating post.

I would nominate it for a Bacon Emmy.

Log in to Reply
Ray Hyde

June 23, 2007

This morning when I woke up, we were enjoying a power outage.

There was no particuar reason for this. We have had no bad weather in this area.

Myt wife is convinced that the power company is doing this on purpose to engender sympathy for their transmission line projects. Based on the fact that I have heard several “news” reports sympathetic to the need for more transmission lines, I’m inclined to agree.

Maybe I’m hypersensitive on this, because I am at ground zero in multiple scenarios. But, when I here the same sentiments, with the same key words, coming from multiple directions, then my B.S. flag goes up.

Anyway, I “showered” with baby wipes, brushed my teeth with bottled water, and shaved with a wind-up razor.

So, reading this post, I said to myself, where is the redundancy in power generation? I have my own generator, but it isn’t set up for automatic operation. Certanly, there is some redundancy in interconnecting electric grids.

But, isn’t the real answer simiar to the google solution? More sources? More locations?

Why does it make any difference whether what we are talking about is information, power, or jobs?

Log in to Reply
Larry Gross

June 23, 2007

to be fair – not all outages can be prevented by redundancy. There does have to be a cost/benefit decision on just how much redundancy verses cost and risk.

what got my attention though with the state failure was two things.

1. – that “parts” needed to be ordered… and were being “Fed Exed”.

2. – that apparently the State’s calculation with respect to how much redundancy verses how much damage might be the resullt of an outage doesn’t add up ….

The second one.. is typical of many corporate and government agencies. They just don’t want to consider what happens when there is a failure because as Reid sez the corporate culture is to NOT be the guy who stops the paperwork shuffling… by asking questions that cause others to have to work and money to be spent…

But the first one tells the tale.

If there is a 24/7 mandate – then parts are on-hand.. on purpose

…and if failover is a mandate then those parts are actually already configured as part of a system much like a computer version of an emergency generator

Log in to Reply
Anonymous

June 23, 2007

Reid Greenmun’s points are valid. I have also discovered that in the realm of government and non-profit areas, turf tends to come into play. Each director or department head tends to want his very own system, and forget what’s best for the whole organization. Net result: Numerous systems on various platforms, none of which “talk” to each other and all of which cost $$$$ to maintain. Plus most have lousy and/or outdated documentation, if any at all.

I especially like “Because senior IT decision-makers are seldom those that understand the tasks they are deciding about. ” My last IT director was a CPA and had next to no idea HOW things actually worked in IT. I’m old school – measure twice and cut once – so I rarely had a major problem with what I worked on. I had a colleague who was, I admit, very, very smart. The problem was that he was also very, very careless and had a tendency to activate systems/programs without checking them out carefully. Some of his projects crashed spectacularly, sending key users into a tailspin. My colleague would then go in and fix his mistakes and things got back to normal.

The director’s attitude was that my colleague was brilliant and doing a great job because he knew how to fix all of these problems. I, on the other hand, must have an awfully easy job BECAUSE PROBLEMS JUST DIDN’T SEEM TO HAPPEN ON MY PROJECTS. I kid you not.

When he finally realized that I was serious about retiring, he offered me a job as a consultant and was stunned when I refused.

Deena Flinchum

Log in to Reply
Anonymous

June 23, 2007

This is how it works. If there are no problems, there are no jobs. Systems that can take care of themselves do not need fixers. You screw yourself if you have systems that do not need fixing. That’s why there are software applications like Remedy, to provide the metrics that support getting rid of workers. Cutting once does not justify your job. Rebuilding a server that wasn’t built right to begin with, now that’s the ticket. Trouble ticket that is.

Log in to Reply
Anonymous

July 2, 2007

OK. I work there so I will have to make this post anonymous, something I rarely do.

The State is still running on the old data center downtown while the new one is under construction. They actually thought they had redundancy and they did, but it didn’t work. The hardware in question was a device that communicates between the mainframe servers and the peripherals and the design of the new datacenter is more modern and will not require this devices.

After the dust settled it turned out that there was a defect in some memory that was known to the original manufacturer but apparently the memory was only replaced on devices that actually failed. Think of it like many of the car “recalls”. The car makers often know of a defect but sometimes they don’t broadcast the information so unless YOUR car stops in the middle of the road you will never know.

Regarding the comments about “needs more money”. That is exactly what the partnership with Northrop Grumman is all about. Northrop Grumman is investing millions of $$ in the state IT infrastructure in a short time with a long term payback. The problem with the state system is that the 2 year budget cycle makes it hard to invest $$ in infrastructure and you can’t get the cost savings without the up front investment.

Log in to Reply
Jim Bacon

July 2, 2007

Anonymous, Thanks for the explanation. It makes sense to me. (But, then, what do I know about IT?)

Log in to Reply

Bacon's Rebellion

On the Path to Outsourcing: The Biggest Computer Crash in Recent History

Share this article

ADVERTISEMENT

ADVERTISEMENT

Comments

16 responses to “On the Path to Outsourcing: The Biggest Computer Crash in Recent History”

Leave a Reply Cancel reply

On the Path to Outsourcing: The Biggest Computer Crash in Recent History

Share this article

ADVERTISEMENT

ADVERTISEMENT

Comments

16 responses to “On the Path to Outsourcing: The Biggest Computer Crash in Recent History”

Leave a Reply Cancel reply

Read More