Limbo! How low can you go?

You’ve seen the limbo line, with each time around going a little lower, until the player crashes. Well, the past two weeks I’ve been playing “IT Limbo”.

If you’re an IT person, and responsible for a lot of gear, you’ll know about this. It all begins when an important system fails, and you have no choice to hammer on it till it comes up.

We ended up in a situation where we had to move a bunch of bladeservers and EMC VNX arrays. Such devices do not like to be moved, and require lots of careful planning. We did our due dilligence, had a carefully devised plan, and on the appointed day, wespent many hours dropping volumes, luns, replication and mirroring links, failing things over to a secondary, etc. We run in a system where every volume has a replication partner, and a third data storehouse for dead data (backups). Trouble is, when moving, one of the hot partners has to be shut off, leaving you with one hot array.

We had successfully finished the job, were cleaning up, and preparing to leave for the day, when 100’s of panicked calls started coming in - “the XYZ is down! OMG!” We ran into the data center, where the movers were just leaving, and began frantically trying to figure out what happened. Didn’t take long - one of the movers had accidentally unplugged 8 twinax 10gig cables, and dropped our remaining array (1 petabyte worth) by disconnecting it from the world. So we started plugging the cables back in.

And no ESX server would connect. After going through basic troubleshooting, nothing was working, and we started dialing support and lining up troops.

We call it going around the world. When you get on a support call, and as each tech finishes his/her shift, they transfer you to the next open support center, usually somewhere to the west of you. This time, we went around the world with EMC for five days. 24/7.

We set up cots in our offices. I managed 9 hours sleep in 5 days. Everyone who saw me thought I had cancer, or some dread malady. I finally passed out one morning, and the staff was unable to wake me. I slept 7 hours. So evidently, 9 hours in 5 days was as low as I could go. Limbo!

BTW - system is back online, and all is much better now. I’m just about desperate to do my things now, or to have any time away from the office. Can’t wait to get back to my projects.

8 Likes

I listened to The Ticket’s “Junior” Miller recount his participation in the (bicycle) Race Across America - 9 hours in 5 days sounds about right for what he and his team-mates were able to manage.

Now you just need a bike!

Murphy is alive and well. I’ve experienced similar situations as this. No matter how much planning, how much expertise, how much risk management or how much care is taken it just happens sometimes. It can strike terror into the hearts of the most qualified IT staff. The most important thing is how you deal with it and it sounds like, while grueling, your team dealt with it very well. Congratulations!

Did the movers own up to their mistake? Did they have any insurance or did your business have insurance? Many moves involve the purchase of insurance to deal with the cost ramifications of accidental downtime.

Sometimes the job just sucks all of the fun out of life doesn’t it? Sometimes though it is what we live for. I’m sure a lot of IT people can relate to your story. Thanks for sharing.

I wondered where you’d gone to since you hadn’t posted anything in a while. Welcome back!

1 Like

Hah! As if.

Honestly, I’ve gone through a lot of these over the years, but this one was so much worse. I’m responsible for IT. So it falls on me, accident or not. And I had to reassure and listen to the worries of a lot of other Directors throughout the process. Those of you who know where I work, know that the stakes are considerably higher than for most other businesses. An event like this can literally kill people. So for me, it was less about the technical challenges, and more about the stresses caused by responsibility, the stakes, and other peoples (justified) fears.

Glad it’s over. Now, to figure out how to prevent it from happening again.

1 Like

Well there becomes a point at which you need rest, like it or not. When you don’t get the rest you need, people tend to make careless mistakes. I have done my fair share of 36-48 hour work days, likely it will not happen again. Also I did 120 work weeks for about 2 months, the money didn’t matter at that point to me. It did make a nice down payment & payed for all my furniture in the house at the time. Those were at hospitals or data centers.

Fortunately my current bosses know how much time I have put in over the years, they don’t rely on me working too much overtime anymore. I used to be the 1st or second inline of if they needed someone.

2 Likes

RULE #1: Never leave movers next to anything that is important.
RULE #2: See Rule #1

:zipper_mouth:

2 Likes