Talk's Outage this morning

Thanks to everyone on @Team_Infrastructure and for @Draco’s announcement this morning. We were able to recover Talk from an outage fairly quickly. Hopefully this didn’t affect many of you doing that time but we do apologize for the inconvenience.

If one’s intersted in seeing the post mortum then feel free to review the helpdesk issue #41 where we dive into the root cause, build the runbook to resolve in the future and discuss future solutions to prevent further outages.

2 Likes

Ahhh, it ran out of space …

What are our current procedures for upgrading / security updates for our sites?
I know your monitoring software should help a bit. Will it remind us for preventive maintenance?

4 Likes

I know your monitoring software should help a bit. Will it remind us for preventive maintenance?

For preventive maintenance we should be writing up ansible playbooks and having either gitlab or travis-ci execute those playbooks as part of our regular CI/CD automated chores.

There’s also configuration settings for apt to prune out old kernels, which was apart of the issue in that it was not setup to do such.

The monitoring system helps with capacity planning, life cycle management of hardware, performance monitoring, situational awareness / auditing, and generally alerting.

What are our current procedures for upgrading / security updates for our sites?

Since we’re moving towards an automated system the current procedure is a bit dated and the updated procedure would be to put all changes in github with a dockerfile and have dockerhub build the things with travis-ci running the tests.

When things are all green and Infrastructure Chair has signed off we bump the tagged version in github for release and that code is what’s automatically pushed to production nodes.

As far as the undercloud goes (ie vm and resulting os that docker runs on top of) that’s all managed by ansible+cicd and if there’s any thing that would need a maintenance window then we discuss that with the infrastructure team and post a maintenance window on the calendar.

Breakfixes
Like the one this morning, while rare under a CI/CD environment do happen and the procedure that we used this morning was:

  1. Check our egos at the door
  2. Create a ticket in helpdesk.dallasmakerspace.org about the issue for tracking effort and documenting a runbook
  3. Create the conference room for calling in (discord chat/voice works too)
  4. Proceed to determine root cause and action plan
  5. execute action plan
  6. document changes and deltas
  7. Evaluate preventive measures

Love that one…

Reminds me of a good BOFH comic:

1 Like
  1. don’t let Stan on the keyboard until he has had caffeine.
3 Likes

7.5 Don’t break things until @Denzuko has had his tenth cup of coffee

1 Like