Talk Maintenance

Hey everyone,

Just letting you know I’m tweaking some talk settings to try and fix the various issues going on.

Thanks!

5 Likes

Up first! We’re out of CPU credits. We use a lower priced level of Amazon, since it’s worked fine and can automatically adjust to the amount of usage. However it looks like something spiked on the 26th that has caused a chain of non-recovery…

I’m temporarily increasing our credits to unlimited to help it catch back up so I can take a better look at a more functional system.

3 Likes

That would have been the recent outage, yes?
Though it’s quite more functional now than it was on the 25th, it “feels” like something which was done during that recovery isn’t sitting right…

1 Like

We should be looking into setting up a swarm cluster in AWS for things like this.

1 Like

Where is the database for talk? Inside the instance?

1 Like

From what I saw, its postgres running internal to the docker container for talk.

2 Likes

It’s working!!!

image

image

9 Likes

How can we tell in the future if it gets behind?

1 Like

So Discourse uses Sidekiq to send out notifications and handle all potentially delable actions, it seems our Sidekiq is extremely delayed.

# sidekiq 5.0.3 discourse [5 of 5 busy]

Investigating that next.

4 Likes

There are 56854 pending jobs that Sidekiq needs to send out…

4 Likes

Merciful Zeus that’s a lot of notifications

Do we have a root cause to why that’s bogged down?

Can’t get an admin login because the emails are backed up ;p

Hoping the CPU goes down sometime today. If not I’ll have to get Stan to login as admin so I can look through the queues.

May not have the same views as stan but looks like we’ll also need to plan for a major upgrade too. 2.1 is out and we’re drastically behind.

Just something else to have on our radar. In the mean time looks like sidekiq can be pulled out of the stack as its own container and deployed outside the main system. If we make talk into a docker swarm then we can dedicate a few temporary vms just for the message queue and take them out of the swarm when its done processing.

Deployment script

Borrowing from my devops talk repository:

#!/bin/bash
# build.sh - create a docker swarmmode cluster then deploy a service stack
# Copyright (c)2017 Dwight Spencer <[email protected]>, All Rights Reserved.

ENVIRONMENT=${ENVIRONMENT:-"stage"}
MAXWORKERS=${MAXWORKERS:-"5"}
MAXMASTERS=${MAXMASTERS:-"3"}
SERVICES=${SERVICES:-"services.yml"}

alias swarm="docker swarm"
alias machine="docker-machine"
alias compose="docker stack deploy --docker-compose"

dmcreate() {
  machine create --engine-storage-driver overlay 2 -d generic $@
}

connect() {
  local max=$1
  local name=$2
  local command=$3
  local token=$4
  local master=${5:-""}

  for x in `seq 1 ${max}`; do
    eval $(machine env ${ENVIRONMENT}.${name}-${x})
    RHOST=$(machine ip active)
    swarm ${command} --advertise-addr $RHOST --listen-addr $RHOST --token ${token} ${master}
  done
}

#vpc=$(aws ec2 create-vpc --cidr-block 10.0.0.0/16 | jq -r ".[].VpcId")
#aws ec2 create-subnet --vpc-id $vpc --cidr-block 10.0.0.0/24
#aws ec2 create-subnet --vpc-id $vpc --cidr-block 10.0.1.0/24
#igateway=$(aws ec2 create-internet-gateway --vpc-id $vpc | jq -r ".[].InternetGatewayId")
#ngateway=$(aws ec2 create-nat-gateway --vpc-id $vpc | jq -r ".[].NatGatewayId")
#iroute=$(aws ec2 create-route-table --vpc-id $vpc | jq -r ".[].RouteTableId")
#aws ec2 create-route --route-table-id $iroute --destination-ipv4-cidr-block 0.0.0.0/0 --gateway-id $igateway
#aws ec2 security group ...

for x in `seq 1 ${MAXMASTERS}`; do dmcreate ${ENVIRONMENT}.master-$x; done
for x in `seq 1 ${MAXWORKERS}`; do dmcreate ${ENVIRONMENT}.worker-$x; done

eval $(docker-machine env ${ENVINONMENT}.master-1)

export master_ip=$(docker-machine ip active)
swarm join --advertise-addr ${master_ip} --listen-addr ${master_ip}

export manager_token=$(swarm join-token manager -q)
export worker_token=$(swarm join-token worker -q)

connect ${MAXMASTER} "master" "init" ${master_token} ""
connect ${MAXWORKER} "worker" "join" ${worker_token} "${master_ip}"

compose services.yml ${ENVIRONMENT}

Looked with Stan at lunch, there were 450,000 queue jobs waiting because SMTP authentication kept failing.

Turns out the IP address somehow changed, so SmartPost was no longer allowing us to send.

If you get old emails, sorry, but it’s slightly out of reach to clear them out.

5 Likes

Whatever it takes to flush that toilet just do it.

1 Like

Down to 400,000 enqueued already.

… 300,000

… 200,000 (looks like ~100,000/hour)

1 Like

That’s spammer’s numbers right there.

1 Like