Hey everyone,
Just letting you know I’m tweaking some talk settings to try and fix the various issues going on.
Thanks!
Hey everyone,
Just letting you know I’m tweaking some talk settings to try and fix the various issues going on.
Thanks!
Up first! We’re out of CPU credits. We use a lower priced level of Amazon, since it’s worked fine and can automatically adjust to the amount of usage. However it looks like something spiked on the 26th that has caused a chain of non-recovery…
I’m temporarily increasing our credits to unlimited to help it catch back up so I can take a better look at a more functional system.
That would have been the recent outage, yes?
Though it’s quite more functional now than it was on the 25th, it “feels” like something which was done during that recovery isn’t sitting right…
We should be looking into setting up a swarm cluster in AWS for things like this.
Where is the database for talk? Inside the instance?
From what I saw, its postgres running internal to the docker container for talk.
It’s working!!!
How can we tell in the future if it gets behind?
So Discourse uses Sidekiq to send out notifications and handle all potentially delable actions, it seems our Sidekiq is extremely delayed.
# sidekiq 5.0.3 discourse [5 of 5 busy]
Investigating that next.
There are 56854
pending jobs that Sidekiq needs to send out…
Merciful Zeus that’s a lot of notifications
Do we have a root cause to why that’s bogged down?
Can’t get an admin login because the emails are backed up ;p
Hoping the CPU goes down sometime today. If not I’ll have to get Stan to login as admin so I can look through the queues.
May not have the same views as stan but looks like we’ll also need to plan for a major upgrade too. 2.1 is out and we’re drastically behind.
Just something else to have on our radar. In the mean time looks like sidekiq can be pulled out of the stack as its own container and deployed outside the main system. If we make talk into a docker swarm then we can dedicate a few temporary vms just for the message queue and take them out of the swarm when its done processing.
Borrowing from my devops talk repository:
#!/bin/bash
# build.sh - create a docker swarmmode cluster then deploy a service stack
# Copyright (c)2017 Dwight Spencer <[email protected]>, All Rights Reserved.
ENVIRONMENT=${ENVIRONMENT:-"stage"}
MAXWORKERS=${MAXWORKERS:-"5"}
MAXMASTERS=${MAXMASTERS:-"3"}
SERVICES=${SERVICES:-"services.yml"}
alias swarm="docker swarm"
alias machine="docker-machine"
alias compose="docker stack deploy --docker-compose"
dmcreate() {
machine create --engine-storage-driver overlay 2 -d generic $@
}
connect() {
local max=$1
local name=$2
local command=$3
local token=$4
local master=${5:-""}
for x in `seq 1 ${max}`; do
eval $(machine env ${ENVIRONMENT}.${name}-${x})
RHOST=$(machine ip active)
swarm ${command} --advertise-addr $RHOST --listen-addr $RHOST --token ${token} ${master}
done
}
#vpc=$(aws ec2 create-vpc --cidr-block 10.0.0.0/16 | jq -r ".[].VpcId")
#aws ec2 create-subnet --vpc-id $vpc --cidr-block 10.0.0.0/24
#aws ec2 create-subnet --vpc-id $vpc --cidr-block 10.0.1.0/24
#igateway=$(aws ec2 create-internet-gateway --vpc-id $vpc | jq -r ".[].InternetGatewayId")
#ngateway=$(aws ec2 create-nat-gateway --vpc-id $vpc | jq -r ".[].NatGatewayId")
#iroute=$(aws ec2 create-route-table --vpc-id $vpc | jq -r ".[].RouteTableId")
#aws ec2 create-route --route-table-id $iroute --destination-ipv4-cidr-block 0.0.0.0/0 --gateway-id $igateway
#aws ec2 security group ...
for x in `seq 1 ${MAXMASTERS}`; do dmcreate ${ENVIRONMENT}.master-$x; done
for x in `seq 1 ${MAXWORKERS}`; do dmcreate ${ENVIRONMENT}.worker-$x; done
eval $(docker-machine env ${ENVINONMENT}.master-1)
export master_ip=$(docker-machine ip active)
swarm join --advertise-addr ${master_ip} --listen-addr ${master_ip}
export manager_token=$(swarm join-token manager -q)
export worker_token=$(swarm join-token worker -q)
connect ${MAXMASTER} "master" "init" ${master_token} ""
connect ${MAXWORKER} "worker" "join" ${worker_token} "${master_ip}"
compose services.yml ${ENVIRONMENT}
Looked with Stan at lunch, there were 450,000 queue jobs waiting because SMTP authentication kept failing.
Turns out the IP address somehow changed, so SmartPost was no longer allowing us to send.
If you get old emails, sorry, but it’s slightly out of reach to clear them out.
Whatever it takes to flush that toilet just do it.
Down to 400,000 enqueued already.
… 300,000
… 200,000 (looks like ~100,000/hour)
That’s spammer’s numbers right there.