Talk Maintenance

LukeStrickland · June 5, 2018, 10:45pm

Hey everyone,

Just letting you know I’m tweaking some talk settings to try and fix the various issues going on.

Thanks!

LukeStrickland · June 5, 2018, 10:48pm

Up first! We’re out of CPU credits. We use a lower priced level of Amazon, since it’s worked fine and can automatically adjust to the amount of usage. However it looks like something spiked on the 26th that has caused a chain of non-recovery…

I’m temporarily increasing our credits to unlimited to help it catch back up so I can take a better look at a more functional system.

jast · June 5, 2018, 10:54pm

That would have been the recent outage, yes?
Though it’s quite more functional now than it was on the 25th, it “feels” like something which was done during that recovery isn’t sitting right…

denzuko · June 5, 2018, 10:59pm

We should be looking into setting up a swarm cluster in AWS for things like this.

Draco · June 5, 2018, 11:04pm

Where is the database for talk? Inside the instance?

denzuko · June 5, 2018, 11:05pm

From what I saw, its postgres running internal to the docker container for talk.

denzuko · June 5, 2018, 11:06pm

github.com

discourse/discourse_docker/blob/master/templates/postgres.10.template.yml

params:
  db_synchronous_commit: "off"
  db_shared_buffers: "256MB"
  db_work_mem: "10MB"
  db_default_text_search_config: "pg_catalog.english"
  db_name: discourse
  db_user: discourse
  db_wal_level: minimal
  db_max_wal_senders: 0
  db_checkpoint_segments: 6
  db_logging_collector: off
  db_log_min_duration_statement: 100

hooks:
  before_code:
    - replace:
       filename: /etc/service/unicorn/run
       from: "# postgres"
       to: sv start postgres || exit 1

This file has been truncated. show original

LukeStrickland · June 5, 2018, 11:13pm

It’s working!!!

Draco · June 5, 2018, 11:16pm

How can we tell in the future if it gets behind?

LukeStrickland · June 5, 2018, 11:26pm

So Discourse uses Sidekiq to send out notifications and handle all potentially delable actions, it seems our Sidekiq is extremely delayed.

# sidekiq 5.0.3 discourse [5 of 5 busy]

Investigating that next.

LukeStrickland · June 5, 2018, 11:30pm

There are 56854 pending jobs that Sidekiq needs to send out…

Nate · June 6, 2018, 6:01am

Merciful Zeus that’s a lot of notifications

denzuko · June 6, 2018, 1:56pm

Do we have a root cause to why that’s bogged down?

LukeStrickland · June 6, 2018, 2:05pm

Can’t get an admin login because the emails are backed up ;p

LukeStrickland · June 6, 2018, 2:16pm

Hoping the CPU goes down sometime today. If not I’ll have to get Stan to login as admin so I can look through the queues.

denzuko · June 6, 2018, 2:42pm

May not have the same views as stan but looks like we’ll also need to plan for a major upgrade too. 2.1 is out and we’re drastically behind.

Just something else to have on our radar. In the mean time looks like sidekiq can be pulled out of the stack as its own container and deployed outside the main system. If we make talk into a docker swarm then we can dedicate a few temporary vms just for the message queue and take them out of the swarm when its done processing.

Deployment script

Borrowing from my devops talk repository:

#!/bin/bash
# build.sh - create a docker swarmmode cluster then deploy a service stack
# Copyright (c)2017 Dwight Spencer <[email protected]>, All Rights Reserved.

ENVIRONMENT=${ENVIRONMENT:-"stage"}
MAXWORKERS=${MAXWORKERS:-"5"}
MAXMASTERS=${MAXMASTERS:-"3"}
SERVICES=${SERVICES:-"services.yml"}

alias swarm="docker swarm"
alias machine="docker-machine"
alias compose="docker stack deploy --docker-compose"

dmcreate() {
  machine create --engine-storage-driver overlay 2 -d generic $@
}

connect() {
  local max=$1
  local name=$2
  local command=$3
  local token=$4
  local master=${5:-""}

  for x in `seq 1 ${max}`; do
    eval $(machine env ${ENVIRONMENT}.${name}-${x})
    RHOST=$(machine ip active)
    swarm ${command} --advertise-addr $RHOST --listen-addr $RHOST --token ${token} ${master}
  done
}

#vpc=$(aws ec2 create-vpc --cidr-block 10.0.0.0/16 | jq -r ".[].VpcId")
#aws ec2 create-subnet --vpc-id $vpc --cidr-block 10.0.0.0/24
#aws ec2 create-subnet --vpc-id $vpc --cidr-block 10.0.1.0/24
#igateway=$(aws ec2 create-internet-gateway --vpc-id $vpc | jq -r ".[].InternetGatewayId")
#ngateway=$(aws ec2 create-nat-gateway --vpc-id $vpc | jq -r ".[].NatGatewayId")
#iroute=$(aws ec2 create-route-table --vpc-id $vpc | jq -r ".[].RouteTableId")
#aws ec2 create-route --route-table-id $iroute --destination-ipv4-cidr-block 0.0.0.0/0 --gateway-id $igateway
#aws ec2 security group ...

for x in `seq 1 ${MAXMASTERS}`; do dmcreate ${ENVIRONMENT}.master-$x; done
for x in `seq 1 ${MAXWORKERS}`; do dmcreate ${ENVIRONMENT}.worker-$x; done

eval $(docker-machine env ${ENVINONMENT}.master-1)

export master_ip=$(docker-machine ip active)
swarm join --advertise-addr ${master_ip} --listen-addr ${master_ip}

export manager_token=$(swarm join-token manager -q)
export worker_token=$(swarm join-token worker -q)

connect ${MAXMASTER} "master" "init" ${master_token} ""
connect ${MAXWORKER} "worker" "join" ${worker_token} "${master_ip}"

compose services.yml ${ENVIRONMENT}

LukeStrickland · June 6, 2018, 5:06pm

Looked with Stan at lunch, there were 450,000 queue jobs waiting because SMTP authentication kept failing.

Turns out the IP address somehow changed, so SmartPost was no longer allowing us to send.

If you get old emails, sorry, but it’s slightly out of reach to clear them out.

Brian · June 6, 2018, 5:27pm

Whatever it takes to flush that toilet just do it.

StanSimmons · June 6, 2018, 5:28pm

Down to 400,000 enqueued already.

… 300,000

… 200,000 (looks like ~100,000/hour)

malcolmputer · June 6, 2018, 7:04pm

That’s spammer’s numbers right there.