Auto-scaling and self-defensive services in Golang

| 18 min. (3691 words)

The Raygun service is made up of many moving parts, each specialized for a particular task. One of these processes is written in Golang and is responsible for desymbolicating iOS crash reports. You don’t need to know what that means, but in short, it takes native iOS crash reports, looks up the relevant dSYM files, and processes them together to produce human readable stack traces.

The operation of the dsym-worker is simple. It receives jobs via a single consumer attached to a Redis queue. It grabs a job from the queue, performs the job, acknowledges the queue and repeats. We have a single dsym-worker running on a single machine, which has usually been enough to process the job load at a reasonable rate. There are a few things that can and has happened with this simple setup which **require on-call **maintenance:

  • Increased load. Every now and then, usually during the weekend when perhaps people use their iOS devices more, the number of iOS crash reports coming in could be too much for a single job to be processed at a time. If this happens, more dsym-worker processes need to be manually started to handle the load. Each process that is started attaches a consumer to the job queue. The Golang Redis queue library it uses then distributes jobs to each consumer so that multiple jobs can be done at the same time.
  • Unresponsiveness. That is to say that the process is still running, but isn’t doing any work. In general, this can occurr due to infinite loops or deadlocks. This is particularly bad as our process monitor sees that it is still running, and so it is only when the queue reaches a threshold that alerts are raised. If this happens, the process needs to be manually killed, and a new one started. (Or perhaps many, to catch up on the work load)
  • Termination. The process crashes and shuts down entirely. This has never happened to the dsym-worker, but is always a possibility as the code is updated. If this happens, a monitor alerts that the process has died, and it needs to be manually started up again.

It’s not good needing to deal with these in the middle of the night, and sometimes it isn’t so good for the person responsible for the code either.

These things can and should be automated, and so I set out to do so.

Theory

So overall, we need auto-scaling to handle variable/increased amounts of load, the ability to detect and replace unresponsive workers in some way, and the ability to detect and restart dead processes. Time to come up with a plan of attack.

My first idea was extremely simple. The Golang Redis queue library we use, as you may expect, has the ability to attach multiple consumers to the queue from within a single process. By attaching multiple consumers, more work can be done at once, which should help with implementing the auto-scaling. Furthermore, if each consumer keeps track of when they last completed a job, they can be regularly checked to see if it has been too long since it has done any work. This could be used to implement simple detection of unresponsive consumers. At the time, I wasn’t focused on the dead worker detection, and so started looking into the feasibility of this plan so far.

It didn’t take long to discover that this strategy was not going to cut it – not in Golang at least. Each consumer is managed in a goroutine within the Golang Redis queue library. If an unresponsive consumer is detected then we need to kill it off, but it turns out that one does not simply kill off a goroutine (oh, I should mention, I’m quite new to Golang). For a goroutine to end, it should generally complete its work, or be told to break out of a loop using a channel or some other mechanism. If a consumer is stuck in an infinite loop though, as far as I can tell, there isn’t a way to command the goroutine to close down. If there is, it’s bound to mean modifying the Golang Redis queue library. This strategy is getting more complicated than it’s worth, so lets try something else.

My next idea was to write a whole new program that spawns and manages several workers. Each worker can still just attach a single consumer to the queue, but more processes running means more work being done at once. Golang certainly has the ability to start up and shut down child processes, so that helps a lot with auto scaling. There are various ways for separate processes to communicate with each other, so the workers can tell the master process when they last completed a job. If the master process sees that a worker hasn’t done any work for too long, then we get both unresponsive and death detection – more on those later.

And another approach that I briefly thought of is more of a hivemind set up. A single worker process could have both the logic to process a single job at a time, as well as spawning and managing other worker processes. If an unresponsive or dead process is detected, one of the other running processes could assume responsibility for starting up a new one. Collectively they could make sure there was always a good number of processes running to handle the load. I did not look into this at all, so have no idea how sane this is. It could be an interesting exercise though.

In the end, I went with the master process approach. The following is how I tackled each challenge.

Auto scaling

The master process starts by spinning up a goroutine that regularly determines the number of processes that should be running to handle the load. This main goroutine then starts or stops worker processes to match this number. The calculation of the desired worker count is very simple. It considers both the current length of the queue, as well as the rate at which the queue count is changing. The longer the queue is, or the faster jobs are being queued, the more workers that should be spawned. Here is a simplified look at he main goroutine:

func watch() {
  procs := make(map[int]*Worker)

  for {
    // Check the health of each worker
    checkProcesses(&procs)

    queueLength, rate := queueStats()

    // Calculate desired worker count, then start/stop workers to match
    desiredWorkerCount := calculateDesiredWorkerCount(queueLength, rate, len(procs))
    if len(procs) != desiredWorkerCount {
      manageWorkers(&procs, desiredWorkerCount)
    }

    time.Sleep(30000 * time.Millisecond)
  }
}

The master process needs to keep track of the child processes that it starts. This is to help with auto-scaling, and to regularly check the health of each child. I found that the easiest way to do this was to use a map with integer keys, and instances of a Worker struct as the values. The length of this process-map is used to determine the next key to use when adding a worker, and also which key to delete when removing a worker.

func manageWorkers(procs *map[int]*Worker, desiredWorkerCount int) {
  currentCount := len(*procs)
  if currentCount < desiredWorkerCount {
    // Add workers:
    for currentCount < desiredWorkerCount { StartWorker(procs, currentCount) currentCount++ } } else if currentCount > desiredWorkerCount {
    // Remove workers:
    for currentCount > desiredWorkerCount {
      StopWorker(procs, currentCount - 1)
      currentCount--
    }
  }
}

Golang provides an os/exec package for high level process management. The master process uses this package to spawn new worker processes. The master and worker processes are deployed to the same folder, so “./dsym-worker” can be used to start the workers up. However, this does not work when running the master process from a Launch Daemon. The first line in the StartWorker function below is how you can get the working directory of the running process. With this we can create the full path of the worker executable to run it up reliably. Once the process is running, we create an object from the Worker struct and store it in the process-map.

func StartWorker(procs *map[int]*Worker, index int) {
  dir, err := filepath.Abs(filepath.Dir(os.Args[0]))
  if err != nil {
    // Handle error
  }
  cmd := exec.Command(dir + "/dsym-worker")

  // Some other stuff gets done here and stored in the Worker object
  // such as retrieving the standard in and out pipes as explained later

  cmd.Start()

  worker := NewWorker(cmd, index)
  (*procs)[index] = worker
}

Determining the desired worker count for the job load, and then starting/stopping workers to meet that number is all there really is to auto-scaling in this case. I’ll cover how we stop workers further down.

Inter process communication in Golang

As mentioned previously, my simple approach for detecting an unresponsive worker is done by each worker reporting the time at which it last completed a job. If the master process finds that a worker has not done a job for too long, then consider it unresponsive and replace it with a new worker. To implement this, the worker processes needs to communicate in some way to the master process to relay the current time whenever it completes a job. There are many ways that this could be achieved:

  • Read and write to a file
  • Set up a local queue system such as Redis or RabbitMQ
  • Use the Golang rpc package
  • Transfer gobbed data through a local network connection
  • Utilize shared memory
  • Set up named pipes

In our case, all that we are communicating is just timestamps, not important customer data, so most of these are a bit overkill. I went with what I thought was the easiest solution – communicating through the standard out pipe of the worker processes.

After starting up a new process via exec.Command as described previously, the standard out pipe of the process can be obtained through:

stdoutPipe, err := cmd.StdoutPipe()

Once we have the standard out pipe, we can run up a goroutine to concurrently listen to it. Within the goroutine, I’ve gone with using a scanner to read from the pipe as seen here:

scanner := bufio.NewScanner(stdoutPipe)

for scanner.Scan() {
  line := scanner.Text()
  // Process the line here
}

Code after the scanner.Text() call will be executed every time a line of text is written to the standard out pipe from the worker process.

Unresponsiveness detection

Now that inter process communication is in place, we can use it to implement the detection of unresponsive worker processes. I updated our existing worker to print out the current time using the Golang fmt package upon completing a job. This gets picked up by the scanner where we parse the time using the same format it was printed in. The time object is then set to the LastJob field of the relevant Worker object that we keep track of.

t, err := time.Parse(time.RFC3339, line)
if err != nil {
  // Handle error
} else {
  worker.LastJob = t
}

Back in the main goroutine that is regularly iterating the process-map, we can now compare the current time with the LastJob time of each worker. If this is too long, we kill off the process and start a new one.

func CheckWorker(procs *map[int]*Worker, worker *Worker, index int) {
  // Replace the given worker if it hasn't done any work for too long
  duration := time.Now().Sub(worker.LastJob)
  if duration.Minutes() > 4 {
    KillWorker(procs, index)
    StartWorker(procs, index)
  }
}

Killing a process can be done by calling the Kill function of the Process object. This is provided by the Command object we got when we spawned the process. Another thing we need to do is delete the Worker object from the process-map.

func KillWorker(procs *map[int]*Worker, index int) {
  worker := (*procs)[index]
  if worker != nil {
    process := worker.Command.Process

    delete(*procs, index) // Remove from process-map

    err := process.Kill() // Kill process
    if err != nil {
      // Handle error
    }
  }
}

After killing off the misbehaving worker, a new one can be started by calling the StartWorker function listed further above. The new worker gets referenced in the process-map with the same key as the worker that was just killed – thus completing the worker replacement logic.

Termination detection

Technically, the detection and resolution of unresponsive processes also covers the case of processes that terminate unexpectedly as well. The dead processes won’t be able to report that they are doing jobs, and so eventually they’ll be considered unresponsive and be replaced. It would be nice though if we could detect this earlier.

Attempt 1

When we start a process, we get the pid assigned to it. The Golang os package has a function called FindProcess which returns a Process object for a given pid. So that’s great, if we just regularly check if FindProcess returns a process for each pid we keep track of, then we’ll know if a particular worker has died right? No, FindProcess will always return something even if no process exists for a given pid… Let’s try something else.

Attempt 2

Using a terminal, if you type “kill -s 0 {pid}”, then a signal will not be sent to the process, but error checking will still be performed. If there is no process for the given pid, then an error will occur. This can be implemented with Golang, so I tried it out. Unfortunately running this in Golang does not produce any error. Similarly, sending a signal of 0 to a non existent process also doesn’t indicate the existence of a process.

Final solution

Fortunately, we have actually already written a mechanism that will allow us to detect dead processes. Remember the scanner being used to listen to the worker processes? Well, the standard out pipe object that the scanner is listening to is a type called ReadCloser. As the name suggests, it can be closed, which happens to occur if the worker process at the other end stops in any way. If the pipe closes, the scanner stops listening and breaks out of the scanner loop. So right there after the scanner loop we have a point in code where we know a worker process has stopped.

All we need to do now is determine if the worker shut down as a result of the normal operation of the master process (e.g. killing unresponsive workers, or down-scaling for decreased load), or if it terminated unexpectedly. Before the master processes kills/stops a worker for any reason, it deletes it from the process-map. So, if the worker process is still in the process-map after the scanner stops, then it has not shut down at the hands of the master process. If that is the case, start up a new one to take its place.

if (*procs)[worker.Index] == worker {
  StartWorker(procs, worker.Index)
}

The functionality of this can easily be tested by using the kill command in a terminal to snipe a worker process. The Mac Activity Monitor will show that a new one replaces it almost instantly.

Graceful shut down

When I first prototyped the auto-scaling behaviour with Golang, I was calling the KillWorker function listed further above to kill off processes when not so many were needed. If a worker is currently processing a job when it is killed off in some way, what happens to the job? Well, until the job has been completed, it sits safely in an unacked queue in Redis for a particular worker. Only when the worker acknowledges the queue that the job has been completed will it disappear. The master process regularly checks for dead Redis connections, and moves any unacked jobs for them back to the ready queue. This is all managed by the Golang Redis queue library we’re using.

This means that when a worker process terminates unexpectedly, no jobs are lost. It also means that killing off processes manually works totally fine. However, it feels kinda rude, and means processing those jobs is delayed. A better solution is to implement graceful shut down – that is to allow the worker process to finish the job they are currently processing, and then naturally exit.

Step 1 – master process tells worker to stop

To start off, we need a way for the master process to tell a particular worker process to begin graceful shut down. I’ve read that a common way of doing this is to send an OS signal such as ‘interrupt’ to the worker process, and then have the worker handle those signals to perform graceful shut down. For now though, I preferred to leave the OS signals to their default behaviours, and instead have the master process send “stop” through the standard in pipe of a worker process.

func StopWorker(procs *map[int]*Worker, index int) {
  worker := (*procs)[index]
  // The standard in pipe was obtained and stored here when the worker was first started
  stdinPipe := worker.StdinPipe
  _, err := (*stdinPipe).Write([]byte("stop\n"))
  if err != nil {
    // Handle error
  }
}

Step 2 – worker begins graceful shut down

when a worker process receives a “stop” message, it uses the Golang Redis queue library to stop consuming, and sets a boolean field to indicate that it’s ready for graceful shut down. Another boolean field is used to keep track of whether or not a job is currently in progress. The program is kept alive as long as one of these boolean fields is true. If they are both false, then it means it has no jobs to process, is marked for graceful shut down and so the program is allowed to terminate naturally.

func scan(consumer *Consumer) {
  reader := bufio.NewReader(os.Stdin)
  for {
    text, _ := reader.ReadString('\n')
    if strings.Contains(text, "stop") {
      stop(consumer)
    }
  }
}

Step 3 – worker tells master that it’s done

In the master process, we need to stop keeping track of any shut down workers by deleting them from the process-map. We could do this after sending the worker a “stop” message, but what would happen if the last job happened to cause the worker to get stuck in an unexpected infinite loop? To clean this up better, when a worker process has finished its last job and is able to shut down, it prints out a “stop” message. Just like the timestamps, this message gets picked up in the scanner we set up previously. When the master sees this message, it’s fine to stop keeping track of that worker, so delete it from the process-map.

// In the worker process:
func stop(consumer *Consumer) {
  consumer.queue.StopConsuming()
  consumer.running = false
  if !consumer.doingJob {
    fmt.Println("stop")
  }
}
// In the scanner loop of the master process:
if strings.Contains(line, "stop") {
  delete(*procs, worker.Index)
  break
}

Who watches the watcher?

At this point, the dsym-worker can now auto-scale to handle increased load, and has self-defensive mechanisms against unexpected termination and unresponsiveness. But what about the master process itself? It is a lot more simple than the worker processes, but is still at risk of crashing, especially as this is my first attempt at this kind of process set up. If it goes down, we’re right back to where we started with on-call goodness. May as well set up a mechanism to automate restarting the master process too.

There are a few ways to make sure that the master process is restarted upon failure. One way could be to use the ‘KeepAlive” option in a Launch Daemon config. Another option could be to write a script that checks the existence of the process, and start it if not found. Such a script could be run every 5 minutes or so from a Cron job.

What I ended up doing was to create yet another Golang program which initially starts the master process, and then restarts it if termination is detected. This is achieved using the same technique that the master process uses. Overall it is very small and simple, with nothing that I know of to go wrong. So far it’s holding up well. Another thing that would help is if I actually fix any issues that could cause the master process to crash… but I digress.

Orphaned workers

If the master process is shut down via an interrupt signal, it is automatically handled in a way that all the child processes get shut down too. This is really handy for deploying a new version, as only the top level process needs to be told to shut down. If the master process out right crashes though, then it’s a different story. All the worker processes hang around and keep processing work, but with nothing to supervise them. When a new master process is started, more workers are spawned, which could cause a problem if this keeps happening without any clean up.

This is an important situation to handle, so here is my simple solution against this. The os package in Golang provides a method called Getppid(). This takes no parameters, so you can simply call it any time and get the pid of the parent process. If the parent dies, the child is orphaned and the function will return 1 – the pid of init. So within a worker process, we can easily detect if it becomes orphaned. When a worker first starts it can obtain and remember the pid of its initial parent. Then, regularly call Getppid, and compare the result to the initial parent pid. If the parent pid has changed change, the worker has been orphaned, so commence graceful shut down.

ppid := os.Getppid()
if ppid != initialParentId {
  stop(consumer) // Begin graceful shut down
}

Finishing words

And that’s about it. I hope you found this look into how I implemented an auto-scaling and self-defensive service in Golang at least a little bit interesting and not overly ridiculous. So far it’s all working really well, and I have some ideas to make it even more robust. If you know of any existing techniques, processes or libraries that I could have used instead, then I’d love to hear them in the comments below.

All the processes that make up our dsym-worker of course use Raygun to report any errors or panics that occur. This has made tracking down and fixing issues a breeze. Sign up for a free trial of Raygun if you also need error reporting in your services or applications.