Threads, Forking and New Relic #1101

jasonrclark · 2013-08-09T19:59:26Z

TL;DR; We've got ideas for avoiding deadlocks when using New Relic and Resque together, but would like feedback on the approach, and advice on what we could contribute to Resque to improve the situation.

Long story short, threads (with locks) and forking don't mix nicely. (:wavy_dash: + :fork_and_knife: = :collision: )

The Problem

There are long-standing reports of deadlocking Resque jobs when using New Relic (newrelic_rpm). Certain updates have alleviated issues for specific users, but we still hear about problems from others--especially those running lots of jobs (think millions of jobs per day)

Some great debugging by @benweint and others has led us to suspect the problems are variations on this sequence of events:

New Relic starts in the Resque master process and spawns a background thread to transmit data back to our services.
Resque starts running normally, forking to process jobs
New Relic harvests once a minute and during that time takes various locks (internal locks in newrelic_rpm, locks in Ruby's Logger class, or even native locks around, for example, DNS lookup)
While New Relic's harvest thread has a lock, Resque forks a new job
Only Resque's main thread lives on in the child process, so New Relic's harvest thread isn't around to release the lock from step 3).
If any code tries to acquire the lock from 3), the Resque job hangs 😢

It seems we can't safely fork while holding a lock in another thread or bad things happen. If newrelic_rpm owned all the locks in question we could deal with them in our after_fork handler (and we have where we've identified them). However, at least one common case (DNS lookup) isn't even locking in Ruby-land.

A Solution

I've pushed a branch, resque_harvest_locking on newrelic_rpm with a proposed change to our instrumentation. (newrelic/newrelic-ruby-agent@d703271) This synchronizes between New Relic's background thread running a harvest and Resque forking a new job (just the forking, not the full job processing). We already locked around our harvest thread, so we just sync on the same lock from a patched Resque::Worker#fork

This has downsides for sure:

If the network is slow contacting New Relic, new jobs can't start until the harvest finishes.
Ugly patching to get the lock at the right spot in Resque's processing
The existing code doesn't look workable for Resque 2.x, so there's still work to do if this approach is taken

But in our testing it does alleviate deadlocks.

Gathering More Information

The latest version of the newrelic_rpm gem ships with an executable called nrdebug. Given a pid, nrdebug drops a log file with tons of information, up to and including gdb and Ruby backtraces from the stuck process. Anyone seeing hung workers with newrelic_rpm should definitely run nrdebug before killing those workers. We'd love to see the results.

Other Ideas?

Has anyone else found these type of issues using threads and Resque together?

Are there places we could add hooks in Resque to accomplish to make this less invasive?

Even more drastically, could there be a better approach to running plugins periodically in the Resque's master so we wouldn't need the background thread and could sidestep the locking woes?

Thanks!

The text was updated successfully, but these errors were encountered:

steveklabnik · 2013-08-12T14:40:14Z

Thanks so much for looking into this. I've been ill, so sorry about the delay.

I think that this approach sounds reasonable, but I also don't use NewRelic with Resque personally, so I'd like to hear what people who do use it think. Let me tweet this out there, and also cross-post to the mailing list, and see if we can't get more feedback.

could there be a better approach to running plugins periodically in the Resque's master so we wouldn't need the background thread and could sidestep the locking woes?

I think this would be the ideal solution for 2.0.

codeodor · 2013-08-12T17:40:27Z

I use NewRelic with Resque but I can't say I've ever noticed a problem. (But we're only running dozens of long-running jobs, not millions per day).

That said, I'm happy to help out with the app I'm talking about above if you'll tell me how I can be useful.

MarkusQ · 2013-08-12T18:00:04Z

One possibility that occurs to me (without looking at the code) is to replace the background thread with another fork; the mental model is "no process should ever fork after calling Thread.new." This would clearly remove you from the TL;DR-defined (threads+locks+forks) space.

My second suggestion would be pretty much what Jason implemented.

About ten years ago this came up and Joel VanderWerf sketched a solution http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/61918 but I'm not finding a live link or local copy.

Another idea (which gets messy and may not handle all cases, but I have seen "working") is to manually set all locks to unlocked on forking (iirc the attempt at this I saw used ObjectSpace, evil.rb, and goat entrails).

I keep thinking there are other, cleaner answers, but none are sprining to mind.

mperham · 2013-08-12T18:55:32Z

This is relevant to my interests. Good to learn about nrdebug, could be useful for debugging Sidekiq lockups too.

jasonrclark · 2013-08-13T20:27:40Z

@MarkusQ Thanks for the feedback. We've definitely considered a forking model to get reporting back to New Relic separated from the app processes, but it's felt like a pretty big change from the agent's current behavior. We will keep it in mind to be sure.

Really appreciate as well the links to those old solutions. While haven't found the code, it definitely led to some interesting reading.

jasonrclark · 2013-08-13T20:28:54Z

@codeodor We'll definitely get in touch if there's any way you could help out. We mostly encounter problems with short, frequent jobs, but if we do make changes we'll want to be sure things remain compatible with longer running jobs like you've got.

jeremyolliver · 2013-08-27T04:23:02Z

I can't really comment on where to head codewise, but just reporting in as someone who's experienced this issue a lot.

I have an application running in production which is running on an older version of resque (resque-1.13.0 and ruby enterprise e.g. 1.8.7) which currently reliably causes/reproduces the locking up issue (I've had to kill process several times over the last couple of weeks for it alone). I'm currently in the process of upgrading that application to ruby 1.9.3 and resque 1.24.1 - unsure if that will resolve or alleviate these problems given some work has already gone in to the releases since then.

While I've still got the old system running and can reproduce, would backtraces / gdb output be useful (from the older resque version)? If so, let me know what you want to see, as outside of reading newrelic's how to post It all looks a bit foreign to me.

EDIT: I was way off on my count of how many jobs goes through. I forgot about some optimisations I did

yaauie · 2013-08-27T06:08:19Z

@jeremyolliver I don't think that the backtraces from older 1.x-series will be all that helpful, since there will be no updates to them beyond the forward progress that has already been made to 1.24 and beyond. Best of luck getting upgraded though 😄.

jeremyolliver · 2013-08-27T22:43:34Z

I figured as much. I can report back post-upgrade if the issue still persists.

benweint · 2013-09-05T19:32:14Z

@jeremyolliver Backtraces from the 1.x series would be useful in as much as they would allow us to confirm whether your issue is the one Jason describes above or some other problem. You can use the nrdebug tool that Jason mentions in the 'Gathering More Information' section above to automate the collection of backtraces so that you won't have to manually carry out the steps in that blog post.

If you're able to get backtraces, we'd be happy to take a look at them to at least confirm whether you're hitting the same issue or something different (I've seen a few cases where hangs people thought were caused by this issue were actually caused by unrelated issues in other libraries or application code).

obrie · 2013-10-07T12:44:04Z

Hey guys -

I've been following this thread as it actually is in line with what I've experienced on a separate project. I wanted to provide an update because it affects how folks address this issue. I found that there were two issues at play here:

First, there is a bug in Ruby. All mutexes in any background thread are supposed to be unlocked when you fork. Unfortunately there was a race condition in all versions of Ruby which made this fail. The good news is there's a patch I submitted which has been accepted: https://bugs.ruby-lang.org/issues/8433#note-2. This will need to get backported, but in the meantime I'm personally feeding the patch into rvm.

Second, instead of using the default dns resolution implementation in Ruby, you can use resolv-replace. This makes dns resolution thread-friendly.

After adding in these two changes, it completely removed any issues I was seeing with hung forks and NewRelic. There are a few bug fixes I've encountered in NewRelic that I'll be putting PRs in for soon but I don't think any of them were contributing to hung workers -- only the two I noted above.

Hope this is helpful -- let me know if you have any questions!

jasonrclark · 2013-10-07T16:55:46Z

@obrie That's absolutely fantastic news! Thanks for putting the work into that. We'd love to hear whenever this lands somewhere that we can test against.

We'll also take a look at resolv-replace and see if that improves the situation as well. It sounds promising given that a large number of the hangs we've seen have been DNS related.

obrie · 2013-10-07T17:19:56Z

No problem :) It's also worth noting that there's a bug in 1.9.3 pre-p374 that can also cause forks to deadlock: https://bugs.ruby-lang.org/issues/7693. We were experiencing this issue (in addition to all the others) -- so we're now using p448 in all of our testing.

jasonrclark · 2013-10-09T18:40:56Z

Hey all,
We've released version 3.6.8.164 of newrelic_rpm which includes an optional feature to implement the locking around harvest I describe in this issue. It's documented at https://docs.newrelic.com/docs/ruby/resque-instrumentation#deadlocks, but you can set this in the desired section of your newrelic.yml in enable the locks:

resque:
  use_harvest_lock: true

Keen to hear how this helps for anyone experiencing these sorts of deadlocks. Thanks all!

I'll be looking at Resque 2.x in the near future, so we'll see if we can do anything better at that point.

obrie · 2013-10-16T23:12:55Z

Just to give one more update... We've now been running live in production (with a high throughput) without any issues for about a week now. To provide a little more information, this was the environment we required to completely avoid hung workers:

Ruby 1.9.3-p448 ... p327+ fixes thread creation deadlock [https://bugs.ruby-lang.org/issues/7693]
railsexpress patch (if applicable, the falcon gc patch on p385+ fixes 100% cpu gc deadlock) [https://github.com/skaes/rvm-patchsets/blob/master/patchsets/ruby/1.9.3/p448/railsexpress]
Backported mutexes patch (fixes mutex deadlock) [https://bugs.ruby-lang.org/issues/8433]
Ubuntu 12.04 / glibc 2.15 (fixes mutex locking bug / deadlock) .. Unfortunately I don't have a link for this one, but I can verify that glibc 2.11.1 has the issue on Ubuntu 10.04
resolv-replace Ruby library (fixes dns resolution deadlock)

I'll provide further updates if we discover anything else -- hopefully that's it :)

steveklabnik · 2013-10-16T23:13:49Z

❤️❤️❤️❤️❤️

mperham · 2013-10-16T23:25:33Z

Wow, thank you @obrie! This is helpful to everyone using threads.

jasonrclark · 2013-10-17T17:35:05Z

This is awesome @obrie! We're definitely pointing people to these options when they report deadlocks, since most of them are outside of what newrelic_rpm can address directly from within the gem.

Again, thanks for all the deep digging!

TheCorp · 2013-10-18T18:19:33Z

@obrie - Quick question, any idea if this issue is present on Ruby 2.x or just 1.9.3 p<237? I am on ruby 2.0.0p247.

Also @mperham - Based on the commit it seems like the NewRelic upgrade with the harvest locking is specifically for Resque users? Does Sidekiq have a similar issue or is the resolv-replace all we need?

We process millions of jobs with Sidekiq so we're definitely very interested in whether this affects us or not.

mperham · 2013-10-18T18:23:16Z

@TheCorp Sidekiq itself never forks but I can't control if workers want to fork. It's crazy but possible. Sidekiq master requires resolv-replace now.

TheCorp · 2013-10-18T18:50:56Z

Well then, this might explain why we had so many issues when we tried to implement parallel http (we do lots of scraping) calls by forking threads within a sidekiq worker.

mperham · 2013-10-18T18:56:53Z

You don't fork threads, you fork processes. I mean a worker could actually fork the sidekiq process.

Slow DNS can cause those issues. We use a local DNS caching proxy which I think helps. resolv-replace should help a lot.

TheCorp · 2013-10-18T19:02:47Z

Gotcha, thanks.

kajic · 2013-10-25T13:51:53Z

In the process of upgrading ruby (from 1.9.3p0 to 2.0.0p247) I was also forced to upgrade newrelic_rpm (from 3.5.3.25
to 3.6.8.168). During the switch, an old resque machine (resque gem at 1.20.0), running ruby 1.9.3p0 and newrelic_rpm 3.6.8.168, started to experience hanging processes. This was not happening at all with newrelic_rpm 3.5.3.25. I was led to this github issue by examining the hanging processes using gdb. After having upgraded ruby to 2.0.0p247 the issue seems to have gone away. I hope this can be of help to someone.

onyxraven · 2014-07-25T22:36:26Z

This issue is still open, so I'll comment on here now.

We had NewRelic disabled on our Resque workers (managed with resque-pool) via the environment variable, and just turned it on after upgrading to the last 1.x Resque. We are now seeing occasional locked up workers again, though its a lot better than when we had turned it off before.

Relevant Specs:
Ubuntu 12.04.4 LTS (GNU/Linux 3.2.0-65-virtual x86_64)
ldd (Ubuntu EGLIBC 2.15-0ubuntu10.5) 2.15
ruby 1.9.3p545 (2014-02-24 revision 45159) [x86_64-linux]
resque (1.25.2)
resque-dynamic-queues (0.8.0)
resque-pool (0.3.0)
resque-queue-priority (0.6.0)
resque-scheduler (2.5.5)

I'm about to turn on the harvest locks flag now.

I think I've got a ticket open with NewRelic I'll contact them directly with, but wonder if there was any updates on this side.

doliveirakn · 2017-10-17T16:15:11Z

This ticket has been open for several years. Has there been any progress or do people just accept that having new relic with resque will cause deadlocks on occasion.

jasonrclark · 2017-10-17T23:18:04Z

Hey @doliveirakn! Turns out there was progress in the meantime that never got updated here. The most common issue people were hitting was with DNS resolution, as mentioned previously. Turns out as of newrelic_rpm 3.12.1 we started requiring resolv-replace in these scenarios by default. Since that time, the team has pretty much stopped hearing about the issue, which is ✨

While it's always a possibility to deadlock when threads and forking mix, this seems to have dealt with the primary cause of resque + New Relic pain. Are you seeing problems yourself still, or just checking in?

doliveirakn · 2017-10-17T23:29:58Z

Thanks for getting back.

I came across this issue about 2 years ago and I've had new relic disabled for resque since and have been periodically checking up on this issue to see progress.

If you think it is resolved, I will look at re-enabling newrelic and I'll report back if we hit more deadlocks.

onyxraven · 2017-10-18T15:04:04Z

We have had NewRelic and Resque working side by side for a while now. We started using resolv-replace in our own code after digging into some of the stacktraces (similar to what @jasonrclark mentioned). We don't have harvest locks on, and we havent seen a recurrance of the deadlock in a long, long time.

In other words, I think this can be closed.

jasonrclark · 2017-10-23T17:23:40Z

I'm leaning toward closing as well @onyxraven. @doliveirakn could you report back if you get to try it out (or if it isn't going to happen for some reason), and I'll close then?

doliveirakn · 2017-10-23T20:44:05Z

I have put this live as of last week (Oct 18th)
So far, I have seen no deadlocks. This issue is likely safe to close

jasonrclark · 2017-10-23T23:31:07Z

Thanks @doliveirakn for reviving the discussion!

benweint mentioned this issue Sep 5, 2013

Children disappear but parents remain stuck in wait state #180

Closed

mperham mentioned this issue Oct 17, 2013

Use resolv-replace sidekiq/sidekiq#1258

Closed

beedub mentioned this issue Dec 9, 2014

::Socket.getaddrinfo gets stuck, blocking the process indefinitely redis/redis-rb#436

Closed

steveklabnik added Needs Triage Bug labels Jan 10, 2016

steveklabnik added Hard and removed Needs Triage labels Jan 12, 2016

jasonrclark closed this as completed Oct 23, 2017

jeremy mentioned this issue May 8, 2018

Resque forks while resolving DNS, causing deadlock in child #1630

Open

atz mentioned this issue Jul 31, 2018

Deadlock issue in workers sul-dlss/preservation_catalog#1037

Closed

pboling mentioned this issue Jun 16, 2021

DNS resolver locking up entire MRI process sidekiq/sidekiq#2175

Closed

This was referenced Aug 3, 2023

Resque workers deadlocked from ddtrace after processing jobs DataDog/dd-trace-rb#3015

Closed

Resque workers start hanging when sending data to hostname DataDog/dd-trace-rb#2379

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threads, Forking and New Relic #1101

Threads, Forking and New Relic #1101

jasonrclark commented Aug 9, 2013

steveklabnik commented Aug 12, 2013

codeodor commented Aug 12, 2013

MarkusQ commented Aug 12, 2013

mperham commented Aug 12, 2013

jasonrclark commented Aug 13, 2013

jasonrclark commented Aug 13, 2013

jeremyolliver commented Aug 27, 2013

yaauie commented Aug 27, 2013

jeremyolliver commented Aug 27, 2013

benweint commented Sep 5, 2013

obrie commented Oct 7, 2013

jasonrclark commented Oct 7, 2013

obrie commented Oct 7, 2013

jasonrclark commented Oct 9, 2013

obrie commented Oct 16, 2013

steveklabnik commented Oct 16, 2013

mperham commented Oct 16, 2013

jasonrclark commented Oct 17, 2013

TheCorp commented Oct 18, 2013

mperham commented Oct 18, 2013

TheCorp commented Oct 18, 2013

mperham commented Oct 18, 2013

TheCorp commented Oct 18, 2013

kajic commented Oct 25, 2013

onyxraven commented Jul 25, 2014

doliveirakn commented Oct 17, 2017

jasonrclark commented Oct 17, 2017

doliveirakn commented Oct 17, 2017

onyxraven commented Oct 18, 2017 •

edited

Loading

jasonrclark commented Oct 23, 2017

doliveirakn commented Oct 23, 2017

jasonrclark commented Oct 23, 2017

Threads, Forking and New Relic #1101

Threads, Forking and New Relic #1101

Comments

jasonrclark commented Aug 9, 2013

The Problem

A Solution

Gathering More Information

Other Ideas?

steveklabnik commented Aug 12, 2013

codeodor commented Aug 12, 2013

MarkusQ commented Aug 12, 2013

mperham commented Aug 12, 2013

jasonrclark commented Aug 13, 2013

jasonrclark commented Aug 13, 2013

jeremyolliver commented Aug 27, 2013

yaauie commented Aug 27, 2013

jeremyolliver commented Aug 27, 2013

benweint commented Sep 5, 2013

obrie commented Oct 7, 2013

jasonrclark commented Oct 7, 2013

obrie commented Oct 7, 2013

jasonrclark commented Oct 9, 2013

obrie commented Oct 16, 2013

steveklabnik commented Oct 16, 2013

mperham commented Oct 16, 2013

jasonrclark commented Oct 17, 2013

TheCorp commented Oct 18, 2013

mperham commented Oct 18, 2013

TheCorp commented Oct 18, 2013

mperham commented Oct 18, 2013

TheCorp commented Oct 18, 2013

kajic commented Oct 25, 2013

onyxraven commented Jul 25, 2014

doliveirakn commented Oct 17, 2017

jasonrclark commented Oct 17, 2017

doliveirakn commented Oct 17, 2017

onyxraven commented Oct 18, 2017 • edited Loading

jasonrclark commented Oct 23, 2017

doliveirakn commented Oct 23, 2017

jasonrclark commented Oct 23, 2017

onyxraven commented Oct 18, 2017 •

edited

Loading