Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threads, Forking and New Relic #1101

Closed
jasonrclark opened this issue Aug 9, 2013 · 32 comments
Closed

Threads, Forking and New Relic #1101

jasonrclark opened this issue Aug 9, 2013 · 32 comments

Comments

@jasonrclark
Copy link

TL;DR; We've got ideas for avoiding deadlocks when using New Relic and Resque together, but would like feedback on the approach, and advice on what we could contribute to Resque to improve the situation.

Long story short, threads (with locks) and forking don't mix nicely. (:wavy_dash: + :fork_and_knife: = :collision: )

The Problem

There are long-standing reports of deadlocking Resque jobs when using New Relic (newrelic_rpm). Certain updates have alleviated issues for specific users, but we still hear about problems from others--especially those running lots of jobs (think millions of jobs per day)

Some great debugging by @benweint and others has led us to suspect the problems are variations on this sequence of events:

  1. New Relic starts in the Resque master process and spawns a background thread to transmit data back to our services.
  2. Resque starts running normally, forking to process jobs
  3. New Relic harvests once a minute and during that time takes various locks (internal locks in newrelic_rpm, locks in Ruby's Logger class, or even native locks around, for example, DNS lookup)
  4. While New Relic's harvest thread has a lock, Resque forks a new job
  5. Only Resque's main thread lives on in the child process, so New Relic's harvest thread isn't around to release the lock from step 3).
  6. If any code tries to acquire the lock from 3), the Resque job hangs 😢

It seems we can't safely fork while holding a lock in another thread or bad things happen. If newrelic_rpm owned all the locks in question we could deal with them in our after_fork handler (and we have where we've identified them). However, at least one common case (DNS lookup) isn't even locking in Ruby-land.

A Solution

I've pushed a branch, resque_harvest_locking on newrelic_rpm with a proposed change to our instrumentation. (newrelic/newrelic-ruby-agent@d703271) This synchronizes between New Relic's background thread running a harvest and Resque forking a new job (just the forking, not the full job processing). We already locked around our harvest thread, so we just sync on the same lock from a patched Resque::Worker#fork

This has downsides for sure:

  • If the network is slow contacting New Relic, new jobs can't start until the harvest finishes.
  • Ugly patching to get the lock at the right spot in Resque's processing
  • The existing code doesn't look workable for Resque 2.x, so there's still work to do if this approach is taken

But in our testing it does alleviate deadlocks.

Gathering More Information

The latest version of the newrelic_rpm gem ships with an executable called nrdebug. Given a pid, nrdebug drops a log file with tons of information, up to and including gdb and Ruby backtraces from the stuck process. Anyone seeing hung workers with newrelic_rpm should definitely run nrdebug before killing those workers. We'd love to see the results.

Other Ideas?

Has anyone else found these type of issues using threads and Resque together?

Are there places we could add hooks in Resque to accomplish to make this less invasive?

Even more drastically, could there be a better approach to running plugins periodically in the Resque's master so we wouldn't need the background thread and could sidestep the locking woes?

Thanks!

@steveklabnik
Copy link
Member

Thanks so much for looking into this. I've been ill, so sorry about the delay.

I think that this approach sounds reasonable, but I also don't use NewRelic with Resque personally, so I'd like to hear what people who do use it think. Let me tweet this out there, and also cross-post to the mailing list, and see if we can't get more feedback.

could there be a better approach to running plugins periodically in the Resque's master so we wouldn't need the background thread and could sidestep the locking woes?

I think this would be the ideal solution for 2.0.

@codeodor
Copy link

I use NewRelic with Resque but I can't say I've ever noticed a problem. (But we're only running dozens of long-running jobs, not millions per day).

That said, I'm happy to help out with the app I'm talking about above if you'll tell me how I can be useful.

@MarkusQ
Copy link

MarkusQ commented Aug 12, 2013

One possibility that occurs to me (without looking at the code) is to replace the background thread with another fork; the mental model is "no process should ever fork after calling Thread.new." This would clearly remove you from the TL;DR-defined (threads+locks+forks) space.

My second suggestion would be pretty much what Jason implemented.

About ten years ago this came up and Joel VanderWerf sketched a solution http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/61918 but I'm not finding a live link or local copy.

Another idea (which gets messy and may not handle all cases, but I have seen "working") is to manually set all locks to unlocked on forking (iirc the attempt at this I saw used ObjectSpace, evil.rb, and goat entrails).

I keep thinking there are other, cleaner answers, but none are sprining to mind.

@mperham
Copy link

mperham commented Aug 12, 2013

This is relevant to my interests. Good to learn about nrdebug, could be useful for debugging Sidekiq lockups too.

@jasonrclark
Copy link
Author

@MarkusQ Thanks for the feedback. We've definitely considered a forking model to get reporting back to New Relic separated from the app processes, but it's felt like a pretty big change from the agent's current behavior. We will keep it in mind to be sure.

Really appreciate as well the links to those old solutions. While haven't found the code, it definitely led to some interesting reading.

@jasonrclark
Copy link
Author

@codeodor We'll definitely get in touch if there's any way you could help out. We mostly encounter problems with short, frequent jobs, but if we do make changes we'll want to be sure things remain compatible with longer running jobs like you've got.

@jeremyolliver
Copy link

I can't really comment on where to head codewise, but just reporting in as someone who's experienced this issue a lot.

I have an application running in production which is running on an older version of resque (resque-1.13.0 and ruby enterprise e.g. 1.8.7) which currently reliably causes/reproduces the locking up issue (I've had to kill process several times over the last couple of weeks for it alone). I'm currently in the process of upgrading that application to ruby 1.9.3 and resque 1.24.1 - unsure if that will resolve or alleviate these problems given some work has already gone in to the releases since then.

While I've still got the old system running and can reproduce, would backtraces / gdb output be useful (from the older resque version)? If so, let me know what you want to see, as outside of reading newrelic's how to post It all looks a bit foreign to me.

EDIT: I was way off on my count of how many jobs goes through. I forgot about some optimisations I did

@yaauie
Copy link
Member

yaauie commented Aug 27, 2013

@jeremyolliver I don't think that the backtraces from older 1.x-series will be all that helpful, since there will be no updates to them beyond the forward progress that has already been made to 1.24 and beyond. Best of luck getting upgraded though 😄.

@jeremyolliver
Copy link

I figured as much. I can report back post-upgrade if the issue still persists.

@benweint
Copy link
Contributor

benweint commented Sep 5, 2013

@jeremyolliver Backtraces from the 1.x series would be useful in as much as they would allow us to confirm whether your issue is the one Jason describes above or some other problem. You can use the nrdebug tool that Jason mentions in the 'Gathering More Information' section above to automate the collection of backtraces so that you won't have to manually carry out the steps in that blog post.

If you're able to get backtraces, we'd be happy to take a look at them to at least confirm whether you're hitting the same issue or something different (I've seen a few cases where hangs people thought were caused by this issue were actually caused by unrelated issues in other libraries or application code).

@obrie
Copy link

obrie commented Oct 7, 2013

Hey guys -

I've been following this thread as it actually is in line with what I've experienced on a separate project. I wanted to provide an update because it affects how folks address this issue. I found that there were two issues at play here:

First, there is a bug in Ruby. All mutexes in any background thread are supposed to be unlocked when you fork. Unfortunately there was a race condition in all versions of Ruby which made this fail. The good news is there's a patch I submitted which has been accepted: https://bugs.ruby-lang.org/issues/8433#note-2. This will need to get backported, but in the meantime I'm personally feeding the patch into rvm.

Second, instead of using the default dns resolution implementation in Ruby, you can use resolv-replace. This makes dns resolution thread-friendly.

After adding in these two changes, it completely removed any issues I was seeing with hung forks and NewRelic. There are a few bug fixes I've encountered in NewRelic that I'll be putting PRs in for soon but I don't think any of them were contributing to hung workers -- only the two I noted above.

Hope this is helpful -- let me know if you have any questions!

@jasonrclark
Copy link
Author

@obrie That's absolutely fantastic news! Thanks for putting the work into that. We'd love to hear whenever this lands somewhere that we can test against.

We'll also take a look at resolv-replace and see if that improves the situation as well. It sounds promising given that a large number of the hangs we've seen have been DNS related.

@obrie
Copy link

obrie commented Oct 7, 2013

No problem :) It's also worth noting that there's a bug in 1.9.3 pre-p374 that can also cause forks to deadlock: https://bugs.ruby-lang.org/issues/7693. We were experiencing this issue (in addition to all the others) -- so we're now using p448 in all of our testing.

@jasonrclark
Copy link
Author

Hey all,
We've released version 3.6.8.164 of newrelic_rpm which includes an optional feature to implement the locking around harvest I describe in this issue. It's documented at https://docs.newrelic.com/docs/ruby/resque-instrumentation#deadlocks, but you can set this in the desired section of your newrelic.yml in enable the locks:

resque:
  use_harvest_lock: true

Keen to hear how this helps for anyone experiencing these sorts of deadlocks. Thanks all!

I'll be looking at Resque 2.x in the near future, so we'll see if we can do anything better at that point.

@obrie
Copy link

obrie commented Oct 16, 2013

Just to give one more update... We've now been running live in production (with a high throughput) without any issues for about a week now. To provide a little more information, this was the environment we required to completely avoid hung workers:

  • Ruby 1.9.3-p448 ... p327+ fixes thread creation deadlock [https://bugs.ruby-lang.org/issues/7693]
  • railsexpress patch (if applicable, the falcon gc patch on p385+ fixes 100% cpu gc deadlock) [https://github.com/skaes/rvm-patchsets/blob/master/patchsets/ruby/1.9.3/p448/railsexpress]
  • Backported mutexes patch (fixes mutex deadlock) [https://bugs.ruby-lang.org/issues/8433]
  • Ubuntu 12.04 / glibc 2.15 (fixes mutex locking bug / deadlock) .. Unfortunately I don't have a link for this one, but I can verify that glibc 2.11.1 has the issue on Ubuntu 10.04
  • resolv-replace Ruby library (fixes dns resolution deadlock)

I'll provide further updates if we discover anything else -- hopefully that's it :)

@steveklabnik
Copy link
Member

❤️❤️❤️❤️❤️

@mperham
Copy link

mperham commented Oct 16, 2013

Wow, thank you @obrie! This is helpful to everyone using threads.

@jasonrclark
Copy link
Author

This is awesome @obrie! We're definitely pointing people to these options when they report deadlocks, since most of them are outside of what newrelic_rpm can address directly from within the gem.

Again, thanks for all the deep digging!

@TheCorp
Copy link

TheCorp commented Oct 18, 2013

@obrie - Quick question, any idea if this issue is present on Ruby 2.x or just 1.9.3 p<237? I am on ruby 2.0.0p247.

Also @mperham - Based on the commit it seems like the NewRelic upgrade with the harvest locking is specifically for Resque users? Does Sidekiq have a similar issue or is the resolv-replace all we need?

We process millions of jobs with Sidekiq so we're definitely very interested in whether this affects us or not.

@mperham
Copy link

mperham commented Oct 18, 2013

@TheCorp Sidekiq itself never forks but I can't control if workers want to fork. It's crazy but possible. Sidekiq master requires resolv-replace now.

@TheCorp
Copy link

TheCorp commented Oct 18, 2013

Well then, this might explain why we had so many issues when we tried to implement parallel http (we do lots of scraping) calls by forking threads within a sidekiq worker.

@mperham
Copy link

mperham commented Oct 18, 2013

You don't fork threads, you fork processes. I mean a worker could actually fork the sidekiq process.

Slow DNS can cause those issues. We use a local DNS caching proxy which I think helps. resolv-replace should help a lot.

@TheCorp
Copy link

TheCorp commented Oct 18, 2013

Gotcha, thanks.

@kajic
Copy link

kajic commented Oct 25, 2013

In the process of upgrading ruby (from 1.9.3p0 to 2.0.0p247) I was also forced to upgrade newrelic_rpm (from 3.5.3.25
to 3.6.8.168). During the switch, an old resque machine (resque gem at 1.20.0), running ruby 1.9.3p0 and newrelic_rpm 3.6.8.168, started to experience hanging processes. This was not happening at all with newrelic_rpm 3.5.3.25. I was led to this github issue by examining the hanging processes using gdb. After having upgraded ruby to 2.0.0p247 the issue seems to have gone away. I hope this can be of help to someone.

@onyxraven
Copy link

This issue is still open, so I'll comment on here now.

We had NewRelic disabled on our Resque workers (managed with resque-pool) via the environment variable, and just turned it on after upgrading to the last 1.x Resque. We are now seeing occasional locked up workers again, though its a lot better than when we had turned it off before.

Relevant Specs:
Ubuntu 12.04.4 LTS (GNU/Linux 3.2.0-65-virtual x86_64)
ldd (Ubuntu EGLIBC 2.15-0ubuntu10.5) 2.15
ruby 1.9.3p545 (2014-02-24 revision 45159) [x86_64-linux]
resque (1.25.2)
resque-dynamic-queues (0.8.0)
resque-pool (0.3.0)
resque-queue-priority (0.6.0)
resque-scheduler (2.5.5)

I'm about to turn on the harvest locks flag now.

I think I've got a ticket open with NewRelic I'll contact them directly with, but wonder if there was any updates on this side.

@doliveirakn
Copy link

This ticket has been open for several years. Has there been any progress or do people just accept that having new relic with resque will cause deadlocks on occasion.

@jasonrclark
Copy link
Author

Hey @doliveirakn! Turns out there was progress in the meantime that never got updated here. The most common issue people were hitting was with DNS resolution, as mentioned previously. Turns out as of newrelic_rpm 3.12.1 we started requiring resolv-replace in these scenarios by default. Since that time, the team has pretty much stopped hearing about the issue, which is ✨

While it's always a possibility to deadlock when threads and forking mix, this seems to have dealt with the primary cause of resque + New Relic pain. Are you seeing problems yourself still, or just checking in?

@doliveirakn
Copy link

Thanks for getting back.

I came across this issue about 2 years ago and I've had new relic disabled for resque since and have been periodically checking up on this issue to see progress.

If you think it is resolved, I will look at re-enabling newrelic and I'll report back if we hit more deadlocks.

@onyxraven
Copy link

onyxraven commented Oct 18, 2017

We have had NewRelic and Resque working side by side for a while now. We started using resolv-replace in our own code after digging into some of the stacktraces (similar to what @jasonrclark mentioned). We don't have harvest locks on, and we havent seen a recurrance of the deadlock in a long, long time.

In other words, I think this can be closed.

@jasonrclark
Copy link
Author

I'm leaning toward closing as well @onyxraven. @doliveirakn could you report back if you get to try it out (or if it isn't going to happen for some reason), and I'll close then?

@doliveirakn
Copy link

I have put this live as of last week (Oct 18th)
So far, I have seen no deadlocks. This issue is likely safe to close

@jasonrclark
Copy link
Author

Thanks @doliveirakn for reviving the discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests