- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threads, Forking and New Relic #1101
Comments
Thanks so much for looking into this. I've been ill, so sorry about the delay. I think that this approach sounds reasonable, but I also don't use NewRelic with Resque personally, so I'd like to hear what people who do use it think. Let me tweet this out there, and also cross-post to the mailing list, and see if we can't get more feedback.
I think this would be the ideal solution for 2.0. |
I use NewRelic with Resque but I can't say I've ever noticed a problem. (But we're only running dozens of long-running jobs, not millions per day). That said, I'm happy to help out with the app I'm talking about above if you'll tell me how I can be useful. |
One possibility that occurs to me (without looking at the code) is to replace the background thread with another fork; the mental model is "no process should ever fork after calling Thread.new." This would clearly remove you from the TL;DR-defined (threads+locks+forks) space. My second suggestion would be pretty much what Jason implemented. About ten years ago this came up and Joel VanderWerf sketched a solution http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/61918 but I'm not finding a live link or local copy. Another idea (which gets messy and may not handle all cases, but I have seen "working") is to manually set all locks to unlocked on forking (iirc the attempt at this I saw used ObjectSpace, evil.rb, and goat entrails). I keep thinking there are other, cleaner answers, but none are sprining to mind. |
This is relevant to my interests. Good to learn about |
@MarkusQ Thanks for the feedback. We've definitely considered a forking model to get reporting back to New Relic separated from the app processes, but it's felt like a pretty big change from the agent's current behavior. We will keep it in mind to be sure. Really appreciate as well the links to those old solutions. While haven't found the code, it definitely led to some interesting reading. |
@codeodor We'll definitely get in touch if there's any way you could help out. We mostly encounter problems with short, frequent jobs, but if we do make changes we'll want to be sure things remain compatible with longer running jobs like you've got. |
I can't really comment on where to head codewise, but just reporting in as someone who's experienced this issue a lot. I have an application running in production which is running on an older version of resque (resque-1.13.0 and ruby enterprise e.g. 1.8.7) which currently reliably causes/reproduces the locking up issue (I've had to kill process several times over the last couple of weeks for it alone). I'm currently in the process of upgrading that application to ruby 1.9.3 and resque 1.24.1 - unsure if that will resolve or alleviate these problems given some work has already gone in to the releases since then. While I've still got the old system running and can reproduce, would backtraces / gdb output be useful (from the older resque version)? If so, let me know what you want to see, as outside of reading newrelic's how to post It all looks a bit foreign to me. EDIT: I was way off on my count of how many jobs goes through. I forgot about some optimisations I did |
@jeremyolliver I don't think that the backtraces from older 1.x-series will be all that helpful, since there will be no updates to them beyond the forward progress that has already been made to 1.24 and beyond. Best of luck getting upgraded though 😄. |
I figured as much. I can report back post-upgrade if the issue still persists. |
@jeremyolliver Backtraces from the 1.x series would be useful in as much as they would allow us to confirm whether your issue is the one Jason describes above or some other problem. You can use the If you're able to get backtraces, we'd be happy to take a look at them to at least confirm whether you're hitting the same issue or something different (I've seen a few cases where hangs people thought were caused by this issue were actually caused by unrelated issues in other libraries or application code). |
Hey guys - I've been following this thread as it actually is in line with what I've experienced on a separate project. I wanted to provide an update because it affects how folks address this issue. I found that there were two issues at play here: First, there is a bug in Ruby. All mutexes in any background thread are supposed to be unlocked when you fork. Unfortunately there was a race condition in all versions of Ruby which made this fail. The good news is there's a patch I submitted which has been accepted: https://bugs.ruby-lang.org/issues/8433#note-2. This will need to get backported, but in the meantime I'm personally feeding the patch into rvm. Second, instead of using the default dns resolution implementation in Ruby, you can use After adding in these two changes, it completely removed any issues I was seeing with hung forks and NewRelic. There are a few bug fixes I've encountered in NewRelic that I'll be putting PRs in for soon but I don't think any of them were contributing to hung workers -- only the two I noted above. Hope this is helpful -- let me know if you have any questions! |
@obrie That's absolutely fantastic news! Thanks for putting the work into that. We'd love to hear whenever this lands somewhere that we can test against. We'll also take a look at |
No problem :) It's also worth noting that there's a bug in 1.9.3 pre-p374 that can also cause forks to deadlock: https://bugs.ruby-lang.org/issues/7693. We were experiencing this issue (in addition to all the others) -- so we're now using p448 in all of our testing. |
Hey all,
Keen to hear how this helps for anyone experiencing these sorts of deadlocks. Thanks all! I'll be looking at Resque 2.x in the near future, so we'll see if we can do anything better at that point. |
Just to give one more update... We've now been running live in production (with a high throughput) without any issues for about a week now. To provide a little more information, this was the environment we required to completely avoid hung workers:
I'll provide further updates if we discover anything else -- hopefully that's it :) |
❤️❤️❤️❤️❤️ |
Wow, thank you @obrie! This is helpful to everyone using threads. |
This is awesome @obrie! We're definitely pointing people to these options when they report deadlocks, since most of them are outside of what Again, thanks for all the deep digging! |
@obrie - Quick question, any idea if this issue is present on Ruby 2.x or just 1.9.3 p<237? I am on ruby 2.0.0p247. Also @mperham - Based on the commit it seems like the NewRelic upgrade with the harvest locking is specifically for Resque users? Does Sidekiq have a similar issue or is the resolv-replace all we need? We process millions of jobs with Sidekiq so we're definitely very interested in whether this affects us or not. |
@TheCorp Sidekiq itself never forks but I can't control if workers want to fork. It's crazy but possible. Sidekiq master requires resolv-replace now. |
Well then, this might explain why we had so many issues when we tried to implement parallel http (we do lots of scraping) calls by forking threads within a sidekiq worker. |
You don't fork threads, you fork processes. I mean a worker could actually fork the sidekiq process. Slow DNS can cause those issues. We use a local DNS caching proxy which I think helps. resolv-replace should help a lot. |
Gotcha, thanks. |
In the process of upgrading ruby (from 1.9.3p0 to 2.0.0p247) I was also forced to upgrade newrelic_rpm (from 3.5.3.25 |
This issue is still open, so I'll comment on here now. We had NewRelic disabled on our Resque workers (managed with resque-pool) via the environment variable, and just turned it on after upgrading to the last 1.x Resque. We are now seeing occasional locked up workers again, though its a lot better than when we had turned it off before. Relevant Specs: I'm about to turn on the harvest locks flag now. I think I've got a ticket open with NewRelic I'll contact them directly with, but wonder if there was any updates on this side. |
This ticket has been open for several years. Has there been any progress or do people just accept that having new relic with resque will cause deadlocks on occasion. |
Hey @doliveirakn! Turns out there was progress in the meantime that never got updated here. The most common issue people were hitting was with DNS resolution, as mentioned previously. Turns out as of While it's always a possibility to deadlock when threads and forking mix, this seems to have dealt with the primary cause of resque + New Relic pain. Are you seeing problems yourself still, or just checking in? |
Thanks for getting back. I came across this issue about 2 years ago and I've had new relic disabled for resque since and have been periodically checking up on this issue to see progress. If you think it is resolved, I will look at re-enabling newrelic and I'll report back if we hit more deadlocks. |
We have had NewRelic and Resque working side by side for a while now. We started using resolv-replace in our own code after digging into some of the stacktraces (similar to what @jasonrclark mentioned). We don't have harvest locks on, and we havent seen a recurrance of the deadlock in a long, long time. In other words, I think this can be closed. |
I'm leaning toward closing as well @onyxraven. @doliveirakn could you report back if you get to try it out (or if it isn't going to happen for some reason), and I'll close then? |
I have put this live as of last week (Oct 18th) |
Thanks @doliveirakn for reviving the discussion! |
TL;DR; We've got ideas for avoiding deadlocks when using New Relic and Resque together, but would like feedback on the approach, and advice on what we could contribute to Resque to improve the situation.
Long story short, threads (with locks) and forking don't mix nicely. (:wavy_dash: + :fork_and_knife: = :collision: )
The Problem
There are long-standing reports of deadlocking Resque jobs when using New Relic (
newrelic_rpm
). Certain updates have alleviated issues for specific users, but we still hear about problems from others--especially those running lots of jobs (think millions of jobs per day)Some great debugging by @benweint and others has led us to suspect the problems are variations on this sequence of events:
newrelic_rpm
, locks in Ruby'sLogger
class, or even native locks around, for example, DNS lookup)It seems we can't safely
fork
while holding a lock in another thread or bad things happen. Ifnewrelic_rpm
owned all the locks in question we could deal with them in ourafter_fork
handler (and we have where we've identified them). However, at least one common case (DNS lookup) isn't even locking in Ruby-land.A Solution
I've pushed a branch,
resque_harvest_locking
onnewrelic_rpm
with a proposed change to our instrumentation. (newrelic/newrelic-ruby-agent@d703271) This synchronizes between New Relic's background thread running a harvest and Resque forking a new job (just the forking, not the full job processing). We already locked around our harvest thread, so we just sync on the same lock from a patchedResque::Worker#fork
This has downsides for sure:
But in our testing it does alleviate deadlocks.
Gathering More Information
The latest version of the
newrelic_rpm
gem ships with an executable callednrdebug
. Given a pid,nrdebug
drops a log file with tons of information, up to and including gdb and Ruby backtraces from the stuck process. Anyone seeing hung workers withnewrelic_rpm
should definitely runnrdebug
before killing those workers. We'd love to see the results.Other Ideas?
Has anyone else found these type of issues using threads and Resque together?
Are there places we could add hooks in Resque to accomplish to make this less invasive?
Even more drastically, could there be a better approach to running plugins periodically in the Resque's master so we wouldn't need the background thread and could sidestep the locking woes?
Thanks!
The text was updated successfully, but these errors were encountered: