-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Prevent fork from happening in getnameinfo #13990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@@ -752,7 +758,7 @@ rb_getnameinfo(const struct sockaddr *sa, socklen_t salen, | |||
} | |||
|
|||
pthread_t th; | |||
if (raddrinfo_pthread_create(&th, do_getnameinfo, arg) != 0) { | |||
if (raddrinfo_pthread_create(&th, fork_safe_do_getnameinfo, arg) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have the crash info handy? IIRC you said the crash was in raddrinfo_pthread_create
?
I'm wondering if the lock should be higher or not. For rb_getaddrinfo
the lock is above the thread spawn and wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have the crash info handy? IIRC you said the crash was in raddrinfo_pthread_create?
The backtrace is at https://github.com/shop/issues-ruby-infrastructure/issues/5#issuecomment-3098309280. It died on pthread_detach
after raddrinfo_pthread_create
. So,
I'm wondering if the lock should be higher or not.
I agree that it needs to be higher, but
For rb_getaddrinfo the lock is above the thread spawn and wait.
What rb_getaddrinfo
locks today seems to be at the same level as what this PR does. For Linux, GETADDRINFO_IMPL
would be 2, and the rb_getaddrinfo
for 2
is
Lines 531 to 537 in 78820e8
if (raddrinfo_pthread_create(&th, fork_safe_do_getaddrinfo, arg) != 0) { | |
int err = errno; | |
free_getaddrinfo_arg(arg); | |
errno = err; | |
return EAI_SYSTEM; | |
} | |
pthread_detach(th); |
Should we fix both rb_getaddrinfo
and rb_getnameinfo
to take the lock before spawning a thread and then release it at least after the detach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, rereading the ticket, the concern in #10864 / [Feature #20590] was that getaddrinfo
synchronize a mutex, so the goal was to avoid forking while it is held. And the issue it was trying to avoid is deadlocked child, not a crash in pthread.
I'm OK with moving the lock up so the pthread
functions are covered, as I don't think it makes a meqaningful difference, but I think we should at least have a theory that would explain this crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backtrace is at shop/issues-ruby-infrastructure#5 (comment).
For outsiders:
(gdb) bt
#0 0x00007d12e809cb2c in pthread_kill () at /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007d12e804327e in raise () at /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007d12e86731c8 in ruby_default_signal (sig=<optimized out>) at signal.c:422
#3 0x00007d12e8434a00 in rb_bug_for_fatal_signal (default_sighandler=0x0, sig=sig@entry=11, ctx=ctx@entry=0x7d12e6f492c0, fmt=fmt@entry=0x7d12e88d5e6c "Segmentation fault at %p") at error.c:1134
#4 0x00007d12e8672446 in sigsegv (sig=11, info=0x7d12e6f493f0, ctx=0x7d12e6f492c0) at signal.c:933
#5 0x00007d12e8043330 in <signal handler called> () at /lib/x86_64-linux-gnu/libc.so.6
#6 0x00007d12e809bc90 in pthread_detach () at /lib/x86_64-linux-gnu/libc.so.6
#7 0x00007d12c5fb011f in rb_getnameinfo (sa=sa@entry=0x7d11ce7f1420, salen=16, host=host@entry=0x7fffdd5d8500 "\340\361-\326\021}", hostlen=hostlen@entry=1024, serv=serv@entry=0x7fffdd5d8900 "", servlen=servlen@entry=1024, flags=3) at raddrinfo.c:735
#8 0x00007d12c5fb0698 in addrinfo_getnameinfo (argc=argc@entry=1, argv=argv@entry=0x7fffdd5d8d40, self=self@entry=137513716759200) at raddrinfo.c:2399
#9 0x00007d12c5fb07a8 in addrinfo_ip_address (self=137513716759200) at raddrinfo.c:2457
#10 0x00007d12e8708ddf in vm_call_cfunc_with_frame_ (ec=0x7d12e786d550, reg_cfp=0x7d12e7983e38, calling=<optimized out>, argc=<optimized out>, argv=<optimized out>, stack_bottom=<optimized out>) at /tmp/ruby-build/ruby-3.4.4/vm_insnhelper.c:3794
#11 0x00007d12e871249c in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /tmp/ruby-build/ruby-3.4.4/vm_callinfo.h:415
#12 vm_exec_core (ec=0x7d1120a1f6c0, ec@entry=0x7d12e786d550) at /tmp/ruby-build/ruby-3.4.4/insns.def:898
#13 0x00007d12e87198b9 in rb_vm_exec (ec=0x7d12e786d550) at vm.c:2595
│ 711 int
│ 712 rb_getnameinfo(const struct sockaddr *sa, socklen_t salen,
│ 713 char *host, size_t hostlen,
│ 714 char *serv, size_t servlen, int flags)
│ 715 {
│ 716 int retry;
│ 717 struct getnameinfo_arg *arg;
│ 718 int err, gni_errno = 0;
│ 719
│ 720 start:
│ 721 retry = 0;
│ 722
│ 723 arg = allocate_getnameinfo_arg(sa, salen, hostlen, servlen, flags);
│ 724 if (!arg) {
│ 725 return EAI_MEMORY;
│ 726 }
│ 727
│ 728 pthread_t th;
│ 729 if (raddrinfo_pthread_create(&th, do_getnameinfo, arg) != 0) {
│ 730 int err = errno;
│ 731 free_getnameinfo_arg(arg);
│ 732 errno = err;
│ 733 return EAI_SYSTEM;
│ 734 }
│ > 735 pthread_detach(th);
│ 736
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I think we should at least have a theory that would explain this crash.
Looking at that backtrace, I don't quite see how fork
could be the cause. Unless I'm misreading the code.
th
is on the stack, so if somehow the GVL was released and another thread called fork, we wouldn't reach pthread_detach
.
Unless it's pthread_create
that has been broken by fork(2)
, and now in the child it's incapable of spawning threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking, a signal handler is invoked before pthread_detach()
, it happens to start on the thread running getnameinfo, it also calls fork
, and the child process (doesn't exit and) moves on to call pthread_detach()
.
I guess that was an unreasonable guess. But this is all I can come up with for fork
.
Other than that, my guess is that a thread without GVL or a signal handler somehow discovers and detaches that thread. It doesn't seem like a good guess either, but otherwise pthread_detach()
wouldn't crash.
This PR applies the same fork-safety lock as getaddrinfo #10864 to getnameinfo.
Given that
do_getaddrinfo
anddo_getnameinfo
have a very similar structure and one of them needs a fork-safety lock, it seems likely that we need the same thing for the other one.We're seeing segfaults on
rb_getnameinfo
at Shopify. I'm not sure if this change fixes the situation, but it helps us investigate the issue by ruling out the possibility offork
duringrb_getnameinfo
.