Skip to content

Prevent fork from happening in getnameinfo #13990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

k0kubun
Copy link
Member

@k0kubun k0kubun commented Jul 23, 2025

This PR applies the same fork-safety lock as getaddrinfo #10864 to getnameinfo.

Given that do_getaddrinfo and do_getnameinfo have a very similar structure and one of them needs a fork-safety lock, it seems likely that we need the same thing for the other one.

We're seeing segfaults on rb_getnameinfo at Shopify. I'm not sure if this change fixes the situation, but it helps us investigate the issue by ruling out the possibility of fork during rb_getnameinfo.

@k0kubun k0kubun requested a review from byroot July 23, 2025 17:50
@@ -752,7 +758,7 @@ rb_getnameinfo(const struct sockaddr *sa, socklen_t salen,
}

pthread_t th;
if (raddrinfo_pthread_create(&th, do_getnameinfo, arg) != 0) {
if (raddrinfo_pthread_create(&th, fork_safe_do_getnameinfo, arg) != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have the crash info handy? IIRC you said the crash was in raddrinfo_pthread_create?

I'm wondering if the lock should be higher or not. For rb_getaddrinfo the lock is above the thread spawn and wait.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have the crash info handy? IIRC you said the crash was in raddrinfo_pthread_create?

The backtrace is at https://github.com/shop/issues-ruby-infrastructure/issues/5#issuecomment-3098309280. It died on pthread_detach after raddrinfo_pthread_create. So,

I'm wondering if the lock should be higher or not.

I agree that it needs to be higher, but

For rb_getaddrinfo the lock is above the thread spawn and wait.

What rb_getaddrinfo locks today seems to be at the same level as what this PR does. For Linux, GETADDRINFO_IMPL would be 2, and the rb_getaddrinfo for 2 is

ruby/ext/socket/raddrinfo.c

Lines 531 to 537 in 78820e8

if (raddrinfo_pthread_create(&th, fork_safe_do_getaddrinfo, arg) != 0) {
int err = errno;
free_getaddrinfo_arg(arg);
errno = err;
return EAI_SYSTEM;
}
pthread_detach(th);
which takes the fork lock in the function used by the spawned thread.

Should we fix both rb_getaddrinfo and rb_getnameinfo to take the lock before spawning a thread and then release it at least after the detach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, rereading the ticket, the concern in #10864 / [Feature #20590] was that getaddrinfo synchronize a mutex, so the goal was to avoid forking while it is held. And the issue it was trying to avoid is deadlocked child, not a crash in pthread.

I'm OK with moving the lock up so the pthread functions are covered, as I don't think it makes a meqaningful difference, but I think we should at least have a theory that would explain this crash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backtrace is at shop/issues-ruby-infrastructure#5 (comment).

For outsiders:

(gdb) bt
#0  0x00007d12e809cb2c in pthread_kill () at /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007d12e804327e in raise () at /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007d12e86731c8 in ruby_default_signal (sig=<optimized out>) at signal.c:422
#3  0x00007d12e8434a00 in rb_bug_for_fatal_signal (default_sighandler=0x0, sig=sig@entry=11, ctx=ctx@entry=0x7d12e6f492c0, fmt=fmt@entry=0x7d12e88d5e6c "Segmentation fault at %p") at error.c:1134
#4  0x00007d12e8672446 in sigsegv (sig=11, info=0x7d12e6f493f0, ctx=0x7d12e6f492c0) at signal.c:933
#5  0x00007d12e8043330 in <signal handler called> () at /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007d12e809bc90 in pthread_detach () at /lib/x86_64-linux-gnu/libc.so.6
#7  0x00007d12c5fb011f in rb_getnameinfo (sa=sa@entry=0x7d11ce7f1420, salen=16, host=host@entry=0x7fffdd5d8500 "\340\361-\326\021}", hostlen=hostlen@entry=1024, serv=serv@entry=0x7fffdd5d8900 "", servlen=servlen@entry=1024, flags=3) at raddrinfo.c:735
#8  0x00007d12c5fb0698 in addrinfo_getnameinfo (argc=argc@entry=1, argv=argv@entry=0x7fffdd5d8d40, self=self@entry=137513716759200) at raddrinfo.c:2399
#9  0x00007d12c5fb07a8 in addrinfo_ip_address (self=137513716759200) at raddrinfo.c:2457
#10 0x00007d12e8708ddf in vm_call_cfunc_with_frame_ (ec=0x7d12e786d550, reg_cfp=0x7d12e7983e38, calling=<optimized out>, argc=<optimized out>, argv=<optimized out>, stack_bottom=<optimized out>) at /tmp/ruby-build/ruby-3.4.4/vm_insnhelper.c:3794
#11 0x00007d12e871249c in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /tmp/ruby-build/ruby-3.4.4/vm_callinfo.h:415
#12 vm_exec_core (ec=0x7d1120a1f6c0, ec@entry=0x7d12e786d550) at /tmp/ruby-build/ruby-3.4.4/insns.def:898
#13 0x00007d12e87198b9 in rb_vm_exec (ec=0x7d12e786d550) at vm.c:2595
│      711 int
│      712 rb_getnameinfo(const struct sockaddr *sa, socklen_t salen,
│      713                char *host, size_t hostlen,
│      714                char *serv, size_t servlen, int flags)
│      715 {
│      716     int retry;
│      717     struct getnameinfo_arg *arg;
│      718     int err, gni_errno = 0;
│      719
│      720 start:
│      721     retry = 0;
│      722
│      723     arg = allocate_getnameinfo_arg(sa, salen, hostlen, servlen, flags);
│      724     if (!arg) {
│      725         return EAI_MEMORY;
│      726     }
│      727
│      728     pthread_t th;
│      729     if (raddrinfo_pthread_create(&th, do_getnameinfo, arg) != 0) {
│      730         int err = errno;
│      731         free_getnameinfo_arg(arg);
│      732         errno = err;
│      733         return EAI_SYSTEM;
│      734     }
│  >   735     pthread_detach(th);
│      736

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I think we should at least have a theory that would explain this crash.

Looking at that backtrace, I don't quite see how fork could be the cause. Unless I'm misreading the code.

th is on the stack, so if somehow the GVL was released and another thread called fork, we wouldn't reach pthread_detach.

Unless it's pthread_create that has been broken by fork(2), and now in the child it's incapable of spawning threads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking, a signal handler is invoked before pthread_detach(), it happens to start on the thread running getnameinfo, it also calls fork, and the child process (doesn't exit and) moves on to call pthread_detach().

I guess that was an unreasonable guess. But this is all I can come up with for fork.

Other than that, my guess is that a thread without GVL or a signal handler somehow discovers and detaches that thread. It doesn't seem like a good guess either, but otherwise pthread_detach() wouldn't crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants