What could be causing make to hang when compiling on multiple cores?
Solution 1
I don't have an answer to this precise issue, but I can try to give you a hint of what may be happening: Missing dependencies in Makefiles.
Example:
target: a.bytecode b.bytecode
link a.bytecode b.bytecode -o target
a.bytecode: a.source
compile a.source -o a.bytecode
b.bytecode: b.source
compile b.source a.bytecode -o a.bytecode
If you call make target
everything will compile correctly. Compilation of a.source
is performed (arbitrarily, but deterministically) first. Then compilation of b.source
is performed.
But if you make -j2 target
both compile
commands will be run in parallel. And you'll actually notice that your Makefile's dependencies are broken. The second compile assumes a.bytecode
is already compiled, but it does not appear in dependencies. So an error is likely to happen. The correct dependency line for b.bytecode
should be:
b.bytecode: b.source a.bytecode
To come back to your problem, if you are not lucky it's possible that a command hang in a 100% CPU loop, because of a missing dependency. That's probably what is happening here, the missing dependency couldn't be revealed by a sequential build, but it has been revealed by your parallel build.
Solution 2
I don't know how long you've had the machine, but my first recommendation would be to try a memory test and verify that the memory is functioning properly. I know it often isn't the memory that is the problem, but if it is, it is best to eliminate it as a cause first before trying to trace down other probably issues.
Solution 3
I realize this is a really old question, but it still pops up at the top of search results, so here is my solution:
GNU make has a jobserver mechanism to ensure make and its recursive children do not consume more than the specified number of cores: http://make.mad-scientist.net/papers/jobserver-implementation/
It relies on a pipe shared by all processes. Each process that wants to fork additional children has to first consume tokens from the pipe, then relinquish them when done. If a child process does not return the tokens it consumed, the top-level make while hang forever waiting for them to be returned.
https://bugzilla.redhat.com/show_bug.cgi?id=654822
I encountered this error when building binutils with GNU make on my Solaris box, where "sed" is not GNU sed. Fiddling with PATH to make sed==gsed take priority over the system sed fixed the issue. I don't know why sed was consuming tokens from the pipe, though.
Solution 4
make
seems to create a deadlock. Using ps -ef
, these processes appear to be the culprit:
root 695 615 1 22:18 ? 00:00:00 make PREBUILD -j32 root 2127 695 20 22:18 ? 00:00:04 make -f Makefile.prenobuild
If you check what each is doing, the child process is writing to file descriptor 4, and the parent process is waiting for all child processes to exit:
root@ltzj2-6hl3t-b98zz:/# strace -p 2127 strace: Process 2127 attached write(4, "+", 1
root@ltzj2-6hl3t-b98zz:/# strace -p 695 strace: Process 695 attached {{wait4(-1, }}
file descriptor 4 happens to be a pipe:
root@ltzj2-6hl3t-b98zz:/# ls -la /proc/2127/fd/4 l-wx------ 1 root root 64 Sep 3 22:22 /proc/2127/fd/4 -> 'pipe:[1393418985]'
and that pipe is only between the parent and child processes:
root@ltzj2-6hl3t-b98zz:/# lsof | grep 1393418985 make 695 root 3r FIFO 0,12 0t0 1393418985 pipe make 695 root 4w FIFO 0,12 0t0 1393418985 pipe make 2127 root 3r FIFO 0,12 0t0 1393418985 pipe make 2127 root 4w FIFO 0,12 0t0 1393418985 pipe
so, it would appear that 2127 is stuck trying to add output into the pipe back to 695, but 695 is pended on wait4()
, so it's never going to empty that pipe.
If I empty the pipe from the shell using cat, then the build resumes and completes as expected...
root@ltzj2-6hl3t-b98zz:/# cat /proc/695/fd/3 +++++++++++++++++++++++++++++++
The build unblocks and continues running...
My original understanding was off, but after more investigation I eventually ended up at this Linux Kernel defect:
An exact explanation of how this hangs make is here: https://lore.kernel.org/lkml/1628086770.5rn8p04n6j.none@localhost/.
You can work around this pending a kernel patch with the following workaround applied to the gnu make source code:
--- a/src/posixos.c 2020-01-02 23:11:27.000000000 -0800 +++ b/src/posixos.c 2021-09-18 09:12:02.786563319 -0700 @@ -179,8 +179,52 @@ jobserver_release (int is_fatal) { int r; - EINTRLOOP (r, write (job_fds[1], &token, 1)); - if (r != 1) + int n; + char b[32]; + + /* Use non-blocking write to avoid deadlock from multiple make children + * releasing jobs at the same time. */ + set_blocking (job_fds[1], 0); + memset(b,token,sizeof(b)); + n = 1; + while ( n > 0 ) + { + r = write (job_fds[1], b, n); + /* Interrupted System Call, try again */ + if ( r == -1 ) + { + if ( errno == EINTR ) + continue; + + /* We get here because this process and another both tried to write to the pipe at + * exactly the same time, and the pipe only contains 1 page. We lost, the other + * process won (wrote to the pipe). We can only reset this condition by first + * reading from the pipe. Of course, that means we then need to return an extra + * token. */ + if ( errno == EWOULDBLOCK || errno == EAGAIN ) + { + if ( jobserver_acquire(0) ) + { + n++; + /* Probably close to impossible... */ + if ( n > 32 ) + break; + continue; + } + } + } + if ( r == 0 ) /* Wrote 0 bytes, but not an error, try again */ + continue; + if ( r > 0 ) + { + n -= r; + continue; + } + break; /* All other errors, break. */ + } + set_blocking (job_fds[1], 1); + + if (n != 0) { if (is_fatal) pfatal_with_name (_("write jobserver"));
Related videos on Youtube
for
Updated on September 18, 2022Comments
-
for over 1 year
Using jQuery UI selectable, how to find selected items?
-
jlp almost 12 yearsIt does sound like a race condition. One thing you could do is attach to the running make process (the one that is spinning) using, e.g.
strace -p <pid>
and see if you can find out what it's looking at/for. strace will only show you syscalls (not function calls), but it could still give you valuable info if it's spinning while looking at or for a particular file. -
Nils almost 12 yearsThe thread you found via google leads to the conclusion that no one was able to compile it with
-j >1
. -
jozxyqk over 9 yearsNot related to parallel compilation, but I had a hanging makefile which took forever to debug. Turns out it was simply in the initialization of a variable,
$(shell ...)
was ultimately running a command which was waiting for input fromstdin
. This was caused when a variable was empty and no file arguments were passed to the command.
-
-
Jagjot almost 12 yearsInteresting. Do you know if there are any tools available that can run through a makefile and check these dependencies?
-
Stéphane Gimenez almost 12 yearsI don't know any. In any case such a tool could only find obvious mistakes. Unless it understands the syntax for each command that appears in the Makefile, and knows what are the (potentially implicit) dependencies.
-
David Faure about 4 yearsI don't see how this could lead to
make
hanging. Thecompile
command forb.bytecode
would fail witha.bytecode not found
, most likely. -
antonio garcia over 2 yearsI was able to eventually trace it down to this kernel bug 759c01142a