What could be causing make to hang when compiling on multiple cores?

646

Solution 1

I don't have an answer to this precise issue, but I can try to give you a hint of what may be happening: Missing dependencies in Makefiles.

Example:

target: a.bytecode b.bytecode
    link a.bytecode b.bytecode -o target

a.bytecode: a.source
    compile a.source -o a.bytecode

b.bytecode: b.source
    compile b.source a.bytecode -o a.bytecode

If you call make target everything will compile correctly. Compilation of a.source is performed (arbitrarily, but deterministically) first. Then compilation of b.source is performed.

But if you make -j2 target both compile commands will be run in parallel. And you'll actually notice that your Makefile's dependencies are broken. The second compile assumes a.bytecode is already compiled, but it does not appear in dependencies. So an error is likely to happen. The correct dependency line for b.bytecode should be:

b.bytecode: b.source a.bytecode

To come back to your problem, if you are not lucky it's possible that a command hang in a 100% CPU loop, because of a missing dependency. That's probably what is happening here, the missing dependency couldn't be revealed by a sequential build, but it has been revealed by your parallel build.

Solution 2

I don't know how long you've had the machine, but my first recommendation would be to try a memory test and verify that the memory is functioning properly. I know it often isn't the memory that is the problem, but if it is, it is best to eliminate it as a cause first before trying to trace down other probably issues.

Solution 3

I realize this is a really old question, but it still pops up at the top of search results, so here is my solution:

GNU make has a jobserver mechanism to ensure make and its recursive children do not consume more than the specified number of cores: http://make.mad-scientist.net/papers/jobserver-implementation/

It relies on a pipe shared by all processes. Each process that wants to fork additional children has to first consume tokens from the pipe, then relinquish them when done. If a child process does not return the tokens it consumed, the top-level make while hang forever waiting for them to be returned.

https://bugzilla.redhat.com/show_bug.cgi?id=654822

I encountered this error when building binutils with GNU make on my Solaris box, where "sed" is not GNU sed. Fiddling with PATH to make sed==gsed take priority over the system sed fixed the issue. I don't know why sed was consuming tokens from the pipe, though.

Solution 4

make seems to create a deadlock. Using ps -ef, these processes appear to be the culprit:

root  695 615  1 22:18 ? 00:00:00 make PREBUILD -j32
root 2127 695 20 22:18 ? 00:00:04 make -f Makefile.prenobuild

If you check what each is doing, the child process is writing to file descriptor 4, and the parent process is waiting for all child processes to exit:

root@ltzj2-6hl3t-b98zz:/# strace -p 2127
strace: Process 2127 attached
write(4, "+", 1
root@ltzj2-6hl3t-b98zz:/# strace -p 695
strace: Process 695 attached
{{wait4(-1, }}

file descriptor 4 happens to be a pipe:

root@ltzj2-6hl3t-b98zz:/# ls -la /proc/2127/fd/4
l-wx------ 1 root root 64 Sep 3 22:22 /proc/2127/fd/4 -> 'pipe:[1393418985]'

and that pipe is only between the parent and child processes:

root@ltzj2-6hl3t-b98zz:/# lsof | grep 1393418985
make  695 root 3r FIFO 0,12 0t0 1393418985 pipe
make  695 root 4w FIFO 0,12 0t0 1393418985 pipe
make 2127 root 3r FIFO 0,12 0t0 1393418985 pipe
make 2127 root 4w FIFO 0,12 0t0 1393418985 pipe

so, it would appear that 2127 is stuck trying to add output into the pipe back to 695, but 695 is pended on wait4(), so it's never going to empty that pipe.

If I empty the pipe from the shell using cat, then the build resumes and completes as expected...

root@ltzj2-6hl3t-b98zz:/# cat /proc/695/fd/3
+++++++++++++++++++++++++++++++

The build unblocks and continues running...


My original understanding was off, but after more investigation I eventually ended up at this Linux Kernel defect:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=46c4c9d1beb7f5b4cec4dd90e7728720583ee348

An exact explanation of how this hangs make is here: https://lore.kernel.org/lkml/1628086770.5rn8p04n6j.none@localhost/.

You can work around this pending a kernel patch with the following workaround applied to the gnu make source code:

--- a/src/posixos.c 2020-01-02 23:11:27.000000000 -0800
+++ b/src/posixos.c 2021-09-18 09:12:02.786563319 -0700
@@ -179,8 +179,52 @@
 jobserver_release (int is_fatal)
 {
   int r;
-  EINTRLOOP (r, write (job_fds[1], &token, 1));
-  if (r != 1)
+  int n;
+  char b[32];
+
+  /* Use non-blocking write to avoid deadlock from multiple make children
+   * releasing jobs at the same time. */
+  set_blocking (job_fds[1], 0);
+  memset(b,token,sizeof(b));
+  n = 1;
+  while ( n > 0 )
+    {
+      r = write (job_fds[1], b, n);
+      /* Interrupted System Call, try again */
+      if ( r == -1 )
+   {
+     if ( errno == EINTR )
+       continue;
+
+     /* We get here because this process and another both tried to write to the pipe at
+      * exactly the same time, and the pipe only contains 1 page.  We lost, the other
+      * process won (wrote to the pipe).  We can only reset this condition by first
+      * reading from the pipe.  Of course, that means we then need to return an extra
+      * token. */
+     if ( errno == EWOULDBLOCK || errno == EAGAIN )
+       {
+         if ( jobserver_acquire(0) )
+       {
+         n++;
+         /* Probably close to impossible... */
+         if ( n > 32 )
+           break;
+         continue;
+       }
+       }
+   }
+      if ( r == 0 )        /* Wrote 0 bytes, but not an error, try again */
+   continue;
+      if ( r > 0 )
+   {
+     n -= r;
+     continue;
+   }
+      break;           /* All other errors, break. */
+    }
+  set_blocking (job_fds[1], 1);
+
+  if (n != 0)
     {
       if (is_fatal)
         pfatal_with_name (_("write jobserver"));
Share:
646

Related videos on Youtube

for
Author by

for

Updated on September 18, 2022

Comments

  • for
    for over 1 year

    Using jQuery UI selectable, how to find selected items?

    • jlp
      jlp almost 12 years
      It does sound like a race condition. One thing you could do is attach to the running make process (the one that is spinning) using, e.g. strace -p <pid> and see if you can find out what it's looking at/for. strace will only show you syscalls (not function calls), but it could still give you valuable info if it's spinning while looking at or for a particular file.
    • Nils
      Nils almost 12 years
      The thread you found via google leads to the conclusion that no one was able to compile it with -j >1.
    • jozxyqk
      jozxyqk over 9 years
      Not related to parallel compilation, but I had a hanging makefile which took forever to debug. Turns out it was simply in the initialization of a variable, $(shell ...) was ultimately running a command which was waiting for input from stdin. This was caused when a variable was empty and no file arguments were passed to the command.
  • Jagjot
    Jagjot almost 12 years
    Interesting. Do you know if there are any tools available that can run through a makefile and check these dependencies?
  • Stéphane Gimenez
    Stéphane Gimenez almost 12 years
    I don't know any. In any case such a tool could only find obvious mistakes. Unless it understands the syntax for each command that appears in the Makefile, and knows what are the (potentially implicit) dependencies.
  • David Faure
    David Faure about 4 years
    I don't see how this could lead to make hanging. The compile command for b.bytecode would fail with a.bytecode not found, most likely.
  • antonio garcia
    antonio garcia over 2 years
    I was able to eventually trace it down to this kernel bug 759c01142a