What is the reason for Broken Pipe on Unix Domain Sockets?

11,560

'Broken pipe' means you have written to a connection that had already been closed by the other end. It is detected somewhat asynchronously due to buffering. It basically means you have an error in your application protocol.

Share:
11,560
jbx
Author by

jbx

Updated on June 20, 2022

Comments

  • jbx
    jbx almost 2 years

    I have a server application which received requests and forwards them on a Unix Domain Socket. This works perfectly under reasonable usage but when I am doing some load tests with a few thousand requests I am getting a Broken Pipe error.

    I am using Java 7 with junixsocket to send the requests. I have lots of concurrent requests, but I have a thread pool of 20 workers which is writing to the unix domain socket, so there is no issue of too many concurrent open connections.

    For each request I am opening, sending and closing the connection with the Unix Domain Socket.

    What is the reason that could cause a Broken Pipe on Unix Domain Sockets?

    UPDATE:

    Putting a code sample if required:

    byte[] mydata = new byte[1024];
    //fill the data with bytes ...
    
    AFUNIXSocketAddress socketAddress = new AFUNIXSocketAddress(new File("/tmp/my.sock"));
    Socket socket = AFUNIXSocket.connectTo(socketAddress);
    OutputStream out = new BufferedOutputStream(socket.getOutputStream());
    InputStream in = new BufferedInputStream(socket.getInputStream()));
    
    out.write(mydata);
    out.flush();  //The Broken Pipe occurs here, but only after a few thousand times
    
    //read the response back...
    
    out.close();
    in.close();
    socket.close();
    

    I have a thread pool of 20 workers, and they are doing the above concurrently (so up to 20 concurrent connections to the same Unix Domain Socket), with each one opening, sending and closing. This works fine for a load test of a burst of 10,000 requests but when I put a few thousand more I suddenly get this error, so I am wondering whether its coming from some OS limit.

    Keep in mind that this is a Unix Domain Socket, not a network TCP socket.

  • jbx
    jbx about 12 years
    Thanks, but this is a Unix Domain Socket, its not a normal TCP socket where a broken pipe is typically caused by network issues or the server closing the connection in a non-graceful manner.
  • user207421
    user207421 about 12 years
    @jbx None of that is true. 'Broken pipe' always means the peer closed the connection, and nothing else, whether TCP or Unix domain. Network errors do not cause this problem. Both graceful and ungrateful closes will cause this problem.
  • jbx
    jbx about 12 years
    Yes, but why is it happening on a Unix Domain Socket? It is essentially a local file handle on the OS. There is no other side which is closing anything its all local.
  • user207421
    user207421 about 12 years
    @jbx Because the peer closed the connection. The peer in this case is another process in the same OS but it is still the peer.
  • jbx
    jbx about 12 years
    OK, and why is the process closing the connection? Why would it work for the first 10,000 requests, and then out of the blue this occurs? (10,000 is not the exact number, its actually more than that but it does not reach the 20,000 load test limit)
  • user207421
    user207421 about 12 years
    @jbx I don't know. It's your process, not mine. But it is closing. That's what the exception means.
  • jbx
    jbx about 12 years
    No its not closing, thats the whole point I am trying to understand. The listening process is not my process, its the php fcgi and it has no reason to close. Some load condition is triggering this but nothing in the logs, which is why the question.
  • user207421
    user207421 about 12 years
    @jbx As they say at AA, the first step is to get out of denial mode. Your are getting a 'broken pipe' error. That happens when the peer closes its socket, and in no other circumstance. Ergo the peer is closing its socket. Period. Punto basta. Finis. Ende. Why, is another question.
  • jbx
    jbx about 12 years
    Lol denial mode. OK. Could it be... just could it... that my load test is filling up some OS buffer, which is why I am getting the error exactly when I trying to flush() the data through? I am just trying to find the relationship between my test and the behaviour.
  • user207421
    user207421 about 12 years
    @jbx That's exactly when I would expect you to get it. There is an OS buffer all right, and it can fill up all right: that would block your write or your flush, not cause this error. The cause remains what I said above, several times.
  • Brian Vandenberg
    Brian Vandenberg almost 4 years
    Disconnected on the other end is not the only possible reason for broken pipe on a domain socket. I'm in the middle of trying to solve this problem for a C++ program I'm working on. The read side of the domain socket is fine but the write side produces SIGPIPE. I've traced execution on both processes and neither one ever closes the socket.
  • jbx
    jbx over 2 years
    This is a very old post, but still... If you read my question I clearly said it is not a TCP socket, it is a UNIX Domain Socket, so completely local interprocess communication.
  • TheHans255
    TheHans255 over 2 years
    @jbx Apologies, I realize my terminology was wrong. UDP/datagram shouldn't even be part of this conversation, and I should have said "stream socket" instead of TCP. I've been inclined to use the latter since UNIX stream sockets present exactly like TCP sockets to your program once you plug in their address, since you use the same syscalls and listen/connect dynamic on them, and you can run TCP protocols such as HTTP on them. The bulk of my answer still stands, though, since it's taken from the documentation which does not mention TCP - the error is fully mine.
  • user207421
    user207421 over 2 years
    @BrianVandenberg What other cause did you come up with?
  • jbx
    jbx over 2 years
    In my case there was no data getting stuck as such. I think if I remember well (since 9 years passed since) the problem was that I was reaching the default max connection and file limits of the system. I had increased things like fs.file-max and net.core.somaxconn and the problem went away.
  • Brian Vandenberg
    Brian Vandenberg over 2 years
    @user207421 I'm only ~70% sure what I'm about to say was the same problem. Before O_NONBLOCK there was FNDELAY and O_NDELAY. In Solaris at least the differences are easy to miss. For example, with the former if read() returns 0 that means EOF; for the latter it could just mean zero bytes were read. I know I learned this around the same time I solved that SIGPIPE problem, but if they're the same problem I don't remember how they're related.
  • jbx
    jbx over 2 years
    @BrianVandenberg Wow somehow this thread was revived after 9 years. In my case it was a case of reaching OS limits (that was why it was happening under load conditions). Increasing things like the max file handles and somaxconn solved it.