What can cause a spontaneous EPIPE error without either end calling close() or crashing?

unix sockets ipc posix

16,864

Solution 1

Perhaps you could try strace as described in: http://modperlbook.org/html/6-9-1-Detecting-Aborted-Connections.html

I assume that your problem is related to the one described here: http://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable

Unfortunately I'm having a similar problem myself but couldn't manage to get it fixed with the given advices. However, perhaps that SO_LINGER thing works for you.

Solution 2

shutdown() may have been called on one of the socket endpoints.
If either side may fork and execute a child process, ensure that the FD_CLOEXEC (close-on-exec) flag is set on the socket file descriptor if you did not intend for it to be inherited by the child. Otherwise the child process could (accidentally or otherwise) be manipulating your socket connection.

16,864

Author by

Hongli

CTO, entrepreneur & consultant. Author of the Passenger application server, which is middleware that is in use by over 650.000 websites world-wide, including Apple, to help them ship software faster and to handle millions of customers per day. I have a wide range of technological competences & interests. Current specializations: full-stack web dev; Ruby development; DevOps, infrastructure & containerization; debugging, scaling & optimizing apps at scale; and systems programming. Available for hire or contracting work.

Updated on June 19, 2022

Comments

Hongli almost 2 years
I have an application that consists of two processes (let's call them A and B), connected to each other through Unix domain sockets. Most of the time it works fine, but some users report the following behavior:
1. A sends a request to B. This works. A now starts reading the reply from B.
2. B sends a reply to A. The corresponding write() call returns an EPIPE error, and as a result B close() the socket. However, A did not close() the socket, nor did it crash.
3. A's read() call returns 0, indicating end-of-file. A thinks that B prematurely closed the connection.
Users have also reported variations of this behavior, e.g.:
1. A sends a request to B. This works partially, but before the entire request is sent A's write() call returns EPIPE, and as a result A close() the socket. However B did not close() the socket, nor did it crash.
2. B reads a partial request and then suddenly gets an EOF.
The problem is I cannot reproduce this behavior locally at all. I've tried OS X and Linux. The users are on a variety of systems, mostly OS X and Linux.

Things that I've already tried and considered:
- Double close() bugs (close() is called twice on the same file descriptor): probably not as that would result in EBADF errors, but I haven't seen them.
- Increasing the maximum file descriptor limit. One user reported that this worked for him, the rest reported that it did not.
What else can possibly cause behavior like this? I know for certain that neither A nor B close() the socket prematurely, and I know for certain that neither of them have crashed because both A and B were able to report the error. It is as if the kernel suddenly decided to pull the plug from the socket for some reason.
Hongli about 14 years

Thanks, but neither situations are applicable to my program.
ephemient about 14 years

... on a UNIX domain socket? That's a local-only protocol.
Nikolai Fetissov about 14 years

Oh ... shoot, I totally missed that. Thanks.
user206268 about 14 years

It turned out that the server's file descriptor was added with the EPOLLET flag to the epoll queue which seems to be wrong.
Hongli almost 13 years

Not exactly the answer I was looking for but the TCP page you linked to is very informative! It's down now by Archive.org still has it: ia700609.us.archive.org/22/items/…