Close open sockets from crashed program

5,366

Sockets are finally closed by the Unix kernel; a crashed program is no different from a normal program exit w/o a close()/shutdown() call.

Your problem may have to do with the TIME_WAIT state of the TCP/IP state machine and should be solved with the SO_REUSEADDR option. One way to confirm this is wait for about 5 minutes before starting again after a crash. If you find sufficient sockets are now available you should study the TIME_WAIT logic and work around it. If the wait trick does not solve your problem there might be a different issue in your program which needs to be identified then.

Here is a good read on the subject,
TIME_WAIT and its design implications for protocols and scalable client server systems

Two quick extracts from there for reference,

TIME_WAIT is often also known as the 2MSL wait state. This is because the socket that transitions to TIME_WAIT stays there for a period that is 2 x Maximum Segment Lifetime in duration. The MSL is the maximum amount of time that any segment, for all intents and purposes a datagram that forms part of the TCP protocol, can remain valid on the network before being discarded. This time limit is ultimately bounded by the TTL field in the IP datagram that is used to transmit the TCP segment. Different implementations select different values for MSL and common values are 30 seconds, 1 minute or 2 minutes. RFC 793 specifies MSL as 2 minutes and Windows systems default to this value but can be tuned using the TcpTimedWaitDelay registry setting.

(PS: hence the 4+1 minute wait for my test suggested above)

Changing the 2MSL delay is usually a machine wide configuration change. You can instead attempt to work around TIME_WAIT at the socket level with the SO_REUSEADDR socket option. This allows a socket to be created whilst an existing socket with the same address and port already exists. The new socket essentially hijacks the old socket. You can use SO_REUSEADDR to allow sockets to be created whilst a socket with the same port is already in TIME_WAIT but this can also cause problems such as denial of service attacks or data theft.

The article describes one more way. But that comes with other caveats.

There's another way to terminate a TCP connection and that's by aborting the connection and sending an RST rather than a FIN. This is usually achieved by setting the SO_LINGER socket option to 0. This causes pending data to be discarded and the connection to be aborted with an RST rather than for the pending data to be transmitted and the connection closed cleanly with a FIN. It's important to realise that when a connection is aborted any data that might be in flow between the peers is discarded and the RST is delivered straight away; usually as an error which represents the fact that the "connection has been reset by the peer". The remote peer knows that the connection was aborted and neither peer enters TIME_WAIT.

Before using these schemes it is a good idea to understand the TCP machine behavior so you do not inadvertently introduce other situations which will need debug later. So at least read that article completely :-)

Share:
5,366

Related videos on Youtube

Mahoni
Author by

Mahoni

Updated on September 18, 2022

Comments

  • Mahoni
    Mahoni over 1 year

    I am opening thousand of sockets and sometimes the program crashes leaving me with a lot less available sockets. Is there a way to clean those hanging sockets?