I want to explain what is the TIME_WAIT state, and why we can have problems with the ephemeral ports.
We know that the TCP sockets (layer 4 on the OSI stack) have different states.We have 12 different states: ESTABLISHED, SYN_SENT, SYN_RECV, FIN_WAIT1, FIN_WAIT2, TIME_WAIT, CLOSE, CLOSE_WAIT, LAST_ACK, LISTEN, CLOSING, UNKOWN. ( See it in man netstat and see the “State” section).
These states are used in differents stages of the connection. Let’s see an example: TCP establishes new connections with a three way handshake. When the host Alice sends a SYN packet to the host Bob, the socket in Alice is in SYN_SENT state. When Bob responds with a SYN-ACK, the socket is in SYN_RECV. Alice sends a ACK package to Bob and the connection finally is in ESTABLISHED state. As we can see, the TCP socket (TCP connections) state is changing depending on the stage of the connection.
If you want to see the connections in our PC just type:
$ sudo netstat -pan --tcp | less
And what are the ephemeral ports?
The definition of a socket is (SRC_IP, SRC_PORT, DST_IP, DST_PORT). When we want to establish a connection to a remote server, we know the DST_IP (well, we normally know the hostname and we resolve the IP with DNS), we know the DST_PORT too (the searvice we are trying to reach: HTTP is port 80, HTTPS is port 443, etc..) and we know the SRC_IP, because is our IP addess. But, what about the SRC_PORT? The operating system assigns you an arbitrary SRC_PORT when you are trying to establish a connection. The number of this kind of ports is limited. We can see the range of ephemeral ports executing:
$ cat /proc/sys/net/ipv4/ip_local_port_range 32768 61000
So, 61000 – 32768 port are 28.233 ports. This are the available ephemeral ports we have in the system. It means that if we try to establish more than 28.233 connections, the kernel will run out of ephemeral ports, and we will not be able to have our SRC_PORT for being able to create a socket.
Well, normaly you don’t have to care about running out of ephemeral ports because having 28.233 connections ESTABLISHED at the same time is something insane. It’s not in you everyday usage.
When we close a connection in state ESTABLISHED, we assume that the socket is destroyed and the ephemeral port we were using is released and it can be used again in another connection, right?
The answer is: nop.
When you close a connection that is in state ESTABLISHED, the connection will enter in the TIME_WAIT state. The definition of this state is “The socket is waiting after close to handle packets still in the network”. This state is meant to prevent problems with lost packets on the network, and we will remain in this state 2*MSL seconds (MSL= Maximum Segment Lifetime). The recommended value for MSL is 60 seconds, so we’ll have to wait 120 seconds for releasing a socket after executing “close()” in the code. That’s a lot of time.
What does it means “waiting after close to handle packets still in the network”? It can be possible that a lot of packeges were delayed for some reason or are still in a buffer of a very busy router/switch. If we close the connection just right now and the Linux OS reuses this port, we can receive packets from the old connection, and this is something you don’t want to appen.
Ok. But is necessary to wait 2*MSL seconds before closing a connection? Can I reduce the time? Well, int theory yes but if you are under under a Linux kernel (I have 3.2.0-3 now) you can’t. The question is: WHY?! Well, it’s something funny, guys. The value MSL=60 is, well, hardcoded in the Linux kernel. Yep. Hardcoded. If you download the linux kernel and take a look to the file “/include/net/tcp.h” on the line 105 you will see:
#define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT state, about 60 seconds */
Imagine we have an application that, for every web request in the search box of my front page I make 50 requests to a NoSQL Database (let’s say MongoDB, for example) to get some non-normalized information about, let’s say, some search results or something like that. All the requests to NoSQL are served in less than 300ms. But we are opening a new connection in every request (yep, it doesn’t sound like a good idea and it isn’t, but let’s see what happens). Ok, if I have one user on my frontpage making 1 search the web application will make 50 requests (remember 50 new TCP connections) to MongoDB and I will have 100 TCP connections in TIME_WAIT state for about 2 minutes.
Ok, now let’s say that people are making 20 searches/second during 60 seconds. That means 20 searches/second * 50 connections oppened by search * 60 seconds = 60.000 connections in TIME_WAIT. Ooooooops! This is not even possible because we only have ~28K ephemeral ports and the requests would start failing when all the ephemeral ports were in TIME_WAIT state.
The first solution is to increase the number of ephemeral ports available. We have seen that by default we just have ~28K ports available but we can increase this number doing:
$ sudo su - $ echo "1025 61000" > /proc/sys/net/ipv4/ip_local_port_range
Now we have 60K ephemeral ports. But as you have noticed this is not a real solution for the problem we have created, because we cannot handle 100 searches in less than a minute, because we run out of ephemeral ports again.
The other solution is to use tcp_tw_recycle and tcp_tw_reuse. Both options are very dangeorus and are violates the RFC 1122. They are not recommended in a production enviroment. The information I’ve found about the options are:
TCP_TW_RECYCLE: It enables fast recycling of TIME_WAIT sockets. The default value is 0 (disabled). The sysctl documentation incorrectly states the default as enabled. It can be changed to 1 (enabled) in many cases. Known to cause some issues with hoststated (load balancing and fail over) if enabled, should be used with caution.
Activated by doing:
$ echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
TCP_TW_REUSE: This allows reusing sockets in TIME_WAIT state for new connections when it is safe from protocol viewpoint. Default value is 0 (disabled). It is generally a safer alternative to tcp_tw_recycle
Note: The tcp_tw_reuse setting is particularly useful in environments where numerous short connections are open and left in TIME_WAIT state, such as web servers. Reusing the sockets can be very effective in reducing server load.
Activated by doing:
$ echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse
And the definitive solution comes as a question: WHAT THE F*KING HELL ARE YOU DOING CREATING 50 NEW CONNECTIONS IN EVERY REQUEST?! If you came here for a problem with TIME_WAIT in a production server when you are under a lot of load, I want to tell you something you will not like: you are doing something wrong! As you have seen, it’s not enough to close your opened conections (because you are closing the connections you are using, right?), but when you have a lot of traffic you must REUSE the connections. You must look in the documentation of the libraries you are using for accessing you data how to do connection pooling and all this stuff. Rusing the connections is the key. If the browser uses Keep-Alive flag HTTP, why you shuldn’t use something like that with the connections to you data storage?
And this is it, guys. If you have any doubt about that, please leave a comment and I will try to answer.
 The part that says: “When a connection is closed actively, it MUST linger in TIME-WAIT state for a time 2xMSL (Maximum Segment Lifetime).”