Gone in 3600 seconds: story about TCP Keep-Alives
One of the things I’ve been working on recently included monitoring dropped sessions on an internal firewall. This firewall (along the others) is positioned between an application server and a database server. The firewall allows only incoming connections from ephemeral ports on the application server to port 1521 on the database server (that’s Oracle SQLNET). The following figure shows the setup:
The dropped packets log contained something interesting. From time to time, the firewall dropped some packets coming from the database servers, as shown below:
Dropped packets:
Source IP:Port Destination IP:Port
============================================
10.0.1.24:1521 10.1.1.15:11925
10.0.1.24:1521 10.1.1.15:31578
This was pretty strange as the database server should never open new connections so I did some further research. I setup two sniffers on both sides and analyzed captured packets. That allowed me to reconstruct what happened here – the example I’m using below shows a session that starts at 10:00AM:
- 10:00 – The application server connect to the database server, port 1521 (SQLNET). Connection is established from an ephemeral port, 31578. The application server starts sending queries following a normal TCP three way handshake.
- 11:00 – The application server sends the last query to the database server which replies with results. The application server sends an empty ACK TCP packet acknowledging that it received this packet.
- 12:00 - One hour after the last packet has been seen in a TCP session, the internal firewall’s timeout causes it to delete this session from its stateful connection table. This means that any future packets pretending to belong to this session will be dropped.
- 13:00 – The database server sends an ACK packet to the application server. This is caused by the TCP keep-alive mechanism as described in RFC 1122. By default, after 2 hours of a session being idle, the OS on the database server sends an ACK packet to see if the remote side is still up. If no answer is received, it exponentially back offs with new ACK packets. After this, it will drop this session.
So, the problem here was caused by the application server not properly closing a session that it doesn’t use any more, and not using TCP keep-alives. It was interesting to see that the application server used the session exactly for 1 hour.
In order to properly fix this we would have to work with the vendor on the application server to see why it stops using connections without closing them. An easier fix was to increase the timeout setting for the stateful connection table on the firewall to 9000 seconds (2.5 hours), of course, after carefully examining the impact of this action on the firewall since it will cause it to use more memory for similar questions. This allowed ACK packets (TCP keep-alives) sent from the database server through the firewall and the application server correctly replied to them.
Why all this you might ask? This was one example of why we should spend time cleaning our local networks as well. During this exercise we found heaps of incorrectly configured servers and/or applications that people lived with for ages, without even knowing what’s going on on the network (low) layer.
--
Bojan
Comments