A friend of mine reported a FreeBSD oddity to me - TCP connections from FreeBSD to a certain (unknown) host hung – but only when coming from a 64bit machine. All appeared to be fine when connecting from a 32bit FreeBSD host.
A few tcpdumps later we found out that the receiving host was misbehaving – It accepted the TCP connection with a receive window of
0 (while unusual, this seems to be a common syncookie implementation). Practically this means the sender can never send data unless the receiver sends a window update.
tcpdump up to this point looks like this:
10.42.0.25.44852 > 10.42.0.2.1516: Flags [S], seq 3339144437, win 65535, options [...], length 0 10.42.0.2.1516 > 10.42.0.25.44852: Flags [S.], seq 42, ack 3339144438, win 0, length 0 10.42.0.25.44852 > 10.42.0.2.1516: Flags [.], seq 1, ack 1, win 65535, length 0
As this window update packet might get lost, TCP resorts to sending so called
window probe packets containing one byte of data to see if the receiver might have opened up his receive window.
[Note: this requires at least one byte of data in the send buffer, otherwise the connection would just be idle.]
In case the receive window is open, the ACK packet will contain a new window size, and the connection can proceed as usual.
10.42.0.25.44852 > 10.42.0.2.1516: Flags [.], seq 1:2, ack 1, win 65535, length 1 10.42.0.2.1516 > 10.42.0.25.44852: Flags [.], seq 1, ack 2, win 70, length 0
this is theoretically done in tcp_timer.c:407 that runs whenever the
persist timer runs out.
Unfortunately, this doesn't happen on FreeBSD 64bit. Instead, the connection just sits there and keeps on idling.
At this point I'd like to say thanks to VirtualBox here. Debugging all this was really helped by the fact that I could easily create both a 32bit and a 64bit virtual machine and quickly recompile testing kernels.
As I had virtually no understanding of the TCP code, I liberally sprinkled it with printf()s to narrow down where the persist timer was (not) enabled, and then went backwards through the function to find the diverging point.
The culprit was found in tcp_output.c:564 where adv is calculated as:
long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) - (tp->rcv_adv - tp->rcv_nxt);
Filling in the numbers from my debugging sessions, this is:
min(65536, 65535) - (65579 - 43) = 65535 - 65536 = (-1)
Unfortunately it's only (-1) on i386. Due to differing sizes of the variables involved, this turns out to be 4294967295 (
=0xffffffff) on amd64.
This lead to a FreeBSD PR: kern/154006 which proposes to change this code to:
long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) - (long) (tp->rcv_adv - tp->rcv_nxt);
This code yields (-1) on both architectures.
The only reply this PR got, was from Bruce Evans who critiqued my use of a simple
(long) cast, which appears to have derailed this PR, sticking it in the usual
never getting fixed limbo where unfortunately most of my PR's appear to end up.
There is actually a precedent for my
simple cast. Check
if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt)) recwin = (long)(tp->rcv_adv - tp->rcv_nxt);
But I digress. Sorry.
Looking at the code in question including the comment directly above:
/* * Compare available window to amount of window * known to peer (as advertised window less * next expected input). If the difference is at least two * max size segments, or at least 50% of the maximum possible * window, then want to send a window update to peer. * Skip this if the connection is in T/TCP half-open state. * Don't send pure window updates when the peer has closed * the connection and won't ever send more data. */
It appears that adv never was intended to be negative, so I was wondering if the code was even more broken than it appeared at first glance.
I checked the revision history, but this part is essentially unchanged since rev 1.1 so I assumed it was all fine and stopped looking, and left it at that point for a while – secretly hoping that I'd get feedback on my PR, maybe even a comment from someone with more knowledge of the TCP code.
This week at work, I stumbled across our full set of TCP/IP Illustrated. Volume 2 contains the 4.4BSD-Lite network stack source with lengthy explanations. I checked, and voila on Page 860 (Section 26.3) we can find nearly exact same lines. (recwin is called win, but it's the same value)
[Note: if you want to follow along, there appear to be several pdf versions on the net. E.g. here]
I've since read most of Chapter 26
TCP Output, and was reassured that adv is never meant to be negative. It's a measure of how many bytes more we could receive, even after the other side sent as much data as we already told them they can send. (I.e. the maximum receive window the other side knows about).
min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) is the technical limit of what we can receive – recwin is the size of the receive buffer,
(TCP_MAXWIN << tp->rcv_scale) is the maximum receive window we can announce (given that the scaling option can not be changed during connection lifetime).
tp->rcv_nxt is the sequence number of the maximum received data,
tp->rcv_adv the sequence of the maximum announced receive window. If (as in our example case) we never received data,
tp->rcv_adv - tp->rcv_nxt should be exactly the size of the receive window we told the opposite host.
Well. At least according to Stevens.
Our debugging clearly shows it to be 65536 while the tcpdump shows a window of 65535. So there is a discrepancy there. Checking the sequence numbers in the tcpdump output, we see that we got
seq=42 from the remote host, so
rcv_nxt=43 is what we expect here. But rcv_adv is too big.
tcp_output.c, we see that
rcv_adv is set at line 1314
tp->rcv_adv = tp->rcv_nxt + recwin;
Adding a debugging statement here confirms that recwin at this point is correctly = 65535.
So rcv_adv must be increased somewhere else, and probably without checking against the
TCP_MAXWIN << tp->rcv_scale) and thus putting the TCP stack in a state not anticipated by the 4.4BSD code.
Searching for other places where rcv_adv is modified, we find the
tcp_rcvseqinit(tp) macro defined in
#define tcp_rcvseqinit(tp) \ (tp)->rcv_adv = (tp)->rcv_nxt = (tp)->irs + 1
and used in
tcp_input.c. As this sets
rcv_nxt to the same value, this can't be our problem.
tcp_input.c however, we find tcp_input.c:1759
tp->rcv_adv += tp->rcv_wnd;
[For the record, there is another such line at tcp_syncache.c:789 which we're currently not interested in.]
Luckily there is a Volume 3 of TCP/IP Illustrated which (among other things) contains a discussion of the T/TCP changes. There we can find the same code in in Section 11.4 (page 132 code line 607). Stevens goes on to note:
Set rcv_adv 607
rcv_advis defined as the highest advertised sequence number plus one (Figure 24.18, p 809 of Volume 2). But the
tcp_rcvseqinitmacro in Figure 11.4 initialized it to the received sequence number plus one. At this point in the processing,
rcv_wndwill be the size of the socket's receive buffer, from p. 941 of Volume 2. Therefore, by adding
rcv_adv, the latter points just beyond the current receive window.
rcv_advmust be initialized here because its value is used in the silly window avoidance in
tcp_output(p. 879 of Volume 2).
rcv_advis set at the end of
tcp_output, normally when the first segment is sent (which in this case would be the server SYN/ACK in response to the client's SYN). But with T/TCP
rcv_advneeds to be set the first time through
tcp_output, since we may be sending data with the first segment that we send.
Checking this code-path with a quick printf()-test shows that rcv_wnd at this point is 65536 which corresponds to the comment by Stevens. And this is the reason why adv can go negative at all.
In my opinion, the code is wrong. It should be using the same
min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -style logic to limit the window to what it can really advertise.
That would make the correct code something like this:
tp->rcv_adv += min(tp->rcv_wnd, (long)TCP_MAXWIN << tp->rcv_scale);
Any network gurus around? I'd be interested in your thoughts on this…
Update2: Interestingly, NetBSD (and thus OpenBSD) never picked up the T/TCP changes, so they aren't actually affected by this particular bug.
Update3: The fix has been committed as r220156 to FreeBSD-current, and was merged to FreeBSD-stable soon after. Yay!
P.S.: Regardless of this analysis, I still think the PR should be applied, just on a robustness principle (Who knows when else adv becomes negative)...
P.P.S.: If anyone wants to play along, I've created a small scapy
/ python script to trigger that problem. Before running, you need to make sure your host doesn't interfere with traffic on that port (defaults to port
1516). On FreeBSD a simple
ipfw add deny ip from any to me 1516
should be enough before you start tcpbug.py.