Sec, blogmal! - tidbits - tcp-bug

Categories:

Everything

Oktober '14

MoDiMiDoFrSaSo
293012345
6789101112
13141516171819
20212223242526
272829303112

Archive:

Flattr me:

Flattr this

Thu, 24 Mar 2011

The tale of a TCP bug

The following post is a bit longish, and details my foray into the BSD TCP/IP stack debugging and finding what I think is a 15-year old bug.

Part 1: Strange behavior

A friend of mine reported a FreeBSD oddity to me - TCP connections from FreeBSD to a certain (unknown) host hung – but only when coming from a 64bit machine. All appeared to be fine when connecting from a 32bit FreeBSD host.

A few tcpdumps later we found out that the receiving host was misbehaving – It accepted the TCP connection with a receive window of 0 (while unusual, this seems to be a common syncookie implementation). Practically this means the sender can never send data unless the receiver sends a window update.

tcpdump up to this point looks like this:

10.42.0.25.44852 > 10.42.0.2.1516: Flags [S], 
   seq 3339144437, win 65535, options [...], length 0
10.42.0.2.1516 > 10.42.0.25.44852: Flags [S.], 
   seq 42, ack 3339144438, win 0, length 0
10.42.0.25.44852 > 10.42.0.2.1516: Flags [.], 
   seq 1, ack 1, win 65535, length 0

As this window update packet might get lost, TCP resorts to sending so called window probe packets containing one byte of data to see if the receiver might have opened up his receive window.

[Note: this requires at least one byte of data in the send buffer, otherwise the connection would just be idle.]

In case the receive window is open, the ACK packet will contain a new window size, and the connection can proceed as usual.

10.42.0.25.44852 > 10.42.0.2.1516: Flags [.], 
   seq 1:2, ack 1, win 65535, length 1
10.42.0.2.1516 > 10.42.0.25.44852: Flags [.],
   seq 1, ack 2, win 70, length 0

this is theoretically done in tcp_timer.c:407 that runs whenever the persist timer runs out.

Unfortunately, this doesn't happen on FreeBSD 64bit. Instead, the connection just sits there and keeps on idling.

Digging deeper

At this point I'd like to say thanks to VirtualBox here. Debugging all this was really helped by the fact that I could easily create both a 32bit and a 64bit virtual machine and quickly recompile testing kernels.

As I had virtually no understanding of the TCP code, I liberally sprinkled it with printf()s to narrow down where the persist timer was (not) enabled, and then went backwards through the function to find the diverging point.

The culprit was found in tcp_output.c:564 where adv is calculated as:

long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -
	(tp->rcv_adv - tp->rcv_nxt);

Filling in the numbers from my debugging sessions, this is:

min(65536, 65535) - (65579 - 43) = 65535 - 65536 = (-1)

Unfortunately it's only (-1) on i386. Due to differing sizes of the variables involved, this turns out to be 4294967295 (=0xffffffff) on amd64.

This lead to a FreeBSD PR: kern/154006 which proposes to change this code to:

long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -
	(long) (tp->rcv_adv - tp->rcv_nxt);

This code yields (-1) on both architectures.

The only reply this PR got, was from Bruce Evans who critiqued my use of a simple (long) cast, which appears to have derailed this PR, sticking it in the usual never getting fixed limbo where unfortunately most of my PR's appear to end up.

There is actually a precedent for my simple cast. Check
tcp_output.c:1003

if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
	recwin = (long)(tp->rcv_adv - tp->rcv_nxt);

But I digress. Sorry.

How broken is it?

Looking at the code in question including the comment directly above:

/*
 * Compare available window to amount of window
 * known to peer (as advertised window less
 * next expected input).  If the difference is at least two
 * max size segments, or at least 50% of the maximum possible
 * window, then want to send a window update to peer.
 * Skip this if the connection is in T/TCP half-open state.
 * Don't send pure window updates when the peer has closed
 * the connection and won't ever send more data.
 */

It appears that adv never was intended to be negative, so I was wondering if the code was even more broken than it appeared at first glance.

I checked the revision history, but this part is essentially unchanged since rev 1.1 so I assumed it was all fine and stopped looking, and left it at that point for a while – secretly hoping that I'd get feedback on my PR, maybe even a comment from someone with more knowledge of the TCP code.

Stevens to the rescue

Volume 2

This week at work, I stumbled across our full set of TCP/IP Illustrated. Volume 2 contains the 4.4BSD-Lite network stack source with lengthy explanations. I checked, and voila on Page 860 (Section 26.3) we can find nearly exact same lines. (recwin is called win, but it's the same value)

[Note: if you want to follow along, there appear to be several pdf versions on the net. E.g. here]

I've since read most of Chapter 26 TCP Output, and was reassured that adv is never meant to be negative. It's a measure of how many bytes more we could receive, even after the other side sent as much data as we already told them they can send. (I.e. the maximum receive window the other side knows about).

min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) is the technical limit of what we can receive – recwin is the size of the receive buffer, (TCP_MAXWIN << tp->rcv_scale) is the maximum receive window we can announce (given that the scaling option can not be changed during connection lifetime).

tp->rcv_nxt is the sequence number of the maximum received data, tp->rcv_adv the sequence of the maximum announced receive window. If (as in our example case) we never received data, tp->rcv_adv - tp->rcv_nxt should be exactly the size of the receive window we told the opposite host.

Well. At least according to Stevens.

Our debugging clearly shows it to be 65536 while the tcpdump shows a window of 65535. So there is a discrepancy there. Checking the sequence numbers in the tcpdump output, we see that we got seq=42 from the remote host, so rcv_nxt=43 is what we expect here. But rcv_adv is too big.

Digging through tcp_output.c, we see that rcv_adv is set at line 1314

tp->rcv_adv = tp->rcv_nxt + recwin;

Adding a debugging statement here confirms that recwin at this point is correctly = 65535.

Mysterious increase

So rcv_adv must be increased somewhere else, and probably without checking against the technical limit TCP_MAXWIN << tp->rcv_scale) and thus putting the TCP stack in a state not anticipated by the 4.4BSD code.

Searching for other places where rcv_adv is modified, we find the tcp_rcvseqinit(tp) macro defined in tcp_seq.h as:

#define tcp_rcvseqinit(tp) \
        (tp)->rcv_adv = (tp)->rcv_nxt = (tp)->irs + 1

and used in tcp_input.c. As this sets rcv_adv and rcv_nxt to the same value, this can't be our problem.

Looking through tcp_input.c however, we find tcp_input.c:1759

tp->rcv_adv += tp->rcv_wnd;

[For the record, there is another such line at tcp_syncache.c:789 which we're currently not interested in.]

Volume 3

This part was introduced in r1.11 in 1995 which marked the inclusion of T/TCP. in FreeBSD.

Luckily there is a Volume 3 of TCP/IP Illustrated which (among other things) contains a discussion of the T/TCP changes. There we can find the same code in in Section 11.4 (page 132 code line 607). Stevens goes on to note:

Set rcv_adv
607
rcv_adv is defined as the highest advertised sequence number plus one (Figure 24.18, p 809 of Volume 2). But the tcp_rcvseqinit macro in Figure 11.4 initialized it to the received sequence number plus one. At this point in the processing, rcv_wnd will be the size of the socket's receive buffer, from p. 941 of Volume 2. Therefore, by adding rcv_wnd to rcv_adv, the latter points just beyond the current receive window. rcv_adv must be initialized here because its value is used in the silly window avoidance in tcp_output (p. 879 of Volume 2). rcv_adv is set at the end of tcp_output, normally when the first segment is sent (which in this case would be the server SYN/ACK in response to the client's SYN). But with T/TCP rcv_adv needs to be set the first time through tcp_output, since we may be sending data with the first segment that we send.

Checking this code-path with a quick printf()-test shows that rcv_wnd at this point is 65536 which corresponds to the comment by Stevens. And this is the reason why adv can go negative at all.

In my opinion, the code is wrong. It should be using the same min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -style logic to limit the window to what it can really advertise.

That would make the correct code something like this:

tp->rcv_adv += min(tp->rcv_wnd, (long)TCP_MAXWIN << tp->rcv_scale);

What now?

Any network gurus around? I'd be interested in your thoughts on this…

Update: I was informed that OpenBSD fixed the calcualation of adv in rev. 1.15 in 1998 while they were making their code 64bit-clean. Kudos to them.

Update2: Interestingly, NetBSD (and thus OpenBSD) never picked up the T/TCP changes, so they aren't actually affected by this particular bug.

Update3: The fix has been committed as r220156 to FreeBSD-current, and was merged to FreeBSD-stable soon after. Yay!

– Sec

P.S.: Regardless of this analysis, I still think the PR should be applied, just on a robustness principle (Who knows when else adv becomes negative)...

P.P.S.: If anyone wants to play along, I've created a small scapy / python script to trigger that problem. Before running, you need to make sure your host doesn't interfere with traffic on that port (defaults to port 1516). On FreeBSD a simple

ipfw add deny ip from any to me 1516

should be enough before you start tcpbug.py.


posted at: 14:43 | Category: /tidbits | permanent link to this entry | 5 comments (trackback)
 

Your Comment
 
Name:
URL/Email: [http://... or mailto:you@wherever] (optional)
Title: (optional)
Comment:
Save my Name and URL/Email for next time
(Note that comments will be rejected unless you enter 42 in the following box: )

powered by blosxom
in 1.00 s