Sec, blogmal! - tidbits - tcp-bug



Dezember '14



Flattr me:

Flattr this

Thu, 24 Mar 2011

The tale of a TCP bug
The following post is a bit longish, and details my foray into the [def|BSD]: [TCP/)def|IP]: stack debugging and finding what I think is a 15-year old bug. !4 Part 1: Strange behavior A friend of mine reported a [def|FreeBSD]: oddity to me - [def|TCP]: connections from FreeBSD to a certain (unknown) host hung -- but only when coming from a 64bit machine. All appeared to be fine when connecting from a 32bit FreeBSD host. A few [tcpdump(s]: later we found out that the receiving host was misbehaving -- It accepted the TCP connection with a receive window of %0% (while unusual, this seems to be a common [def|syncookie]: implementation). Practically this means the sender can never send data unless the receiver sends a window update. tcpdump up to this point looks like this: % !cmdx! % > Flags [S], % seq 3339144437, win 65535, options [...], length 0 % > Flags [S.], % seq 42, ack 3339144438, win 0, length 0 % > Flags [.], % seq 1, ack 1, win 65535, length 0 As this window update packet might get lost, TCP resorts to sending so called "window probe" packets containing one byte of data to see if the receiver might have opened up his receive window. [/Note: this requires at least one byte of data in the send buffer, otherwise the connection would just be idle./] In case the receive window is open, the ACK packet will contain a new window size, and the connection can proceed as usual. % !cmdx! % > Flags [.], % seq 1:2, ack 1, win 65535, length 1 % > Flags [.], % seq 1, ack 2, win 70, length 0 this is theoretically done in [tcp_timer.c:407]:;ln=1#l407 that runs whenever the "persist" timer runs out. Unfortunately, this doesn't happen on FreeBSD 64bit. Instead, the connection just sits there and keeps on idling. !4 Digging deeper At this point I'd like to say thanks to [VirtualBox]: here. Debugging all this was really helped by the fact that I could easily create both a 32bit and a 64bit virtual machine and quickly recompile testing kernels. As I had virtually no understanding of the TCP code, I liberally sprinkled it with [def|printf()s]: to narrow down where the persist timer was (not) enabled, and then went backwards through the function to find the diverging point. The culprit was found in [tcp_output.c:564]:;ln=1#l564 where *adv* is calculated as: % !cmdx! % long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) - % (tp->rcv_adv - tp->rcv_nxt); Filling in the numbers from my debugging sessions, this is: % !cmdout! % min(65536, 65535) - (65579 - 43) = 65535 - 65536 = (-1) Unfortunately it's only (-1) on i386. Due to differing sizes of the variables involved, this turns out to be 4294967295 (%=0xffffffff%) on amd64. This lead to a FreeBSD PR: [kern/154006]: which proposes to change this code to: % !cmdx! % long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) - % (long) (tp->rcv_adv - tp->rcv_nxt); This code yields (-1) on both architectures. The only reply this PR got, was from Bruce Evans who critiqued my use of a simple %(long)% cast, which appears to have derailed this PR, sticking it in the usual "never getting fixed limbo" where unfortunately most of my PR's appear to end up. There is actually a precedent for my "simple" cast. Check [tcp_output.c:1003]:;ln=1#l1003 % !cmdx! % if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt)) % recwin = (long)(tp->rcv_adv - tp->rcv_nxt); But I digress. Sorry. !4 How broken is it? Looking at the code in question including the comment directly above: % !cmdx! % /* % * Compare available window to amount of window % * known to peer (as advertised window less % * next expected input). If the difference is at least two % * max size segments, or at least 50% of the maximum possible % * window, then want to send a window update to peer. % * Skip this if the connection is in T/TCP half-open state. % * Don't send pure window updates when the peer has closed % * the connection and won't ever send more data. % */ It appears that *adv* never was intended to be negative, so I was wondering if the code was even more broken than it appeared at first glance. I checked the revision history, but this part is essentially unchanged since [rev 1.1]:;ln=1 so I assumed it was all fine and stopped looking, and left it at that point for a while -- secretly hoping that I'd get feedback on my PR, maybe even a comment from someone with more knowledge of the TCP code. !4 Stevens to the rescue _
_ [{Volume 2}:tcp-bug/vol2.jpg]: This week at work, I stumbled across our full set of [TCP/IP Illustrated]: Volume 2 contains the 4.4BSD-Lite network stack source with lengthy explanations. I checked, and voila on Page 860 (Section 26.3) we can find nearly exact same lines. (recwin is called win, but it's the same value) [/Note: if you want to follow along, there appear to be several pdf versions on the net. E.g. [here]:] I've since read most of Chapter 26 "TCP Output", and was reassured that *adv* is never meant to be negative. It's a measure of how many bytes more we could receive, even after the other side sent as much data as we already told them they can send. (I.e. the maximum receive window the other side knows about). %min(recwin, (long)TCP_MAXWIN << tp->rcv_scale)% is the technical limit of what we can receive -- recwin is the size of the receive buffer, %(TCP_MAXWIN << tp->rcv_scale)% is the maximum receive window we can announce (given that the scaling option can not be changed during connection lifetime). %tp->rcv_nxt% is the sequence number of the maximum received data, %tp->rcv_adv% the sequence of the maximum announced receive window. If (as in our example case) we never received data, %tp->rcv_adv - tp->rcv_nxt% should be exactly the size of the receive window we told the opposite host. Well. At least according to Stevens. Our debugging clearly shows it to be 65536 while the tcpdump shows a window of 65535. So there is a discrepancy there. Checking the sequence numbers in the tcpdump output, we see that we got %seq=42% from the remote host, so %rcv_nxt=43% is what we expect here. But rcv_adv is too big. Digging through %tcp_output.c%, we see that %rcv_adv% is set at [line 1314]:;ln=1#l1314 % !cmdx! % tp->rcv_adv = tp->rcv_nxt + recwin; Adding a debugging statement here confirms that recwin at this point is correctly = 65535. !4 Mysterious increase So rcv_adv must be increased somewhere else, and probably without checking against the "technical limit" %TCP_MAXWIN << tp->rcv_scale)% and thus putting the TCP stack in a state not anticipated by the 4.4BSD code. Searching for other places where rcv_adv is modified, we find the %tcp_rcvseqinit(tp)% macro defined in %tcp_seq.h% as: % !cmdx! % #define tcp_rcvseqinit(tp) \ % (tp)->rcv_adv = (tp)->rcv_nxt = (tp)->irs + 1 and used in %tcp_input.c%. As this sets %rcv_adv% and %rcv_nxt% to the same value, this can't be our problem. Looking through %tcp_input.c% however, we find [tcp_input.c:1759]:;ln=1#l1759 % !cmdx! % tp->rcv_adv += tp->rcv_wnd; [/For the record, there is another such line at [tcp_syncache.c:789]:;ln=1#l789 which we're currently not interested in./] _
_ [{Volume 3}:tcp-bug/vol3.jpg]: This part was introduced in [r1.11]:;r2=1.11 in 1995 which marked the inclusion of [T/TCP]: in FreeBSD. Luckily there is a Volume 3 of TCP/IP Illustrated which (among other things) contains a discussion of the T/TCP changes. There we can find the same code in in Section 11.4 (page 132 code line 607). Stevens goes on to note: _ _ _
Set rcv_adv
607 _
%rcv_adv% is defined as the highest advertised sequence number plus one (Figure 24.18, p 809 of Volume 2). But the %tcp_rcvseqinit% macro in Figure 11.4 initialized it to the received sequence number plus one. At this point in the processing, %rcv_wnd% will be the size of the socket's receive buffer, from p. 941 of Volume 2. Therefore, by adding %rcv_wnd% to %rcv_adv%, the latter points just beyond the current receive window. %rcv_adv% must be initialized here because its value is used in the silly window avoidance in %tcp_output% (p. 879 of Volume 2). %rcv_adv% is set at the end of %tcp_output%, normally when the first segment is sent (which in this case would be the server SYN/ACK in response to the client's SYN). But with T/TCP %rcv_adv% needs to be set the first time through %tcp_output%, since we may be sending data with the first segment that we send.
Checking this code-path with a quick printf()-test shows that rcv_wnd at this point is 65536 which corresponds to the comment by Stevens. And this is the reason why *adv* can go negative at all. In my opinion, the code is wrong. It should be using the same %min(recwin, (long)TCP_MAXWIN << tp->rcv_scale)% -style logic to limit the window to what it can really advertise. That would make the correct code something like this: % !cmdx! % tp->rcv_adv += min(tp->rcv_wnd, (long)TCP_MAXWIN << tp->rcv_scale); !4 What now? Any network gurus around? I'd be interested in your thoughts on this... *Update:* I was informed that [def|OpenBSD]: fixed the calcualation of *adv* in [rev. 1.15]:;r2=1.15 in 1998 while they were making their code 64bit-clean. Kudos to them. *Update2:* Interestingly, [def|NetBSD]: (and thus OpenBSD) never picked up the T/TCP changes, so they aren't actually affected by this particular bug. *Update3:* The fix has been committed as [r220156]: to FreeBSD-current, and was merged to FreeBSD-stable soon after. Yay! -- Sec P.S.: Regardless of this analysis, I still think the [PR#kern/154006]: should be applied, just on a robustness principle (Who knows when else *adv* becomes negative)... P.P.S.: If anyone wants to play along, I've created a small [scapy]: %/% [def|python]: script to trigger that problem. Before running, you need to make sure your host doesn't interfere with traffic on that port (defaults to port %1516%). On FreeBSD a simple % !cmdx! % ipfw add deny ip from any to me 1516 should be enough before you start []:tcp-bug/
posted at: 14:43 | Category: /tidbits | permanent link to this entry | 5 comments (trackback)
  • Re: The tale of a TCP bug
    pvarga wrote on Thu, 24 Mar 2011 19:07

    in libkern.h there is lmin macro that would work too.

  • Re: The tale of a TCP bug
    jhb wrote on Thu, 24 Mar 2011 20:49

    I actually noticed your PR and forwarded it to some folks for review back when it was filed. I have a similar bug pending review as well. Actually, it looks like the same bug, except I encountered this during active connections, not during setup. I used a different approach to the patch, but yours may be more correct:

    • Re: The tale of a TCP bug
      jhb wrote on Thu, 24 Mar 2011 21:09

      Hmmmm, thinking about this a bit more. Your proposed change would not help established connections (the bug I encountered). I would think a better approach would be to fix the calculation of tp->rcv_wnd in tcp_input() to DTRT in this case instead of assuming that rcv_adv is always >= rcv_nxt. Something like:

      tp->rcv_wnd = win; if (SEQ_GEQ(tp->rcv_adv, tp->rcv_nxt)) tp->rcv_wnd = imax(win, (int)(tp->rcv_adv - tp->rcv_nxt));

      Feel free to continue this on

      • Re: The tale of a TCP bug
        Sec wrote on Fri, 25 Mar 2011 00:17

        I've just subscribed to freebsd-net, let's discuss it there. If/When we reach a conclusion, I'll update this posting.

  • Re: The tale of a TCP bug
    rc wrote on Fri, 25 Mar 2011 17:40

    I really hope this gets fixed soon.

    I'm having problems with a certain server/URL which is certainly due to this bug.

Your Comment
URL/Email: [http://... or mailto:you@wherever] (optional)
Title: (optional)
Save my Name and URL/Email for next time
(Note that comments will be rejected unless you enter 42 in the following box: )

powered by blosxom
in 1.00 s