Monday, November 2, 2009

23.7 udp_input Function

Team-Fly
 

 

TCP/IP Illustrated, Volume 2: The Implementation
By
Gary R. Wright, W. Richard Stevens
Table of Contents
Chapter 23. 
UDP: User Datagram Protocol


23.7 udp_input Function


UDP output is driven by a process calling one of the five write functions. The functions shown in Figure 23.14 are all called directly as part of the system call. UDP input, on the other hand, occurs when IP input receives an IP datagram on its input queue whose protocol field specifies UDP. IP calls the function udp_input through the pr_input function in the protocol switch table (Figure 8.15). Since IP input is at the software interrupt level, udp_input also executes at this level. The goal of udp_input is to place the UDP datagram onto the appropriate socket's buffer and wake up any process blocked for input on that socket.


We'll divide our discussion of the udp_input function into three sections:


  1. the general validation that UDP performs on the received datagram,

  2. processing UDP datagrams destined for a unicast address: locating the appropriate PCB and placing the datagram onto the socket's buffer, and

  3. processing UDP datagrams destined for a broadcast or multicast address: the datagram may be delivered to multiple sockets.


This last step is new with the support of multicasting in Net/3, but consumes almost one-third of the code.


General Validation of Received UDP Datagram


Figure 23.21 shows the first section of UDP input.



Figure 23.21. udp_input function: general validation of received UDP datagram.



55-65

The two arguments to udp_input are m, a pointer to an mbuf chain containing the IP datagram, and iphlen, the length of the IP header (including possible IP options).



Discard IP options


67-76

If IP options are present they are discarded by ip_stripoptions. As the comments indicate, UDP should save a copy of the IP options and make them available to the receiving process through the IP_RECVOPTS socket option, but this isn't implemented yet.


77-88

If the length of the first mbuf on the mbuf chain is less than 28 bytes (the size of the IP header plus the UDP header), m_pullup rearranges the mbuf chain so that at least 28 bytes are stored contiguously in the first mbuf.



Verify UDP length


89-101

There are two lengths associated with a UDP datagram: the length field in the IP header (ip_len) and the length field in the UDP header (uh_ulen). Recall that ipintr subtracted the length of the IP header from ip_len before calling udp_input (Figure 10.11). The two lengths are compared and there are three possibilities:


  1. ip_len equals uh_ulen. This is the common case.

  2. ip_len is greater than uh_ulen. The IP datagram is too big, as shown in Figure 23.22.

    Figure 23.22. UDP length too small.

    The code believes the smaller of the two lengths (the UDP header length) and m_adj removes the excess bytes of data from the end of the datagram. In the code the second argument to m_adj is negative, which we said in Figure 2.20 trims data from the end of the mbuf chain. It is possible in this scenario that the UDP length field has been corrupted. If so, the datagram will probably be discarded shortly, assuming the sender calculated the UDP checksum, that this checksum detects the error, and that the receiver verifies the checksum. The IP length field should be correct since it was verified by IP against the amount of data received from the interface, and the IP length field is covered by the mandatory IP header checksum.

  3. ip_len is less than uh_ulen. The IP datagram is smaller than possible, given the length in the UDP header. Figure 23.23 shows this case.

    Figure 23.23. UDP length too big.

    Something is wrong and the datagram is discarded. There is no other choice here: if the UDP length field has been corrupted, it can't be detected with the UDP checksum. The correct UDP length is needed to calculate the checksum.


    As we've said, the UDP length is redundant. In Chapter 28 we'll see that TCP does not have a length field in its header�it uses the IP length field, minus the lengths of the IP and TCP headers, to determine the amount of data in the datagram. Why does the UDP length field exist? Possibly to add a small amount of error checking, since UDP checksums are optional.




Save copy of IP header and verify UDP checksum


102-106

udp_input saves a copy of the IP header before verifying the checksum, because the checksum computation wipes out some of the fields in the original IP header.


110

The checksum is verified only if UDP checksums are enabled for the kernel (udpcksum), and if the sender calculated a UDP checksum (the received checksum is nonzero).



This test is incorrect. If the sender calculated a checksum, it should be verified, regardless of whether outgoing checksums are calculated or not. The variable udpcksum should only specify whether outgoing checksums are calculated. Unfortunately many vendors have copied this incorrect test, although many vendors today finally ship their kernels with UDP checksums enabled by default.



111-120

Before calculating the checksum, the IP header is referenced as an ipovly structure (Figure 23.18) and the fields are initialized as described in the previous section when the UDP checksum is calculated by udp_output.


At this point special code is executed if the datagram is destined for a broadcast or multicast IP address. We defer this code until later in the section.



Demultiplexing Unicast Datagrams


Assuming the datagram is destined for a unicast address, Figure 23.24 shows the code that is executed.



Figure 23.24. udp_input function: demultiplex unicast datagram.



Check one-behind cache


206-209

UDP maintains a pointer to the last Internet PCB for which it received a datagram, udp_last_inpcb. Before calling in_pcblookup, which might have to search many PCBs on the UDP list, the foreign and local addresses and ports of that last PCB are compared against the received datagram. This is called a one-behind cache [Partridge and Pink 1993], and it is based on the assumption that the next datagram received has a high probability of being destined for the same socket as the last received datagram [Mogul 1991]. This cache was introduced with the 4.3BSD Tahoe release.


210-213

The order of the four comparisons between the cached PCB and the received datagram is intentional. If the PCBs don't match, the comparisons should stop as soon as possible. The highest probability is that the destination port numbers are different�this is therefore the first test. The lowest probability of a mismatch is between the local addresses, especially on a host with just one interface, so this is the last test.


Unfortunately this one-behind cache, as coded, is practically useless [Partridge and Pink 1993]. The most common type of UDP server binds only its well-known port, leaving its local address, foreign address, and foreign port wildcarded. The most common type of UDP client does not connect its UDP socket; it specifies the destination address for each datagram using sendto. Therefore most of the time the three values in the PCB inp_laddr, inp_faddr, and inp_fport are wildcards. In the cache comparison the four values in the received datagram are never wildcards, meaning the cache entry will compare equal with the received datagram only when the PCB has all four local and foreign values specified to nonwildcard values. This happens only for a connected UDP socket.



On the system bsdi, the counter udpps_pcbcachemiss was 41,253 and the counter udps_ipackets was 42,485. This is less than a 3% cache hit rate.


The netstat -s command prints most of the fields in the udpstat structure (Figure 23.5). Unfortunately the Net/3 version, and most vendor's versions, never print udpps_pcbcachemiss. If you want to see the value, use a debugger to examine the variable in the running kernel.




Search all UDP PCBs


214-218

Assuming the comparison with the cached PCB fails, in_pcblookup searches for a match. The INPLOOKUP_WILDCARD flag is specified, allowing a wildcard match. If a match is found, the pointer to the PCB is saved in udp_last_inpcb, which we said is a cache of the last received UDP datagram's PCB.



Generate ICMP port unreachable error


220-230

If a matching PCB is not found, UDP normally generates an ICMP port unreachable error. First the m_flags for the received mbuf chain is checked to see if the datagram was sent to a link-level broadcast or multicast destination address. It is possible to receive an IP datagram with a unicast IP address that was sent to a broadcast or multicast link-level address, but an ICMP port unreachable error must not be generated. If it is OK to generate the ICMP error, the IP header is restored to its received value (save_ip) and the IP length is also set back to its original value.



This check for a link-level broadcast or multicast address is redundant. icmp_error also performs this check. The only advantage in this redundant check is to maintain the counter udps_noportbcast in addition to the counter udps_noport.


The addition of iphlen back into ip_len is a bug. icmp_error will also do this, causing the IP length field in the IP header returned in the ICMP error to be 20 bytes too large. You can tell if a system has this bug by adding a few lines of code to the Traceroute program (Chapter 8 of Volume 1) to print this field in the ICMP port unreachable that is returned when the destination host is finally reached.



Figure 23.25 is the next section of processing for a unicast datagram, delivering the datagram to the socket corresponding to the destination PCB.



Figure 23.25. udp_input function: deliver unicast datagram to socket.



Return source IP address and source port


231-236

The source IP address and source port number from the received IP datagram are stored in the global sockaddr_in structure udp_in. This structure is passed as an argument to sbappendaddr later in the function.


Using a global to hold the IP address and port number is OK because udp_input is single threaded. When this function is called by ipintr it processes the received datagram completely before returning. Also, sbappendaddr copies the socket address structure from the global into an mbuf.



IP_RECVDSTADDR socket option


237-244

The constant INP_CONTROLOPTS is the combination of the three socket options that the process can set to cause control information to be returned through the recvmsg system call for a UDP socket (Figure 22.5). The IP_RECVDSTADDR socket option returns the destination IP address from the received UDP datagram as control information. The function udp_saveopt allocates an mbuf of type MT_CONTROL and stores the 4-byte destination IP address in the mbuf. We show this function in Section 23.8.



This socket option appeared with 4.3BSD Reno and was intended for applications such as TFTP, the Trivial File Transfer Protocol, that should not respond to client requests that are sent to a broadcast address. Unfortunately, even if the receiving application uses this option, it is nontrivial to determine if the destination IP address is a broadcast address or not (Exercise 23.6).


When the multicasting changes were added in 4.4BSD, this code was left in only for datagrams destined for a unicast address. We'll see in Figure 23.26 that this option is not implemented for datagrams sent to a broadcast of multicast address. This defeats the purpose of the option!



Figure 23.26. udp_input function: demultiplexing of broadcast and multicast datagrams.




Unimplemented socket options


245-260

This code is commented out because it doesn't work. The intent of the IP_RECVOPTS socket option is to return the IP options from the received datagram as control information, and the intent of IP_RECVRETOPTS socket option is to return source route information. The manipulation of the mp variable by all three IP_RECV socket options is to build a linked list of up to three mbufs that are then placed onto the socket's buffer by sbappendaddr. The code shown in Figure 23.25 only returns one option as control information, so the m_next pointer of that mbuf is always a null pointer.



Append data to socket's receive queue


262-272

At this point the received datagram (the mbuf chain pointed to by m), is ready to be placed onto the socket's receive queue along with a socket address structure representing the sender's IP address and port (udp_in), and optional control information (the destination IP address, the mbuf pointed to by opts). This is done by sbappendaddr. Before calling this function, however, the pointer and lengths of the first mbuf on the chain are adjusted to ignore the IP and UDP headers. Before returning, sorwakeup is called for the receiving socket to wake up any processes asleep on the socket's receive queue.



Error return


273-276

If an error is encountered during UDP input processing, udp_input jumps to the label bad. The mbuf chain containing the datagram is released, along with the mbuf chain containing any control information (if present).



Demultiplexing Multicast and Broadcast Datagrams


We now return to the portion of udp_input that handles datagrams sent to a broadcast or multicast IP address. The code is shown in Figure 23.26.


121-138

As the comments indicate, these datagrams are delivered to all sockets that match, not just a single socket. The inadequacy of the UDP interface that is mentioned refers to the inability of a process to receive asynchronous errors on a UDP socket (notably ICMP port unreachables) unless the socket is connected. We described this in Section 22.11.


139-145

The source IP address and port number are saved in the global sockaddr_in structure udp_in, which is passed to sbappendaddr. The mbuf chain's length and data pointer are updated to ignore the IP and UDP headers.


146-164

The large for loop scans each UDP PCB to find all matching PCBs. in_pcblookup is not called for this demultiplexing because it returns only one PCB, whereas the broadcast or multicast datagram may be delivered to more than one PCB.


If the local port in the PCB doesn't match the destination port from the received datagram, the entry is ignored. If the local address in the PCB is not the wildcard, it is compared to the destination IP address and the entry is skipped if they're not equal. If the foreign address in the PCB is not a wildcard, it is compared to the source IP address and if they match, the foreign port must also match the source port. This last test assumes that if the socket is connected to a foreign IP address it must also be connected to a foreign port, and vice versa. This is the same logic we saw in in_pcblookup.


165-177

If this is not the first match found (last is nonnull), a copy of the datagram is placed onto the receive queue for the previous match. Since sbappendaddr releases the mbuf chain when it is done, a copy is first made by m_copy. Any processes waiting for this data are awakened by sorwakeup. A pointer to this matching socket structure is saved in last.


This use of the variable last avoids calling m_copy (an expensive operation since an entire mbuf chain is copied) unless there are multiple recipients for a given datagram. In the common case of a single recipient, the for loop just sets last to the single matching PCB, and when the loop terminates, sbappendaddr places the mbuf chain onto the socket's receive queue�a copy is not made.


178-188

If this matching socket doesn't have either the SO_REUSEPORT or the SO_REUSEADDR socket option set, then there's no need to check for additional matches and the loop is terminated. The datagram is placed onto the single socket's receive queue in the call to sbappendaddr outside the loop.


189-197

If last is null at the end of the loop, no matches were found. An ICMP error is not generated because the datagram was sent to a broadcast or multicast IP address.


198-204

The final matching entry (which could be the only matching entry) has the original datagram (m) placed onto its receive queue. After sorwakeup is called, udp_input returns, since the processing the broadcast or multicast datagram is complete.


The remainder of the function (shown previously in Figure 23.24) handles unicast datagrams.



Connected UDP Sockets and Multihomed Hosts


There is a subtle problem when using a connected UDP socket to exchange datagrams with a process on a multihomed host. Datagrams from the peer may arrive with a different source IP address and will not be delivered to the connected socket.


Consider the example shown in Figure 23.27.



Figure 23.27. Example of connected UDP socket sending datagram to a multihomed host.


Three steps take place.






  1. The client on bsdi creates a UDP socket and connects it to 140.252.1.29, the PPP interface on sun, not the Ethernet interface. A datagram is sent on the socket to the server.


    The server on sun receives the datagram and accepts it, even though it arrives on an interface that differs from the destination IP address. (sun is acting as a router, so whether it implements the weak end system model or the strong end system model doesn't matter.) The datagram is delivered to the server, which is waiting for client requests on an unconnected UDP socket.



  2. The server sends a reply, but since the reply is being sent on an unconnected UDP socket, the source IP address for the reply is chosen by the kernel based on the outgoing interface (140.252.13.33). The destination IP address in the request is not used as the source address for the reply.


    When the reply is received by bsdi it is not delivered to the client's connected UDP socket since the IP addresses don't match.



  3. bsdi generates an ICMP port unreachable error since the reply can't be demultiplexed. (This assumes that there is not another process on bsdi eligible to receive the datagram.)



The problem in this example is that the server does not use the destination IP address from the request as the source IP address of the reply. If it did, the problem wouldn't exist, but this solution is nontrivial�see Exercise 23.10. We'll see in Figure 28.16 that a TCP server uses the destination IP address from the client as the source IP address from the server, if the server has not explicitly bound a local IP address to its socket.





    Team-Fly
     

     
    Top
     


    No comments:

    Post a Comment