Sunday, October 25, 2009

4.3 Ethernet Interface

Team-Fly
 

 

TCP/IP Illustrated, Volume 2: The Implementation
By
Gary R. Wright, W. Richard Stevens
Table of Contents
Chapter 4. 
Interfaces: Ethernet


4.3 Ethernet Interface


Net/3 Ethernet device drivers all follow the same general design. This is common for most Unix device drivers because the writer of a driver for a new interface card often starts with a working driver for another card and modifies it. In this section we'll provide a brief overview of the Ethernet standard and outline the design of an Ethernet driver. We'll refer to the LANCE driver to illustrate the design.


Figure 4.8 illustrates Ethernet encapsulation of an IP packet.



Figure 4.8. Ethernet encapsulation of an IP packet.


Ethernet frames consist of 48-bit destination and source addresses followed by a 16-bit type field that identifies the format of the data carried by the frame. For IP packets, the type is 0x0800 (2048). The frame is terminated with a 32-bit CRC (cyclic redundancy check), which detects errors in the frame.



We are describing the original Ethernet framing standard published in 1982 by Digital Equipment Corp., Intel Corp., and Xerox Corp., as it is the most common form used today in TCP/IP networks. An alternative form is specified by the IEEE (Institute of Electrical and Electronics Engineers) 802.2 and 802.3 standards. Section 2.2 in Volume 1 describes the differences between the two forms. See [Stallings 1987] for more information on the IEEE standards.


Encapsulation of IP packets for Ethernet is specified by RFC 894 [Hornig 1984] and for 802.3 networks by RFC 1042 [Postel and Reynolds 1988].



We will refer to the 48-bit Ethernet addresses as hardware addresses. The translation from IP to hardware addresses is done by the ARP protocol described in Chapter 21 (RFC 826 [Plummer 1982]) and from hardware to IP addresses by the RARP protocol (RFC 903 [Finlayson et al. 1984]). Ethernet addresses come in two types, unicast and multicast. A unicast address specifies a single Ethernet interface, and a multicast address specifies a group of Ethernet interfaces. An Ethernet broadcast is a multicast received by all interfaces. Ethernet unicast addresses are assigned by the device's manufacturer, although some devices allow the address to be changed by software.



Some DECNET protocols require the hardware addresses of a multihomed host to be identical, so DECNET must be able to change the Ethernet unicast address of a device.



Figure 4.9 illustrates the data structures and functions that are part of the Ethernet interface.



Figure 4.9. Ethernet device driver.



In figures, a function is identified by an ellipse (leintr), data structures by a box (le_softc[0]), and a group of functions by a rounded box (ARP protocol).



In the top left corner of Figure 4.9 we show the input queues for the OSI Connectionless Network Layer (clnl) protocol, IP, and ARP. We won't say anything more about clnlintrq, but include it to emphasize that ether_input demultiplexes Ethernet frames into multiple protocol queues.



Technically, OSI uses the term Connectionless Network Protocol (CLNP versus CLNL) but we show the terminology used by the Net/3 code. The official standard for CLNP is ISO 8473. [Stallings 1993] summarizes the standard.



The le_softc interface structure is in the center of Figure 4.9. We are interested only in the ifnet and arpcom portions of the structure. The remaining portions are specific to the LANCE hardware. We showed the ifnet structure in Figure 3.6 and the arpcom structure in Figure 3.26.


leintr Function


We start with the reception of Ethernet frames. For now, we assume that the hardware has been initialized and the system has been configured so that leintr is called when the interface generates an interrupt. In normal operation, an Ethernet interface receives frames destined for its unicast hardware address and for the Ethernet broadcast address. When a complete frame is available, the interface generates an interrupt and the kernel calls leintr.



In Chapter 12, we'll see that many Ethernet interfaces may be configured to receive Ethernet multicast frames (other than broadcasts).


Some interfaces can be configured to run in promiscuous mode in which the interface receives all frames that appear on the network. The tcpdump program described in Volume 1 can take advantage of this feature using BPF.



leintr examines the hardware and, if a frame has arrived, calls leread to transfer the frame from the interface to a chain of mbufs (with m_devget). If the hardware reports that a frame transmission has completed or an error has been detected (such as a bad checksum), leintr updates the appropriate interface statistics, resets the hardware, and calls lestart, which attempts to transmit another frame.


All Ethernet device drivers deliver their received frames to ether_input for further processing. The mbuf chain constructed by the device driver does not include the Ethernet header, so it is passed as a separate argument to ether_input. The ether_header structure is shown in Figure 4.10.



Figure 4.10. The ether_header structure.


38-42

The Ethernet CRC is not generally available. It is computed and checked by the interface hardware, which discards frames that arrive with an invalid CRC. The Ethernet device driver is responsible for converting ether_type between network and host byte order. Outside of the driver, it is always in host byte order.



leread Function


The leread function (Figure 4.11) starts with a contiguous buffer of memory passed to it by leintr and constructs an ether_header structure and a chain of mbufs. The chain contains the data from the Ethernet frame. leread also passes the incoming frame to BPF.



Figure 4.11. leread function.



528-539

The leintr function passes three arguments to leread:unit, which identifies the particular interface card that received a frame; buf, which points to the received frame; and len, the number of bytes in the frame (including the header and the CRC).


The function constructs the ether_header structure by pointing et to the front of the buffer and converting the Ethernet type value to host byte order.


540-551

The number of data bytes is computed by subtracting the sizes of the Ethernet header and the CRC from len.
Runt packets, which are too short to be a valid Ethernet frame, are logged, counted, and discarded.


552-557

Next, the destination address is examined to determine if it is the Ethernet broadcast or an Ethernet multicast address. The Ethernet broadcast address is a special case of an Ethernet multicast address; it has every bit set. etherbroadcastaddr is an array defined as



u_char etherbroadcastaddr[6] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };


This is a convenient way to define a 48-bit value in C. This technique works only if we assume that characters are 8-bit values�something that isn't guaranteed by ANSI C.



If bcmp reports that etherbroadcastaddr and ether_dhost are the same, the M_BCAST flag is set.


An Ethernet multicast addresses is identified by the low-order bit of the most significant byte of the address. Figure 4.12 illustrates this.



Figure 4.12. Testing for an Ethernet multicast address.


In Chapter 12 we'll see that not all Ethernet multicast frames are IP multicast datagrams and that IP must examine the packet further.


If the multicast bit is on in the address, M_MCAST is set in the flags variable. The order of the tests is important: first ether_input compares the entire 48-bit address to the Ethernet broadcast address, and if they are different it checks the low-order bit of the most significant byte to identify an Ethernet multicast address (Exercise 4.1).


558-573

If the interface is tapped by BPF, the frame is passed directly to BPF by calling bpf_tap. We'll see that for SLIP and the loopback interfaces, a special BPF frame is constructed since those networks do not have a link-level header (unlike Ethernet).


When an interface is tapped by BPF, it can be configured to run in promiscuous mode and receive all Ethernet frames that appear on the network instead of the subset of frames normally received by the hardware. The packet is discarded by leread if it was sent to a unicast address that does not match the interface's address.


574-585

m_devget (Section 2.6) copies the data from the buffer passed to leread to an mbuf chain it allocates. The first argument to m_devget points to the first byte after the Ethernet header, which is the first data byte in the frame. If m_devget runs out of memory, leread returns immediately. Otherwise the broadcast and multicast flags are set in the first mbuf in the chain, and ether_input processes the packet.



ether_input Function


ether_input, shown in Figure 4.13, examines the ether_header structure to determine the type of data that has been received and then queues the received packet for processing.



Figure 4.13. ether_input function.




Broadcast and multicast recognition


196-209

The arguments to ether_input are ifp, a pointer to the receiving interface's ifnet structure; eh, a pointer to the Ethernet header of the received packet; and m, a pointer to the received packet (excluding the Ethernet header).


Any packets that arrive on an inoperative interface are silently discarded. The interface may not have been configured with a protocol address, or may have been disabled by an explicit request from the ifconfig(8) program (Section 6.6).


210-218

The variable time is a global timeval structure that the kernel maintains with the current time and date, as the number of seconds and microseconds past the Unix Epoch (00:00:00 January 1, 1970, Coordinated Universal Time [UTC]). A brief discussion of UTC can be found in [Itano and Ramsey 1993]. We'll encounter the timeval structure throughout the Net/3 sources:



struct timeval {
long tv_sec; /* seconds */
long tv_usec; /* and microseconds */
};

ether_input updates if_lastchange with the current time and increments if_ibytes by the size of the incoming packet (the packet length plus the 14-byte Ethernet header).


Next, ether_input repeats the tests done by leread to determine if the packet is a broadcast or multicast packet.



Some kernels may not have been compiled with the BPF code, so the test must also be done in ether_input.




Link-level demultiplexing


219-227

ether_input jumps according to the Ethernet type field. For an IP packet, schednetisr schedules an IP software interrupt and the IP input queue, ipintrq, is selected. For an ARP packet, the ARP software interrupt is scheduled and arpintrq is selected.



An isr is an interrupt service routine.


In previous BSD releases, ARP packets were processed immediately while at the network interrupt level by calling arpinput directly. By queueing the packets, they can be processed at the software interrupt level.


If other Ethernet types are to be handled, a kernel programmer would add additional cases here. Alternately, a process can receive other Ethernet types using BPF. For example, RARP servers are normally implemented using BPF under Net/3.



228-307

The default case processes unrecognized Ethernet types or packets that are encapsulated according to the 802.3 standard (such as the OSI connectionless transport). The Ethernet type field and the 802.3 length field occupy the same position in an Ethernet frame. The two encapsulations can be distinguished because the range of types in an Ethernet encapsulation is distinct from the range of lengths in the 802.3 encapsulation (Figure 4.14). We have omitted the OSI code. [Stallings 1993] contains a description of the OSI link-level protocols.



Figure 4.14. Ethernet type and 802.3 length fields.



There are many additional Ethernet type values that are assigned to various protocols; we don't show them in Figure 4.14. RFC 1700 [Reynolds and Postel 1994] contains a list of the more common types.




Queue the packet


308-315

Finally, ether_input places the packet on the selected queue or discards the packet if the queue is full. We'll see in Figures 7.23 and 21.16 that the default limit for the IP and ARP input queues is 50 (ipqmaxlen) packets each.


When ether_input returns, the device driver tells the hardware that it is ready to receive the next packet, which may already be present in the device. The packet input queues are processed when the software interrupt scheduled by schednetisr occurs (Section 1.12). Specifically, ipintr is called to process the packets on the IP input queue, and arpintr is called to process the packets on the ARP input queue.



ether_output Function


We now examine the output of Ethernet frames, which starts when a network-level protocol such as IP calls the if_output function, specified in the interface's ifnet structure. The if_output function for all Ethernet devices is ether_output (Figure 4.2). ether_output takes the data portion of an Ethernet frame, encapsulates it with the 14-byte Ethernet header, and places it on the interface's send queue. This is a large function so we describe it in four parts:


  • verification,

  • protocol-specific processing,

  • frame construction, and

  • interface queueing.


Figure 4.15 includes the first part of the function.



Figure 4.15. ether_output function: verification.


49-64

The arguments to ether_output are ifp, which points to the outgoing interface's ifnet structure; m0, the packet to send; dst, the destination address of the packet; and rt0, routing information.


65-67

The macro senderr is called throughout ether_output.



#define senderr(e) { error = (e); goto bad;}

senderr saves the error code and jumps to bad at the end of the function, where the packet is discarded and ether_output returns error.


If the interface is up and running, ether_output updates the last change time for the interface. Otherwise, it returns ENETDOWN.



Host route


68-74

rt0 points to the routing entry located by ip_output and passed to ether_output. If ether_output is called from BPF, rt0 can be null, in which case control passes to the code in Figure 4.16. Otherwise, the route is verified. If the route is not valid, the routing tables are consulted and EHOSTUNREACH is returned if a route cannot be located. At this point, rt0 and rt point to a valid route for the next-hop destination.



Figure 4.16. ether_output function: network protocol processing.



Gateway route


75-85

If the next hop for the packet is a gateway (versus a final destination), a route to the gateway is located and pointed to by rt. If a gateway route cannot be found, EHOSTUNREACH is returned. At this point, rt points to the route for the next-hop destination. The next hop may be a gateway or the final destination.



Avoid ARP flooding


86-90

The RTF_REJECT flag is enabled by the ARP code to discard packets to the destination when the destination is not responding to ARP requests. This is described with Figure 21.24.


ether_output processing continues according to the destination address of the packet. Since Ethernet devices respond only to Ethernet addresses, to send a packet, ether_output must find the Ethernet address that corresponds to the IP address of the next-hop destination. The ARP protocol (Chapter 21) implements this translation. Figure 4.16 shows how the driver accesses the ARP protocol.



IP output


91-101

ether_output jumps according to sa_family in the destination address. We show only the AF_INET, AF_ISO, and AF_UNSPEC cases in Figure 4.16 and have omitted the code for AF_ISO.


The AF_INET case calls arpresolve to determine the Ethernet address corresponding to the destination IP address. If the Ethernet address is already in the ARP cache, arpresolve returns 1 and ether_output proceeds. Otherwise this IP packet is held by ARP, and when ARP determines the address, it calls ether_output from the function in_arpinput.


Assuming the ARP cache contains the hardware address, ether_output checks if the packet is going to be broadcast and if the interface is simplex (i.e., it can't receive its own transmissions). If both tests are true, m_copy makes a copy of the packet. After the switch, the copy is queued as if it had arrived on the Ethernet interface. This is required by the definition of broadcasting; the sending host must receive a copy of the packet.



We'll see in Chapter 12 that multicast packets may also be looped back to be received on the output interface.




Explicit Ethernet output


142-146


Some protocols, such as ARP, need to specify the Ethernet destination and type explicitly. The address family constant AF_UNSPEC indicates that dst points to an Ethernet header. bcopy duplicates the destination address in edst and assigns the Ethernet type to type. It isn't necessary to call arpresolve (as for AF_INET) because the Ethernet destination address has been provided explicitly by the caller.



Unrecognized address families


147-151

Unrecognized address families generate a console message and ether_output returns EAFNOSUPPORT.


In the next section of ether_output, shown in Figure 4.17, the Ethernet frame is constructed.



Figure 4.17. ether_output function: Ethernet frame construction.



Ethernet header


152-167

If the code in the switch made a copy of the packet, the copy is processed as if it had been received on the output interface by calling looutput. The loopback interface and looutput are described in Section 5.4.


M_PREPEND ensures that there is room for 14 bytes at the front of the packet.



Most protocols arrange to leave room at the front of the mbuf chain so that M_PREPEND needs only to adjust some pointers (e.g., sosend for UDP output in Section 16.7 and igmp_sendreport in Section 13.6).



ether_output forms the Ethernet header from type, edst, and ac_enaddr (Figure 3.26). ac_enaddr is the unicast Ethernet address associated with the output interface and is the source Ethernet address for all frames transmitted on the interface. ether_output overwrites the source address the caller may have specified in the ether_header structure with ac_enaddr. This makes it more difficult to forge the source address of an Ethernet frame.


At this point, the mbuf contains a complete Ethernet frame except for the 32-bit CRC, which is computed by the Ethernet hardware during transmission. The code shown in Figure 4.18 queues the frame for transmission by the device.



Figure 4.18. ether_output function: output queueing.


168-185

If the output queue is full, ether_output discards the frame and returns ENOBUFS. If the output queue is not full, the frame is placed on the interface's send queue, and the interface's if_start function transmits the next frame if the interface is not already active.


186-190


The senderr macro jumps to bad where the frame is discarded and an error code is returned.



lestart Function


The lestart function dequeues frames from the interface output queue and arranges for them to be transmitted by the LANCE Ethernet card. If the device is idle, the function is called to begin transmitting frames. An example appears at the end of ether_output (Figure 4.18), where lestart is called indirectly through the interface's if_start function.


If the device is busy, it generates an interrupt when it completes transmission of the current frame. The driver calls lestart to dequeue and transmit the next frame. Once started, the protocol layer can queue frames without calling lestart since the driver dequeues and transmits frames until the queue is empty.


Figure 4.19 shows the lestart function. lestart assumes splimp has been called to block any device interrupts.



Figure 4.19. lestart function.



Interface must be initialized


325-333


If the interface is not initialized, lestart returns immediately.



Dequeue frame from output queue


335-342


If the interface is initialized, the next frame is removed from the queue. If the interface output queue is empty, lestart returns.



Transmit frame and pass to BPF


343-350


leput copies the frame in m to the hardware buffer pointed to by the first argument to leput. If the interface is tapped by BPF, the frame is passed to bpf_tap. We have omitted the device-specific code that initiates the transmission of the frame from the hardware buffer.



Repeat if device is ready for more frames


359


lestart stops passing frames to the device when le->sc_txcnt equals LETBUF. Some Ethernet interfaces can queue more than one outgoing Ethernet frame. For the LANCE driver, LETBUF is the number of hardware transmit buffers available to the driver, and le->sc_txcnt keeps track of how many of the buffers are in use.



Mark device as busy


360-362


Finally, lestart turns on IFF_OACTIVE in the ifnet structure to indicate the device is busy transmitting frames.



There is an unfortunate side effect to queueing multiple frames in the device for transmission. According to [Jacobson 1988a], the LANCE chip is able to transmit queued frames with very little delay between frames. Unfortunately, some [broken] Ethernet devices drop the frames because they can't process the incoming data fast enough.


This interacts badly with an application such as NFS that sends large UDP datagrams (often greater than 8192 bytes) that are fragmented by IP and queued in the LANCE device as multiple Ethernet frames. Fragments are lost on the receiving side, resulting in many incomplete datagrams and high delays as NFS retransmits the entire UDP datagram.


Jacobson noted that Sun's LANCE driver only queued one frame at a time, perhaps to avoid this problem.






    Team-Fly
     

     
    Top
     


    No comments:

    Post a Comment