Tuesday, October 20, 2009

Section 30.3.  Routing Table










30.3. Routing Table



The routing table
is the core of the routing sysbsystem. In its simplest definition, it consists of a database of routes that is available to other subsystemsIPv4, for examplethrough various functions, the most important being the one used to do lookups.


As you may already imagine, routes do not consist only of the basic information shown in the section "Routers, Routes, and Routing Tables." Over time, due both to code optimizations and to the introduction of new features, the amount of information that makes up an entry in the routing table has grown quite a bit. We will look at those details in Chapter 34.


In the following subsections, we will briefly see:


  • How Linux routes packets addressed to local addresses

  • What algorithm is used to look up addresses in the routing table

  • What administrative actions can be applied to traffic matching a route besides the default forwarding action

  • What extra information is stored in a route by upper protocols for their convenience



30.3.1. Special Routes


When a packet is received, a router needs to determine whether to deliver it locally to the next-higher layer (because the local host is the final destination) or to forward it. A simple way to accomplish this is to store all the local addresses in a list and scan the list for each packet as part of the routing lookup. Of course, a list would not be the best choice; there are better data structures that can provide faster lookup time. Linux uses a separate hash-based routing table where it stores only local addresses. To be more exact, it stores all of those addresses that it listens to, which includes both the locally configured addresses and the subnet broadcasts.


This means that by default, Linux uses two routing tables:


  • A table for local addresses. A successful lookup in this table means that the packet is to be delivered on the host itself. We will see in Chapter 32 how the kernel populates this table.

  • A table for all other routes, manually configured by the user or dynamically inserted by routing protocols.




30.3.2. Route Types and Actions










We saw in the section "Routers, Routes, and Routing Tables" what a basic route consists of. By default, the action taken to process a packet that matches a given route is to forward it according to the forwarding information returned from the routing table for that route: the next-hop router and the egress device.


However, Linux allows you to optionally define other kinds of actions as well.[*] Here are the main ones:

[*] We will see in Chapter 36 that you can configure these alternative route types only using the new-generation configuration tool IPROUTE2.



Black hole


Packets matching this type of route are silently discarded.


Unreachable


Packets matching this type of route are discarded and generate an Internet Control Message Protocol (ICMP) host unreachable message.


Prohibit


Packets matching this type of route are discarded and generate an ICMP packet filtered message.


Throw


This type is used in conjunction with policy routing, a feature covered in Chapter 31. When policy routing is configured, a matching route of this type will make the lookup abandon the current table and continue with the following one (if any).




30.3.3. Routing Cache


Depending on the role played by the router, the number of routes in its routing table can range from a few units to a few hundred thousand. Because of that, it should be obvious that it would be beneficial to maintain a smaller table that caches the results of lookups, both positive and negative. Linux splits the routing cache
into two components (where a protocol, in this context, means an L3 protocol such as IPv4 and IPv6):


  • A protocol-dependent cache

  • A protocol-independent destination cache, often called just DST


The first component represents the skeleton of the cache, where each element is defined as a collection of protocol-specific fields. The second component, which is embedded in the first, stores only protocol-independent information. Both the protocol-dependent cache and the protocol independent component of it are described in Chapter 33.


We will see in Chapter 31 that it is possible to create multiple independent routing tables on a Linux system that supports the policy routing feature. Regardless of the number of routing tables, Linux uses only one routing cache. If policy routing is supported, the cache does not provide any fairness, so it is possible that the routes of one routing table use many more entries of the cache than other routing tables (i.e., the space in the cache is not equally distributed among the routing tables). This approach, however, ensures greater routing throughput overall.




30.3.4. Routing Table Versus Routing Cache




The routing table and the routing cache, besides differing in size and structure, also differ in the granularity of their objects. The routing table uses subnets, aggregates of consecutive addresses. Entries of the cache, on the other hand, are associated with single IP addresses. Because of this, the lookup algorithm used by the routing table and the routing cache also differs, as we will see in the section "Lookups."


Let's view an example. Suppose our routing table includes, among other routes, the one in Table 30-4, which is the only one that leads to the subnet 10.0.1.0/24.


Table 30-4. Example of routing table entry

Destination

Next hop

Device to use

10.0.1.0/24

10.0.0.1

eth0



Let's also suppose the kernel was asked to transmit two packets to the addresses 10.0.1.100 and 10.0.1.101, respectively. Since the route in Table 30-4 would match in both cases, the kernel would use it to route the two packets and would install two entries into the routing cache that would look like those in Table 30-5.


Table 30-5. Example of routing cache entry

Destination

Next hop

Device to use

10.0.1.100

10.0.0.1

eth0

10.0.1.101

10.0.0.1

eth0



The elements in Table 30-5 are a simplified version, of course. In Chapter 33, we will see that entries of the routing cache include the source address, too.




30.3.5. Routing Cache Garbage Collection



Garbage collection is responsible for eliminating data structures, owned by the routing subsystem, that are no longer in use. However, data structures may be removed even if they are in use, for example, to free the memory needed to store something more important. The effects of the deletions done by the garbage collection will not lead to any loss of data, because all the information eliminated can be re-created. The deletion of an element from the cache can lead only to a cache miss in the worst case.


There are two kinds of garbage collection:



Synchronous


When the routing subsystem sees the need to free some memory, a cleanup is done right away. There are two cases where the routing code may force garbage collection without waiting for the regular timer to do it:


  • When a new entry is to be added to the routing cache and the number of entries currently in the cache has reached a particular threshold, which is configurable by the user.

  • When memory is needed by the neighboring subsystem cache. We saw in Chapter 27 that the routing cache and the neighboring subsystem cache keep references to each other. The creation of a new routing cache entry could trigger the creation of a new neighbor cache entry. If the neighboring protocolsay, ARPfailed to allocate the memory it needed, the routing subsystem would force a garbage collection to indirectly free data structures owned by the neighboring protocol and therefore help the latter find the memory it needed.


Asynchronous


To keep the cache size reasonable, a periodic timer is used to trigger regular cleanups. By default, routing cache entries do not expire. However, it is possible for external subsystems to tell the routing cache to expire certain entries after a given amount of time. The routing subsystem runs a timer that periodically scans the cache, looking for entries that:


  • Are expired and should be removed

  • Are not expired, but could be sacrificed if the kernel needs to free some memory



30.3.5.1. Examples of events that can expire cache entries

An entry is set to expire only in specific cases, including:


  • When the local system receives an ICMP UNREACHABLE or ICMP FRAGMENTATION NEEDED message, it hands it to the ICMP layer. Such a message notifies the local host about a packet that it previously sent out whose size exceeded the MTU of a router along the path to the destination address. The ICMP handler will scan the routing cache, update the PMTU field of all the affected entries, and set the latter to expire after a certain configurable amount of time, which is 10 minutes by default.

    ICMP also notifies the L4 protocol associated with the packet that triggered the ICMP message. For instance, TCP may use these notifications for the Path MTU discovery algorithm. See Chapters 18 and 25 for more details on path MTU discovery.

  • A destination IP address can be classified as unreachable when the neighboring protocol has failed to resolve the L3-to-L2 mapping (see Chapter 27) or when the local host is at one end of an IP tunnel, and the other end becomes unreachable for some reason (for example, a routing problem or misconfiguration).

    When a destination IP address is classified as unreachable, all the entries of the cache associated with the address need to be flushed and therefore will be set to expire right away.




30.3.5.2. Examples of eligible cache victims



There may be cases where the kernel needs to free some cache entries to make room for new ones, and the periodic timer is not able to guarantee by itself that the cache will always have some free room (i.e., to keep its size below some threshold). In those cases, the host must delete entries that the periodic timer would not pick because they are still valid. Even if the garbage collection system needs to select victims from valid entries, it can reduce the damage by selecting those that can be re-created quickly with only a small overhead.


Good candidates for removal include routes to broadcast and multicast addresses. Normally, when the routing subsystem deletes a routing cache entry, it may indirectly remove the L3-to-L2 association as well. When this happens, the next time the host needs to send data to the L3 address, the neighboring subsystem will need to resolve the L3-to-L2 association again. However, broadcast and multicast addresses can be resolved with low overhead because they do not need any solicitation request (see the section "Special Cases" in Chapter 26).


Particularly bad (high-overhead) candidates for removal include:



REDIRECT routes


This kind of route has been learned through an ICMP REDIRECT message; if it is removed, the host will use suboptimal routing for further traffic along that route. Removing the entry may also be a waste of time because the host will most likely receive another ICMP REDIRECT that just leads to reinserting the route.


Routes manually configured by the administrator


These are routes that a user, via a command such as ip route get 10.0.0.1 monitor, has asked the kernel to send a notification (via the Netlink socket) when it changes state. The user probably considers this route important for some reason. See Table 36-11 in Chapter 36 for more information.


In any case, entries with non-null reference counts are never considered eligible for deletion.














No comments:

Post a Comment